Avesha Score
avesha_logo.svg

Avesha Blogs

3 July, 2023,

15 min read

Copied

Editor’s note: this transcript has been lightly edited for clarity.

Dheeraj Ravula:           
- Good afternoon, and welcome to today's webinar, on the topic - the Cloud Problem Nobody Is Talking About. I'd like to welcome two distinguished panelists that are making significant contributions to innovations in the cloud, especially on the Kubernetes front. Our first panelist, Shlomo Bielak, VP of Engineering at theScore. He's a seasoned exec and a thought leader, very focused on the human element in delivering effective and efficient solutions in the cloud. Our second panelist, Raj Nair, CEO, and co-founder at Avesha. He's a serial entrepreneur and an innovator credited with delivering the first internet load balancer. This was back in 1997, called ArrowPoint. Avesha products help run Kubernetes workloads efficiently and cost effectively in any cloud, leveraging the power of AI and reinforcement learning. It brings the power of intelligent remediation to your Kubernetes deployments. I'm your host, Deeraj Ravula, head of Customer Success at Avesha. Today's topic for the webinar is about the challenges of delivering SLOs and SLAs in the cloud, challenges in doing intelligent remediation in the cloud, despite having a host of tools, and data sets, and infrastructure that's available to you. Welcome Shlomo. Welcome, Raj.           
- Thank you.           
- Thank you.

Dheeraj Ravula:           
- That brings us to the first question, right? There's a whole host of APM products out there, access to metrics and data across every layer in the infrastructure for application deployed in the cloud. Shlomo, this question's for you. Are you able to do something to help your SREs with intelligent remediation with respect to maintaining an SLA in the cloud?

Shlomo Bielak           
- So, I think most organizations have a preventative measure that they put in place. So for example, many organizations know of events that are coming. They prepare for them. They put a task force and a plan together, and teams work around the clock, additional to their current workload, and build the resilience for that event. It has some disadvantages, which is that it draws on people more than it should, especially if you have lots of events, and it basically increases your costs because you’re doing it prior to the event. And then the work effort after the event also exists. The reason being, that there are technology gaps when it comes to scaling specific application requirements and the downstream effects of that, which is why we’re working with Avesha right now.

Dheeraj Ravula           
- Awesome. So there's a deluge of tools and data you still need to kind of make sense out of it, I guess.

Shlomo Bielak           
- Look, a lot of the engineers on the front lines, which I do consider to be front lines, you have healthcare front lines, and you have technology front lines. It’s quite stressful for the events that are predicted, it’s wait, for the events that are unpredicted, it’s stress. So I think having a solution that covers both of them would be a benefit in both circumstances.

Dheeraj Ravula           
- Awesome. So, what is the most pressing need that you see, right, with how do you use this data besides just for planning for things?

Shlomo Bielak           
- I think it changes the mindset that most organizations use. In the past, people used VMware to prevent applications from failing, using DRS and vMotion. With cloud native, you have scaling that goes nodes, and horizontally as pods. And what essentially happens is you can scale without actually creating resilience per CPU. Like you should be evaluating how effective your systems are, and until a solution is made, pretend you have constraints, because one, there are financial constraints, but most organizations don't handle those very well. So there's a financial side to this that occurs. So if you interpret how you're using the system where just because I can do something doesn't mean it's the right way to do it, I think having things scaled correctly for SLOs and then reacting to meeting those SLOs in a predictive manner is really what we should be after, the challenges we're scaling bigger and bigger to solve resilience challenges. And I think we've been doing our utmost to get the best out of every CPU, and make sure that we're heading towards a very unimpactful human event when there are significant events for our business.

Dheeraj Ravula        
- Right. It's almost like the problem with the plentiful. Right? It seems like there's an unlimited supply, so you kind of fall into this trap of naturally kind of taking this more carefully. With that, I mean, do you see any tools out there that look promising? That help you with some of these decisions.

Shlomo Bielak        
- Well, I mean we've looked together with Avesha at your Smart Scaler capabilities where you can predict and create a service map of your service, and how it needs to scale together. So I didn't touch on it, but essentially there are services that are dependent on each other. There's no microservices system that isn't related to another microservices system. So, cloud native, while it simplifies monoliths failing, it complicates the interdependencies of ratios between systems needed to scale, and that is where you need help. Humans aren't good at managing that level of complexity in a stressful scenario, even without the stressful scenario it's really complicated. So that is where technology does a really good job. It can crunch numbers for you, and take your business requirements, and implement them without you having to be there under pressure.

Dheeraj Ravula        
- Right. Raj? I mean, this is probably an area where I think that's been a focus for us, right? Is kind of solving human element. Because as we move from monoliths to microservices, and kind of harness the power of being able to scale more efficiently, we've also introduced more knobs, right? More data, more knobs, more things to kind of fine-tune, and the problem of a human trying to do this, right?

Raj Nair        
- Yeah, I mean that's exactly right. I mean the approach we have taken is different than what others have. And where we are looking at the data, we are looking at historical data for application behavior. And then from that data using the tools that we have now, you know, the ML, and in particular reinforcement learning, which is, you know, the heart of generative AI, right? And so it's a semi-supervised learning where you actually take this data, look at the behavior of each microservice, and then as Shlomo pointed out, you know, the interlinking of them, you know, which is all deduced from the data. That's the beauty of it. And then from that, you create this model that knows essentially, how to scale for different workloads for different, I should say, loads to the application, how to scale individual workloads to the right amount, to achieve the objective for the entire application, the SLO for the entire application. And this is an innovative approach, which I've not seen anybody else do. And we are very happy to bring this to the market and help solve real problems. You know, as you pointed out Dheeraj Ravula, you know, so people don't need to worry about what knobs to set and how much to set it to and so on, you know, to achieve a particular objective. And this is specifically, exactly what ML is suited for, you know, to be able to take a lot of behavioral data and then to actually crunch through that, and then work on it, you know, and do something about it. So we, so that's really what we try to do.

Dheeraj Ravula        
- Shlomo, working with you, you always talk about the human element, right? The human element of taking production data, right? And there's lots of it, and lots of moving parts, lots of logs, lots of metrics, and then having to make sense out of it, right? I mean, using that to simulate the same thing, just so you can performance-tune, and make sure that you meet an SLA. That's a lot of work. And it's usually work for your most talented individuals, right? Usually your smartest developers, your smartest SREs, and the DevOps people that you put on this task, which, I mean go just kind of digging into my own, my past when I was an engineer, doing performance tuning for production. It's exciting, but it's very hard to do it again and again, right?

Shlomo Bielak        
- I generally go towards the idea that people who are extremely talented shouldn't face consequences for being a hero. You shouldn't need a hero. There should be a hero infrequently. And that's how we should be using talent. I think it's a mindset change that's needed within the market. So I take it as a simple analogy, when you rent a car, you use it, and you forecast how much you're gonna need, and then you return the car when you're done. But when you run this in the cloud, what essentially, we do is, well, I'll scale it up so I don't have issues, so I don't have to deal with it, because I don't really know how the downstream effect will be. So I'll make it bigger. Okay. I'll make it bigger. So the car's sitting in your driveway, empty, for a long time, not being utilized. You don't know how many you need in your fleet for your team. I just, I don't think it's effective for the financials, for the environment, for the climate, because you're using more resources than you need. I think fundamentally our mindset should be these resources while they are scalable and cloud native does a fantastic job going vertically and horizontally, it actually can allow us to be extremely inefficient and prevent us from becoming efficient. So, that is where I think your models and how it learns, how the behavior should be, it will react much more predictively than the horizontal pod autoscalers that exist in market today, which are highly ineffective from our vantage point, from a vendor vantage point, it's fantastic. You, you use more resources. So in this circumstance, I think this is more human element, environment element, business element. It's just all around fantastic.

Dheeraj Ravula        
- Awesome. Actually, I was about to go sell my seven-seater car that I only used 20 times a year with all seven people sitting in it. I think that's a pretty good way to compare with what's actually happening in the cloud, right? We are putting out resources that we barely use most of the time. And autoscaling, and meeting SLAs, is a real problem out there. And maybe just to put a little bit of a real context to it, why is this important, meeting an SLA, having enough time to your business?

Shlomo Bielak        
- I mean, there's two sides to the importance, the importance of protecting the staff that make it happen, and the customer's expectations. We all are aware that we don't tolerate much anymore because life's so complicated and stressful. You really don't want things that you use on a regular basis to be inconsistent. It's agitating. You can't change that. That's the society today. The part of the society that's missing is protecting the resources too, which is an oversight. Just because something's seamless to the customer doesn't mean it's seamless to the people that are making it happen. It takes them away from being able to participate, one in the economy, and in their lives. Why it matters to me, and why I'm so focused on this is, I have had frontline members that this caused significant issues, either weight loss because we didn't have good methods, and they were under stress and didn't manage their own mental and physical health, or it can go dark enough where you don't have an ecosystem, and you don't have a caring leader, and you do something significantly bad to yourself. Which I'm aware of members that this has happened, and you know what I mean? You don't want an in honor or in memory of someone because we just didn't have the right viewpoint on technology, or viewpoint on how we manage this system. Seamless is not seamless to the people who make it happen. That is the part, the customer isn't the only vantage point that you should be looking at.

Dheeraj Ravula        
- That's actually a really important way to look at it. I will personally acknowledge, I haven't really looked at it from the angle, the human element of it. I mean it’s; I've been kind of so focused on elements of running this more efficiently, wastage, right? Not, I think we actually invested also understanding what the consequence is of how we operate in the cloud to carbon footprint. But I think it's actually equally or maybe even more important, the aspect you bring in about the stress it brings to people, right? About dealing with so much data, and it's an increasing amount of data, increasing amount of scale at which applications are being deployed. Raj, I was kind of curious how this translates to, right? I mean, we are taking really what a normal, intelligent SRE engineers do to tune the system, and use AI, especially reinforcement learning, to kind of do the same thing to alleviate the stress, and the need to do repetitive tasks, right? because that repetition needs to happen every time you put out a new release, every time you change infrastructure, every time any layer in the stack is changed, right? Because what's expected, like Shlomo, you talked about this, there's an expectation that everything will be up all the time. And that seems normal from an expectation perspective, but delivering it, on the other hand, can be very stressful, right? So, Raj, do you want to comment on how we are using reinforcement learning to-        
- Absolutely.        
- Alleviate the human element, right?

Raj Nair        
- Yeah. So the, I mean, at the outset, obviously, you know, what we are doing is sort of what we've trained this model to do. And that requires data. So it requires all of this application, behavioral observation data, that we are getting from sources like Datadog or Dynatrace or Neural Lake or whatever be the APM. But then we are combining that with infrastructure metrics, and then we are training, so it's a supervised train. So it's, we are training it, and we have designed a reward function, you know, for that bot, and then we train it, within the environment we train it, we are, by the way, there are some assumptions. We are hoping that the data that we got is good because again, that's a key element for RL, or any AI for that matter. And that within that data, you have seen the spikes and you know, we have seen, the model is able to learn how each of these microservices pods operate within that environment we are in. Okay? So that, in other words, how it scales up and down for different loads. And then from this data, this model is then able to be trained to follow a path where it will get the maximum reward. So the reward can be, again, whatever you choose it to be. I mean, if it is a specific SLO in terms of failure percentage, or if it's a SLO in terms of latency, what will be, or it can be a combination. And that's the beauty of AI is that you can, you know, you can add whatever be the objectives you want. You then train it, and once that has converged, then now that model can essentially, you know, it does what a human would've done, but of course removing the stress. So the idea is that the human doesn't have to be right, but you want it to be like what a human would normally have done. It is mimicking that, right? It's trying to do that. Now, it's not like able to do root cause analysis and all of that. I mean, that's not what it's meant to do. And there are other tools for that kind of thing, but at least it'll relieve the stress of otherwise, you know, which you'll be subjecting a human to, and to juggle all these things without the proper tools. So we wanted to bring a tool to the market that'll actually help somebody, you know, that can help, in this case, the SREs, and it's an augmentation, think of it as augmentation for their, for human, you know? So to help them relieve the stress and do something easily. And that's really the goal. And I believe that should be the goal of all AI. It's not to replace, it'll never replace a human, but it'll help relieve their stress, and give them, you know, a much better way of dealing with the problems. And as you pointed out data, you know, a complicated and even more complicated landscape that we are seeing day by day, how to deal with that, you know? So that is really the main purpose of this. And then to make sure that this is useful for, you know, somebody like Shlomo, with his problem set that he's dealing with.

Shlomo Bielak       
- I would interpret those SLOs. I hope you don't mind me interrupting a little bit. The SLOs are developer happiness, or SRE happiness. I know there's a business measurement, but I think with, like Dora, and the ability to measure DevOps in organizations, they're actually heading towards burden and pain, and then correcting processes, I think you're natively doing that. So like, if I send chills when I say min and max replicas, and if that causes stress to you, you are running an organization right now that needs to work on how can I make that not a scary thing to change. I need that to be a machine that knows this is what the ratios need to be, here's where it's optimal for this amount of events, this amount of customers. And then from my side, my SLO as a technology leader is focused on the staff. Yes, the customers. But that's the byproduct. I'm also worried about the staff because they make it happen, they sustain it. Technology doesn't sustain our applications and the innovation. People do. So I like setting the SLOs as how impactful is it on the team? Like we have a long way to go, don't get me wrong. It's nice to be very positive when you're facing challenge, but also to be realistic, and recognize where you are today. So it's a lot of work for a lot of organizations. I'm not aware of many that can scale and not fall over when they have black Friday or something like that. It's very challenging. This is where it'd be very helpful.

Dheeraj Ravula       
- Yeah, actually I love the concept of SLOs as human outcomes as well, right? Not just business and technology outcomes, usually measured as uptime, error percentage rates. You're talking more about how often did you have to wake up an SRO, an SRE, right?

Shlomo Bielak       
- Devs too, don't get me wrong. Like organizations today will use any talent that can bring them back. Like when an emergency occurs, it doesn't matter who's on call, you will come together and make it happen. Now the draw on the team aggregately, is quite high when you do things like that. So there is a side to it, which is, when events aren't managed, where SMEs are assigned, it's ping pong, and you wake lots of people. So I think the issue is compounding itself where you have an issue, I have to get people to do things for us to scale at the same time when issues occur, I also don't always have the right person engaged. So you have to tackle both of these things if you really want to solve this issue, scale using technology that can predict, but also make sure that when issues do occur, regardless of technology, you get the right person engaged so the minimal amount of people are impacted by unexpected pressures.       
- Right. No, that's a good, very, very good point. In terms of, without using reinforcement learning or AI tools, what does that do to people, right? In terms of having to produce an SLO or an SLE? Not from the human perspective, but from a business outcome perspective. I think people often talk about the compromises we are having to make, right? Without having tools that help you make those decisions quicker and faster, right? And have to deal with these data delusions.

Dheeraj Ravula       
- I generally say we're conservative in nature when it comes to customer risk. You'll go big. So if you're in an issue and you want to recover, I'm not gonna go with what I think it's gonna work. I'm gonna go five times what I think is gonna work so I can recover and make sure I'm okay, because nobody wants the issue to continue. That's for two reasons. One, lack of confidence, and two, it sometimes makes sense when you have lack of confidence to go big, because it's not savings you're after, it's the customer experience. Like the customer comes first. And I think most people understand that circumstance, but it makes us react inefficiently. Sometimes overreacting causes a cascading event. Let's say I go big because I don't know, then a downstream issue, you hit connection limits, or you hit your project limits. Oh, I can't scale that big now the project's full, I can't scale more nodes. So you inadvertently cause more issues by trying to be conservative when you lack confidence. So I think as long as technology is giving you confidence as well as a response, it becomes a trust model there. And then you don't overreact and cause a cascading event. So this, this happens when the most experienced come in, you find new issues because of your response to be protective of your organization, and protective of your customers. It's an overreaction, which is justified, but it can cascade to more problems for yourself and the customer itself.

Dheeraj Ravula       
- I'm smiling just because it kind of resonates. I've seen it happen more than once, where people either over provision, or also hit the problem you talk about, right? When you over-provision in one place, you open up another front. It's an inadvertent side effect, but it's also making this problem, the complexity more difficult to handle.

Shlomo Bielak       
- Just one last piece is for a person that needs to react and do this, life can be shit, to be very frank, and I've used that term before. And the context behind it is, technology should be facilitating tasks that we shouldn't need to do to deliver value. And the things you're working on does specifically that, they just, you have to understand that the hero's job, while it is respected and rewarded, which technically you should try not to do, because you're reinforcing behaviors that are hurting a person you care about. You have to realize that when you teach in the market, log in, run these commands, you're wrong. What's the process I need so that I don't need to do that is really what we should be doing. And using technology as similar to your approach actually facilitates, again, changing the market's approach. Teaching everybody to do everything themselves isn't what we need. They need to understand it, but they shouldn't be doing it themselves. They need to separate or create one degree of separation from delivering value, and their personal time, which should be somewhat, you know, a reward to doing good work.

Dheeraj Ravula       
- Yeah. It's almost like its fear that's driving us instead of confidence, right? And I think there's an important element when we talk about Smart Scaler, and reinforcement learning, is how it also accelerate our features to market as a side effect, right? Because it makes you more confident, it makes you surer about the changes you'll deliver, because you know that instead of using that human capital to kind of debug issues, and kind of do the performance-tuning, every time there's some change in any place you kind of leverage RL to do that work. Raj, can you talk about how that's another aspect, right? Is how it accelerates feature development instead of people can-

Raj Nair       
- That's a very good point. I mean, because, for example, you know, I mean let's, there are different angles to that. One particularly is that, you know, if you were to do it by hand, you know, if what this is doing, you literally have to have a performance tuning team and that it's, that is dedicated to looking at this in a test environment, running all these things, and to do this repeatedly. Like every time there's a change in the software, again, this whole team has to get together, and then start testing this, and then making sure that they know how it behaves. But actually what's even, you know, more difficult is that after they do all that, and let's say they learnt it, how do they implement it? And this is the thing where I think Shlomo also pointed out, like when an event happens, or when there is a need to deal with this, they have to keep changing all these things, you know? And it can be harrowing, you know, to do it in real time, hoping that you're doing all the right thing because it's a human element, right? You can make a mistake, and you know, there's so much anxiety also at that time. So the point is, you really need a tool that can do this for you. And one of the things that we also do as part of our validation, to build the trust because, you know, AI, everybody has a fear that, you know, what if it goes off the rails and it's not doing right? So we have a feature where you can actually tell it to, you know, not do anything, but just show the recommendations that it's going to do. You feel comfortable like, okay, this seems to be working, you know, so let me go and implement it. So that trust is built. And then back to your point, Dheeraj Ravula, you know, when you go from one release to another, if things have changed, well, you know, you don't need to go and further, you know, retune it, because it actually knows how to cope with it, you know? So that's the other beauty of it. And there's always the element of retraining. But you know, and when let's say you reach a certain threshold, it'll let you know that, look, I am ready to get retuned, retrained. You hit a button and it goes off and does the retraining, and then you have the new model. So it's a get rid of thing, which is great. And that's part of what AI also brings to the question, right? The ability to continuously adapt.

Shlomo Bielak       
- One second, I want to make a contradiction, not in the sense of not correct, but for performance and new releases, I generally follow the principle of know when you degrade, know when you break. So I would turn it off, I would use it to match prod, so the challenge most teams have is environments drift, which because they persist, which they shouldn't, we don't need to get into that today, but because they persist, they drift. So I would say, I'm ready to do a performance test. Hey, AI, what's in prod? Match it to non-prod. Okay, now let's slam that system with that new release and performance run until I degrade the APM tells me, okay, you're degraded. And then on top of that you have the second, instead of not degraded down, I want to crush each pod distinctly. So this service crashes at this amount of users, this service crashes at this amount of users. I don't really want to teach it anything about that. I want just to know the limits so I can be aware of it. It'll obvious scale in prod, but those limits are hard to the app. It's not the scaling, it's the app has an architectural limit. You're doing your best until the architecture fails, architectures will fail, there will be limitations to the cloud provider, the SaaS services, everything that you're using has a limit. You'll do the best for hitting those limits. But finding out those limits are also really important. So you can alert on them. That of Avesha can't help you in those scenarios. You're going to hit an architectural limit, don't hit it, alert before at an SLO for that, and then re-architect your app for that scale and innovation will scale you again.

Raj Nair       
- Absolutely. So that's part of the process we have is we have this thing called the pod capacity estimator, and it's specifically built to do that. That is to tell you what are the limits for each of these microservices. So, fantastic. But again, understanding why you hit that limit is that is beyond our capacity. I mean that is the job of the architect to actually figure it out, you know.

Shlomo Bielak       
- And you would see that within your APM as well, where you just, you're bounded by something, either connection limit, instance limit, whatever it is that you have, you'll see that occur.

Dheeraj Ravula       
- Yeah, right. Maybe just a little digging a little deeper into this one. I love how you explain how Avesha does its function. So when we use Smart Scaler, right, we are using reinforcement learning, which kind of is a shift in paradigm, right? Typically we used to, at least from an engineer's perspective, the pattern is you kind of test, and then tweak the system with an expected SLA in your mind, right? I want to have this much of uptime, this much response times, and so many errors per second. And then typically that involves doing a lot of like, testing, right? You put up scenarios, you kind of run a suite of tests, and then start tweaking them. And then as you tweak, you need to keep running those tests, right? That's a, it's a laborious task, but when you use reinforcement learning to kind of do the same thing, and learn the model, it's not only doing that, but it also understands both the traffic that's coming for this specific service application, or that whole service map, but it's also understanding how that application behaves as the load changes, and it's understanding the pattern of traffic coming in. So it's having this conversation with a customer yesterday, like, hey, will you be able to tell if I'm having a denial of service attack? Or is this normal traffic, right? This is an interesting topic where I think reinforcement learning can also kind of do some of the things that humans can do, but you have to be almost like 24/7 looking for it, right? So do you want to talk a little bit about deviating from like, when things started deviating from what it's learned?

Raj Nair       
- Yeah, so couple of things. I mean, that is an interesting comment about, you know, when we detect that you are, we are, you are actually operating in a zone that's beyond what you originally learned, which is like, you know, we certainly can detect it at the moment, you know, we are not doing anything with that other than reporting it or, you know, logging it. Because the reason being that what we've found is, you know, with regular HPA, you know, it is not that efficient at all. So, with our model, even when you're off what you trained it with, it still performs way better than HBA, and that is what we have seen in practice, mostly because it knows the dependencies, just like Shlomo had pointed out, you know, like he calls it the ratios of how much you to scale. It knows this thing, it inherently knows that that's part of what it learned from the data that it saw. So it can do a much better job of scaling than what HBA can do. So because BHA operating in a dumb way, you know, it's just a hard threshold and then it doesn't, this doesn't know what to do with that. So, but back to your point about the denial of service. I mean that's a really interesting angle where, you know, down the road we can, I mean we are just scratching the surface with this reinforcement learning approach, but I think there's a lot more that can be done and will be done, you know, in this area in the future.

Dheeraj Ravula       
- Shlomo, just back to how painful is the decision about cost versus availability?

Shlomo Bielak       
- I mean, most organizations that are using Kubernetes have already recognized that technology's the core to delivering value to customers. So they know the spend is occurring and they're doing it, right? So I think organizations, when economies change, will focus on efficiencies as need, and in growth periods adopt as much technology as possible to become more resilient when new business pressures come. It's just logical path to follow. And I think most organizations do it. We're trying to target the best customer or patron experience period. Do we want to do efficiencies? Sure. Are we doing efficiencies? Yes. The challenge is the seamlessness that we're doing also needs to advance more and more towards helping our resources maintain the best life they can possibly have. Meaning less and less incidences, less and less work that is menial, less and less repetitive tasks. Nobody wants to be hoodwinked into doing something that is just meaningless for a talent. So the terrible trivium, I don't know if everybody's seen that, but if you look up the terrible trivium, his superpower is to convince people to do repetitive tasks that have no value. I'm not sure it's a superpower, it's pretty much evil, but we have that today. We have that mentality today that I maintain my value by doing more and more work. No, you maintain your value by being creative and innovative, and organizations gain value when it persists without you. That is, we don't, none of us persist. So why are we following a model where everything's tied to the person who isn't going to persist, and the way we operate, you're gonna persist very short periods of time because you're burning yourself out, and the market is utilizing your talent. So try to become much more resilient within that mindset. But yeah, the terrible trivium actually does happen today. It's our ego, our ego convinces us that we need to do this to be valued, but it's just tricking us like the terrible trivium.

Dheeraj Ravula       
- I'm trying to visualize a cage that I build for myself, right? The cage called, I'm so good at it, I'm gonna continue to do it. And it kind of binds you down instead of feeding you up to do more meaningful, more impactful things. Do you particularly see this problem just for the space you are operating in it, does it apply to other verticals? Is this a, is it a problem a more like germane to like other industries as well?

Shlomo Bielak       
- I am a person that people who know me, I'm quite frank, and honest, and transparent. In my previous role I've worked for and with many of the top organizations in North America, and nothing to do with myself, I just, I either provided a service, or I chose them because I thought it would be a good career choice and it was, I can come up using my hands how many organizations that are operating where they're protecting their talent pool. So I think the way that it's operating today, it's been exacerbated by cloud. Because we had constraints before. The constraints were this is what I bought, and I have lead time to buy it, so make it work within here. And then build more resilience because there's no easy solution out. So I always follow some principles within where I operate, and where I work is, I'm going to create fake constraints, pretend constraints, and you need to live within those so that it, we change our practices even though the pressures don't exist. How many organizations, the best have the same problems, and I mean the best, I'm not gonna list them off. I mean the top tech that you think is phenomenal have the same issues with their talent pool, the same silos that occur, the same challenges for their, you know, the people who are front lines. But I find really ironic is that, and I'm not trying to diminish one or the other, frontline workers for healthcare understood burnout. Technology, who talks about it? Technology burnout? People who are making all the apps you use to run your day and just make things happen, bring food to your table, essentially, is a frontline worker, too. They may be a dev, they may be an SRE, they may be an operations person, they may be app ops, whatever you want to call them. They are frontline and they take a lot of heat and technology should be catering to them just as we do for other types of burnout.

Dheeraj Ravula       
- Yeah, we expect our banking systems to work all the time. We expect our medical services to work all the time. There's technology behind every one of them, and there's engineers and SREs and DevOps people that are making sure that that happens. But you're right, nobody really talks about that kind of burnout, which is I guess hidden away, right? There's no face that kind of the public looks at when it comes to that stuff.

Raj Nair       
- Oh, I think we have a question.

Dheeraj Ravula       
- Of course. Oh, this is an excellent question, actually. How much data does AI need to do the training?

Raj Nair       
- Yes. So, I'll answer that, it depends on, essentially, how much variability you see in your traffic patterns. So, typically what we have seen is it's like if you have a week's worth of data that is enough to get the model trained, and then you can go from there, and then as it needs more retraining over time, you can get, you know, you continuously keep getting even more data along the way, and that is enough to get it going. So it's just a simple matter of you deciding, you know, what is the variability and pattern that you're trying to learn. So.

Shlomo Bielak       
- I'll comment on that as well. Like we see that within APM learning too, and its intent is make sure you have enough data where your repeat patterns, you know what those variability is. So for example, if you're indexing logs for seven days, but you have an instance that happens in nine, you need to go and collect more logs, because it won't be able to capture that anomaly. I would think the same would apply to any machine learning. You need to make sure your dataset's long enough to facilitate a time space of pattern that is consistent. So a week could be, could be nine days, you'd figure that out based on knowing what your traffic patterns are today. You could look at CloudFlare, CloudFront, things like that, to see if you have any common distribution that occurs. And then as long as you capture that for Avesha, I think it would do a fantastic job.

Dheeraj Ravula       
- Yeah, perfect. I think, yeah, that's kind of how we look at it when customers ask that question is, what does your pattern look like? Right? A lot of customers actually have a daily pattern, right? Where the traffic peaks at lunchtime, on the way to home, people might call in for the service, or maybe right after they have dinner, right? So there's usually a lot of customers that have a daily pattern, but then there's other customers that have weekly. It's important for us to have at least one pattern set, right? So it could be daily, weekly, in some instances monthly. And that's sufficient enough, right, Raj to do the training.

Raj Nair       
- Yeah. And then there's a, I think a couple of other questions.

Shlomo Bielak       
- So there's an AOPs, is this an AI or AI Ops remediation tool? I find that, because I was looking into Avesha as well in the past, and I follow the practice that remediation from an AI perspective is the AIOps did the wrong job. You shouldn't have had that incident the first place. So if you're predictive towards the issue, you would've scaled correctly and prevented the issue. So, proactive. Yes. Is it AIOps proactive tool? Yes. I, remediation, you can speak to it Raj, but I kind of feel like you're a proactive AIOps.

Raj Nair       
- It's a predictive, yeah, it's a proactive or predictive intelligent remediation. You can think of it that way. And that's exactly right. And I think there was another question about, you know, whether the application be, what do we observe from application behavior? So essentially, you know, we are looking at anything that is important for you to, for this model to learn from. So whether it is failure rates, whether it's latency, whether it is, and obviously, you know, request per second as the load. So these are all things that we look at from the APM tools, and you know, errors that you're seeing, right? The HTTP return codes, all these matter, in addition to, of course, the infrastructure metrics, you know, so all these we are taking into account, and then we are looking at the number of pods that were scaled, and it is all per microservice within the environment. And then we apply the filtering, you know, to remove some, you know, to take into account the service graph, and all those things. So there's a, it's a series of filters and training that we have to do before we can build these models. And then all this is automated, of course, you know, all you need is this, it's a SaaS. So you, you know, this data is sent up to our SaaS, where it is cleaned, and then you train with those, it is trained, used to train the models, and then you'll then send down the results. It uses actually KEA as the actual tool, thus the scaling. But then it informs KEA what it has to scale to. So it's a simple, very simple SaaS setup. So it's very easy to do.

Shlomo Bielak       
- I think the service map is critical. I don't think people realize how difficult it is to have a service that failed first. So I always find that the incident is one service failed first based on load. Others will fail too when you scale up the one that failed first. Without that service map, it's extremely difficult to manage inbound because teams have learned over time that if I scale this one, I know this one's gonna go next. And then you have all of these sync things within Argo CD or whatever you're doing, you're trying to make sure you get all of them at the same time, so you don't cause another service to be the next point of failure based on load. So I think that service map is very critical to help customers manage this. And I think that value in itself, irrespective of all the others, make things a lot, lot easier to manage when it comes to which one's gonna fail next.     
- Yeah.

Dheeraj Ravula       
- Yeah, that's an excellent point. Especially in the context of microservices, right? Because there's a lot of other, lot of services that you're managing, and they have interdependencies, and scaling one doesn't necessarily mean you've solved-     
- Yeah.

Raj Nair       
- I wonder, oh, sorry, go ahead.     
- No, no, no, go ahead.

Shlomo Bielak       
- I was gonna say, I wonder if Best Buy is watching, they should reach out.     
- I hope they-     
- I won't comment why, but I know they should reach out.     
- So, there's another question that came in. Is there a reliability breaking load to worry about?     
- I think Raj, you spoke to it where you had those pod sizing service.     
- Yeah.

Shlomo Bielak       
- I think that would be reference to, at this point, this is what you should be at, this is where you'll break. Obviously, it's gonna try to react. And the way I interpret that question is what happens if it continues to react to keep things safe beyond the thresholds of the system can manage, meaning you'll actually hit our architectural limit. I think that will happen, but you can speak to it. I don't know how you would prevent this. You're obviously trying to stay up, but if it goes beyond the physical limitations, then you might actually go into that scenario unless you hard code that to prevent it.

Raj Nair       
- Yeah, I don't, I mean, obviously, you know, if it's completely broken, then beyond a certain point, then you are not being going to be able to achieve whatever SLO and-     
- We have mechanisms to send alerts, and Slack, through Slack, and all the other mechanisms that people normally expect, you know, to let them know that you literally hit a limit. And this you will know even beforehand, you know, you'll know as a report from us, you know, these are the limits of your current, you know, microservices, and we are trying to go beyond this, and I think Dheeraj Ravula has touched on that in the past. One of the questions. Like, you could take a note of that from your, from that report, and you can actually do something about it, even ahead of time. Like when you look at that and say, oh God, you know, this is bottlenecked, there's nothing we can do beyond this level of loading, because there's an architectural limit, and-     
- I need to go back to my engineer and tell them, look, we gotta fix this. So that is actually a very useful, you know, output for you guys. I would imagine, Shlomo, right?

Shlomo Bielak       
- I mean, from our side, bad is bad. You need to know what's bad. So I kind of feel like if that's the issue I'm dealing with is finding my architectural limit so that I redesign to make us scale more customers imagined. I'd love that. That's much better than saying how do I manage basic load when it changes and fluctuates, and be efficient with it? I would prefer to go to the point where customers are loading our system 200 per, look, the score's growing fast. So we are the organization that's hundred percent plus per year in growth. And under that context, that's great. Bring it and we will change our architecture, and know where our architecture limits are, because we're growing, and growth is a good thing. Growth on a daily basis affecting you is silliness. Like, that's just silliness. You should not be affecting your teams so frequently, but when you hit your architectural limits, okay, I learned my lesson guys, let's take that Jira or whatever ticket you're using to track your needs for planning, and build that product priority. I have a physical limit on my application, irrespective of everything else. Let's fix that, versus what's wrong, I don't know which service is failing because of the other and the ratios are wrong. It's kind of removing all the confounding variables and saying your architecture needs to be fixed.     
- I have no problem with failure when it comes to limitations of architecture. We will fix the architecture, then.

Raj Nair       
- No, no, I was just saying that in the past where Shlomo had said there are some events that cause spikes that can go way more than what is expected. So you know, you need to know, right? That a, you know, beyond this point, I have a problem. This thing will not scale, because there's a hard bottleneck. And so then, you know.

Dheeraj Ravula       
- You see it as an additional feature set, right? That reinforcement learning is not only helping you do the scaling in a very efficient way, but it's also kind of highlighting where to focus on as the load grows, right? It's pointing to places where it might need a re-architecture, right? It can't solve the architecture problem, but it can definitely kind of put a spotlight on it. So there's, the teams can focus on solving the problem in the next release.

Shlomo Bielak       
- Insights. Insights is always good regardless of them being bad. Like, the nature of it could be bad, you could have a failure, you may not recover, it may take time to come back, but that insight's key. It's the confounding issues that ruin your ability to adapt, by having this, we scaled it how it's supposed to scale, this is how it's supposed to be optimized. We hit an upper limit, this is it, this is where your problem is, that's very helpful to remove the inability of teams to figure out where the root cause is. I think this actually helps predetermine it.

Dheeraj Ravula       
- Kind of takes the noise out and helps you kind of focus on the areas that needs focus. Just to follow up, and this is a question for both of you in terms of how much data do I need? The question is, how easy should it be to try Smart Scaler on an application that you already have running introduction? And the counter question to Raj is how easy is it? So, Shlomo, to you, like what's your-

Shlomo Bielak       
- Yeah, so I generally, if you have a Kubernetes cluster, be it EKS, AKS, GKE, Anthos, whatever it is, you'll run on any cloud native platform. The context is try it and see if you optimize your clusters, does it give you insights that you currently don't have? I mean, a deployment in Kubernetes is, once you're using it frequently, it's quite fast and simple, right? It's not like you need a VM to come up and wait for an OS to work. The nodes are running, you deploy it image into the container platform and you're running. So I don't, I see it as an easy task. What you find may not be the easy task, which is to articulate to the business why it's running so much smaller than before. Why was your operational budget dropping? Why were we so off on our ratios?    
Why were we not reacting correctly before? I think those are all self-awareness and self-reflection that we'll follow after trying it out.

Dheeraj Ravula       
- Cool. And Raj question to you, how easy is it for a customer, new company, that has a Kubernetes workload to try Smart Scaler?

Raj Nair       
- Yeah, I mean it's, logistically, it's very, very simple. You know, it's a matter of, you know, essentially, it's a call out, you know, from your cluster to our SaaS, sending in the data needs, and then getting back the recommendations. But then again, the data should be good. And so that is what, you know, if you don't really have a need for this, like, you have let's say very low loading, and there's really, it's not really a problem for you. This is not the right tool. I mean, it's just a waste of time. But if you have a situation where like, where you're, you really have to find, you know, you have a varying load, you really don't know how to deal with this, you know, and it's a complicated matter. You really have to use such tools to help deal with this, rather than just running around, you know, with this data, and you don't know what to do with that data. You know that that's not a great situation. So, you know that I think that's all it is. So, and there are a lot of, I think Shlomo has already said, right, there are a lot of organizations in different verticals in, I'm sure, you know, in his own vertical, in the finance vertical we've seen, and then also in the retail space, anything that has a B2C component and if it's a hot property, you're gonna see these unpredictable, you know, loading, and you don't know what to do about it. You know? And that's the best situation for where this would work the best. In fact, what we've seen is the more heavy the loading, the more unpredictable, the more useful this tool is. And that we have seen savings even in, you know, I mean from a cost standpoint, we've seen a lot of savings, because people just overprovision, and try to think that the problem goes away, but it doesn't. And even in that, you know, we've been able to reduce a lot of work from almost 70% in some cases, and sometimes 40%. So, but you know, the biggest benefit is really the getting little bit of control over the situation and getting your guys, you know, have more of a balanced and a mature way of dealing with this sort of complexity.

Shlomo Bielak       
- I think any organization that's affected by events or news would benefit because those cause unpredictable scenarios. So for example, a player is traded, Kevin Durant is traded, or Kobe Bryant is trade traded. That can cause a huge fluctuation because it's chatter. People want to know, people want to get involved, they want to be a part of that community. And that community can be huge. It can be millions of people. So I think you could also distinguish the verticals based on are you affected outside by environmental changes such as news, or events that are known to the community. It would be advantageous for you to use a service like this because unpredictable becomes predictable in your circumstance, and it's trained, so it's advantageous. And I think those are the, those in the most need, because we keep perceiving unpredictable as, oh, I didn't know that would happen. That's not true at all. I didn't know it was gonna be so big, but if I had a learning model to handle my scaling properly, even though it went big, it managed it because it knows how to handle big. All I'm suggesting is you can actually, the unplanned events become a natural response, and unplanned becomes managed, which organizations should be striving to do. because business pressure is not usually in your control. It's usually unplanned. And you should be resilient to it.

Dheeraj Ravula       
- So, that kind of touches on the topic of outliers, right? How outliers kind of decide our behavior when it comes to autoscaling or provisioning and how, Raj, maybe you want to touch on how outliers are dealt with in reinforcement learning models.

Raj Nair       
- Well, I mean, when you mean, I think, what do you mean by outliers are, you know, like some things in your data set that are sort of really out of the, so there is a, you know, obviously when we take in the data set for the training, we do a process of cleaning and all that. And so there's a set of well-established AI procedures for dealing with that. It's all, at the end of the day, you know, there's no magic to this, it's just a statistical problem, right? We are trying to use different methods to accurately describe, you know, what we are seeing, statistics. And so outliers are an issue that we have to deal with. And then as sometimes you have to do, like in the case of latency and so on, you know, there are different, you know, like there's a P95 for example, where, you know, you can use that to get to the range of latency that you care about. You can't like keep looking at, you know, I mean, so there are, in other words, established procedures for handling this. So we do that, and that's how we get a good model. And then of course, during the inferencing, if you see something strange, you know, I mean, again, it depends on what you want to do, how we want to deal with it. If it is a legitimate traffic, and you want to actually scale to that, you know, and as the case might be, you know, then you would respond to it, you know, because that's what you want to do. But then I think it varies from different organizations, and so we-

Dheeraj Ravula       
- Gotcha. So just final two questions. Shlomo, I mean, you see this as an industry-wide problem in terms of provisioning for autoscaling, and I have a really important question for Raj after that one.

Shlomo Bielak       
- Oh, sorry, I didn't hear a question. I do think it's a big problem for most industries, not so much over-provisioning, I think it's just the weight that teams bear to make things seamless for customers, and overlooking their needs is my driver. We can't overlook talent. We need to protect them. And whatever's available in market, you should work with these organizations to help your organization, which is why we've made that connection. I recognize the value within the technology and how it could help the team I care about, or the teams I care about. So it was a natural partnering.    
- Question for Raj. What is on the board behind you? I'm suspecting it must be your next invention.

Raj Nair       
- Actually I'm not in my room, I'm in my colleagues' room, and I can only think of maybe it's some, something that he is thinking about. So it's something else, unrelated to this topic, so.

Dheeraj Ravula       
- Awesome. Any takeaway thoughts from you Raj, before we-

Raj Nair       
- No, I think it's been awesome. I think, thanks for sharing your insights, Shlomo. I mean, we believe we are on a great journey to help, you know, the teams, your teams, and others, you know, to bring some sanity around the issues they're dealing with, using AI and some of the new techniques and to help ease their, you know, workloads, and make it a more pleasant experience for them. So, and we are very happy to be part of that and partner with companies like yours to make this real. So.

Dheeraj Ravula       
- Thanks. Awesome. This was an exceptionally insightful session. Thank you, Shlomo. Thank you, Raj. And thank you everyone for joining and listening to us. Please reach out to us if you have similar challenges that AI can help leverage, and kind of help you alleviate some of the problems, right, that people face in delivering SLOs in the real world.

Raj Nair       
- And there's a QR code if you want to get in touch with us, so feel free to go scan that.

Shlomo Bielak       
- Thank you very much for having me. Thank you, guys.

Raj Nair       
- Thank you. Thank you everyone.    
- Take care.

Related Articles

card image

Transforming your GPU infrastructure into a competitive advantage

card image

Building Distributed MongoDB Deployments Across Multi-Cluster/Multi-Cloud Environments with KubeSlice

card image

KubeSlice: The Bridge to Seamless Multi-Cloud Kubernetes Service Migration

card image

Optimizing Payments Infrastructure with Smart Karpenter: A Case Study

card image

Scaling RAG in Production with Elastic GPU Service (EGS)

card image

Optimizing GPU Allocation for Real-Time Inference with Avesha EGS

card image

#1 Myth or Mantra of spike scaling – "throw more resources at it."

card image

Do You Love Your Cloud Credits? Here's How You Can Get More…

card image

The APM Paradox: When Solution Becomes the Problem

Copyright © Avesha 2025. All rights reserved.

Terms and Conditions

Privacy Policy

twitter logo
linkedin logo
slack logo
youtube logo