Zero_RTO_Kubernetes_v01_01.webp
avesha_logo.svg

Avesha Blogs

18 October, 2023,

15 min read

Copied

Video link: https://www.youtube.com/watch?v=O7qSHSQWayg

Kunal Kushwaha: 
Thanks a lot for joining, I really appreciate it and I'm really excited for today's session. We have Prasad here with us and Matt. Prasad Matt how are you doing? Let's start with Prasad, how are you?

Prasad Dorbala:   
yeah doing good good good good thanks thanks for organizing this we're all excited to talk about different things, what we are bringing and experiences in disaster recovery so it's interesting.

Kunal Kushwaha: 
Nice looking forward to chatting with you, how's it going Matt?

Matt LeBlanc:   
I'm great looking forward to having this conversation it's been a big part of my career in the last half decade or so yeah

Kunal Kushwaha:  
And for those of you who don't know of these two folks they have an extensive amount of experience with you know managing data and especially you know when working with K8s. So today we're going to talk about Kubernetes Disaster Recovery, we're going to talk about KubeSlice and you know some amazing projects and yeah preventing major data losses you know working with large amount of data making sure you know that you're all up to date that's what we're going to discuss you can also share in your questions in the chat I see a few close 200 people are already live which is good. So I'm expecting nice questions if you're not live. This is being recorded so you can watch it again as many times as you like and I'll leave all the links in the description as well so you can join the slack channels and also you know connect with the guests over here right. So yeah keep let's keep the momentum going and asking your questions happily to answer those live and yeah no this is live someone is saying it's not live right it's live live right now. Alright I appreciate it thank you and before we get started let's do some introduction, Matt would you like to introduce yourself and then Prasad I believe you can do it and then you can share your screen for as you believe you have a presentation.

Matt LeBlanc:   
Sure Matt LeBlanc I'm a Senior engineer here at Avesha I live in the the Boston area where I've spent a lot of part of my career working in Data Center Technology from storage at EMC and more recently in the last three plus years I've been in Kubernetes data protection and most recently moved over to the Avesha team.

Prasad Dorbala:   
Thanks Matt. Hi everybody my name is Prasad Prasad I'm a co-founder and chief product officer for Avesha. You know as my experience has been operating large-scale infrastructures be it a SAS provider or managed services for disaster recovery and others so you know as we see the industry trend you know data on Kubernetes is increasing. It is year over year it is growing about 40 percent so you can see that a lot of people are bringing data on to Kubernetes. Previously it used to be a more stateless workload but now stateful workloads are coming over and one of the key important factors is how do you protect right the infrastructure? There will be lots and lots of failures for various reasons not necessarily internal but also external so how do you make sure your business is up and running? So loss of any availability of infrastructure or services would impact revenue from a top-line revenue from an Enterprise standpoint, not only that it would also impact customer experience. So you know my CEO used to say don't let me come on to the front page of any newspaper because we lost service, so that's reason why it is very important to understand the behaviors with respect to recovery and as well as how do you protect your infrastructure? That's what we're going to talk

Kunal Kushwaha:   
Amazing well thanks for sharing Matt and Prasad I believe you can share your screen

Prasad Dorbala:   
I'm going to start sharing the screen and let me pull this I would let Matt start

Matt LeBlanc:   
Sure, so today we're going to talk about zero RTO. We'll cover what a RTO is in just a moment so what is an RTO? It's really about time, so there are two different, there's recover point objective and recovered time objective and basically your recover point objective is to the point of the last backup but your recovered time objective is how long can you tolerate getting back to. So really they're they're two different moments, so one was your last backup first. How long does it take to get your application back up on its feet? It's like, so why are we talking about this? Well there are a few different reasons that’s why we need to make sure that we have data protection. First of all there is customer experience if your application is down your customers may not experience, might not have a positive user experience. For example large this is a recent story a large American auto manufacturer, they have a application that you control your car with and allows you to start your car or lock the oh you know start the car turn it off lock the car that type of thing via their app and apparently their application goes down on a regular basis and they need to get back to a last good point. So it does impact customer experience as a owner of one of those vehicles I have experienced when it doesn't work and it's the winter time and I can't start my car another user experience which maybe some of you can relate to would be a few months ago there was a problem with the traffic control system which is ancient and that's another story but the traffic control system was down and the reason it was down was there was I believe there was a bad restore on from a backup which comes down to another point of make sure you practice your backups make sure you know you're able to get back in your feed and obviously there's also the idea of you could lose data if you had a bad actor. I'll give you another example that one a few years ago there was a employee of Twitter who was quitting and just before you left he went and deleted the real Donald's Trump's account, so obviously a backup allowed them to get back up on their feet but you know between user experience bad actor and what I'm going to talk about right now is compliance. We also need it for regulation purposes and stock 2 is a really good example so there are some five big basic criteria here you can hit the next animation there and it really talks about a few different basic components. Security, making sure that that data is behind firewall you have intrusion protection and it is secure. Confidential so you make sure it is encrypted um as well as proper access controls and firewalls available that comes down to making sure you have a DR capability the ability to go back in time to recover your application. As well as processing integrity and privacy, making sure all that data is secure and available. So that is the reason why you need it and for sock 2 if you are not in compliance and you are audited as Prasad was talking about earlier it also means that in addition to revenue loss from being down or bad customer experience from being down regulation may also impose fines on you. So not only are you losing users customers you are losing your reputation, potentially losing revenue and now you're going to incur occur a fines so there's a lot of good reasons for data protection. Next slide. 

Prasad Dorbala:    
Now a just going to add here most of the SAS providers when when you want to impress upon the customers that you are a SAS solution the very first thing Enterprises would ask you is show me your service organization control right what controls do you have in place so that you know like mac described who what's your access privileges so that you know in your employees who is logging in who is logging out. What is the audit process? Not only that you know, confidentiality, integrity, and availability so these are the auditable items which are audited by the third party which is the auditor and they give you an attestation that you are actually have controls in place you're able to you know mitigate the risks and then be able to be available. That is the certificate which the every Enterprises would ask for them to actually sign up for a service so it is very very important for the internal engineering team to be geared up whenever we put up an infrastructure to be more soft to compliant Matt go ahead 

Kunal Kushwaha:    
There's a comment, if you can explain RPO and RTO with an example 

Matt LeBlanc:    
Okay recover Point objective recover time objective can you go back to that other slide thank you. So our recover Point objective is how far back do you need to go so um you need to go back say 15 minutes to an hour that'll be your last known good point but just because you're able to go back 15 minutes it doesn't mean you're going to be recover immediately there's going to take some time to recover. Whether you're moving data to recover your PVC your chunk of disk with your application is using it will take time to move that data and it may take time to to redeploy that application on the other side so recover point will be your last known good backup. Let's say 11 A.M but your recovered time objective may be much later than that so you did your backup at 11 o'clock but you know it takes time to get that the bits moved across the wire and time to redeploy that application. So your RTO may be an hour but your your RPO may be 15 minutes so it all depends on data movement and how long it takes to deploy that application. Prasad do you have any more to add now 

Prasad Dorbala:    
I think as you rightly said right if you have a database and then you are you're doing a checkpoint on your data and the data is resident on your disk and then that gets copied over to a different location that is the the time frame it is when you do a checkpoint that is your recovery point right and then be able to recover up to the known good state for my your application consistency standpoint. A recovery time objective is let's assume you're not active you're an active passive or active standby. If you want to bring up your services how fast can you bring up the services? The services can reach to your application data and then be able to serve that is your RTO recovery time objective so so I I just digging deep into that right Enterprises by and large have hundreds and hundreds of application right so typical use cases are what are the customer facing Or the critical application you need to first get an inventory of what applications do you have and not all applications have the same criticality of being available right. So beyond the technology there is also a process when you declare this disaster right so typically some of the organizations have a restriction with respect to where the geographic location is right. So shop to also describe whether your primary side to a recovered side is how many miles away or is it so why is it important?is that let's say you're you're in a Data Center and then you have a dual feed for your power but something happens to that area in the geographic area and then you lose both the dual feeds of power and then obviously you'll have a backup by diesel or something like that but you drain out and then what happens is you want to go to a different region not in the same region but different regions so that you don't have that impact so but when you are moving from one region to another region is it that all applications need to be needed or what are the critical applications are needed to be active active to make to serve the customers? So there is a whole slew of inventory which people do what are necessary for real time near real-time applications. There could be some which can be like you know back office kind of an applications which can have some toleration with respect to when they come back up or there could be a cold start saying that hey these things can come an hour later but I want to make sure there is a real-time applications or which are customer facing need to be there. That's the strategy which you need to have in order to get your DR plan in place right not only this but there is also incident response management for telling the customers this is happening so there is a lot of happening but we are trying to focus on technology at this point. If you were to have active active all right so as we call it CDP, continuous data protection so continuous data protection is the RPO is almost near real time right you want to have synchronous replication of data from site a to site B and you know thanks to Kubernetes and the world of containers has helped us a lot to scale applications quickly so it previously before containers you would have to bring the servers up M applications up those are that used to take a longer time. Now in this container world of Kubernetes people are trying to say can I really have a zero RTO right as in you know me miss a heartbeat from a customer standpoint and then be able to scale it that that is what you would need to do with respect to back to active but you don't want to do it for the entire infrastructure of 300 or 400 applications you want to just do it for the critical applications. Some of them maybe in a hot standby where they they have the tolerance to be available within 30 minutes right say for instance you know when you're metering certain services and then metering that services are monthly billing although you have Telemetry of what is utilization you don't have to really bill it immediately so those kind of applications like what we call it back office applications can come back a little later right. So if there is some analytics kind of a thing which you would need to do it you know monthly basis or quarterly basis those things need not necessarily be as real time as needed so they may come up with a cold start so each one would have a different strategy right so where you do replications remember data is the new oil right so every application relies on data and data is the key factor for the recovery aspect right so how do you replicate that data is the most important strategy with respect to DR. In in in the world of Kubernetes right you have lots and lots of objects right now when you bring the services on to Kubernetes and data onto Kubernetes you store this data in persistent volumes and then you do the claims to make sure that your application take the persistence volume one important factor in Kubernetes world is all the persistent volume you can virtualize it when you are inside a cluster right when you want to go across clusters because you want to have this dependency of across regions. Now how do you replicate the persistent volumes so that you have the data consistency in a different location. That is one important factor you need to think consider. When an application comes you want to have whether it is in mtls for your service to service communication or whether it is you know a different type of certificates which you use those are all the things which you have to do. So we used to use the statement in when we are recovering before you do any good don't do any harm right like internal threats and external threats are very important from perspective of you know any operations from a security perspective. So our back rules our back policies Network policies all those things are important even in the recovery side right so how do you propagate those things and then how do we make sure that when you go to a different site our backs are maintained so those are all important. Now in Kubernetes it gives you the freedom to create your own custom resources right you you take third party you know controllers and then you deploy it now when you go into a newer site how do you make sure that your CRDs are maintained custom resources and state associated with the CRDs or maintain those are all things which are very important right so similarly Ingress rules and services and tooling right you've got to know insight into what is happening so that those are also important for you to consider. Matt, you want to take this or should I? So as I was talking to you you know the question of Snapchatting for your PVs or based on your recovery time objective and Recovery Point objective you could do synchronous replication if the sites between the two production and Recovery side is less than 10 milliseconds. So if it is you know farther than if your tolerance for your application is a little you know sensitive but you cannot tolerate 80 milliseconds or 60 milliseconds or data then you would have to do Snapchatting and synchronization but that is one aspect but scaling that is that defines an RPO for you but RTO is the user experience right how do you make sure that it is there so infrastructure the scaling of that the deployment of your applications our back policies all of those things are important for you to make that recovery much more smoother and easier right that that is what is important for any kind of a framework foreign 

Matt LeBlanc:    
We're going to take a little step back. I did see in the comments about one person had asked what is Kubernetes and I'll simply state that Kubernetes is the Next Generation platform for your workloads. So basically if we started off in the world of, let's go just go back one or two steps back to virtualization you know everyone moved their operating systems and applications into a VM. We moved into the world of containers Docker was probably the most most well-known and Kubernetes is really the ultimately the winner of of the of the Container Wars shall we say and when Kubernetes first started everything was believed to be or everything was mostly ephemeral in that there was no standard way of accessing your storage so there are a few misconceptions here so all Kubernetes workloads are stateless that's not correct because of the CSI driver that was available and I'll talk about a minute it basically allowed for saving state to your disks whatever disk you're using misconception number two data protection is not my responsibility we are all responsible for that data so if you were to deploy a application that has State then you need to make sure that you're aware to protect it we talked about why customer experience recoverability regulation and and revenue loss um and misconception number three I don't have any stable apps are you sure I've had this conversation with many many people and maybe it's just the terminology of stateful versus stateless but and and the history of believing that Kubernetes did not have the ability to have state to be carried um but things have changed so I will point out that the CSI drivers that container storage interface they were introduced in late 2019 and about a year later they added the CSI snapshot functionality um once that was available to the public that allowed stateful apps to be deployed so rather than asking you know my customers about their stateful apps I'll rephrase it are you running a database in your cluster do you if you do have a database you're running stateful data so some examples [ __ ] postgres Cassandra MySQL elasticsearch those have an active database and they reside on the PVC persistent volume claim that's the chunk of disk that is reserved to that application so we now understand that not all Kubernetes applications are stateless because of all of those applications or databases and they're all available on the CNCF website there are many projects out there and many of them are databases next slide so how does KubeSlice help so we can help with that recoverability so we talked about those five components earlier it needs to be secure so what KubeSlice does is it allows you to connect securely timely and reliably complete your replication across your clusters. 

Prasad Dorbala:    
So Matt so hold on a second there is a question what does stateful mean here? 

Matt LeBlanc:    
Oh stateful means that you are saving the state. So for example stateless I mean let me talk about that so if you have an application that is stateless and basically it's processing something and I will give you an over simplified example and that your app is going to go grab some data say from an S3 bucket and it's going to process that application and hand you the answer right the answer is 42. it's on your screen you turn your computer off it's gone it is not saved it is stateless you got your answer and that's it end of story a stateful is a application that is keeping that data so whether you're talking about a spreadsheet on your own computer or you're talking about a database you are saving that data because the intention of a stateful application is you're going to write to it you're going to read from it you're going to update it so that is pretty much what a a what is the difference between stateful and stateless thanks Prasad 

Prasad Dorbala:   
And to embellish more to what Matt said you know there is a distinction how Kubernetes does with respect to stateful and stateless if if you're if you look at a pod name State stateless pods or you know part name service name da da da and then some you know hash key which you associated with it so the identity is ephemeral right they come and then go and then it can be in anything but we're a stateful workloads have us a certain set of a sequence which they need to bring in so when you are working with large-scale databases or any kind of a application which actually turns up databases they're all stateful in fact their identity is maintained when you when you shut it down and then bringing back up their identity is also maintained so that is important to understand where the stateless workloads and stateful workloads are 

Matt LeBlanc:   
There is a question up on the screen I would like to address, it says are there any solutions for efficient data backup in Kate's Kubernetes that is hosting postgres on a persistent volume currently we are using external VM for hosting the DB? There are many data protection players out there one of the challenges in all of those Solutions is they're really focused on each of the Clusters so it is all backups are done on a per per workload per cluster basis so if you want to back up your postgresql on cluster one you're going to log into the backup console for cluster one and say let's back up postgresql and then you have a different management solution over on the other side that applies to all of the solutions out there there's one free open source project called Valero which is funded by VMware and that will allow you to command line do backups but there are many other backups out there the challenge is you don't have a single place to manage all that um and again that is kind of where KubeSlice can come in to help and provide that means of securing your data through a replicated fashion very quickly yeah Prasad do you have anything else add?

Prasad Dorbala:    
Yeah no I think in fact one of the important factors for replication postgres by itself gives you the ability to actually do the wall file movement and different things and then you kind of do it so what is happening is as you all know Kubernetes is always centered around cluster boundary right when you want to do a disaster recovery and then you want to do multi-cluster that is where the challenges are going to be so we KubeSlice is a platform which enables all that replication so that you can turn on the postgres active and backup postgres and then move the Wall Files over through KubeSlice and I think if somebody is interested in reach out to us we'll show you how easy it is to actually do a postgres replication across multi-cluster using KubeSlice. That's something which we have used on this platform much easily a little later I will show you a demonstration of how we do replication through mongodb it's a similar concept but I would also display in our demonstration right 

Matt LeBlanc:    
We have another question Prasad why does clouds like GCP, AWS have their own DR solution for GKE and EKS cluster? Actually most of the cloud providers out there don't really have a solution their approach they may have something for for their EC2 instances but for EKS AKS GKE the responsibility for the data is on the end user. Yeah I'm gonna repeat that again the responsibility for the data is put on the shoulders of the owner of the data the user right they'll help you with that managed service Kubernetes cluster but they're not there to protect your data this is very very similar to say Office 365 where yes Microsoft will replicate that data across your laptops and and from the cloud but the end user is responsible for that data so if I delete my data I'm responsible to get it back or my employer hopefully they have some solution in place so that is why there are really are no built-in Solutions and there are no well aside from the one we're going to talk about today there are no solutions for going across clouds so if you wanted to have a application that was running across AKS and EKS and GKE that is possible which I believe Prasad's going to show you later on today 

Prasad Dorbala:   
Yeah another thing the trend in the industry is people don't want to have vendor lock in right so the while you know it is important to jump start a lot of different things in the cloud provider Solutions but people are trying to the the whole native motion for the CNCF is trying to make it CloudNative and Kubernetes Native. Tf you were to do something on in the Kubernetes Native then you have to build your own tools or you use need to use some tools which are out there in the market to be able to have that happen so that you don't have to get vendor locked in with the solutions which are done by the club providers 

Matt LeBlanc:    
Alright so back to our deck here so we were talking about security and KubeSlice creates a non-rootable connection between your namespaces in each of your clusters so I'm going to repeat that again creates a flat Network that is secure open VPN tunnel and allows allows access only to the information within namespace A and cluster A and namespace B in cluster B so that allows for that replication that Prasad was talking about. From a confidentially standpoint it is non-routable and you have the ability to have very granular role-based access control so you can limit access to those namespaces. It is allows for performance monitoring we can ensure be able to know that you are within your response time that is necessary if we have a synchronous requirement I believe 10 seconds was the kind of the maximum required for that person had mentioned earlier and otherwise it's really going to go into an async type type of scenario process Integrity we are a creep slice is able to do monitoring and logging so you know exactly what is going on who has access to what and then finally on the Privacy side instead of the only way to connect it to clusters without KubeSlice is really to connect them fully so if cluster a was connected to Cluster B it means that everyone or everything inside a cluster a will be able to see all the the objects within cluster B what KubeSlice does instead of going to the north South Route basically going over the in and going out of your cluster over the internet and back down into your other cluster we're going to go with this East-West passage which basically allows for a point-to-point tunnel between those namespaces. Prasad, anything else to add?

Prasad Dorbala:   
I think Nimish was talking about it would be super help useful if we have a multi-tenancy architecture which is which client has their own databases in fact as you can see further down when you um when we go into the demo this kind concept of slice is exactly that you know where we are building networking across clusters in with in a single cluster or multi-cluster and then grouping namespaces to create a tenancy a tenancy can be a team or tenancy can be a set of applications and that particular Network Solutions which we are I mean if I have time I will show you that Network solution which is is a flat Network across it's like a virtual cluster across multiple physical clusters and then you can do multiple virtual clusters inside the same cluster as a fleet up clusters so that that's the construct which we have built with KubeSlice. This is an open source project you can go to github.com KubeSlice and then you can try it out and it is much easier to you know put certain namespaces on KubeSlice and then create a network service on that I hope you're interested there 

Matt LeBlanc:    
I think we go to the next slide. That's cool and that is it so that leads us to the demo that Prasad was talking about why don't you explain how we have this set up here 

Prasad Dorbala:   
Yeah so let me just let me just say so hopefully my token did not expire there we go I know that I expected it all right so don't worry about the trial license I'm just using this thing so yeah this is a KubeSlice in action right this is an Enterprise version which I'm showing you but there is an open source version you you would not see the pretty UI but you know most of the functionality is out there where the the way Kubernetes KubeSlice works it has a controller which is essentially using it's a it can be a cluster by itself or it can be a part of a cluster and then you register a bunch of clusters into your framework right you can register very simply by adding the manual option by doing the naming the cluster and then put the kubeconfig and then hit import the cluster it automatically registers. What do you mean by registering? Is it actually deploys a controller we are an operator model so that you don't have any kind of a drift and all that stuff so the the it operator gets installed and then you can you can actually use that operator to create bunch of slices slice slices nothing other than let's say if you manage namespaces so mongodb slice I said it right so right now mongodb slice I created saying that there is a mongodb namespaces in worker one worker 2 worker 3 which is essentially you know when you look at it this there is a cluster in GCP there is a cluster in AWS and there is a cluster in Azure in this case I used multi-cloud scenario but you know you could use it for the same you know EKS or you can use it for your you know on-prem Solutions we support Edge Data Center and all clouds possible distribution is irrespective for us any kind of a distribution you have a Rancher distribution an Open Shift distribution you can use that right so so if you if you register you can see here right when you go to manual mode different types of clouds you can select right whether it is Azure Edge GCP kind Open Shift Oracle Rancher all these clusters can be onboarded on onto your onto your controller and then now what you need to do is you create a advice by by adding you know a bunch of things with respect to add slice you can create different sets of namespaces and all that stuff so I'm not going to do that for now in the demonstration purposes but you know it is from a single pane of glass you can be able to manage the Clusters from a topological view how many say for instance I talked about right Azure to AWS right. It can be two milliseconds because it's all both of them on U.S east so but if it is 60 milliseconds from US East to US West that is somehow you want to see whether the strategy for the replication is it a continuous replication or is it a synchronous or an asynchronous replication some applications can survive that 60 milliseconds this is speed alike right you can change much from a standpoint of you know physics so you can see how clusters are communicating insight into as you can see from a worker to a worker one is very close to each other so that's the reason why but other one is different you can manage you know you can see the help of all the cluster slices and each and every component inside it you can also manage resource quotas yeah right

Prasad Dorbala:   
so I Alex what is milliseconds time mention mean? So it is the latency between two endpoints talking to each other is the amount of time it takes to for the round trip that is the milliseconds it is the delay between the two clusters that that is what the milliseconds definition here on this way right so you can have the resource management and all that stuff across multi-cluster so why am i showing you all this is I created a slice you know in between three clusters right and then I discovered Services across all these three clusters and then you know created a network Services which is a flat Network so with a address scheme of 192.168.16.0.0 so that way what happened is that now each cluster can be have their own pod cider there is no conflict we create an overlay Network on top of it right and then you connect them by having to see all of them you can add more clusters as we go along if there are more clusters you can add more clusters or you can subtract clusters easily through you know single pane of glass across these things right so I did this right and then deployed mongodb in fact if people are familiar with lens. I'm going to show you lens right so this is the AWS cluster if you see the namespace you know mongodb mongodb there is so pods there is a mongodb set 001 in in the cluster which is in AWS so when I go to the you know this is GCP or no this is Azure if you go to Azure and then look at mongoDB there is a set one which is so if you remember we were talking about what is a stateful set and a stateless set any database all of them are stateful sets so that's the reason why when you see the part name the part name is going to be the name as defined by what you wanted right so when you go to GCP right um I'm going to show you that the namespace mongoDB so there is a replica set too and then a mongoDB has an operator which is coexisting with all the databases and that that actually makes that you know if the database the life cycle management of it so that's the reason why the mongoDB 1 is primary here mongoDB 2 which is running in Azure cluster is secondary mongoDB 3 which is essentially in GCP or something so literally if one of the site goes down all of these things will become primary and when the site comes back up all the replication happens so you can see processes you know from a real-time standpoint let it generate the data in two seconds you can see all the transactions are actually going across this heterogeneous Kubernetes clusters across clouds and then being able to replicate it measure the metrics which are going across what is the network latency what it is seeing how much is the throughput all that stuff now literally this became the DR scenario where you if any site is down immediately the other site you know picks up and you have three different sites I mean there this is a a bank solution where they wanted to have not just a single site as a recovery site they wanted to have multiple sites and the the multiple sites were actually read only whereas one one site was a right only so that that's the environment which we have demonstrated and they are testing it out in on KubeSlice so all of these things are very simple because across clusters we are able to export services I don't know how many of you are familiar with multi-sig multi-cluster service export service import construct that Sigma multi-cluster has defined MCS API MCS is multi-cluster Services API so that API is what is implemented on Kubernetes on KubeSlice which allows you to export all different services from different locations and then make sure that your mongoDB is you know DR proof. One side goes down like for instance if this like for some time back I made a change so the this became primary it's zero zero became secondary so and then this will come so that that kind of replication is easily possible in in on KubeSlice so you know 

Prasad Dorbala:   
the DR environment cluster will be directly connected with a live environment cluster all the time or only during disaster? so Sajal the question is do you connect all the time or do you not connect all the time I think it goes back to the presentation which I shared right if you look at it so if if you want to have a near real time RPO and near real time RTO then it is synchronous replication near way of It's Always connected but if it is like a hot standby then it is a asynchronous replication you're taking snapshots and then doing that but if it is a core startup right then it is a restore from backup at that time you don't have to have any active active scenario but you redeploy and then make them come up right did I answer that question 

Matt LeBlanc:    
yeah there was another question to please address the zero RTO bit how do we enable the zero RTO? 

Prasad Dorbala:    
So the what what we are enabling is KubeSlice is a slice across multi-cluster environment like for instance cluster one to cluster two to cluster three and all that stuff right so if cluster one goes down and then you have the deployment spec already in cluster two you autoscale using HPA and that that way what happens is you have your HPA so initial hiccup yes in you have minimum the min to be like you know five or some based on your service and then max is when you actually grow and then you enable the cluster cluster autoscaler so as and when you know the traffic is coming up the the nodes get deployed so very important thing to remember from the container to the world of previously we used to have servers per applications per servers but now in the container Kubernetes way the application can be in any of the server pool right so you do need to take into consideration if you have no definity and all those things maintained you need to make sure those labels are there but once the the node pool brings up more nodes and then HPA kicks in and then it grows so as long as you have minimum capacity from a HPA standpoint and then put that then it is literally zero RTO 

Matt LeBlanc:    
Prasad we do have a few a few folks that are probably newer to Kubernetes could you explain the concept of HPA? 

Prasad Dorbala:    
So HPA is horizontal Part auto scaling right so when you have a replica you can have the same service and then deploy it in so how many replicas you want to have right replicas are three in number zero it's if you don't want to have any replicas it is zero so it Kubernetes spins up those pods on demand based on various metrics one metric is CPU utilization and memory the typical HPA I mean in fact we at Aviation we have you know RL which is a reinforcement based learning HPA which actually looks at service availability SLA and as well as the service latency and then we grow so the grow is Kubernetes scheduler will automatically spin up the pods if the CPU is starving or memory is starving so that you can have more copies of these services and then you can serve more traffic that is HPA 

Matt LeBlanc:    
so that's Kubernetes ability to scale up and as well as scale back 

Prasad Dorbala:    
scale down right yeah 

Kunal Kushwaha:    
yeah we have compiled some of the questions I'll put them on the screen 

Prasad Dorbala:   
Alright so I'm going to stop sharing and then probably use all right what level of technical expertise or knowledge is required to implement and manage KubeSlice efficiently? Our goal is to make sure it is as simple as possible right so I think Matt keeps asking me can my grandma do it so that is a you know you know that's our Mantra make sure can I can we make sure that we can have a small you know the people who are not as proficient as Kubernetes be able to do it so we have automated a whole slew of things and effectively manage everything through through a controller and by the way this is a rich UI is built thanks to our marketing team who have spent a lot of time thinking about how to usability aspects, Enterprise is point and click and then do it and that's that's that's how we do it.

Prasad Dorbala:    
Should be aware when implementing KubeSlice for disaster recovery? So I come from um you know where regulated Industries on many things I in fact did the fedramp and other things right so all of these need Disaster Recovery whether it is sought to federamp HIPAA compliant based stuff and so as Matt described you availability security and you know protection is most important right availability is in when when the service goes down how fast is it available that is important security is who is protecting is anybody in true intrusion in all that stuff is very important to from access Privileges and all that stuff protection is encrypted encryption at rest and as well as encryption in transit is important so what KubeSlice does is it reduces the surface area inside a cluster so that way it actually gives you the security which is needed in order to protect it and also KubeSlice has a controller driven model the control the n as well as the work Theta clusters so it disables all the config drift problems which are there right so if somebody goes and then makes some changes some cluster automatically the operator looks at and says social truth is something else and then this is something else so it goes changes so from a regulation standpoint it helps sub to compliance it helps you know making sure that your collateral damage doesn't exist it also helps with respect to you know chatty neighbor problems so that you know you don't have any kind of a starvation issues and stuff like that and as well as it helps the RBAC functionality across multi-cluster so the question which you always ask is hey Prasad you have access to us East tomorrow I spin up a cluster in US West do do I have Prasad the same privileges and do I have the same Prasad there that is a bigger challenge because when you are in a rush trying to do something and you you forget a lot of different things you have to automate all that whether it is through terraform or githubs and all that stuff but KubeSlice maintains all that so that you don't have to do all of that 

Matt LeBlanc:    
when I was in my previous role I I did a lot of work with the Canadian Market which is actually a pretty hot Kubernetes Market and in some of my conversations up there I had learned that the you know many of my customers were looking to basically have a secondary region to replicate to and this would be an easy way to get that done because quite simply it creates a point-to-point network between those two namespaces and as Prasad pointed out it's a non-routable address it's a 192 Network it has its own DNS so and it's encrypted so there's no way for anyone outside of those namespaces to get access to that data so it is secure and a a easier way for you to replicate your data to another region

Matt LeBlanc:   
Let's see what kind of ongoing maintenance or management is required using KubeSlice for DR there automated processes for self-healing mechanisms in place? 

Prasad Dorbala:   
Yeah look we that's the reason why we used the operator model there are a whole bunch of reconcilers which are maintained inside the KubeSlice to maintain make sure that if anything happens it puts back into the you know state which was supposed to be so the that's the reason why we prescribed that controller is a different cluster to the fleet of clusters even if you go and then see Kubernetes architecture hcd schedulers and all that stuff are separate from your Fleet of worker clusters which you have the same Paradigm is maintained in KubeSlice to separate out the control plane for all the slice life cycle management and the semantics associated with it and the reconcilers inside the worker cluster maintain the state of self-healing with respect to if there is any challenges out there right so that that's the pattern and which we have used which is thankfully Kubernetes gives you that ability to do that previously it used to be a lot of work for us but you know Kubernetes had made it so simple for us to put the operator a model in and then that operator model is what we use 

Matt LeBlanc:    
There was another question on there that asked are we on the cncf are we our club size a cncf project is KubeSlice the cncf project 

Prasad Dorbala:    
Yeah yes we are going to put it into sandbox we haven't yet done that we are ready to get there 

Matt LeBlanc:    
but we are an existing member 

Prasad Dorbala:   
and I I believe we probably should have that submitted by the next kubecon in Chicago 

Matt LeBlanc:    
Yeah that's our goal are there any specific deployment models or architectures that organizations should consider when implementing KubeSlice for Kubernetes Disaster Recovery man models are the DR Solutions are all you know people tend to say that availability zone is a disaster recovery Zone right if you spread across multiple availability Zone it is good enough it is not right so it does protect certain challenges let's say the just recently to to to is it June 13th or June 10 days ago or 15 days ago AWS had challenges on us East yeah my previous company their their whole Kubernetes cluster was affected now what is the challenge the if you look at the status page of them we were down right we were trying to make sure that across region but yeah AWS west region didn't have any problem if I had a cluster in there and then declare a disaster I would have been happier not not answering questions of the to my end customers saying that why the hell the service was now 

Matt LeBlanc:   
so this means that if you're in either if you're in multiple regions that you can assure that if one of those regions go down you should be able to keep your data integrity and your application up what is also interesting is I remember I think it was Q4 of 2021 I know it was a while ago in the world of Kubernetes but AWS went down three times that quarter and Azure went down that quarter I think at least once as well as Google so there is the idea that our cloud don't ever break that's not entirely true and with KubeSlice you could run a multi-cloud solution in that heterogeneous fashion and that if AWS goes down because we know that you know 80 of the people are are in AWS then you'll be able to recover in Azure or in Google or on-prem and on your data center so there's a lot of different options there 

Prasad Dorbala:    
I think that people should not forget bgp [Laughter] networking and that you know bgp is a protocol of all things and it does read out things in different locations and then people can go black hole for a long time. So yeah what kind of monitoring alerting capabilities at qsys are going to help? so we have built-in so look our pedigree is operations we have built in all the metrics which are needed for visibility so you there is no point I'm I kind of built a lot of startups before lights blinking bits are flowing is good enough but if you don't have inside the RCA you know which you need to have is very critical right so it is important to have we have Integrations into slack we have Integrations into different places we have all the events which are there a whole bunch of logs which you can put in so there are lots of ways to go look at it into everything proactively in a reactively 

Matt LeBlanc:  and as well as integration with Prometheus so you'll be able to pull all that data and and if you're running off of your own Prometheus server you'd be able to set up alerts based on that data being pulled in 

Prasad Dorbala:    
Yeah, does KubeSlice work on baremetal Kubernetes cluster? absolutely so it has no bearing on um it has no bearing on what distribution and what in fact one of our customer is a gaming customer who is running all the you know KubeSlice on they are bridging a bare metal to a VM environment when they on spikes they are going to VM environment but in a city-state they're actually using bare metal so KubeSlice I have that connectivity across different clusters and then be able to support all that functionality which we have

Matt LeBlanc:    
can you explain the underlying principles or Technologies behind KubeSlice that make it effective in minimizing downtime and data loss? 

Prasad Dorbala:   
So KubeSlice by in itself right gives you a platform to create a virtual cluster across multi-region think about it this way right if you are in a cluster and then one of the node goes down in your cluster right scheduler goes and then figures out how to re-deploy those pods in different capacity in different nodes similarly if a cluster one end of the cluster goes down because it's a virtual cluster you can move the workloads into the other cluster because the other cluster may have the capacity so that that philosophy of having to spread as a virtual cluster number one number two is what is a virtual cluster definition of networking in Kubernetes that no every part to other part should be talking to each other without net functionality that's the foundation for Kubernetes so we built an overlay Network across multi-cluster so that if you have air cluster have a problem other cluster is available but Services doesn't have to go through any nats or anything so literally gives you the same feel that as though you have one node going down in a cluster and and the scheduler actually automatically schedules the pods in another set of nodes so that's the kind of philosophy we have used across multi-cluster region that way you can do PVC replication or PV replications or even the workload deployment

Prasad Dorbala:    
so very important how frequently that's a very very important by the way sock 2 mandates that the Auditors ask you not DR should also be exercised every year the test has to happen to get an attestation so it is necessary to do that now backups must be performed based on the business need right so the the critical the business it is important to perform snapshot as fast as possible right by the way after you perform a backup don't be happy try to recover and see if it actually works right so many times I have run into a problem if backup is there backup is there but when we try to recover [ __ ] that doesn't work because the data is corrupted so it is important this is like a regular hygiene you wake up in the morning and brush your teeth you don't think twice right right so backup should be just like that. sorry Matt you're on mute yeah so 

Matt LeBlanc:    
back to that original example that I spoke about when the air traffic control system was down they didn't practice recovery and as a result I was stuck in Toronto for another half day as well as millions of other U.S or North American Travelers were not able to get home they stopped flying it was the first time that we've seen clear skies since 9 11. right it was it was that that so how important is it that you practice your recovery it is imperative if you don't have if you don't practice it you're not no not going to know what you're doing at that time

Matt LeBlanc: Another great question I saw was does VPC come into this picture? you know do you need to make those changes what about other infrastructure changes required to be provisioned other than Kubernetes? 

Prasad Dorbala:    
It's very very important question thanks for asking that sadio our philosophy is making sure that it is Kubernetes native in fact um I wish I had a slide to show but you know if I can show quickly I'll show it but we don't have time so we bring software defined networking inside Kubernetes cluster so you don't have to do VPC peering you don't have to worry about all you need to do is when you define an infrastructure you define your cider pods so you know Services cider pod cider and your networking that is by default there when you bring up your clusters that is it right the overlay Network takes care of across clusters you don't have to do any kind of a VPC peering and all that stuff we established the tunnels so the tunnels can be either through a public interface or you want to have a private interface going across the different VPC that is that is it and then we we enable it from inside the cluster to the other end of the cluster but not outside so it gives you two benefits right number one from here security perimeter standpoint you're inside the same security perimeter you are not going through changing infrastructure from Kubernetes outside the cluster going into you know Cloud Native cloud provided services and then you have to have a different set of keys to go and encrypt it or you have to go to your csos and say hey do I have permission to go and then see all the data flowing through different places all that is taken away because we are natively inside the Kubernetes cluster. 

Matt LeBlanc:    
What are the future plans and roadmap for KubeSlice? Are there any upcoming features or enhancements that users can look forward to? 

Prasad Dorbala:    
oh there there are many things which we are planning on doing in our roadmap standpoint right one we currently what we have is a construct of slice right we are also thinking of different things like you know prioritization inside it and all that right did I lose everybody? 

Kunal Kushwaha:   
no I just highlighted you

Prasad Dorbala:   
Oh yeah sorry so yeah there there are many use cases where there is a secure slice there is a data slice there is a different types of personas of different slice we are going to improve as you know think about it this way you have um iPhone and then you have a set of apps and then an app store right so iPhone is the foundation for you to build apps on so KubeSlice is the foundation which we have built and then you can build different apps on this multi-cluster scenarios apps are infrastructure set Centric apps be it resiliency app be it you know secure app right now that PSP is deprecated in 1.26 the PSS is built into it and then part security part admission control are built in so how do you make sure that these set of namespaces which are very critical for your compliance standpoint be secure right so that will be the secure app secure slice which you're going to build on top of likewise there are many functionalities we are thinking of.

Matt LeBlanc:   
Kunal you have a fantastic Community here I have to give you the question that we've got today are very they're very well put and very well thought out so thank you very much for having us today 

Kunal Kushwaha:    
Some nice nice live questions and for folks who register for the webinars they were able to share some nice questions which you just answered but there are plenty more that we did not go through for that I want to share that I've updated the description with the slack channel so you can join the slack Channel and you can talk to Prasad and Matt over there and um yeah join the community over there and keep the conversation going I know some people reached out to me as I'll connect them with Prasad via email because we can desire now so maybe Monday but yeah we answered all the questions and getting some nice feedback thanks Alex really appreciate it this is being recorded so you can watch it again if you want and yeah we'll be doing more more such sessions 

Prasad Dorbala:   
I should I I'm not embarrassed in saying this I was so impressed with Kunal for the last two years I've been following him right, he is dedicated he is a guy who is helping the community and I really thank you for what you're doing Kunal 

Kunal Kushwaha:    
Yeah I really appreciate it tall team effort um I think um wouldn't be possible without having your folks you know on the stream sharing your knowledge if you have so much experience so yeah I mean data recovery data you know data on Kubernetes definitely a very important topic I've spoken I don't know if you know the data on Kubernetes meetups yeah so I I did a session on stateful versus stateless and you definitely gave me new points to think about when talking about recovery of data so I learned a lot as well yeah I mean it was really fun and this was the first one that we did live trying to do many more and hopefully at some in-person meetups soon now that I'm I'm not traveling till September so we'll plan some nice things 

Matt LeBlanc:    
So I do know that the the Avesha team will be sponsoring the there's a Kube sorry there's a Kubernetes Meetup that will be at the Google building in Cambridge Massachusetts that's basically a part of of the Boston area um and of course we will be at KubeCon in Chicago and if you need more information please go to our website avesha.io or aveshasystems.com either one will get you there and please reach out I know there was a joke about Joey from Friends but you can find me up on LinkedIn um I happen to be from the same town I I went to school in the neighborhood that the actor grew up we're not related which is just a funny coincidence 

Kunal Kushwaha:    
oh I definitely spoke to you in North America 

Matt LeBlanc:   
ah you remember that story

Prasad Dorbala:   
also we have a Kube SLA Kube slash channel in Kubernetes or organization we are also there just ping us anytime anyway yeah 

Kunal Kushwaha:    
And all the links are in the description the website and the slack channels so yeah cool um that's a good one

Prasad Dorbala:   
Joey doesn't share food but shares amazing knowledge

Matt Leblanc:    
thank you yeah friends reference I thought 

Kunal Kushwaha:   
Cool alright well thanks for joining everyone and see you in the next one okay 

Kunal Kushwaha:   
Thank you bye-bye

Related Articles

card image

Transforming your GPU infrastructure into a competitive advantage

card image

Building Distributed MongoDB Deployments Across Multi-Cluster/Multi-Cloud Environments with KubeSlice

card image

KubeSlice: The Bridge to Seamless Multi-Cloud Kubernetes Service Migration

card image

Optimizing Payments Infrastructure with Smart Karpenter: A Case Study

card image

Scaling RAG in Production with Elastic GPU Service (EGS)

card image

Optimizing GPU Allocation for Real-Time Inference with Avesha EGS

card image

#1 Myth or Mantra of spike scaling – "throw more resources at it."

card image

Do You Love Your Cloud Credits? Here's How You Can Get More…

card image

The APM Paradox: When Solution Becomes the Problem

Copyright © Avesha 2025. All rights reserved.

Terms and Conditions

Privacy Policy

twitter logo
linkedin logo
slack logo
youtube logo