Kubernetes Scaling
Jason Bloomberg

Jason Bloomberg

Managing Partner, Intellyx

15 March, 2023,

5 min read

Copied

Take any list of the benefits of cloud computing, and at the top you’ll likely find massive horizontal scalability.

In many ways, horizontal scalability was the original justification for building public clouds in the first place. Set up virtual machines to scale out automatically, and then roll up massive numbers of such VM instances to take advantage of the economies of scale.

The result: public clouds could deliver massive horizontal scale far less expensively than organizations could on their own.

Such automatic horizontal scaling, or autoscaling, drives the primary IaaS value proposition. Scaling out horizontally in the cloud is a simple matter of setting each instance’s autoscaling parameters properly. The result is essentially infinite horizontal scale, limited only by the budget.

While cloud computing (public clouds in particular) offers many advantages over on-premises alternatives, cloud-based scalability can also be quite expensive. Setting up instances to autoscale at the drop of the hat can run up the cloud bill dramatically.

The rise of Kubernetes and cloud native computing in general have changed the nature of autoscaling, in large part to optimize how dynamic software infrastructure leverages cloud resources.

Cloud native computing with Kubernetes generally handles horizontal scalability quite differently than IaaS: the former at the pod level within Kubernetes, and the latter at the instance level within the cloud’s own environment configurations.

Scalability operations in the cloud are slow, on the order of minutes. Kubernetes autoscaling, in contrast, can take place in milliseconds at the pod level.

This horizontal pod autoscaling (HPA) is built into Kubernetes, automatically scaling pods up and down to deal with sudden increases and decreases in traffic.

And yet, while HPA provides greater agility and lower resource costs than cloud autoscaling can, it still has its drawbacks. Because HPA is reactive, even the few milliseconds it takes to respond to a spike in resource demands can lead to slowdowns or even momentary failures.

Don’t let poor autoscaling strategies give your users the dreaded 503 Service Unavailable error. Instead, scale Kubernetes the smart way by taking a proactive approach to sudden changes in demand.

The Problem with Reactive Autoscaling

In Kubernetes, horizontal pod autoscalers automatically change workload resources to automatically scale the workload to match demand.

Kubernetes performs this scaling by deploying more pods. Correspondingly, if the load decreases, Kubernetes will deprovision now-excess pods down to whatever the configured minimum pod number is for the cluster in question.

Horizontal pod autoscalers, in turn, periodically adjust the scale of its target deployment to match whatever metrics are relevant, including average CPU, memory utilization, or any metric the operator has configured the autoscaler to consider.

While these autoscalers can make their adjustments quickly, such adjustments are nevertheless reactive, or after the fact because they take action in response to the metric crossing a fixed threshold.

A sudden spike in load can lead to a momentary resource constraint before the autoscaling can adjust. Such constraints can lead to slowdowns, out of memory errors, and in some situations, that dreaded 503 error.

In order to avoid such problems, the traditional technique is for the operator to overprovision the Kubernetes clusters in question to ensure sufficient resources are available to handle such spikes.

While overprovisioning can decrease the chances of resource constraints, it is expensive to implement. One of the main economic motivations for moving from on-premises servers to the cloud and then from traditional IaaS to Kubernetes is to avoid such overprovisioning.

The last thing an operator wants to do is overprovision.

From Reactive to Proactive, ‘Smart’ Autoscaling

The best way to both optimize the costs and avoid the constraints of implementing horizontal pod autoscaling is to predict traffic demand ahead of time and scale up or down both application and infrastructure resources precisely based upon these predictions.

Reinforcement learning (RL) is an artificial intelligence technique that is well-suited for making such predictions.

The RL machine learning training method differs from both supervised and unsupervised learning, as it depends upon rewarding desired behaviors and/or punishing undesired ones. RL agents interpret data, take actions, and learn through trial and error in a simulator that leverages historical training data.

Smart Scaler from Avesha inputs both application and infrastructure performance data from the Prometheus open-source Kubernetes monitoring tool. Based upon these data, Smart Scaler uses RL to estimate the number of pods necessary for a given workload as well as likely traffic patterns that might lead to spikes or other changes.

These estimates then feed back into the Smart Scaler RL engine, which continuously optimizes the number of pods in a cluster in advance of any predicted changes in traffic to the workloads in each pod.

This continuous predictive autoscaling of Kubernetes resources improves upon the reactive HPA built into Kubernetes.

The Intellyx Take

Scaling Kubernetes requires automation, and automation in turn increasingly relies upon AI.

There are many different types of AI, and even within the machine learning arena, there are several learning approaches. Choosing the right approach is an important example of ‘the right tool for the job.’

Reinforcement learning is particularly useful in situations with many rapid cause and effect scenarios – and in the case of Kubernetes horizontal pod autoscaling, such scenarios are precisely the problem at hand.

HPA by itself is an important improvement over cloud autoscaling, but without the power of RL, HPA will never meet the needs of organizations that require the full power of Kubernetes scalability. As a result, Avesha Smart Scaler is an essential tool in any Kubernetes toolbox.

 

Copyright © Intellyx LLC. Avesha is an Intellyx customer. No AI was used in the production of this article. Intellyx retains final editorial control of this article.

Related Articles

card image

Transforming your GPU infrastructure into a competitive advantage

card image

Building Distributed MongoDB Deployments Across Multi-Cluster/Multi-Cloud Environments with KubeSlice

card image

KubeSlice: The Bridge to Seamless Multi-Cloud Kubernetes Service Migration

card image

Optimizing Payments Infrastructure with Smart Karpenter: A Case Study

card image

Scaling RAG in Production with Elastic GPU Service (EGS)

card image

Optimizing GPU Allocation for Real-Time Inference with Avesha EGS

card image

#1 Myth or Mantra of spike scaling – "throw more resources at it."

card image

Do You Love Your Cloud Credits? Here's How You Can Get More…

card image

The APM Paradox: When Solution Becomes the Problem

Copyright © Avesha 2025. All rights reserved.

Terms and Conditions

Privacy Policy

twitter logo
linkedin logo
slack logo
youtube logo