Please fill the form below to download a free copy of the whitepaper.

SmartScaler + Run:ai: Predictive Scaling That Outpaces KPA in Real-World LLM Inference
This performance report highlights how Avesha SmartScaler dramatically improves LLM inference behavior on the Run:ai platform—consistently outperforming Run:ai’s Knative Pod Autoscaler (KPA) under identical conditions. Using Nebius H100 8×GPU nodes running Llama-3.1 8B FP8 on NVIDIA NIM, SmartScaler scales earlier, stabilizes throughput faster, and processes significantly more tokens during burst periods. While KPA reacts to concurrency thresholds, SmartScaler predicts load ahead of time using RL models, GPU-aware telemetry, and real-time framework metrics. The result: lower waiting queues, faster ramp-up, and up to 3×+ higher token throughput during instantaneous bursts, even though both systems operate with the same pod capacity thresholds. This document provides the full architecture, scaling logic, benchmarking methodology, and side-by-side comparison of SmartScaler vs. KPA on Run:ai-managed clusters.
Author(s):
Avesha Team