Customers & Partners
FAQ
starWhitepaperstar

Download Whitepaper

Please fill the form below to download a free copy of the whitepaper.

Screenshot 2025-11-20 at 4.57.37 PM.png

SmartScaler + Run:ai: Predictive Scaling That Outpaces KPA in Real-World LLM Inference

This performance report highlights how Avesha SmartScaler dramatically improves LLM inference behavior on the Run:ai platform—consistently outperforming Run:ai’s Knative Pod Autoscaler (KPA) under identical conditions. Using Nebius H100 8×GPU nodes running Llama-3.1 8B FP8 on NVIDIA NIM, SmartScaler scales earlier, stabilizes throughput faster, and processes significantly more tokens during burst periods. While KPA reacts to concurrency thresholds, SmartScaler predicts load ahead of time using RL models, GPU-aware telemetry, and real-time framework metrics. The result: lower waiting queues, faster ramp-up, and up to 3×+ higher token throughput during instantaneous bursts, even though both systems operate with the same pod capacity thresholds. This document provides the full architecture, scaling logic, benchmarking methodology, and side-by-side comparison of SmartScaler vs. KPA on Run:ai-managed clusters.

Author(s):

avesha_campaign_small_icon.svg

Avesha Team