Raj Nair
Founder & CEO
15 July, 2024,
1 min read
#1 Myth or Mantra of spike scaling – “throw more resources at it.” Is there a better way? Imagine scaling up a microservice that’s upstream of a chokepoint without first investigating the bottlenecks. Picture simulating various traffic conditions to figure it out, all while struggling to maintain a fast product delivery schedule within budget constraints. How many of you grapple with these challenges every day?
During a recent webinar, the conversation turned to the often-ignored difficulties of delivering SLOs and SLAs in cloud environments. One guest, a CTO managing weekly events, vividly described the human cost of stress on engineers handling unpredictable traffic spikes. His team spends 40 man-hours preparing a single application for an event, only to have SREs monitor scaling through the night. The frustration in his voice was palpable as he spoke of SRE burnout. “Humans aren’t particularly good at knowing what to scale in an application with hundreds of microservices. They can’t keep all the dependencies in their heads.”
Could RLHF (Reinforcement Learning with Human Feedback) be the answer? Picture an event co-pilot, an AI that not only calculates the precise scaling needed for each microservice but also dynamically adjusts in real-time to traffic conditions. This AI, envisioned by the CTO, would alleviate the stress on SREs through disciplined, dependency-aware scaling, balancing the workload and promoting sustainability. The alternative—scaling everything uniformly—could inadvertently create new bottlenecks and prevent getting the SLA/SLOs.
What are your thoughts on reducing SRE stress from managing infrastructure for events?
Copied