Customers & Partners
FAQ
GeneDx_logo.svg

Region

United States

Industry

Biotechnology, genome testing

Optimizing genomic workloads & slashing idle spend with Smart Karpenter

GeneDX accelerates precision-medicine research with event-driven AI/ML pipelines that run on Kubernetes across Oracle Kubernetes Engine (OKE) and Azure Kubernetes Service (AKS). Two research teams share a multi-cloud platform that must scale quickly to meet tight turnaround-time (TAT) requirements for genomic analyses. 

Background

TeamCloudPre-Smart Karpenter Workflow
Team AOKEStatic node group sized for peak.
Team BAKSIdentical pattern; idle nodes during troughs, slow starts during spikes.

Karpenter was not available on Oracle Cloud Infrastructure (OCI), so the OKE clusters had no concept of just-in-time nodes, also, scaling across the estate was reactive and over-provisioned.

Challenges

  1. Slow surge response: 5-minute pod queue times during traffic spikes.
  2. 30 % idle cloud spend: node pools padded to avoid cold starts.
  3. Manual threshold tuning: DevOps tweaked HPA/VPA every release.
  4. SLO risk: rising TAT threatened clinical commitments.

Solutions

Smart Karpenter fuses Avesha Smart Scaler with Karpenter to predict pod demand in advance and provision the exact nodes required.

CapabilityImpact at GeneDX
Predictive pod scaling: RL models analyze latency, RPS, and service dependenciesPods launch before a spike; queues disappear.
Dynamic node provisioning: predictions drive Karpenter for right-sized nodesNo idle padding; nodes spin up/down in < 60 s.
Observation → Optimize rolloutTwo-week shadow run before full AI control.
Continuous learningScaling stays accurate as workloads evolve.

Implementation Steps

  1. Helm install in “observe” mode; zero YAML changes.
  2. One-sprint traffic replay to train baseline models.
  3. 5 % → 100 % cut-over after three clean deployments.
  4. Cost guardrails: policy caps daily node-hours; Smart Karpenter throttles non-urgent jobs when budget nears limit.

Results

KPIBeforeAfter Smart KarpenterDelta
Average node CPU utilization48 %82 %+71 %
Idle node-hours / month1 900520−73 % waste
P95 pod queue time5 m 10 s< 45 s6.8 × faster
SLO violations (job TAT)12 / month0100 % compliance
Cloud compute spendBaseline−33 %Savings fund new research lines

Customer Voice

“Smart Karpenter makes Karpenter proactive. Nodes appear before the load hits and disappear immediately after, cutting a third of our cloud bill while keeping turnaround times rock-solid.” - Director of Genomic ML Platforms, GeneDX

Conclusion

With Smart Karpenter, GeneDX:

  • Achieves predictive, hands-free autoscaling across OKE and AKS.
  • Eliminates idle spend while boosting utilization above 80 %.
  • Meets stringent diagnostic SLOs without manual tuning.
  • Gains a reinforcement-learning foundation for future hybrid-cloud growth.