Avesha Resources / Blogs
Raj Nair
Founder & CEO
Prabhu Navali
VP of Product & Architecture
Olyvia Rakshit
VP Marketing & Product (UX)

How EGS transforms distributed GPU infrastructure into a unified, policy-driven AI Grid — with intelligent workload placement, DPU-accelerated connectivity, distributed inference, fine-tuning, and hardware-enforced multi-tenancy across every compute tier.
The NVIDIA AI Grid initiative envisions a world where GPU compute is no longer siloed in isolated clusters — but flows intelligently across a unified, programmable fabric spanning devices, edges, telco networks, and hyperscaler clouds. Realizing this vision requires an orchestration layer that goes far beyond traditional Kubernetes: one that understands GPU topology, workload latency profiles, multi-tenant isolation requirements, and elastic demand across geographically dispersed sites. Avesha's Elastic Grid Service (EGS) is purpose-built for exactly this role. Deployed across Telco Cloud Continuum spanning Far Edge to Near Edge, to Core and AI Factory tiers and architected for enterprise edge-to-data-center-to-neoclouds tiers scenarios — EGS acts as the AI Application Workload Router: a cross-site, cross-domain orchestration engine that treats GPU resources as a unified, policy-governed elastic pool. "The Telco Cloud Continuum has moved from theory to operational reality. We are no longer managing servers — we are orchestrating intelligence." — Telenor MWC 2026
EGS is organized around four foundational pillars — intelligent workload routing and placement, intelligent GPU sharing, hardware-enforced multi-tenancy, comprehensive FinOps observability. Architecturally, a central EGS Controller cluster governs scheduling, policy, and inventory, while lightweight EGS Worker agents execute on every cluster across the continuum.
Core Components

At the heart of EGS is its workload routing and placement (WP) and workload associated GPU Provision Request (GPR) — the fundamental resource allocation primitive that abstracts physical GPU capacity from application logic. A GPR specifies GPU type, memory, count, cluster/tier affinity, data sovereignty, priority tier (Low / Medium / High), duration, and redistribution policy. GPRs flow through a priority queue managed by the EGS Controller, which applies priority-with-fairness and max-min fairness algorithms to allocate physical GPU capacity across competing workloads and tenants.
EGS classifies every workload by placement policy. Pre-defined workloads (e.g., latency-sensitive video
transcoding, AI-for-RAN functions) are pinned to a specific tier — typically Far Edge — where sub-10ms proximity
to data sources is non-negotiable. Redistributable workloads (e.g., LLM inference servers, batch IDP jobs) can
run at any available tier and are automatically migrated by EGS when local capacity is exhausted.
When a redistributable workload cannot be scheduled due to GPU saturation, EGS activates capacity chasing. It
scans the unified GPU inventory across all clusters in scope, selects the optimal destination based on priority, wait
time, GPU shape compatibility, policies, and network latency, and provisions the workload there — typically within
30 seconds, with zero manual intervention. The workspace overlay network preserves unified service connectivity
throughout the migration.
EGS implements a three-tier priority system (High: 1–300, Medium: 1–200, Low: 1–100). When a high-priority
workload requires capacity occupied by a lower-priority batch job, EGS preempts the lower-priority GPR: it evicts
the batch workload, health-checks and memory-clears the GPU, then reallocates it to the high-priority tenant. For
AI-RAN workloads — where network function AI models must maintain radio network performance — EGS
supports pre-emptive prioritization ensuring telco AI services always receive reserved compute.
Metric Result
Cross-Tier Placement Accuracy 100%
Capacity Chasing Latency < 30 seconds
LLM Burst Scale-Out (vLLM HPA) < 90 seconds
Manual Intervention Required Zero — fully automated
GPU Utilization Increase vs. Baseline +30–45%
Idle GPU Time Reduction > 40%
Distributed inference — running AI models as close to the data source as latency demands while elastically
bursting to higher-tier clusters — is the central value proposition of EGS in an AI Grid. Three complementary
mechanisms enable this.
An EGS Workspace is a secure, isolated AI service tenant. Each workspace receives dedicated Kubernetes
namespaces, workspace-scoped GPU access via GPRs, and its own WireGuard-encrypted/BlueField DPU
enabled L3 VPN overlay network spanning all clusters associated with that workspace. Inference workloads within
a workspace communicate with each other as if co-located — even when distributed across telco tiers or across
enterprise tiers. EGS continuously monitors for isolation breaches and unauthorized GPU access events across
all concurrent workspaces.
EGS integrates directly with Kubernetes Smart Scaler (HPA) signals. When a scale-out event is triggered and
local Far Edge GPUs are saturated, EGS uses a workload placement to burst new LLM/vLLM inference replicas
to other tiers clusters. This cross-tier burst is fully transparent to the application — the service endpoint remains
consistent, and the workspace overlay routes requests to replicas across both clusters. These cross-tier burst
events result in reducing the SLA violations.
Not every workload is latency-critical. For Intelligent Document Processing (IDP) using LLMs — batch invoice
processing, contract analysis, claims review — EGS Time-Slicing provisions one or more GPUs across multiple
workspaces in a round-robin or fair-share schedule. EGS manages eviction, re-queue, and re-provisioning
automatically. In a deployment, m independent IDP workspaces sharing n GPUs that are shared between high
priority workloads (guaranteed access) and time-sliced workloads (time-sliced access) driving a 30–45%
utilization increase versus dedicated allocations.
Multi-cluster distributed inference and fine-tuning workflows demand low-latency, high-throughput, and
cryptographically secure east-west connectivity between workload components across tiers. Traditional
software-based VPN gateways running on host CPUs consume significant compute resources that should be
reserved for AI workloads.
EGS is evolving its overlay network architecture to offload the VPN gateway function entirely to NVIDIA
BlueField-3 DPUs. This DPU-native VPN Gateway Service delivers three transformative benefits:
A critical challenge in distributed inference or fine-tuning workflows across a multi-tier AI Grid is data gravity: AI
models and inference datasets must be present at each cluster before workloads can be scheduled there.
On-demand model transfer at burst time introduces unacceptable cold-start latency — particularly for large
models.
EGS integrates with global distributed file systems and object storage — including S3-compatible stores and CSI
enabled cluster-native equivalents — to pre-position model weights, fine-tuned adapters, and inference datasets
across clusters before Capacity Chasing events are triggered. In a typical deployment, pre-compiled inference
models are stored in Persistent Volume Claims (PVCs) at both Far Edge, Near Edge and other tiers. When
Capacity Chasing migrates the Inference Server to a different tier - Near Edge, the model is already available —
enabling seamless workload migration with zero cold-start latency.
EGS enforces placement policies that respect jurisdiction boundaries during data pre-positioning. Model data
pre-positioned within sovereign cluster sets is never routed outside those boundaries during normal operation.
When workspace overlay slices connect storage in one cluster to a GPU workload in another, traffic remains
within the cryptographically isolated workspace overlay — never traversing the public internet.
When a model is updated centrally in the AI Factory, EGS enables worker clusters at other tiers to pull updates
through the secure overlay rather than via external internet paths. Storage-tier isolation per workspace ensures
that tenant A's model data stored in a storage cluster is cryptographically segmented from tenant B's artifacts —
enforced at both the network policy layer and the workspace RBAC layer.
At various tiers, GPU nodes are physically constrained and expensive. EGS enables true GPU infrastructure
sharing across tenants — with a hardware-enforced node and network isolation that satisfies
enterprise/telco-grade security mandates.
EGS leverages the NVIDIA BlueField-3 DPU and DOCA Platform Framework to offload OVN-Kubernetes
processing entirely from the host CPU to the DPU's ARM cores. This architectural shift — where standard Open
vSwitch (OVS) is disabled on the host OS and all switching/routing is handled by the DPU — delivers two critical
benefits for AI Grid multi-tenancy:
EGS enforces a zero-trust security model across the entire continuum. Every inter-cluster channel is
authenticated via DPU-offloaded PSP Gateway. Network policies prevent pods in one workspace from accessing
pods in another. Air-gap and classified-mode operations are supported for sovereign missions where the
management plane must remain fully offline.
The NVIDIA AI Grid vision requires a control plane that spans tiers, enforces policy, and automates the entire
lifecycle of GPU workloads from request to release — across enterprise edge, data center, and telco cloud
continuum environments. Avesha EGS delivers exactly this.
By leveraging intelligent workload routing & placement, capacity chasing, distributed inference, fine-tuning across
workspace slices, DPU-accelerated high-speed multi-cluster connectivity, global model and data pre-positioning
via integrated file systems, and hardware-enforced BF/DPU workspace isolation, the EGS platform effectively
unifies disparate GPU infrastructure into a cohesive AI Grid fabric.
For operators and enterprises, the strategic window to establish infrastructure leadership in distributed AI is open
now. EGS provides the validated workload orchestration layer to capture that opportunity — anchored by the
physical infrastructure and sovereign compute that only edge-native operators control.
Learn more: https://docs.avesha.io/documentation/enterprise-egs/1.17.0
Copied