AIBrix: An Open-Source, Large-Scale LLM Inference Infrastructure for System Research
ASPLOS 2025 Tutorials
AIBrix Team Introduction
Agenda
Introducing AIBrix: A Testbed for Large-Scale LLM Inference System Research
AIBrix Overview
“AIBrix is an open-source LEGO set for building enterprise-grade LLM infra without duct-taping GPUs to your server rack..”
- by Mitko Vasilev, CTO, Mitko X
Why LLM Inference Systems are Challenging for Systems Researchers
Am I making the right assumptions about my systems?
Am I solving “the right problem”?
How well does my solution work in a realistic LLM stack?
LLM Inference: A Layered View of Architectural Challenges & Solutions
OpenAI Compatible API
API Gateway
Model Service
Container
Container
Distributed Cache Service
KV Cache
KV Cache
Models
Adapters
Additional features (https://github.com/vllm-project/aibrix):
Challenge #1: Resource Efficiency
Using single black-box metric: different metrics produce different indication of resource demand
Fast evolving accelerators enables needs for resource heterogeneity.
Challenge #2: Cache/Load -aware Routing
Complex cache reuse pattern requires routing strategy that is both locality and load -aware.
Challenge #3: KV Cache Cross-Engine KV Reuse
Offloading KV cache improves GPU compute efficiency.
Challenge #4: Multi-LoRA Management
Using single black-box metric: different metrics produce different indication of resource demand.
model:7b
model:7b
model:7b
Introducing AIBrix: Scalable & Cost-efficient LLM Inference
OpenAI Compatible API
API Gateway
Model Service
Container
Container
Distributed Cache Service
KV Cache
KV Cache
Models
Adapters
AIBrix Extendable API: Enable Flexible LoRA Placement
OpenAI Compatible API
API Gateway
Model Service
Container
Container
Distributed Cache Service
KV Cache
KV Cache
Models
Adapters
def schedule_lora_model_adapter(context, model, instances[])
Choose best instance instances[] to schedule the lora model adapter.
AIBrix Extendable API: Enable Cache/Load-Aware Routing
OpenAI Compatible API
API Gateway
Model Service
Container
Container
Distributed Cache Service
KV Cache
KV Cache
Models
Adapters
def route_request(context, prompt, kv_info, instances[], model)
Route request to one of the instances (instances[]) by giving kv cache matching result (kv_info) and load information
AIBrix Extendable API: Enable Autoscaling
OpenAI Compatible API
API Gateway
Model Service
Container
Container
Distributed Cache Service
KV Cache
KV Cache
Models
Adapters
def compute_target_replicas(cur_instance_count, scaling_context)
Calculates the number of replicas needed based on current metrics and the provided scaling specifications
AIBrix Open-Source Status
Call for Collaboration
LLM-Tailored Autoscaling: Leveraging LLM-specific Metrics and Embracing Resource Heterogeneity
Presenter: Ning Wang
Autoscaling Introduction
https://blog.px.dev/autoscaling-custom-k8s-metric/
LLM Requests
Challenges 1: LLM Autoscaling Metrics
No QPS change,
Latency increased
No SM Active changes, latency increasd
SM active
latency
Challenges 2: Heterogeneous GPU Resource Utilization
Griggs, Tyler, Xiaoxuan Liu, Jiaxiang Yu, Doyoung Kim, Wei-Lin Chiang, Alvin Cheung, and Ion Stoica. "M\'elange: Cost efficient large language model serving by exploiting gpu heterogeneity." arXiv preprint arXiv:2404.14527 (2024).
Challenges 3: Scaling Algorithms
Current Usage
Predicted Usage
Average Usage
Optimizer-based Autoscaling
2
1
2
4
1
llm requests GPU consumptions profile
2
1
2
4
1
GPU 1
GPU 2
Input-output Specific GPU Benchmarking
Arrival Rate: 1, 2, 4, .., 64 | 128 | 256 | 512 | Input tokens... |
4 | TPUT: 1.08, 2.16, 4,31, .., 48.20 E2E P99: 0.16, 0.20, ..., 2.06 TTFT P99: 0.06, 0.08, ..., 0.59 TPOT ... | ... | ... | ... |
8 | ... | ... | ... | ... |
16 | ... | ... | ... | ... |
Output tokens... | ... | ... | ... | ... |
Input-output Specific GPU Profiling
Throughputs | 128 | 256 | 512 | Input tokens... |
4 | 31.42 | 16.69 | 8.52 | ... |
8 | 16.62 | 16.38 | 8.42 | ... |
16 | 16.07 | 8.37 | 8.08 | ... |
Output tokens | ... | ... | ... | ... |
Arrival Rate | Throughputs | Note |
1 | 1.07 | Unsaturated GPU capacity |
2 | 2.14 | Unsaturated GPU capacity |
4 | 4.21 | Unsaturated GPU capacity |
8 | 8.08 | Unsaturated GPU capacity with no SLO deterioration over time (profile candidate) |
16 | 14.13 | Saturated GPU capacity with little possible SLO deterioration over time. |
32 | 19.33 | Saturated GPU capacity with large SLO deterioration over time. |
64 | 20.68 | Saturated GPU capacity with large SLO deterioration over time. |
Heterogeneous Autoscaling Overall Design
Autoscaling
Autoscaling Experiments
gpu_cache_usage_perc: 50
gpu_cache_usage_perc: 70
Practical Challenges in Optimizer-based Autoscaling
Evaluation Challenges
Herbst, Nikolas Roman, Samuel Kounev, and Ralf Reussner. "Elasticity in cloud computing: What it is, and what it is not." In 10th international conference on autonomic computing (ICAC 13), pp. 23-27. 2013.
Zhong, Yinmin, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. "{DistServe}: Disaggregating prefill and decoding for goodput-optimized large language model serving." In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pp. 193-210. 2024.
Reducing Inference Bottlenecks in Shared Prompt Environments with Prefix Caching
Presenter: Gangmuk Lim
Routing in LLM inference
*SGLang: Efficient Execution of Structured Language Model Programs (NeurlPS 2024)
*Locality-aware Fair Scheduling in LLM Serving
*Preble: Efficient Distributed Prompt Scheduling for LLM Serving
AIBrix can be easily extended!
For example Preble routing algorithm is already implemented in AIBrix
Routing in LLM inference: KV cache
“I like apple”
K (I)
V (I)
K (Iike)
V (Iike)
K (...)
V (...)
I
like
apple
Q (Iike)
Q (...)
Q (I)
Q (apple)
K (apple)
V (apple)
Example: Prefix-aware
KV cache (“I like apple.”)
Node 1
Node 2
“I like apple very much”
Load: 50%
Load: 30%
“I like orange”
“I like apple”
0% match
50% match
“I like apple”
100% match
Architecture of AIBrix Router
Envoy Gateway
Request
AIBrix router
Pod 1
Pod 2
Pod 3
Pod 4
target-pod-ip
AIBrix router
Easily extensible! (prefix aware routing routing, gpu utilization routing, etc.)
The new routing algorithm will not require changing anything else but the code for policy logic
Compute routing rule
Experiment Results
# prefix tokens: 1000
# prefix tokens: 8000
Take away:
Simple prefix-aware routing is not sufficient. Routing should carefully consider load and prefix to do the best routing.
Prefix-aware technical details: different data structure
I like
apple
orange
very much
pie
I like
apple pie
apple very
much
orange
I like apple very much
Hash table
Radix tree
I like
apple very
much
KV Block
Block granularity
Token granularity
KV Block
KV Block
Prompt:
Technical challenge 1: Highly available router (scaling)
Envoy Gateway
AIBrix router
Request
Pod 1
Pod 2
Pod 3
Pod 4
I like
apple
orange
very much
pie
AIBrix router
AIBrix router
I like
apple
orange
very much
pie
I like
apple
orange
very much
pie
Each Gateway would have large state.
Avg hit ratio=0.2, Avg token length=100, Num prompts=20, 61.71 KB
Avg hit ratio=0.2, Avg token length=1000, Num prompts=50, 1.52 MB
Avg hit ratio=0.2, Avg token length=10000, Num prompts=100, 30.51 MB
Avg hit ratio=0.2, Avg token length=10000, Num prompts=1000, 305.09 MB
Avg hit ratio=0.2, Avg token length=10000, Num prompts=10000, 2.98 GB
Technical challenge 2:
Synchronizing the kv cache state in different layers
Envoy Gateway
AIBrix Router
Request
Pod 1
Pod 2
Pod 3
Pod 4
Distributed
KV cache
Pod
Pod
Pod
Pod
AIBrix Router
AIBrix Router
engine cache
engine cache
engine cache
engine cache
Each engine instance maintains its own cache. (e.g., vLLM)
Technical challenge 2:
Synchronizing the kv cache state in different layers
Envoy Gateway
AIBrix Router
Request
Pod 1
Pod 2
Pod 3
Pod 4
Distributed
KV cache
engine cache
engine cache
engine cache
engine cache
AIBrix Router
AIBrix Router
What about large scale cluster?
This is another scalability issue!
Open Challenges
Heterogeneous GPUs
LoRA
Multi-tenancy (fairness)
SLO
KV Cache Offloading for Cross-Engine KV Reuse
Presenter: Haiyang Shi
KV Cache Recap
Adapted from https://huggingface.co/blog/kv-cache-quantization
*: when processing token[K], we only need the K’th row of Q
**: when processing token[K], we require the full K & V tensors, but we can reuse the cached values (This enables skipping the recomputation of K & V)
Key Challenges
Scenario 1- Prefix Cache MLSys'24 PromptCache
Key Challenges
Scenario 2 - Request Migration OSDI'24-ServerlessLLM
Key Challenges
Lower-End GPU Instance
Key Challenges
High-End GPU Instance
KV Cache Architectures
From Single Node GPU Cache => Distributed and Disaggregated KV Cache
AIBrix KV Cache Offloading Framework
Pluggable Eviction Policies
Pluggable Marshallers
Pluggable Connectors
RFC: https://github.com/vllm-project/vllm/issues/14724
Multi-LoRA Management in Production Environment
Presenter: Jiaxin Shan
Dense LoRA: Powering Efficient Finetuning at Scale
Wupdated=W+ΔW
Wupdated=W+A.B
Volume
Density
Traffic skew between different models. Some are critical production online models, while others are experimental with near-zero usage.
High Density Deployment Challenges at Scale
model:7b
model:7b
model:7b
model:7b
Challenges:
1. Lora is a container internal manifest, it breaks the kubernetes pod design, even harder for service discovery �
2. Request routing & Lora scheduler would be challenge�
3. Compete resources with tenants, contention problem exists
Case Study: Text2SQL
Text2TLS - SQL Like Language with specific syntax
AIBrix is deployed on VKE, utilizing 12 NVIDIA A10 cards. This setup substantially boosts our infrastructure's processing power, enabling support for more AI-driven code services, such as DBW testing and TLS production tasks through Bytebrain.
Text2SQL - Traditional SQL
50% GPU cost reduction!
Lora Integration with vLLM
[RFC]: Enhancing LoRA Management for
Production Environments in vLLM
Lora at scale efficiencies
AIBrix helps reduce the resource cost by 4.7X when the application has sparse workload. And still be able to maintain 1.5X resource savings when the load is high.
New Challenge: When we use merged and when we use unmerged weights?
Related work: OSDI'24 DLora
Using high-density LoRA deployment with correct batching size provides cost reduction ranging from 8% to 2.1X.
New Challenge: How to prioritize latency sensitive request and address resource contention issue?
Open Research Challenges in LLM Inference Systems
Presenter: Jiaxin Shan
Open Research Challenges
1. High Density/ Serverless for LLM Foundation Models or Adaptive GPU Multiplexing (ServerlessLLM, Prism)��2. Online & batch request colocations in LLM Serving systems (BatchLLM, ConServe)
3. Heterogenous placement and routing, compute resource standardization issue. (Melange)
4. Unified LLM routing challenges considering fairness, heterogenous, Lora, prefix cache, SLO, load etc (VTC, D2LPM, Preble)
5. Application aware optimization (Autellix)
6. Large scale simulation, output latency prediction and roofline model analysis (vidur, LLMViewer)
Thank You
Prefix-aware technical details
AIBrix implementation architecture
Should we mention about radix tree, hash data structure details in this talk?
consistent view, dynamo
high availability (multi-replica)
Key Innovations Enhancing Performance and Reducing Costs
Introduction The Need for Scalable LLM Infrastructure
Conclusion Summary & Future Work