1 of 61

AIBrix: An Open-Source, Large-Scale LLM Inference Infrastructure for System Research

ASPLOS 2025 Tutorials

2 of 61

AIBrix Team Introduction

3 of 61

Agenda

  • Session 1 (2:00PM - 3:30PM)
    • Introducing AIbrix: A Testbed for Large-Scale LLM Inference System Research
    • LLM-Tailored Autoscaling: Leveraging LLM-specific Metrics and Embracing Resource Heterogeneity
    • Reducing Inference Bottlenecks in Shared Prompt Environments with Prefix Caching
    • KV Cache Offloading for Cross-Engine KV Reuse
    • Multi-LoRA Management in Production Environment
    • Open Research Challenges in LLM Inference Systems

  • Session 2 (4:00PM - 5:30PM)
    • Hands-on AIBrix Feature Demo in AWS Studio Workshop

4 of 61

Introducing AIBrix: A Testbed for Large-Scale LLM Inference System Research

5 of 61

AIBrix Overview

  • Why launching AIBrix (motivation)?
    • Deploying LLM at scale is not easy for everyone.
  • What is AIBrix?
    • A cloud-native solution optimized for deploying, managing, and scaling LLM inference, with co-design with inference engines.
    • It is developed by ByteDance and open-sourced as a vLLM project.
  • Why open-sourcing AIBrix?
    • Open Collaboration for a open and accessible AI Infra.
  • Why adopting AIBrix?
    • Simplicity, usability, scalability and performance.

AIBrix is an open-source LEGO set for building enterprise-grade LLM infra without duct-taping GPUs to your server rack..”

- by Mitko Vasilev, CTO, Mitko X

6 of 61

Why LLM Inference Systems are Challenging for Systems Researchers

Am I making the right assumptions about my systems?

Am I solving “the right problem”?

How well does my solution work in a realistic LLM stack?

7 of 61

LLM Inference: A Layered View of Architectural Challenges & Solutions

OpenAI Compatible API

API Gateway

Model Service

Container

Container

Distributed Cache Service

KV Cache

KV Cache

Models

Adapters

  • Resource/KV Cache/Heterogeneity -aware Routing
  • Fairness and Load Control
  • Dynamic Batching
  • High-Density Adapter Management
  • Model/Adapter Provisioning
  • Unified Request API
  • Autoscaling/Multi-Node Orchestration
  • Cache-awareness
  • Distributed Storage Design

Additional features (https://github.com/vllm-project/aibrix):

  • Unified AI Runtime with GPU Streaming Loader
  • Hybrid Orchestration
  • Accelerator Diagnostic and Failure Mockup Tools
  • Benchmark tools

8 of 61

Challenge #1: Resource Efficiency

Using single black-box metric: different metrics produce different indication of resource demand

Fast evolving accelerators enables needs for resource heterogeneity.

9 of 61

Challenge #2: Cache/Load -aware Routing

Complex cache reuse pattern requires routing strategy that is both locality and load -aware.

10 of 61

Challenge #3: KV Cache Cross-Engine KV Reuse

Offloading KV cache improves GPU compute efficiency.

11 of 61

Challenge #4: Multi-LoRA Management

Using single black-box metric: different metrics produce different indication of resource demand.

model:7b

model:7b

model:7b

12 of 61

Introducing AIBrix: Scalable & Cost-efficient LLM Inference

OpenAI Compatible API

API Gateway

Model Service

Container

Container

Distributed Cache Service

KV Cache

KV Cache

Models

Adapters

13 of 61

AIBrix Extendable API: Enable Flexible LoRA Placement

OpenAI Compatible API

API Gateway

Model Service

Container

Container

Distributed Cache Service

KV Cache

KV Cache

Models

Adapters

def schedule_lora_model_adapter(context, model, instances[])

Choose best instance instances[] to schedule the lora model adapter.

14 of 61

AIBrix Extendable API: Enable Cache/Load-Aware Routing

OpenAI Compatible API

API Gateway

Model Service

Container

Container

Distributed Cache Service

KV Cache

KV Cache

Models

Adapters

def route_request(context, prompt, kv_info, instances[], model)

Route request to one of the instances (instances[]) by giving kv cache matching result (kv_info) and load information

15 of 61

AIBrix Extendable API: Enable Autoscaling

OpenAI Compatible API

API Gateway

Model Service

Container

Container

Distributed Cache Service

KV Cache

KV Cache

Models

Adapters

def compute_target_replicas(cur_instance_count, scaling_context)

Calculates the number of replicas needed based on current metrics and the provided scaling specifications

  • scaling_context is an interface that provides access to scaling parameters like target_value, fluctuation_tolerance, min_max_scale_rate, etc.)

16 of 61

AIBrix Open-Source Status

17 of 61

Call for Collaboration

18 of 61

LLM-Tailored Autoscaling: Leveraging LLM-specific Metrics and Embracing Resource Heterogeneity

Presenter: Ning Wang

19 of 61

Autoscaling Introduction

  • Heavy workload: too many requests lead to service unavailable.

  • With Autoscaling: platform dynamically allocates resources based on request demands

https://blog.px.dev/autoscaling-custom-k8s-metric/

LLM Requests

20 of 61

Challenges 1: LLM Autoscaling Metrics

  • Queries Per Second (QPS) is not a good pod autoscaling indicator in LLM inference.
    • QPS metrics fall short in LLM scenarios due to variations in input & output sizes.
  • Existing Metrics:
    • Limitations of GPU metrics exposed by NVIDIA Data Center GPU Manager (DCGM).
    • Non-linearity of metrics and scaling implications
  • AIbrix supports LLM inference-specific metrics.

No QPS change,

Latency increased

No SM Active changes, latency increasd

SM active

latency

21 of 61

Challenges 2: Heterogeneous GPU Resource Utilization

  • Key observation: cheap GPUs are good at small requests and high-end GPUs are good at large requests -> We should mix them to get the most cost-efficient inference.

Griggs, Tyler, Xiaoxuan Liu, Jiaxiang Yu, Doyoung Kim, Wei-Lin Chiang, Alvin Cheung, and Ion Stoica. "M\'elange: Cost efficient large language model serving by exploiting gpu heterogeneity." arXiv preprint arXiv:2404.14527 (2024).

22 of 61

Challenges 3: Scaling Algorithms

  • Reactive-based pod autoscaling approach
    • Metric-based Pod Autoscaling
    • Optimizer-based Pod Autoscaling
  • Proactive-based pod autoscaling approach
    • Prediction-based Pod Autoscaling

Current Usage

Predicted Usage

Average Usage

23 of 61

Optimizer-based Autoscaling

  • Key Concepts:
    • GPU Capacity: We consider the capacity of a type of gpu, G, under SLO, by the max throughput that it can achieve under SLO constraint, which is denoted by MaxTput(G, SLO)
    • Request Load: A workload with request rate r comuses r/MaxTput(G, SLO) portion of the GPU capacity.
  • Resource Optimization Problem Formulation:
    • Packing all requests to different type of GPUs with minimal costs under the GPU capacity constraints.

2

1

2

4

1

llm requests GPU consumptions profile

2

1

2

4

1

GPU 1

GPU 2

24 of 61

Input-output Specific GPU Benchmarking

  • Offline Benchmarking for each input-output pattern

Arrival Rate: 1, 2, 4, .., 64

128

256

512

Input tokens...

4

TPUT: 1.08, 2.16, 4,31, .., 48.20

E2E P99: 0.16, 0.20, ..., 2.06

TTFT P99: 0.06, 0.08, ..., 0.59

TPOT ...

...

...

...

8

...

...

...

...

16

...

...

...

...

Output tokens...

...

...

...

...

25 of 61

Input-output Specific GPU Profiling

  • Offline Profiling for each input-output pattern
    • Finding the maximum throughput that the GPU can achieves within SLO.

  • GPU Profile: GPU’s processing capability in different input-output token length

Throughputs

128

256

512

Input tokens...

4

31.42

16.69

8.52

...

8

16.62

16.38

8.42

...

16

16.07

8.37

8.08

...

Output tokens

...

...

...

...

Arrival Rate

Throughputs

Note

1

1.07

Unsaturated GPU capacity

2

2.14

Unsaturated GPU capacity

4

4.21

Unsaturated GPU capacity

8

8.08

Unsaturated GPU capacity with no SLO deterioration over time (profile candidate)

16

14.13

Saturated GPU capacity with little possible SLO deterioration over time.

32

19.33

Saturated GPU capacity with large SLO deterioration over time.

64

20.68

Saturated GPU capacity with large SLO deterioration over time.

26 of 61

Heterogeneous Autoscaling Overall Design

  • Online Pod Autoscaling

27 of 61

Autoscaling

Autoscaling Experiments

  • GPU-optimizer-based pod autoscaler can reduce costs while maintaining SLO (e.g., latency).
  • GPU-optimizer-based pod autoscaler leverages GPU profiling data to bypass the need for threshold tuning.

gpu_cache_usage_perc: 50

gpu_cache_usage_perc: 70

28 of 61

Practical Challenges in Optimizer-based Autoscaling

  • GPU Availability Constraint
    • Optimizer returns 4 A10 gpus, but there are only 3 A10 gpus available.
  • Profile-aware Request Routing
    • The optimizer’s min-cost packing is not achieved by random routing.
  • Model-level Pod Autoscaling Coordination
    • Enforce a rolling update policy if both scaling-in and scaling-out happen at the same time. For example, if the optimizer returns a scaling decision of increasing L20 and decreasing A10, Scale up L20 first and delay scaling down A10.
  • Integer Linear Programming stability.
    • Avoid huge GPU combination changes (eg. switching from 4L20 to 5A10).

29 of 61

Evaluation Challenges

Herbst, Nikolas Roman, Samuel Kounev, and Ralf Reussner. "Elasticity in cloud computing: What it is, and what it is not." In 10th international conference on autonomic computing (ICAC 13), pp. 23-27. 2013.

Zhong, Yinmin, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. "{DistServe}: Disaggregating prefill and decoding for goodput-optimized large language model serving." In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pp. 193-210. 2024.

30 of 61

Reducing Inference Bottlenecks in Shared Prompt Environments with Prefix Caching

Presenter: Gangmuk Lim

31 of 61

Routing in LLM inference

  • Traditional load balancing would not work effectively for LLM inference application.
  • Prefix-aware routing v.s. Load-aware routing
    • Previous works: SGLang (RadixAttention), Preble (Prefix + Load aware routing), DLPM (Prefix + Load + fairness aware routing)

AIBrix can be easily extended!

For example Preble routing algorithm is already implemented in AIBrix

32 of 61

Routing in LLM inference: KV cache

“I like apple”

K (I)

V (I)

K (Iike)

V (Iike)

K (...)

V (...)

I

like

apple

Q (Iike)

Q (...)

Q (I)

Q (apple)

K (apple)

V (apple)

33 of 61

Example: Prefix-aware

KV cache (“I like apple.”)

Node 1

Node 2

“I like apple very much”

Load: 50%

Load: 30%

“I like orange”

“I like apple”

0% match

50% match

“I like apple”

100% match

34 of 61

Architecture of AIBrix Router

Envoy Gateway

Request

AIBrix router

Pod 1

Pod 2

Pod 3

Pod 4

target-pod-ip

AIBrix router

Easily extensible! (prefix aware routing routing, gpu utilization routing, etc.)

The new routing algorithm will not require changing anything else but the code for policy logic

Compute routing rule

35 of 61

Experiment Results

  • 8 * NVIDIA-L20 GPUs
  • Deepseek 7B LLM Chat
  • Workload: prefix sharing workload, RPS 10

# prefix tokens: 1000

# prefix tokens: 8000

Take away:

Simple prefix-aware routing is not sufficient. Routing should carefully consider load and prefix to do the best routing.

36 of 61

Prefix-aware technical details: different data structure

I like

apple

orange

very much

pie

I like

apple pie

apple very

much

orange

I like apple very much

Hash table

Radix tree

I like

apple very

much

KV Block

Block granularity

Token granularity

KV Block

KV Block

Prompt:

37 of 61

Technical challenge 1: Highly available router (scaling)

Envoy Gateway

AIBrix router

Request

Pod 1

Pod 2

Pod 3

Pod 4

I like

apple

orange

very much

pie

AIBrix router

AIBrix router

I like

apple

orange

very much

pie

I like

apple

orange

very much

pie

Each Gateway would have large state.

Avg hit ratio=0.2, Avg token length=100, Num prompts=20, 61.71 KB

Avg hit ratio=0.2, Avg token length=1000, Num prompts=50, 1.52 MB

Avg hit ratio=0.2, Avg token length=10000, Num prompts=100, 30.51 MB

Avg hit ratio=0.2, Avg token length=10000, Num prompts=1000, 305.09 MB

Avg hit ratio=0.2, Avg token length=10000, Num prompts=10000, 2.98 GB

38 of 61

Technical challenge 2:

Synchronizing the kv cache state in different layers

Envoy Gateway

AIBrix Router

Request

Pod 1

Pod 2

Pod 3

Pod 4

Distributed

KV cache

Pod

Pod

Pod

Pod

AIBrix Router

AIBrix Router

engine cache

engine cache

engine cache

engine cache

Each engine instance maintains its own cache. (e.g., vLLM)

39 of 61

Technical challenge 2:

Synchronizing the kv cache state in different layers

Envoy Gateway

AIBrix Router

Request

Pod 1

Pod 2

Pod 3

Pod 4

Distributed

KV cache

engine cache

engine cache

engine cache

engine cache

AIBrix Router

AIBrix Router

What about large scale cluster?

This is another scalability issue!

  • requiring constant data tranmission between all engines and all routers (all-to-all)

40 of 61

Open Challenges

  • Routing will be more complex decision, considering …

Heterogeneous GPUs

LoRA

Multi-tenancy (fairness)

SLO

41 of 61

KV Cache Offloading for Cross-Engine KV Reuse

Presenter: Haiyang Shi

42 of 61

KV Cache Recap

Adapted from https://huggingface.co/blog/kv-cache-quantization

*: when processing token[K], we only need the K’th row of Q

**: when processing token[K], we require the full K & V tensors, but we can reuse the cached values (This enables skipping the recomputation of K & V)

43 of 61

Key Challenges

Scenario 1- Prefix Cache MLSys'24 PromptCache

  1. KV Cache Reuse vs. Limited GPU Memory Capacity
  • Input sequences have reusable portions, e.g., repeated system prompts, multi-turn conversations, template-driven content generation, code autocompletion, …

  • KV caches consume a lot of GPU mem, scaling w/ model size and context length
    • For LLaMA 3.1 70B model, KV cache for 128K tokens consumes ~40GB GPU memory

  • Limited GPU mem -> KV cache eviction -> recompute wasting GPU cycles and time

44 of 61

Key Challenges

Scenario 2 - Request Migration OSDI'24-ServerlessLLM

  • KV Cache Migration
  • Lack of fault tolerance -> complete loss of in-progress requests -> resuming failed requests time-consuming

  • Dynamic workloads + slow instance bootstrapping -> scale up/down -> service degradation or downtime

45 of 61

Key Challenges

  • Hardware Constraints
  • Lower-end GPU instances w/o high-speed interconnects like RDMA

  • Many GPUs share a single VPC NIC, leading to significant network bandwidth contention

  • Selective KV cache offloading

Lower-End GPU Instance

46 of 61

Key Challenges

  • Hardware Constraints
  • High-end GPU instances w/ RDMA

  • Remote KV cache access and local KV cache access (on host DRAM) compete PCIe bandwidth

  • Selective KV cache offloading preserving PCIe bandwidth for local KV cache accesses low latency

High-End GPU Instance

47 of 61

KV Cache Architectures

From Single Node GPU Cache => Distributed and Disaggregated KV Cache

48 of 61

AIBrix KV Cache Offloading Framework

Pluggable Eviction Policies

  • Selective KV cache offloading
  • Flexible to design new policies
  • Can be used to study characteristics of different eviction policies w/ LLM workloads

Pluggable Marshallers

  • KV-aware compression
  • Quantization

Pluggable Connectors

  • SSD backends
  • Networked backends
  • Managed KV cache services

RFC: https://github.com/vllm-project/vllm/issues/14724

49 of 61

Multi-LoRA Management in Production Environment

Presenter: Jiaxin Shan

50 of 61

Dense LoRA: Powering Efficient Finetuning at Scale

Wupdated=W+ΔW

Wupdated=W+A.B

Volume

Density

Traffic skew between different models. Some are critical production online models, while others are experimental with near-zero usage.

51 of 61

High Density Deployment Challenges at Scale

model:7b

model:7b

model:7b

model:7b

Challenges:

1. Lora is a container internal manifest, it breaks the kubernetes pod design, even harder for service discovery �

2. Request routing & Lora scheduler would be challenge�

3. Compete resources with tenants, contention problem exists

52 of 61

Case Study: Text2SQL

Text2TLS - SQL Like Language with specific syntax

AIBrix is deployed on VKE, utilizing 12 NVIDIA A10 cards. This setup substantially boosts our infrastructure's processing power, enabling support for more AI-driven code services, such as DBW testing and TLS production tasks through Bytebrain.

Text2SQL - Traditional SQL

50% GPU cost reduction!

53 of 61

Lora Integration with vLLM

  1. Dynamic Model Adapter Management: Enables dynamic loading and unloading of LoRA model adapters, improving flexibility and resource efficiency by allowing models to be adjusted on demand.�

  • Advanced Scheduling Algorithm: Utilizes an intelligent scheduling algorithm that optimally places LoRA models on the right pods, minimizing interference and enhancing overall inference performance.

  • LoRA-Specific Routing with EndpointSlice: Leverages Kubernetes EndpointSlice for precise LoRA-specific routing, ensuring efficient traffic distribution and optimized resource utilization.

[RFC]: Enhancing LoRA Management for

Production Environments in vLLM

54 of 61

Lora at scale efficiencies

AIBrix helps reduce the resource cost by 4.7X when the application has sparse workload. And still be able to maintain 1.5X resource savings when the load is high.

New Challenge: When we use merged and when we use unmerged weights?

Related work: OSDI'24 DLora

Using high-density LoRA deployment with correct batching size provides cost reduction ranging from 8% to 2.1X.

New Challenge: How to prioritize latency sensitive request and address resource contention issue?

55 of 61

Open Research Challenges in LLM Inference Systems

Presenter: Jiaxin Shan

56 of 61

Open Research Challenges

1. High Density/ Serverless for LLM Foundation Models or Adaptive GPU Multiplexing (ServerlessLLM, Prism)��2. Online & batch request colocations in LLM Serving systems (BatchLLM, ConServe)

3. Heterogenous placement and routing, compute resource standardization issue. (Melange)

4. Unified LLM routing challenges considering fairness, heterogenous, Lora, prefix cache, SLO, load etc (VTC, D2LPM, Preble)

5. Application aware optimization (Autellix)

6. Large scale simulation, output latency prediction and roofline model analysis (vidur, LLMViewer)

57 of 61

Thank You

58 of 61

Prefix-aware technical details

AIBrix implementation architecture

Should we mention about radix tree, hash data structure details in this talk?

consistent view, dynamo

high availability (multi-replica)

59 of 61

Key Innovations Enhancing Performance and Reducing Costs

  • High-density LoRA management for efficient adapter scheduling
  • LLM-specific autoscalers for real-time dynamic scaling
  • Prefix-aware, load-aware request routing
  • Distributed KV cache improving token reuse
  • Unified AI runtime for seamless model management

60 of 61

Introduction The Need for Scalable LLM Infrastructure

  • LLMs drive AI applications but require efficient deployment
  • Research opportunities:
    • Resource Efficiency
      • Efficient autoscaling
    • Resource/Performance Isolation
      • Supporting inference requests with different SLAs and resource requirements
      • Mixing online/offline infererences
      • Performance isolation/fairness guarantees for multi-LoRA deployment
    • Resource Heterogeneity
      • Support for multi-level KV cache
      • Dynamic provisioning for heterogeneous accelerators
  • AIBrix provides a co-designed solution optimizing every layer

61 of 61

Conclusion Summary & Future Work

  • AIBrix optimizes LLM inference with system-level orchestration
  • Innovations in autoscaling, routing, and GPU efficiency
  • Future improvements: expanded workload profiling, dynamic autoscaling refinements