1 of 61

AIBrix: An Open-Source, Large-Scale LLM Inference Infrastructure for System Research

ASPLOS 2025 Tutorials

2 of 61

AIBrix Team Introduction

3 of 61

Agenda

Session 1 (2:00PM - 3:30PM)

Introducing AIbrix: A Testbed for Large-Scale LLM Inference System Research
LLM-Tailored Autoscaling: Leveraging LLM-specific Metrics and Embracing Resource Heterogeneity
Reducing Inference Bottlenecks in Shared Prompt Environments with Prefix Caching
KV Cache Offloading for Cross-Engine KV Reuse
Multi-LoRA Management in Production Environment
Open Research Challenges in LLM Inference Systems

Session 2 (4:00PM - 5:30PM)

Hands-on AIBrix Feature Demo in AWS Studio Workshop

First we’ll start with a high-level intro to AIBrix: What is ABrix? What kind of problems AIBrix is designed to solve? And most importantly, why AIBrix fit into a testbed for LLM inference system research.��Next, my colleague from ByteDance are going to talk about a few key innovations introduced in the AIbrix project, for enhancing LLM inference performance and reducing costs at scale.

For example,

• How to leverage LLM-Specific Metrics & Embracing Resource Heterogeneity to design 𝗔𝘂𝘁𝗼𝘀𝗰𝗮𝗹𝗶𝗻𝗴 for LLM inference

• how to Optimizing Shared Prompt Environments with Prefix Caching to reduce 𝗜𝗻𝗳𝗲𝗿𝗲𝗻𝗰𝗲 𝗕𝗼𝘁𝘁𝗹𝗲𝗻𝗲𝗰𝗸𝘀

• How to utilize 𝗞𝗩 𝗖𝗮𝗰𝗵𝗲 𝗢𝗳𝗳𝗹𝗼𝗮𝗱𝗶𝗻𝗴 for cross-engine Key-Value Reuse

• How to manage 𝗠𝘂𝗹𝘁𝗶-𝗟𝗼𝗥𝗔 for Production Deployment

In the second session, Prendass from AWS is going to conduct Hands-On AWS Studio Workshop. The audience is going to dive into practical learning, how to deploy LLM applications and inference on Amazon EKS using AIBrix.

As we have six topics in the session #1, if we can’t finish on time…

4 of 61

Introducing AIBrix: A Testbed for Large-Scale LLM Inference System Research

5 of 61

AIBrix Overview

Why launching AIBrix (motivation)?

Deploying LLM at scale is not easy for everyone.

What is AIBrix?

A cloud-native solution optimized for deploying, managing, and scaling LLM inference, with co-design with inference engines.
It is developed by ByteDance and open-sourced as a vLLM project.

Why open-sourcing AIBrix?

Open Collaboration for a open and accessible AI Infra.

Why adopting AIBrix?

Simplicity, usability, scalability and performance.

“AIBrix is an open-source LEGO set for building enterprise-grade LLM infra without duct-taping GPUs to your server rack..”

- by Mitko Vasilev, CTO, Mitko X

Let us start with the high-level overview of AIBrix. As AIBrix is open-sources, four questions our OSS users usually ask.

What is AIBrix? Why ByteDance decided to launch the project, and then decided to open-source it? Last, why should we adopt AIBrix?

As many of you know, LLMs are changing how applications are built, but deploying them at scale efficiently remains a challenge. Deploying a single model to a single node might be doable, but when it comes to deploy a single model at scale, the problem become a lot tricky. What if you want to deploy many many models into a large GPU fleet? Routing, HPA, fault tolerance, SLO, cost-efficiency, a wide spectrum of distributed system problems side

Later, my colleague Le is going to show you a layer-by-layer architectural view, from API gateway, inference services, inference engine, and KV cache, and talk about each layer’s challenge and solution.

[What is AIBrix?] As mentioned previously, AIBrix is a cloud-native vLLM serving stack developed by ByteDance— we designed AIBrix from the ground up to optimize and simplify large-scale LLM production. AIBrix has been contributed to the vLLM project as its control plane, further enhancing its scalability and efficiency for real-world deployment.

Why Open-sourcing it? The core maintainer of AIBrix believe in open-source and open collaboration deeply. A few of us have been actively working in multiple open-source projects, including vLLM, Kubernetes and Ray. We strongly advocate the idea of making AI infrastructure and system open and accessible for everyone.

And AIBrix provided several factors including simplicity, usability, scalability and performance, which covers those key factors for system researchers to consider adopting it.

Give the floor to Le….

======================================================================================�AIBrix is designed to address this gap by providing an optimized, cost-efficient, and scalable infrastructure tailored for LLM inference. Unlike traditional cloud-native stacks, AIBrix is co-designed with inference engines like vLLM, ensuring deep integration for seamless operation.

Optimized for enterprise-scale deployment and cost efficiency.
Customizable modules help facilitate future research for LLM serving cloud infrastructure.

6 of 61

Why LLM Inference Systems are Challenging for Systems Researchers

Am I making the right assumptions about my systems?

Am I solving “the right problem”?

How well does my solution work in a realistic LLM stack?

Most of the publications focusing on LLM inference at the system conferences focus on implementation and optimization on serving framework, including LLM serving engines, libraries, and runtimes, while a few works actually address the LLM inference cloud stacks are mostly coming from industry contributors, who have access to resources and owns the entire serving stack.

Deploying LLM at scale is not easy for everyone, including system researchers. In our day-to-day work, we often ask ourselves the following questions:

[1] Am I making the right assumptions about my systems? We often asked ourselves whether our assumption is valid or conforms state-of-the-art design of systems in reality– Is the workload pattern valid? Does this component actually exist in reality? Should I implement my optimization in component A or component B?

[2] Am I solving “the right problem”? How often does the problem actually occurs in the actual system? Is the problem I’m solving focusing on the critical path of the request or addressing the most severe bottleneck of the system?

[3] How well does my solution work in a realistic LLM stack? While many solutions improve performance LLM serving in a standalone environment, deploying the same solution in the full fledged system stack could introduce additional challenges, and the actual improvement could be different from what we expected. Often it is impossible for research communities to evaluate how practical their solutions are without having a testbed where they could build their solutions upon.

7 of 61

LLM Inference: A Layered View of Architectural Challenges & Solutions

OpenAI Compatible API

API Gateway

Model Service

Container

Distributed Cache Service

KV Cache

Models

Adapters

Resource/KV Cache/Heterogeneity -aware Routing
Fairness and Load Control
Dynamic Batching

High-Density Adapter Management
Model/Adapter Provisioning

Unified Request API

Autoscaling/Multi-Node Orchestration

Cache-awareness
Distributed Storage Design

Additional features (https://github.com/vllm-project/aibrix):

Unified AI Runtime with GPU Streaming Loader
Hybrid Orchestration
Accelerator Diagnostic and Failure Mockup Tools
Benchmark tools

To help with these issues, we introduce AIBrix, a production-ready LLM serving system stack that could also serve as a research testbed for research community.

AIBrix introduces several innovations that significantly enhance performance and cost efficiency:

[1] OpenAI compatible API that serve as unified extendable entry point to all LLM inference requests.

[2] API Gateway that extends Envoy Gateway to optimize instance routing and support various routing strategies. Unlike traditional gateways that distribute requests blindly, AIbrix analyzes token patterns, prefill cache availability, and compute overhead to enhance routing efficiency in diverse deployment scenarios. These scenarios include fairness and load control, dynamic batching, and cache-aware routing

[3] Model service that features high-density lora management, which enables dynamic LoRA registration and lineage support in vLLM, streamlining LoRA adapter management and reducing the cost of managing fine-tuned models.

[4] LLM resource orchestration that supports various scenario-driven LLM autoscaling policies. It also features optimizations such as sliding window metric aggregation to reduce the propagation delay of real-time metrics.

[5] Distributed and disaggregated KV Cache pool that introduces a distributed KV cache that enables high-capacity, cross-engine KV reuse while optimizing network and memory efficiency.

[6] Additionally we support

Unified AI Runtime with GPU Streaming Loader: serves as a unified runtime layer, managing interactions between inference engine pods and the control plane. Additionally, \sysname{} features a GPU streaming loader that bypasses disk I/O bottlenecks to accelerate model loading and execution.
Hybrid Orchestration: introduces a hybrid approach to multi-node inference by integrating Ray for fine-grained application orchestration with Kubernetes for coarse-grained resource management.Compared to inference engine's native supports in a distributed environment, which emphasize parallelism over service-oriented needs, AIBrix balances distributed execution with production-grade orchestration, achieving scalability, rolling upgrades, and efficient resource allocation.
Accelerator Diagnostic and Failure Mockup Tools: introduce a diagnostic tool that leverages AI accelerators' built-in capabilities to detect and diagnose hardware failures. Also, a failure mockup tool simulates hardware failures, enabling rigorous fault tolerance and recovery testing.
Benchmark tools that helps evaluate end to end performance for various types of workload.

You can find more information about these features from the project website.

—----------------------------------------------------------------------------------------------------------

High-Density LoRA Management: enables dynamic LoRA registration and lineage support in vLLM, streamlining LoRA adapter management and reducing the cost of managing fine-tuned models.
LLM-Specific Autoscaling: supports various scenario-driven LLM autoscaling policies. It also features optimizations such as sliding window metric aggregation to reduce the propagation delay of real-time metrics.
LLM Gateway and Routing Strategy: introduces an LLM-aware API gateway, extending Envoy Gateway to optimize instance routing and support various routing strategies. Unlike traditional gateways that distribute requests blindly, \sysname{} analyzes token patterns, prefill cache availability, and compute overhead to enhance routing efficiency in diverse deployment scenarios.
Distributed and disaggregated KV Cache pool: introduces a distributed KV cache that enables high-capacity, cross-engine KV reuse while optimizing network and memory efficiency. Key innovations include a scan-resistant eviction policy, reduced redundant data transfers, asynchronous metadata updates, and shared-memory-based data exchange, enhancing inference throughput and efficiency.
Unified AI Runtime with GPU Streaming Loader: serves as a unified runtime layer, managing interactions between inference engine pods and the control plane. It automates models artifact handling, configures inference engines, and provides real-time observability, ensuring vendor-agnostic compatibility. Additionally, \sysname{} features a GPU streaming loader that bypasses disk I/O bottlenecks to accelerate model loading and execution.
Hybrid Orchestration: introduces a hybrid approach to multi-node inference by integrating Ray for fine-grained application orchestration with Kubernetes for coarse-grained resource management.Compared to inference engine's native supports in a distributed environment, which emphasize parallelism over service-oriented needs, AIBrix balances distributed execution with production-grade orchestration, achieving scalability, rolling upgrades, and efficient resource allocation.
Cost efficient and SLO-driven Heterogeneous Serving introduces a GPU optimizer that balances cost efficiency with SLO adherence, dynamically selecting the optimal GPU configuration based on workload characteristics and hardware availability, ensuring cost-effective heterogeneous GPU utilization.
Accelerator Diagnostic and Failure Mockup Tools: introduces a diagnostic tool that leverages AI accelerators' built-in capabilities to detect and diagnose hardware failures. Also, a failure mockup tool simulates hardware failures, enabling rigorous fault tolerance and recovery testing.

8 of 61

Challenge #1: Resource Efficiency

Using single black-box metric: different metrics produce different indication of resource demand

Fast evolving accelerators enables needs for resource heterogeneity.

https://www.cudocompute.com/blog/nvidias-blackwell-architecture-breaking-down-the-b100-b200-and-gb200

AIBrix enables us to to investigate various problems we faced in production:

[1] One specific problems we face in the production is that LLM application meets new challenges while choosing a good metrics to enable autoscaling. Different from traditional microservices, we found that using resource utilization is not the best metrics in determining actual amount of resources we need. Meanwhile, performance also cannot be used to perform scaling operations as it is not linearly correlated with applications resource needs. This leads to new challenges in both choosing appropriate metrics as well as designing policies to scale resources dynamically.

[2] Meanwhile, as the GPU’s compute power keep increasing, cloud providers like us typically owns many different generations of GPUs over the years, which makes it necessary for us to think about how to spread our deployment over different types of GPUs based on requests characteristics.

[3] My colleague, Ning, will discuss how we use AIBrix to improve resource efficiency.

9 of 61

Challenge #2: Cache/Load -aware Routing

https://lmsys.org/blog/2024-01-17-sglang/

Complex cache reuse pattern requires routing strategy that is both locality and load -aware.

10 of 61

Challenge #3: KV Cache Cross-Engine KV Reuse

Offloading KV cache improves GPU compute efficiency.

11 of 61

Challenge #4: Multi-LoRA Management

Using single black-box metric: different metrics produce different indication of resource demand.

model:7b

12 of 61

Introducing AIBrix: Scalable & Cost-efficient LLM Inference

OpenAI Compatible API

API Gateway

Model Service

Container

Distributed Cache Service

KV Cache

Models

Adapters

13 of 61

AIBrix Extendable API: Enable Flexible LoRA Placement

OpenAI Compatible API

API Gateway

Model Service

Container

Distributed Cache Service

KV Cache

Models

Adapters

def schedule_lora_model_adapter(context, model, instances[])

Choose best instance instances[] to schedule the lora model adapter.

14 of 61

AIBrix Extendable API: Enable Cache/Load-Aware Routing

OpenAI Compatible API

API Gateway

Model Service

Container

Distributed Cache Service

KV Cache

Models

Adapters

def route_request(context, prompt, kv_info, instances[], model)

Route request to one of the instances (instances[]) by giving kv cache matching result (kv_info) and load information

15 of 61

AIBrix Extendable API: Enable Autoscaling

OpenAI Compatible API

API Gateway

Model Service

Container

Distributed Cache Service

KV Cache

Models

Adapters

def compute_target_replicas(cur_instance_count, scaling_context)

Calculates the number of replicas needed based on current metrics and the provided scaling specifications

scaling_context is an interface that provides access to scaling parameters like target_value, fluctuation_tolerance, min_max_scale_rate, etc.)

16 of 61

AIBrix Open-Source Status

on the left hand, we can see the original blog post from vLLM about AIBrix, Feb, 21. Since then, AIBrix receives strong support from the community, receiving over 3K stars, over 300 folks, and about 40 contributors.

AIBrix has rapidly become a community-driven project with contributions from top academic and industry partners.

We’re honored to have received inspiring endorsements from industry leaders such as Clayton Coleman, DE from Google, and Robert Nishihara (co-founder of Anyscale, and co-creator of Ray Open-Source project).

On the bottom right, there are some feedback from community users….

“Deploying a single vLLM instance is like teaching a toddler to use a lightsaber- easy to start, chaos guaranteed at scale. Routing… caching… autoscaling… fault tolerance? Suddenly, you’re the unpaid sysadmin of a GPU-powered theme park.

Think of AIbrix as Kubernetes’ hyper-caffeinated cousin who understands LLMs and LoRA adapters.”

17 of 61

Call for Collaboration

Explore AIBrix on GitHub: https://github.com/vllm-project/aibrix

Join the community (https://vllm-dev.slack.com/archives/C08EQ883CSV) & contribute to open-source AI infrastructure

Research & Industry Collaboration: https://github.com/vllm-project/aibrix/issues/699

18 of 61

LLM-Tailored Autoscaling: Leveraging LLM-specific Metrics and Embracing Resource Heterogeneity

Presenter: Ning Wang

19 of 61

Autoscaling Introduction

Heavy workload: too many requests lead to service unavailable.

With Autoscaling: platform dynamically allocates resources based on request demands

https://blog.px.dev/autoscaling-custom-k8s-metric/

LLM Requests

20 of 61

Challenges 1: LLM Autoscaling Metrics

Queries Per Second (QPS) is not a good pod autoscaling indicator in LLM inference.

QPS metrics fall short in LLM scenarios due to variations in input & output sizes.

Existing Metrics:

Limitations of GPU metrics exposed by NVIDIA Data Center GPU Manager (DCGM).
Non-linearity of metrics and scaling implications

AIbrix supports LLM inference-specific metrics.

No QPS change,

Latency increased

No SM Active changes, latency increasd

SM active

latency

1 min

There are several new challenges in LLM pod autoscaling system design. The first one is the scaling metric selection. Here, I would like to share some production trace data to talk about the challenges.

In the classic micro-service systems, request rate, QPS, is a widely used metric to guide the autoscaling decision. In the first figure, if you looks at the yellow line here, which is QPS. you will find that there is no big change during this particular time period. However, the latency here (blue line) has increased a lot. The reason is that we can no long use the number of requests to determine the resource needed since LLM requests with different lengths have huge difference in terms of the resource need.

In the second and third figures, we would like to show that even we use GPU metrics, such as SM active, which is widely used in evaluating GPU resource utilization, it may not work. In this trace example, we don’t observe SM active change in this time window, but the LLM inference requests have increased a lot.

To address this challenges, in AIbrix, we develop many LLM inference-specific metrics, such TTFT (TimeToFirstTokenSeconds), TPOT (TimePerOutputTokenSeconds), etc, to have beter auto scaling metrics.

21 of 61

Challenges 2: Heterogeneous GPU Resource Utilization

Key observation: cheap GPUs are good at small requests and high-end GPUs are good at large requests -> We should mix them to get the most cost-efficient inference.

Griggs, Tyler, Xiaoxuan Liu, Jiaxiang Yu, Doyoung Kim, Wei-Lin Chiang, Alvin Cheung, and Ion Stoica. "M\'elange: Cost efficient large language model serving by exploiting gpu heterogeneity." arXiv preprint arXiv:2404.14527 (2024).

1.5 min

The second challenge here is GPU heterogeneity in the production environment. In commercial cloud providers like Bytedance, the clusters may have different types of GPUs and we want to utilize all of them for cost-effective. Then the challenge is how. Note that different GPUs have different processing capability, and how to ensure the service SLO can still be satisfied in such a case is very challenging.

Is this possible to mix GPU utilization for better solution? Yes, One interesting observation from the the Melange paper is that the cheap GPUs are good at small requests and high-end GPUs are good at large request. The left figure shows the performance comparison between A100 and H100 in terms of the metric token per dollar, which make the dollar cost to generate a LLM token. Here, we can found that each type of GPU has its own sweet point. We can find that there are more than 40% of increase in small requests. They are some experiments regarding the request load distribution in that paper in static env, and it shows that this idea works, and we try to adapt this idea into our autoscaling decision.

22 of 61

Challenges 3: Scaling Algorithms

Reactive-based pod autoscaling approach

Metric-based Pod Autoscaling
Optimizer-based Pod Autoscaling

Proactive-based pod autoscaling approach

Prediction-based Pod Autoscaling

Current Usage

Predicted Usage

Average Usage

1.5 mi

Another challenge in pod autoscaling how to make the scaling decision. Here, we summarize existing scaling algorithms into two categories. The first one is called reactive-based pod autoscaling approach, which tries to change pod autoscaling decision based on actual request pattern. The second one is called proactive-based pod autoscaling, which use workload prediction method to predict future request pattern and update pod scaling decisions in advance.

(may skip: In this slide, we show one prediction-based method in the cpu usage case in production environment. The yellow line is the predicted usage and the green lien is the actual usage. )

For the reactive-based pod autoscaling, we further classify it into two different approaches. We call the the first type the metric-based pod autoscaling, which use metric change threshold to conduct autoscaling decision. The stabilization window are also adopted in recent pod autoscaling algorithms, such as HPA, and Knative Pod Autoscaling. We call the second type reactive-based pod autocaling the optimizer-based pod autoscaling, and we will discuss it in detail in the following slides. The idea here that we don’t need to look at metrics changes, but based on the request workload directly.

The challenge here is that we need to understand when each autoscaling algorithm work best in

23 of 61

Optimizer-based Autoscaling

Key Concepts:

GPU Capacity: We consider the capacity of a type of gpu, G, under SLO, by the max throughput that it can achieve under SLO constraint, which is denoted by MaxTput(G, SLO)
Request Load: A workload with request rate r comuses r/MaxTput(G, SLO) portion of the GPU capacity.

Resource Optimization Problem Formulation:

Packing all requests to different type of GPUs with minimal costs under the GPU capacity constraints.

2

1

2

4

1

llm requests GPU consumptions profile

2

1

2

4

1

GPU 1

GPU 2

24 of 61

Input-output Specific GPU Benchmarking

Offline Benchmarking for each input-output pattern

Arrival Rate: 1, 2, 4, .., 64	128	256	512	Input tokens...
4	TPUT: 1.08, 2.16, 4,31, .., 48.20 E2E P99: 0.16, 0.20, ..., 2.06 TTFT P99: 0.06, 0.08, ..., 0.59 TPOT ...	...	...	...
8	...	...	...	...
16	...	...	...	...
Output tokens...	...	...	...	...

25 of 61

Input-output Specific GPU Profiling

Offline Profiling for each input-output pattern

Finding the maximum throughput that the GPU can achieves within SLO.

GPU Profile: GPU’s processing capability in different input-output token length

Throughputs	128	256	512	Input tokens...
4	31.42	16.69	8.52	...
8	16.62	16.38	8.42	...
16	16.07	8.37	8.08	...
Output tokens	...	...	...	...

Arrival Rate	Throughputs	Note
1	1.07	Unsaturated GPU capacity
2	2.14	Unsaturated GPU capacity
4	4.21	Unsaturated GPU capacity
8	8.08	Unsaturated GPU capacity with no SLO deterioration over time (profile candidate)
16	14.13	Saturated GPU capacity with little possible SLO deterioration over time.
32	19.33	Saturated GPU capacity with large SLO deterioration over time.
64	20.68	Saturated GPU capacity with large SLO deterioration over time.

In the slide we would like to talk about the idea how to to profile the GPU capacity for a particular request. Note that in the previous page, we talked about that we benchmarked each input-output pattern in different request rates.

We would check the gpu’s actual throughput compare with the request rate to figure out the GPU’s capacity. The idea here is that if the request rate is low, the GPU should be able to process it which is indicated by the actual throughput. Howeve, when we increase the quest rate, it will be harder and harder for the GPU to handle such cases. We use this idea to find the highest arrival rate as the GPU’s capacity, which is 8 in this example. With such GPU capacity profile, we are able to use an ILP solver/optimizer to figure out the number of GPUs that we need to serve the input requests.

26 of 61

Heterogeneous Autoscaling Overall Design

Online Pod Autoscaling

27 of 61

Autoscaling

Autoscaling Experiments

GPU-optimizer-based pod autoscaler can reduce costs while maintaining SLO (e.g., latency).
GPU-optimizer-based pod autoscaler leverages GPU profiling data to bypass the need for threshold tuning.

gpu_cache_usage_perc: 50

gpu_cache_usage_perc: 70

28 of 61

Practical Challenges in Optimizer-based Autoscaling

GPU Availability Constraint

Optimizer returns 4 A10 gpus, but there are only 3 A10 gpus available.

Profile-aware Request Routing

The optimizer’s min-cost packing is not achieved by random routing.

Model-level Pod Autoscaling Coordination

Enforce a rolling update policy if both scaling-in and scaling-out happen at the same time. For example, if the optimizer returns a scaling decision of increasing L20 and decreasing A10, Scale up L20 first and delay scaling down A10.

Integer Linear Programming stability.

Avoid huge GPU combination changes (eg. switching from 4L20 to 5A10).

29 of 61

Evaluation Challenges

Herbst, Nikolas Roman, Samuel Kounev, and Ralf Reussner. "Elasticity in cloud computing: What it is, and what it is not." In 10th international conference on autonomic computing (ICAC 13), pp. 23-27. 2013.

Zhong, Yinmin, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. "{DistServe}: Disaggregating prefill and decoding for goodput-optimized large language model serving." In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pp. 193-210. 2024.

30 of 61

Reducing Inference Bottlenecks in Shared Prompt Environments with Prefix Caching

Presenter: Gangmuk Lim

31 of 61

Routing in LLM inference

Traditional load balancing would not work effectively for LLM inference application.
Prefix-aware routing v.s. Load-aware routing

Previous works: SGLang (RadixAttention), Preble (Prefix + Load aware routing), DLPM (Prefix + Load + fairness aware routing)

*SGLang: Efficient Execution of Structured Language Model Programs (NeurlPS 2024)

*Locality-aware Fair Scheduling in LLM Serving

*Preble: Efficient Distributed Prompt Scheduling for LLM Serving

AIBrix can be easily extended!

For example Preble routing algorithm is already implemented in AIBrix

32 of 61

Routing in LLM inference: KV cache

“I like apple”

K (I)

V (I)

K (Iike)

V (Iike)

K (...)

V (...)

I

like

apple

Q (Iike)

Q (...)

Q (I)

Q (apple)

K (apple)

V (apple)

33 of 61

Example: Prefix-aware

KV cache (“I like apple.”)

Node 1

Node 2

“I like apple very much”

Load: 50%

Load: 30%

“I like orange”

“I like apple”

0% match

50% match

“I like apple”

100% match

To hit the KV caches, requests should be routed to an instance having the corresponding KV. To achieve it, router needs to know which pod has related KV cache to route. Let’s see an example. The first request is ‘I like apple’. all the nodes do not have any KV cache. so you route the node 1. the second request is also ‘I like apple’, then it should route to node 1 whose KV cache hit ratio will be 100%. what about ‘I like apple very much’. node 1 has 50% match and node 2 is 0% match. However, node 2 is less loaded. Now, it becomes unclear where to route. What if the next request is “I like orange”. Ideal router should figure out the best instance based on the cached prefix state and load in all instances. It is non-trivial and has some technical challenges. We benchmarked prefix aware routing algorithm that we implemented (Preble) with various workload. we will show later.

34 of 61

Architecture of AIBrix Router

Envoy Gateway

Request

AIBrix router

Pod 1

Pod 2

Pod 3

Pod 4

target-pod-ip

AIBrix router

Easily extensible! (prefix aware routing routing, gpu utilization routing, etc.)

The new routing algorithm will not require changing anything else but the code for policy logic

Compute routing rule

35 of 61

Experiment Results

8 * NVIDIA-L20 GPUs
Deepseek 7B LLM Chat
Workload: prefix sharing workload, RPS 10

# prefix tokens: 1000

# prefix tokens: 8000

Take away:

Simple prefix-aware routing is not sufficient. Routing should carefully consider load and prefix to do the best routing.

36 of 61

Prefix-aware technical details: different data structure

I like

apple

orange

very much

pie

I like

apple pie

apple very

much

orange

I like apple very much

Hash table

Radix tree

I like

apple very

much

KV Block

Block granularity

Token granularity

KV Block

Prompt:

37 of 61

Technical challenge 1: Highly available router (scaling)

Envoy Gateway

AIBrix router

Request

Pod 1

Pod 2

Pod 3

Pod 4

I like

apple

orange

very much

pie

AIBrix router

I like

apple

orange

very much

pie

I like

apple

orange

very much

pie

Each Gateway would have large state.

Avg hit ratio=0.2, Avg token length=100, Num prompts=20, 61.71 KB

Avg hit ratio=0.2, Avg token length=1000, Num prompts=50, 1.52 MB

Avg hit ratio=0.2, Avg token length=10000, Num prompts=100, 30.51 MB

Avg hit ratio=0.2, Avg token length=10000, Num prompts=1000, 305.09 MB

Avg hit ratio=0.2, Avg token length=10000, Num prompts=10000, 2.98 GB

38 of 61

Technical challenge 2:

Synchronizing the kv cache state in different layers

Envoy Gateway

AIBrix Router

Request

Pod 1

Pod 2

Pod 3

Pod 4

Distributed

KV cache

Pod

AIBrix Router

engine cache

Each engine instance maintains its own cache. (e.g., vLLM)

39 of 61

Technical challenge 2:

Synchronizing the kv cache state in different layers

Envoy Gateway

AIBrix Router

Request

Pod 1

Pod 2

Pod 3

Pod 4

Distributed

KV cache

engine cache

AIBrix Router

What about large scale cluster?

This is another scalability issue!

requiring constant data tranmission between all engines and all routers (all-to-all)

40 of 61

Open Challenges

Routing will be more complex decision, considering …

Heterogeneous GPUs

LoRA

Multi-tenancy (fairness)

SLO

41 of 61

KV Cache Offloading for Cross-Engine KV Reuse

Presenter: Haiyang Shi

42 of 61

KV Cache Recap

Adapted from https://huggingface.co/blog/kv-cache-quantization

*: when processing token[K], we only need the K’th row of Q

**: when processing token[K], we require the full K & V tensors, but we can reuse the cached values (This enables skipping the recomputation of K & V)

Haiyang (2mins)��Let’s quickly recap how KV Cache optimizes text generation in large language models.�As you may know, transformers rely on self-attention, where each new token depends on all previous tokens. Without optimization, this means, at every step, we need to recompute key and value tensors for every previous token, leading to a lot of redundant calculations. In order to avoid redundant computation, engines like vLLM use kv cache to store key value tensors of past tokens, such that, at every step, instead of recomputing everything, we just fetch key value tensors of previous tokens from the kv cache, append new token’s tensors, and then do the generation. It is a widely-used approach to improve efficiency, but it also comes with several key challenges when we talk about kv cache offloading and reuse.

43 of 61

Key Challenges

Scenario 1- Prefix Cache MLSys'24 PromptCache

KV Cache Reuse vs. Limited GPU Memory Capacity

Input sequences have reusable portions, e.g., repeated system prompts, multi-turn conversations, template-driven content generation, code autocompletion, …

KV caches consume a lot of GPU mem, scaling w/ model size and context length

For LLaMA 3.1 70B model, KV cache for 128K tokens consumes ~40GB GPU memory

Limited GPU mem -> KV cache eviction -> recompute wasting GPU cycles and time

Haiyang (2mins)��KV caches are pretty reusable in practice. There are many scenarios, such as chatbot instructions, multi-turn conversations, where we can observe many repeated prefix sequences. Our inference engine would be very efficient if the kv cache could store the key value tensors for all the repeated tokens. But here is the challenge, kv caches consume a lot of gpu memory, scaling with both model size and context length. For example, a 70B model with 100K context can use hundreds of GBs just for the kv cache. When GPU memory hits the limit, the inference engine has to evict some tokens’ key value tensors from the kv cache. Once we encounter these tokens in the future, since they are not in the kv cache, we have to recompute them, which causes gpu cycle wasting and performance slowing down. This is the first challenge introduced by limited gpu memory capacity.

44 of 61

Key Challenges

Scenario 2 - Request Migration OSDI'24-ServerlessLLM

KV Cache Migration

Lack of fault tolerance -> complete loss of in-progress requests -> resuming failed requests time-consuming

Dynamic workloads + slow instance bootstrapping -> scale up/down -> service degradation or downtime

Haiyang (2mins)��Now, let’s talk about the second challenge related to KV cache migration.�We all know that failures are inevitable in distributed systems. We have experienced tons of failures caused by hardware faults, out-of-memory errors, or network partitions. Inference engines are facing these failures as well. However, most of engines are lack of fault tolerance for kv caches.�When a GPU node crashes, every in-progress inference requests on it fail as well. We have to reschedule them to other nodes in the cluster to recompute from the very beginning, which not only waste gpu cycles but also cause bad latency performance on these requests.�In addition to fault tolerance, the auto-scaling way to handle dynamic workloads is also challenging since bootstrapping new GPU instances are typically slow. Launching a new engine from scratch may take up to ten mins, since we have to load the model weights, which could be tens or hundreds of GBs, into the gpu memory. This slow down makes it challenging to maintain optimal performance during sudden workload spikes, potentially leading to service degradation or downtime.

45 of 61

Key Challenges

Hardware Constraints

Lower-end GPU instances w/o high-speed interconnects like RDMA

Many GPUs share a single VPC NIC, leading to significant network bandwidth contention

Selective KV cache offloading

Lower-End GPU Instance

46 of 61

Key Challenges

Hardware Constraints

High-end GPU instances w/ RDMA

Remote KV cache access and local KV cache access (on host DRAM) compete PCIe bandwidth

Selective KV cache offloading preserving PCIe bandwidth for local KV cache accesses low latency

High-End GPU Instance

47 of 61

KV Cache Architectures

From Single Node GPU Cache => Distributed and Disaggregated KV Cache

Haiyang (2mins)��Now let’s look at the architectural spectrum of KV cache implementations, from single node kv cache to distributed and disaggregated kv cache.�We start with vLLM's prefix cache, where the engine fully manages gpu memory for kv cache reuse. This approach gives us the best performance since everything sits in GPU memory.�The next evolution introduces CPU memory into the scope. Here, the engine manages both GPU and CPU memory. While this expands our capacity, we now face the optimization challenges to hide the latency of GPU-CPU data transfers.�Taking this further, we can externalize the cache as a separate service on the same node. This provides isolation and allows specialized optimization on the cache service, but introduces communication overhead between engine and the kv cache service.�Finally, we get a distributed cache service that shares KV tensors across engines. This enables horizontal scaling and handles failover gracefully.�These are four typical kv cache architectures, and each approach makes different tradeoffs between locality, scalability and complexity.

48 of 61

AIBrix KV Cache Offloading Framework

Pluggable Eviction Policies

Selective KV cache offloading
Flexible to design new policies
Can be used to study characteristics of different eviction policies w/ LLM workloads

Pluggable Marshallers

KV-aware compression
Quantization

Pluggable Connectors

SSD backends
Networked backends
Managed KV cache services

RFC: https://github.com/vllm-project/vllm/issues/14724

49 of 61

Multi-LoRA Management in Production Environment

Presenter: Jiaxin Shan

[LIguang Introduces Jiaxin] Jiaxin Shan, Software Engineer in Bytedance Compute Infrastructure Team. He received a MS degree from University of Pittsburgh. He is a co-chair in Kubernetes Serving WG and Kubeflow community, he is also the co-creator of KubeRay. His research interests focus on ML Infra and Serverless systems.

Jiaxin (1min 30s) Hi everyone, and thanks for being here.

This is Jiaxin, I’m excited to talk about a topic that’s becoming increasingly important as more teams starts to build customized model through finetuning. The topic is Multi-LoRA Management in Production Environments. We’ve seen LoRA become a popular and efficient way to finetune LLMs, but managing dozens or even hundreds of LoRA adapters in production brings a unique set of infrastructure challenges.

In this talk, I’ll walk through what specific problems Bytedance has faced managing lora at scale, and how we’re solving them in practice—using systems like Kubernetes and inference engines like vLLM.

Let’s get started.

50 of 61

Dense LoRA: Powering Efficient Finetuning at Scale

Wupdated=W+ΔW

Wupdated=W+A.B

Volume

Density

Traffic skew between different models. Some are critical production online models, while others are experimental with near-zero usage.

Jiaxin (2mins)

LoRA, which stands for Low-Rank Adaptation, is a technique to fine-tune large pre-trained models efficiently

The core idea is to adapt large pre-trained models to specific tasks without needing to retrain the entire model, but only a small set of parameters called adapters. These adapters typically only add about 1% of storage and memory overhead compared to the size of the pre-trained LLM. At the same time, Lora maintains the quality with accuracy loss compared to fully fine-tuned models. The obvious benefit of LoRA is that it makes fine-tuning a lot cheaper by reducing memory needs.

During training, LoRA freezes the original weights W and fine-tunes two small matrices, A and B, making fine-tuning much more efficient. In the inference phase, We take the output from the pre-trained model Wx, and we add the Low Rank adaptation multiplication of A &B to get final inference results. That’s the basics of Lora, let’s move the real world challenges.

Bytedance serve many adapter in production, but their traffic patterns vary wildly. Some are critical production models with heavy traffic, while most are experimental with low traffic. In this case, we’d like to increase deployment density by consolidating Multi-LoRA into same base model deployments. Comparing deploy replicas with separate GPU, packing them together significantly lower GPU costs.

(click)

So We believe denser deployment is key to addressing the growing volume of finetuned models.

51 of 61

High Density Deployment Challenges at Scale

model:7b

Challenges:

1. Lora is a container internal manifest, it breaks the kubernetes pod design, even harder for service discovery �

2. Request routing & Lora scheduler would be challenge�

3. Compete resources with tenants, contention problem exists

Jiaxin (2mins)

But while LoRA is great for model flexibility and GPU efficiency, managing it in production environments like Kubernetes is surprisingly challenging.�Deploying a base model by itself in Kubernetes is fairly straightforward.The problems start when you want to support multiple LoRA adapters.

(click)

Why? Because LoRA must be loaded alongside the base model—they can’t be deployed separately.

(click)

This breaks some traditional Kubernetes assumptions, where each model or service has a clean 1:1 mapping with its deployment. Default service discovery won’t help you find “the least busy adapter.” And routing gets even harder when multiple LoRAs contend for the same GPU resources within the same pod.

So while LoRA simplifies model delivery and efficiency—it makes orchestration and routing much more complex.

52 of 61

Case Study: Text2SQL

Text2TLS - SQL Like Language with specific syntax

AIBrix is deployed on VKE, utilizing 12 NVIDIA A10 cards. This setup substantially boosts our infrastructure's processing power, enabling support for more AI-driven code services, such as DBW testing and TLS production tasks through Bytebrain.

Text2SQL - Traditional SQL

50% GPU cost reduction!

(jiaxin 1.5mins)

Let’s use a concrete example from Bytedance’s production environment to illustrate the benefits of denser deployment.�Most database and log search products are now integrating AI capabilities, and Text2SQL is vertical capability that helps translate natural language into SQL queries. We initially fine-tuned the DeepSeek 33b model with more SQL corpus to support our Text2SQL use cases.

However, our business also includes several SQL-like scenarios—such as log search and Elasticsearch—where the query structure resembles SQL, but the syntax and semantics differ significantly. Having separate deploymentfor each of these use cases would spend lost of GPU resources

To address this, we adopted the LoRA adapter solution—packing multiple SQL-related adapters into a single DeepSeek model. Their traffic volume is not that large, package them together allowed us to consolidate traffic and eliminate redundant GPU overhead.

Let’s have a concrete example to show how Bytedance use Lora underneath.

Text2SQL is very popular to translate natural language to SQL. This

However, we notice we have some other similar workloads, they are SQL but the syntax and methods are different from traditional SQL.

53 of 61

Lora Integration with vLLM

Dynamic Model Adapter Management: Enables dynamic loading and unloading of LoRA model adapters, improving flexibility and resource efficiency by allowing models to be adjusted on demand.�

Advanced Scheduling Algorithm: Utilizes an intelligent scheduling algorithm that optimally places LoRA models on the right pods, minimizing interference and enhancing overall inference performance.

LoRA-Specific Routing with EndpointSlice: Leverages Kubernetes EndpointSlice for precise LoRA-specific routing, ensuring efficient traffic distribution and optimized resource utilization.

[RFC]: Enhancing LoRA Management for

Production Environments in vLLM

vllm-project/vllm/issues/6275

#6566 #6315 #6234 #11094 #11117

(Jiaxin, 2mins)

To address these challenges, we’ve built several key features in both vLLM and AIBrix to improve LoRA management in production environments.

First, AIBrix introduced Dynamic Model Adapter Management. This allows platform to load and unload LoRA adapters on demand, rather than baking them into the container at startup. It gives us much more flexibility and improves resource efficiency, especially in multi-tenant clusters where adapter usage can vary significantly.

Next, AIbrix developed an Advanced Scheduling Algorithm. This scheduler intelligently places LoRA models on the right pods—based on current load, GPU availability, and adapter affinity. The goal is to minimize interference between adapters and maximize overall inference performance.

Finally, we implemented LoRA-Specific Routing using Kubernetes EndpointSlice. This allows the control plane to expose each LoRA adapter as a distinct routing target—so we can route traffic directly to the best pod for a given adapter.

Together, these improvements bring fine-grained control, better performance, and scalability to LoRA-based serving on Kubernetes.

54 of 61

Lora at scale efficiencies

AIBrix helps reduce the resource cost by 4.7X when the application has sparse workload. And still be able to maintain 1.5X resource savings when the load is high.

New Challenge: When we use merged and when we use unmerged weights?

Related work: OSDI'24 DLora

Using high-density LoRA deployment with correct batching size provides cost reduction ranging from 8% to 2.1X.

New Challenge: How to prioritize latency sensitive request and address resource contention issue?

(jiaxin 1.5mins)

Let’s take a look at how these optimizations translate into real performance and cost benefits.

First, with AIBrix, we’re able to reduce resource cost by up to 4.7X in sparse traffic scenarios—where traditional deployments would waste GPU capacity. Even under high load, the system still provides a consistent 1.5X resource savings, thanks to high performance triton based Lora Kernel in vLLM and efficiency scheduling. Second, by using high-density LoRA deployment and tuning the batch size appropriately, we observed cost reductions ranging from 8% all the way up to 2.1X, depending on the workload characteristics. These improvements aren’t just theoretical—they directly impact how scalable and sustainable self-hosted LLM serving can be in production.

But even with these optimizations, we’re not done yet—there are still important questions we need to solve in production.

- The first is: When should we use merged weights vs. unmerged LoRA adapters? This decision directly impacts memory efficiency and runtime performance. For example, when serving just a single adapter, merged weights typically offer better performance than standard LoRA injection. But when the number of adapters becomes large, loading each one dynamically may be more scalable than pre-merging everything. The recent OSDI’24 DLoRA paper proposes some policies for this, but in production, the situation is often more complex. We think there's still room to explore smarter, traffic-aware strategies for merging.

- The second challenge is: How do we prioritize latency-sensitive requests when there’s resource contention. This remains an open research question, especially in multi-tenant environments where different workloads share the same infrastructure.

That’s everything I’d like to share for multi-lora practice in production.

�

55 of 61

Open Research Challenges in LLM Inference Systems

Presenter: Jiaxin Shan

56 of 61

Open Research Challenges

1. High Density/ Serverless for LLM Foundation Models or Adaptive GPU Multiplexing (ServerlessLLM, Prism)��2. Online & batch request colocations in LLM Serving systems (BatchLLM, ConServe)

3. Heterogenous placement and routing, compute resource standardization issue. (Melange)

4. Unified LLM routing challenges considering fairness, heterogenous, Lora, prefix cache, SLO, load etc (VTC, D2LPM, Preble)

5. Application aware optimization (Autellix)

6. Large scale simulation, output latency prediction and roofline model analysis (vidur, LLMViewer)

(Jiaxin, 3mins)

There’s a whole new class of challenges as we look toward LLM Serving systems.

Earlier, we’ve talked about high density for lora models, while, it has limitation that all finetune models has to belong to same base model. In bytedance, users also use variety of open source base models like LLama, Qwen, Mistral, Deepseek. these models are incredibly resource-intensive, and today as a platform team, we often hit GPU capacity limitations because there’re so many model deployment but lot of them do not have enough traffic. Recently, we noticed there are exciting directions to explore—like swapping models in and out using DRAM or disk, or even leveraging NVIDIA’s virtual memory APIs to manage memory dynamically at the base model level. Work like ServerlessLLM and Prism points the way here. If there’s a solid way to do GPU multiplexing to share GPUs across multiple models temporally or spatially, that will definitely helps our case that build a GPU pool for large amount of different base models.
Another big opportunity is in resource management for LLM Ssystem. We see increasing demand for batch-style prediction requests—workloads like generation or model evaluation jobs. These requests have loose SLOs, normally 24-hour. The request can be batched and queued to run alongside interactive traffic on vLLM, helping us flatten load spikes and maximize GPU utilization. while, maintaining the hybrid workloads and make sure different SLA can be satisfied are not easy, features like traffic control or autoscaling,back pressure needs careful design to achieve this, aibrix is still on the way. We also notice recentwork like BatchLLM and ConServe are early steps in that direction.
The third opportunity is hard heterogeneity. At Bytedance, we have a mix of A100s, L40s, A10Gs… and the same model might run on multiple GPU types depending on availability. We need smarter placement and routing strategies that abstract away hardware differences but still respect compute characteristics. Melange started down this path, but there’s more work to be done to meet real production needs. like ILP contrasints, multi-deployment coordination etc.
Let’s talk about routing challenges, we believe current STOA approaches are too narrow and do not work in our production environments. Most strategies like kv cache ware, session cache, least kv load etc today consider only 1–2 dimensions But in reality, production workloads involve many factors: fairness, SLO, adapter usage, prefix cache hit rates, and hardware type—often all at once. Can we build a composite routing strategy that optimizes across all these axes? That’s where we think the next big leap can happen. Papers like VTC, D2LPM, and Preble touch on parts of this, but haven’t cracked the full picture yet.
the next generation of optimization has to be application-aware. Imagine being able to pass session patterns or request intent from the client all the way down to the engine— so we can enable things like accurate session caching or smart kv prefetch or semantic cache that’s fully aligned with the user’s workflow. Projects like Autellix are starting to explore this, and we’re very interested in pushing it further.
Finally, A key challenge in building features like routing, autoscaling, and request-level fairness is the lack of effective simulation tools. These features rely on environment feedback—like latency, GPU load, and request interference—which are hard to model without running real workloads on GPUs. While tools like roofline analysis (e.g., Vidur, LLMViewer) help estimate latency and cost, they can’t fully capture complex engine behaviors, leaving gaps in accuracy. Better simulation and profiling tools would let us iterate faster and enable features like soft isolation and QoS enforcement without costly live experiments.

These are just a few of the research directions we’re actively thinking about—and we’re excited to work with the community to move them forward.

57 of 61

Thank You

58 of 61

Prefix-aware technical details

AIBrix implementation architecture

Should we mention about radix tree, hash data structure details in this talk?

consistent view, dynamo

high availability (multi-replica)

59 of 61

Key Innovations Enhancing Performance and Reducing Costs

High-density LoRA management for efficient adapter scheduling
LLM-specific autoscalers for real-time dynamic scaling
Prefix-aware, load-aware request routing
Distributed KV cache improving token reuse
Unified AI runtime for seamless model management

60 of 61

Introduction The Need for Scalable LLM Infrastructure

LLMs drive AI applications but require efficient deployment
Research opportunities:

Resource Efficiency

Efficient autoscaling

Resource/Performance Isolation

Supporting inference requests with different SLAs and resource requirements
Mixing online/offline infererences
Performance isolation/fairness guarantees for multi-LoRA deployment

Resource Heterogeneity

Support for multi-level KV cache
Dynamic provisioning for heterogeneous accelerators

AIBrix provides a co-designed solution optimizing every layer

61 of 61

Conclusion Summary & Future Work

AIBrix optimizes LLM inference with system-level orchestration
Innovations in autoscaling, routing, and GPU efficiency
Future improvements: expanded workload profiling, dynamic autoscaling refinements