1 of 28

Locality-Preserving �Hyperconverged KV-Cache Offloading for Cost-Efficient LLM Inference

KamaTech summer project Sep 2025

Chevi Koren ● Sarah Swiatycki ● Sara Turuver ● Nechama Krashinski ● Devora Greiniman

2 of 28

Datacenter Scale LLM Inference Framework

2

KV-cache aware routing

GPU Server

SSD

Full KV-Cache mapping

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

Monitor and map all events of creation and eviction of kv-cache entries

Storage capacity coupled w compute (local SSD)
Hinders scalability (centralized management)
Memory footprint (global kv-cache map)
Complex routing policies

Worker

(TRT-LLM, vLLM, SGLang)

2

LightningAI

3 of 28

Hyperconverged KV-Cache Offloading

3

KV-cache aware routing

GPU Server

SSD

Full KV-Cache mapping

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

Monitor and map all events of creation and eviction of kv-cache entries

Capacity coupled with compute (local SSD)
Hinders scalability (centralized management)
Memory footprint (global kv-cache map)
Complex routing policies

Worker

(TRT-LLM, vLLM, SGLang)

Resource aware routing

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

Unlimited capacity
Offloading kv pool index management, fine granularity
Supports elasticity, crash recovery, PD reconfigurations
Lighter resource-focused routing policy

KV-cache oblivious routing

Hyperconverged KV-Store

3

LightningAI

4 of 28

Hyperconverged KV-Cache Offloading

4

Mooncake

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

KVRocks

Llm-d

LMCache connector

Pliops connector

Nvidia Dynamo

4

LightningAI

5 of 28

5

6 of 28

Dynamo Efficiency vs User Experience�benchmarks.xlsx

X2

Better

6

7 of 28

Dynamo - KVCache Sizing & Clients Calculation

Dynamo

4.37 x 1024^3 = 4.6GB / 300MB (per requests) = 15 clients

Dynamo + LMCache

3.37 x 1024^3 = 3.6GB / 300MB (per requests) = 12 clients

7

8 of 28

llm-d Efficiency vs User Experience�benchmarks.xlsx

x2

Better

*80.00

*79.81

*78.50

*Hit-Rate

8

9 of 28

Llm-d��GPU��

9

10 of 28

llm-d�GPU KV-Cache Sizing & Clients Calculation �

Maximum Clients Calculation: 22,864 tokens / 2,500 tokens ~ 8 clients

Key Configuration Parameters:

Model –Length:

--max-model-len 6000

Local Cpu Cache:

LMCACHE_MAX_LOCAL_CPU_SIZE = 256

Runtime Enviroment: Kubernetes deployments k3s�

“The overall design architecture is similar to Dynamo, but since LLM-D runs on Kubernetes, the deployment proved more complex with additional overhead from the containerized environment. As a result, fewer concurrent clients were required, as performance degraded noticeably when scaling beyond the calculated threshold.”

10

11 of 28

Dynamo vs LLMD: Efficiency and User Experience Comparison

kv cache size in HBM:

Dynamo: 35,776 = 4.37GB�Llm-d: 27,632��dynamo + LMCache: 27,584 = 3.37GB

Llm-d + LMCache : 22,848

Llm-d +LMCache -�Hit-RATE:80.00%

11

LightningAI

12 of 28

MOONCAKE �Architecture, Challenges & Solutions

A technical deep-dive into implementing a high-performance distributed caching system for Large Language Model inference optimization.

12

13 of 28

�DRAM Usage Investigation – MOONCAKE Runtime

Problem Description:

DRAM usage grew faster than expected

Inserting 2 keys ~1GB DRAM

�

Retry:�Retry with different, no-split script )vs. Michael’s split run(

�Memory_per_token=512 * 2(kv) * 2(bf16) *28(layer) = = 56kb

Memory_per_token=1K * 2(kv) * 2(bf16) *40(layer) =160kb

13

14 of 28

Assumption:

Cache log printed every X ops
More keys may be stored than shown

��Attempt to access the code was made to add logs for identifying token count and examining stored memory.

However, VLLM v0 code wasn’t included in the install, making inspection difficult.

Verified: log written every 10s

Investigation Steps:

Issue opened to inspect stored content�The Issue

Current Status:

The investigation was interrupted midway.

No conclusive results have been reached yet.

14

15 of 28

What is KVRocks?

Open-source key-value database

Fully compatible with Redis protocols

Based on RocksDB, works on disk (SSD) instead of RAM

Can store terabytes at low cost while maintaining high performance

Supports Cluster Mode for massive scaling

15

16 of 28

KVRocks Configuration

dir RAID0 NVMe

workers 32

rocksdb.max_background_jobs 32

rocksdb.block_cache_size 419430

rocksdb.write_buffer_size 512

compression lz4

16

LightningAI

17 of 28

Benchmark Timeout Handling

Server didn’t respond in time → client crashed

Asyncio.exceptions.TimeoutError�

[INFO] - Client 0 is done (num_successes=0, num_failures=1)

[INFO] - 1 out of 12 clients finished

[INFO] - Sending termination signal to all clients

Added failure return instead of crash → system stays stable.

17

ISSUE:

FIX:

17

18 of 28

LMCache + KVRocks vs LMCache + Dram

benchmarks_lmcache.xlsx

Scenario	Chunk size	max num batched tokens	max model len	gpu memory utilization	KV Cache in HBM
Dram mml 6000	256	2048	6000	0.95	25264
KVRocks mml 6000	256	2048	6000	0.95	38336
dram chunk 1024	1024	default	29296	0.95	38336
KVRocks chunk 1024	1024	default	29296	0.95	34272

18

19 of 28

Pliops Connector Integration with KVRocks

Code modifications:

Add:

kvrocks_backend.cpp

kvrocks_backend.hpp

Changes:

CMakeLists.txt (Added KVRocks support option)

19

20 of 28

Pliops Connector Integration with KVRocks

Using Multi-op Operation with KVRocks

Inefficiency:

The interaction with the shared memory takes

a long time when handling such a large amount

of data at once.

20

21 of 28

Pliops Connector Integration with KVRocks

Using Multi-op Operation with KVRocks

Solution:�Reading and writing the data from

shared memory in parallel using smaller blocks.

21

22 of 28

KMCache vs Plipos Gateway: LMCache more Efficiency �Benchmark Results

22

23 of 28

Dynamo 2 Nodes Results

Better

23

24 of 28

KVRocks vs Dram vs no KV-Cache Offloading

Better

24

25 of 28

LLMD + KVROCKS Integration Attempt (via LMCache)

NotImplementedError: memoryview: unsupported format <B�File ".../lmcache/.../connector/redis_connector.py", line 91, in get�view[:metadata.length] = kv_bytes

The old LLMD image ships with incompatible versions of Python, vLLM, and LMCache.� → This breaks LMCache remote backend (KVROCKS), even though local LMCache works.

Latest Release -The LMCache library packaged inside the latest LLMD image was compiled against a different PyTorch version than the one included in the vLLM image.� → This causes Python to fail loading the LMCache shared object during startup.

ImportError: /opt/vllm/lib64/python3.12/site-packages/lmcache/c_ops.cpython-312-x86_64-linux-gnu.so:

undefined symbol: _ZN3c105ErrorC2ENS_14SourceLocationESs

��

ISSUE (Opened by LLMD Team)

25

26 of 28

26

27 of 28

All - summary

�Dynamo - LLm-d

Dynamo takes up less space on the HBM compared to LLMD
Dynamo results were better than those of LLMD.

Pliops Connector - LMCache

The connector occupies more space on the HBM compared to LMCACHE.
LMCache results were better than those of Pliops Connector.

Larger chunks reduce hit rate (256 → 77.7%, 1024 → 69.6%) but improve overall performance

We were not able to show superiority of hyperconverged kv-store over the usage of local dram

27

28 of 28

Thank You!

Confidential