1 of 28

Locality-Preserving ๏ฟฝHyperconverged KV-Cache Offloading for Cost-Efficient LLM Inference

KamaTech summer project Sep 2025

Chevi Korenย โ— ย Sarah Swiatycki ย โ— ย  Sara Turuver ย โ— ย ย Nechama Krashinski โ— Devora Greiniman

2 of 28

Datacenter Scale LLM Inference Framework

2

KV-cache aware routing

GPU Server

SSD

Full KV-Cache mapping

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

Monitor and map all events of creation and eviction of kv-cache entries

  • Storage capacity coupled w compute (local SSD)
  • Hinders scalability (centralized management)
  • Memory footprint (global kv-cache map)
  • Complex routing policies

Worker

(TRT-LLM, vLLM, SGLang)

2

LightningAI

LightningAI

3 of 28

Hyperconverged KV-Cache Offloading

3

KV-cache aware routing

GPU Server

SSD

Full KV-Cache mapping

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

Monitor and map all events of creation and eviction of kv-cache entries

  • Capacity coupled with compute (local SSD)
  • Hinders scalability (centralized management)
  • Memory footprint (global kv-cache map)
  • Complex routing policies

Worker

(TRT-LLM, vLLM, SGLang)

Resource aware routing

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

  • Unlimited capacity
  • Offloading kv pool index management, fine granularity
  • Supports elasticity, crash recovery, PD reconfigurations
  • Lighter resource-focused routing policy

KV-cache oblivious routing

Hyperconverged KV-Store

3

LightningAI

LightningAI

4 of 28

Hyperconverged KV-Cache Offloading

4

Mooncake

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

KVRocks

Llm-d

LMCache connector

Pliops connector

Nvidia Dynamo

4

LightningAI

LightningAI

5 of 28

5

6 of 28

Dynamo Efficiency vs User Experience๏ฟฝbenchmarks.xlsxย 

X2

Better

6

7 of 28

Dynamo - KVCache Sizing & Clients Calculation

Dynamo

4.37 x 1024^3 = 4.6GB / 300MB (per requests) = 15 clients

Dynamo + LMCache

3.37 x 1024^3 = 3.6GB / 300MB (per requests) = 12 clients

7

8 of 28

llm-d Efficiency vs User Experience๏ฟฝbenchmarks.xlsx

x2

Better

*80.00

*79.81

*78.50

*Hit-Rate

8

9 of 28

Llm-d๏ฟฝ๏ฟฝGPU๏ฟฝ๏ฟฝ

9

10 of 28

llm-d๏ฟฝGPU KV-Cache Sizing & Clients Calculation ๏ฟฝย 

Maximum Clients Calculation: 22,864 tokens / 2,500 tokens ~ 8 clients

Key Configuration Parameters:

Model โ€“Length:

--max-model-len 6000

Local Cpu Cache:ย ย ย  ย  ย  ย  ย  ย  ย  ย  ย  ย 

LMCACHE_MAX_LOCAL_CPU_SIZE = 256ย  ย  ย  ย  ย  ย  ย ย 

Runtime Enviroment: Kubernetes deployments k3s๏ฟฝ

โ€œThe overall design architecture is similar to Dynamo, but since LLM-D runs on Kubernetes, the deployment proved more complex with additional overhead from the containerized environment. As a result, fewer concurrent clients were required, as performance degraded noticeably when scaling beyond the calculated threshold.โ€

10

11 of 28

Dynamo vs LLMD: Efficiency and User Experience Comparison

kvย cache size in HBM:

Dynamo: 35,776 = 4.37GB๏ฟฝLlm-d: 27,632๏ฟฝ๏ฟฝdynamo + LMCache: 27,584 = 3.37GB

Llm-d +ย LMCacheย : 22,848ย 

Llm-d +LMCache -๏ฟฝHit-RATE:80.00%

11

LightningAI

LightningAI

12 of 28

MOONCAKE ๏ฟฝArchitecture, Challenges & Solutions

A technical deep-dive into implementing a high-performance distributed caching system for Large Language Model inference optimization.

12

13 of 28

๏ฟฝDRAM Usage Investigation โ€“ MOONCAKE Runtime

Problem Description:

DRAM usage grew faster than expected

Inserting 2 keysย  ~1GB DRAM

๏ฟฝ

Retry:๏ฟฝRetry with different, no-split script )vs. Michaelโ€™s split run(

๏ฟฝMemory_per_token=512 * 2(kv) * 2(bf16) *28(layer) = = 56kb

Memory_per_token=1K * 2(kv) * 2(bf16) *40(layer) =160kb

13

14 of 28

Assumption:

  • Cache log printed every X ops
  • More keys may be stored than shown

๏ฟฝ๏ฟฝ๏ฟฝAttempt to access the code was made to add logs for identifying token count and examining stored memory.

However, VLLM v0 code wasnโ€™t included in the install, making inspection difficult.

Verified: log written every 10s

Investigation Steps:

Issue opened to inspect stored content๏ฟฝThe Issue

Current Status:

The investigation was interrupted midway.

No conclusive results have been reached yet.

14

15 of 28

What is KVRocks?

Open-source key-value database

Fully compatible with Redis protocols

Based on RocksDB, works on disk (SSD) instead of RAM

Can store terabytes at low cost while maintaining high performance

Supports Cluster Mode for massive scaling

15

16 of 28

KVRocks Configuration

dirย RAID0ย NVMe

workers 32ย 

rocksdb.max_background_jobsย 32

rocksdb.block_cache_sizeย 419430ย 

rocksdb.write_buffer_sizeย 512

compression lz4

16

LightningAI

17 of 28

Benchmark Timeout Handling

Server didnโ€™t respond in time โ†’ client crashed

Asyncio.exceptions.TimeoutError๏ฟฝ

[INFO] - Client 0 is done (num_successes=0, num_failures=1)

[INFO] - 1 out of 12 clients finished

[INFO] - Sending termination signal to all clients

Added failure return instead of crash โ†’ system stays stable.

17

ISSUE:

FIX:

17

18 of 28

LMCache + KVRocks vs LMCacheย + Dram

Scenario

Chunk size

max num batched tokens

max model len

gpu memory utilization

KV Cache in HBM

Dram mml 6000

256

2048

6000

0.95

25264

KVRocks mml 6000

256

2048

6000

0.95

38336

dram chunk 1024

1024

default

29296

0.95

38336

KVRocks chunk 1024

1024

default

29296

0.95

34272

18

19 of 28

Pliopsย Connector Integration with KVRocks

Code modifications:

Add:

kvrocks_backend.cpp

kvrocks_backend.hpp

Changes:

CMakeLists.txt (Added KVRocks support option)

19

20 of 28

Pliopsย Connector Integration with KVRocks

Using Multi-op Operationย with KVRocks

Inefficiency:

The interaction with the shared memory takes

a long time when handling such a large amount

of data at once.

20

21 of 28

Pliopsย Connector Integration with KVRocks

Using Multi-op Operationย with KVRocks

Solution:๏ฟฝReading and writing the data from

shared memory in parallel using smaller blocks.

21

22 of 28

KMCache vs Plipos Gateway: LMCache more Efficiency ๏ฟฝBenchmark Results

22

23 of 28

Dynamo 2 Nodes Results

Better

23

24 of 28

KVRocks vs Dram vs no KV-Cache Offloading

Better

Better

24

25 of 28

LLMD + KVROCKS Integration Attempt (via LMCache)

NotImplementedError: memoryview: unsupported format <B๏ฟฝFile ".../lmcache/.../connector/redis_connector.py", line 91, in get๏ฟฝview[:metadata.length] = kv_bytes

The old LLMD image ships with incompatible versions of Python, vLLM, and LMCache.๏ฟฝ โ†’ This breaks LMCache remote backend (KVROCKS), even though local LMCache works.

Latest Release -The LMCache library packaged inside the latest LLMD image was compiled against a different PyTorch version than the one included in the vLLM image.๏ฟฝ โ†’ This causes Python to fail loading the LMCache shared object during startup.

ImportError: /opt/vllm/lib64/python3.12/site-packages/lmcache/c_ops.cpython-312-x86_64-linux-gnu.so:

undefined symbol: _ZN3c105ErrorC2ENS_14SourceLocationESs

๏ฟฝ๏ฟฝ๏ฟฝ

๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ

25

26 of 28

26

27 of 28

All - summary

๏ฟฝDynamo - LLm-d

  • Dynamo takes up less space on the HBM compared to LLMD
  • Dynamo results were better than those of LLMD.

Pliopsย Connector - LMCache

  • The connector occupies more space on the HBM compared to LMCACHE.
  • LMCache results were better than those of Pliopsย Connector.

Larger chunks reduce hit rate (256 โ†’ 77.7%, 1024 โ†’ 69.6%) but improve overall performance

We were not able to show superiority of hyperconverged kv-store over the usage of local dram

27

28 of 28

Thank You!

Confidential