1 of 28

Locality-Preserving �Hyperconverged KV-Cache Offloading for Cost-Efficient LLM Inference

KamaTech summer project Sep 2025

Chevi Koren   Sarah Swiatycki     Sara Turuver     Nechama Krashinski Devora Greiniman

2 of 28

Datacenter Scale LLM Inference Framework

2

KV-cache aware routing

GPU Server

SSD

Full KV-Cache mapping

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

Monitor and map all events of creation and eviction of kv-cache entries

  • Storage capacity coupled w compute (local SSD)
  • Hinders scalability (centralized management)
  • Memory footprint (global kv-cache map)
  • Complex routing policies

Worker

(TRT-LLM, vLLM, SGLang)

2

LightningAI

LightningAI

3 of 28

Hyperconverged KV-Cache Offloading

3

KV-cache aware routing

GPU Server

SSD

Full KV-Cache mapping

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

Monitor and map all events of creation and eviction of kv-cache entries

  • Capacity coupled with compute (local SSD)
  • Hinders scalability (centralized management)
  • Memory footprint (global kv-cache map)
  • Complex routing policies

Worker

(TRT-LLM, vLLM, SGLang)

Resource aware routing

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

  • Unlimited capacity
  • Offloading kv pool index management, fine granularity
  • Supports elasticity, crash recovery, PD reconfigurations
  • Lighter resource-focused routing policy

KV-cache oblivious routing

Hyperconverged KV-Store

3

LightningAI

LightningAI

4 of 28

Hyperconverged KV-Cache Offloading

4

Mooncake

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

GPU Server

SSD

KVRocks

Llm-d

LMCache connector

Pliops connector

Nvidia Dynamo

4

LightningAI

LightningAI

5 of 28

5

6 of 28

Dynamo Efficiency vs User Experience�benchmarks.xlsx 

X2

Better

6

7 of 28

Dynamo - KVCache Sizing & Clients Calculation

Dynamo

4.37 x 1024^3 = 4.6GB / 300MB (per requests) = 15 clients

Dynamo + LMCache

3.37 x 1024^3 = 3.6GB / 300MB (per requests) = 12 clients

7

8 of 28

llm-d Efficiency vs User Experiencebenchmarks.xlsx

x2

Better

*80.00

*79.81

*78.50

*Hit-Rate

8

9 of 28

Llm-d��GPU��

9

10 of 28

llm-dGPU KV-Cache Sizing & Clients Calculation � 

Maximum Clients Calculation: 22,864 tokens / 2,500 tokens ~ 8 clients

Key Configuration Parameters:

Model –Length:

--max-model-len 6000

Local Cpu Cache:                     

LMCACHE_MAX_LOCAL_CPU_SIZE = 256              

Runtime Enviroment: Kubernetes deployments k3s

“The overall design architecture is similar to Dynamo, but since LLM-D runs on Kubernetes, the deployment proved more complex with additional overhead from the containerized environment. As a result, fewer concurrent clients were required, as performance degraded noticeably when scaling beyond the calculated threshold.”

10

11 of 28

Dynamo vs LLMD: Efficiency and User Experience Comparison

kv cache size in HBM:

Dynamo: 35,776 = 4.37GBLlm-d: 27,632��dynamo + LMCache: 27,584 = 3.37GB

Llm-d + LMCache : 22,848 

Llm-d +LMCache -�Hit-RATE:80.00%

11

LightningAI

LightningAI

12 of 28

MOONCAKE �Architecture, Challenges & Solutions

A technical deep-dive into implementing a high-performance distributed caching system for Large Language Model inference optimization.

12

13 of 28

�DRAM Usage Investigation – MOONCAKE Runtime

Problem Description:

DRAM usage grew faster than expected

Inserting 2 keys  ~1GB DRAM

Retry:Retry with different, no-split script )vs. Michael’s split run(

�Memory_per_token=512 * 2(kv) * 2(bf16) *28(layer) = = 56kb

Memory_per_token=1K * 2(kv) * 2(bf16) *40(layer) =160kb

13

14 of 28

Assumption:

  • Cache log printed every X ops
  • More keys may be stored than shown

���Attempt to access the code was made to add logs for identifying token count and examining stored memory.

However, VLLM v0 code wasn’t included in the install, making inspection difficult.

Verified: log written every 10s

Investigation Steps:

Issue opened to inspect stored content�The Issue

Current Status:

The investigation was interrupted midway.

No conclusive results have been reached yet.

14

15 of 28

What is KVRocks?

Open-source key-value database

Fully compatible with Redis protocols

Based on RocksDB, works on disk (SSD) instead of RAM

Can store terabytes at low cost while maintaining high performance

Supports Cluster Mode for massive scaling

15

16 of 28

KVRocks Configuration

dir RAID0 NVMe

workers 32 

rocksdb.max_background_jobs 32

rocksdb.block_cache_size 419430 

rocksdb.write_buffer_size 512

compression lz4

16

LightningAI

17 of 28

Benchmark Timeout Handling

Server didn’t respond in time → client crashed

Asyncio.exceptions.TimeoutError

[INFO] - Client 0 is done (num_successes=0, num_failures=1)

[INFO] - 1 out of 12 clients finished

[INFO] - Sending termination signal to all clients

Added failure return instead of crash → system stays stable.

17

ISSUE:

FIX:

17

18 of 28

LMCache + KVRocks vs LMCache + Dram

Scenario

Chunk size

max num batched tokens

max model len

gpu memory utilization

KV Cache in HBM

Dram mml 6000

256

2048

6000

0.95

25264

KVRocks mml 6000

256

2048

6000

0.95

38336

dram chunk 1024

1024

default

29296

0.95

38336

KVRocks chunk 1024

1024

default

29296

0.95

34272

18

19 of 28

Pliops Connector Integration with KVRocks

Code modifications:

Add:

kvrocks_backend.cpp

kvrocks_backend.hpp

Changes:

CMakeLists.txt (Added KVRocks support option)

19

20 of 28

Pliops Connector Integration with KVRocks

Using Multi-op Operation with KVRocks

Inefficiency:

The interaction with the shared memory takes

a long time when handling such a large amount

of data at once.

20

21 of 28

Pliops Connector Integration with KVRocks

Using Multi-op Operation with KVRocks

Solution:�Reading and writing the data from

shared memory in parallel using smaller blocks.

21

22 of 28

KMCache vs Plipos Gateway: LMCache more Efficiency �Benchmark Results

22

23 of 28

Dynamo 2 Nodes Results

Better

23

24 of 28

KVRocks vs Dram vs no KV-Cache Offloading

Better

Better

24

25 of 28

LLMD + KVROCKS Integration Attempt (via LMCache)

NotImplementedError: memoryview: unsupported format <B�File ".../lmcache/.../connector/redis_connector.py", line 91, in get�view[:metadata.length] = kv_bytes

The old LLMD image ships with incompatible versions of Python, vLLM, and LMCache.� → This breaks LMCache remote backend (KVROCKS), even though local LMCache works.

Latest Release -The LMCache library packaged inside the latest LLMD image was compiled against a different PyTorch version than the one included in the vLLM image.� → This causes Python to fail loading the LMCache shared object during startup.

ImportError: /opt/vllm/lib64/python3.12/site-packages/lmcache/c_ops.cpython-312-x86_64-linux-gnu.so:

undefined symbol: _ZN3c105ErrorC2ENS_14SourceLocationESs

���

�������������

25

26 of 28

26

27 of 28

All - summary

Dynamo - LLm-d

  • Dynamo takes up less space on the HBM compared to LLMD
  • Dynamo results were better than those of LLMD.

Pliops Connector - LMCache

  • The connector occupies more space on the HBM compared to LMCACHE.
  • LMCache results were better than those of Pliops Connector.

Larger chunks reduce hit rate (256 → 77.7%, 1024 → 69.6%) but improve overall performance

We were not able to show superiority of hyperconverged kv-store over the usage of local dram

27

28 of 28

Thank You!

Confidential