Locality-Preserving ๏ฟฝHyperconverged KV-Cache Offloading for Cost-Efficient LLM Inference
KamaTech summer project Sep 2025
Chevi Korenย โ ย Sarah Swiatycki ย โ ย Sara Turuver ย โ ย ย Nechama Krashinski โ Devora Greiniman
Datacenter Scale LLM Inference Framework
2
KV-cache aware routing
GPU Server
SSD
Full KV-Cache mapping
GPU Server
SSD
GPU Server
SSD
GPU Server
SSD
GPU Server
SSD
GPU Server
SSD
GPU Server
SSD
GPU Server
SSD
GPU Server
SSD
GPU Server
SSD
GPU Server
SSD
GPU Server
SSD
Monitor and map all events of creation and eviction of kv-cache entries
Worker
(TRT-LLM, vLLM, SGLang)
2
LightningAI
LightningAI
Hyperconverged KV-Cache Offloading
3
KV-cache aware routing
GPU Server
SSD
Full KV-Cache mapping
GPU Server
SSD
GPU Server
SSD
GPU Server
SSD
GPU Server
SSD
GPU Server
SSD
GPU Server
SSD
GPU Server
SSD
GPU Server
SSD
GPU Server
SSD
GPU Server
SSD
GPU Server
SSD
Monitor and map all events of creation and eviction of kv-cache entries
Worker
(TRT-LLM, vLLM, SGLang)
Resource aware routing
GPU Server
SSD
GPU Server
SSD
GPU Server
SSD
GPU Server
SSD
GPU Server
SSD
GPU Server
SSD
GPU Server
SSD
GPU Server
SSD
GPU Server
SSD
GPU Server
SSD
GPU Server
SSD
GPU Server
SSD
KV-cache oblivious routing
Hyperconverged KV-Store
3
LightningAI
LightningAI
Hyperconverged KV-Cache Offloading
4
Mooncake
GPU Server
SSD
GPU Server
SSD
GPU Server
SSD
GPU Server
SSD
GPU Server
SSD
GPU Server
SSD
GPU Server
SSD
GPU Server
SSD
GPU Server
SSD
GPU Server
SSD
GPU Server
SSD
GPU Server
SSD
KVRocks
Llm-d
LMCache connector
Pliops connector
Nvidia Dynamo
4
LightningAI
LightningAI
5
Dynamo Efficiency vs User Experience๏ฟฝbenchmarks.xlsxย
X2
Better
6
Dynamo - KVCache Sizing & Clients Calculation
Dynamo
4.37 x 1024^3 = 4.6GB / 300MB (per requests) = 15 clients
Dynamo + LMCache
3.37 x 1024^3 = 3.6GB / 300MB (per requests) = 12 clients
7
llm-d Efficiency vs User Experience๏ฟฝbenchmarks.xlsx
x2
Better
*80.00
*79.81
*78.50
*Hit-Rate
8
Llm-d๏ฟฝ๏ฟฝGPU๏ฟฝ๏ฟฝ
9
llm-d๏ฟฝGPU KV-Cache Sizing & Clients Calculation ๏ฟฝย
Maximum Clients Calculation: 22,864 tokens / 2,500 tokens ~ 8 clients
Key Configuration Parameters:
Model โLength:
--max-model-len 6000
Local Cpu Cache:ย ย ย ย ย ย ย ย ย ย ย ย
LMCACHE_MAX_LOCAL_CPU_SIZE = 256ย ย ย ย ย ย ย ย
Runtime Enviroment: Kubernetes deployments k3s๏ฟฝ
โThe overall design architecture is similar to Dynamo, but since LLM-D runs on Kubernetes, the deployment proved more complex with additional overhead from the containerized environment. As a result, fewer concurrent clients were required, as performance degraded noticeably when scaling beyond the calculated threshold.โ
10
Dynamo vs LLMD: Efficiency and User Experience Comparison
kvย cache size in HBM:
Dynamo: 35,776 = 4.37GB๏ฟฝLlm-d: 27,632๏ฟฝ๏ฟฝdynamo + LMCache: 27,584 = 3.37GB
Llm-d +ย LMCacheย : 22,848ย
Llm-d +LMCache -๏ฟฝHit-RATE:80.00%
11
LightningAI
LightningAI
MOONCAKE ๏ฟฝArchitecture, Challenges & Solutions
A technical deep-dive into implementing a high-performance distributed caching system for Large Language Model inference optimization.
12
๏ฟฝDRAM Usage Investigation โ MOONCAKE Runtime
Problem Description:
DRAM usage grew faster than expected
Inserting 2 keysย ~1GB DRAM
๏ฟฝ
Retry:๏ฟฝRetry with different, no-split script )vs. Michaelโs split run(
๏ฟฝMemory_per_token=512 * 2(kv) * 2(bf16) *28(layer) = = 56kb
Memory_per_token=1K * 2(kv) * 2(bf16) *40(layer) =160kb
13
Assumption:
๏ฟฝ๏ฟฝ๏ฟฝAttempt to access the code was made to add logs for identifying token count and examining stored memory.
However, VLLM v0 code wasnโt included in the install, making inspection difficult.
Verified: log written every 10s
Investigation Steps:
Issue opened to inspect stored content๏ฟฝThe Issue
Current Status:
The investigation was interrupted midway.
No conclusive results have been reached yet.
14
What is KVRocks?
Open-source key-value database
Fully compatible with Redis protocols
Based on RocksDB, works on disk (SSD) instead of RAM
Can store terabytes at low cost while maintaining high performance
Supports Cluster Mode for massive scaling
15
KVRocks Configuration
dirย RAID0ย NVMe
workers 32ย
rocksdb.max_background_jobsย 32
rocksdb.block_cache_sizeย 419430ย
rocksdb.write_buffer_sizeย 512
compression lz4
16
LightningAI
Benchmark Timeout Handling
Server didnโt respond in time โ client crashed
Asyncio.exceptions.TimeoutError๏ฟฝ
[INFO] - Client 0 is done (num_successes=0, num_failures=1)
[INFO] - 1 out of 12 clients finished
[INFO] - Sending termination signal to all clients
Added failure return instead of crash โ system stays stable.
17
ISSUE:
FIX:
17
LMCache + KVRocks vs LMCacheย + Dram
Scenario | Chunk size | max num batched tokens | max model len | gpu memory utilization | KV Cache in HBM |
Dram mml 6000 | 256 | 2048 | 6000 | 0.95 | 25264 |
KVRocks mml 6000 | 256 | 2048 | 6000 | 0.95 | 38336 |
dram chunk 1024 | 1024 | default | 29296 | 0.95 | 38336 |
KVRocks chunk 1024 | 1024 | default | 29296 | 0.95 | 34272 |
18
Pliopsย Connector Integration with KVRocks
19
Pliopsย Connector Integration with KVRocks
Using Multi-op Operationย with KVRocks
Inefficiency:
The interaction with the shared memory takes
a long time when handling such a large amount
of data at once.
20
Pliopsย Connector Integration with KVRocks
Using Multi-op Operationย with KVRocks
Solution:๏ฟฝReading and writing the data from
shared memory in parallel using smaller blocks.
21
KMCache vs Plipos Gateway: LMCache more Efficiency ๏ฟฝBenchmark Results
22
Dynamo 2 Nodes Results
Better
23
KVRocks vs Dram vs no KV-Cache Offloading
Better
Better
24
LLMD + KVROCKS Integration Attempt (via LMCache)
NotImplementedError: memoryview: unsupported format <B๏ฟฝFile ".../lmcache/.../connector/redis_connector.py", line 91, in get๏ฟฝview[:metadata.length] = kv_bytes
The old LLMD image ships with incompatible versions of Python, vLLM, and LMCache.๏ฟฝ โ This breaks LMCache remote backend (KVROCKS), even though local LMCache works.
Latest Release -The LMCache library packaged inside the latest LLMD image was compiled against a different PyTorch version than the one included in the vLLM image.๏ฟฝ โ This causes Python to fail loading the LMCache shared object during startup.
ImportError: /opt/vllm/lib64/python3.12/site-packages/lmcache/c_ops.cpython-312-x86_64-linux-gnu.so:
undefined symbol: _ZN3c105ErrorC2ENS_14SourceLocationESs
๏ฟฝ๏ฟฝ๏ฟฝ
๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ
25
26
All - summary
๏ฟฝDynamo - LLm-d
Pliopsย Connector - LMCache
Larger chunks reduce hit rate (256 โ 77.7%, 1024 โ 69.6%) but improve overall performance
We were not able to show superiority of hyperconverged kv-store over the usage of local dram
27
Thank You!
Confidential