Locality-Preserving �Hyperconverged KV-Cache Offloading for Cost-Efficient LLM Inference
KamaTech summer project Sep 2025
Chevi Koren ● Sarah Swiatycki ● Sara Turuver ● Nechama Krashinski ● Devora Greiniman
Datacenter Scale LLM Inference Framework
2
KV-cache aware routing
GPU Server
SSD
Full KV-Cache mapping
GPU Server
SSD
GPU Server
SSD
GPU Server
SSD
GPU Server
SSD
GPU Server
SSD
GPU Server
SSD
GPU Server
SSD
GPU Server
SSD
GPU Server
SSD
GPU Server
SSD
GPU Server
SSD
Monitor and map all events of creation and eviction of kv-cache entries
Worker
(TRT-LLM, vLLM, SGLang)
2
LightningAI
LightningAI
Hyperconverged KV-Cache Offloading
3
KV-cache aware routing
GPU Server
SSD
Full KV-Cache mapping
GPU Server
SSD
GPU Server
SSD
GPU Server
SSD
GPU Server
SSD
GPU Server
SSD
GPU Server
SSD
GPU Server
SSD
GPU Server
SSD
GPU Server
SSD
GPU Server
SSD
GPU Server
SSD
Monitor and map all events of creation and eviction of kv-cache entries
Worker
(TRT-LLM, vLLM, SGLang)
Resource aware routing
GPU Server
SSD
GPU Server
SSD
GPU Server
SSD
GPU Server
SSD
GPU Server
SSD
GPU Server
SSD
GPU Server
SSD
GPU Server
SSD
GPU Server
SSD
GPU Server
SSD
GPU Server
SSD
GPU Server
SSD
KV-cache oblivious routing
Hyperconverged KV-Store
3
LightningAI
LightningAI
Hyperconverged KV-Cache Offloading
4
Mooncake
GPU Server
SSD
GPU Server
SSD
GPU Server
SSD
GPU Server
SSD
GPU Server
SSD
GPU Server
SSD
GPU Server
SSD
GPU Server
SSD
GPU Server
SSD
GPU Server
SSD
GPU Server
SSD
GPU Server
SSD
KVRocks
Llm-d
LMCache connector
Pliops connector
Nvidia Dynamo
4
LightningAI
LightningAI
5
Dynamo Efficiency vs User Experience�benchmarks.xlsx
X2
Better
6
Dynamo - KVCache Sizing & Clients Calculation
Dynamo
4.37 x 1024^3 = 4.6GB / 300MB (per requests) = 15 clients
Dynamo + LMCache
3.37 x 1024^3 = 3.6GB / 300MB (per requests) = 12 clients
7
llm-d Efficiency vs User Experience�benchmarks.xlsx
x2
Better
*80.00
*79.81
*78.50
*Hit-Rate
8
Llm-d��GPU��
9
llm-d�GPU KV-Cache Sizing & Clients Calculation �
Maximum Clients Calculation: 22,864 tokens / 2,500 tokens ~ 8 clients
Key Configuration Parameters:
Model –Length:
--max-model-len 6000
Local Cpu Cache:
LMCACHE_MAX_LOCAL_CPU_SIZE = 256
Runtime Enviroment: Kubernetes deployments k3s�
“The overall design architecture is similar to Dynamo, but since LLM-D runs on Kubernetes, the deployment proved more complex with additional overhead from the containerized environment. As a result, fewer concurrent clients were required, as performance degraded noticeably when scaling beyond the calculated threshold.”
10
Dynamo vs LLMD: Efficiency and User Experience Comparison
kv cache size in HBM:
Dynamo: 35,776 = 4.37GB�Llm-d: 27,632��dynamo + LMCache: 27,584 = 3.37GB
Llm-d + LMCache : 22,848
Llm-d +LMCache -�Hit-RATE:80.00%
11
LightningAI
LightningAI
MOONCAKE �Architecture, Challenges & Solutions
A technical deep-dive into implementing a high-performance distributed caching system for Large Language Model inference optimization.
12
�DRAM Usage Investigation – MOONCAKE Runtime
Problem Description:
DRAM usage grew faster than expected
Inserting 2 keys ~1GB DRAM
�
Retry:�Retry with different, no-split script )vs. Michael’s split run(
�Memory_per_token=512 * 2(kv) * 2(bf16) *28(layer) = = 56kb
Memory_per_token=1K * 2(kv) * 2(bf16) *40(layer) =160kb
13
Assumption:
���Attempt to access the code was made to add logs for identifying token count and examining stored memory.
However, VLLM v0 code wasn’t included in the install, making inspection difficult.
Verified: log written every 10s
Investigation Steps:
Issue opened to inspect stored content�The Issue
Current Status:
The investigation was interrupted midway.
No conclusive results have been reached yet.
14
What is KVRocks?
Open-source key-value database
Fully compatible with Redis protocols
Based on RocksDB, works on disk (SSD) instead of RAM
Can store terabytes at low cost while maintaining high performance
Supports Cluster Mode for massive scaling
15
KVRocks Configuration
dir RAID0 NVMe
workers 32
rocksdb.max_background_jobs 32
rocksdb.block_cache_size 419430
rocksdb.write_buffer_size 512
compression lz4
16
LightningAI
Benchmark Timeout Handling
Server didn’t respond in time → client crashed
Asyncio.exceptions.TimeoutError�
[INFO] - Client 0 is done (num_successes=0, num_failures=1)
[INFO] - 1 out of 12 clients finished
[INFO] - Sending termination signal to all clients
Added failure return instead of crash → system stays stable.
17
ISSUE:
FIX:
17
LMCache + KVRocks vs LMCache + Dram
Scenario | Chunk size | max num batched tokens | max model len | gpu memory utilization | KV Cache in HBM |
Dram mml 6000 | 256 | 2048 | 6000 | 0.95 | 25264 |
KVRocks mml 6000 | 256 | 2048 | 6000 | 0.95 | 38336 |
dram chunk 1024 | 1024 | default | 29296 | 0.95 | 38336 |
KVRocks chunk 1024 | 1024 | default | 29296 | 0.95 | 34272 |
18
Pliops Connector Integration with KVRocks
19
Pliops Connector Integration with KVRocks
Using Multi-op Operation with KVRocks
Inefficiency:
The interaction with the shared memory takes
a long time when handling such a large amount
of data at once.
20
Pliops Connector Integration with KVRocks
Using Multi-op Operation with KVRocks
Solution:�Reading and writing the data from
shared memory in parallel using smaller blocks.
21
KMCache vs Plipos Gateway: LMCache more Efficiency �Benchmark Results
22
Dynamo 2 Nodes Results
Better
23
KVRocks vs Dram vs no KV-Cache Offloading
Better
Better
24
LLMD + KVROCKS Integration Attempt (via LMCache)
NotImplementedError: memoryview: unsupported format <B�File ".../lmcache/.../connector/redis_connector.py", line 91, in get�view[:metadata.length] = kv_bytes
The old LLMD image ships with incompatible versions of Python, vLLM, and LMCache.� → This breaks LMCache remote backend (KVROCKS), even though local LMCache works.
Latest Release -The LMCache library packaged inside the latest LLMD image was compiled against a different PyTorch version than the one included in the vLLM image.� → This causes Python to fail loading the LMCache shared object during startup.
ImportError: /opt/vllm/lib64/python3.12/site-packages/lmcache/c_ops.cpython-312-x86_64-linux-gnu.so:
undefined symbol: _ZN3c105ErrorC2ENS_14SourceLocationESs
���
�������������
25
26
All - summary
�Dynamo - LLm-d
Pliops Connector - LMCache
Larger chunks reduce hit rate (256 → 77.7%, 1024 → 69.6%) but improve overall performance
We were not able to show superiority of hyperconverged kv-store over the usage of local dram
27
Thank You!
Confidential