SGLang HiCache extends the traditional RadixAttention with a three-tier hierarchical KV caching system that dramatically improves performance for long-context and multi-turn conversation scenarios. By intelligently managing KV caches across GPU memory, host memory, and external storage backends, HiCache addresses the fundamental capacity bottleneck that limits cache hit rates in conventional systems.
## Configuration Guidelines
## Core HiCache Parameters
```bash
# Essential HiCache flags
--page-size 64 # Page size for cache management
--enable-hierarchical-cache # Enable HiCache
--hicache-ratio 2 # Host memory ratio (2x GPU memory)
--hicache-size 100 # Host memory size in GBs, will override the above ratio
--hicache-io-backend kernel # The I/O backend of moving data between CPU and GPU
--hicache-write-policy write_through # Cache write policy from GPU to CPU
# Timeout: Balance between completion and best-effort
--hicache-storage-prefetch-policy timeout
```
### Integration with PD Disaggregation
HiCache works seamlessly with PD Disaggregation. You can choose between two configurations:
1.**Prefill-only HiCache**: Enable HiCache only on Prefill nodes, allowing KV cache sharing among Prefill instances
2.**Full HiCache with async offloading**: Enable HiCache on Prefill nodes and async KV cache offloading on Decode nodes, allowing Prefill nodes to reuse KV caches from Decode nodes in multi-turn dialogue scenarios
```bash
# Prefill node with HiCache enabled for cross-prefill sharing (ideal for SystemPrompt scenarios)
Here is an example of deploying DeepSeek-R1 with HiCache-HF3FS. For more details, see the [HF3FS Documentation](../../python/sglang/srt/mem_cache/storage/hf3fs/docs/README.md).
Here is an example of deploying Qwen3-235B-A22B-Instruct-2507 with Mooncake. For more details, see the [Mooncake Documentation](../../python/sglang/srt/mem_cache/storage/mooncake_store/README.md).
2.**Register your backend:** Add your storage backend to the HiCache [BackendFactory](../../python/sglang/srt/mem_cache/storage/backend_factory.py#L188)
The HiCache controller handles all scheduling and synchronization automatically.
### Dynamic Backend Loading
Alternatively, you can use dynamic loading to avoid hard-coding your backend in the repository: