Commit Graph

2 Commits

Author SHA1 Message Date
Mengqing Cao
223cc34085 [KVCache] Refactor KVCache as page_size_bytes is ineffective (#3438)
### What this PR does / why we need it?
Refactor KVCache as page_size_bytes is ineffective.

1. Currently the `AttentionSpec` is patched, but the `page_size_bytes`
is still using that in vLLM in runtime, thus the patch is not working
actually. Thus this pr removes the patch on `AttentionSpec`, and will do
the final fix in vLLM.
2. Use `MLAAttentionSpec` instead of `FullAttentionSpec` to reduce
`page_size_bytes` of spec, so that num_blocks in spec could double

### How was this patch tested?
Test pass with Qwen3-Next and DeepSeek-V3.2-Exp

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-10-14 21:28:41 +08:00
lidenghui1110
0f3939e5a9 [Feature]cpu offload connector (#1659)
This PR implements cpu offload connector to enable NPU kv cache offload
to host DRAM.

- vLLM version: v0.10.2
- vLLM main:
5aeb925452

Signed-off-by: lidenghui <lidenghui1110@gmail.com>
Signed-off-by: AlvisGong <gwly0401@163.com>
Signed-off-by: CalvinXKY <kyxiezju@163.com>
Co-authored-by: AlvisGong <gwly0401@163.com>
2025-09-23 14:25:05 +08:00