lty
33b8ca4e96
[Feature]KV pool supports sparse attention (#6339)
### What this PR does / why we need it?
The kv pooling feature is adapted to Sparse Attention to support models
such as Deepseek V3.2.
### Does this PR introduce _any_ user-facing change?
NA
### How was this patch tested?
```
vllm serve /mnt/weight/DeepSeek-V3.2-Exp-W8A8 \
--host $local_ip \
--port 8002 \
--served-model-name model \
--data-parallel-size 1 \
--tensor-parallel-size 8 \
--prefill-context-parallel-size 2 \
--decode-context-parallel-size 1 \
--cp-kv-cache-interleave-size 128 \
--block-size 128 \
--enable-expert-parallel \
--no-enable-prefix-caching \
--no-enable-chunked-prefill \
--max-num-seqs 4 \
--max-model-len 8192 \
--max-num-batched-tokens 8192 \
--gpu-memory-utilization 0.95 \
--trust-remote-code \
--enforce-eager \
--quantization ascend \
--additional_config '{"ascend_scheduler_config":{"enabled":false}}' \
--kv-transfer-config \
'{
"kv_connector": "AscendStoreConnector",
"kv_role": "kv_both",
"kv_connector_extra_config": {
"backend": "mooncake",
"lookup_rpc_port":"0",
"use_layerwise": false
}
}'
```
- vLLM version: v0.14.1
- vLLM main:
dc917cceb8
Signed-off-by: lty <linhebiwen@gmail.com>
2026-02-05 10:36:52 +08:00
..
2026-02-04 09:08:18 +08:00
2025-11-28 18:06:39 +08:00
2026-02-02 19:16:26 +08:00
2026-02-04 15:53:53 +08:00
2026-01-28 20:16:58 +08:00
2026-01-19 08:59:46 +08:00
2026-01-19 08:59:46 +08:00
2026-02-05 10:36:52 +08:00
2026-01-26 14:28:16 +08:00
2026-01-24 22:45:38 +08:00
2026-01-24 22:45:38 +08:00
2026-01-24 22:08:33 +08:00
2026-02-04 21:36:26 +08:00
2026-02-03 14:26:21 +08:00
2026-02-03 19:49:58 +08:00
2026-01-26 09:08:42 +08:00
2026-02-02 19:15:31 +08:00
2026-02-05 10:06:14 +08:00
2026-01-22 15:46:59 +08:00
2026-01-16 20:57:46 +08:00
2026-02-04 09:08:18 +08:00
2026-02-04 09:08:18 +08:00
2026-01-19 08:59:46 +08:00
2026-01-16 20:57:46 +08:00
2026-02-03 14:10:01 +08:00
2026-01-16 20:57:46 +08:00
2026-01-16 20:57:46 +08:00
2026-02-03 14:13:06 +08:00
2026-02-01 20:06:01 +08:00
2026-02-02 16:12:04 +08:00