lty
33b8ca4e96
[Feature]KV pool supports sparse attention (#6339)
### What this PR does / why we need it?
The kv pooling feature is adapted to Sparse Attention to support models
such as Deepseek V3.2.
### Does this PR introduce _any_ user-facing change?
NA
### How was this patch tested?
```
vllm serve /mnt/weight/DeepSeek-V3.2-Exp-W8A8 \
--host $local_ip \
--port 8002 \
--served-model-name model \
--data-parallel-size 1 \
--tensor-parallel-size 8 \
--prefill-context-parallel-size 2 \
--decode-context-parallel-size 1 \
--cp-kv-cache-interleave-size 128 \
--block-size 128 \
--enable-expert-parallel \
--no-enable-prefix-caching \
--no-enable-chunked-prefill \
--max-num-seqs 4 \
--max-model-len 8192 \
--max-num-batched-tokens 8192 \
--gpu-memory-utilization 0.95 \
--trust-remote-code \
--enforce-eager \
--quantization ascend \
--additional_config '{"ascend_scheduler_config":{"enabled":false}}' \
--kv-transfer-config \
'{
"kv_connector": "AscendStoreConnector",
"kv_role": "kv_both",
"kv_connector_extra_config": {
"backend": "mooncake",
"lookup_rpc_port":"0",
"use_layerwise": false
}
}'
```
- vLLM version: v0.14.1
- vLLM main:
dc917cceb8
Signed-off-by: lty <linhebiwen@gmail.com>