[Feat]ds3.2 support pcp (#6733)

### What this PR does / why we need it?
The ds3.2 model adaptation supports the PCP feature.

The solution is as follows: When saving the KV cache, first perform an
allgather operation on the KVs, and then each node saves its own copy.
When the attention or indexer performs calculations, they all gather the
KV cache and then perform the calculations.

### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
02/12 23:05:10 - AISBench - INFO - Running 1-th replica of evaluation
02/12 23:05:10 - AISBench - INFO - Task [vllm-api-general-chat/gsm8k]:
{'accuracy': 96.35416666666667, 'type': 'GEN'}
02/12 23:05:10 - AISBench - INFO - time elapsed: 2.87s
02/12 23:05:12 - AISBench - INFO - Evaluation tasks completed.
02/12 23:05:12 - AISBench - INFO - Summarizing evaluation results...
dataset       version    metric    mode      vllm-api-general-chat
gsm8kdataset  -          accuracy  gen                       96.35


- vLLM version: v0.15.0
- vLLM main:
9562912cea

---------

Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
This commit is contained in:
weiguihua2
2026-02-25 09:46:57 +08:00
committed by GitHub
parent ee59429015
commit db51a1b9b6
4 changed files with 504 additions and 79 deletions

View File

@@ -56,6 +56,20 @@ def test_models_pcp_dcp_basic():
quantization="ascend",
) as runner:
runner.model.generate(prompts, sampling_params)
model = "vllm-ascend/DeepSeek-V3.2-W8A8-Pruning"
with VllmRunner(
model,
enforce_eager=True,
max_model_len=1024,
tensor_parallel_size=2,
prefill_context_parallel_size=2,
decode_context_parallel_size=2,
enable_expert_parallel=True,
block_size=128,
quantization="ascend",
) as runner:
runner.model.generate(prompts, sampling_params)
def test_models_pcp_dcp_full_graph():