[Feat]ds3.2 support pcp (#6733)

### What this PR does / why we need it?
The ds3.2 model adaptation supports the PCP feature.

The solution is as follows: When saving the KV cache, first perform an
allgather operation on the KVs, and then each node saves its own copy.
When the attention or indexer performs calculations, they all gather the
KV cache and then perform the calculations.

### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
02/12 23:05:10 - AISBench - INFO - Running 1-th replica of evaluation
02/12 23:05:10 - AISBench - INFO - Task [vllm-api-general-chat/gsm8k]:
{'accuracy': 96.35416666666667, 'type': 'GEN'}
02/12 23:05:10 - AISBench - INFO - time elapsed: 2.87s
02/12 23:05:12 - AISBench - INFO - Evaluation tasks completed.
02/12 23:05:12 - AISBench - INFO - Summarizing evaluation results...
dataset       version    metric    mode      vllm-api-general-chat
gsm8kdataset  -          accuracy  gen                       96.35


- vLLM version: v0.15.0
- vLLM main:
9562912cea

---------

Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
This commit is contained in:
weiguihua2
2026-02-25 09:46:57 +08:00
committed by GitHub
parent ee59429015
commit db51a1b9b6
4 changed files with 504 additions and 79 deletions

View File

@@ -26,6 +26,9 @@ class AscendPCPMetadata:
tail_attn_nomask_seqlens: torch.Tensor = None
q_full_idx: torch.Tensor = None
pcp_allgather_restore_idx: list[int] | None = None
block_table_cp: torch.Tensor = None
valid_block_ids: torch.Tensor = None
prefill_q_cum_seqlens: torch.Tensor = None
@dataclass