xc-llm-ascend

Files

zzhxxx 64d29875f9 [Refactor] Replace the implementations of o_proj, q_b_proj, and kv_b_proj with custom_op for sharded CP (#5698 )

### What this PR does / why we need it?
Based on the Sharded-CP feature
PR:https://github.com/vllm-project/vllm-ascend/pull/4702;
RFC:https://github.com/vllm-project/vllm/issues/30055

This PR officially integrates Deepseek V3.2's DSA-CP support on the
basis of https://github.com/vllm-project/vllm-ascend/pull/4702,
improving inference efficiency and scalability under mixed
prefill-decode workloads. The main improvements include:
- Replace the implementations of o_proj, q_b_proj, and kv_b_proj with
custom_op for TP=1.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

---------

Signed-off-by: zzhx1 <zzh_201018@outlook.com>
Signed-off-by: chenxiao <Jaychou1620@Gmail.com>
Signed-off-by: Kurumi5210 <jaychou1620@gmail.com>
Co-authored-by: clrs97 <524936896@qq.com>
Co-authored-by: chenxiao <Jaychou1620@Gmail.com>

2026-01-09 15:58:40 +08:00

context_parallel

[feature]dcp&pcp support mlapo (#5672 )

2026-01-08 23:49:23 +08:00

__init__.py

[Core] Make V1 work and enable V1 engine test (#389 )

2025-03-28 19:34:23 +08:00

attention_mask.py

[Refactor] Fix AttentionMaskBuilder singleton and remove redundant pcp_prefill_mask (#4870 )

2026-01-07 17:09:52 +08:00

attention_v1.py

[Refactor] Fix AttentionMaskBuilder singleton and remove redundant pcp_prefill_mask (#4870 )