[Feat] shared expert dp for deepseek_mtp (#3811)

### What this PR does / why we need it?
Support shared expert DP for deepseek_mtp feature. 
`shared_expert_dp` requires `SP==True`, with corresponding parameter
restrictions.
Previously, due to the coupling between `shared_expert_dp` and torchair,
and the removal of `deepseek_mtp` in vllm_ascend, shared expert dp of
deepseek_mtp was temporarily removed.
Currently, by performing the `reduce_scatter` on the input of
deepssek_mtp in `mtp_proposer.py`, we ensure that it matches the
dimensions of `input_embedding`, and then perform the `all_gather` on
the output of mtp.

### How was this patch tested?
baseline:
<img width="1184" height="692" alt="image"
src="https://github.com/user-attachments/assets/9680d53a-7b1d-481a-accc-b8f3dae2b9e3"
/>

enable shared_expert_dp and multistream_overlap_shared_expert:
<img width="1167" height="687" alt="image"
src="https://github.com/user-attachments/assets/2531d06b-dfda-4e24-8628-6f4b0f677ddc"
/>

TPOT: 48ms -> 45.4ms
Average TPS per rank: 117.6 -> 126.1


- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

---------

Signed-off-by: chenmenglong <chenmenglong1@huawei.com>
Signed-off-by: zengran <zengran2@huawei.com>
Co-authored-by: zengran <zengran2@huawei.com>
This commit is contained in:
MengLong Chen
2025-12-01 20:44:11 +08:00
committed by GitHub
parent 27b09ca9b9
commit 143e1f46d0
9 changed files with 185 additions and 17 deletions

View File

@@ -283,7 +283,7 @@ class NPUPlatform(Platform):
if parallel_config and parallel_config.worker_cls == "auto":
# TODO: this is a tricky way to disable `use_sequence_parallel_moe` in vllm.
parallel_config.all2all_backend = "flashinfer_all2allv"
if ascend_config.torchair_graph_config.enabled or ascend_config.enable_shared_expert_dp:
if ascend_config.torchair_graph_config.enabled:
parallel_config.worker_cls = "vllm_ascend.torchair.torchair_worker.NPUTorchairWorker"
else:
parallel_config.worker_cls = "vllm_ascend.worker.worker_v1.NPUWorker"
@@ -379,8 +379,6 @@ class NPUPlatform(Platform):
ascend_config = get_ascend_config()
if use_mla and ascend_config.enable_shared_expert_dp:
if use_mla and not use_sparse:
return "vllm_ascend.torchair.torchair_mla.AscendMLATorchairBackend"
if use_mla and use_sparse:
return "vllm_ascend.torchair.torchair_sfa.AscendSFATorchairBackend"