xc-llm-ascend

Files

linfeng-yuan d01fd1d1c3 [misc][torchair] fix bugs around deepseek mtp, enable_shared_expert_dp and use_cached_kv_cache_bytes (#3074 )

### What this PR does / why we need it?
This miscellaneous contains several small fixes:
1) fix initialization and forward bugs of DeepseekMTPLayer with
`shared_expert_dp` enabled.
2) fix a tensor shape mismatches after o_proj caused by a work-aroud
change in NPUModelRunner.
3) avoid unnecessary decline of kv_cache memory (default: 64MB) with
`use_cached_kv_cache_bytes` disabled.
4) fall back `fused_moe_state` from `MC2` to `All2All` since the padding
logic of `mc2_mask` is incompatible with input hidden_states when
`shared_expert_dp` enabled.

Once this PR is merged, users can launch disaggregated_prefill
deployments (large_ep) with `deepseek_mtp` and `shared_expert_dp` as
`v0.9.1-dev` branch. The remaining problem of kv_cache tokens decline
compared to `v0.9.1-dev` will be resolved by
https://github.com/vllm-project/vllm-ascend/pull/3073.
 
### Does this PR introduce _any_ user-facing change?

No.
### How was this patch tested?
E2E vllm serving about deepseek_mtp with torchair graph mode and
`enable_shared_expert_dp` with eager mode. Large ep deployments are also
tested with this PR.


- vLLM version: v0.10.2
- vLLM main:
5aeb925452

---------

Signed-off-by: linfeng-yuan <1102311262@qq.com>

2025-09-23 14:52:42 +08:00

__init__.py

[Refactor] Refactor Spec Decode (#2668 )

2025-09-04 11:34:47 +08:00

eagle_proposer.py

[Refactor] Adjustments to moe_comm_method selection process (#3001 )

2025-09-22 19:12:58 +08:00

interface.py

[Refactor] Refactor Spec Decode (#2668 )

2025-09-04 11:34:47 +08:00

mtp_proposer.py

[misc][torchair] fix bugs around deepseek mtp, enable_shared_expert_dp and use_cached_kv_cache_bytes (#3074 )

2025-09-23 14:52:42 +08:00

ngram_proposer.py

[Refactor] Refactor Spec Decode (#2668 )

2025-09-04 11:34:47 +08:00