[misc][torchair] fix bugs around deepseek mtp, enable_shared_expert_dp and use_cached_kv_cache_bytes (#3074)

### What this PR does / why we need it? This miscellaneous contains several small fixes: 1) fix initialization and forward bugs of DeepseekMTPLayer with `shared_expert_dp` enabled. 2) fix a tensor shape mismatches after o_proj caused by a work-aroud change in NPUModelRunner. 3) avoid unnecessary decline of kv_cache memory (default: 64MB) with `use_cached_kv_cache_bytes` disabled. 4) fall back `fused_moe_state` from `MC2` to `All2All` since the padding logic of `mc2_mask` is incompatible with input hidden_states when `shared_expert_dp` enabled. Once this PR is merged, users can launch disaggregated_prefill deployments (large_ep) with `deepseek_mtp` and `shared_expert_dp` as `v0.9.1-dev` branch. The remaining problem of kv_cache tokens decline compared to `v0.9.1-dev` will be resolved by https://github.com/vllm-project/vllm-ascend/pull/3073. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? E2E vllm serving about deepseek_mtp with torchair graph mode and `enable_shared_expert_dp` with eager mode. Large ep deployments are also tested with this PR. - vLLM version: v0.10.2 - vLLM main: 5aeb925452 --------- Signed-off-by: linfeng-yuan <1102311262@qq.com>
2025-09-23 14:52:42 +08:00
parent 0f3939e5a9
commit d01fd1d1c3
7 changed files with 36 additions and 24 deletions
--- a/vllm_ascend/torchair/torchair_model_runner.py
+++ b/vllm_ascend/torchair/torchair_model_runner.py
@@ -87,8 +87,8 @@ class NPUTorchairModelRunner(NPUModelRunner):
    ) -> tuple[int, Optional[torch.Tensor], bool, bool]:
        """Override from NPUModelRunner to pad num_tokens"""
        if self.enable_shared_expert_dp:
-            return super()._sync_metadata_across_dp(num_tokens, with_prefill,
-                                                    enable_dbo)
+            # Padding is not required for shared_expert_dp cases in eager mode.
+            return num_tokens, None, with_prefill, enable_dbo
        if self.dp_size == 1:
            if not with_prefill:
                maybe_padded_num_tokens = self.select_torchair_padded_batch_size(