[Perf] Optimize fused_experts quantization code to save npu memory (#784)

### What this PR does / why we need it? In the w8a8 quantization code of `fused_experts`, the output of almost every operator is assigned a new variable name. If we want to save NPU memory, we manually `del` these variables to end their lifecycle, which fills the code with `del` statements and looks inelegant. Therefore, I plan to names the output of most operators as `hidden_states`, thereby ending the lifecycle of the previous `hidden_states`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Signed-off-by: ApsarasX <apsarax@outlook.com>
2025-05-09 15:09:37 +08:00
parent 2c685e3b61
commit 324f819b92
2 changed files with 50 additions and 41 deletions
--- a/vllm_ascend/models/deepseek_v2.py
+++ b/vllm_ascend/models/deepseek_v2.py
@@ -222,9 +222,6 @@ class CustomDeepseekV2MoE(nn.Module):
        num_tokens, hidden_dim = hidden_states.shape
        hidden_states = hidden_states.view(-1, hidden_dim)

-        if self.n_shared_experts is not None:
-            shared_output = self.shared_experts(hidden_states)
-
        if (self.tp_size > 1 and self.enable_mc2 and not is_prefill):
            chunks = torch.chunk(hidden_states,
                                 get_tp_group().world_size,
@@ -248,8 +245,8 @@ class CustomDeepseekV2MoE(nn.Module):
            else:
                final_hidden_states = tensor_model_parallel_all_reduce(
                    final_hidden_states)
-
-        if shared_output is not None:
+        if self.n_shared_experts is not None:
+            shared_output = self.shared_experts(hidden_states)
            final_hidden_states = final_hidden_states + shared_output

        return final_hidden_states.view(num_tokens, hidden_dim)