[Perf] Optimize fused_experts quantization code to save npu memory (#784)

### What this PR does / why we need it?
In the w8a8 quantization code of `fused_experts`, the output of almost
every operator is assigned a new variable name. If we want to save NPU
memory, we manually `del` these variables to end their lifecycle, which
fills the code with `del` statements and looks inelegant.
Therefore, I plan to names the output of most operators as
`hidden_states`, thereby ending the lifecycle of the previous
`hidden_states`.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

Signed-off-by: ApsarasX <apsarax@outlook.com>
This commit is contained in:
ApsarasX
2025-05-09 15:09:37 +08:00
committed by GitHub
parent 2c685e3b61
commit 324f819b92
2 changed files with 50 additions and 41 deletions

View File

@@ -222,9 +222,6 @@ class CustomDeepseekV2MoE(nn.Module):
num_tokens, hidden_dim = hidden_states.shape
hidden_states = hidden_states.view(-1, hidden_dim)
if self.n_shared_experts is not None:
shared_output = self.shared_experts(hidden_states)
if (self.tp_size > 1 and self.enable_mc2 and not is_prefill):
chunks = torch.chunk(hidden_states,
get_tp_group().world_size,
@@ -248,8 +245,8 @@ class CustomDeepseekV2MoE(nn.Module):
else:
final_hidden_states = tensor_model_parallel_all_reduce(
final_hidden_states)
if shared_output is not None:
if self.n_shared_experts is not None:
shared_output = self.shared_experts(hidden_states)
final_hidden_states = final_hidden_states + shared_output
return final_hidden_states.view(num_tokens, hidden_dim)