[Perf] Optimize fused_experts quantization code to save npu memory (#784)
### What this PR does / why we need it? In the w8a8 quantization code of `fused_experts`, the output of almost every operator is assigned a new variable name. If we want to save NPU memory, we manually `del` these variables to end their lifecycle, which fills the code with `del` statements and looks inelegant. Therefore, I plan to names the output of most operators as `hidden_states`, thereby ending the lifecycle of the previous `hidden_states`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Signed-off-by: ApsarasX <apsarax@outlook.com>
This commit is contained in:
@@ -222,9 +222,6 @@ class CustomDeepseekV2MoE(nn.Module):
|
||||
num_tokens, hidden_dim = hidden_states.shape
|
||||
hidden_states = hidden_states.view(-1, hidden_dim)
|
||||
|
||||
if self.n_shared_experts is not None:
|
||||
shared_output = self.shared_experts(hidden_states)
|
||||
|
||||
if (self.tp_size > 1 and self.enable_mc2 and not is_prefill):
|
||||
chunks = torch.chunk(hidden_states,
|
||||
get_tp_group().world_size,
|
||||
@@ -248,8 +245,8 @@ class CustomDeepseekV2MoE(nn.Module):
|
||||
else:
|
||||
final_hidden_states = tensor_model_parallel_all_reduce(
|
||||
final_hidden_states)
|
||||
|
||||
if shared_output is not None:
|
||||
if self.n_shared_experts is not None:
|
||||
shared_output = self.shared_experts(hidden_states)
|
||||
final_hidden_states = final_hidden_states + shared_output
|
||||
|
||||
return final_hidden_states.view(num_tokens, hidden_dim)
|
||||
|
||||
Reference in New Issue
Block a user