optimize the funtion of computing topk and topp in sampler. (#970)

### What this PR does / why we need it? Optimize the performance of calculation logic in sampler and deepseekv2. ### Does this PR introduce _any_ user-facing change? Added VLLM_ENABLE_TOPK_OPTIMZE config in sampler ### How was this patch tested? pytest test_sampler.py Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com> Co-authored-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com> Co-authored-by: ZhengWG <zwg0606@gmail.com>
2025-06-05 16:42:18 +08:00
parent e1ab6d318e
commit 908a851a77
9 changed files with 330 additions and 3 deletions
--- a/vllm_ascend/models/deepseek_v2.py
+++ b/vllm_ascend/models/deepseek_v2.py
@@ -238,8 +238,7 @@ class CustomDeepseekV2MoE(nn.Module):

        num_tokens, hidden_size = hidden_states.shape

-        if self.n_shared_experts is not None:
-            shared_output = self.shared_experts(hidden_states)
+        old_hidden_states = hidden_states.clone()

        if self.tp_size > 1:
            if envs_ascend.VLLM_ENABLE_MC2 and not is_prefill:
@@ -288,6 +287,9 @@ class CustomDeepseekV2MoE(nn.Module):
                if num_padding_tokens > 0:
                    hidden_states = hidden_states[:-num_padding_tokens]

+        if self.n_shared_experts is not None:
+            shared_output = self.shared_experts(old_hidden_states)
+
        if shared_output is not None:
            hidden_states = hidden_states + shared_output