optimize the funtion of computing topk and topp in sampler. (#970)

### What this PR does / why we need it?
Optimize the performance of calculation logic in sampler and deepseekv2.

### Does this PR introduce _any_ user-facing change?
Added VLLM_ENABLE_TOPK_OPTIMZE config in sampler

### How was this patch tested?
pytest test_sampler.py

Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
Co-authored-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
Co-authored-by: ZhengWG <zwg0606@gmail.com>
This commit is contained in:
sherie
2025-06-05 16:42:18 +08:00
committed by GitHub
parent e1ab6d318e
commit 908a851a77
9 changed files with 330 additions and 3 deletions

View File

@@ -238,8 +238,7 @@ class CustomDeepseekV2MoE(nn.Module):
num_tokens, hidden_size = hidden_states.shape
if self.n_shared_experts is not None:
shared_output = self.shared_experts(hidden_states)
old_hidden_states = hidden_states.clone()
if self.tp_size > 1:
if envs_ascend.VLLM_ENABLE_MC2 and not is_prefill:
@@ -288,6 +287,9 @@ class CustomDeepseekV2MoE(nn.Module):
if num_padding_tokens > 0:
hidden_states = hidden_states[:-num_padding_tokens]
if self.n_shared_experts is not None:
shared_output = self.shared_experts(old_hidden_states)
if shared_output is not None:
hidden_states = hidden_states + shared_output