optimize the funtion of computing topk and topp in sampler. (#970)
### What this PR does / why we need it? Optimize the performance of calculation logic in sampler and deepseekv2. ### Does this PR introduce _any_ user-facing change? Added VLLM_ENABLE_TOPK_OPTIMZE config in sampler ### How was this patch tested? pytest test_sampler.py Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com> Co-authored-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com> Co-authored-by: ZhengWG <zwg0606@gmail.com>
This commit is contained in:
@@ -238,8 +238,7 @@ class CustomDeepseekV2MoE(nn.Module):
|
||||
|
||||
num_tokens, hidden_size = hidden_states.shape
|
||||
|
||||
if self.n_shared_experts is not None:
|
||||
shared_output = self.shared_experts(hidden_states)
|
||||
old_hidden_states = hidden_states.clone()
|
||||
|
||||
if self.tp_size > 1:
|
||||
if envs_ascend.VLLM_ENABLE_MC2 and not is_prefill:
|
||||
@@ -288,6 +287,9 @@ class CustomDeepseekV2MoE(nn.Module):
|
||||
if num_padding_tokens > 0:
|
||||
hidden_states = hidden_states[:-num_padding_tokens]
|
||||
|
||||
if self.n_shared_experts is not None:
|
||||
shared_output = self.shared_experts(old_hidden_states)
|
||||
|
||||
if shared_output is not None:
|
||||
hidden_states = hidden_states + shared_output
|
||||
|
||||
|
||||
Reference in New Issue
Block a user