[Perf][MoE] Improve MoE multistream parallel performace. (#1891)
This PR designs the shared expert multi-stream parallelism of
w8a8-dynamic-quantized MoE stage in more detail to achieve better
performance.
- vLLM version: v0.10.0
- vLLM main:
2cc571199b
Signed-off-by: whx-sjtu <2952154980@qq.com>
This commit is contained in:
@@ -393,7 +393,7 @@ class CustomDeepseekV2MoE(nn.Module):
|
||||
|
||||
# router_logits: (num_tokens, n_experts)
|
||||
router_logits = None
|
||||
if not self.rm_router_logits:
|
||||
if not self.rm_router_logits and not self.enable_multistream_moe:
|
||||
router_logits, _ = self.gate(hidden_states)
|
||||
|
||||
experts_hidden_states = self.experts(
|
||||
|
||||
Reference in New Issue
Block a user