[MoE][Multistream] Avoid performing communication in extra stream. (#3582)

This PR moves the communication operation of shared experts out of extra stream because I found that this might cause rtMemcpy related errors when running shared experts multistream with aclgraph. Furthermore, I utilize a global variable as extra stream object to avoid allocating streams for each layer in full-graph mode. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: whx-sjtu <2952154980@qq.com>
2025-10-24 10:44:38 +08:00
parent b54d44e664
commit 1b270a64bd
3 changed files with 25 additions and 13 deletions
--- a/tests/e2e/singlecard/test_multistream_overlap_shared_expert.py
+++ b/tests/e2e/singlecard/test_multistream_overlap_shared_expert.py
@@ -28,7 +28,7 @@ from tests.e2e.conftest import VllmRunner
 from tests.e2e.model_utils import check_outputs_equal

 MODELS = [
-    "Qwen/Qwen3-0.6B",
+    "vllm-ascend/DeepSeek-V2-Lite-W8A8",
 ]