[Feature] Support moe multi-stream for aclgraph. (#2946)

This PR puts the calculation of shared experts into a separate stream, overlaping with routing experts. - vLLM version: v0.10.2 - vLLM main: fbd6523ac0 --------- Signed-off-by: whx-sjtu <2952154980@qq.com>
2025-09-19 11:06:45 +08:00
parent 0c04bf1e36
commit 0a526768f5
14 changed files with 170 additions and 49 deletions
--- a/docs/source/locale/zh_CN/LC_MESSAGES/user_guide/configuration/additional_config.po
+++ b/docs/source/locale/zh_CN/LC_MESSAGES/user_guide/configuration/additional_config.po
@@ -195,8 +195,8 @@ msgid ""
 msgstr "是否将MLA的向量操作放到另一个流中。此选项仅对使用MLA的模型（例如，DeepSeek）有效。"

 #: ../../user_guide/configuration/additional_config.md
-msgid "`enable_multistream_moe`"
-msgstr "`enable_multistream_moe`"
+msgid "`multistream_overlap_shared_expert`"
+msgstr "`multistream_overlap_shared_expert`"

 #: ../../user_guide/configuration/additional_config.md
 msgid ""
--- a/docs/source/user_guide/configuration/additional_config.md
+++ b/docs/source/user_guide/configuration/additional_config.md
@@ -35,6 +35,7 @@ The following table lists the additional configuration options available in vLLM
 | `enable_shared_expert_dp`     | bool | `False` | When the shared expert in DP, it has better performance but consumes more memory. Currently only DeepSeek series models are supported to use. |
 | `lmhead_tensor_parallel_size` | int | `None` | The custom tensor parallel size of lmhead. |
 | `oproj_tensor_parallel_size` | int | `None` | The custom tensor parallel size of oproj. |
+| `multistream_overlap_shared_expert`| bool | `False` | Whether to enable multistream shared expert. This option only takes effects on moe models with shared experts. |

 The details of each config option are as follows:

@@ -45,7 +46,6 @@ The details of each config option are as follows:
 | `enabled` | bool | `False` | Whether to enable torchair graph mode. Currently only DeepSeek series models and PanguProMoE are supported to use torchair graph mode |
 | `mode` | str | `None` | When using reduce-overhead mode for torchair, mode needs to be set |
 | `enable_multistream_mla`| bool | `False` | Whether to put vector ops of MLA to another stream. This option only takes effects on models using MLA (e.g., DeepSeek). |
-| `enable_multistream_moe`| bool | `False` | Whether to enable multistream shared expert. This option only takes effects on DeepSeek moe models. |
 | `enable_view_optimize` | bool | `True` | Whether to enable torchair view optimization |
 | `enable_frozen_parameter` | bool | `True` | Whether to fix the memory address of weights during inference to reduce the input address refresh time during graph execution. |
 | `use_cached_graph` | bool | `False` | Whether to use cached graph |
@@ -74,13 +74,13 @@ An example of additional configuration is as follows:
        "use_cached_graph": True,
        "graph_batch_sizes": [1, 2, 4, 8],
        "graph_batch_sizes_init": False,
-        "enable_multistream_moe": False,
        "enable_kv_nz": False
    },
    "ascend_scheduler_config": {
        "enabled": True,
        "enable_chunked_prefill": True,
    },
+    "multistream_overlap_shared_expert": True,
    "refresh": False,
 }
 ```