[Feature]Use DispatchGmmCombineDecode operator to replace MC2(Optional) (#5040)

### What this PR does / why we need it? This PR adds model-side integration for the previously introduced experimental AscendC fused operator DispatchGmmCombineDecode, used in MoE decoding. The operator implementation itself was added in a prior PR[#4139 ](https://github.com/vllm-project/vllm-ascend/pull/4139). This change only adapts the model execution path to optionally use the fused operator. When the environment variable VLLM_ASCEND_ENABLE_FUSED_MC2=2 is set, the original MC2 path composed of multiple operators (A8W8 dispatch → GMM → SwiGLU → GMM → combine) might be replaced by the single fused operator DispatchGmmCombineDecode. By default, the existing multi-operator MC2 implementation is preserved. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: ad32e3e19c Signed-off-by: wangqiankun <wangqiankun13@huawei.com>
2025-12-21 15:23:59 +08:00
parent 67a0325cf2
commit 904c18f929
6 changed files with 51 additions and 9 deletions
--- a/vllm_ascend/envs.py
+++ b/vllm_ascend/envs.py
@@ -135,7 +135,13 @@ env_variables: Dict[str, Callable[[], Any]] = {
    # Whether to anbale dynamic EPLB
    "DYNAMIC_EPLB":
    lambda: os.getenv("DYNAMIC_EPLB", "false").lower(),
-    # Whether to anbale fused mc2(dispatch_gmm_combine_decode/dispatch_ffn_combine operator)
+    # Whether to enable fused mc2(`dispatch_gmm_combine_decode`/`dispatch_ffn_combine` operator)
+    # 0, or not set: default ALLTOALL and MC2 will be used.
+    # 1: ALLTOALL and MC2 might be replaced by `dispatch_ffn_combine` operator.
+    # `dispatch_ffn_combine` can be used only for moe layer with W8A8, EP<=16, non-mtp, non-dynamic-eplb.
+    # 2: MC2 might be replaced by `dispatch_gmm_combine_decode` operator.
+    # `dispatch_gmm_combine_decode` can be used only for **decode node** moe layer
+    # with W8A8, non-dynamic-eplb. And MTP layer must be W8A8.
    "VLLM_ASCEND_ENABLE_FUSED_MC2":
    lambda: int(os.getenv("VLLM_ASCEND_ENABLE_FUSED_MC2", '0')),
 }