[Feature]Enable DispatchGmmCombineDecode when eagle is moe with w8a8 or not moe [RFC: issue 5476] (#5758)

### What this PR does / why we need it? Operator `DispatchGmmCombineDecode` does not support non-W8A8 scenarios and cannot share the same communication domain with Operator `Dispatch`/`Combine`. > for instance, when the draft model uses a non-W8A8 MOE architecture while the main model employs a W8A8 MOE architecture. Therefore days ago, I implemented an interception that unconditionally disables Operator `DispatchGmmCombineDecode` whenever the speculative mode is `EAGLE` or `EAGLE-3`. [PR: 5293](https://github.com/vllm-project/vllm-ascend/pull/5293) However, this approach was not precise enough. This PR further refines the logic by specifically identifying the draft model's configuration: Operator `DispatchGmmCombineDecode` will now be disabled only when the draft model uses an MOE architecture and is non-W8A8. More info about this operator, please refer to RFC: issue https://github.com/vllm-project/vllm-ascend/issues/5476 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Acc test qwen3-235b eplb on a single A3 node(ep16), with dispatch_gmm_combine_decode ```shell nic_name="xxxx" local_ip="xxx.xxx.xxx.xxx" export HCCL_IF_IP=$local_ip export GLOO_SOCKET_IFNAME=$nic_name export TP_SOCKET_IFNAME=$nic_name export HCCL_SOCKET_IFNAME=$nic_name export VLLM_ASCEND_ENABLE_FUSED_MC2=2 echo "VLLM_ASCEND_ENABLE_FUSED_MC2=${VLLM_ASCEND_ENABLE_FUSED_MC2}" export HCCL_OP_EXPANSION_MODE="AIV" export HCCL_BUFFSIZE=512 export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True export OMP_PROC_BIND=false export OMP_NUM_THREADS=10 export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD vllm serve /dataset/Qwen3-235B-A22B-Instruct-2507-w8a8-QuaRot/ \ --served-model-name "qwen" \ --host 0.0.0.0 \ --port 8004 \ --async-scheduling \ --tensor-parallel-size 4 \ --data-parallel-size 4 \ --max-num-seqs 64 \ --max-model-len 40960 \ --max-num-batched-tokens 16384 \ --gpu-memory-utilization 0.9 \ --enable-expert-parallel \ --no-enable-prefix-caching \ --quantization "ascend" \ --trust-remote-code \ --speculative_config \ '{ "method": "eagle3", "model": "/dataset/Qwen3-235B-A22B-Instruct-2507-speculator-eagle3/", "num_speculative_tokens": 2 }' \ --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \ 2>&1 | tee qwen3_235b_eagle3.log ``` | dataset | version | metric | mode | vllm-api-stream-chat | |----- | ----- | ----- | ----- | -----| | aime2024 | 604a78 | accuracy | gen | 80.00 | - vLLM version: v0.13.0 - vLLM main: 2f4e6548ef Signed-off-by: wangqiankun <wangqiankun13@huawei.com>
2026-01-22 10:51:02 +08:00
parent cef04b3555
commit 08d7014874
1 changed files with 10 additions and 1 deletions
--- a/vllm_ascend/utils.py
+++ b/vllm_ascend/utils.py
@@ -869,13 +869,22 @@ def is_drafter_moe_model(vllm_config: VllmConfig):


 def speculative_enable_dispatch_gmm_combine_decode(vllm_config: VllmConfig) -> bool:
+    """When draft contains MOE Arch and non-w8a8, disable dispatch_gmm_combine_decode."""
    if vllm_config.speculative_config is None:
        return True
    speculative_method = getattr(vllm_config.speculative_config, "method", None)
    if speculative_method in [None, "ngram", "suffix"]:
        return True
    if speculative_method in ["eagle", "eagle3"]:
-        return False
+        if is_drafter_moe_model(vllm_config):
+            draft_model_config = vllm_config.speculative_config.draft_model_config
+            hf_text_config = draft_model_config.hf_text_config
+            quant_type = getattr(hf_text_config, "moe_quantize", None)
+            if quant_type is None:
+                quant_type = getattr(hf_text_config, "quantize", None)
+            return quant_type == "w8a8_dynamic"
+        else:
+            return True
    if speculative_method == "mtp":
        mtp_quant_type = getattr(vllm_config.model_config.hf_text_config, "mtp_quantize", None)
        return mtp_quant_type == "w8a8_dynamic"