Files
xc-llm-ascend/vllm_ascend
wangqiankun13 08d7014874 [Feature]Enable DispatchGmmCombineDecode when eagle is moe with w8a8 or not moe [RFC: issue 5476] (#5758)
### What this PR does / why we need it?
Operator `DispatchGmmCombineDecode` does not support non-W8A8 scenarios
and cannot share the same communication domain with Operator
`Dispatch`/`Combine`.
> for instance, when the draft model uses a non-W8A8 MOE architecture
while the main model employs a W8A8 MOE architecture.

Therefore days ago, I implemented an interception that unconditionally
disables Operator `DispatchGmmCombineDecode` whenever the speculative
mode is `EAGLE` or `EAGLE-3`. [PR:
5293](https://github.com/vllm-project/vllm-ascend/pull/5293)
However, this approach was not precise enough. 
This PR further refines the logic by specifically identifying the draft
model's configuration: Operator `DispatchGmmCombineDecode` will now be
disabled only when the draft model uses an MOE architecture and is
non-W8A8.

More info about this operator, please refer to RFC: issue
https://github.com/vllm-project/vllm-ascend/issues/5476

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

Acc
test qwen3-235b eplb on a single A3 node(ep16),
with dispatch_gmm_combine_decode

```shell
nic_name="xxxx"
local_ip="xxx.xxx.xxx.xxx"

export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name

export VLLM_ASCEND_ENABLE_FUSED_MC2=2
echo "VLLM_ASCEND_ENABLE_FUSED_MC2=${VLLM_ASCEND_ENABLE_FUSED_MC2}"

export HCCL_OP_EXPANSION_MODE="AIV"
export HCCL_BUFFSIZE=512
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
vllm serve /dataset/Qwen3-235B-A22B-Instruct-2507-w8a8-QuaRot/ \
        --served-model-name "qwen" \
        --host 0.0.0.0 \
        --port 8004 \
        --async-scheduling \
        --tensor-parallel-size 4 \
        --data-parallel-size 4 \
        --max-num-seqs 64 \
        --max-model-len 40960 \
        --max-num-batched-tokens 16384 \
        --gpu-memory-utilization 0.9 \
        --enable-expert-parallel \
        --no-enable-prefix-caching \
        --quantization "ascend" \
        --trust-remote-code \
        --speculative_config \
        '{
            "method": "eagle3",
            "model": "/dataset/Qwen3-235B-A22B-Instruct-2507-speculator-eagle3/",
            "num_speculative_tokens": 2
        }' \
        --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
        2>&1 | tee qwen3_235b_eagle3.log
```

| dataset | version | metric | mode | vllm-api-stream-chat |
|----- | ----- | ----- | ----- | -----|
| aime2024 | 604a78 | accuracy | gen | 80.00 |

- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

Signed-off-by: wangqiankun <wangqiankun13@huawei.com>
2026-01-22 10:51:02 +08:00
..
2026-01-22 09:26:39 +08:00