[Feature]Enable DispatchGmmCombineDecode when eagle is moe with w8a8 or not moe [RFC: issue 5476] (#5758)
### What this PR does / why we need it?
Operator `DispatchGmmCombineDecode` does not support non-W8A8 scenarios
and cannot share the same communication domain with Operator
`Dispatch`/`Combine`.
> for instance, when the draft model uses a non-W8A8 MOE architecture
while the main model employs a W8A8 MOE architecture.
Therefore days ago, I implemented an interception that unconditionally
disables Operator `DispatchGmmCombineDecode` whenever the speculative
mode is `EAGLE` or `EAGLE-3`. [PR:
5293](https://github.com/vllm-project/vllm-ascend/pull/5293)
However, this approach was not precise enough.
This PR further refines the logic by specifically identifying the draft
model's configuration: Operator `DispatchGmmCombineDecode` will now be
disabled only when the draft model uses an MOE architecture and is
non-W8A8.
More info about this operator, please refer to RFC: issue
https://github.com/vllm-project/vllm-ascend/issues/5476
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Acc
test qwen3-235b eplb on a single A3 node(ep16),
with dispatch_gmm_combine_decode
```shell
nic_name="xxxx"
local_ip="xxx.xxx.xxx.xxx"
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export VLLM_ASCEND_ENABLE_FUSED_MC2=2
echo "VLLM_ASCEND_ENABLE_FUSED_MC2=${VLLM_ASCEND_ENABLE_FUSED_MC2}"
export HCCL_OP_EXPANSION_MODE="AIV"
export HCCL_BUFFSIZE=512
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
vllm serve /dataset/Qwen3-235B-A22B-Instruct-2507-w8a8-QuaRot/ \
--served-model-name "qwen" \
--host 0.0.0.0 \
--port 8004 \
--async-scheduling \
--tensor-parallel-size 4 \
--data-parallel-size 4 \
--max-num-seqs 64 \
--max-model-len 40960 \
--max-num-batched-tokens 16384 \
--gpu-memory-utilization 0.9 \
--enable-expert-parallel \
--no-enable-prefix-caching \
--quantization "ascend" \
--trust-remote-code \
--speculative_config \
'{
"method": "eagle3",
"model": "/dataset/Qwen3-235B-A22B-Instruct-2507-speculator-eagle3/",
"num_speculative_tokens": 2
}' \
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
2>&1 | tee qwen3_235b_eagle3.log
```
| dataset | version | metric | mode | vllm-api-stream-chat |
|----- | ----- | ----- | ----- | -----|
| aime2024 | 604a78 | accuracy | gen | 80.00 |
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
Signed-off-by: wangqiankun <wangqiankun13@huawei.com>
This commit is contained in:
@@ -869,13 +869,22 @@ def is_drafter_moe_model(vllm_config: VllmConfig):
|
||||
|
||||
|
||||
def speculative_enable_dispatch_gmm_combine_decode(vllm_config: VllmConfig) -> bool:
|
||||
"""When draft contains MOE Arch and non-w8a8, disable dispatch_gmm_combine_decode."""
|
||||
if vllm_config.speculative_config is None:
|
||||
return True
|
||||
speculative_method = getattr(vllm_config.speculative_config, "method", None)
|
||||
if speculative_method in [None, "ngram", "suffix"]:
|
||||
return True
|
||||
if speculative_method in ["eagle", "eagle3"]:
|
||||
return False
|
||||
if is_drafter_moe_model(vllm_config):
|
||||
draft_model_config = vllm_config.speculative_config.draft_model_config
|
||||
hf_text_config = draft_model_config.hf_text_config
|
||||
quant_type = getattr(hf_text_config, "moe_quantize", None)
|
||||
if quant_type is None:
|
||||
quant_type = getattr(hf_text_config, "quantize", None)
|
||||
return quant_type == "w8a8_dynamic"
|
||||
else:
|
||||
return True
|
||||
if speculative_method == "mtp":
|
||||
mtp_quant_type = getattr(vllm_config.model_config.hf_text_config, "mtp_quantize", None)
|
||||
return mtp_quant_type == "w8a8_dynamic"
|
||||
|
||||
Reference in New Issue
Block a user