[Feat]Xlite Qwen3 MoE Support Data Parallel (#6715)

### What this PR does / why we need it?
This patch adds support for the Qwen3-MoE data parallel in Xlite. For
more details about Xlite, please refer to the following
link:[https://atomgit.com/openeuler/GVirt/blob/master/xlite/README.md](https://atomgit.com/openeuler/GVirt/blob/master/xlite/README.md).

online server config:
```shell
port=$1
log=$2
export VLLM_USE_V1=1
export TASK_QUEUE_ENABLE=1
export HCCL_BUFFSIZE=512
export HCCL_OP_EXPANSION_MODE="AIV"
export OMP_PROC_BIND=false
export VLLM_ASCEND_ENABLE_NZ=0
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl kernel.sched_migration_cost_ns=50000
ip=127.0.0.1
python -m vllm.entrypoints.openai.api_server \
        --model /mnt/nvme1n1/wy/models/Qwen3-30B-A3B  \
        --tensor-parallel-size 2 \
        --enable-expert-parallel \
        --data-parallel-size 4 \
        --gpu-memory-utilization 0.9 \
        --max-num-batched-tokens 32768 \
        --data-parallel-size-local 4 \
        --max-num-seqs=200 \
        --block-size 128 \
        --max-model-len 6656 \
        --trust-remote-code \
        --disable-log-requests \
        --served-model-name qwen \
        --no-enable-prefix-caching \
	--additional-config '{"xlite_graph_config": {"enabled": true, "full_mode": true}, "enable_cpu_binding": true}' \
	--compilation-config '{"cudagraph_capture_sizes":[1, 16, 32, 48, 64, 100, 150, 200], "cudagraph_mode": "FULL_DECODE_ONLY"}' \
	--async-scheduling \
	--host ${ip} \
	--port ${port} > ${log} 2>&1 &
``` 
test_config:
```shell
vllm bench serve \
    --max-concurrency ${maxconcurrency} \
    --num-prompts ${num_prompts} \
    --host ${HOST} \
    --port ${PORT} \
    --model ${MODEL_NAME} \
    --dataset-name random \
    --backend openai-chat \
    --random-input-len 512 \
    --random-output-len 512  \
    --random-range-ratio 0.2 \
    --temperature 0.6 \
    --metric-percentiles "50,90,99" \
    --tokenizer ${TOKENIZER_PATH} \
    --endpoint /v1/chat/completions \
    --ignore-eos
``` 

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?


- vLLM version: v0.16.0
- vLLM main:
c86cdcbcd2

Signed-off-by: uuzWY <Ethan.wangyuan@huawei.com>
Co-authored-by: uuzWY <Ethan.wangyuan@huawei.com>
This commit is contained in:
王远
2026-03-09 17:53:35 +08:00
committed by GitHub
parent ba1c82e758
commit 82fdd40d49
5 changed files with 59 additions and 19 deletions

View File

@@ -2169,6 +2169,20 @@ class NPUModelRunner(GPUModelRunner):
spec_decode_common_attn_metadata = spec_decode_common_attn_metadata.unpadded(num_tokens, num_reqs)
return attn_metadata, spec_decode_common_attn_metadata
def _should_build_dummy_attn_metadata(
self,
force_attention: bool = False,
is_profile: bool = False,
cudagraph_runtime_mode: CUDAGraphMode | None = None,
) -> bool:
"""
Determine whether attention metadata should be built during dummy_run.
SubClass can override this to add custom conditions.
"""
# If force_attention is True, we always capture attention, Otherwise,
# it only happens for cudagraph_runtime_mode=FULL.
return force_attention or cudagraph_runtime_mode == CUDAGraphMode.FULL
@torch.inference_mode()
def _dummy_run(
self,
@@ -2272,9 +2286,8 @@ class NPUModelRunner(GPUModelRunner):
# vllm-ascend does not support ubatch now
ubatch_slices, ubatch_slices_padded = None, None
attn_metadata: PerLayerAttnMetadata | None = None
# If force_attention is True, we always capture attention. Otherwise,
# it only happens for cudagraph_runtime_mode=FULL.
if force_attention or cudagraph_runtime_mode == CUDAGraphMode.FULL:
# Build attention metadata for dummy_run
if self._should_build_dummy_attn_metadata(force_attention, is_profile, cudagraph_runtime_mode):
if create_mixed_batch:
raise NotImplementedError(
"create_mixed_batch is used for warmup deepgemm, vllm-ascend does not need it"