[Feat]Xlite Qwen3 MoE Support Data Parallel (#6715)

### What this PR does / why we need it? This patch adds support for the Qwen3-MoE data parallel in Xlite. For more details about Xlite, please refer to the following link:[https://atomgit.com/openeuler/GVirt/blob/master/xlite/README.md](https://atomgit.com/openeuler/GVirt/blob/master/xlite/README.md). online server config: ```shell port=$1 log=$2 export VLLM_USE_V1=1 export TASK_QUEUE_ENABLE=1 export HCCL_BUFFSIZE=512 export HCCL_OP_EXPANSION_MODE="AIV" export OMP_PROC_BIND=false export VLLM_ASCEND_ENABLE_NZ=0 sysctl -w vm.swappiness=0 sysctl -w kernel.numa_balancing=0 sysctl kernel.sched_migration_cost_ns=50000 ip=127.0.0.1 python -m vllm.entrypoints.openai.api_server \ --model /mnt/nvme1n1/wy/models/Qwen3-30B-A3B \ --tensor-parallel-size 2 \ --enable-expert-parallel \ --data-parallel-size 4 \ --gpu-memory-utilization 0.9 \ --max-num-batched-tokens 32768 \ --data-parallel-size-local 4 \ --max-num-seqs=200 \ --block-size 128 \ --max-model-len 6656 \ --trust-remote-code \ --disable-log-requests \ --served-model-name qwen \ --no-enable-prefix-caching \ --additional-config '{"xlite_graph_config": {"enabled": true, "full_mode": true}, "enable_cpu_binding": true}' \ --compilation-config '{"cudagraph_capture_sizes":[1, 16, 32, 48, 64, 100, 150, 200], "cudagraph_mode": "FULL_DECODE_ONLY"}' \ --async-scheduling \ --host ${ip} \ --port ${port} > ${log} 2>&1 & ``` test_config: ```shell vllm bench serve \ --max-concurrency ${maxconcurrency} \ --num-prompts ${num_prompts} \ --host ${HOST} \ --port ${PORT} \ --model ${MODEL_NAME} \ --dataset-name random \ --backend openai-chat \ --random-input-len 512 \ --random-output-len 512 \ --random-range-ratio 0.2 \ --temperature 0.6 \ --metric-percentiles "50,90,99" \ --tokenizer ${TOKENIZER_PATH} \ --endpoint /v1/chat/completions \ --ignore-eos ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: c86cdcbcd2 Signed-off-by: uuzWY <Ethan.wangyuan@huawei.com> Co-authored-by: uuzWY <Ethan.wangyuan@huawei.com>
2026-03-09 17:53:35 +08:00
parent ba1c82e758
commit 82fdd40d49
5 changed files with 59 additions and 19 deletions
--- a/vllm_ascend/worker/model_runner_v1.py
+++ b/vllm_ascend/worker/model_runner_v1.py
@@ -2169,6 +2169,20 @@ class NPUModelRunner(GPUModelRunner):
            spec_decode_common_attn_metadata = spec_decode_common_attn_metadata.unpadded(num_tokens, num_reqs)
        return attn_metadata, spec_decode_common_attn_metadata

+    def _should_build_dummy_attn_metadata(
+        self,
+        force_attention: bool = False,
+        is_profile: bool = False,
+        cudagraph_runtime_mode: CUDAGraphMode | None = None,
+    ) -> bool:
+        """
+        Determine whether attention metadata should be built during dummy_run.
+        SubClass can override this to add custom conditions.
+        """
+        # If force_attention is True, we always capture attention, Otherwise,
+        # it only happens for cudagraph_runtime_mode=FULL.
+        return force_attention or cudagraph_runtime_mode == CUDAGraphMode.FULL
+
    @torch.inference_mode()
    def _dummy_run(
        self,
@@ -2272,9 +2286,8 @@ class NPUModelRunner(GPUModelRunner):
        # vllm-ascend does not support ubatch now
        ubatch_slices, ubatch_slices_padded = None, None
        attn_metadata: PerLayerAttnMetadata | None = None
-        # If force_attention is True, we always capture attention. Otherwise,
-        # it only happens for cudagraph_runtime_mode=FULL.
-        if force_attention or cudagraph_runtime_mode == CUDAGraphMode.FULL:
+        # Build attention metadata for dummy_run
+        if self._should_build_dummy_attn_metadata(force_attention, is_profile, cudagraph_runtime_mode):
            if create_mixed_batch:
                raise NotImplementedError(
                    "create_mixed_batch is used for warmup deepgemm, vllm-ascend does not need it"