From eab306b09c1fc6e5cb038c2d78192d9870cbbf58 Mon Sep 17 00:00:00 2001 From: Angazenn <92204292+Angazenn@users.noreply.github.com> Date: Sat, 27 Dec 2025 15:55:58 +0800 Subject: [PATCH] [doc] Update Qwen3-235B doc for reproducing latest performance (#5323) ### What this PR does / why we need it? This PR updates Qwen3-235B doc to give a simple recipe for repreducing our latest perfomance on Atlas A3 servers. - vLLM version: release/v0.13.0 - vLLM main: https://github.com/vllm-project/vllm/commit/5fbfa8d9ef15948599631baeb91e8220b2ee9bcc --------- Signed-off-by: Angazenn --- docs/source/tutorials/Qwen3-235B-A22B.md | 329 ++++++++++++++++++++++- 1 file changed, 323 insertions(+), 6 deletions(-) diff --git a/docs/source/tutorials/Qwen3-235B-A22B.md b/docs/source/tutorials/Qwen3-235B-A22B.md index 38b23111..9b534c16 100644 --- a/docs/source/tutorials/Qwen3-235B-A22B.md +++ b/docs/source/tutorials/Qwen3-235B-A22B.md @@ -99,7 +99,6 @@ export VLLM_USE_MODELSCOPE=true # To reduce memory fragmentation and avoid out of memory export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True export HCCL_BUFFSIZE=512 -export HCCL_OP_EXPANSION_MODE="AIV" export OMP_PROC_BIND=false export OMP_NUM_THREADS=1 export VLLM_ASCEND_ENABLE_FLASHCOMM1=1 @@ -119,14 +118,16 @@ vllm serve vllm-ascend/Qwen3-235B-A22B-w8a8 \ --enable-expert-parallel \ --trust-remote-code \ --gpu-memory-utilization 0.95 \ ---rope_scaling '{"rope_type":"yarn","factor":4,"original_max_position_embeddings":32768}' \ +--hf-overrides '{"rope_parameters": {"rope_type":"yarn","rope_theta":1000000,"factor":4,"original_max_position_embeddings":32768}}' \ --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}' \ --async-scheduling ``` **Notice:** -- for vllm version below `v0.12.0` use parameter: `--rope_scaling '{"rope_type":"yarn","factor":4,"original_max_position_embeddings":32768}' \` -- for vllm version `v0.12.0` use parameter: `--hf-overrides '{"rope_parameters": {"rope_type":"yarn","rope_theta":1000000,"factor":4,"original_max_position_embeddings":32768}}' \` +- [Qwen3-235B-A22B](https://huggingface.co/Qwen/Qwen3-235B-A22B#processing-long-texts) originally only supports 40960 context(max_position_embeddings). If you want to use it and its related quantization weights to run long seqs (such as 128k context), it is required to use yarn rope-scaling technique. + - For vLLM version same as or new than `v0.12.0`, use parameter: `--hf-overrides '{"rope_parameters": {"rope_type":"yarn","rope_theta":1000000,"factor":4,"original_max_position_embeddings":32768}}' \`. + - For vllm version below `v0.12.0`, use parameter: `--rope_scaling '{"rope_type":"yarn","factor":4,"original_max_position_embeddings":32768}' \`. + If you are using weights like [Qwen3-235B-A22B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507) which originally supports long contexts, there is no need to add this parameter. The parameters are explained as follows: - `--data-parallel-size` 1 and `--tensor-parallel-size` 8 are common settings for data parallelism (DP) and tensor parallelism (TP) sizes. @@ -169,7 +170,6 @@ export OMP_PROC_BIND=false export OMP_NUM_THREADS=1 export HCCL_BUFFSIZE=1024 export TASK_QUEUE_ENABLE=1 -export HCCL_OP_EXPANSION_MODE="AIV" vllm serve vllm-ascend/Qwen3-235B-A22B \ --host 0.0.0.0 \ @@ -215,7 +215,6 @@ export OMP_PROC_BIND=false export OMP_NUM_THREADS=1 export HCCL_BUFFSIZE=1024 export TASK_QUEUE_ENABLE=1 -export HCCL_OP_EXPANSION_MODE="AIV" vllm serve vllm-ascend/Qwen3-235B-A22B \ --host 0.0.0.0 \ @@ -311,3 +310,321 @@ vllm bench serve --model vllm-ascend/Qwen3-235B-A22B-w8a8 --dataset-name random ``` After about several minutes, you can get the performance evaluation result. + +## Reproducing Performance Results + +In this section, we provide simple scripts to re-produce our latest performance. It is also recommended to read instructions above to understand basic concepts or options in vLLM && vLLM-Ascend. + +### Environment + +- vLLM v0.13.0 +- vLLM-Ascend v0.13.0rc1 +- CANN 8.3.RC2 +- torch_npu 2.8.0 +- HDK/driver 25.3.RC1 +- triton_ascend 3.2.0.dev2025110717 + +**Notice:** +triton_ascend is required for reproducing best performance of Qwen3-235B in vLLM-Ascend. If it is not installed in your environment, please follow the instructions below: + +```bash +wget https://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-ascend/triton_ascend-3.2.0.dev2025110717-cp311-cp311-manylinux_2_27_aarch64.whl +pip install triton_ascend-3.2.0.dev2025110717-cp311-cp311-manylinux_2_27_aarch64.whl +``` + +### Single Node A3 (64G*16) + +Example server scripts: + +```shell +#!/bin/sh +# Load model from ModelScope to speed up download +export VLLM_USE_MODELSCOPE=true +# To reduce memory fragmentation and avoid out of memory +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True +export HCCL_BUFFSIZE=512 +export HCCL_OP_EXPANSION_MODE="AIV" +export OMP_PROC_BIND=false +export OMP_NUM_THREADS=1 +export VLLM_ASCEND_ENABLE_FLASHCOMM1=1 +export VLLM_ASCEND_ENABLE_FUSED_MC2=1 +export TASK_QUEUE_ENABLE=1 + +vllm serve vllm-ascend/Qwen3-235B-A22B-w8a8 \ +--host 0.0.0.0 \ +--port 8000 \ +--tensor-parallel-size 4 \ +--data-parallel-size 4 \ +--seed 1024 \ +--quantization ascend \ +--served-model-name qwen3 \ +--max-num-seqs 128 \ +--max-model-len 40960 \ +--max-num-batched-tokens 16384 \ +--enable-expert-parallel \ +--trust-remote-code \ +--gpu-memory-utilization 0.9 \ +--no-enable-prefix-caching \ +--compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}' \ +--async-scheduling +``` + +Benchmark scripts: + +```shell +vllm bench serve --model qwen3 \ +--tokenizer vllm-ascend/Qwen3-235B-A22B-w8a8 \ +--ignore-eos \ +--dataset-name random \ +--random-input-len 3584 \ +--random-output-len 1536 \ +--num-prompts 800 \ +--max-concurrency 160 \ +--request-rate 24 \ +--host 0.0.0.0 \ +--port 8000 \ +``` + +Reference test results: + +| num_requests | concurrency | mean TTFT(ms) | mean TPOT(ms) | output token throughput (tok/s) | +|----- | ----- | ----- | ----- | -----| +| 720 | 144 | 4717.45 | 48.69 | 2761.72 | + +Note: +1. Setting `export VLLM_ASCEND_ENABLE_FUSED_MC2=1` enables MoE fused operators that reduce time consumption of MoE in both prefill and decode. This is an experimental feature which only supports W8A8 quantization on Atlas A3 servers now. If you encounter any problems when using this feature, you can disable it by setting `export VLLM_ASCEND_ENABLE_FUSED_MC2=0` and update issues in vLLM-Ascend community. +2. Here we disable prefix cache because of random datasets. You can enable prefix cache if requests have long common prefix. + +### Three Node A3 -- PD disaggregation + +On three Atlas 800 A3(64G*16)server, we recommend to use one node as one prefill instance and two nodes as one decode instance. Example server scripts: +Prefill Node 1 + +```shell +#!/bin/sh +export HCCL_IF_IP=prefill_node_1_ip + +# Set ifname according to your network setting +ifname="" + +export GLOO_SOCKET_IFNAME=${ifname} +export TP_SOCKET_IFNAME=${ifname} +export HCCL_SOCKET_IFNAME=${ifname} + +# Load model from ModelScope to speed up download +export VLLM_USE_MODELSCOPE=true +# To reduce memory fragmentation and avoid out of memory +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True +export HCCL_BUFFSIZE=512 +export HCCL_OP_EXPANSION_MODE="AIV" +export OMP_PROC_BIND=false +export OMP_NUM_THREADS=1 +export VLLM_ASCEND_ENABLE_FLASHCOMM1=1 +export VLLM_ASCEND_ENABLE_FUSED_MC2=2 +export TASK_QUEUE_ENABLE=1 +export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/mooncake:$LD_LIBRARY_PATH + +vllm serve vllm-ascend/Qwen3-235B-A22B-w8a8 \ +--host 0.0.0.0 \ +--port 8000 \ +--tensor-parallel-size 8 \ +--data-parallel-size 2 \ +--data-parallel-size-local 2 \ +--data-parallel-start-rank 0 \ +--data-parallel-address prefill_node_1_ip \ +--data-parallel-rpc-port prefill_node_dp_port \ +--seed 1024 \ +--quantization ascend \ +--served-model-name qwen3 \ +--max-num-seqs 24 \ +--max-model-len 40960 \ +--max-num-batched-tokens 16384 \ +--enable-expert-parallel \ +--enforce-eager \ +--trust-remote-code \ +--gpu-memory-utilization 0.9 \ +--enforce-eager \ +--no-enable-prefix-caching \ +--kv-transfer-config \ +'{"kv_connector": "MooncakeConnectorV1", +"kv_role": "kv_producer", +"kv_port": "30000", +"engine_id": "0", +"kv_connector_module_path": "vllm_ascend.distributed.mooncake_connector", +"kv_connector_extra_config": { + "use_ascend_direct": true, + "prefill": { + "dp_size": 2, + "tp_size": 8 + }, + "decode": { + "dp_size": 8, + "tp_size": 4 + } +} +}' +``` + +Decode Node 1 + +```shell +#!/bin/sh +export HCCL_IF_IP=decode_node_1_ip + +ifname="" + +export GLOO_SOCKET_IFNAME=${ifname} +export TP_SOCKET_IFNAME=${ifname} +export HCCL_SOCKET_IFNAME=${ifname} + +# Load model from ModelScope to speed up download +export VLLM_USE_MODELSCOPE=true +# To reduce memory fragmentation and avoid out of memory +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True +export HCCL_BUFFSIZE=1024 +export HCCL_OP_EXPANSION_MODE="AIV" +export OMP_PROC_BIND=false +export OMP_NUM_THREADS=1 +export VLLM_ASCEND_ENABLE_FLASHCOMM1=1 +export VLLM_ASCEND_ENABLE_FUSED_MC2=2 +export TASK_QUEUE_ENABLE=1 +export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/mooncake:$LD_LIBRARY_PATH + +vllm serve vllm-ascend/Qwen3-235B-A22B-w8a8 \ +--host 0.0.0.0 \ +--port 8000 \ +--tensor-parallel-size 4 \ +--data-parallel-size 8 \ +--data-parallel-size-local 4 \ +--data-parallel-start-rank 0 \ +--data-parallel-address decode_node_1_ip \ +--data-parallel-rpc-port decode_node_dp_port \ +--seed 1024 \ +--quantization ascend \ +--served-model-name qwen3 \ +--max-num-seqs 128 \ +--max-model-len 40960 \ +--max-num-batched-tokens 256 \ +--enable-expert-parallel \ +--trust-remote-code \ +--gpu-memory-utilization 0.9 \ +--compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}' \ +--async-scheduling \ +--no-enable-prefix-caching \ +--kv-transfer-config \ +'{"kv_connector": "MooncakeConnectorV1", +"kv_role": "kv_consumer", +"kv_port": "30100", +"engine_id": "1", +"kv_connector_module_path": "vllm_ascend.distributed.mooncake_connector", +"kv_connector_extra_config": { + "use_ascend_direct": true, + "prefill": { + "dp_size": 2, + "tp_size": 8 + }, + "decode": { + "dp_size": 8, + "tp_size": 4 + } +} +}' +``` + +Decode Node 2 + +```shell +#!/bin/sh +export HCCL_IF_IP=decode_node_2_ip + +ifname="" + +export GLOO_SOCKET_IFNAME=${ifname} +export TP_SOCKET_IFNAME=${ifname} +export HCCL_SOCKET_IFNAME=${ifname} + +# Load model from ModelScope to speed up download +export VLLM_USE_MODELSCOPE=true +# To reduce memory fragmentation and avoid out of memory +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True +export HCCL_BUFFSIZE=1024 +export HCCL_OP_EXPANSION_MODE="AIV" +export OMP_PROC_BIND=false +export OMP_NUM_THREADS=1 +export VLLM_ASCEND_ENABLE_FLASHCOMM1=1 +export VLLM_ASCEND_ENABLE_FUSED_MC2=2 +export TASK_QUEUE_ENABLE=1 +export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/mooncake:$LD_LIBRARY_PATH + +vllm serve vllm-ascend/Qwen3-235B-A22B-w8a8 \ +--host 0.0.0.0 \ +--port 8000 \ +--headless \ +--tensor-parallel-size 4 \ +--data-parallel-size 8 \ +--data-parallel-size-local 4 \ +--data-parallel-start-rank 4 \ +--data-parallel-address decode_node_1_ip \ +--data-parallel-rpc-port decode_node_dp_port \ +--seed 1024 \ +--quantization ascend \ +--served-model-name qwen3 \ +--max-num-seqs 128 \ +--max-model-len 40960 \ +--max-num-batched-tokens 256 \ +--enable-expert-parallel \ +--trust-remote-code \ +--gpu-memory-utilization 0.9 \ +--compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}' \ +--async-scheduling \ +--no-enable-prefix-caching \ +--kv-transfer-config \ +'{"kv_connector": "MooncakeConnectorV1", +"kv_role": "kv_consumer", +"kv_port": "30100", +"engine_id": "1", +"kv_connector_module_path": "vllm_ascend.distributed.mooncake_connector", +"kv_connector_extra_config": { + "use_ascend_direct": true, + "prefill": { + "dp_size": 2, + "tp_size": 8 + }, + "decode": { + "dp_size": 8, + "tp_size": 4 + } +} +}' +``` + +PD proxy: + +``` +python load_balance_proxy_server_example.py --port 12347 --prefiller-hosts prefill_node_1_ip --prefiller-port 8000 --decoder-hosts decode_node_1_ip --decoder-ports 8000 +``` + +Benchmark scripts: + +```shell +vllm bench serve --model qwen3 \ +--tokenizer vllm-ascend/Qwen3-235B-A22B-w8a8 \ +--ignore-eos \ +--dataset-name random \ +--random-input-len 3584 \ +--random-output-len 1536 \ +--num-prompts 2880 \ +--max-concurrency 576 \ +--request-rate 8 \ +--host 0.0.0.0 \ +--port 12347 \ +``` + +Reference test results: + +| num_requests | concurrency | mean TTFT(ms) | mean TPOT(ms) | output token throughput (tok/s) | +|----- | ----- | ----- | ----- | -----| +| 2880 | 576 | 3735.98 | 52.07 | 8593.44 | + +Note: +1. We recommend to set `export VLLM_ASCEND_ENABLE_FUSED_MC2=2` on this scenario (typically EP32 for Qwen3-235B). This enables a different MoE fusion operator.