[Doc]Refresh model tutorial examples and serving commands (#7426)
### What this PR does / why we need it?
Main updates include:
- update model IDs and default model paths in serving / offline
inference examples
- adjust some command snippets and notes for better copy-paste usability
- replace `SamplingParams` argument usage from `max_completion_tokens`
to `max_tokens`(**Offline** inference currently **does not support** the
"max_completion_tokens")
``` bash
Traceback (most recent call last):
File "/vllm-workspace/vllm-ascend/qwen-next.py", line 18, in <module>
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40, max_completion_tokens=32)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: Unexpected keyword argument 'max_completion_tokens'
[ERROR] 2026-03-17-09:57:40 (PID:276, Device:-1, RankID:-1) ERR99999 UNKNOWN applicaiton exception
```
- refresh **Qwen3-Omni-30B-A3B-Thinking** recommended environment
variable
``` bash
export HCCL_BUFFSIZE=512
export HCCL_OP_EXPANSION_MODE=AIV
```
``` bash
EZ9999[PID: 25038] 2026-03-17-08:21:12.001.372 (EZ9999): HCCL_BUFFSIZE is too SMALL, maxBs = 256, h = 2048,
epWorldSize = 2, localMoeExpertNum = 64, sharedExpertNum = 0, tokenNeedSizeDispatch = 4608, tokenNeedSizeCombine
= 4096, k = 8, NEEDED_HCCL_BUFFSIZE(((maxBs * tokenNeedSizeDispatch * ep_worldsize * localMoeExpertNum) +
(maxBs * tokenNeedSizeCombine * (k + sharedExpertNum))) * 2) = 305MB, HCCL_BUFFSIZE=200MB.
[FUNC:CheckWinSize][FILE:moe_distribute_dispatch_v2_tiling.cpp][LINE:984]
```
- fix **Qwen3-reranker** example usage to match the current **pooling
runner** interface and score output access
``` python
model = LLM(
model=model_name,
task="score", # need fix
hf_overrides={
"architectures": ["Qwen3ForSequenceClassification"],
"classifier_from_token": ["no", "yes"],
```
--->
``` python
model = LLM(
model=model_name,
runner="pooling",
hf_overrides={
"architectures": ["Qwen3ForSequenceClassification"],
"classifier_from_token": ["no", "yes"],
```
- modify **PaddleOCR-VL** parameter `TASK_QUEUE_ENABLE` from `2` to `1`
``` bash
(EngineCore_DP0 pid=26273) RuntimeError: NPUModelRunner init failed, error is NPUModelRunner failed, error
is Do not support TASK_QUEUE_ENABLE = 2 during NPU graph capture, please export TASK_QUEUE_ENABLE=1/0.
```
These changes are needed because several documentation examples had
drifted from the current runtime behavior and recommended invocation
patterns, which could confuse users when following the tutorials
directly.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
- vLLM version: v0.17.0
- vLLM main:
4497431df6
Signed-off-by: MrZ20 <2609716663@qq.com>
This commit is contained in:
@@ -8,11 +8,7 @@ Welcome to the tutorial on optimizing Qwen Dense models in the vLLM-Ascend envir
|
||||
|
||||
This document will show the main verification steps of the model, including supported features, feature configuration, environment preparation, accuracy and performance evaluation.
|
||||
|
||||
The Qwen3 Dense models is first supported in [v0.8.4rc2](https://github.com/vllm-project/vllm-ascend/blob/main/docs/source/user_guide/release_notes.md#v084rc2---20250429)
|
||||
|
||||
## **Node**
|
||||
|
||||
This example requires version **v0.11.0rc2**. Earlier versions may lack certain features.
|
||||
The Qwen3 Dense models is first supported in [v0.8.4rc2](https://github.com/vllm-project/vllm-ascend/blob/main/docs/source/user_guide/release_notes.md#v084rc2---20250429). This example requires version **v0.11.0rc2**. Earlier versions may lack certain features.
|
||||
|
||||
## Supported Features
|
||||
|
||||
@@ -115,12 +111,13 @@ The specific example scenario is as follows:
|
||||
|
||||
### Run docker container
|
||||
|
||||
#### **Node**
|
||||
:::{note}
|
||||
|
||||
- /model/Qwen3-32B-W8A8 is the model path, replace this with your actual path.
|
||||
- vllm-ascend/Qwen3-32B-W8A8 is the default model path, replace this with your actual path.
|
||||
- v0.11.0rc2-a3 is image tag, replace this with your actual tag.
|
||||
- replace this with your actual port: '-p 8113:8113'.
|
||||
- replace this with your actual card: '--device /dev/davinci0'.
|
||||
:::
|
||||
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
@@ -142,7 +139,7 @@ docker run --rm \
|
||||
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
|
||||
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
|
||||
-v /etc/ascend_install.info:/etc/ascend_install.info \
|
||||
-v /model/Qwen3-32B-W8A8:/model/Qwen3-32B-W8A8 \
|
||||
-v /root/.cache:/root/.cache \
|
||||
-p 8113:8113 \
|
||||
-it $IMAGE bash
|
||||
```
|
||||
@@ -174,7 +171,7 @@ export HCCL_OP_EXPANSION_MODE="AIV"
|
||||
# Enable FlashComm_v1 optimization when tensor parallel is enabled.
|
||||
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
|
||||
|
||||
vllm serve /model/Qwen3-32B-W8A8 \
|
||||
vllm serve vllm-ascend/Qwen3-32B-W8A8 \
|
||||
--served-model-name qwen3 \
|
||||
--trust-remote-code \
|
||||
--async-scheduling \
|
||||
@@ -190,15 +187,16 @@ vllm serve /model/Qwen3-32B-W8A8 \
|
||||
--gpu-memory-utilization 0.9
|
||||
```
|
||||
|
||||
#### **Node**
|
||||
:::{note}
|
||||
|
||||
- /model/Qwen3-32B-W8A8 is the model path, replace this with your actual path.
|
||||
- vllm-ascend/Qwen3-32B-W8A8 is the default model path, replace this with your actual path.
|
||||
|
||||
- If the model is not a quantized model, remove the `--quantization ascend` parameter.
|
||||
|
||||
- **[Optional]** `--additional-config '{"pa_shape_list":[48,64,72,80]}'`: `pa_shape_list` specifies the batch sizes where you want to switch to the PA operator. This is a temporary tuning knob. Currently, the attention operator dispatch defaults to the FIA operator. In some batch-size (concurrency) settings, FIA may have suboptimal performance. By setting `pa_shape_list`, when the runtime batch size matches one of the listed values, vLLM-Ascend will replace FIA with the PA operator to prevent performance degradation. In the future, FIA will be optimized for these scenarios and this parameter will be removed.
|
||||
|
||||
- If the ultimate performance is desired, the cudagraph_capture_sizes parameter can be enabled, reference: [key-optimization-points](./Qwen3-Dense.md#key-optimization-points)、[optimization-highlights](./Qwen3-Dense.md#optimization-highlights). Here is an example of batchsize of 72: `--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes":[1,8,24,48,60,64,72,76]}'`.
|
||||
:::
|
||||
|
||||
Once your server is started, you can query the model with input prompts
|
||||
|
||||
@@ -219,11 +217,12 @@ curl http://localhost:8113/v1/chat/completions -H "Content-Type: application/jso
|
||||
|
||||
Run the following script to execute offline inference on multi-NPU.
|
||||
|
||||
#### **Node**
|
||||
:::{note}
|
||||
|
||||
- /model/Qwen3-32B-W8A8 is the model path, replace this with your actual path.
|
||||
- vllm-ascend/Qwen3-32B-W8A8 is the default model path, replace this with your actual path.
|
||||
|
||||
- If the model is not a quantized model,remove the `quantization="ascend"` parameter.
|
||||
:::
|
||||
|
||||
```python
|
||||
import gc
|
||||
@@ -244,7 +243,7 @@ prompts = [
|
||||
"The future of AI is",
|
||||
]
|
||||
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40)
|
||||
llm = LLM(model="/model/Qwen3-32B-W8A8",
|
||||
llm = LLM(model="vllm-ascend/Qwen3-32B-W8A8",
|
||||
tensor_parallel_size=4,
|
||||
trust_remote_code=True,
|
||||
distributed_executor_backend="mp",
|
||||
@@ -299,12 +298,13 @@ There are three `vllm bench` subcommands:
|
||||
|
||||
Take the `serve` as an example. Run the code as follows.
|
||||
|
||||
#### **Node**
|
||||
:::{note}
|
||||
|
||||
- /model/Qwen3-32B-W8A8 is the model path, replace this with your actual path.
|
||||
- vllm-ascend/Qwen3-32B-W8A8 is the default model path, replace this with your actual path.
|
||||
:::
|
||||
|
||||
```shell
|
||||
vllm bench serve --model /model/Qwen3-32B-W8A8 --served-model-name qwen3 --port 8113 --dataset-name random --random-input 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./
|
||||
vllm bench serve --model vllm-ascend/Qwen3-32B-W8A8 --served-model-name qwen3 --port 8113 --dataset-name random --random-input 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./
|
||||
```
|
||||
|
||||
After about several minutes, you can get the performance evaluation result.
|
||||
|
||||
Reference in New Issue
Block a user