[Doc]Refresh model tutorial examples and serving commands (#7426)

### What this PR does / why we need it? Main updates include: - update model IDs and default model paths in serving / offline inference examples - adjust some command snippets and notes for better copy-paste usability - replace `SamplingParams` argument usage from `max_completion_tokens` to `max_tokens`（**Offline** inference currently **does not support** the "max_completion_tokens"） ``` bash Traceback (most recent call last): File "/vllm-workspace/vllm-ascend/qwen-next.py", line 18, in <module> sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40, max_completion_tokens=32) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ TypeError: Unexpected keyword argument 'max_completion_tokens' [ERROR] 2026-03-17-09:57:40 (PID:276, Device:-1, RankID:-1) ERR99999 UNKNOWN applicaiton exception ``` - refresh **Qwen3-Omni-30B-A3B-Thinking** recommended environment variable ``` bash export HCCL_BUFFSIZE=512 export HCCL_OP_EXPANSION_MODE=AIV ``` ``` bash EZ9999[PID: 25038] 2026-03-17-08:21:12.001.372 (EZ9999): HCCL_BUFFSIZE is too SMALL, maxBs = 256, h = 2048, epWorldSize = 2, localMoeExpertNum = 64, sharedExpertNum = 0, tokenNeedSizeDispatch = 4608, tokenNeedSizeCombine = 4096, k = 8, NEEDED_HCCL_BUFFSIZE(((maxBs * tokenNeedSizeDispatch * ep_worldsize * localMoeExpertNum) + (maxBs * tokenNeedSizeCombine * (k + sharedExpertNum))) * 2) = 305MB, HCCL_BUFFSIZE=200MB. [FUNC:CheckWinSize][FILE:moe_distribute_dispatch_v2_tiling.cpp][LINE:984] ``` - fix **Qwen3-reranker** example usage to match the current **pooling runner** interface and score output access ``` python model = LLM( model=model_name, task="score", # need fix hf_overrides={ "architectures": ["Qwen3ForSequenceClassification"], "classifier_from_token": ["no", "yes"], ``` ---> ``` python model = LLM( model=model_name, runner="pooling", hf_overrides={ "architectures": ["Qwen3ForSequenceClassification"], "classifier_from_token": ["no", "yes"], ``` - modify **PaddleOCR-VL** parameter `TASK_QUEUE_ENABLE` from `2` to `1` ``` bash (EngineCore_DP0 pid=26273) RuntimeError: NPUModelRunner init failed, error is NPUModelRunner failed, error is Do not support TASK_QUEUE_ENABLE = 2 during NPU graph capture, please export TASK_QUEUE_ENABLE=1/0. ``` These changes are needed because several documentation examples had drifted from the current runtime behavior and recommended invocation patterns, which could confuse users when following the tutorials directly. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - vLLM version: v0.17.0 - vLLM main: 4497431df6 Signed-off-by: MrZ20 <2609716663@qq.com>
2026-03-20 11:34:18 +08:00
parent 7be66cec75
commit a1f321a556
19 changed files with 86 additions and 53 deletions
--- a/docs/source/tutorials/models/DeepSeek-V3.1.md
+++ b/docs/source/tutorials/models/DeepSeek-V3.1.md
@@ -25,8 +25,7 @@ Refer to [feature guide](../../user_guide/feature_guide/index.md) to get the fea
 ### Model Weight

 - `DeepSeek-V3.1`(BF16 version): [Download model weight](https://www.modelscope.cn/models/deepseek-ai/DeepSeek-V3.1).
- `DeepSeek-V3.1-w8a8`(Quantized version without mtp): [Download model weight](https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V3.1-w8a8).
- `DeepSeek-V3.1_w8a8mix_mtp`(Quantized version with mix mtp): [Download model weight](https://www.modelscope.cn/models/Eco-Tech/DeepSeek-V3.1-w8a8). Please modify `torch_dtype` from `float16` to `bfloat16` in `config.json`.
+- `DeepSeek-V3.1-w8a8-mtp-QuaRot`(Quantized version with mix mtp): [Download model weight](https://www.modelscope.cn/models/Eco-Tech/DeepSeek-V3.1-w8a8-mtp-QuaRot).
 - `DeepSeek-V3.1-Terminus-w4a8-mtp-QuaRot`(Quantized version with mix mtp): [Download model weight](https://www.modelscope.cn/models/Eco-Tech/DeepSeek-V3.1-Terminus-w4a8-mtp-QuaRot).
 - `Method of Quantify`: [msmodelslim](https://gitcode.com/Ascend/msit/blob/master/msmodelslim/example/DeepSeek/README.md#deepseek-v31-w8a8-%E6%B7%B7%E5%90%88%E9%87%8F%E5%8C%96-mtp-%E9%87%8F%E5%8C%96). You can use these methods to quantify the model.

--- a/docs/source/tutorials/models/DeepSeek-V3.2.md
+++ b/docs/source/tutorials/models/DeepSeek-V3.2.md
@@ -17,7 +17,7 @@ Refer to [feature guide](../../user_guide/feature_guide/index.md) to get the fea
 ### Model Weight

 - `DeepSeek-V3.2-Exp`(BF16 version): require 2 Atlas 800 A3 (64G × 16) nodes or 4 Atlas 800 A2 (64G × 8) nodes. [Download model weight](https://modelers.cn/models/Modelers_Park/DeepSeek-V3.2-Exp-BF16)
- `DeepSeek-V3.2-Exp-w8a8`(Quantized version): require 1 Atlas 800 A3 (64G × 16) node or 2 Atlas 800 A2 (64G × 8) nodes. [Download model weight](https://modelers.cn/models/Modelers_Park/DeepSeek-V3.2-Exp-w8a8)
+- `DeepSeek-V3.2-Exp-W8A8`(Quantized version): require 1 Atlas 800 A3 (64G × 16) node or 2 Atlas 800 A2 (64G × 8) nodes. [Download model weight](https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V3.2-Exp-W8A8)
 - `DeepSeek-V3.2`(BF16 version): require 2 Atlas 800 A3 (64G × 16) nodes or 4 Atlas 800 A2 (64G × 8) nodes. Model weight in BF16 not found now.
 - `DeepSeek-V3.2-w8a8`(Quantized version): require 1 Atlas 800 A3 (64G × 16) node or 2 Atlas 800 A2 (64G × 8) nodes. [Download model weight](https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V3.2-W8A8/)

--- a/docs/source/tutorials/models/GLM5.md
+++ b/docs/source/tutorials/models/GLM5.md
@@ -18,7 +18,7 @@ Refer to [feature guide](../../user_guide/feature_guide/index.md) to get the fea

 - `GLM-5`(BF16 version): [Download model weight](https://www.modelscope.cn/models/ZhipuAI/GLM-5).
 - `GLM-5-w4a8`: [Download model weight](https://modelscope.cn/models/Eco-Tech/GLM-5-w4a8).
- `GLM-5-w8a8`: [Download model weight](https://ai.gitcode.com/Eco-Tech/GLM-5-w8a8/tree/main).
+- `GLM-5-w8a8`: [Download model weight](https://www.modelscope.cn/models/Eco-Tech/GLM-5-w8a8).
 - You can use [msmodelslim](https://gitcode.com/Ascend/msmodelslim) to quantify the model naively.

 It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`
@@ -1308,6 +1308,21 @@ python load_balance_proxy_server_example.py \
      6721 6722 6723 6724      
 ```

+## Functional Verification
+
+Once your server is started, you can query the model with input prompts:
+
+```shell
+curl http://<node0_ip>:<port>/v1/completions \
+    -H "Content-Type: application/json" \
+    -d '{
+        "model": "glm-5",
+        "prompt": "The future of AI is",
+        "max_completion_tokens": 50,
+        "temperature": 0
+    }'
+```
+
 ## Accuracy Evaluation

 Here are two accuracy evaluation methods.
--- a/docs/source/tutorials/models/Kimi-K2-Thinking.md
+++ b/docs/source/tutorials/models/Kimi-K2-Thinking.md
@@ -113,12 +113,24 @@ Run the following script to start the vLLM server on Multi-NPU:
 For an Atlas 800 A3 (64G*16) node, tensor-parallel-size should be at least 16.

 ```bash
-vllm serve Kimi-K2-Thinking \
--served-model-name kimi-k2-thinking \
--tensor-parallel-size 16 \
--enable-expert-parallel \
--trust-remote-code \
--no-enable-prefix-caching
+#!/bin/bash
+export VLLM_USE_MODELSCOPE=True
+export HCCL_BUFFSIZE=1024
+export TASK_QUEUE_ENABLE=1
+export OMP_PROC_BIND=false
+export HCCL_OP_EXPANSION_MODE=AIV
+export PYTORCH_NPU_ALLOC_CONF="expandable_segments:True"
+
+vllm serve "moonshotai/Kimi-K2-Thinking" \
+  --tensor-parallel-size 16 \
+  --port 8000 \
+  --max-model-len 8192 \
+  --max-num-batched-tokens 8192 \
+  --max-num-seqs 12 \
+  --gpu-memory-utilization 0.9 \
+  --trust-remote-code \
+  --enable-expert-parallel \
+  --no-enable-prefix-caching
 ```

 Once your server is started, you can query the model with input prompts.
--- a/docs/source/tutorials/models/MiniMax-M2.5.md
+++ b/docs/source/tutorials/models/MiniMax-M2.5.md
@@ -299,7 +299,7 @@ print(resp.choices[0].message.content)
 Or send a request using curl:

 ```{code-block} bash
-curl http://127.0.0.1:8000/v1/chat/completions \
+curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "MiniMax-M2.5",
--- a/docs/source/tutorials/models/PaddleOCR-VL.md
+++ b/docs/source/tutorials/models/PaddleOCR-VL.md
@@ -74,7 +74,7 @@ Run the following script to start the vLLM server on single 910B4:
 #!/bin/sh
 export VLLM_USE_MODELSCOPE=true
 export MODEL_PATH="PaddlePaddle/PaddleOCR-VL"
-export TASK_QUEUE_ENABLE=2
+export TASK_QUEUE_ENABLE=1
 export CPU_AFFINITY_CONF=1
 export PYTORCH_NPU_ALLOC_CONF="expandable_segments:True"

--- a/docs/source/tutorials/models/Qwen-VL-Dense.md
+++ b/docs/source/tutorials/models/Qwen-VL-Dense.md
@@ -142,7 +142,7 @@ llm = LLM(
 )

 sampling_params = SamplingParams(
-    max_completion_tokens=512
+    max_tokens=512
 )

 image_messages = [
--- a/docs/source/tutorials/models/Qwen2.5-7B.md
+++ b/docs/source/tutorials/models/Qwen2.5-7B.md
@@ -122,7 +122,7 @@ Not supported yet.
 After starting the service, verify functionality using a `curl` request:

 ```shell
-curl http://<IP>:<Port>/v1/completions \
+curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "qwen-2.5-7b-instruct",
--- a/docs/source/tutorials/models/Qwen2.5-Omni.md
+++ b/docs/source/tutorials/models/Qwen2.5-Omni.md
@@ -16,8 +16,8 @@ Refer to [feature guide](../../user_guide/feature_guide/index.md) to get the fea

 ### Model Weight

- `Qwen2.5-Omni-3B`(BF16): [Download model weight](https://huggingface.co/Qwen/Qwen2.5-Omni-3B)
- `Qwen2.5-Omni-7B`(BF16): [Download model weight](https://huggingface.co/Qwen/Qwen2.5-Omni-7B)
+- `Qwen2.5-Omni-3B`(BF16): [Download model weight](https://modelscope.cn/models/Qwen/Qwen2.5-Omni-3B)
+- `Qwen2.5-Omni-7B`(BF16): [Download model weight](https://modelscope.cn/models/Qwen/Qwen2.5-Omni-7B)

 Following examples use the 7B version by default.

@@ -71,6 +71,8 @@ docker run --rm \
 :::{note}
 The env `LOCAL_MEDIA_PATH` which allowing API requests to read local images or videos from directories specified by the server file system. Please note this is a security risk. Should only be enabled in trusted environments.

+:::
+
 ```bash
 export VLLM_USE_MODELSCOPE=true
 export MODEL_PATH="Qwen/Qwen2.5-Omni-7B"
@@ -104,10 +106,10 @@ VLLM_TARGET_DEVICE=empty pip install -v ".[audio]"
 ```bash
 export VLLM_USE_MODELSCOPE=true
 export MODEL_PATH=Qwen/Qwen2.5-Omni-7B
-export LOCAL_MEDIA_PATH=/local_path/to_media/
+export LOCAL_MEDIA_PATH=$HOME/.cache/vllm/assets/vllm_public_assets/
 export DP_SIZE=8

-vllm serve ${MODEL_PATH}\
+vllm serve ${MODEL_PATH} \
 --host 0.0.0.0 \
 --port 8000 \
 --served-model-name Qwen-Omni \
@@ -137,7 +139,7 @@ INFO:     Application startup complete.
 Once your server is started, you can query the model with input prompts:

 ```bash
-curl http://127.0.0.1:8000/v1/chat/completions   -H "Content-Type: application/json"   -H "Authorization: Bearer EMPTY"   -d '{
+curl http://localhost:8000/v1/chat/completions   -H "Content-Type: application/json"   -H "Authorization: Bearer EMPTY"   -d '{
    "model": "Qwen-Omni",
    "messages": [
      {
--- a/docs/source/tutorials/models/Qwen3-235B-A22B.md
+++ b/docs/source/tutorials/models/Qwen3-235B-A22B.md
@@ -18,7 +18,7 @@ Refer to [feature guide](../../user_guide/feature_guide/index.md) to get the fea

 ### Model Weight

- `Qwen3-235B-A22B`(BF16 version): require 1 Atlas 800 A3 (64G × 16) node, 1 Atlas 800 A2 (64G × 8) node or 2 Atlas 800 A2(32G * 8)nodes. [Download model weight](https://modelers.cn/models/Modelers_Park/Qwen3-235B-A22B)
+- `Qwen3-235B-A22B`(BF16 version): require 1 Atlas 800 A3 (64G × 16) node, 1 Atlas 800 A2 (64G × 8) node or 2 Atlas 800 A2(32G * 8)nodes. [Download model weight](https://www.modelscope.cn/models/Qwen/Qwen3-235B-A22B)
 - `Qwen3-235B-A22B-w8a8`(Quantized version): require 1 Atlas 800 A3 (64G × 16) node or 1 Atlas 800 A2 (64G × 8) node or 2 Atlas 800 A2(32G * 8)nodes. [Download model weight](https://modelscope.cn/models/vllm-ascend/Qwen3-235B-A22B-W8A8)

 It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`.
@@ -174,7 +174,7 @@ export OMP_NUM_THREADS=1
 export HCCL_BUFFSIZE=1024
 export TASK_QUEUE_ENABLE=1

-vllm serve vllm-ascend/Qwen3-235B-A22B \
+vllm serve Qwen/Qwen3-235B-A22B \
 --host 0.0.0.0 \
 --port 8000 \
 --data-parallel-size 2 \
@@ -219,7 +219,7 @@ export OMP_NUM_THREADS=1
 export HCCL_BUFFSIZE=1024
 export TASK_QUEUE_ENABLE=1

-vllm serve vllm-ascend/Qwen3-235B-A22B \
+vllm serve Qwen/Qwen3-235B-A22B \
 --host 0.0.0.0 \
 --port 8000 \
 --headless \
--- a/docs/source/tutorials/models/Qwen3-Coder-30B-A3B.md
+++ b/docs/source/tutorials/models/Qwen3-Coder-30B-A3B.md
@@ -16,7 +16,7 @@ Refer to [feature guide](../../user_guide/feature_guide/index.md) to get the fea

 ### Model Weight

-`Qwen3-Coder-30B-A3B-Instruct`(BF16 version): requires 1 Atlas 800 A3 node (with 16x 64G NPUs) or 1 Atlas 800 A2 node (with 8x 64G/32G NPUs). [Download model weight](https://modelers.cn/models/Modelers_Park/Qwen3-Coder-30B-A3B-Instruct)
+`Qwen3-Coder-30B-A3B-Instruct`(BF16 version): requires 1 Atlas 800 A3 node (with 16x 64G NPUs) or 1 Atlas 800 A2 node (with 8x 64G/32G NPUs). [Download model weight](https://modelscope.cn/models/Qwen/Qwen3-Coder-30B-A3B-Instruct)

 It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`

--- a/docs/source/tutorials/models/Qwen3-Dense.md
+++ b/docs/source/tutorials/models/Qwen3-Dense.md
@@ -8,11 +8,7 @@ Welcome to the tutorial on optimizing Qwen Dense models in the vLLM-Ascend envir

 This document will show the main verification steps of the model, including supported features, feature configuration, environment preparation, accuracy and performance evaluation.

-The Qwen3 Dense models is first supported in [v0.8.4rc2](https://github.com/vllm-project/vllm-ascend/blob/main/docs/source/user_guide/release_notes.md#v084rc2---20250429)
-
-## **Node**
-
-This example requires version **v0.11.0rc2**. Earlier versions may lack certain features.
+The Qwen3 Dense models is first supported in [v0.8.4rc2](https://github.com/vllm-project/vllm-ascend/blob/main/docs/source/user_guide/release_notes.md#v084rc2---20250429). This example requires version **v0.11.0rc2**. Earlier versions may lack certain features.

 ## Supported Features

@@ -115,12 +111,13 @@ The specific example scenario is as follows:

 ### Run docker container

-#### **Node**
+:::{note}

- /model/Qwen3-32B-W8A8 is the model path, replace this with your actual path.
+- vllm-ascend/Qwen3-32B-W8A8 is the default model path, replace this with your actual path.
 - v0.11.0rc2-a3 is image tag, replace this with your actual tag.
 - replace this with your actual port: '-p 8113:8113'.
 - replace this with your actual card: '--device /dev/davinci0'.
+:::

 ```{code-block} bash
   :substitutions:
@@ -142,7 +139,7 @@ docker run --rm \
 -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
 -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
 -v /etc/ascend_install.info:/etc/ascend_install.info \
-v /model/Qwen3-32B-W8A8:/model/Qwen3-32B-W8A8 \
+-v /root/.cache:/root/.cache \
 -p 8113:8113 \
 -it $IMAGE bash
 ```
@@ -174,7 +171,7 @@ export HCCL_OP_EXPANSION_MODE="AIV"
 # Enable FlashComm_v1 optimization when tensor parallel is enabled.
 export VLLM_ASCEND_ENABLE_FLASHCOMM1=1

-vllm serve /model/Qwen3-32B-W8A8 \
+vllm serve vllm-ascend/Qwen3-32B-W8A8 \
  --served-model-name qwen3 \
  --trust-remote-code \
  --async-scheduling \
@@ -190,15 +187,16 @@ vllm serve /model/Qwen3-32B-W8A8 \
  --gpu-memory-utilization 0.9
 ```

-#### **Node**
+:::{note}

- /model/Qwen3-32B-W8A8 is the model path, replace this with your actual path.
+- vllm-ascend/Qwen3-32B-W8A8 is the default model path, replace this with your actual path.

 - If the model is not a quantized model, remove the `--quantization ascend` parameter.

 - **[Optional]** `--additional-config '{"pa_shape_list":[48,64,72,80]}'`: `pa_shape_list` specifies the batch sizes where you want to switch to the PA operator. This is a temporary tuning knob. Currently, the attention operator dispatch defaults to the FIA operator. In some batch-size (concurrency) settings, FIA may have suboptimal performance. By setting `pa_shape_list`, when the runtime batch size matches one of the listed values, vLLM-Ascend will replace FIA with the PA operator to prevent performance degradation. In the future, FIA will be optimized for these scenarios and this parameter will be removed.

 - If the ultimate performance is desired, the cudagraph_capture_sizes parameter can be enabled, reference: [key-optimization-points](./Qwen3-Dense.md#key-optimization-points)、[optimization-highlights](./Qwen3-Dense.md#optimization-highlights). Here is an example of batchsize of 72: `--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes":[1,8,24,48,60,64,72,76]}'`.
+:::

 Once your server is started, you can query the model with input prompts

@@ -219,11 +217,12 @@ curl http://localhost:8113/v1/chat/completions -H "Content-Type: application/jso

 Run the following script to execute offline inference on multi-NPU.

-#### **Node**
+:::{note}

- /model/Qwen3-32B-W8A8 is the model path, replace this with your actual path.
+- vllm-ascend/Qwen3-32B-W8A8 is the default model path, replace this with your actual path.

 - If the model is not a quantized model,remove the `quantization="ascend"` parameter.
+:::

 ```python
 import gc
@@ -244,7 +243,7 @@ prompts = [
    "The future of AI is",
 ]
 sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40)
-llm = LLM(model="/model/Qwen3-32B-W8A8",
+llm = LLM(model="vllm-ascend/Qwen3-32B-W8A8",
          tensor_parallel_size=4,
          trust_remote_code=True,
          distributed_executor_backend="mp",
@@ -299,12 +298,13 @@ There are three `vllm bench` subcommands:

 Take the `serve` as an example. Run the code as follows.

-#### **Node**
+:::{note}

- /model/Qwen3-32B-W8A8 is the model path, replace this with your actual path.
+- vllm-ascend/Qwen3-32B-W8A8 is the default model path, replace this with your actual path.
+:::

 ```shell
-vllm bench serve --model /model/Qwen3-32B-W8A8 --served-model-name qwen3 --port 8113 --dataset-name random --random-input 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./
+vllm bench serve --model vllm-ascend/Qwen3-32B-W8A8 --served-model-name qwen3 --port 8113 --dataset-name random --random-input 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./
 ```

 After about several minutes, you can get the performance evaluation result.
--- a/docs/source/tutorials/models/Qwen3-Next.md
+++ b/docs/source/tutorials/models/Qwen3-Next.md
@@ -16,7 +16,7 @@ Refer to [feature guide](../../user_guide/feature_guide/index.md) to get the fea

 ## Weight Preparation

- Download Link for the `Qwen3-Next-80B-A3B-Instruct` Model Weights: [Download model weight](https://modelers.cn/models/Modelers_Park/Qwen3-Next-80B-A3B-Instruct/tree/main)
+ Download Link for the `Qwen3-Next-80B-A3B-Instruct` Model Weights: [Download model weight](https://modelscope.cn/models/Qwen/Qwen3-Next-80B-A3B-Instruct)

 ## Deployment

@@ -103,7 +103,7 @@ if __name__ == '__main__':
    prompts = [
        "Who are you?",
    ]
-    sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40, max_completion_tokens=32)
+    sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40, max_tokens=32)
    llm = LLM(model="Qwen/Qwen3-Next-80B-A3B-Instruct",
              tensor_parallel_size=4,
              enforce_eager=True,
--- a/docs/source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md
+++ b/docs/source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md
@@ -109,7 +109,7 @@ def clean_up():


 def main():
-    MODEL_PATH = "Qwen3/Qwen3-Omni-30B-A3B-Thinking"
+    MODEL_PATH = "Qwen/Qwen3-Omni-30B-A3B-Thinking"
    llm = LLM(
        model=MODEL_PATH,
        tensor_parallel_size=2,
@@ -123,7 +123,7 @@ def main():
        temperature=0.6,
        top_p=0.95,
        top_k=20,
-        max_completion_tokens=16384,
+        max_tokens=16384,
    )

    processor = Qwen3OmniMoeProcessor.from_pretrained(MODEL_PATH)
@@ -176,6 +176,11 @@ if __name__ == "__main__":
 Run the following script to start the vLLM server on Multi-NPU:
 For an Atlas A2 with 64 GB of NPU card memory, tensor-parallel-size should be at least 1, and for 32 GB of memory, tensor-parallel-size should be at least 2.

+```bash
+export HCCL_BUFFSIZE=512
+export HCCL_OP_EXPANSION_MODE=AIV
+```
+
 ```bash
 vllm serve Qwen/Qwen3-Omni-30B-A3B-Thinking --tensor-parallel-size 2 --enable_expert_parallel
 ```
--- a/docs/source/tutorials/models/Qwen3-VL-Embedding.md
+++ b/docs/source/tutorials/models/Qwen3-VL-Embedding.md
@@ -40,7 +40,7 @@ vllm serve Qwen/Qwen3-VL-Embedding-8B --runner pooling
 Once your server is started, you can query the model with input prompts.

 ```bash
-curl http://127.0.0.1:8000/v1/embeddings -H "Content-Type: application/json" -d '{
+curl http://localhost:8000/v1/embeddings -H "Content-Type: application/json" -d '{
  "input": [
        "The capital of China is Beijing.",
        "Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun."
--- a/docs/source/tutorials/models/Qwen3.5-27B.md
+++ b/docs/source/tutorials/models/Qwen3.5-27B.md
@@ -18,8 +18,8 @@ Refer to [feature guide](../../user_guide/feature_guide/index.md) to get the fea

 ### Model Weight

- `Qwen3.5-27B`(BF16 version): require 1 Atlas 800 A3 (64G × 16) nodes or 1 Atlas 800 A2 (64G × 8) node. [Download model weight](https://huggingface.co/Qwen/Qwen3.5-27B/tree/main)
- `Qwen3.5-27B-w8a8`(Quantized version): require 1 Atlas 800 A3 (64G × 16) node or 1 Atlas 800 A2 (64G × 8) node. [Download model weight](https://www.modelscope.cn/models/Eco-Tech/Qwen3.5-27B-w8a8-mtp/files)
+- `Qwen3.5-27B`(BF16 version): require 1 Atlas 800 A3 (64G × 16) nodes or 1 Atlas 800 A2 (64G × 8) node. [Download model weight](https://modelscope.cn/models/Qwen/Qwen3.5-27B)
+- `Qwen3.5-27B-w8a8`(Quantized version): require 1 Atlas 800 A3 (64G × 16) node or 1 Atlas 800 A2 (64G × 8) node. [Download model weight](https://www.modelscope.cn/models/Eco-Tech/Qwen3.5-27B-w8a8-mtp)

 It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`.

@@ -145,7 +145,7 @@ The parameters are explained as follows:
 Once your server is started, you can query the model with input prompts:

 ```shell
-curl http://<node0_ip>:<port>/v1/completions \
+curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "qwen3.5",
--- a/docs/source/tutorials/models/Qwen3.5-397B-A17B.md
+++ b/docs/source/tutorials/models/Qwen3.5-397B-A17B.md
@@ -18,8 +18,8 @@ Refer to [feature guide](../../user_guide/feature_guide/index.md) to get the fea

 ### Model Weight

- `Qwen3.5-397B-A17B`(BF16 version): require 2 Atlas 800 A3 (64G × 16) nodes or 2 Atlas 800 A2 (64G × 8) nodes. [Download model weight](https://huggingface.co/Qwen/Qwen3.5-397B-A17B/tree/main)
- `Qwen3.5-397B-A17B-w8a8`(Quantized version): require 1 Atlas 800 A3 (64G × 16) node or 2 Atlas 800 A2 (64G × 8) nodes. [Download model weight](https://www.modelscope.cn/models/Eco-Tech/Qwen3.5-397B-A17B-w8a8-mtp/files)
+- `Qwen3.5-397B-A17B`(BF16 version): require 2 Atlas 800 A3 (64G × 16) nodes or 2 Atlas 800 A2 (64G × 8) nodes. [Download model weight](https://www.modelscope.cn/models/Qwen/Qwen3.5-397B-A17B)
+- `Qwen3.5-397B-A17B-w8a8`(Quantized version): require 1 Atlas 800 A3 (64G × 16) node or 2 Atlas 800 A2 (64G × 8) nodes. [Download model weight](https://www.modelscope.cn/models/Eco-Tech/Qwen3.5-397B-A17B-w8a8-mtp)

 It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`.

--- a/docs/source/tutorials/models/Qwen3_embedding.md
+++ b/docs/source/tutorials/models/Qwen3_embedding.md
@@ -41,7 +41,7 @@ vllm serve Qwen/Qwen3-Embedding-8B --runner pooling --host 127.0.0.1 --port 8888
 Once your server is started, you can query the model with input prompts.

 ```bash
-curl http://127.0.0.1:8888/v1/embeddings -H "Content-Type: application/json" -d '{
+curl http://localhost:8888/v1/embeddings -H "Content-Type: application/json" -d '{
  "input": [
        "The capital of China is Beijing.",
        "Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun."
--- a/docs/source/tutorials/models/Qwen3_reranker.md
+++ b/docs/source/tutorials/models/Qwen3_reranker.md
@@ -111,7 +111,7 @@ model_name = "Qwen/Qwen3-Reranker-8B"

 model = LLM(
    model=model_name,
-    task="score",
+    runner="pooling",
    hf_overrides={
        "architectures": ["Qwen3ForSequenceClassification"],
        "classifier_from_token": ["no", "yes"],
@@ -154,7 +154,7 @@ if __name__ == "__main__":

    outputs = model.score(query_template.format(prefix=prefix, instruction=instruction, query=query), documents)

-    print([output.outputs[0].score for output in outputs])
+    print([output.outputs.score for output in outputs])
 ```

 If you run this script successfully, you will see a list of scores printed to the console, similar to this: