[Doc][Misc] Comprehensive documentation cleanup and grammatical fixes (#8073)

What this PR does / why we need it? This pull request performs a comprehensive cleanup of the vLLM Ascend documentation. It fixes numerous typos, grammatical errors, and phrasing issues across community guidelines, developer documents, hardware tutorials, and feature guides. Key improvements include correcting hardware names (e.g., Atlas 300I), fixing broken links, cleaning up code examples (removing duplicate flags and trailing commas), and improving the clarity of technical explanations. These changes are necessary to ensure the documentation is professional, accurate, and easy for users to follow. Does this PR introduce any user-facing change? No, this PR contains documentation-only updates. How was this patch tested? The changes were manually reviewed for accuracy and grammatical correctness. No functional code changes were introduced. --------- Signed-off-by: herizhen <1270637059@qq.com> Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com>
2026-04-09 15:37:57 +08:00
parent c40a387f63
commit 0d1424d81a
71 changed files with 1295 additions and 1296 deletions
--- a/docs/source/tutorials/models/DeepSeek-R1.md
+++ b/docs/source/tutorials/models/DeepSeek-R1.md
@@ -122,7 +122,7 @@ The parameters are explained as follows:

 - Setting the environment variable `VLLM_ASCEND_BALANCE_SCHEDULING=1` enables balance scheduling. This may help increase output throughput and reduce TPOT in v1 scheduler. However, TTFT may degrade in some scenarios. Furthermore, enabling this feature is not recommended in scenarios where PD is separated.
 - For single-node deployment, we recommend using `dp4tp4` instead of `dp2tp8`.
- `--max-model-len` specifies the maximum context length - that is, the sum of input and output tokens for a single request. For performance testing with an input length of 3.5K and output length of 1.5K, a value of `16384` is sufficient, however, for precision testing, please set it at least `35000`.
+- `--max-model-len` specifies the maximum context length - that is, the sum of input and output tokens for a single request. For performance testing with an input length of 3.5K and output length of 1.5K, a value of `16384` is sufficient, however, for precision testing, please set it to at least `35000`.
 - `--no-enable-prefix-caching` indicates that prefix caching is disabled. To enable it, remove this option.
 - If you use the w4a8 weight, more memory will be allocated to kvcache, and you can try to increase system throughput to achieve greater throughput.

@@ -181,7 +181,7 @@ vllm serve vllm-ascend/DeepSeek-R1-W8A8 \
 ```shell
 #!/bin/sh

-# this obtained through ifconfig
+# this is obtained through ifconfig
 # nic_name is the network interface name corresponding to local_ip of the current node
 nic_name="xxxx"
 local_ip="xxxx"
@@ -258,10 +258,10 @@ Here are two accuracy evaluation methods.

 2. After execution, you can get the result, here is the result of `DeepSeek-R1-W8A8` in `vllm-ascend:0.11.0rc2` for reference only.

-| dataset | version | metric | mode | vllm-api-general-chat |
-|----- | ----- | ----- | ----- | -----|
-| aime2024dataset | - | accuracy | gen | 80.00 |
-| gpqadataset | - | accuracy | gen | 72.22 |
+    | dataset | version | metric | mode | vllm-api-general-chat |
+    |----- | ----- | ----- | ----- | -----|
+    | aime2024dataset | - | accuracy | gen | 80.00 |
+    | gpqadataset | - | accuracy | gen | 72.22 |

 ### Using Language Model Evaluation Harness

@@ -271,13 +271,13 @@ As an example, take the `gsm8k` dataset as a test dataset, and run accuracy eval

 2. Run `lm_eval` to execute the accuracy evaluation.

-```shell
-lm_eval \
-  --model local-completions \
-  --model_args model=path/DeepSeek-R1-W8A8,base_url=http://<node0_ip>:<port>/v1/completions,tokenized_requests=False,trust_remote_code=True \
-  --tasks gsm8k \
-  --output_path ./
-```
+    ```shell
+    lm_eval \
+      --model local-completions \
+      --model_args model=path/DeepSeek-R1-W8A8,base_url=http://<node0_ip>:<port>/v1/completions,tokenized_requests=False,trust_remote_code=True \
+      --tasks gsm8k \
+      --output_path ./
+    ```

 3. After execution, you can get the result.

@@ -291,7 +291,7 @@ Refer to [Using AISBench for performance evaluation](../../developer_guide/evalu

 Run performance evaluation of `DeepSeek-R1-W8A8` as an example.

-Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
+Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/benchmarking/) for more details.

 There are three `vllm bench` subcommands:

--- a/docs/source/tutorials/models/DeepSeek-V3.1.md
+++ b/docs/source/tutorials/models/DeepSeek-V3.1.md
@@ -27,7 +27,7 @@ Refer to [feature guide](../../user_guide/feature_guide/index.md) to get the fea
 - `DeepSeek-V3.1`(BF16 version): [Download model weight](https://www.modelscope.cn/models/deepseek-ai/DeepSeek-V3.1).
 - `DeepSeek-V3.1-w8a8-mtp-QuaRot`(Quantized version with mix mtp): [Download model weight](https://www.modelscope.cn/models/Eco-Tech/DeepSeek-V3.1-w8a8-mtp-QuaRot).
 - `DeepSeek-V3.1-Terminus-w4a8-mtp-QuaRot`(Quantized version with mix mtp): [Download model weight](https://www.modelscope.cn/models/Eco-Tech/DeepSeek-V3.1-Terminus-w4a8-mtp-QuaRot).
- `Method of Quantify`: [msmodelslim](https://gitcode.com/Ascend/msit/blob/master/msmodelslim/example/DeepSeek/README.md#deepseek-v31-w8a8-%E6%B7%B7%E5%90%88%E9%87%8F%E5%8C%96-mtp-%E9%87%8F%E5%8C%96). You can use these methods to quantify the model.
+- `Quantization method`: [msmodelslim](https://gitcode.com/Ascend/msit/blob/master/msmodelslim/example/DeepSeek/README.md#deepseek-v31-w8a8-%E6%B7%B7%E5%90%88%E9%87%8F%E5%8C%96-mtp-%E9%87%8F%E5%8C%96). You can use this method to quantize the model.

 It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`.

@@ -264,391 +264,391 @@ To run the vllm-ascend `Prefill-Decode Disaggregation` service, you need to depl

 2. Prefill Node 0 `run_dp_template.sh` script

-```shell
-# this obtained through ifconfig
-# nic_name is the network interface name corresponding to local_ip of the current node
-nic_name="xxx"
-local_ip="141.xx.xx.1"
+    ```shell
+    # this obtained through ifconfig
+    # nic_name is the network interface name corresponding to local_ip of the current node
+    nic_name="xxx"
+    local_ip="141.xx.xx.1"

-# The value of node0_ip must be consistent with the value of local_ip set in node0 (master node)
-node0_ip="xxxx"
+    # The value of node0_ip must be consistent with the value of local_ip set in node0 (master node)
+    node0_ip="xxxx"

-# [Optional] jemalloc
-# jemalloc is for better performance, if `libjemalloc.so` is installed on your machine, you can turn it on.
-# export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
+    # [Optional] jemalloc
+    # jemalloc is for better performance, if `libjemalloc.so` is installed on your machine, you can turn it on.
+    # export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD

-export HCCL_IF_IP=$local_ip
-export GLOO_SOCKET_IFNAME=$nic_name
-export TP_SOCKET_IFNAME=$nic_name
-export HCCL_SOCKET_IFNAME=$nic_name
+    export HCCL_IF_IP=$local_ip
+    export GLOO_SOCKET_IFNAME=$nic_name
+    export TP_SOCKET_IFNAME=$nic_name
+    export HCCL_SOCKET_IFNAME=$nic_name

-export VLLM_RPC_TIMEOUT=3600000
-export VLLM_EXECUTE_MODEL_TIMEOUT_SECONDS=30000
-export HCCL_EXEC_TIMEOUT=204
-export HCCL_CONNECT_TIMEOUT=120
+    export VLLM_RPC_TIMEOUT=3600000
+    export VLLM_EXECUTE_MODEL_TIMEOUT_SECONDS=30000
+    export HCCL_EXEC_TIMEOUT=204
+    export HCCL_CONNECT_TIMEOUT=120

-export OMP_PROC_BIND=false
-export OMP_NUM_THREADS=10
-export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
-export HCCL_BUFFSIZE=256
-export TASK_QUEUE_ENABLE=1
-export HCCL_OP_EXPANSION_MODE="AIV"
-export VLLM_USE_V1=1
-export ASCEND_RT_VISIBLE_DEVICES=$1
-export ASCEND_BUFFER_POOL=4:8
-export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/mooncake:$LD_LIBRARY_PATH
+    export OMP_PROC_BIND=false
+    export OMP_NUM_THREADS=10
+    export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
+    export HCCL_BUFFSIZE=256
+    export TASK_QUEUE_ENABLE=1
+    export HCCL_OP_EXPANSION_MODE="AIV"
+    export VLLM_USE_V1=1
+    export ASCEND_RT_VISIBLE_DEVICES=$1
+    export ASCEND_BUFFER_POOL=4:8
+    export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/mooncake:$LD_LIBRARY_PATH

-export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
+    export VLLM_ASCEND_ENABLE_FLASHCOMM1=1

-vllm serve /weights/DeepSeek-V3.1-w8a8-mtp-QuaRot \
-  --host 0.0.0.0 \
-  --port $2 \
-  --data-parallel-size $3 \
-  --data-parallel-rank $4 \
-  --data-parallel-address $5 \
-  --data-parallel-rpc-port $6 \
-  --tensor-parallel-size $7 \
-  --enable-expert-parallel \
-  --seed 1024 \
-  --served-model-name deepseek_v3 \
-  --max-model-len 65536 \
-  --max-num-batched-tokens 16384 \
-  --max-num-seqs 8 \
-  --enforce-eager \
-  --trust-remote-code \
-  --gpu-memory-utilization 0.9 \
-  --quantization ascend \
-  --no-enable-prefix-caching \
-  --speculative-config '{"num_speculative_tokens": 1, "method": "mtp"}' \
-  --additional-config '{"recompute_scheduler_enable":true}' \
-  --kv-transfer-config \
-  '{"kv_connector": "MooncakeConnectorV1",
-  "kv_role": "kv_producer",
-  "kv_port": "30000",
-  "engine_id": "0",
-  "kv_connector_extra_config": {
-            "prefill": {
-                    "dp_size": 2,
-                    "tp_size": 8
-             },
-             "decode": {
-                    "dp_size": 32,
-                    "tp_size": 1
-             }
-      }
-  }'
-```
+    vllm serve /weights/DeepSeek-V3.1-w8a8-mtp-QuaRot \
+      --host 0.0.0.0 \
+      --port $2 \
+      --data-parallel-size $3 \
+      --data-parallel-rank $4 \
+      --data-parallel-address $5 \
+      --data-parallel-rpc-port $6 \
+      --tensor-parallel-size $7 \
+      --enable-expert-parallel \
+      --seed 1024 \
+      --served-model-name deepseek_v3 \
+      --max-model-len 65536 \
+      --max-num-batched-tokens 16384 \
+      --max-num-seqs 8 \
+      --enforce-eager \
+      --trust-remote-code \
+      --gpu-memory-utilization 0.9 \
+      --quantization ascend \
+      --no-enable-prefix-caching \
+      --speculative-config '{"num_speculative_tokens": 1, "method": "mtp"}' \
+      --additional-config '{"recompute_scheduler_enable":true}' \
+      --kv-transfer-config \
+      '{"kv_connector": "MooncakeConnectorV1",
+      "kv_role": "kv_producer",
+      "kv_port": "30000",
+      "engine_id": "0",
+      "kv_connector_extra_config": {
+                "prefill": {
+                        "dp_size": 2,
+                        "tp_size": 8
+                },
+                "decode": {
+                        "dp_size": 32,
+                        "tp_size": 1
+                }
+          }
+      }'
+    ```

 3. Prefill Node 1 `run_dp_template.sh` script

-```shell
-# this obtained through ifconfig
-# nic_name is the network interface name corresponding to local_ip of the current node
-nic_name="xxx"
-local_ip="141.xx.xx.2"
+    ```shell
+    # this obtained through ifconfig
+    # nic_name is the network interface name corresponding to local_ip of the current node
+    nic_name="xxx"
+    local_ip="141.xx.xx.2"

-# The value of node0_ip must be consistent with the value of local_ip set in node0 (master node)
-node0_ip="xxxx"
+    # The value of node0_ip must be consistent with the value of local_ip set in node0 (master node)
+    node0_ip="xxxx"

-# [Optional] jemalloc
-# jemalloc is for better performance, if `libjemalloc.so` is installed on your machine, you can turn it on.
-# export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
+    # [Optional] jemalloc
+    # jemalloc is for better performance, if `libjemalloc.so` is installed on your machine, you can turn it on.
+    # export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD

-export HCCL_IF_IP=$local_ip
-export GLOO_SOCKET_IFNAME=$nic_name
-export TP_SOCKET_IFNAME=$nic_name
-export HCCL_SOCKET_IFNAME=$nic_name
+    export HCCL_IF_IP=$local_ip
+    export GLOO_SOCKET_IFNAME=$nic_name
+    export TP_SOCKET_IFNAME=$nic_name
+    export HCCL_SOCKET_IFNAME=$nic_name

-export VLLM_RPC_TIMEOUT=3600000
-export VLLM_EXECUTE_MODEL_TIMEOUT_SECONDS=30000
-export HCCL_EXEC_TIMEOUT=204
-export HCCL_CONNECT_TIMEOUT=120
+    export VLLM_RPC_TIMEOUT=3600000
+    export VLLM_EXECUTE_MODEL_TIMEOUT_SECONDS=30000
+    export HCCL_EXEC_TIMEOUT=204
+    export HCCL_CONNECT_TIMEOUT=120

-export OMP_PROC_BIND=false
-export OMP_NUM_THREADS=10
-export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
-export HCCL_BUFFSIZE=256
-export TASK_QUEUE_ENABLE=1
-export HCCL_OP_EXPANSION_MODE="AIV"
-export VLLM_USE_V1=1
-export ASCEND_RT_VISIBLE_DEVICES=$1
-export ASCEND_BUFFER_POOL=4:8
-export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/mooncake:$LD_LIBRARY_PATH
+    export OMP_PROC_BIND=false
+    export OMP_NUM_THREADS=10
+    export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
+    export HCCL_BUFFSIZE=256
+    export TASK_QUEUE_ENABLE=1
+    export HCCL_OP_EXPANSION_MODE="AIV"
+    export VLLM_USE_V1=1
+    export ASCEND_RT_VISIBLE_DEVICES=$1
+    export ASCEND_BUFFER_POOL=4:8
+    export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/mooncake:$LD_LIBRARY_PATH

-export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
+    export VLLM_ASCEND_ENABLE_FLASHCOMM1=1

-vllm serve /weights/DeepSeek-V3.1-w8a8-mtp-QuaRot \
-  --host 0.0.0.0 \
-  --port $2 \
-  --data-parallel-size $3 \
-  --data-parallel-rank $4 \
-  --data-parallel-address $5 \
-  --data-parallel-rpc-port $6 \
-  --tensor-parallel-size $7 \
-  --enable-expert-parallel \
-  --seed 1024 \
-  --served-model-name deepseek_v3 \
-  --max-model-len 65536 \
-  --max-num-batched-tokens 16384 \
-  --max-num-seqs 8 \
-  --enforce-eager \
-  --trust-remote-code \
-  --gpu-memory-utilization 0.9 \
-  --quantization ascend \
-  --no-enable-prefix-caching \
-  --speculative-config '{"num_speculative_tokens": 1, "method": "mtp"}' \
-  --additional-config '{"recompute_scheduler_enable":true}' \
-  --kv-transfer-config \
-  '{"kv_connector": "MooncakeConnectorV1",
-  "kv_role": "kv_producer",
-  "kv_port": "30100",
-  "engine_id": "1",
-  "kv_connector_extra_config": {
-            "prefill": {
-                    "dp_size": 2,
-                    "tp_size": 8
-             },
-             "decode": {
-                    "dp_size": 32,
-                    "tp_size": 1
-             }
-      }
-  }'
-```
+    vllm serve /weights/DeepSeek-V3.1-w8a8-mtp-QuaRot \
+      --host 0.0.0.0 \
+      --port $2 \
+      --data-parallel-size $3 \
+      --data-parallel-rank $4 \
+      --data-parallel-address $5 \
+      --data-parallel-rpc-port $6 \
+      --tensor-parallel-size $7 \
+      --enable-expert-parallel \
+      --seed 1024 \
+      --served-model-name deepseek_v3 \
+      --max-model-len 65536 \
+      --max-num-batched-tokens 16384 \
+      --max-num-seqs 8 \
+      --enforce-eager \
+      --trust-remote-code \
+      --gpu-memory-utilization 0.9 \
+      --quantization ascend \
+      --no-enable-prefix-caching \
+      --speculative-config '{"num_speculative_tokens": 1, "method": "mtp"}' \
+      --additional-config '{"recompute_scheduler_enable":true}' \
+      --kv-transfer-config \
+      '{"kv_connector": "MooncakeConnectorV1",
+      "kv_role": "kv_producer",
+      "kv_port": "30100",
+      "engine_id": "1",
+      "kv_connector_extra_config": {
+                "prefill": {
+                        "dp_size": 2,
+                        "tp_size": 8
+                },
+                "decode": {
+                        "dp_size": 32,
+                        "tp_size": 1
+                }
+          }
+      }'
+    ```

 4. Decode Node 0 `run_dp_template.sh` script

-```shell
-# this obtained through ifconfig
-# nic_name is the network interface name corresponding to local_ip of the current node
-nic_name="xxx"
-local_ip="141.xx.xx.3"
+    ```shell
+    # this obtained through ifconfig
+    # nic_name is the network interface name corresponding to local_ip of the current node
+    nic_name="xxx"
+    local_ip="141.xx.xx.3"

-# The value of node0_ip must be consistent with the value of local_ip set in node0 (master node)
-node0_ip="xxxx"
+    # The value of node0_ip must be consistent with the value of local_ip set in node0 (master node)
+    node0_ip="xxxx"

-# [Optional] jemalloc
-# jemalloc is for better performance, if `libjemalloc.so` is installed on your machine, you can turn it on.
-# export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
+    # [Optional] jemalloc
+    # jemalloc is for better performance, if `libjemalloc.so` is installed on your machine, you can turn it on.
+    # export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD

-export HCCL_IF_IP=$local_ip
-export GLOO_SOCKET_IFNAME=$nic_name
-export TP_SOCKET_IFNAME=$nic_name
-export HCCL_SOCKET_IFNAME=$nic_name
+    export HCCL_IF_IP=$local_ip
+    export GLOO_SOCKET_IFNAME=$nic_name
+    export TP_SOCKET_IFNAME=$nic_name
+    export HCCL_SOCKET_IFNAME=$nic_name

-export VLLM_RPC_TIMEOUT=3600000
-export VLLM_EXECUTE_MODEL_TIMEOUT_SECONDS=30000
-export HCCL_EXEC_TIMEOUT=204
-export HCCL_CONNECT_TIMEOUT=120
+    export VLLM_RPC_TIMEOUT=3600000
+    export VLLM_EXECUTE_MODEL_TIMEOUT_SECONDS=30000
+    export HCCL_EXEC_TIMEOUT=204
+    export HCCL_CONNECT_TIMEOUT=120

-export OMP_PROC_BIND=false
-export OMP_NUM_THREADS=10
-export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
-export HCCL_BUFFSIZE=1100
-export TASK_QUEUE_ENABLE=1
-export HCCL_OP_EXPANSION_MODE="AIV"
-export VLLM_USE_V1=1
-export ASCEND_RT_VISIBLE_DEVICES=$1
-export ASCEND_BUFFER_POOL=4:8
-export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/mooncake:$LD_LIBRARY_PATH
+    export OMP_PROC_BIND=false
+    export OMP_NUM_THREADS=10
+    export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
+    export HCCL_BUFFSIZE=1100
+    export TASK_QUEUE_ENABLE=1
+    export HCCL_OP_EXPANSION_MODE="AIV"
+    export VLLM_USE_V1=1
+    export ASCEND_RT_VISIBLE_DEVICES=$1
+    export ASCEND_BUFFER_POOL=4:8
+    export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/mooncake:$LD_LIBRARY_PATH

-vllm serve /weights/DeepSeek-V3.1-w8a8-mtp-QuaRot \
-  --host 0.0.0.0 \
-  --port $2 \
-  --data-parallel-size $3 \
-  --data-parallel-rank $4 \
-  --data-parallel-address $5 \
-  --data-parallel-rpc-port $6 \
-  --tensor-parallel-size $7 \
-  --enable-expert-parallel \
-  --seed 1024 \
-  --served-model-name deepseek_v3 \
-  --max-model-len 65536 \
-  --max-num-batched-tokens 256 \
-  --max-num-seqs 28 \
-  --trust-remote-code \
-  --gpu-memory-utilization 0.92 \
-  --quantization ascend \
-  --no-enable-prefix-caching \
-  --async-scheduling \
-  --speculative-config '{"num_speculative_tokens": 1, "method": "mtp"}' \
-  --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes":[2, 4, 8, 16, 24, 32, 48, 56]}' \
-  --additional-config '{"recompute_scheduler_enable":true,"multistream_overlap_shared_expert": true,"finegrained_tp_config": {"lmhead_tensor_parallel_size":16}}' \
-  --kv-transfer-config \
-  '{"kv_connector": "MooncakeConnectorV1",
-  "kv_role": "kv_consumer",
-  "kv_port": "30200",
-  "engine_id": "2",
-  "kv_connector_extra_config": {
-            "prefill": {
-                    "dp_size": 2,
-                    "tp_size": 8
-             },
-             "decode": {
-                    "dp_size": 32,
-                    "tp_size": 1
-             }
-      }
-  }'
-```
+    vllm serve /weights/DeepSeek-V3.1-w8a8-mtp-QuaRot \
+      --host 0.0.0.0 \
+      --port $2 \
+      --data-parallel-size $3 \
+      --data-parallel-rank $4 \
+      --data-parallel-address $5 \
+      --data-parallel-rpc-port $6 \
+      --tensor-parallel-size $7 \
+      --enable-expert-parallel \
+      --seed 1024 \
+      --served-model-name deepseek_v3 \
+      --max-model-len 65536 \
+      --max-num-batched-tokens 256 \
+      --max-num-seqs 28 \
+      --trust-remote-code \
+      --gpu-memory-utilization 0.92 \
+      --quantization ascend \
+      --no-enable-prefix-caching \
+      --async-scheduling \
+      --speculative-config '{"num_speculative_tokens": 1, "method": "mtp"}' \
+      --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes":[2, 4, 8, 16, 24, 32, 48, 56]}' \
+      --additional-config '{"recompute_scheduler_enable":true,"multistream_overlap_shared_expert": true,"finegrained_tp_config": {"lmhead_tensor_parallel_size":16}}' \
+      --kv-transfer-config \
+      '{"kv_connector": "MooncakeConnectorV1",
+      "kv_role": "kv_consumer",
+      "kv_port": "30200",
+      "engine_id": "2",
+      "kv_connector_extra_config": {
+                "prefill": {
+                        "dp_size": 2,
+                        "tp_size": 8
+                },
+                "decode": {
+                        "dp_size": 32,
+                        "tp_size": 1
+                }
+          }
+      }'
+    ```

 5. Decode Node 1 `run_dp_template.sh` script

-```shell
-# this obtained through ifconfig
-# nic_name is the network interface name corresponding to local_ip of the current node
-nic_name="xxx"
-local_ip="141.xx.xx.4"
+    ```shell
+    # this obtained through ifconfig
+    # nic_name is the network interface name corresponding to local_ip of the current node
+    nic_name="xxx"
+    local_ip="141.xx.xx.4"

-# The value of node0_ip must be consistent with the value of local_ip set in node0 (master node)
-node0_ip="xxxx"
+    # The value of node0_ip must be consistent with the value of local_ip set in node0 (master node)
+    node0_ip="xxxx"

-# [Optional] jemalloc
-# jemalloc is for better performance, if `libjemalloc.so` is installed on your machine, you can turn it on.
-# export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
+    # [Optional] jemalloc
+    # jemalloc is for better performance, if `libjemalloc.so` is installed on your machine, you can turn it on.
+    # export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD

-export HCCL_IF_IP=$local_ip
-export GLOO_SOCKET_IFNAME=$nic_name
-export TP_SOCKET_IFNAME=$nic_name
-export HCCL_SOCKET_IFNAME=$nic_name
+    export HCCL_IF_IP=$local_ip
+    export GLOO_SOCKET_IFNAME=$nic_name
+    export TP_SOCKET_IFNAME=$nic_name
+    export HCCL_SOCKET_IFNAME=$nic_name

-export VLLM_RPC_TIMEOUT=3600000
-export VLLM_EXECUTE_MODEL_TIMEOUT_SECONDS=30000
-export HCCL_EXEC_TIMEOUT=204
-export HCCL_CONNECT_TIMEOUT=120
+    export VLLM_RPC_TIMEOUT=3600000
+    export VLLM_EXECUTE_MODEL_TIMEOUT_SECONDS=30000
+    export HCCL_EXEC_TIMEOUT=204
+    export HCCL_CONNECT_TIMEOUT=120

-export OMP_PROC_BIND=false
-export OMP_NUM_THREADS=10
-export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
-export HCCL_BUFFSIZE=1100
-export TASK_QUEUE_ENABLE=1
-export HCCL_OP_EXPANSION_MODE="AIV"
-export VLLM_USE_V1=1
-export ASCEND_RT_VISIBLE_DEVICES=$1
-export ASCEND_BUFFER_POOL=4:8
-export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/mooncake:$LD_LIBRARY_PATH
+    export OMP_PROC_BIND=false
+    export OMP_NUM_THREADS=10
+    export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
+    export HCCL_BUFFSIZE=1100
+    export TASK_QUEUE_ENABLE=1
+    export HCCL_OP_EXPANSION_MODE="AIV"
+    export VLLM_USE_V1=1
+    export ASCEND_RT_VISIBLE_DEVICES=$1
+    export ASCEND_BUFFER_POOL=4:8
+    export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/mooncake:$LD_LIBRARY_PATH

-vllm serve /weights/DeepSeek-V3.1-w8a8-mtp-QuaRot \
-  --host 0.0.0.0 \
-  --port $2 \
-  --data-parallel-size $3 \
-  --data-parallel-rank $4 \
-  --data-parallel-address $5 \
-  --data-parallel-rpc-port $6 \
-  --tensor-parallel-size $7 \
-  --enable-expert-parallel \
-  --seed 1024 \
-  --served-model-name deepseek_v3 \
-  --max-model-len 65536 \
-  --max-num-batched-tokens 256 \
-  --max-num-seqs 28 \
-  --trust-remote-code \
-  --gpu-memory-utilization 0.92 \
-  --quantization ascend \
-  --no-enable-prefix-caching \
-  --async-scheduling \
-  --speculative-config '{"num_speculative_tokens": 1, "method": "mtp"}' \
-  --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes":[2, 4, 8, 16, 24, 32, 48, 56]}' \
-  --additional-config '{"recompute_scheduler_enable":true,"multistream_overlap_shared_expert": true,"finegrained_tp_config": {"lmhead_tensor_parallel_size":16}}' \
-  --kv-transfer-config \
-  '{"kv_connector": "MooncakeConnectorV1",
-  "kv_role": "kv_consumer",
-  "kv_port": "30200",
-  "engine_id": "2",
-  "kv_connector_extra_config": {
-            "prefill": {
-                    "dp_size": 2,
-                    "tp_size": 8
-             },
-             "decode": {
-                    "dp_size": 32,
-                    "tp_size": 1
-             }
-      }
-  }'
-```
+    vllm serve /weights/DeepSeek-V3.1-w8a8-mtp-QuaRot \
+      --host 0.0.0.0 \
+      --port $2 \
+      --data-parallel-size $3 \
+      --data-parallel-rank $4 \
+      --data-parallel-address $5 \
+      --data-parallel-rpc-port $6 \
+      --tensor-parallel-size $7 \
+      --enable-expert-parallel \
+      --seed 1024 \
+      --served-model-name deepseek_v3 \
+      --max-model-len 65536 \
+      --max-num-batched-tokens 256 \
+      --max-num-seqs 28 \
+      --trust-remote-code \
+      --gpu-memory-utilization 0.92 \
+      --quantization ascend \
+      --no-enable-prefix-caching \
+      --async-scheduling \
+      --speculative-config '{"num_speculative_tokens": 1, "method": "mtp"}' \
+      --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes":[2, 4, 8, 16, 24, 32, 48, 56]}' \
+      --additional-config '{"recompute_scheduler_enable":true,"multistream_overlap_shared_expert": true,"finegrained_tp_config": {"lmhead_tensor_parallel_size":16}}' \
+      --kv-transfer-config \
+      '{"kv_connector": "MooncakeConnectorV1",
+      "kv_role": "kv_consumer",
+      "kv_port": "30200",
+      "engine_id": "2",
+      "kv_connector_extra_config": {
+                "prefill": {
+                        "dp_size": 2,
+                        "tp_size": 8
+                },
+                "decode": {
+                        "dp_size": 32,
+                        "tp_size": 1
+                }
+          }
+      }'
+    ```

-**Notice:**
-The parameters are explained as follows:
+    **Notice:**
+    The parameters are explained as follows:

- `VLLM_ASCEND_ENABLE_FLASHCOMM1=1`: enables the communication optimization function on the prefill nodes.
- `VLLM_ASCEND_ENABLE_MLAPO=1`: enables the fusion operator, which can significantly improve performance but consumes more NPU memory. In the Prefill-Decode (PD) separation scenario, enable MLAPO only on decode nodes.
- `--async-scheduling`: enables the asynchronous scheduling function. When Multi-Token Prediction (MTP) is enabled, asynchronous scheduling of operator delivery can be implemented to overlap the operator delivery latency.
- `cudagraph_capture_sizes`: The recommended value is `n x (mtp + 1)`. And the min is `n = 1` and the max is `n = max-num-seqs`. For other values, it is recommended to set them to the number of frequently occurring requests on the Decode (D) node.
- `recompute_scheduler_enable: true`: enables the recomputation scheduler. When the Key-Value Cache (KV Cache) of the decode node is insufficient, requests will be sent to the prefill node to recompute the KV Cache. In the PD separation scenario, it is recommended to enable this configuration on both prefill and decode nodes simultaneously.
- `multistream_overlap_shared_expert: true`: When the Tensor Parallelism (TP) size is 1 or `enable_shared_expert_dp: true`, an additional stream is enabled to overlap the computation process of shared experts for improved efficiency.
- `lmhead_tensor_parallel_size: 16`: When the Tensor Parallelism (TP) size of the decode node is 1, this parameter allows the TP size of the LMHead embedding layer to be greater than 1, which is used to reduce the computational load of each card on the LMHead embedding layer.
+    - `VLLM_ASCEND_ENABLE_FLASHCOMM1=1`: enables the communication optimization function on the prefill nodes.
+    - `VLLM_ASCEND_ENABLE_MLAPO=1`: enables the fusion operator, which can significantly improve performance but consumes more NPU memory. In the Prefill-Decode (PD) separation scenario, enable MLAPO only on decode nodes.
+    - `--async-scheduling`: enables the asynchronous scheduling function. When Multi-Token Prediction (MTP) is enabled, asynchronous scheduling of operator delivery can be implemented to overlap the operator delivery latency.
+    - `cudagraph_capture_sizes`: The recommended value is `n x (mtp + 1)`. And the min is `n = 1` and the max is `n = max-num-seqs`. For other values, it is recommended to set them to the number of frequently occurring requests on the Decode (D) node.
+    - `recompute_scheduler_enable: true`: enables the recomputation scheduler. When the Key-Value Cache (KV Cache) of the decode node is insufficient, requests will be sent to the prefill node to recompute the KV Cache. In the PD separation scenario, it is recommended to enable this configuration on both prefill and decode nodes simultaneously.
+    - `multistream_overlap_shared_expert: true`: When the Tensor Parallelism (TP) size is 1 or `enable_shared_expert_dp: true`, an additional stream is enabled to overlap the computation process of shared experts for improved efficiency.
+    - `lmhead_tensor_parallel_size: 16`: When the Tensor Parallelism (TP) size of the decode node is 1, this parameter allows the TP size of the LMHead embedding layer to be greater than 1, which is used to reduce the computational load of each card on the LMHead embedding layer.

 6. run server for each node:

-```shell
-# p0
-python launch_online_dp.py --dp-size 2 --tp-size 8 --dp-size-local 2 --dp-rank-start 0 --dp-address 141.xx.xx.1 --dp-rpc-port 12321 --vllm-start-port 7100
-# p1
-python launch_online_dp.py --dp-size 2 --tp-size 8 --dp-size-local 2 --dp-rank-start 0 --dp-address 141.xx.xx.2 --dp-rpc-port 12321 --vllm-start-port 7100
-# d0
-python launch_online_dp.py --dp-size 32 --tp-size 1 --dp-size-local 16 --dp-rank-start 0 --dp-address 141.xx.xx.3 --dp-rpc-port 12321 --vllm-start-port 7100
-# d1
-python launch_online_dp.py --dp-size 32 --tp-size 1 --dp-size-local 16 --dp-rank-start 16 --dp-address 141.xx.xx.3 --dp-rpc-port 12321 --vllm-start-port 7100
-```
+    ```shell
+    # p0
+    python launch_online_dp.py --dp-size 2 --tp-size 8 --dp-size-local 2 --dp-rank-start 0 --dp-address 141.xx.xx.1 --dp-rpc-port 12321 --vllm-start-port 7100
+    # p1
+    python launch_online_dp.py --dp-size 2 --tp-size 8 --dp-size-local 2 --dp-rank-start 0 --dp-address 141.xx.xx.2 --dp-rpc-port 12321 --vllm-start-port 7100
+    # d0
+    python launch_online_dp.py --dp-size 32 --tp-size 1 --dp-size-local 16 --dp-rank-start 0 --dp-address 141.xx.xx.3 --dp-rpc-port 12321 --vllm-start-port 7100
+    # d1
+    python launch_online_dp.py --dp-size 32 --tp-size 1 --dp-size-local 16 --dp-rank-start 16 --dp-address 141.xx.xx.3 --dp-rpc-port 12321 --vllm-start-port 7100
+    ```

 7. Run the `proxy.sh` script on the prefill master node

-Run a proxy server on the same node with the prefiller service instance. You can get the proxy program in the repository's examples: [load\_balance\_proxy\_server\_example.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)
+    Run a proxy server on the same node with the prefiller service instance. You can get the proxy program in the repository's examples: [load\_balance\_proxy\_server\_example.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)

-```shell
-python load_balance_proxy_server_example.py \
-  --port 1999 \
-  --host 141.xx.xx.1 \
-  --prefiller-hosts \
-    141.xx.xx.1 \
-    141.xx.xx.1 \
-    141.xx.xx.2 \
-    141.xx.xx.2 \
-  --prefiller-ports \
-    7100 7101 7100 7101 \
-  --decoder-hosts \
-    141.xx.xx.3 \
-    141.xx.xx.3 \
-    141.xx.xx.3 \
-    141.xx.xx.3 \
-    141.xx.xx.3 \
-    141.xx.xx.3 \
-    141.xx.xx.3 \
-    141.xx.xx.3 \
-    141.xx.xx.3 \
-    141.xx.xx.3 \
-    141.xx.xx.3 \
-    141.xx.xx.3 \
-    141.xx.xx.3 \
-    141.xx.xx.3 \
-    141.xx.xx.3 \
-    141.xx.xx.3 \
-    141.xx.xx.4 \
-    141.xx.xx.4 \
-    141.xx.xx.4 \
-    141.xx.xx.4 \
-    141.xx.xx.4 \
-    141.xx.xx.4 \
-    141.xx.xx.4 \
-    141.xx.xx.4 \
-    141.xx.xx.4 \
-    141.xx.xx.4 \
-    141.xx.xx.4 \
-    141.xx.xx.4 \
-    141.xx.xx.4 \
-    141.xx.xx.4 \
-    141.xx.xx.4 \
-    141.xx.xx.4 \
-  --decoder-ports \
-    7100 7101 7102 7103 7104 7105 7106 7107 7108 7109 7110 7111 7112 7113 7114 7115 \
-    7100 7101 7102 7103 7104 7105 7106 7107 7108 7109 7110 7111 7112 7113 7114 7115 \
-```
+    ```shell
+    python load_balance_proxy_server_example.py \
+      --port 1999 \
+      --host 141.xx.xx.1 \
+      --prefiller-hosts \
+        141.xx.xx.1 \
+        141.xx.xx.1 \
+        141.xx.xx.2 \
+        141.xx.xx.2 \
+      --prefiller-ports \
+        7100 7101 7100 7101 \
+      --decoder-hosts \
+        141.xx.xx.3 \
+        141.xx.xx.3 \
+        141.xx.xx.3 \
+        141.xx.xx.3 \
+        141.xx.xx.3 \
+        141.xx.xx.3 \
+        141.xx.xx.3 \
+        141.xx.xx.3 \
+        141.xx.xx.3 \
+        141.xx.xx.3 \
+        141.xx.xx.3 \
+        141.xx.xx.3 \
+        141.xx.xx.3 \
+        141.xx.xx.3 \
+        141.xx.xx.3 \
+        141.xx.xx.3 \
+        141.xx.xx.4 \
+        141.xx.xx.4 \
+        141.xx.xx.4 \
+        141.xx.xx.4 \
+        141.xx.xx.4 \
+        141.xx.xx.4 \
+        141.xx.xx.4 \
+        141.xx.xx.4 \
+        141.xx.xx.4 \
+        141.xx.xx.4 \
+        141.xx.xx.4 \
+        141.xx.xx.4 \
+        141.xx.xx.4 \
+        141.xx.xx.4 \
+        141.xx.xx.4 \
+        141.xx.xx.4 \
+      --decoder-ports \
+        7100 7101 7102 7103 7104 7105 7106 7107 7108 7109 7110 7111 7112 7113 7114 7115 \
+        7100 7101 7102 7103 7104 7105 7106 7107 7108 7109 7110 7111 7112 7113 7114 7115 \
+    ```

-```shell
-cd vllm-ascend/examples/disaggregated_prefill_v1/
-bash proxy.sh
-```
+    ```shell
+    cd vllm-ascend/examples/disaggregated_prefill_v1/
+    bash proxy.sh
+    ```

 ## Functional Verification

@@ -704,7 +704,7 @@ The performance result is:

 Run performance evaluation of `DeepSeek-V3.1-w8a8-mtp-QuaRot` as an example.

-Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
+Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/benchmarking/) for more details.

 There are three `vllm bench` subcommands:

--- a/docs/source/tutorials/models/DeepSeek-V3.2.md
+++ b/docs/source/tutorials/models/DeepSeek-V3.2.md
@@ -806,31 +806,31 @@ Refer to [Distributed DP Server With Large-Scale Expert Parallelism](https://doc

 1. Prefill node 0

-```shell
-# change ip to your own
-python launch_online_dp.py --dp-size 2 --tp-size 16 --dp-size-local 1 --dp-rank-start 0 --dp-address 141.61.39.105 --dp-rpc-port 12890 --vllm-start-port 9100
-```
+    ```shell
+    # change ip to your own
+    python launch_online_dp.py --dp-size 2 --tp-size 16 --dp-size-local 1 --dp-rank-start 0 --dp-address 141.61.39.105 --dp-rpc-port 12890 --vllm-start-port 9100
+    ```

 2. Prefill node 1

-```shell
-# change ip to your own
-python launch_online_dp.py --dp-size 2 --tp-size 16 --dp-size-local 1 --dp-rank-start 1 --dp-address 141.61.39.105 --dp-rpc-port 12890 --vllm-start-port 9100
-```
+    ```shell
+    # change ip to your own
+    python launch_online_dp.py --dp-size 2 --tp-size 16 --dp-size-local 1 --dp-rank-start 1 --dp-address 141.61.39.105 --dp-rpc-port 12890 --vllm-start-port 9100
+    ```

 3. Decode node 0

-```shell
-# change ip to your own
-python launch_online_dp.py --dp-size 8 --tp-size 4 --dp-size-local 4 --dp-rank-start 0 --dp-address 141.61.39.117 --dp-rpc-port 12777 --vllm-start-port 9100
-```
+    ```shell
+    # change ip to your own
+    python launch_online_dp.py --dp-size 8 --tp-size 4 --dp-size-local 4 --dp-rank-start 0 --dp-address 141.61.39.117 --dp-rpc-port 12777 --vllm-start-port 9100
+    ```

 4. Decode node 1

-```shell
-# change ip to your own
-python launch_online_dp.py --dp-size 8 --tp-size 4 --dp-size-local 4 --dp-rank-start 4 --dp-address 141.61.39.117 --dp-rpc-port 12777 --vllm-start-port 9100
-```
+    ```shell
+    # change ip to your own
+    python launch_online_dp.py --dp-size 8 --tp-size 4 --dp-size-local 4 --dp-rank-start 4 --dp-address 141.61.39.117 --dp-rpc-port 12777 --vllm-start-port 9100
+    ```

 ### Request Forwarding

@@ -896,13 +896,13 @@ As an example, take the `gsm8k` dataset as a test dataset, and run accuracy eval

 2. Run `lm_eval` to execute the accuracy evaluation.

-```shell
-lm_eval \
-  --model local-completions \
-  --model_args model=/root/.cache/Eco-Tech/DeepSeek-V3.2-w8a8-mtp-QuaRot,base_url=http://127.0.0.1:8000/v1/completions,tokenized_requests=False,trust_remote_code=True \
-  --tasks gsm8k \
-  --output_path ./
-```
+    ```shell
+    lm_eval \
+    --model local-completions \
+    --model_args model=/root/.cache/Eco-Tech/DeepSeek-V3.2-w8a8-mtp-QuaRot,base_url=http://127.0.0.1:8000/v1/completions,tokenized_requests=False,trust_remote_code=True \
+    --tasks gsm8k \
+    --output_path ./
+    ```

 3. After execution, you can get the result.

@@ -926,7 +926,7 @@ The performance result is:

 Run performance evaluation of `DeepSeek-V3.2-W8A8` as an example.

-Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
+Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/benchmarking/) for more details.

 There are three `vllm bench` subcommands:

--- a/docs/source/tutorials/models/GLM4.x.md
+++ b/docs/source/tutorials/models/GLM4.x.md
@@ -668,31 +668,31 @@ Once the preparation is done, you can start the server with the following comman

 1. Prefill node 0

-```shell
-# change ip to your own
-python launch_online_dp.py --dp-size 2 --tp-size 8 --dp-size-local 2 --dp-rank-start 0 --dp-address $node_p0_ip --dp-rpc-port 12880 --vllm-start-port 9300
-```
+    ```shell
+    # change ip to your own
+    python launch_online_dp.py --dp-size 2 --tp-size 8 --dp-size-local 2 --dp-rank-start 0 --dp-address $node_p0_ip --dp-rpc-port 12880 --vllm-start-port 9300
+    ```

 2. Prefill node 1

-```shell
-# change ip to your own
-python launch_online_dp.py --dp-size 2 --tp-size 8 --dp-size-local 2 --dp-rank-start 0 --dp-address $node_p1_ip --dp-rpc-port 12880 --vllm-start-port 9300
-```
+    ```shell
+    # change ip to your own
+    python launch_online_dp.py --dp-size 2 --tp-size 8 --dp-size-local 2 --dp-rank-start 0 --dp-address $node_p1_ip --dp-rpc-port 12880 --vllm-start-port 9300
+    ```

 3. Decode node 0

-```shell
-# change ip to your own
-python launch_online_dp.py --dp-size 8 --tp-size 4 --dp-size-local 4 --dp-rank-start 0 --dp-address $node_d0_ip --dp-rpc-port 12778 --vllm-start-port 9300
-```
+    ```shell
+    # change ip to your own
+    python launch_online_dp.py --dp-size 8 --tp-size 4 --dp-size-local 4 --dp-rank-start 0 --dp-address $node_d0_ip --dp-rpc-port 12778 --vllm-start-port 9300
+    ```

 4. Decode node 1

-```shell
-# change ip to your own
-python launch_online_dp.py --dp-size 8 --tp-size 4 --dp-size-local 4 --dp-rank-start 4 --dp-address $node_d0_ip --dp-rpc-port 12778 --vllm-start-port 9300
-```
+    ```shell
+    # change ip to your own
+    python launch_online_dp.py --dp-size 8 --tp-size 4 --dp-size-local 4 --dp-rank-start 4 --dp-address $node_d0_ip --dp-rpc-port 12778 --vllm-start-port 9300
+    ```

 ### Request Forwarding

@@ -722,7 +722,7 @@ python load_balance_proxy_server_example.py \
      $node_d1_ip \
    --decoder-ports \
      9300 9301 9302 9303 \
-      9300 9301 9302 9303 \
+      9300 9301 9302 9303
 ```

 ## Functional Verification
@@ -763,7 +763,7 @@ Here are two accuracy evaluation methods.

 ### Using Language Model Evaluation Harness

-Not test yet.
+Not tested yet.

 ## Performance

@@ -775,7 +775,7 @@ Refer to [Using AISBench for performance evaluation](../../developer_guide/evalu

 Run performance evaluation of `GLM-4.x` as an example.

-Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
+Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/benchmarking/) for more details.

 There are three `vllm bench` subcommands:

@@ -802,7 +802,7 @@ vllm bench serve \
  --port 8000 \
  --endpoint /v1/completions \
  --max-concurrency 1 \
-  --request-rate 1 \
+  --request-rate 1
 ```

 After about several minutes, you can get the performance evaluation result.
--- a/docs/source/tutorials/models/GLM5.md
+++ b/docs/source/tutorials/models/GLM5.md
@@ -1351,7 +1351,7 @@ Here are two accuracy evaluation methods.

 ### Using Language Model Evaluation Harness

-Not test yet.
+Not tested yet.

 ## Performance

--- a/docs/source/tutorials/models/Kimi-K2-Thinking.md
+++ b/docs/source/tutorials/models/Kimi-K2-Thinking.md
@@ -103,7 +103,7 @@ Your model files look like:
 |-- modeling_deepseek.py
 |-- tiktoken.model
 |-- tokenization_kimi.py
-`-- tokenizer_config.json
+|-- tokenizer_config.json
 ```

 ## Online Inference on Multi-NPU
--- a/docs/source/tutorials/models/PaddleOCR-VL.md
+++ b/docs/source/tutorials/models/PaddleOCR-VL.md
@@ -215,46 +215,46 @@ The 910B4 device supports inference using the PaddlePaddle framework.

 1. Pull the PaddlePaddle-compatible CANN image

-```bash
-docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/device/paddle-npu:cann800-ubuntu20-npu-910b-base-aarch64-gcc84
-```
+    ```bash
+    docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/device/paddle-npu:cann800-ubuntu20-npu-910b-base-aarch64-gcc84
+    ```

-Start the container using the following command:
+    Start the container using the following command:

-```bash
-docker run -it --name paddle-npu-dev -v $(pwd):/work \
-    --privileged --network=host --shm-size=128G -w=/work \
-    -v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-    -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-    -v /usr/local/dcmi:/usr/local/dcmi \
-    -e ASCEND_RT_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" \
-    ccr-2vdh3abv-pub.cnc.bj.baidubce.com/device/paddle-npu:cann800-ubuntu20-npu-910b-base-$(uname -m)-gcc84 /bin/bash
-```
+    ```bash
+    docker run -it --name paddle-npu-dev -v $(pwd):/work \
+        --privileged --network=host --shm-size=128G -w=/work \
+        -v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
+        -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
+        -v /usr/local/dcmi:/usr/local/dcmi \
+        -e ASCEND_RT_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" \
+        ccr-2vdh3abv-pub.cnc.bj.baidubce.com/device/paddle-npu:cann800-ubuntu20-npu-910b-base-$(uname -m)-gcc84 /bin/bash
+    ```

 2. Install [PaddlePaddle](https://www.paddlepaddle.org.cn/install/quick?docurl=undefined) and [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)

-```bash
-python -m pip install paddlepaddle==3.2.0
-wget https://paddle-whl.bj.bcebos.com/stable/npu/paddle-custom-npu/paddle_custom_npu-3.2.0-cp310-cp310-linux_aarch64.whl
-pip  install  paddle_custom_npu-3.2.0-cp310-cp310-linux_aarch64.whl
-python -m pip install -U "paddleocr[doc-parser]"
-pip install safetensors
-```
+    ```bash
+    python -m pip install paddlepaddle==3.2.0
+    wget https://paddle-whl.bj.bcebos.com/stable/npu/paddle-custom-npu/paddle_custom_npu-3.2.0-cp310-cp310-linux_aarch64.whl
+    pip  install  paddle_custom_npu-3.2.0-cp310-cp310-linux_aarch64.whl
+    python -m pip install -U "paddleocr[doc-parser]"
+    pip install safetensors
+    ```

-:::{note}
-The OpenCV component may be missing:
+    :::{note}
+    The OpenCV component may be missing:

-```bash
-apt-get update
-apt-get install -y libgl1 libglib2.0-0
-```
+    ```bash
+    apt-get update
+    apt-get install -y libgl1 libglib2.0-0
+    ```

-CANN-8.0.0 does not support some versions of NumPy and OpenCV. It is recommended to install the specified versions.
+    CANN-8.0.0 does not support some versions of NumPy and OpenCV. It is recommended to install the specified versions.

-```bash
-python -m pip install numpy==1.26.4
-python -m pip install opencv-python==3.4.18.65
-```
+    ```bash
+    python -m pip install numpy==1.26.4
+    python -m pip install opencv-python==3.4.18.65
+    ```

 ::::
 ::::{tab-item} OM inference
--- a/docs/source/tutorials/models/Qwen-VL-Dense.md
+++ b/docs/source/tutorials/models/Qwen-VL-Dense.md
@@ -328,7 +328,7 @@ vllm serve Qwen/Qwen3-VL-8B-Instruct \
 ```

 :::{note}
-Add `--max_model_len` option to avoid ValueError that the Qwen3-VL-8B-Instruct model's max seq len (256000) is larger than the maximum number of tokens that can be stored in KV cache. This will differ with different NPU series base on the HBM size. Please modify the value according to a suitable value for your NPU series.
+Add `--max_model_len` option to avoid ValueError that the Qwen3-VL-8B-Instruct model's max seq len (256000) is larger than the maximum number of tokens that can be stored in KV cache. This will differ with different NPU series based on the HBM size. Please modify the value according to a suitable value for your NPU series.
 :::

 If your service start successfully, you can see the info shown below:
@@ -474,8 +474,6 @@ The accuracy of some models is already within our CI monitoring scope, including
 - `Qwen2.5-VL-7B-Instruct`
 - `Qwen3-VL-8B-Instruct`

-You can refer to the [monitoring configuration](https://github.com/vllm-project/vllm-ascend/blob/main/.github/workflows/vllm_ascend_test_nightly_a2.yaml).
-
 :::::{tab-set}
 :sync-group: install

@@ -486,28 +484,28 @@ As an example, take the `mmmu_val` dataset as a test dataset, and run accuracy e

 1. Refer to [Using lm_eval](../../developer_guide/evaluation/using_lm_eval.md) for more details on `lm_eval` installation.

-```shell
-pip install lm_eval
-```
+    ```shell
+    pip install lm_eval
+    ```

 2. Run `lm_eval` to execute the accuracy evaluation.

-```shell
-lm_eval \
-    --model vllm-vlm \
-    --model_args pretrained=Qwen/Qwen3-VL-8B-Instruct,max_model_len=8192,gpu_memory_utilization=0.7 \
-    --tasks mmmu_val \
-    --batch_size 32 \
-    --apply_chat_template \
-    --trust_remote_code \
-    --output_path ./results
-```
+    ```shell
+    lm_eval \
+        --model vllm-vlm \
+        --model_args pretrained=Qwen/Qwen3-VL-8B-Instruct,max_model_len=8192,gpu_memory_utilization=0.7 \
+        --tasks mmmu_val \
+        --batch_size 32 \
+        --apply_chat_template \
+        --trust_remote_code \
+        --output_path ./results
+    ```

 3. After execution, you can get the result, here is the result of `Qwen3-VL-8B-Instruct` in `vllm-ascend:0.11.0rc3` for reference only.

-|  Tasks  |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
-|---------|------:|------|-----:|------|---|-----:|---|-----:|
-|mmmu_val |      0|none  |      |acc   |↑  |0.5389|±  |0.0159|
+    |  Tasks  |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
+    |---------|------:|------|-----:|------|---|-----:|---|-----:|
+    |mmmu_val |      0|none  |      |acc   |↑  |0.5389|±  |0.0159|

 ::::
 ::::{tab-item} Qwen2.5-VL-32B-Instruct
@@ -517,27 +515,27 @@ As an example, take the `mmmu_val` dataset as a test dataset, and run accuracy e

 1. Refer to [Using lm_eval](../../developer_guide/evaluation/using_lm_eval.md) for more details on `lm_eval` installation.

-```shell
-pip install lm_eval
-```
+    ```shell
+    pip install lm_eval
+    ```

 2. Run `lm_eval` to execute the accuracy evaluation.

-```shell
-lm_eval \
-    --model vllm-vlm \
-    --model_args pretrained=Qwen/Qwen2.5-VL-32B-Instruct,max_model_len=8192,tensor_parallel_size=2 \
-    --tasks mmmu_val \
-    --apply_chat_template \
-    --trust_remote_code \
-    --output_path ./results
-```
+    ```shell
+    lm_eval \
+        --model vllm-vlm \
+        --model_args pretrained=Qwen/Qwen2.5-VL-32B-Instruct,max_model_len=8192,tensor_parallel_size=2 \
+        --tasks mmmu_val \
+        --apply_chat_template \
+        --trust_remote_code \
+        --output_path ./results
+    ```

 3. After execution, you can get the result, here is the result of `Qwen2.5-VL-32B-Instruct` in `vllm-ascend:0.11.0rc3` for reference only.

-|  Tasks  |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
-|---------|------:|------|-----:|------|---|-----:|---|-----:|
-|mmmu_val |      0|none  |      |acc   |↑  |0.5744|±  |0.0158|
+    |  Tasks  |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
+    |---------|------:|------|-----:|------|---|-----:|---|-----:|
+    |mmmu_val |      0|none  |      |acc   |↑  |0.5744|±  |0.0158|

 ::::
 :::::
@@ -546,7 +544,7 @@ lm_eval \

 ### Using vLLM Benchmark

-Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
+Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/benchmarking/) for more details.

 There are three `vllm bench` subcommands:

--- a/docs/source/tutorials/models/Qwen2.5-7B.md
+++ b/docs/source/tutorials/models/Qwen2.5-7B.md
@@ -99,7 +99,7 @@ Qwen2.5-7B-Instruct supports single-node single-card deployment on the 910B4 pla
 ```shell
 #!/bin/sh
 export ASCEND_RT_VISIBLE_DEVICES=0
-export MODEL_PATH="Qwen/Qwen2.5-7B-Instruct"
+export MODEL_PATH="./Qwen2.5-7B-Instruct"

 vllm serve ${MODEL_PATH} \
          --host 0.0.0.0 \
@@ -156,7 +156,7 @@ Refer to [Using AISBench for performance evaluation](../../developer_guide/evalu

 Run performance evaluation of `Qwen2.5-7B-Instruct` as an example.

-Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
+Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/benchmarking/) for more details.

 There are three `vllm bench` subcommands:

@@ -177,4 +177,4 @@ vllm bench serve \
  --result-dir ./perf_results/
 ```

-After about several minutes, you can get the performance evaluation result.
+After several minutes, you can get the performance evaluation result.
--- a/docs/source/tutorials/models/Qwen2.5-Omni.md
+++ b/docs/source/tutorials/models/Qwen2.5-Omni.md
@@ -69,7 +69,7 @@ docker run --rm \
 #### Single NPU (Qwen2.5-Omni-7B)

 :::{note}
-The env `LOCAL_MEDIA_PATH` which allowing API requests to read local images or videos from directories specified by the server file system. Please note this is a security risk. Should only be enabled in trusted environments.
+The **environment variable** `LOCAL_MEDIA_PATH` which **allows** API requests to read local images or videos from directories specified by the server file system. Please note this is a security risk. Should only be enabled in trusted environments.

 :::

@@ -99,7 +99,7 @@ VLLM_TARGET_DEVICE=empty pip install -v ".[audio]"

 `--allowed-local-media-path` is optional, only set it if you need infer model with local media file.

-`--gpu-memory-utilization` should not be set manually only if you know what this parameter aims to.
+`--gpu-memory-utilization` should not be set manually unless you know what this parameter does.

 #### Multiple NPU (Qwen2.5-Omni-7B)

@@ -128,7 +128,7 @@ Not supported yet.

 ## Functional Verification

-If your service start successfully, you can see the info shown below:
+If your service **starts** successfully, you can see the info shown below:

 ```bash
 INFO:     Started server process [2736]
@@ -195,7 +195,7 @@ Refer to [Using AISBench for performance evaluation](../../developer_guide/evalu

 Run performance evaluation of `Qwen2.5-Omni-7B` as an example.

-Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
+Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/benchmarking/) for more details.

 There are three `vllm bench` subcommands:

--- a/docs/source/tutorials/models/Qwen3-235B-A22B.md
+++ b/docs/source/tutorials/models/Qwen3-235B-A22B.md
@@ -18,8 +18,8 @@ Refer to [feature guide](../../user_guide/feature_guide/index.md) to get the fea

 ### Model Weight

- `Qwen3-235B-A22B`(BF16 version): require 1 Atlas 800 A3 (64G × 16) node, 1 Atlas 800 A2 (64G × 8) node or 2 Atlas 800 A2(32G * 8)nodes. [Download model weight](https://www.modelscope.cn/models/Qwen/Qwen3-235B-A22B)
- `Qwen3-235B-A22B-w8a8`(Quantized version): require 1 Atlas 800 A3 (64G × 16) node or 1 Atlas 800 A2 (64G × 8) node or 2 Atlas 800 A2(32G * 8)nodes. [Download model weight](https://modelscope.cn/models/vllm-ascend/Qwen3-235B-A22B-W8A8)
+- `Qwen3-235B-A22B`(BF16 version): require 1 Atlas 800 A3 (64G × 16) node, 1 Atlas 800 A2 (64G × 8) node or 2 Atlas 800 A2(32G × 8)nodes. [Download model weight](https://www.modelscope.cn/models/Qwen/Qwen3-235B-A22B)
+- `Qwen3-235B-A22B-w8a8`(Quantized version): require 1 Atlas 800 A3 (64G × 16) node or 1 Atlas 800 A2 (64G × 8) node or 2 Atlas 800 A2(32G × 8)nodes. [Download model weight](https://modelscope.cn/models/vllm-ascend/Qwen3-235B-A22B-W8A8)

 It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`.

@@ -191,7 +191,7 @@ vllm serve Qwen/Qwen3-235B-A22B \
 --max-num-batched-tokens 4096 \
 --trust-remote-code \
 --async-scheduling \
--gpu-memory-utilization 0.9 \
+--gpu-memory-utilization 0.9
 ```

 Node1
@@ -298,7 +298,7 @@ Refer to [Using AISBench for performance evaluation](../../developer_guide/evalu

 Run performance evaluation of `Qwen3-235B-A22B-w8a8` as an example.

-Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
+Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/benchmarking/) for more details.

 There are three `vllm bench` subcommands:

@@ -440,7 +440,6 @@ vllm serve vllm-ascend/Qwen3-235B-A22B-w8a8 \
 --enforce-eager \
 --trust-remote-code \
 --gpu-memory-utilization 0.9 \
--enforce-eager \
 --no-enable-prefix-caching \
 --kv-transfer-config \
 '{"kv_connector": "MooncakeConnectorV1",
--- a/docs/source/tutorials/models/Qwen3-Dense.md
+++ b/docs/source/tutorials/models/Qwen3-Dense.md
@@ -8,7 +8,7 @@ Welcome to the tutorial on optimizing Qwen Dense models in the vLLM-Ascend envir

 This document will show the main verification steps of the model, including supported features, feature configuration, environment preparation, accuracy and performance evaluation.

-The Qwen3 Dense models is first supported in [v0.8.4rc2](https://github.com/vllm-project/vllm-ascend/blob/main/docs/source/user_guide/release_notes.md#v084rc2---20250429). This example requires version **v0.11.0rc2**. Earlier versions may lack certain features.
+The Qwen3 Dense models are first supported in [v0.8.4rc2](https://github.com/vllm-project/vllm-ascend/blob/main/docs/source/user_guide/release_notes.md#v084rc2---20250429). This example requires version **v0.11.0rc2**. Earlier versions may lack certain features.

 ## Supported Features

@@ -288,7 +288,7 @@ Refer to [Using AISBench for performance evaluation](../../developer_guide/evalu

 Run performance evaluation of `Qwen3-32B-W8A8` as an example.

-Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
+Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/benchmarking/) for more details.

 There are three `vllm bench` subcommands:

--- a/docs/source/tutorials/models/Qwen3-Next.md
+++ b/docs/source/tutorials/models/Qwen3-Next.md
@@ -152,7 +152,7 @@ Refer to [Using AISBench for performance evaluation](../../developer_guide/evalu

 Run performance evaluation of `Qwen3-Next` as an example.

-Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
+Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/benchmarking/) for more details.

 There are three `vllm bench` subcommands:

--- a/docs/source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md
+++ b/docs/source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md
@@ -16,7 +16,7 @@ Refer to [feature guide](https://docs.vllm.ai/projects/ascend/zh-cn/latest/user

 ### Model Weight

- `Qwen3-Omni-30B-A3B-Thinking` require 2 NPU Card(64G × 2).[Download model weight](https://modelscope.cn/models/Qwen/Qwen3-Omni-30B-A3B-Thinking)
+- `Qwen3-Omni-30B-A3B-Thinking` requires 2 NPU Cards(64G × 2).[Download model weight](https://modelscope.cn/models/Qwen/Qwen3-Omni-30B-A3B-Thinking)
 It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`

 ### Installation
@@ -239,32 +239,32 @@ As an example, take the `gsm8k` `omnibench` `bbh` dataset as a test dataset, and
 1. Refer to Using evalscope(<https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/evaluation/using_evalscope.html#install-evalscope-using-pip>) for `evalscope`installation.
 2. Run `evalscope` to execute the accuracy evaluation.

-```bash
-evalscope eval \
-    --model /root/.cache/modelscope/hub/models/Qwen/Qwen3-Omni-30B-A3B-Thinking \
-    --api-url http://localhost:8000/v1 \
-    --api-key EMPTY \
-    --eval-type server \
-    --datasets omni_bench, gsm8k, bbh \
-    --dataset-args '{"omni_bench": { "extra_params": { "use_image": true, "use_audio": false}}}' \
-    --eval-batch-size 1 \
-    --generation-config '{"max_completion_tokens": 10000, "temperature": 0.6}' \
-    --limit 100
-```
+    ```bash
+    evalscope eval \
+        --model /root/.cache/modelscope/hub/models/Qwen/Qwen3-Omni-30B-A3B-Thinking \
+        --api-url http://localhost:8000/v1 \
+        --api-key EMPTY \
+        --eval-type server \
+        --datasets omni_bench, gsm8k, bbh \
+        --dataset-args '{"omni_bench": { "extra_params": { "use_image": true, "use_audio": false}}}' \
+        --eval-batch-size 1 \
+        --generation-config '{"max_completion_tokens": 10000, "temperature": 0.6}' \
+        --limit 100
+    ```

 3. After execution, you can get the result, here is the result of `Qwen3-Omni-30B-A3B-Thinking` in vllm-ascend:0.13.0rc1 for reference only.

-```bash
- +-----------------------------+------------+----------+----------+-------+---------+---------+
-| Model                       | Dataset    | Metric   | Subset   |   Num |   Score | Cat.0   |
-+=============================+============+==========+==========+=======+=========+=========+
-| Qwen3-Omni-30B-A3B-Thinking | omni_bench | mean_acc | default  |   100 |    0.44 | default |
-+-----------------------------+------------+----------+----------+-------+---------+---------+ 
-| Qwen3-Omni-30B-A3B-Thinking | gsm8k      | mean_acc | main     |   100 |    0.98 | default |
-+-----------------------------+-----------+----------+----------+-------+---------+---------+
-| Qwen3-Omni-30B-A3B-Thinking | bbh        | mean_acc | OVERALL  |   270 |  0.9148 |         |
-+-----------------------------+------------+----------+----------+-------+---------+---------+
-```
+    ```bash
+    +-----------------------------+------------+----------+----------+-------+---------+---------+
+    | Model                       | Dataset    | Metric   | Subset   |   Num |   Score | Cat.0   |
+    +=============================+============+==========+==========+=======+=========+=========+
+    | Qwen3-Omni-30B-A3B-Thinking | omni_bench | mean_acc | default  |   100 |    0.44 | default |
+    +-----------------------------+------------+----------+----------+-------+---------+---------+ 
+    | Qwen3-Omni-30B-A3B-Thinking | gsm8k      | mean_acc | main     |   100 |    0.98 | default |
+    +-----------------------------+-----------+----------+----------+-------+---------+---------+
+    | Qwen3-Omni-30B-A3B-Thinking | bbh        | mean_acc | OVERALL  |   270 |  0.9148 |         |
+    +-----------------------------+------------+----------+----------+-------+---------+---------+
+    ```

 ## Performance

@@ -272,7 +272,7 @@ evalscope eval \

 Run performance evaluation of `Qwen3-Omni-30B-A3B-Thinking` as an example.
 Refer to vllm benchmark for more details.
-Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
+Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/benchmarking/) for more details.

 There are three `vllm bench` subcommands:

--- a/docs/source/tutorials/models/Qwen3-VL-235B-A22B-Instruct.md
+++ b/docs/source/tutorials/models/Qwen3-VL-235B-A22B-Instruct.md
@@ -18,7 +18,7 @@ Refer to [feature guide](../../user_guide/feature_guide/index.md) to get the fea

 ### Model Weight

- `Qwen3-VL-235B-A22B-Instruct`(BF16 version): require 1 Atlas 800 A3 (64G × 16) node，2 Atlas 800 A2（64G * 8）nodes. [Download model weight](https://modelscope.cn/models/Qwen/Qwen3-VL-235B-A22B-Instruct/)
+- `Qwen3-VL-235B-A22B-Instruct`(BF16 version): require 1 Atlas 800 A3 (64G × 16) node，2 Atlas 800 A2（64G × 8）nodes. [Download model weight](https://modelscope.cn/models/Qwen/Qwen3-VL-235B-A22B-Instruct/)

 It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`

@@ -140,7 +140,7 @@ Node1
 export VLLM_USE_MODELSCOPE=true
 # To reduce memory fragmentation and avoid out of memory
 export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
-# this obtained through ifconfig
+# this is obtained through ifconfig
 # nic_name is the network interface name corresponding to local_ip of the current node
 nic_name="xxxx"
 local_ip="xxxx"
@@ -258,7 +258,7 @@ Refer to [Using AISBench for performance evaluation](../../developer_guide/evalu

 Run performance evaluation of `Qwen3-VL-235B-A22B-Instruct` as an example.

-Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
+Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/benchmarking/) for more details.

 There are three `vllm bench` subcommands:

--- a/docs/source/tutorials/models/Qwen3-VL-30B-A3B-Instruct.md
+++ b/docs/source/tutorials/models/Qwen3-VL-30B-A3B-Instruct.md
@@ -204,4 +204,4 @@ INFO 12-24 09:49:22 [loggers.py:257] Engine 000: Avg prompt throughput: 19.6 tok

 ### Offline Inference

-The usage of offline inference with `Qwen3-VL-30B-A3B-Instruct` is totally the same as that of `Qwen3-VL-8B-Instruct`, find more details at [link](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/Qwen-VL-Dense.html#offline-inference).
+The usage of offline inference with `Qwen3-VL-30B-A3B-Instruct` is totally the same as that of `Qwen3-VL-8B-Instruct`, find more details at [Qwen3-VL-8B-Instruct](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/models/Qwen-VL-Dense.html#offline-inference).
--- a/docs/source/tutorials/models/Qwen3-VL-Reranker.md
+++ b/docs/source/tutorials/models/Qwen3-VL-Reranker.md
@@ -212,7 +212,7 @@ Processed prompts: 100%|██████████████████
 For more examples, refer to the vLLM official examples:

 - [Offline Vision Embedding Example](https://github.com/vllm-project/vllm/blob/main/examples/pooling/score/vision_reranker_offline.py)
- [Online Vision Embedding Example](https://github.com/vllm-project/vllm/blob/main/examples/pooling/score/vision_reranker_online.py)
+- [Online Vision Embedding Example](https://github.com/vllm-project/vllm/blob/main/examples/pooling/score/vision_rerank_api_online.py)

 ## Performance

--- a/docs/source/tutorials/models/Qwen3.5-27B.md
+++ b/docs/source/tutorials/models/Qwen3.5-27B.md
@@ -18,8 +18,8 @@ Refer to [feature guide](../../user_guide/feature_guide/index.md) to get the fea

 ### Model Weight

- `Qwen3.5-27B`(BF16 version): require 1 Atlas 800 A3 (64G × 16) nodes or 1 Atlas 800 A2 (64G × 8) node. [Download model weight](https://modelscope.cn/models/Qwen/Qwen3.5-27B)
- `Qwen3.5-27B-w8a8`(Quantized version): require 1 Atlas 800 A3 (64G × 16) node or 1 Atlas 800 A2 (64G × 8) node. [Download model weight](https://www.modelscope.cn/models/Eco-Tech/Qwen3.5-27B-w8a8-mtp)
+- `Qwen3.5-27B`(BF16 version): requires 1 Atlas 800 A3 (64G × 16) node or 1 Atlas 800 A2 (64G × 8) node. [Download model weight](https://modelscope.cn/models/Qwen/Qwen3.5-27B)
+- `Qwen3.5-27B-w8a8`(Quantized version): requires 1 Atlas 800 A3 (64G × 16) node or 1 Atlas 800 A2 (64G × 8) node. [Download model weight](https://www.modelscope.cn/models/Eco-Tech/Qwen3.5-27B-w8a8-mtp)

 It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`.

@@ -87,7 +87,7 @@ If you want to deploy multi-node environment, you need to set up environment on

 ### Single-node Deployment

-`Qwen3.5-27B` and `Qwen3.5-27B-w8a8` can both be deployed on 1 Atlas 800 A3(64G16), 1 Atlas 800 A2(64G8). Quantized version need to start with parameter --quantization ascend.
+`Qwen3.5-27B` and `Qwen3.5-27B-w8a8` can both be deployed on 1 Atlas 800 A3(64G × 16), 1 Atlas 800 A2(64G × 8). Quantized version needs to start with parameter --quantization ascend.

 Run the following script to execute online 128k inference.