[Doc][v0.18.0] Fix documentation formatting and improve code examples (#8701)

### What this PR does / why we need it? This PR fixes various documentation issues and improves code examples throughout the project. Signed-off-by: MrZ20 <2609716663@qq.com>
2026-04-28 09:01:25 +08:00
parent 9a0b786f2b
commit 2e2aaa2fae
38 changed files with 205 additions and 188 deletions
--- a/.agents/skills/vllm-ascend-release-note-writer/references/ref-past-release-notes-highlight.md
+++ b/.agents/skills/vllm-ascend-release-note-writer/references/ref-past-release-notes-highlight.md
@@ -32,7 +32,7 @@ This is the first release candidate of v0.14.0 for vLLM Ascend. Please follow th
 - Fix multi-modal inference OOM issues by setting `expandable_segments:True` by default. [#5855](https://github.com/vllm-project/vllm-ascend/pull/5855)
 - `VLLM_ASCEND_ENABLE_MLAPO` is set to `True` by default. It's enabled automatically on decode node in PD deployment case. Please note that this feature will cost more memory. If you are memory sensitive, please set it to False. [#5952](https://github.com/vllm-project/vllm-ascend/pull/5952)
 - SSL config can be set to kv_extra_config for PD deployment with mooncake layerwise connector. [#5875](https://github.com/vllm-project/vllm-ascend/pull/5875)
- support `--max_model_len=auto`. [#6193](https://github.com/vllm-project/vllm-ascend/pull/6193)
+- support `--max-model-len=auto`. [#6193](https://github.com/vllm-project/vllm-ascend/pull/6193)
 ### Dependencies
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -50,6 +50,8 @@ All environment variables must be defined in `vllm_ascend/envs.py` using the cen
 **Example:**
 ```python
 import os
 env_variables = {
    "VLLM_ASCEND_ENABLE_NZ": lambda: int(os.getenv("VLLM_ASCEND_ENABLE_NZ", 1)),
    # ...
--- a/docs/source/developer_guide/Design_Documents/add_custom_aclnn_op.md
+++ b/docs/source/developer_guide/Design_Documents/add_custom_aclnn_op.md
@@ -18,7 +18,7 @@ enable_custom_op()
 - Create a new operation folder under `csrc` directory.
 - Create `op_host` and `op_kernel` directories for host and kernel source code.
- Add build options in `csrc/build_aclnn.sh` for supported SOC. Note that multiple ops should be separated with `;`, i.e. `CUSTOM_OPS=op1;op2;op3`.
+- Add build options in `csrc/build_aclnn.sh` for supported SOC. Note that multiple ops should be separated with `;`, i.e. `CUSTOM_OPS="op1;op2;op3"`.
 - Bind aclnn operators to torch.ops._C_ascend module in `csrc/torch_binding.cpp`.
 - Write a meta implementation in `csrc/torch_binding_meta.cpp` for the op to be captured into the aclgraph.
--- a/docs/source/developer_guide/contribution/multi_node_test.md
+++ b/docs/source/developer_guide/contribution/multi_node_test.md
@@ -35,38 +35,36 @@ From the workflow perspective, we can see how the final test script is executed,
    npu_per_node: 16
    # All env vars you need should add it here
    env_common:
-    VLLM_USE_MODELSCOPE: true
+      VLLM_USE_MODELSCOPE: true
-    OMP_PROC_BIND: false
+      OMP_PROC_BIND: false
-    OMP_NUM_THREADS: 100
+      OMP_NUM_THREADS: 100
-    HCCL_BUFFSIZE: 1024
+      HCCL_BUFFSIZE: 1024
-    SERVER_PORT: 8080
+      SERVER_PORT: 8080
    disaggregated_prefill:
-    enabled: true
+      enabled: true
-    # node index(a list) which meet all the conditions:
+      # node index(a list) which meet all the conditions:
-    #  - prefiller
+      #  - prefiller
-    #  - no headless(have api server)
+      #  - no headless(have api server)
-    prefiller_host_index: [0]
+      prefiller_host_index: [0]
-    # node index(a list) which meet all the conditions:
+      # node index(a list) which meet all the conditions:
-    #  - decoder
+      #  - decoder
-    decoder_host_index: [1]
+      decoder_host_index: [1]
    # Add each node's vllm serve cli command just like you run locally
    # Add each node's individual envs like follow
    deployment:
-    -
+      - envs:
-        envs:
+          # fill with envs like: <key>:<value>
        # fill with envs like: <key>:<value>
        server_cmd: >
-        vllm serve ...
+          vllm serve ...
-    -
+      - envs:
-        envs:
+          # fill with envs like: <key>:<value>
        # fill with envs like: <key>:<value>
        server_cmd: >
-        vllm serve ...
+          vllm serve ...
    benchmarks:
-    perf:
+      perf:
        # fill with performance test kwargs
-    acc:
+      acc:
        # fill with accuracy test kwargs
    ```
@@ -74,38 +72,38 @@ From the workflow perspective, we can see how the final test script is executed,
 Currently, the multi-node test workflow is defined in the [nightly_test_a3.yaml](https://github.com/vllm-project/vllm-ascend/blob/main/.github/workflows/schedule_nightly_test_a3.yaml)
-   ```yaml
+    ```yaml
    multi-node-tests:
-    name: multi-node
+      name: multi-node
-    if: always() && (github.event_name == 'schedule' || github.event_name == 'workflow_dispatch')
+      if: always() && (github.event_name == 'schedule' || github.event_name == 'workflow_dispatch')
-    strategy:
+      strategy:
        fail-fast: false
        max-parallel: 1
        matrix:
-        test_config:
+          test_config:
            - name: multi-node-deepseek-pd
-            config_file_path: DeepSeek-V3.yaml
+              config_file_path: DeepSeek-V3.yaml
-            size: 2
+              size: 2
            - name: multi-node-qwen3-dp
-            config_file_path: Qwen3-235B-A22B.yaml
+              config_file_path: Qwen3-235B-A22B.yaml
-            size: 2
+              size: 2
            - name: multi-node-qwenw8a8-2node
-            config_file_path: Qwen3-235B-W8A8.yaml
+              config_file_path: Qwen3-235B-W8A8.yaml
-            size: 2
+              size: 2
            - name: multi-node-qwenw8a8-2node-eplb
-            config_file_path: Qwen3-235B-W8A8-EPLB.yaml
+              config_file_path: Qwen3-235B-W8A8-EPLB.yaml
-            size: 2
+              size: 2
-    uses: ./.github/workflows/_e2e_nightly_multi_node.yaml
+      uses: ./.github/workflows/_e2e_nightly_multi_node.yaml
-    with:
+      with:
        soc_version: a3
        runner: linux-aarch64-a3-0
        image: 'swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/vllm-ascend:nightly-a3'
        replicas: 1
        size: ${{ matrix.test_config.size }}
        config_file_path: ${{ matrix.test_config.config_file_path }}
-    secrets:
+      secrets:
        KUBECONFIG_B64: ${{ secrets.KUBECONFIG_B64 }}
-   ```
+    ```
 The matrix above defines all the parameters required to add a multi-machine use case. The parameters worth noting (if you are adding a new use case) are `size` and the path to the yaml configuration file. The former defines the number of nodes required for your use case, and the latter defines the path to the configuration file you have completed in step 2.
@@ -125,130 +123,130 @@ This section assumes that you already have a [Kubernetes](https://kubernetes.io/
    apiVersion: leaderworkerset.x-k8s.io/v1
    kind: LeaderWorkerSet
    metadata:
-    name: test-server
+      name: test-server
-    namespace: vllm-project
+      namespace: vllm-project
    spec:
-    replicas: 1
+      replicas: 1
-    leaderWorkerTemplate:
+      leaderWorkerTemplate:
        size: 2
        restartPolicy: None
        leaderTemplate:
-        metadata:
+          metadata:
            labels:
-            role: leader
+              role: leader
-        spec:
+          spec:
            containers:
-            - name: vllm-leader
+              - name: vllm-leader
                imagePullPolicy: Always
                image: swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/vllm-ascend:nightly-a3
                env:
-                - name: CONFIG_YAML_PATH
+                  - name: CONFIG_YAML_PATH
                    value: DeepSeek-V3.yaml
-                - name: WORKSPACE
+                  - name: WORKSPACE
                    value: "/vllm-workspace"
-                - name: FAIL_TAG
+                  - name: FAIL_TAG
                    value: FAIL_TAG
                command:
-                - sh
+                  - sh
-                - -c
+                  - -c
-                - |
+                  - |
                    bash /vllm-workspace/vllm-ascend/tests/e2e/nightly/multi_node/scripts/run.sh
                resources:
-                limits:
+                  limits:
                    huawei.com/ascend-1980: 16
                    memory: 512Gi
                    ephemeral-storage: 100Gi
-                requests:
+                  requests:
                    huawei.com/ascend-1980: 16
                    memory: 512Gi
                    ephemeral-storage: 100Gi
                    cpu: 125
                ports:
-                - containerPort: 8080
+                  - containerPort: 8080
                # readinessProbe:
                #   tcpSocket:
                #     port: 8080
                #   initialDelaySeconds: 15
                #   periodSeconds: 10
                volumeMounts:
-                - mountPath: /root/.cache
+                  - mountPath: /root/.cache
                    name: shared-volume
-                - mountPath: /usr/local/Ascend/driver/tools
+                  - mountPath: /usr/local/Ascend/driver/tools
                    name: driver-tools
-                - mountPath: /dev/shm
+                  - mountPath: /dev/shm
                    name: dshm
            volumes:
-            - name: dshm
+              - name: dshm
                emptyDir:
-                medium: Memory
+                  medium: Memory
-                sizeLimit: 15Gi
+                  sizeLimit: 15Gi
-            - name: shared-volume
+              - name: shared-volume
                persistentVolumeClaim:
-                claimName: nv-action-vllm-benchmarks-v2
+                  claimName: nv-action-vllm-benchmarks-v2
-            - name: driver-tools
+              - name: driver-tools
                hostPath:
-                path: /usr/local/Ascend/driver/tools
+                  path: /usr/local/Ascend/driver/tools
        workerTemplate:
-        spec:
+          spec:
            containers:
-            - name: vllm-worker
+              - name: vllm-worker
                imagePullPolicy: Always
                image: swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/vllm-ascend:nightly-a3
                env:
-                - name: CONFIG_YAML_PATH
+                  - name: CONFIG_YAML_PATH
                    value: DeepSeek-V3.yaml
-                - name: WORKSPACE
+                  - name: WORKSPACE
                    value: "/vllm-workspace"
-                - name: FAIL_TAG
+                  - name: FAIL_TAG
                    value: FAIL_TAG
                command:
-                - sh
+                  - sh
-                - -c
+                  - -c
-                - |
+                  - |
                    bash /vllm-workspace/vllm-ascend/tests/e2e/nightly/multi_node/scripts/run.sh
                resources:
-                limits:
+                  limits:
                    huawei.com/ascend-1980: 16
                    memory: 512Gi
                    ephemeral-storage: 100Gi
-                requests:
+                  requests:
                    huawei.com/ascend-1980: 16
                    ephemeral-storage: 100Gi
                    cpu: 125
                volumeMounts:
-                - mountPath: /root/.cache
+                  - mountPath: /root/.cache
                    name: shared-volume
-                - mountPath: /usr/local/Ascend/driver/tools
+                  - mountPath: /usr/local/Ascend/driver/tools
                    name: driver-tools
-                - mountPath: /dev/shm
+                  - mountPath: /dev/shm
                    name: dshm
            volumes:
-            - name: dshm
+              - name: dshm
                emptyDir:
-                medium: Memory
+                  medium: Memory
-                sizeLimit: 15Gi
+                  sizeLimit: 15Gi
-            - name: shared-volume
+              - name: shared-volume
                persistentVolumeClaim:
-                claimName: nv-action-vllm-benchmarks-v2
+                  claimName: nv-action-vllm-benchmarks-v2
-            - name: driver-tools
+              - name: driver-tools
                hostPath:
-                path: /usr/local/Ascend/driver/tools
+                  path: /usr/local/Ascend/driver/tools
    ---
    apiVersion: v1
    kind: Service
    metadata:
-    name: vllm-leader
+      name: vllm-leader
-    namespace: vllm-project
+      namespace: vllm-project
    spec:
-    ports:
+      ports:
        - name: http
-        port: 8080
+          port: 8080
-        protocol: TCP
+          protocol: TCP
-        targetPort: 8080
+          targetPort: 8080
-    selector:
+      selector:
        leaderworkerset.sigs.k8s.io/name: vllm
        role: leader
-    type: ClusterIP
+      type: ClusterIP
    ```
    ```bash
--- a/docs/source/developer_guide/contribution/testing.md
+++ b/docs/source/developer_guide/contribution/testing.md
@@ -40,6 +40,7 @@ export PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu/ https://mirror
 # src path
 export SRC_WORKSPACE=/vllm-workspace
 mkdir -p $SRC_WORKSPACE
 cd $SRC_WORKSPACE
 apt-get update -y
 apt-get install -y python3-pip git vim wget net-tools gcc g++ cmake libnuma-dev curl gnupg2
--- a/docs/source/developer_guide/evaluation/using_ais_bench.md
+++ b/docs/source/developer_guide/evaluation/using_ais_bench.md
@@ -38,11 +38,11 @@ Run the vLLM server in the docker.
 ```{code-block} bash
   :substitutions:
-vllm serve Qwen/Qwen2.5-0.5B-Instruct --max_model_len 35000 &
+vllm serve Qwen/Qwen2.5-0.5B-Instruct --max-model-len 35000 &
 ```
 :::{note}
-`--max_model_len` should be greater than `35000`, this will be suitable for most datasets. Otherwise the accuracy evaluation may be affected.
+`--max-model-len` should be greater than `35000`, this will be suitable for most datasets. Otherwise the accuracy evaluation may be affected.
 :::
 The vLLM server is started successfully, if you see logs as below:
--- a/docs/source/developer_guide/evaluation/using_evalscope.md
+++ b/docs/source/developer_guide/evaluation/using_evalscope.md
@@ -29,7 +29,7 @@ docker run --rm \
 -e VLLM_USE_MODELSCOPE=True \
 -e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
 -it $IMAGE \
-vllm serve Qwen/Qwen2.5-7B-Instruct --max_model_len 26240
+vllm serve Qwen/Qwen2.5-7B-Instruct --max-model-len 26240
 ```
 If the vLLM server is started successfully, you can see information shown below:
--- a/docs/source/developer_guide/evaluation/using_lm_eval.md
+++ b/docs/source/developer_guide/evaluation/using_lm_eval.md
@@ -32,7 +32,7 @@ docker run --rm \
 -e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
 -it $IMAGE \
 /bin/bash
-vllm serve Qwen/Qwen2.5-0.5B-Instruct --max_model_len 4096 &
+vllm serve Qwen/Qwen2.5-0.5B-Instruct --max-model-len 4096 &
 ```
 The vLLM server is started successfully, if you see logs as below:
@@ -48,28 +48,36 @@ INFO:     Application startup complete.
 You can query the result with input prompts:
 ```shell
 PROMPT='<|im_start|>system
 You are a professional accountant. Answer questions using accounting knowledge, output only the option letter (A/B/C/D).<|im_end|>
 <|im_start|>user
 Question: A company'"'"'s balance sheet as of December 31, 2023 shows:
  Current assets: Cash and equivalents 5 million yuan, Accounts receivable 8 million yuan, Inventory 6 million yuan
  Non-current assets: Net fixed assets 12 million yuan
  Current liabilities: Short-term loans 4 million yuan, Accounts payable 3 million yuan
  Non-current liabilities: Long-term loans 9 million yuan
  Owner'"'"'s equity: Paid-in capital 10 million yuan, Retained earnings ?
 Requirement: Calculate the company'"'"'s Asset-Liability Ratio and Current Ratio (round to two decimal places).
 Options:
 A. Asset-Liability Ratio=58.33%, Current Ratio=1.90
 B. Asset-Liability Ratio=62.50%, Current Ratio=2.17
 C. Asset-Liability Ratio=65.22%, Current Ratio=1.75
 D. Asset-Liability Ratio=68.00%, Current Ratio=2.50<|im_end|>
 <|im_start|>assistant
 '
 curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
-    -d '{
+    -d "$(jq -n \
-        "model": "Qwen/Qwen2.5-0.5B-Instruct",
+        --arg model "Qwen/Qwen2.5-0.5B-Instruct" \
-        "prompt": "'"<|im_start|>system\nYou are a professional accountant. Answer questions using accounting knowledge, output only the option letter (A/B/C/D).<|im_end|>\n"\
+        --arg prompt "$PROMPT" \
-"<|im_start|>user\nQuestion: A company's balance sheet as of December 31, 2023 shows:\n"\
+        '{
-"  Current assets: Cash and equivalents 5 million yuan, Accounts receivable 8 million yuan, Inventory 6 million yuan\n"\
+            model: $model,
-"  Non-current assets: Net fixed assets 12 million yuan\n"\
+            prompt: $prompt,
-"  Current liabilities: Short-term loans 4 million yuan, Accounts payable 3 million yuan\n"\
+            max_completion_tokens: 1,
-"  Non-current liabilities: Long-term loans 9 million yuan\n"\
+            temperature: 0,
-"  Owner's equity: Paid-in capital 10 million yuan, Retained earnings ?\n"\
+            stop: ["<|im_end|>"]
-"Requirement: Calculate the company's Asset-Liability Ratio and Current Ratio (round to two decimal places).\n"\
+        }')" | python3 -m json.tool
 "Options:\n"\
 "A. Asset-Liability Ratio=58.33%, Current Ratio=1.90\n"\
 "B. Asset-Liability Ratio=62.50%, Current Ratio=2.17\n"\
 "C. Asset-Liability Ratio=65.22%, Current Ratio=1.75\n"\
 "D. Asset-Liability Ratio=68.00%, Current Ratio=2.50<|im_end|>\n"\
 "<|im_start|>assistant\n"'",
        "max_completion_tokens": 1,
        "temperature": 0,
        "stop": ["<|im_end|>"]
    }' | python3 -m json.tool
 ```
 The output format matches the following:
--- a/docs/source/developer_guide/evaluation/using_opencompass.md
+++ b/docs/source/developer_guide/evaluation/using_opencompass.md
@@ -29,7 +29,7 @@ docker run --rm \
 -e VLLM_USE_MODELSCOPE=True \
 -e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
 -it $IMAGE \
-vllm serve Qwen/Qwen2.5-7B-Instruct --max_model_len 26240
+vllm serve Qwen/Qwen2.5-7B-Instruct --max-model-len 26240
 ```
 The vLLM server is started successfully, if you see information as below:
--- a/docs/source/developer_guide/performance_and_debug/optimization_and_tuning.md
+++ b/docs/source/developer_guide/performance_and_debug/optimization_and_tuning.md
@@ -158,6 +158,12 @@ Scheduling optimization:
   :substitutions:
 # Optimize operator delivery queue. This will affect the memory peak value, and may degrade if the memory is tight.
 export TASK_QUEUE_ENABLE=2
 ```
 or
 ```{code-block} bash
   :substitutions:
 # This will greatly improve the CPU bottleneck model and ensure the same performance for the NPU bottleneck model.
 export CPU_AFFINITY_CONF=1
--- a/docs/source/developer_guide/performance_and_debug/performance_benchmark.md
+++ b/docs/source/developer_guide/performance_and_debug/performance_benchmark.md
@@ -223,7 +223,7 @@ vllm serve Qwen/Qwen3-Embedding-8B --trust-remote-code
 ```shell
 # download dataset
 # wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
-export VLLM_USE_MODELSCOPE=true
+export VLLM_USE_MODELSCOPE=True
 vllm bench serve \
  --model Qwen/Qwen3-Embedding-8B \
  --backend openai-embeddings \
--- a/docs/source/installation.md
+++ b/docs/source/installation.md
@@ -284,7 +284,7 @@ python example.py
 If you encounter a connection error with Hugging Face (e.g., `We couldn't connect to 'https://huggingface.co' to load the files, and couldn't find them in the cached files.`), run the following commands to use ModelScope as an alternative:
 ```bash
-export VLLM_USE_MODELSCOPE=true
+export VLLM_USE_MODELSCOPE=True
 pip install modelscope
 python example.py
 ```
--- a/docs/source/quick_start.md
+++ b/docs/source/quick_start.md
@@ -12,6 +12,8 @@
 ## Setup environment using container
 Before using containers, make sure Docker is installed on your system. If Docker is not installed, please refer to the [Docker installation guide](https://docs.docker.com/get-docker/) for installation instructions.
 :::::{tab-set}
 ::::{tab-item} Ubuntu
@@ -91,7 +93,7 @@ You can use ModelScope mirror to speed up download:
 <!-- tests/e2e/doctest/001-quickstart-test.sh should be considered updating as well -->
 ```bash
-export VLLM_USE_MODELSCOPE=true
+export VLLM_USE_MODELSCOPE=True
 ```
 There are two ways to start vLLM on Ascend NPU:
--- a/docs/source/tutorials/features/long_sequence_context_parallel_multi_node.md
+++ b/docs/source/tutorials/features/long_sequence_context_parallel_multi_node.md
@@ -559,7 +559,7 @@ There are three `vllm bench` subcommands:
 Take the `serve` as an example. Run the code as follows.
 ```shell
-export VLLM_USE_MODELSCOPE=true
+export VLLM_USE_MODELSCOPE=True
 vllm bench serve --model /path_to_weight/DeepSeek-V3.1_w8a8mix_mtp  --dataset-name random --random-input 131072 --num-prompts 20 --request-rate 0 --save-result --result-dir ./
 ```
--- a/docs/source/tutorials/features/long_sequence_context_parallel_single_node.md
+++ b/docs/source/tutorials/features/long_sequence_context_parallel_single_node.md
@@ -72,7 +72,7 @@ Run the following script to execute online 128k inference.
 ```shell
 #!/bin/sh
 # Load model from ModelScope to speed up download
-export VLLM_USE_MODELSCOPE=true
+export VLLM_USE_MODELSCOPE=True
 # To reduce memory fragmentation and avoid out of memory
 export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
 export HCCL_BUFFSIZE=512
@@ -166,7 +166,7 @@ There are three `vllm bench` subcommands:
 Take the `serve` as an example. Run the code as follows.
 ```shell
-export VLLM_USE_MODELSCOPE=true
+export VLLM_USE_MODELSCOPE=True
 vllm bench serve --model vllm-ascend/Qwen3-235B-A22B-w8a8  --dataset-name random --random-input 131072 --num-prompts 1 --request-rate 1 --save-result --result-dir ./
 ```
--- a/docs/source/tutorials/hardwares/310p.md
+++ b/docs/source/tutorials/hardwares/310p.md
@@ -96,7 +96,7 @@ Run the following steps to start the vLLM service on NPU for the Qwen3 Dense ser
            --served_model_name qwen --dtype float16 \
            --additional-config '{"ascend_compilation_config": {"fuse_norm_quant": false}}' \
            --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes": [1,2,4,8,16,32]}' \
-            --quantization ascend --max_model_len 16384
+            --quantization ascend --max-model-len 16384
        # `--load_format` is required only for the W8A8SC quantized weight format.
        #
        ```
@@ -134,7 +134,7 @@ Run the following steps to start the vLLM service on NPU for the Qwen3 Dense ser
                --enforce-eager \
                --dtype float16 \
                --quantization ascend \
-                --max_model_len 10240
+                --max-model-len 10240
            ```
            Argument notes: `--tensor-parallel-size`: `W8A8SC` quantized weights are tightly coupled to the TP size, so you must specify the TP size you plan to use at serving time when running compression. `--model` is the path to the input `w8a8s` weights, and `--output` is the output path for the compressed `w8a8sc` weights.
@@ -159,7 +159,7 @@ Run the following steps to start the vLLM service on NPU for the Qwen3 Dense ser
            --additional-config '{"ascend_compilation_config": {"fuse_norm_quant": false}}' \
            --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes": [1,2,4,8,16,32]}' \
            --quantization ascend \
-            --max_model_len 16384 \
+            --max-model-len 16384 \
            --no-enable-prefix-caching \
            --load_format="sharded_state"
        ```
@@ -178,7 +178,7 @@ Run the following steps to start the vLLM service on NPU for the Qwen3 Dense ser
            --additional-config '{"ascend_compilation_config": {"fuse_norm_quant": false}}' \
            --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes": [1,2,4,8,16]}' \
            --quantization ascend \
-            --max_model_len 16384 \
+            --max-model-len 16384 \
            --no-enable-prefix-caching \
            --load_format="sharded_state"
        ```
@@ -199,7 +199,7 @@ Run the following steps to start the vLLM service on NPU for the Qwen3 Dense ser
            --additional-config '{"ascend_compilation_config": {"fuse_norm_quant": false}}' \
            --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes": [16,32]}' \
            --quantization ascend \
-            --max_model_len 20480 \
+            --max-model-len 20480 \
            --no-enable-prefix-caching \
            --load_format="sharded_state"
        ```
--- a/docs/source/tutorials/models/DeepSeek-R1.md
+++ b/docs/source/tutorials/models/DeepSeek-R1.md
@@ -302,7 +302,7 @@ There are three `vllm bench` subcommands:
 Take the `serve` as an example. Run the code as follows.
 ```shell
-export VLLM_USE_MODELSCOPE=true
+export VLLM_USE_MODELSCOPE=True
 vllm bench serve --model path/DeepSeek-R1-W8A8  --dataset-name random --random-input 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./
 ```
--- a/docs/source/tutorials/models/DeepSeek-V3.2.md
+++ b/docs/source/tutorials/models/DeepSeek-V3.2.md
@@ -943,7 +943,7 @@ There are three `vllm bench` subcommands:
 Take the `serve` as an example. Run the code as follows.
 ```shell
-export VLLM_USE_MODELSCOPE=true
+export VLLM_USE_MODELSCOPE=True
 vllm bench serve --model /root/.cache/Eco-Tech/DeepSeek-V3.2-w8a8-mtp-QuaRot  --dataset-name random --random-input 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./
 ```
--- a/docs/source/tutorials/models/DeepSeekOCR2.md
+++ b/docs/source/tutorials/models/DeepSeekOCR2.md
@@ -93,7 +93,7 @@ vllm serve /root/.cache/DeepSeek-OCR-2 \
    --trust-remote-code \
    --tensor-parallel-size 1  \
    --port 1055 \
-    --max_model_len 8192 \
+    --max-model-len 8192 \
    --no-enable-prefix-caching \
    --gpu-memory-utilization 0.8 \
    --allowed-local-media-path / \
--- a/docs/source/tutorials/models/Kimi-K2.5.md
+++ b/docs/source/tutorials/models/Kimi-K2.5.md
@@ -784,7 +784,7 @@ There are three `vllm bench` subcommands:
 Take the `serve` as an example. Run the code as follows.
 ```shell
-export VLLM_USE_MODELSCOPE=true
+export VLLM_USE_MODELSCOPE=True
 vllm bench serve --model Eco-Tech/Kimi-K2.5-w4a8 --dataset-name random --random-input 1024 --num-prompts 200 --request-rate 1 --save-result --result-dir ./
 ```
--- a/docs/source/tutorials/models/PaddleOCR-VL.md
+++ b/docs/source/tutorials/models/PaddleOCR-VL.md
@@ -72,7 +72,7 @@ Run the following script to start the vLLM server on single 910B4:
 ```shell
 #!/bin/sh
-export VLLM_USE_MODELSCOPE=true
+export VLLM_USE_MODELSCOPE=True
 export MODEL_PATH="PaddlePaddle/PaddleOCR-VL"
 export TASK_QUEUE_ENABLE=1
 export CPU_AFFINITY_CONF=1
@@ -97,11 +97,11 @@ Run the following script to start the vLLM server on single Atlas 300 inference
 ```shell
 #!/bin/sh
-export VLLM_USE_MODELSCOPE=true
+export VLLM_USE_MODELSCOPE=True
 export MODEL_PATH="PaddlePaddle/PaddleOCR-VL"
 vllm serve ${MODEL_PATH} \
-          --max_model_len 16384 \
+          --max-model-len 16384 \
          --served-model-name PaddleOCR-VL-0.9B \
          --trust-remote-code \
          --no-enable-prefix-caching \
@@ -112,7 +112,7 @@ vllm serve ${MODEL_PATH} \
 ```
 :::{note}
-The `--max_model_len` option is added to prevent errors when generating the attention operator mask on the Atlas 300 inference products.
+The `--max-model-len` option is added to prevent errors when generating the attention operator mask on the Atlas 300 inference products.
 :::
 ::::
--- a/docs/source/tutorials/models/Qwen-VL-Dense.md
+++ b/docs/source/tutorials/models/Qwen-VL-Dense.md
@@ -323,12 +323,12 @@ Run docker container to start the vLLM server on single-NPU:
   :substitutions:
 vllm serve Qwen/Qwen3-VL-8B-Instruct \
 --dtype bfloat16 \
--max_model_len 16384 \
+--max-model-len 16384 \
 --max-num-batched-tokens 16384
 ```
 :::{note}
-Add `--max_model_len` option to avoid ValueError that the Qwen3-VL-8B-Instruct model's max seq len (256000) is larger than the maximum number of tokens that can be stored in KV cache. This will differ with different NPU series based on the on-chip memory size. Please modify the value according to a suitable value for your NPU series.
+Add `--max-model-len` option to avoid ValueError that the Qwen3-VL-8B-Instruct model's max seq len (256000) is larger than the maximum number of tokens that can be stored in KV cache. This will differ with different NPU series based on the on-chip memory size. Please modify the value according to a suitable value for your NPU series.
 :::
 If your service start successfully, you can see the info shown below:
@@ -415,7 +415,7 @@ vllm serve Qwen/Qwen2.5-VL-32B-Instruct \
 ```
 :::{note}
-Add `--max_model_len` option to avoid ValueError that the Qwen2.5-VL-32B-Instruct model's max_model_len (128000) is larger than the maximum number of tokens that can be stored in KV cache. This will differ with different NPU series base on the on-chip memory size. Please modify the value according to a suitable value for your NPU series.
+Add `--max-model-len` option to avoid ValueError that the Qwen2.5-VL-32B-Instruct model's max_model_len (128000) is larger than the maximum number of tokens that can be stored in KV cache. This will differ with different NPU series base on the on-chip memory size. Please modify the value according to a suitable value for your NPU series.
 :::
 If your service start successfully, you can see the info shown below:
--- a/docs/source/tutorials/models/Qwen2.5-Omni.md
+++ b/docs/source/tutorials/models/Qwen2.5-Omni.md
@@ -74,7 +74,7 @@ The environment variable `LOCAL_MEDIA_PATH` which allows API requests to read lo
 :::
 ```bash
-export VLLM_USE_MODELSCOPE=true
+export VLLM_USE_MODELSCOPE=True
 export MODEL_PATH="Qwen/Qwen2.5-Omni-7B"
 export LOCAL_MEDIA_PATH=$HOME/.cache/vllm/assets/vllm_public_assets/
@@ -104,7 +104,7 @@ VLLM_TARGET_DEVICE=empty pip install -v ".[audio]"
 #### Multiple NPU (Qwen2.5-Omni-7B)
 ```bash
-export VLLM_USE_MODELSCOPE=true
+export VLLM_USE_MODELSCOPE=True
 export MODEL_PATH=Qwen/Qwen2.5-Omni-7B
 export LOCAL_MEDIA_PATH=$HOME/.cache/vllm/assets/vllm_public_assets/
 export DP_SIZE=8
--- a/docs/source/tutorials/models/Qwen3-235B-A22B.md
+++ b/docs/source/tutorials/models/Qwen3-235B-A22B.md
@@ -95,7 +95,7 @@ Run the following script to execute online 128k inference.
 ```shell
 #!/bin/sh
 # Load model from ModelScope to speed up download
-export VLLM_USE_MODELSCOPE=true
+export VLLM_USE_MODELSCOPE=True
 # To reduce memory fragmentation and avoid out of memory
 export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
 export HCCL_BUFFSIZE=512
@@ -157,7 +157,7 @@ Node 0
 ```shell
 #!/bin/sh
 # Load model from ModelScope to speed up download
-export VLLM_USE_MODELSCOPE=true
+export VLLM_USE_MODELSCOPE=True
 # To reduce memory fragmentation and avoid out of memory
 export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
 # this obtained through ifconfig
@@ -199,7 +199,7 @@ Node1
 ```shell
 #!/bin/sh
 # Load model from ModelScope to speed up download
-export VLLM_USE_MODELSCOPE=true
+export VLLM_USE_MODELSCOPE=True
 # To reduce memory fragmentation and avoid out of memory
 export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
 # this obtained through ifconfig
@@ -309,7 +309,7 @@ There are three `vllm bench` subcommands:
 Take the `serve` as an example. Run the code as follows.
 ```shell
-export VLLM_USE_MODELSCOPE=true
+export VLLM_USE_MODELSCOPE=True
 vllm bench serve --model vllm-ascend/Qwen3-235B-A22B-w8a8  --dataset-name random --random-input-len 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./
 ```
@@ -335,7 +335,7 @@ Example server scripts:
 ```shell
 #!/bin/sh
 # Load model from ModelScope to speed up download
-export VLLM_USE_MODELSCOPE=true
+export VLLM_USE_MODELSCOPE=True
 # To reduce memory fragmentation and avoid out of memory
 export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
 export HCCL_BUFFSIZE=512
@@ -408,7 +408,7 @@ export TP_SOCKET_IFNAME=${ifname}
 export HCCL_SOCKET_IFNAME=${ifname}
 # Load model from ModelScope to speed up download
-export VLLM_USE_MODELSCOPE=true
+export VLLM_USE_MODELSCOPE=True
 # To reduce memory fragmentation and avoid out of memory
 export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
 export HCCL_BUFFSIZE=512
@@ -470,7 +470,7 @@ export TP_SOCKET_IFNAME=${ifname}
 export HCCL_SOCKET_IFNAME=${ifname}
 # Load model from ModelScope to speed up download
-export VLLM_USE_MODELSCOPE=true
+export VLLM_USE_MODELSCOPE=True
 # To reduce memory fragmentation and avoid out of memory
 export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
 export HCCL_BUFFSIZE=1024
@@ -534,7 +534,7 @@ export TP_SOCKET_IFNAME=${ifname}
 export HCCL_SOCKET_IFNAME=${ifname}
 # Load model from ModelScope to speed up download
-export VLLM_USE_MODELSCOPE=true
+export VLLM_USE_MODELSCOPE=True
 # To reduce memory fragmentation and avoid out of memory
 export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
 export HCCL_BUFFSIZE=1024
--- a/docs/source/tutorials/models/Qwen3-8B-W4A8.md
+++ b/docs/source/tutorials/models/Qwen3-8B-W4A8.md
@@ -93,7 +93,7 @@ The converted model files look like:
 Run the following script to start the vLLM server with the quantized model:
 ```bash
-export VLLM_USE_MODELSCOPE=true
+export VLLM_USE_MODELSCOPE=True
 export MODEL_PATH=vllm-ascend/Qwen3-8B-W4A8
 vllm serve ${MODEL_PATH} --served-model-name "qwen3-8b-w4a8" --max-model-len 4096 --quantization ascend
 ```
--- a/docs/source/tutorials/models/Qwen3-Coder-30B-A3B.md
+++ b/docs/source/tutorials/models/Qwen3-Coder-30B-A3B.md
@@ -64,7 +64,7 @@ For an Atlas A2 with 64 GB of NPU card memory, tensor-parallel-size should be at
 ```shell
 #!/bin/sh
-export VLLM_USE_MODELSCOPE=true
+export VLLM_USE_MODELSCOPE=True
 vllm serve Qwen/Qwen3-Coder-30B-A3B-Instruct --served-model-name qwen3-coder --tensor-parallel-size 4 --enable_expert_parallel
 ```
--- a/docs/source/tutorials/models/Qwen3-Next.md
+++ b/docs/source/tutorials/models/Qwen3-Next.md
@@ -163,7 +163,7 @@ There are three `vllm bench` subcommands:
 Take the `serve` as an example. Run the code as follows.
 ```shell
-export VLLM_USE_MODELSCOPE=true
+export VLLM_USE_MODELSCOPE=True
 vllm bench serve --model Qwen/Qwen3-Next-80B-A3B-Instruct  --dataset-name random --random-input 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./
 ```
--- a/docs/source/tutorials/models/Qwen3-VL-235B-A22B-Instruct.md
+++ b/docs/source/tutorials/models/Qwen3-VL-235B-A22B-Instruct.md
@@ -94,7 +94,7 @@ Node 0
 ```shell
 #!/bin/sh
 # Load model from ModelScope to speed up download
-export VLLM_USE_MODELSCOPE=true
+export VLLM_USE_MODELSCOPE=True
 # To reduce memory fragmentation and avoid out of memory
 export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
 # this obtained through ifconfig
@@ -137,7 +137,7 @@ Node1
 ```shell
 #!/bin/sh
 # Load model from ModelScope to speed up download
-export VLLM_USE_MODELSCOPE=true
+export VLLM_USE_MODELSCOPE=True
 # To reduce memory fragmentation and avoid out of memory
 export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
 # this is obtained through ifconfig
@@ -269,7 +269,7 @@ There are three `vllm bench` subcommands:
 Take the `serve` as an example. Run the code as follows.
 ```shell
-export VLLM_USE_MODELSCOPE=true
+export VLLM_USE_MODELSCOPE=True
 vllm bench serve --model Qwen/Qwen3-VL-235B-A22B-Instruct  --dataset-name random --random-input 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./
 ```
--- a/docs/source/tutorials/models/Qwen3.5-27B.md
+++ b/docs/source/tutorials/models/Qwen3.5-27B.md
@@ -94,7 +94,7 @@ Run the following script to execute online 128k inference.
 ```shell
 #!/bin/sh
 # Load model from ModelScope to speed up download
-export VLLM_USE_MODELSCOPE=true
+export VLLM_USE_MODELSCOPE=True
 # To reduce memory fragmentation and avoid out of memory
 export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
 export HCCL_BUFFSIZE=512
@@ -190,7 +190,7 @@ There are three `vllm bench` subcommands:
 Take the `serve` as an example. Run the code as follows.
 ```shell
-export VLLM_USE_MODELSCOPE=true
+export VLLM_USE_MODELSCOPE=True
 vllm bench serve --model Eco-Tech/Qwen3.5-27B-w8a8-mtp --dataset-name random --random-input 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./
 ```
--- a/docs/source/tutorials/models/Qwen3.5-397B-A17B.md
+++ b/docs/source/tutorials/models/Qwen3.5-397B-A17B.md
@@ -94,7 +94,7 @@ Run the following script to execute online 128k inference On 1 Atlas 800 A3(64G*
 ```shell
 #!/bin/sh
 # Load model from ModelScope to speed up download
-export VLLM_USE_MODELSCOPE=true
+export VLLM_USE_MODELSCOPE=True
 # To reduce memory fragmentation and avoid out of memory
 export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
 export HCCL_OP_EXPANSION_MODE="AIV"
@@ -157,7 +157,7 @@ Node 0
 ```shell
 #!/bin/sh
 # Load model from ModelScope to speed up download
-export VLLM_USE_MODELSCOPE=true
+export VLLM_USE_MODELSCOPE=True
 # To reduce memory fragmentation and avoid out of memory
 export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
 # this obtained through ifconfig
@@ -203,7 +203,7 @@ Node1
 ```shell
 #!/bin/sh
 # Load model from ModelScope to speed up download
-export VLLM_USE_MODELSCOPE=true
+export VLLM_USE_MODELSCOPE=True
 # To reduce memory fragmentation and avoid out of memory
 export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
 # this obtained through ifconfig
@@ -595,7 +595,7 @@ There are three `vllm bench` subcommands:
 Take the `serve` as an example. Run the code as follows.
 ```shell
-export VLLM_USE_MODELSCOPE=true
+export VLLM_USE_MODELSCOPE=True
 vllm bench serve --model Eco-Tech/Qwen3.5-397B-A17B-w8a8-mtp --dataset-name random --random-input 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./
 ```
--- a/docs/source/user_guide/feature_guide/dynamic_batch.md
+++ b/docs/source/user_guide/feature_guide/dynamic_batch.md
@@ -38,9 +38,9 @@ So far, dynamic batch performs better on several dense models including Qwen and
 Dynamic batch is used in the online inference. A fully executable example is as follows:
 ```shell
-SLO_LITMIT=50
+SLO_LIMIT=50
 vllm serve Qwen/Qwen2.5-14B-Instruct\
-    --additional_config '{"SLO_limits_for_dynamic_batch":'${SLO_LITMIT}'}' \
+    --additional_config '{"SLO_limits_for_dynamic_batch":'${SLO_LIMIT}'}' \
    --max-num-seqs 256 \
    --block-size 128 \
    --tensor_parallel_size 8 \
--- a/docs/source/user_guide/feature_guide/netloader.md
+++ b/docs/source/user_guide/feature_guide/netloader.md
@@ -54,25 +54,25 @@ To enable Netloader, pass `--load-format=netloader` and provide configuration vi
 ### Server
 ```shell
-VLLM_SLEEP_WHEN_IDLE=1 vllm serve `<model_file>` \
+VLLM_SLEEP_WHEN_IDLE=1 vllm serve <model_file> \
  --tensor-parallel-size 1 \
-  --served-model-name `<model_name>` \
+  --served-model-name <model_name> \
  --enforce-eager \
-  --port `<port>` \
+  --port <port> \
  --load-format netloader
 ```
 ### Client
 ```shell
-export NETLOADER_CONFIG='{"SOURCE":[{"device_id":0, "sources": ["`<server_IP>`:`<server_Port>`"]}]}'
+export NETLOADER_CONFIG='{"SOURCE":[{"device_id":0, "sources": ["<server_IP>:<server_Port>"]}]}'
-VLLM_SLEEP_WHEN_IDLE=1 ASCEND_RT_VISIBLE_DEVICES=`<device_id_diff_from_server>` \
+VLLM_SLEEP_WHEN_IDLE=1 ASCEND_RT_VISIBLE_DEVICES=<device_id_diff_from_server> \
-  vllm serve `<model_file>` \
+  vllm serve <model_file> \
  --tensor-parallel-size 1 \
-  --served-model-name `<model_name>` \
+  --served-model-name <model_name> \
  --enforce-eager \
-  --port `<client_port>` \
+  --port <client_port> \
  --load-format netloader \
  --model-loader-extra-config="${NETLOADER_CONFIG}"
 ```
--- a/docs/source/user_guide/feature_guide/rfork.md
+++ b/docs/source/user_guide/feature_guide/rfork.md
@@ -80,7 +80,7 @@ A simple planner implementation is provided at [`rfork_planner.py`](../../../../
 ```shell
 python rfork_planner.py \
  --host 0.0.0.0 \
-  --port `<planner_port>`
+  --port <planner_port>
 ```
 ### 3. Start vLLM Instances
@@ -93,15 +93,15 @@ For later instances, if the planner can allocate a compatible seed, RFork will t
 ```shell
 export RFORK_CONFIG='{
-  "model_url": "`<model_url>`",
+  "model_url": "<model_url>",
-  "model_deploy_strategy_name": "`<deploy_strategy>`",
+  "model_deploy_strategy_name": "<deploy_strategy>",
-  "rfork_scheduler_url": "http://`<planner_ip>`:`<planner_port>`"
+  "rfork_scheduler_url": "http://<planner_ip>:<planner_port>"
 }'
-vllm serve `<model_path>` \
+vllm serve <model_path> \
  --tensor-parallel-size 1 \
-  --served-model-name `<served_model_name>` \
+  --served-model-name <served_model_name> \
-  --port `<port>` \
+  --port <port> \
  --load-format rfork \
  --model-loader-extra-config "${RFORK_CONFIG}"
 ```
--- a/docs/source/user_guide/release_notes.md
+++ b/docs/source/user_guide/release_notes.md
@@ -381,7 +381,7 @@ This is the first release candidate of v0.14.0 for vLLM Ascend. Please follow th
 - Fix multimodal inference OOM issues by setting `expandable_segments:True` by default. [#5855](https://github.com/vllm-project/vllm-ascend/pull/5855)
 - `VLLM_ASCEND_ENABLE_MLAPO` is set to `True` by default. It's enabled automatically on decode node in PD deployment case. Please note that this feature will cost more memory. If you are memory sensitive, please set it to False. [#5952](https://github.com/vllm-project/vllm-ascend/pull/5952)
 - SSL config can be set to kv_extra_config for PD deployment with mooncake layerwise connector. [#5875](https://github.com/vllm-project/vllm-ascend/pull/5875)
- support `--max_model_len=auto`. [#6193](https://github.com/vllm-project/vllm-ascend/pull/6193)
+- support `--max-model-len=auto`. [#6193](https://github.com/vllm-project/vllm-ascend/pull/6193)
 ### Dependencies
--- a/examples/run_dp_server.sh
+++ b/examples/run_dp_server.sh
@@ -7,7 +7,7 @@ export HCCL_SOCKET_IFNAME="eth0"
 export OMP_PROC_BIND=false
 export OMP_NUM_THREADS=10
-export VLLM_USE_MODELSCOPE=true
+export VLLM_USE_MODELSCOPE=True
 export ASCEND_LAUNCH_BLOCKING=0
--- a/tests/e2e/multicard/2-cards/test_qwen3_moe.py
+++ b/tests/e2e/multicard/2-cards/test_qwen3_moe.py
@@ -80,7 +80,7 @@ async def test_qwen3_moe_w8a8_distributed_tp2_ep_dynamic_eplb():
    port = get_open_port()
    compilation_config = json.dumps({"cudagraph_capture_sizes": [8]})
    server_args = [
-        "--max_model_len",
+        "--max-model-len",
        "8192",
        "--tensor_parallel_size",
        "2",
--- a/tests/e2e/nightly/single_node/models/scripts/GUIDE_AND_TEMPLATE.md
+++ b/tests/e2e/nightly/single_node/models/scripts/GUIDE_AND_TEMPLATE.md
@@ -239,7 +239,7 @@ test_cases:
      <<: *envs
    server_cmd: *server_cmd
    benchmarks:
-      <<: *benchmarks_acc
+      <<: *benchmarks
 ```
 #### EPD / Disaggregated Case
--- a/tests/e2e/run_doctests.sh
+++ b/tests/e2e/run_doctests.sh
@@ -21,7 +21,7 @@ set -eo errexit
 . $(dirname "$0")/common.sh
-export VLLM_USE_MODELSCOPE=true
+export VLLM_USE_MODELSCOPE=True
 export MODELSCOPE_HUB_FILE_LOCK=false
 export HF_HUB_OFFLINE=1