[v0.11.0][Doc] Update doc (#3852)

### What this PR does / why we need it? Update doc Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-10-29 11:32:12 +08:00
parent 6188450269
commit 75de3fa172
49 changed files with 724 additions and 701 deletions
--- a/docs/source/user_guide/feature_guide/eplb_swift_balancer.md
+++ b/docs/source/user_guide/feature_guide/eplb_swift_balancer.md
@@ -2,7 +2,7 @@

 ## Overview

-Expert balancing for MoE models in LLM serving is essential for optimal performance. Dynamically changing experts during inference can negatively impact TTFT (Time To First Token) and TPOT (Tokens Per Output Token) due to stop-the-world operations. SwiftBalancer enables asynchronous expert load balancing with zero-overhead expert movement, ensuring seamless service continuity.
+Expert balancing for MoE models in LLM serving is essential for optimal performance. Dynamically changing experts during inference can negatively impact TTFT (Time To First Token) and TPOT (Time Per Output Token) due to stop-the-world operations. SwiftBalancer enables asynchronous expert load balancing with zero-overhead expert movement, ensuring seamless service continuity.

 ## EPLB Effects

@@ -16,7 +16,7 @@ Expert balancing for MoE models in LLM serving is essential for optimal performa

 ### Dynamic EPLB

-We need to add environment variable `export DYNAMIC_EPLB="true"` to enable vllm eplb. Enable dynamic balancing with auto-tuned parameters. Adjust num_iterations_eplb_update and num_wait_worker_iterations based on workload patterns.
+We need to add the environment variable `export PYTHONOPTIMIZE=1` to get context of the vllm process. Enable dynamic balancing with auto-tuned parameters. Adjust num_iterations_eplb_update and num_wait_worker_iterations based on workload patterns.

 ```shell
 vllm serve Qwen/Qwen3-235B-A22 \
@@ -32,7 +32,7 @@ vllm serve Qwen/Qwen3-235B-A22 \
 ### Static EPLB
 #### Initial Setup (Record Expert Map)

-We need to add environment variable `export EXPERT_MAP_RECORD="true"` to record expert map.Generate the initial expert distribution map using expert_map_record_path. This creates a baseline configuration for future deployments.
+Generate the initial expert distribution map using expert_map_record_path. This creates a baseline configuration for future deployments.

 ```shell
 vllm serve Qwen/Qwen3-235B-A22 \
@@ -61,16 +61,16 @@ vllm serve Qwen/Qwen3-235B-A22 \
 ## Critical Considerations
 1. Parameter Tuning:
   - num_iterations_eplb_update: Higher values (e.g., 400+) for stable workloads; lower values (e.g., 100-200) for fluctuating traffic.
-   - num_wait_worker_iterations: Should be ≥30 to avoid premature balancing during startup.
+   - num_wait_worker_iterations: Should be ≥ 30 to avoid premature balancing during startup.
   - init_redundancy_expert: Must match tensor-parallel size (e.g., 16 for 16 GPUs) to ensure sufficient redundancy.

 2. Hardware Requirements:
-   - Ensure all GPUs have identical memory capacity and compute capabilities.
-   - Network bandwidth must support expert redistribution traffic (≥10Gbps recommended).
+   - Ensure that all GPUs have identical memory capacity and compute capabilities.
+   - Network bandwidth must support expert redistribution traffic (≥ 10 Gbps recommended).

 3. Model Compatibility:
   - Only MoE models with explicit expert parallelism support (e.g., Qwen3-235B-A22) are compatible.
-   - Verify model architecture supports dynamic expert routing via --enable-expert-parallel.
+   - Verify model architecture supports dynamic expert routing through --enable-expert-parallel.

 4. Gating Configuration:
   - When gate_eplb=true, validate that the gating mechanism can handle expert movement without routing errors.
@@ -83,7 +83,7 @@ vllm serve Qwen/Qwen3-235B-A22 \

 6. Startup Behavior:
   - Initial requests may experience higher latency during the first balancing cycle (typically 1-2 minutes).
-   - Avoid sudden traffic spikes during warm-up phase.
+   - Avoid sudden traffic spikes during the warm-up phase.

 7. Common Pitfalls:
   - Incorrect tensor-parallel-size vs. actual GPU count → causes resource underutilization.
--- a/docs/source/user_guide/feature_guide/graph_mode.md
+++ b/docs/source/user_guide/feature_guide/graph_mode.md
@@ -4,11 +4,11 @@
 This feature is currently experimental. In future versions, there may be behavioral changes around configuration, coverage, performance improvement.
 ```

-This guide provides instructions for using Ascend Graph Mode with vLLM Ascend. Please note that graph mode is only available on V1 Engine. And only Qwen, DeepSeek series models are well tested from 0.9.0rc1. We'll make it stable and generalize in the next release.
+This guide provides instructions for using Ascend Graph Mode with vLLM Ascend. Please note that graph mode is only available on V1 Engine. And only Qwen, DeepSeek series models are well tested from 0.9.0rc1. We will make it stable and generalized in the next release.

 ## Getting Started

-From v0.9.1rc1 with V1 Engine, vLLM Ascend will run models in graph mode by default to keep the same behavior with vLLM. If you hit any issues, please feel free to open an issue on GitHub and fallback to eager mode temporarily by set `enforce_eager=True` when initializing the model.
+From v0.9.1rc1 with V1 Engine, vLLM Ascend will run models in graph mode by default to keep the same behavior with vLLM. If you hit any issues, please feel free to open an issue on GitHub and fallback to the eager mode temporarily by set `enforce_eager=True` when initializing the model.

 There are two kinds for graph mode supported by vLLM Ascend:
 - **ACLGraph**: This is the default graph mode supported by vLLM Ascend. In v0.9.1rc1, only Qwen series models are well tested.
@@ -17,7 +17,7 @@ There are two kinds for graph mode supported by vLLM Ascend:
 ## Using ACLGraph
 ACLGraph is enabled by default. Take Qwen series models as an example, just set to use V1 Engine is enough.

-offline example:
+Offline example:

 ```python
 import os
@@ -28,7 +28,7 @@ model = LLM(model="Qwen/Qwen2-7B-Instruct")
 outputs = model.generate("Hello, how are you?")
 ```

-online example:
+Online example:

 ```shell
 vllm serve Qwen/Qwen2-7B-Instruct
@@ -36,9 +36,9 @@ vllm serve Qwen/Qwen2-7B-Instruct

 ## Using TorchAirGraph

-If you want to run DeepSeek series models with graph mode, you should use [TorchAirGraph](https://www.hiascend.com/document/detail/zh/Pytorch/700/modthirdparty/torchairuseguide/torchair_0002.html). In this case, additional config is required.
+If you want to run DeepSeek series models with the graph mode, you should use [TorchAirGraph](https://www.hiascend.com/document/detail/zh/Pytorch/700/modthirdparty/torchairuseguide/torchair_0002.html). In this case, additional configuration is required.

-offline example:
+Offline example:

 ```python
 import os
@@ -49,19 +49,19 @@ model = LLM(model="deepseek-ai/DeepSeek-R1-0528", additional_config={"torchair_g
 outputs = model.generate("Hello, how are you?")
 ```

-online example:
+Online example:

 ```shell
 vllm serve Qwen/Qwen2-7B-Instruct --additional-config='{"torchair_graph_config": {"enabled": true},"ascend_scheduler_config": {"enabled": true,}}'
 ```

-You can find more detail about additional config [here](../configuration/additional_config.md).
+You can find more details about additional configuration [here](../configuration/additional_config.md).

-## Fallback to Eager Mode
+## Fallback to the Eager Mode

-If both `ACLGraph` and `TorchAirGraph` fail to run, you should fallback to eager mode.
+If both `ACLGraph` and `TorchAirGraph` fail to run, you should fallback to the eager mode.

-offline example:
+Offline example:

 ```python
 import os
@@ -71,7 +71,7 @@ model = LLM(model="someother_model_weight", enforce_eager=True)
 outputs = model.generate("Hello, how are you?")
 ```

-online example:
+Online example:

 ```shell
 vllm serve Qwen/Qwen2-7B-Instruct --enforce-eager
--- a/docs/source/user_guide/feature_guide/lora.md
+++ b/docs/source/user_guide/feature_guide/lora.md
@@ -8,7 +8,7 @@ You can refer to [Supported Models](https://docs.vllm.ai/en/latest/models/suppor
 You can run LoRA with ACLGraph mode now. Please refer to [Graph Mode Guide](./graph_mode.md) for a better LoRA performance.

 ## Example
-We show a simple LoRA example here, which enables the ACLGraph mode as default.
+We provide a simple LoRA example here, which enables the ACLGraph mode by default.

 ```shell
 vllm serve meta-llama/Llama-2-7b \
@@ -20,4 +20,4 @@ vllm serve meta-llama/Llama-2-7b \

 We have implemented LoRA-related AscendC operators, such as bgmv_shrink, bgmv_expand, sgmv_shrink and sgmv_expand. You can find them under the "csrc/kernels" directory of [vllm-ascend repo](https://github.com/vllm-project/vllm-ascend.git).

-When you install vllm and vllm-ascend, those operators mentioned above will be compiled and installed automatically. If you don't want to use AscendC operators when you run vllm-ascend, you should set `COMPILE_CUSTOM_KERNELS=0` and reinstall vllm-ascend. To require more instructions about installation and compilation, you can refer to [installation guide](../../installation.md).
+When you install vllm and vllm-ascend, those operators mentioned above will be compiled and installed automatically. If you do not want to use AscendC operators when you run vllm-ascend, you should set `COMPILE_CUSTOM_KERNELS=0` and reinstall vllm-ascend. To require more instructions about installation and compilation, you can refer to [installation guide](../../installation.md).
--- a/docs/source/user_guide/feature_guide/quantization.md
+++ b/docs/source/user_guide/feature_guide/quantization.md
@@ -2,13 +2,13 @@

 Model quantization is a technique that reduces the size and computational requirements of a model by lowering the data precision of the weights and activation values in the model, thereby saving the memory and improving the inference speed.

-Since 0.9.0rc2 version, quantization feature is experimentally supported in vLLM Ascend. Users can enable quantization feature by specifying `--quantization ascend`. Currently, only Qwen, DeepSeek series models are well tested. We’ll support more quantization algorithm and models in the future.
+Since version 0.9.0rc2, the quantization feature is experimentally supported by vLLM Ascend. Users can enable the quantization feature by specifying `--quantization ascend`. Currently, only Qwen, DeepSeek series models are well tested. We will support more quantization algorithms and models in the future.

-## Install modelslim
+## Install ModelSlim

-To quantize a model, users should install [ModelSlim](https://gitee.com/ascend/msit/blob/master/msmodelslim/README.md) which is the Ascend compression and acceleration tool. It is an affinity-based compression tool designed for acceleration, using compression as its core technology and built upon the Ascend platform.
+To quantize a model, you should install [ModelSlim](https://gitee.com/ascend/msit/blob/master/msmodelslim/README.md) which is the Ascend compression and acceleration tool. It is an affinity-based compression tool designed for acceleration, using compression as its core technology and built upon the Ascend platform.

-Install modelslim:
+Install ModelSlim:

 ```bash
 # The branch(br_release_MindStudio_8.1.RC2_TR5_20260624) has been verified
@@ -23,16 +23,16 @@ pip install accelerate
 ## Quantize model

 :::{note}
-You can choose to convert the model yourself or use the quantized model we uploaded,
-see https://www.modelscope.cn/models/vllm-ascend/Kimi-K2-Instruct-W8A8
-This conversion process will require a larger CPU memory, please ensure that the RAM size is greater than 2TB
+You can choose to convert the model yourself or use the quantized model we uploaded.
+See https://www.modelscope.cn/models/vllm-ascend/Kimi-K2-Instruct-W8A8.
+This conversion process requires a larger CPU memory, ensure that the RAM size is greater than 2 TB.
 :::

-### Adapts and change
+### Adapt to changes
 1. Ascend does not support the `flash_attn` library. To run the model, you need to follow the [guide](https://gitee.com/ascend/msit/blob/master/msmodelslim/example/DeepSeek/README.md#deepseek-v3r1) and comment out certain parts of the code in `modeling_deepseek.py` located in the weights folder.
-2. The current version of transformers does not support loading weights in FP8 quantization format. you need to follow the [guide](https://gitee.com/ascend/msit/blob/master/msmodelslim/example/DeepSeek/README.md#deepseek-v3r1) and delete the quantization related fields from `config.json` in the weights folder
+2. The current version of transformers does not support loading weights in FP8 quantization format. you need to follow the [guide](https://gitee.com/ascend/msit/blob/master/msmodelslim/example/DeepSeek/README.md#deepseek-v3r1) and delete the quantization related fields from `config.json` in the weights folder.

-### Generate the w8a8 weights
+### Generate the W8A8 weights

 ```bash
 cd example/DeepSeek
@@ -63,7 +63,7 @@ Here is the full converted model files except safetensors:

 ## Run the model

-Now, you can run the quantized models with vLLM Ascend. Here is the example for online and offline inference.
+Now, you can run the quantized model with vLLM Ascend. Examples for online and offline inference are provided as follows:

 ### Offline inference

@@ -93,26 +93,25 @@ for output in outputs:

 ### Online inference

-Enable quantization by specifying `--quantization ascend`, for more details, see DeepSeek-V3-W8A8 [tutorial](https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_node.html)
+Enable quantization by specifying `--quantization ascend`, for more details, see the [DeepSeek-V3-W8A8 Tutorial](https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_node.html).

 ## FAQs

-### 1. How to solve the KeyError: 'xxx.layers.0.self_attn.q_proj.weight' problem?
+### 1. How to solve the KeyError "xxx.layers.0.self_attn.q_proj.weight"?

-First, make sure you specify `ascend` quantization method. Second, check if your model is converted by this `br_release_MindStudio_8.1.RC2_TR5_20260624` modelslim version. Finally, if it still doesn't work, please
-submit a issue, maybe some new models need to be adapted.
+First, make sure you specify `ascend` as the quantization method. Second, check if your model is converted by the `br_release_MindStudio_8.1.RC2_TR5_20260624` ModelSlim version. Finally, if it still does not work, submit an issue. Maybe some new models need to be adapted.

 ### 2. How to solve the error "Could not locate the configuration_deepseek.py"?

-Please convert DeepSeek series models using `br_release_MindStudio_8.1.RC2_TR5_20260624` modelslim, this version has fixed the missing configuration_deepseek.py error.
+Please convert DeepSeek series models using `br_release_MindStudio_8.1.RC2_TR5_20260624` ModelSlim, where the missing configuration_deepseek.py error has been fixed.

-### 3. When converting deepseek series models with modelslim, what should you pay attention?
+### 3. What should be considered when converting DeepSeek series models with ModelSlim?

-When the mla portion of the weights used `W8A8_DYNAMIC` quantization, if torchair graph mode is enabled, please modify the configuration file in the CANN package to prevent incorrect inference results.
+When the MLA portion of the weights used the `W8A8_DYNAMIC` quantization with the torchair graph mode enabled, modify the configuration file in the CANN package to prevent incorrect inference results.

 The operation steps are as follows:

-1. Search in the CANN package directory used, for example:
+1. Search in the CANN package directory, for example:
 find /usr/local/Ascend/ -name fusion_config.json

 2. Add `"AddRmsNormDynamicQuantFusionPass":"off",` and `"MultiAddRmsNormDynamicQuantFusionPass":"off",` to the fusion_config.json you find, the location is as follows:
--- a/docs/source/user_guide/feature_guide/sleep_mode.md
+++ b/docs/source/user_guide/feature_guide/sleep_mode.md
@@ -8,9 +8,9 @@ Since the generation and training phases may employ different model parallelism

 ## Getting started

-With `enable_sleep_mode=True`, the way we manage memory(malloc, free) in vllm will under a specific memory pool, during loading model and initialize kv_caches, we tag the memory as a map: `{"weight": data, "kv_cache": data}`.
+With `enable_sleep_mode=True`, the way we manage memory (malloc, free) in vllm is under a specific memory pool. During model loading and KV cache initialization, we tag the memory as a map: `{"weight": data, "kv_cache": data}`.

-The engine(v0/v1) supports two sleep levels to manage memory during idle periods:
+The engine (v0/v1) supports two sleep levels to manage memory during idle periods:

 - Level 1 Sleep
    - Action: Offloads model weights and discards the KV cache.
@@ -20,16 +20,16 @@ The engine(v0/v1) supports two sleep levels to manage memory during idle periods

 - Level 2 Sleep
    - Action: Discards both model weights and KV cache.
-    - Memory: The content of both the model weights and kv cache is forgotten.
+    - Memory: The content of both the model weights and KV cache is forgotten.
    - Use Case: Ideal when switching to a different model or updating the current one.

-Since this feature uses the low-level API [AscendCL](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/82RC1alpha002/API/appdevgapi/appdevgapi_07_0000.html), in order to use sleep mode, you should follow the [installation guide](https://vllm-ascend.readthedocs.io/en/latest/installation.html) and building from source, if you are using v0.7.3, remember to set `export COMPILE_CUSTOM_KERNELS=1`, for the latest version(v0.9.x+), the environment variable `COMPILE_CUSTOM_KERNELS` will be set 1 by default while building from source.
+Since this feature uses the low-level API [AscendCL](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/82RC1alpha002/API/appdevgapi/appdevgapi_07_0000.html), in order to use sleep mode, you should follow the [installation guide](https://vllm-ascend.readthedocs.io/en/latest/installation.html) and build from source. If you are using v0.7.3, remember to set `export COMPILE_CUSTOM_KERNELS=1`. For the latest version (v0.9.x+), the environment variable `COMPILE_CUSTOM_KERNELS` will be set to 1 by default while building from source.

 ## Usage

 The following is a simple example of how to use sleep mode.

- offline inference:
+- Offline inference:

    ```python
    import os
@@ -68,9 +68,9 @@ The following is a simple example of how to use sleep mode.
        assert output[0].outputs[0].text == output2[0].outputs[0].text
    ```

- online serving:
+- Online serving:
    :::{note}
-    Considering there may be a risk of malicious access, please make sure you are under a dev-mode, and explicit specify the develop env: `VLLM_SERVER_DEV_MODE` to expose these endpoints(sleep/wake up).
+    Considering there may be a risk of malicious access, please make sure you are under a dev-mode, and explicit specify the dev environment `VLLM_SERVER_DEV_MODE` to expose these endpoints (sleep/wake up).
    :::

    ```bash
--- a/docs/source/user_guide/feature_guide/structured_output.md
+++ b/docs/source/user_guide/feature_guide/structured_output.md
@@ -2,34 +2,34 @@

 ## Overview

-### What is Structured Output?
+### What is structured output?

-LLMs can be unpredictable when you need output in specific formats. Think of asking a model to generate JSON - without guidance, it might produce valid text that breaks JSON specification. **Structured Output (also called Guided Decoding)** enables LLMs to generate outputs that follow a desired structure while preserving the non-deterministic nature of the system.
+LLMs can be unpredictable when you need output in specific formats. Think of asking a model to generate JSON without guidance, it might produce valid text that breaks JSON specification. **Structured Output (also known as Guided Decoding)** enables LLMs to generate outputs that follow a desired structure while preserving the non-deterministic nature of the system.

-In simple terms, structured decoding gives LLMs a “template” to follow. Users provide a schema that “influences” the model’s output, ensuring compliance with the desired structure.
+In simple terms, structured decoding gives LLMs a "template" to follow. Users provide a schema that "influences" the model output, ensuring compliance with the desired structure.

 ![structured decoding](./images/structured_output_1.png)

-### Structured Output in vllm-ascend
+### Structured output in vllm-ascend

-Currently, vllm-ascend supports **xgrammar** and **guidance** backend for structured output with vllm v1 engine.
+Currently, vllm-ascend supports **xgrammar** and **guidance** backends for structured output with vllm v1 engine.

-XGrammar introduces a new technique that batch constrained decoding via pushdown automaton (PDA). You can think of a PDA as a “collection of FSMs, and each FSM represents a context-free grammar (CFG).” One significant advantage of PDA is its recursive nature, allowing us to execute multiple state transitions. They also include additional optimisation (for those who are interested) to reduce grammar compilation overhead. Besides, you can also find more details about guidance by yourself.
+XGrammar introduces a new technique that batch constrained decoding through pushdown automaton (PDA). You can think of a PDA as a "collection of FSMs, and each FSM represents a context-free grammar (CFG)." One significant advantage of PDA is its recursive nature, allowing us to execute multiple state transitions. They also include additional optimizations (for those who are interested) to reduce grammar compilation overhead. Besides, you can also find more details about guidance by yourself.

-## How to Use Structured Output?
+## How to use structured output?

-### Online Inference
+### Online inference

-You can also generate structured outputs using the OpenAI's Completions and Chat API. The following parameters are supported, which must be added as extra parameters:
+You can also generate structured outputs using the Completions and Chat API of OpenAI. The following parameters are supported, which must be added as extra parameters:

 - `guided_choice`: the output will be exactly one of the choices.
 - `guided_regex`: the output will follow the regex pattern.
 - `guided_json`: the output will follow the JSON schema.
 - `guided_grammar`: the output will follow the context free grammar.

-Structured outputs are supported by default in the OpenAI-Compatible Server. You can choose to specify the backend to use by setting the `--guided-decoding-backend` flag to vllm serve. The default backend is `auto`, which will try to choose an appropriate backend based on the details of the request. You may also choose a specific backend, along with some options.
+Structured outputs are supported by default in an OpenAI-Compatible Server. You can choose to specify the backend by setting the `--guided-decoding-backend` flag to vLLM serve. The default backend is `auto`, which will try to choose an appropriate backend based on the details of the request. You may also choose a specific backend, along with some options.

-Now let´s see an example for each of the cases, starting with the guided_choice, as it´s the easiest one:
+The following are examples for each of the cases, starting with the guided_choice, as it's the easiest one:

 ```python
 from openai import OpenAI
@@ -64,7 +64,7 @@ completion = client.chat.completions.create(
 print(completion.choices[0].message.content)
 ```

-One of the most relevant features in structured text generation is the option to generate a valid JSON with pre-defined fields and formats. For this we can use the guided_json parameter in two different ways:
+One of the most relevant features in structured text generation is the option to generate a valid JSON with pre-defined fields and formats. To achieve this, we can use the guided_json parameter in two different ways:

 - Using a JSON Schema.
 - Defining a Pydantic model and then extracting the JSON Schema from it.
@@ -101,7 +101,7 @@ completion = client.chat.completions.create(
 print(completion.choices[0].message.content)
 ```

-Finally we have the guided_grammar option, which is probably the most difficult to use, but it´s really powerful. It allows us to define complete languages like SQL queries. It works by using a context free EBNF grammar. As an example, we can use to define a specific format of simplified SQL queries:
+Finally we have the guided_grammar option, which is probably the most difficult to use, but it´s really powerful. It allows us to define complete languages like SQL queries. It works by using a context free EBNF grammar. As an example, we can define a specific format of simplified SQL queries:

 ```python
 simplified_sql_grammar = """
@@ -133,16 +133,16 @@ print(completion.choices[0].message.content)

 Find more examples [here](https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/structured_outputs.py).

-### Offline Inference
+### Offline inference

-To use Structured Output, we'll need to configure the guided decoding using the class `GuidedDecodingParams` inside `SamplingParams`. The main available options inside `GuidedDecodingParams` are:
+To use structured output, we need to configure the guided decoding using the class `GuidedDecodingParams` inside `SamplingParams`. The main available options inside `GuidedDecodingParams` are:

 - json
 - regex
 - choice
 - grammar

-One example for the usage of the choice parameter is shown below:
+One example for using the choice parameter is shown below:

 ```python
 from vllm import LLM, SamplingParams