[Doc][Misc] Comprehensive documentation cleanup and grammatical fixes (#8073)

What this PR does / why we need it?
This pull request performs a comprehensive cleanup of the vLLM Ascend
documentation. It fixes numerous typos, grammatical errors, and phrasing
issues across community guidelines, developer documents, hardware
tutorials, and feature guides. Key improvements include correcting
hardware names (e.g., Atlas 300I), fixing broken links, cleaning up code
examples (removing duplicate flags and trailing commas), and improving
the clarity of technical explanations. These changes are necessary to
ensure the documentation is professional, accurate, and easy for users
to follow.

Does this PR introduce any user-facing change?
No, this PR contains documentation-only updates.

How was this patch tested?
The changes were manually reviewed for accuracy and grammatical
correctness. No functional code changes were introduced.

---------

Signed-off-by: herizhen <1270637059@qq.com>
Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com>
This commit is contained in:
herizhen
2026-04-09 15:37:57 +08:00
committed by GitHub
parent c40a387f63
commit 0d1424d81a
71 changed files with 1295 additions and 1296 deletions

View File

@@ -13,7 +13,7 @@ This capability supports heterogeneous parallelism strategies within a single mo
Fine-Grained Tensor Parallelism delivers two primary performance advantages through targeted weight sharding:
- **Reduced Per-Device Memory Footprint**:
Fine-grained TP shards large weight matrices( LM Head, o_proj)across devices, lowering peak memory usage and enabling larger batches or deployment on memory-limited hardware—without quantization.
Fine-grained TP shards large weight matrices(e.g., LM Head, o_proj)across devices, lowering peak memory usage and enabling larger batches or deployment on memory-limited hardware—without quantization.
- **Faster Memory Access in GEMMs**:
In decode-heavy workloads, GEMM performance is often memory-bound. Weight sharding reduces per-device weight fetch volume, cutting DRAM traffic and improving bandwidth efficiency—especially for latency-sensitive layers like LM Head and o_proj.

View File

@@ -18,9 +18,8 @@ Batch invariance is crucial for several use cases:
## Hardware Requirements
Batch invariance currently requires Ascend NPUs for 910B,
because only 910B supports batch invariance with HCCL communication for now,
we will support other NPUs in the future.
Batch invariance currently requires Ascend 910B NPUs, because only the 910B supports batch invariance with HCCL communication for now.
We will support other NPUs in the future.
## Software Requirements

View File

@@ -11,7 +11,7 @@ Context parallel mainly solves the problem of serving long context requests. As
- For long context prefill, we can use context parallel to reduce TTFT (time to first token) by amortizing the computation time of the prefill across query tokens.
- For long context decode, we can use context parallel to reduce KV cache duplication and offer more space for KV cache to increase the batch size (and hence the throughput).
To learn more about the theory and implementation details of context parallel, please refer to the [context parallel developer guide](../../developer_guide/feature_guide/context_parallel.md).
To learn more about the theory and implementation details of context parallel, please refer to the [context parallel developer guide](../../developer_guide/Design_Documents/context_parallel.md).
## Supported Scenarios

View File

@@ -11,15 +11,16 @@ We are working on further improvements and this feature will support more XPUs i
### Prerequisites
1. Dynamic batch now depends on an offline cost model saved in a lookup table to refine the token budget. The lookup table is saved in a '.csv' file, which should be first downloaded from [here](https://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-ascend/dynamic_batch_scheduler/A2-B3-BLK128.csv), renamed, and saved to the path `vllm_ascend/core/profile_table.csv`
1. Dynamic batch now depends on an offline cost model saved in a lookup table to refine the token budget. The lookup table is saved in a '.csv' file, which should be first downloaded from [A2-B3-BLK128.csv](https://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-ascend/dynamic_batch_scheduler/A2-B3-BLK128.csv), renamed, and saved to the path `vllm_ascend/core/profile_table.csv`
2. `Pandas` is needed to load the lookup table, in case pandas is not installed.
```bash
```bash
pip install pandas
```
### Tuning Parameters
`--SLO_limits_for_dynamic_batch` is the tuning parameter (integer type) for the dynamic batch feature, larger values impose more constraints on the latency limitation, leading to higher effective throughput. The parameter can be selected according to the specific models or service requirements.
```python

View File

@@ -6,30 +6,28 @@ A **disaggregated encoder** runs the vision-encoder stage of a multimodal LLM in
1. **Independent, fine-grained scaling**
* Vision encoders are lightweight, while language models are orders of magnitude larger.
* The language model can be parallelised without affecting the encoder fleet.
* Encoder nodes can be added or removed independently.
* Vision encoders are lightweight, while language models are orders of magnitude larger.
* The language model can be parallelised without affecting the encoder fleet.
* Encoder nodes can be added or removed independently.
2. **Lower time-to-first-token (TTFT)**
* Language-only requests bypass the vision encoder entirely.
* Encoder output is injected only at required attention layers, shortening the pre-fill critical path.
* Language-only requests bypass the vision encoder entirely.
* Encoder output is injected only at required attention layers, shortening the pre-fill critical path.
3. **Cross-process reuse and caching of encoder outputs**
* In-process encoders confine reuse to a single worker.
* A remote, shared cache lets any worker retrieve existing embeddings, eliminating redundant computation.
* In-process encoders confine reuse to a single worker.
* A remote, shared cache lets any worker retrieve existing embeddings, eliminating redundant computation.
Design doc: <
<https://docs.google.com/document/d/1aed8KtC6XkXtdoV87pWT0a8OJlZ-CpnuLLzmR8l9BAE>
>
Design doc: <https://docs.google.com/document/d/1aed8KtC6XkXtdoV87pWT0a8OJlZ-CpnuLLzmR8l9BAE>
---
## Usage
The current reference pathway is **ExampleConnector**.
Below ready-to-run scripts shows the workflow:
The ready-to-run scripts below show the workflow:
1 Encoder instance + 1 PD instance:
`examples/online_serving/disaggregated_encoder/disagg_1e1pd/`
@@ -45,7 +43,7 @@ Below ready-to-run scripts shows the workflow:
Disaggregated encoding is implemented by running two parts:
* **Encoder instance** a vLLM instance to performs vision encoding.
* **Encoder instance** a vLLM instance to perform vision encoding.
* **Prefill/Decode (PD) instance(s)** runs language pre-fill and decode.
* PD can be in either a single normal instance with (E + PD) or in disaggregated instances with (E + P + D)
@@ -62,11 +60,11 @@ All related code is under `vllm/distributed/ec_transfer`.
* *Multi-Path Scheduling Strategy* - dynamically diverts the multimodal request or text requests to the corresponding inference path
* *Instance-Level Dynamic Load Balancing* - dispatches multimodal requests based on a least-loaded strategy, using a priority queue to balance the active token workload across instances.
We create the example setup with the **MooncakeLayerwiseConnector** from `vllm_ascend/distributed/kv_transfer/kv_p2p/mooncake_layerwise_connector.py` and referred to the `examples/disaggregated_prefill_v1/load_balance_proxy_layerwise_server_example.py` to facilitate the kv transfer between P and D. For step-by-step deployment and configuration of Mooncake, refer to the following guide:
[https://docs.vllm.ai/projects/ascend/en/latest/tutorials/pd_disaggregation_mooncake_multi_node.html](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/pd_disaggregation_mooncake_multi_node.html)
We create the example setup with the **MooncakeLayerwiseConnector** from `vllm_ascend/distributed/kv_transfer/kv_p2p/mooncake_layerwise_connector.py` and refer to the `examples/disaggregated_prefill_v1/load_balance_proxy_layerwise_server_example.py` to facilitate the kv transfer between P and D. For step-by-step deployment and configuration of Mooncake, refer to the following guide:
[https://docs.vllm.ai/projects/ascend/en/latest/tutorials/pd_disaggregation_mooncake_multi_node.html](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/features/pd_disaggregation_mooncake_multi_node.html)
For the PD disaggregation part, when using MooncakeLayerwiseConnector: The request first enters the Decoder instance,the Decoder triggers a remote prefill task in reverse via the Metaserver. The Prefill node then executes inference and pushes KV Cache layer-wise to the Decoder, overlapping computation with transmission. Once the transfer is complete, the Decoder seamlessly continues with the subsequent token generation.
`docs/source/developer_guide/feature_guide/disaggregated_prefill.md` shows the brief idea about the disaggregated prefill.
`docs/source/developer_guide/Design_Documents/disaggregated_prefill.md` shows the brief idea about the disaggregated prefill.
## Limitations

View File

@@ -83,7 +83,7 @@ vllm serve Qwen/Qwen3-235B-A22 \
3. Model Compatibility:
- Only MoE models with explicit expert parallelism support (e.g., Qwen3 MoE models) are compatible.
- Verify model architecture supports dynamic expert routing through --enable-expert-parallel.
- Verify model architecture supports dynamic expert routing through `--enable-expert-parallel`.
4. Monitoring & Validation:
- Track metrics: expert_load_balance_ratio, ttft_p99, tpot_avg, and gpu_utilization.

View File

@@ -65,7 +65,7 @@ Online example:
vllm serve path/to/Qwen3-32B --tensor-parallel-size 8 --additional-config='{"xlite_graph_config": {"enabled": true, "full_mode": true}}'
```
You can find more details about Xlite [here](https://atomgit.com/openeuler/GVirt/blob/master/xlite/README.md)
You can find more details about [Xlite](https://atomgit.com/openeuler/GVirt/blob/master/xlite/README.md)
## Fallback to the Eager Mode

View File

@@ -19,41 +19,41 @@ Taking the DeepSeek model as an example, using 8 Atlas 800T A3 servers to deploy
1. Single Node Verification:
Execute the following commands on each node in sequence. The results must all be `success` and the status must be `UP`:
Execute the following commands on each node in sequence. The results must all be `success` and the status must be `UP`:
```bash
# Check the remote switch ports
for i in {0..15}; do hccn_tool -i $i -lldp -g | grep Ifname; done
# Get the link status of the Ethernet ports (UP or DOWN)
for i in {0..15}; do hccn_tool -i $i -link -g ; done
# Check the network health status
for i in {0..15}; do hccn_tool -i $i -net_health -g ; done
# View the network detected IP configuration
for i in {0..15}; do hccn_tool -i $i -netdetect -g ; done
# View gateway configuration
for i in {0..15}; do hccn_tool -i $i -gateway -g ; done
# View NPU network configuration
cat /etc/hccn.conf
```
```bash
# Check the remote switch ports
for i in {0..15}; do hccn_tool -i $i -lldp -g | grep Ifname; done
# Get the link status of the Ethernet ports (UP or DOWN)
for i in {0..15}; do hccn_tool -i $i -link -g ; done
# Check the network health status
for i in {0..15}; do hccn_tool -i $i -net_health -g ; done
# View the network detected IP configuration
for i in {0..15}; do hccn_tool -i $i -netdetect -g ; done
# View gateway configuration
for i in {0..15}; do hccn_tool -i $i -gateway -g ; done
# View NPU network configuration
cat /etc/hccn.conf
```
2. Get NPU IP Addresses
```bash
for i in {0..15}; do hccn_tool -i $i -vnic -g;done
```
```bash
for i in {0..15}; do hccn_tool -i $i -vnic -g;done
```
3. Get superpodid and SDID
```bash
for i in {0..15}; do npu-smi info -t spod-info -i $i -c 0;npu-smi info -t spod-info -i $i -c 1;done
```
```bash
for i in {0..15}; do npu-smi info -t spod-info -i $i -c 0;npu-smi info -t spod-info -i $i -c 1;done
```
4. Cross-Node PING Test
```bash
# Execute on the target node (replace 'x.x.x.x' with actual NPU IP address)
for i in {0..15}; do hccn_tool -i $i -hccs_ping -g address x.x.x.x;done
```
```bash
# Execute on the target node (replace 'x.x.x.x' with actual NPU IP address)
for i in {0..15}; do hccn_tool -i $i -hccs_ping -g address x.x.x.x;done
```
::::
@@ -61,35 +61,35 @@ for i in {0..15}; do hccn_tool -i $i -hccs_ping -g address x.x.x.x;done
1. Single Node Verification:
Execute the following commands on each node in sequence. The results must all be `success` and the status must be `UP`:
Execute the following commands on each node in sequence. The results must all be `success` and the status must be `UP`:
```bash
# Check the remote switch ports
for i in {0..7}; do hccn_tool -i $i -lldp -g | grep Ifname; done
# Get the link status of the Ethernet ports (UP or DOWN)
for i in {0..7}; do hccn_tool -i $i -link -g ; done
# Check the network health status
for i in {0..7}; do hccn_tool -i $i -net_health -g ; done
# View the network detected IP configuration
for i in {0..7}; do hccn_tool -i $i -netdetect -g ; done
# View gateway configuration
for i in {0..7}; do hccn_tool -i $i -gateway -g ; done
# View NPU network configuration
cat /etc/hccn.conf
```
```bash
# Check the remote switch ports
for i in {0..7}; do hccn_tool -i $i -lldp -g | grep Ifname; done
# Get the link status of the Ethernet ports (UP or DOWN)
for i in {0..7}; do hccn_tool -i $i -link -g ; done
# Check the network health status
for i in {0..7}; do hccn_tool -i $i -net_health -g ; done
# View the network detected IP configuration
for i in {0..7}; do hccn_tool -i $i -netdetect -g ; done
# View gateway configuration
for i in {0..7}; do hccn_tool -i $i -gateway -g ; done
# View NPU network configuration
cat /etc/hccn.conf
```
2. Get NPU IP Addresses
```bash
for i in {0..7}; do hccn_tool -i $i -ip -g;done
```
```bash
for i in {0..7}; do hccn_tool -i $i -ip -g;done
```
3. Cross-Node PING Test
```bash
# Execute on the target node (replace 'x.x.x.x' with actual NPU IP address)
for i in {0..7}; do hccn_tool -i $i -ping -g address x.x.x.x;done
```
```bash
# Execute on the target node (replace 'x.x.x.x' with actual NPU IP address)
for i in {0..7}; do hccn_tool -i $i -ping -g address x.x.x.x;done
```
::::
@@ -426,59 +426,59 @@ In the PD separation scenario, we provide an optimized configuration.
2. add '--enforce-eager' command to 'vllm serve'
3. Take '--kv-transfer-config' as follows:
```shell
--kv-transfer-config \
'{"kv_connector": "MooncakeConnectorV1",
"kv_buffer_device": "npu",
"kv_role": "kv_producer",
"kv_parallel_size": "1",
"kv_port": "20001",
"engine_id": "0"
}'
```
```shell
--kv-transfer-config \
'{"kv_connector": "MooncakeConnectorV1",
"kv_buffer_device": "npu",
"kv_role": "kv_producer",
"kv_parallel_size": "1",
"kv_port": "20001",
"engine_id": "0"
}'
```
4. Take '--additional-config' as follows:
```shell
--additional-config '{"enable_weight_nz_layout":true,"enable_prefill_optimizations":true}'
```
```shell
--additional-config '{"enable_weight_nz_layout":true,"enable_prefill_optimizations":true}'
```
- **decoder node**
1. set HCCL_BUFFSIZE=1024
2. Take '--kv-transfer-config' as follows:
```shell
--kv-transfer-config
'{"kv_connector": "MooncakeConnectorV1",
"kv_buffer_device": "npu",
"kv_role": "kv_consumer",
"kv_parallel_size": "1",
"kv_port": "20001",
"engine_id": "0"
}'
```
```shell
--kv-transfer-config
'{"kv_connector": "MooncakeConnectorV1",
"kv_buffer_device": "npu",
"kv_role": "kv_consumer",
"kv_parallel_size": "1",
"kv_port": "20001",
"engine_id": "0"
}'
```
3. Take '--additional-config' as follows:
```shell
--additional-config '{"enable_weight_nz_layout":true}'
```
```shell
--additional-config '{"enable_weight_nz_layout":true}'
```
### Parameters Description
1. '--additional-config' Parameter Introduction:
- **"enable_weight_nz_layout"**: Whether to convert quantized weights to NZ format to accelerate matrix multiplication.
- **"enable_prefill_optimizations"**: Whether to enable DeepSeek models' prefill optimizations.
<br>
- **"enable_weight_nz_layout"**: Whether to convert quantized weights to NZ format to accelerate matrix multiplication.
- **"enable_prefill_optimizations"**: Whether to enable DeepSeek models' prefill optimizations.
<br>
2. Enable MTP
Add the following command to your configurations.
```shell
--speculative-config '{"num_speculative_tokens": 1, "method":"deepseek_mtp"}'
```
```shell
--speculative-config '{"num_speculative_tokens": 1, "method":"deepseek_mtp"}'
```
### Recommended Configuration Example

View File

@@ -56,16 +56,16 @@ Assuming your working directory is ```/workspace``` and vllm/vllm-ascend have al
1. Install LMCache Repo
```bash
NO_CUDA_EXT=1 pip install lmcache==0.3.12
```
```bash
NO_CUDA_EXT=1 pip install lmcache==0.3.12
```
2. Install LMCache-Ascend Repo
```bash
cd /workspace/LMCache-Ascend
python3 -m pip install -v --no-build-isolation -e .
```
```bash
cd /workspace/LMCache-Ascend
python3 -m pip install -v --no-build-isolation -e .
```
### Usage

View File

@@ -32,4 +32,4 @@ vllm serve Qwen/Qwen2-7B-Instruct
--additional-config '{"ascend_compilation_config":{"enable_npugraph_ex":true, "enable_static_kernel":false}}'
```
You can find more details about npugraph_ex [here](https://www.hiascend.com/document/detail/zh/Pytorch/730/modthirdparty/torchairuseguide/torchair_00021.html)
You can find more details about [npugraph_ex](https://www.hiascend.com/document/detail/zh/Pytorch/730/modthirdparty/torchairuseguide/torchair_00021.html)

View File

@@ -14,11 +14,11 @@ Currently, vllm-ascend has implemented Sequence Parallelism for VL-class models
```bash
vllm serve Qwen/Qwen3-VL-2B-Instruct \
--tensor-parallel-size 2 \
--compilation-config '{"pass_config": {"enable_sp": true, , "sp_min_token_num": 1000}}'
--compilation-config '{"pass_config": {"enable_sp": true , "sp_min_token_num": 1000}}'
```
- `"enable_sp"`: This is the switch for SP. Since SP relies on graph mode, it is not supported in eager mode.
- `sp_min_token_num` (from upstream vllm's `pass_config`): Based on our experiments, when the number of tokens is small (empirical value is less than 1000), SP can actually bring negative benefits. This is because when the communication volume is small, the fixed overhead of the communication operator becomes the dominant factor. SP will only take effect when `num_tokens >= sp_min_token_num`. **The default value is 1000 on Ascend, which generally does not need to be modified.** To customize, use `--compilation-config '{"pass_config": {"enable_sp": true, "sp_min_token_num": 512}}'`. The value will be appended into `compile_ranges_split_points`, which splits the graph compilation range and checks whether the pass is applicable per range.
- `sp_min_token_num` (from upstream vllm's `pass_config`): Based on our experiments, when the number of tokens is small (empirical value is less than 1000), SP can actually bring negative impact. This is because when the communication volume is small, the fixed overhead of the communication operator becomes the dominant factor. SP will only take effect when `num_tokens >= sp_min_token_num`. **The default value is 1000 on Ascend, which generally does not need to be modified.** To customize, use `--compilation-config '{"pass_config": {"enable_sp": true, "sp_min_token_num": 512}}'`. The value will be appended into `compile_ranges_split_points`, which splits the graph compilation range and checks whether the pass is applicable per range.
Without modifying `sp_min_token_num`, the simplest way and recommended way to enable SP is:

View File

@@ -91,7 +91,7 @@ A few important things to consider when using the EAGLE based draft models:
## Speculating using MTP speculators
The following code configures vLLM Ascend to use speculative decoding where proposals are generated by MTP (Multi Token Prediction), boosting inference performance by parallelizing the prediction of multiple tokens. For more information about MTP see [Multi_Token_Prediction](https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/feature_guide/Multi_Token_Prediction.html)
The following code configures vLLM Ascend to use speculative decoding where proposals are generated by MTP (Multi Token Prediction), boosting inference performance by parallelizing the prediction of multiple tokens. For more information about MTP see [Multi_Token_Prediction](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/feature_guide/Multi_Token_Prediction.html)
- Online inference

View File

@@ -2,7 +2,7 @@
Weight prefetching optimizes memory usage by preloading weights into the cache before they are needed, minimizing delays caused by memory access during model execution. Linear layers sometimes exhibit relatively high MTE utilization. To address this, we create a separate pipeline specifically for weight prefetching, which runs in parallel with the original vector computation pipeline, such as quantize, MoE gating top_k, RMSNorm and SwiGlu. This approach allows the weights to be preloaded to L2 cache ahead of time, reducing MTE utilization during the linear layer computations and indirectly improving Cube computation efficiency by minimizing resource contention and optimizing data flow.
Since we use vector computations to hide the weight prefetching pipeline, it has effect on computation, if you prioritize low latency over high throughput, then it is best not to enable prefetching.
Since we use vector computations to hide the weight prefetching pipeline, this has an effect on computation. If you prioritize low latency over high throughput, it is best not to enable prefetching.
## Quick Start
@@ -35,39 +35,39 @@ Notes:
1) For MoE model:
```shell
--additional-config \
'{
"weight_prefetch_config": {
"enabled": true,
"prefetch_ratio": {
"attn": {
"qkv": 1.0,
"o": 1.0
},
"moe": {
"gate_up": 0.8
```shell
--additional-config \
'{
"weight_prefetch_config": {
"enabled": true,
"prefetch_ratio": {
"attn": {
"qkv": 1.0,
"o": 1.0
},
"moe": {
"gate_up": 0.8
}
}
}
}
}'
```
}'
```
2) For dense model:
Following is the default configuration that can get a good performance for `--max-num-seqs` is 72 for Qwen3-32B-W8A8
Following is the default configuration that can get a good performance for `--max-num-seqs` is 72 for Qwen3-32B-W8A8
```shell
--additional-config \
'{
"weight_prefetch_config": {
"enabled": true,
"prefetch_ratio": {
"mlp": {
"gate_up": 1.0,
"down": 1.0
```shell
--additional-config \
'{
"weight_prefetch_config": {
"enabled": true,
"prefetch_ratio": {
"mlp": {
"gate_up": 1.0,
"down": 1.0
}
}
}
}
}'
```
}'
```