[main][Docs] Fix typos across documentation (#6728)

## Summary

Fix typos and improve grammar consistency across 50 documentation files.
 
### Changes include:
- Spelling corrections (e.g., "Facotory" → "Factory", "certainty" →
"determinism")
- Grammar improvements (e.g., "multi-thread" → "multi-threaded",
"re-routed" → "re-run")
- Punctuation fixes (semicolon consistency in filter parameters)
- Code style fixes (correct flag name `--num-prompts` instead of
`--num-prompt`)
- Capitalization consistency (e.g., "python" → "Python", "ascend" →
"Ascend")
- vLLM version: v0.15.0
- vLLM main:
9562912cea

---------

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
This commit is contained in:
Cao Yi
2026-02-13 15:50:05 +08:00
committed by GitHub
parent b6bc3d2f9d
commit 6de207de88
50 changed files with 273 additions and 272 deletions

View File

@@ -2,7 +2,7 @@
## Overview
Fine-Grained Tensor Parallelism (Finegrained TP) extends standard tensor parallelism by enabling **independent tensor parallel sizes for different model components**. Instead of applying a single global `tensor_parallel_size` to all layers, Finegrained TP allows users to configure separate TP size for key modules—such as embedding, language model head (lm_head), attention output projection (oproj), and MLP blocks—via the `finegrained_tp_config` parameter.
Fine-Grained Tensor Parallelism (Fine-grained TP) extends standard tensor parallelism by enabling **independent tensor-parallel sizes for different model components**. Instead of applying a single global `tensor_parallel_size` to all layers, Fine-grained TP allows users to configure separate TP sizes for key modules—such as embedding, language model head (lm_head), attention output projection (o_proj), and MLP blocks—via the `finegrained_tp_config` parameter.
This capability supports heterogeneous parallelism strategies within a single model, providing finer control over weight distribution, memory layout, and communication patterns across devices. The feature is compatible with standard dense transformer architectures and integrates seamlessly into vLLMs serving pipeline.
@@ -12,10 +12,10 @@ This capability supports heterogeneous parallelism strategies within a single mo
Fine-Grained Tensor Parallelism delivers two primary performance advantages through targeted weight sharding:
- **Reduced Per-Device Memory Footprint**
Finegrained TP shards large weight matrices如 LM Heado_projacross devices, lowering peak memory usage and enabling larger batches or deployment on memory-limited hardware—without quantization.
- **Reduced Per-Device Memory Footprint**:
Fine-grained TP shards large weight matrices(如 LM Head, o_proj)across devices, lowering peak memory usage and enabling larger batches or deployment on memory-limited hardware—without quantization.
- **Faster Memory Access in GEMMs**
- **Faster Memory Access in GEMMs**:
In decode-heavy workloads, GEMM performance is often memory-bound. Weight sharding reduces per-device weight fetch volume, cutting DRAM traffic and improving bandwidth efficiency—especially for latency-sensitive layers like LM Head and o_proj.
Together, these effects allow practitioners to better balance memory, communication, and compute—particularly in high-concurrency serving scenarios—while maintaining compatibility with standard dense transformer models.
@@ -26,7 +26,7 @@ Together, these effects allow practitioners to better balance memory, communicat
### Models
Finegrained TP is **model-agnostic** and supports all standard dense transformer architectures, including Llama, Qwen, DeepSeek (base/dense variants), and others.
Fine-grained TP is **model-agnostic** and supports all standard dense transformer architectures, including Llama, Qwen, DeepSeek (base/dense variants), and others.
### Component & Execution Mode Support
@@ -57,7 +57,7 @@ The Fine-Grained TP size for any component must:
### Configuration Format
Finegrained TP is controlled via the `finegrained_tp_config` field inside `--additional-config`.
Fine-grained TP is controlled via the `finegrained_tp_config` field inside `--additional-config`.
```bash
--additional-config '{
@@ -90,7 +90,7 @@ vllm serve deepseek-ai/DeepSeek-R1 \
## Experimental Results
To evaluate the effectiveness of fine-grained TP in large-scale service scenarios, we use the model **DeepSeek-R1-W8A8**, deploy PD-separated decode instances in the environment of 32 cards Ascend 910B*64G (A2), with parallel configuration as DP32+EP32, and fine-grained TP size of 8, the performance data is as follows.
To evaluate the effectiveness of fine-grained TP in large-scale service scenarios, we use the model **DeepSeek-R1-W8A8**, deploy PD separated decode instances in an environment of 32 cards Ascend 910B*64G (A2), with parallel configuration as DP32+EP32, and fine-grained TP size of 8; the performance data is as follows.
| Module | Memory Savings | TPOT Impact (batch=24) |
| ---------------- | -------------- | ------------------------- |

View File

@@ -10,7 +10,7 @@ To enable MTP for DeepSeek-V3 models, add the following parameter when starting
--speculative_config ' {"method": "mtp", "num_speculative_tokens": 1, "disable_padded_drafter_batch": False} '
- `num_speculative_tokens`: The number of speculative tokens which enable model to predict multiple tokens at once, if provided. It will default to the number in the draft model config if present, otherwise, it is required.
- `num_speculative_tokens`: The number of speculative tokens that enables the model to predict multiple tokens at once, if provided. It will default to the number in the draft model config if present, otherwise, it is required.
- `disable_padded_drafter_batch`: Disable input padding for speculative decoding. If set to True, speculative input batches can contain sequences of different lengths, which may only be supported by certain attention backends. This currently only affects the MTP method of speculation, default is False.
## How It Works
@@ -28,7 +28,7 @@ vllm_ascend
**1. sample**
- *rejection_sample.py*: During decoding, the main model processes the previous rounds output token and the predicted token together (computing 1+k tokens simultaneously). The first token is always correct, while the second token—referred to as the **bonus token**—is uncertain since it is derived from speculative prediction, thus We employ **Greedy Strategy** and **Rejection Sampling Strategy** to determine whether the bonus token should be accepted. The module structure consists of an `AscendRejectionSampler` class with a forward method that implements the specific sampling logic.
- *rejection_sample.py*: During decoding, the main model processes the previous rounds output token and the predicted token together (computing 1+k tokens simultaneously). The first token is always correct, while the second token—referred to as the **bonus token**—is uncertain since it is derived from speculative prediction, thus we employ **Greedy Strategy** and **Rejection Sampling Strategy** to determine whether the bonus token should be accepted. The module structure consists of an `AscendRejectionSampler` class with a forward method that implements the specific sampling logic.
```shell
rejection_sample.py
@@ -38,9 +38,9 @@ rejection_sample.py
**2. spec_decode**
This section encompasses the model preprocessing for spec-decode, primarily structured as follows: it includes loading the model, executing a dummy run, and generating token ids. These steps collectively form the model data construction and forward invocation for a single spec-decode operation.
This section encompasses the model preprocessing for spec-decode, primarily structured as follows: it includes loading the model, executing a dummy run, and generating token IDs. These steps collectively form the model data construction and forward invocation for a single spec-decode operation.
- *mtp_proposer.py*: Configure vLLM-Ascend to use speculative decoding where proposals are generated by deepseek mtp layer.
- *mtp_proposer.py*: Configure vLLM-Ascend to use speculative decoding where proposals are generated by DeepSeek MTP layer.
```shell
mtp_proposer.py
@@ -54,7 +54,7 @@ mtp_proposer.py
### Algorithm
**1. Reject_Sample**
**1. Rejection Sampling**
- *Greedy Strategy*
@@ -68,17 +68,17 @@ For each draft token, acceptance is determined by verifying whether the inequali
The decision logic for each draft token is as follows: if the inequality `P_target / P_draft ≥ U` holds, the draft token is accepted as output; conversely, if `P_target / P_draft < U`, the draft token is rejected.
When a draft token is rejected, a recovery sampling process is triggered where a "recovered token" is resampled from the adjusted probability distribution defined as `Q = max(P_target - P_draft, 0)`. In the current MTP implementation, since `P_draft` is not provided and defaults to 1, the formulas simplify such that token acceptance occurs when `P_target ≥ U,` and the recovery distribution becomes `Q = max(P_target - 1, 0)`.
When a draft token is rejected, a recovery sampling process is triggered where a "recovered token" is resampled from the adjusted probability distribution defined as `Q = max(P_target - P_draft, 0)`. In the current MTP implementation, since `P_draft` is not provided and defaults to 1, the formulas simplify such that token acceptance occurs when `P_target ≥ U` and the recovery distribution becomes `Q = max(P_target - 1, 0)`.
**2. Performance**
If the bonus token is accepted, the MTP model performs inference for (num_speculative +1) tokens, including original main model output token and bonus token. If rejected, inference is performed for less token, determining on how many tokens accepted.
If the bonus token is accepted, the MTP model performs inference for (num_speculative + 1) tokens, including original main model output token and bonus token. If rejected, inference is performed for fewer tokens, depending on how many tokens are accepted.
## DFX
### Method Validation
- Currently, the spec_decode scenario only supports methods such as ngram, eagle, eagle3, and mtp. If an incorrect parameter is passed for the method, the code will raise an error to alert the user that an incorrect method was provided.
- Currently, the spec_decode scenario only supports methods such as n-gram, EAGLE, EAGLE3, and MTP. If an incorrect parameter is passed for the method, the code will raise an error to alert the user that an incorrect method was provided.
```python
def get_spec_decode_method(method,
@@ -112,4 +112,4 @@ if self.speculative_config:
## Limitations
- Due to the fact that only a single layer of weights is exposed in DeepSeek's MTP, the accuracy and performance are not effectively guaranteed in scenarios where MTP > 1 (especially MTP ≥ 3). Moreover, due to current operator limitations, MTP supports a maximum of 15.
- In the fullgraph mode with MTP > 1, the capture size of each aclgraph must be an integer multiple of (num_speculative_tokens + 1).
- In the fullgraph mode with MTP > 1, the capture size of each ACLGraph must be an integer multiple of (num_speculative_tokens + 1).

View File

@@ -9,7 +9,7 @@ This guide shows how to use Context Parallel, a long sequence inference optimiza
Context parallel mainly solves the problem of serving long context requests. As prefill and decode present quite different characteristics and have quite different SLO (service level objectives), we need to implement context parallel separately for them. The major considerations are:
- For long context prefill, we can use context parallel to reduce TTFT (time to first token) by amortizing the computation time of the prefill across query tokens.
- For long context decode, we can use context parallel to reduce KV cache duplication and offer more space for KV cache to increase the batchsize (and hence the throughput).
- For long context decode, we can use context parallel to reduce KV cache duplication and offer more space for KV cache to increase the batch size (and hence the throughput).
To learn more about the theory and implementation details of context parallel, please refer to the [context parallel developer guide](../../developer_guide/feature_guide/context_parallel.md).
@@ -54,19 +54,19 @@ You can enable `PCP` and `DCP` by `prefill_context_parallel_size` and `decode_co
--prefill-context-parallel-size 2 \
```
The total world_size is `tensor_parallel_size` * `prefill_context_parallel_size`, so the examples above need 4 NPUs for each.
The total world size is `tensor_parallel_size` * `prefill_context_parallel_size`, so the examples above need 4 NPUs for each.
## Constraints
- While using DCP, the following constraints must be met:
- For MLA based model, such as Deepseek-R1:
- For MLA-based model, such as DeepSeek-R1:
- `tensor_parallel_size >= decode_context_parallel_size`
- `tensor_parallel_size % decode_context_parallel_size == 0`
- For GQA based model, such as Qwen3-235B:
- For GQA-based model, such as Qwen3-235B:
- `(tensor_parallel_size // num_key_value_heads) >= decode_context_parallel_size`
- `(tensor_parallel_size // num_key_value_heads) % decode_context_parallel_size == 0`
- While using Context Parallel in KV cache transfer needed scenario (e.g. KV pooling, PD-disaggregation), to simplify KV cache transmission, `cp_kv_cache_interleave_size` must be set to the same value of KV cache `block_size`(default: 128), which specify cp to split KV cache in a block-interleave style. For example:
- While using Context Parallel in KV cache transfer-needed scenario (e.g. KV pooling, PD disaggregation), to simplify KV cache transmission, `cp_kv_cache_interleave_size` must be set to the same value of KV cache `block_size`(default: 128), which specifies CP to split KV cache in a block-interleave style. For example:
```shell
vllm serve deepseek-ai/DeepSeek-V2-Lite \
@@ -79,7 +79,7 @@ The total world_size is `tensor_parallel_size` * `prefill_context_parallel_size`
## Experimental Results
To evaluate the effectiveness of Context Parallel in in long sequence LLM inference scenarios, we use **DeepSeek-R1-W8A8** and **Qwen3-235B**, deploy PD-disaggregate instances in the environment of 64 cards Ascend 910C*64G (A3), the configuration and performance data are as follows.
To evaluate the effectiveness of Context Parallel in long sequence LLM inference scenarios, we use **DeepSeek-R1-W8A8** and **Qwen3-235B**, deploy PD disaggregate instances in the environment of 64 cards Ascend 910C*64G (A3), the configuration and performance data are as follows.
- DeepSeek-R1-W8A8:

View File

@@ -3,7 +3,7 @@
Dynamic batch is a technique that dynamically adjusts the chunksize during each inference iteration within the chunked prefilling strategy according to the resources and SLO targets, thereby improving the effective throughput and decreasing the TBT.
Dynamic batch is controlled by the value of the `--SLO_limits_for_dynamic_batch`.
Notably, only 910 B3 is supported with decode token numbers scales below 2048 so far.
Notably, only 910 B3 is supported with decode token number scales below 2048 so far.
Especially, the improvements are quite obvious on Qwen, Llama models.
We are working on further improvements and this feature will support more XPUs in the future.
@@ -11,16 +11,16 @@ We are working on further improvements and this feature will support more XPUs i
### Prerequisites
1. Dynamic batch now depends on an offline cost model saved in a lookup table to refine the token budget. The lookup table is saved in '.csv' file, which should be first downloaded from [here](https://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-ascend/dynamic_batch_scheduler/A2-B3-BLK128.csv), renamed, and saved to the path `vllm_ascend/core/profile_table.csv`
1. Dynamic batch now depends on an offline cost model saved in a lookup table to refine the token budget. The lookup table is saved in a '.csv' file, which should be first downloaded from [here](https://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-ascend/dynamic_batch_scheduler/A2-B3-BLK128.csv), renamed, and saved to the path `vllm_ascend/core/profile_table.csv`
2. `Pandas` is needed to load the lookup table, in case `pandas` is not installed.
2. `Pandas` is needed to load the lookup table, in case pandas is not installed.
```bash
pip install pandas
```
### Tuning Parameter
`--SLO_limits_for_dynamic_batch` is the tuning parameters (integer type) for the dynamic batch feature, greater values impose more constraints on the latency limitation, leading to higher effective throughput. The parameter can be selected according to the specific models or service requirements.
### Tuning Parameters
`--SLO_limits_for_dynamic_batch` is the tuning parameter (integer type) for the dynamic batch feature, larger values impose more constraints on the latency limitation, leading to higher effective throughput. The parameter can be selected according to the specific models or service requirements.
```python
--SLO_limits_for_dynamic_batch =-1 # default value, dynamic batch disabled.
@@ -45,7 +45,7 @@ vllm serve Qwen/Qwen2.5-14B-Instruct\
--tensor_parallel_size 8 \
--load_format dummy \
--max_num_batched_tokens 1024 \
--max_model_len 9000 \
--max-model-len 9000 \
--host localhost \
--port 12091 \
--gpu-memory-utilization 0.9 \

View File

@@ -16,17 +16,17 @@ Expert balancing for MoE models in LLM serving is essential for optimal performa
### Models
DeepseekV3/V3.1/R1Qwen3-MOE
DeepSeekV3/V3.1/R1, Qwen3-MoE
### MOE QuantType
W8A8-dynamic
W8A8-Dynamic
## How to Use EPLB
### Dynamic EPLB
We need to add environment variable `export DYNAMIC_EPLB="true"` to enable vllm eplb. Enable dynamic balancing with auto-tuned parameters. Adjust expert_heat_collection_interval and algorithm_execution_interval based on workload patterns.
We need to add environment variable `export DYNAMIC_EPLB="true"` to enable vLLM EPLB. Enable dynamic balancing with auto-tuned parameters. Adjust expert_heat_collection_interval and algorithm_execution_interval based on workload patterns.
```shell
vllm serve Qwen/Qwen3-235B-A22 \
@@ -87,7 +87,7 @@ vllm serve Qwen/Qwen3-235B-A22 \
4. Monitoring & Validation:
- Track metrics: expert_load_balance_ratio, ttft_p99, tpot_avg, and gpu_utilization.
- Use vllm monitor to detect imbalances during runtime.
- Use vLLM monitor to detect imbalances during runtime.
- Always verify expert map JSON structure before loading (validate with jq or similar tools).
5. Startup Behavior:

View File

@@ -1,6 +1,6 @@
# External DP
For larger scale deployments especially, it can make sense to handle the orchestration and load balancing of data parallel ranks externally.
For larger-scale deployments especially, it can make sense to handle the orchestration and load balancing of data parallel ranks externally.
In this case, it's more convenient to treat each DP rank like a separate vLLM deployment, with its own endpoint, and have an external router balance HTTP requests between them, making use of appropriate real-time telemetry from each server for routing decisions.
@@ -8,8 +8,8 @@ In this case, it's more convenient to treat each DP rank like a separate vLLM de
The functionality of [external DP](https://docs.vllm.ai/en/latest/serving/data_parallel_deployment/?h=external#external-load-balancing) is already natively supported by vLLM. In vllm-ascend we provide two enhanced functionalities:
1. A launch script which helps to launch multi vllm instances in one command.
2. A request-length-aware load balance proxy for external dp.
1. A launch script that helps to launch multiple vLLM instances in one command.
2. A request-length-aware load-balance proxy for external DP.
This tutorial will introduce the usage of them.
@@ -24,9 +24,9 @@ pip install fastapi httpx uvicorn
## Starting Exeternal DP Servers
First you need to have at least two vLLM servers running in data parallel. These can be mock servers or actual vLLM servers. Note that this proxy also works with only one vLLM server running, but will fall back to direct request forwarding which is meaningless.
First, you need to have at least two vLLM servers running in data parallel. These can be mock servers or actual vLLM servers. Note that this proxy also works with only one vLLM server running, but will fall back to direct request forwarding which is meaningless.
You can start external vLLM dp servers one-by-one manually or using the launch script in `examples/external_online_dp`. For scenarios of large dp size across multi nodes, we recommend using our launch script for convenience.
You can start external vLLM DP servers one-by-one manually or using the launch script in `examples/external_online_dp`. For scenarios of large DP size across multiple nodes, we recommend using our launch script for convenience.
### Manually Launch
@@ -38,7 +38,7 @@ vllm serve --host 0.0.0.0 --port 8101 --data-parallel-size 2 --data-parallel-ran
### Use Launch Script
Firstly, you need to modify the `examples/external_online_dp/run_dp_template.sh` according to your vLLM configuration. Then you can use `examples/external_online_dp/launch_online_dp.py` to launch multiple vLLM instances in one command each node. It will internally call `examples/external_online_dp/run_dp_template.sh` for each DP rank with proper DP-related parameters.
Firstly, you need to modify the `examples/external_online_dp/run_dp_template.sh` according to your vLLM configuration. Then you can use `examples/external_online_dp/launch_online_dp.py` to launch multiple vLLM instances in one command on each node. It will internally call `examples/external_online_dp/run_dp_template.sh` for each DP rank with proper DP-related parameters.
An example of running external DP in one single node:
@@ -65,9 +65,9 @@ python launch_online_dp.py --dp-size 4 --tp-size 4 --dp-size-local 2 --dp-rank-s
## Starting Load-balance Proxy Server
After all vLLM DP instances are launched, you can now launch the load-balance proxy server which serves as entrypoint for coming requests and load balance them between vLLM DP instances.
After all vLLM DP instances are launched, you can now launch the load-balance proxy server, which serves as an entrypoint for coming requests and load-balances them between vLLM DP instances.
The proxy server has following features:
The proxy server has the following features:
- Load balances requests to multiple vLLM servers based on request length.
- Supports OpenAI-compatible `/v1/completions` and `/v1/chat/completions` endpoints.
@@ -88,4 +88,4 @@ python dp_load_balance_proxy_server.py \
--dp-ports 9000 9001 \
```
After this, you can directly send requests to the proxy server and run DP with external load-balance.
After this, you can directly send requests to the proxy server and run DP with external load balancing.

View File

@@ -8,16 +8,16 @@ This guide provides instructions for using Ascend Graph Mode with vLLM Ascend. P
## Getting Started
From v0.9.1rc1 with V1 Engine, vLLM Ascend will run models in graph mode by default to keep the same behavior with vLLM. If you hit any issues, please feel free to open an issue on GitHub and fallback to the eager mode temporarily by setting `enforce_eager=True` when initializing the model.
From v0.9.1rc1 with V1 Engine, vLLM Ascend will run models in graph mode by default to keep the same behavior with vLLM. If you hit any issues, please feel free to open an issue on GitHub and fall back to the eager mode temporarily by setting `enforce_eager=True` when initializing the model.
There are two kinds for graph mode supported by vLLM Ascend:
There are two kinds of graph mode supported by vLLM Ascend:
- **ACLGraph**: This is the default graph mode supported by vLLM Ascend. In v0.9.1rc1, Qwen and Deepseek series models are well tested.
- **XliteGraph**: This is the openeuler xlite graph mode. In v0.11.0, only Llama, Qwen dense series models, Qwen MoE series models, and Qwen3-vl are supported.
- **ACLGraph**: This is the default graph mode supported by vLLM Ascend. In v0.9.1rc1, Qwen and DeepSeek series models are well tested.
- **XliteGraph**: This is the OpenEuler Xlite graph mode. In v0.11.0, only Llama, Qwen dense series models, Qwen MoE series models, and Qwen3-VL are supported.
## Using ACLGraph
ACLGraph is enabled by default. Take Qwen series models as an example, just set to use V1 Engine is enough.
ACLGraph is enabled by default. Take Qwen series models as an example, just set to use V1 Engine.
Offline example:
@@ -38,7 +38,7 @@ vllm serve Qwen/Qwen2-7B-Instruct
## Using XliteGraph
If you want to run Llama, Qwen dense series models, Qwen MoE series models, or Qwen3-vl with xlite graph mode, please install xlite, and set xlite_graph_config.
If you want to run Llama, Qwen dense series models, Qwen MoE series models, or Qwen3-VL with Xlite graph mode, please install xlite, and set xlite_graph_config.
```bash
pip install xlite
@@ -61,11 +61,11 @@ Online example:
vllm serve path/to/Qwen3-32B --tensor-parallel-size 8 --additional-config='{"xlite_graph_config": {"enabled": true, "full_mode": true}}'
```
You can find more details about xlite [here](https://atomgit.com/openeuler/GVirt/blob/master/xlite/README.md)
You can find more details about Xlite [here](https://atomgit.com/openeuler/GVirt/blob/master/xlite/README.md)
## Fallback to the Eager Mode
If `ACLGraph` and `XliteGraph` all fail to run, you should fallback to the eager mode.
If `ACLGraph` and `XliteGraph` all fail to run, you should fall back to the eager mode.
Offline example:

View File

@@ -1,9 +1,9 @@
# Distributed DP Server With Large Scale Expert Parallelism
# Distributed DP Server With Large-Scale Expert Parallelism
## Getting Start
vLLM-Ascend now supports prefill-decode (PD) disaggregation in the large scale **Expert Parallelism (EP)** scenario. To achieve better performancethe distributed DP server is applied in vLLM-Ascend. In the PD separation scenario, different optimization strategies can be implemented based on the distinct characteristics of PD nodes, thereby enabling more flexible model deployment. \
Take the deepseek model as an example, use 8 Atlas 800T A3 servers to deploy the model. Assume the ip of the servers start from 192.0.0.1, and end by 192.0.0.8. Use the first 4 servers as prefiller nodes and the last 4 servers as decoder nodes. And the prefiller nodes deployed as master node independently, the decoder nodes set 192.0.0.5 node to be the master node.
vLLM-Ascend now supports prefill-decode (PD) disaggregation in the large-scale **Expert Parallelism (EP)** scenario. To achieve better performance, the distributed DP server is applied in vLLM-Ascend. In the PD separation scenario, different optimization strategies can be implemented based on the distinct characteristics of PD nodes, thereby enabling more flexible model deployment. \
Taking the DeepSeek model as an example, using 8 Atlas 800T A3 servers to deploy the model. Assume the IP of the servers starts from 192.0.0.1 and ends by 192.0.0.8. Use the first 4 servers as prefiller nodes and the last 4 servers as decoder nodes. And the prefiller nodes are deployed as master nodes independently, while the decoder nodes use the 192.0.0.5 node as the master node.
## Verify Multi-Node Communication Environment
@@ -51,7 +51,7 @@ for i in {0..15}; do npu-smi info -t spod-info -i $i -c 0;npu-smi info -t spod-i
4. Cross-Node PING Test
```bash
# Execute on the target node (replace 'x.x.x.x' with actual npu ip address)
# Execute on the target node (replace 'x.x.x.x' with actual NPU IP address)
for i in {0..15}; do hccn_tool -i $i -hccs_ping -g address x.x.x.x;done
```
@@ -87,7 +87,7 @@ for i in {0..7}; do hccn_tool -i $i -ip -g;done
3. Cross-Node PING Test
```bash
# Execute on the target node (replace 'x.x.x.x' with actual npu ip address)
# Execute on the target node (replace 'x.x.x.x' with actual NPU IP address)
for i in {0..7}; do hccn_tool -i $i -ping -g address x.x.x.x;done
```
@@ -95,11 +95,11 @@ for i in {0..7}; do hccn_tool -i $i -ping -g address x.x.x.x;done
:::::
## Large Scale EP model deployment
## Large-Scale EP model deployment
### Generate script with configurations
In the PD separation scenario, we provide a optimized configuration. You can use the following shell script for configuring the prefiller and decoder nodes respectively.
In the PD separation scenario, we provide an optimized configuration. You can use the following shell script for configuring the prefiller and decoder nodes respectively.
:::::{tab-set}
@@ -140,7 +140,7 @@ export VLLM_USE_MODELSCOPE="True"
export VLLM_WORKER_MULTIPROC_METHOD="fork"
export VLLM_ASCEND_EXTERNAL_DP_LB_ENABLED=1
# The w8a8 weight can obtained from https://www.modelscope.cn/models/vllm-ascend/DeepSeek-R1-W8A8
# The w8a8 weight can be obtained from https://www.modelscope.cn/models/vllm-ascend/DeepSeek-R1-W8A8
# "--additional-config" is used to enable characteristics from vllm-ascend
vllm serve vllm-ascend/DeepSeek-R1-W8A8 \
--host 0.0.0.0 \
@@ -207,7 +207,7 @@ export VLLM_USE_MODELSCOPE="True"
export VLLM_WORKER_MULTIPROC_METHOD="fork"
export VLLM_ASCEND_EXTERNAL_DP_LB_ENABLED=1
# The w8a8 weight can obtained from https://www.modelscope.cn/models/vllm-ascend/DeepSeek-R1-W8A8
# The w8a8 weight can be obtained from https://www.modelscope.cn/models/vllm-ascend/DeepSeek-R1-W8A8
# "--additional-config" is used to enable characteristics from vllm-ascend
vllm serve vllm-ascend/DeepSeek-R1-W8A8 \
--host 0.0.0.0 \
@@ -254,7 +254,7 @@ dp_size = 2 # total number of DP engines for decode/prefill
dp_size_local = 2 # number of DP engines on the current node
dp_rank_start = 0 # starting DP rank for the current node
# dp_ip is different on prefiller nodes in this example
dp_ip = "192.0.0.1" # master node ip for DP communication
dp_ip = "192.0.0.1" # master node IP for DP communication
dp_port = 13395 # port used for DP communication
engine_port = 9000 # starting port for all DP groups on the current node
template_path = "./run_dp_template.sh"
@@ -288,7 +288,7 @@ dp_size = 64 # total number of DP engines for decode/prefill
dp_size_local = 16 # number of DP engines on the current node
dp_rank_start = 0 # starting DP rank for the current node. e.g. 0/16/32/48
# dp_ip is the same on decoder nodes in this example
dp_ip = "192.0.0.5" # master node ip for DP communication.
dp_ip = "192.0.0.5" # master node IP for DP communication.
dp_port = 13395 # port used for DP communication
engine_port = 9000 # starting port for all DP groups on the current node
template_path = "./run_dp_template.sh"
@@ -314,7 +314,7 @@ for process in processes:
:::::
Note that the prefiller nodes and the decoder nodes may have different configurations. In this example, each prefiller node deployed as master node independently, but all decoder nodes take the first node as the master node. So it leads to difference in 'dp_size_local' and 'dp_rank_start'
Note that the prefiller nodes and the decoder nodes may have different configurations. In this example, each prefiller node is deployed as a master node independently, while the decoder nodes use the 192.0.0.5 node as the master node. This leads to differences in 'dp_size_local' and 'dp_rank_start'
## Example proxy for Distributed DP Server
@@ -365,7 +365,7 @@ You can get the proxy program in the repository's examples, [load\_balance\_prox
## Benchmark
We recommend use aisbench tool to assess performance. [aisbench](https://gitee.com/aisbench/benchmark) Execute the following commands to install aisbench
We recommend using aisbench tool to assess performance. [aisbench](https://gitee.com/aisbench/benchmark). Execute the following commands to install aisbench
```shell
git clone https://gitee.com/aisbench/benchmark.git
@@ -373,7 +373,7 @@ cd benchmark/
pip3 install -e ./
```
You need to canncel the http proxy before assessing performance, as following
You need to cancel the http proxy before assessing performance, as follows:
```shell
# unset proxy
@@ -381,8 +381,8 @@ unset http_proxy
unset https_proxy
```
- You can place your datasets in the dir: `benchmark/ais_bench/datasets`
- You can change the configurationin the dir :`benchmark/ais_bench/benchmark/configs/models/vllm_api` Take the ``vllm_api_stream_chat.py`` for examples
- You can place your datasets in the directory: `benchmark/ais_bench/datasets`
- You can change the configuration in the directory :`benchmark/ais_bench/benchmark/configs/models/vllm_api` Take `vllm_api_stream_chat.py` as an example:
```python
models = [
@@ -408,23 +408,23 @@ models = [
]
```
- Take gsm8k dataset for example, execute the following commands to assess performance.
- Taking the gsm8k dataset as an example, execute the following commands to assess performance.
```shell
ais_bench --models vllm_api_stream_chat --datasets gsm8k_gen_0_shot_cot_str_perf --debug --mode perf
```
- For more details for commands and parameters for aisbench, refer to [aisbench](https://gitee.com/aisbench/benchmark)
- For more details on commands and parameters for aisbench, refer to [aisbench](https://gitee.com/aisbench/benchmark)
## Prefill & Decode Configuration Details
In the PD separation scenario, we provide a optimized configuration.
In the PD separation scenario, we provide an optimized configuration.
- **prefiller node**
1. set HCCL_BUFFSIZE=256
2. add '--enforce-eager' command to 'vllm serve'
3. Take '--kv-transfer-config' as follow
3. Take '--kv-transfer-config' as follows:
```shell
--kv-transfer-config \
@@ -437,7 +437,7 @@ In the PD separation scenario, we provide a optimized configuration.
}'
```
4. Take '--additional-config' as follow
4. Take '--additional-config' as follows:
```shell
--additional-config '{"enable_weight_nz_layout":true,"enable_prefill_optimizations":true}'
@@ -446,7 +446,7 @@ In the PD separation scenario, we provide a optimized configuration.
- **decoder node**
1. set HCCL_BUFFSIZE=1024
2. Take '--kv-transfer-config' as follow
2. Take '--kv-transfer-config' as follows:
```shell
--kv-transfer-config
@@ -459,7 +459,7 @@ In the PD separation scenario, we provide a optimized configuration.
}'
```
3. Take '--additional-config' as follow
3. Take '--additional-config' as follows:
```shell
--additional-config '{"enable_weight_nz_layout":true}'
@@ -467,13 +467,13 @@ In the PD separation scenario, we provide a optimized configuration.
### Parameters Description
1.'--additional-config' Parameter Introduction:
1. '--additional-config' Parameter Introduction:
- **"enable_weight_nz_layout"** Whether to convert quantized weights to NZ format to accelerate matrix multiplication.
- **"enable_prefill_optimizations"** Whether to enable DeepSeek models' prefill optimizations.
- **"enable_weight_nz_layout"**: Whether to convert quantized weights to NZ format to accelerate matrix multiplication.
- **"enable_prefill_optimizations"**: Whether to enable DeepSeek models' prefill optimizations.
<br>
3.enable MTP
2. Enable MTP
Add the following command to your configurations.
```shell
@@ -482,7 +482,7 @@ Add the following command to your configurations.
### Recommended Configuration Example
For exampleif the average input length is 3.5k, and the output length is 1.1k, the context length is 16k, the max length of the input dataset is 7K. In this scenario, we give a recommended configuration for distributed DP server with high EP. Here we use 4 nodes for prefill and 4 nodes for decode.
For example, if the average input length is 3.5k, and the output length is 1.1k, the context length is 16k, the max length of the input dataset is 7K. In this scenario, we give a recommended configuration for distributed DP server with high EP. Here we use 4 nodes for prefill and 4 nodes for decode.
| node | DP | TP | EP | max-model-len | max-num-batched-tokens | max-num-seqs | gpu-memory-utilization |
|----------|----|----|----|---------------|------------------------|--------------|-----------|
@@ -495,6 +495,6 @@ Note that these configurations are not related to optimization. You need to adju
## FAQ
### 1. Prefiller nodes need to warmup
### 1. Prefiller nodes need to warm up
Since the computation of some NPU operators requires several rounds of warm-up to achieve best performance, we recommend preheating the service with some requests before conducting performance tests to achieve the best end-to-end throughput.

View File

@@ -7,7 +7,7 @@
Instead of replicating all weights on every device, **Layer Shard Linear shards the weights of a "series" of such operators across the NPU devices in a communication group**:
- The **i-th layer's linear weight** is stored **only on device `i % K`**, where `K` is the number of devices in the group.
- Other devices hold a lightweight **shared dummy tensor** during initialization and fetch the real weight **on-demand via asynchronous broadcast** during the forward pass.
- Other devices hold a lightweight **shared dummy tensor** during initialization and fetch the real weight **on-demand** via asynchronous broadcast during the forward pass.
As illustrated in the figure below, this design enables broadcast to reach weights: while the current layer (e.g., MLA or MOE) is being computed, the system **asynchronously broadcasts the next layer's weight** in the background. Because the attention computation in the MLA module is sufficiently latency-bound, the weight transfer for `o_proj` is **fully overlapped with computation**, making the communication **latency-free from the perspective of end-to-end inference**.
@@ -23,7 +23,7 @@ This approach **preserves exact computational semantics** while **significantly
![layer shard](./images/layer_sharding.png)
> **Figure.** Layer Shard Linear workflow: weights are sharded by layer across devices (top), and during forward execution (bottom), asynchronous broadcast pre-fetches the next layer's weight while the current layer computes—enabling zero-overhead weight loading.
> **Figure.** Layer Shard Linear workflow: weights are sharded by layer across devices (top), and during forward execution (bottom), asynchronous broadcast **pre-fetches** the next layer's weight while the current layer computes—enabling **zero-overhead** weight loading.
---

View File

@@ -24,7 +24,7 @@ The server runs alongside normal inference tasks via sub-threads and via `statel
### Application Scenarios
- **Reduce startup latency**: By reusing already loaded weights and transferring them directly between NPU cards, Netloader cuts down model loading time vs conventional remote/local pull strategies.
- **Reduce startup latency**: By reusing already loaded weights and transferring them directly between NPU cards, Netloader cuts down model loading time versus conventional remote/local pull strategies.
- **Relieve network & storage load**: Avoid repeated downloads of weight files from remote repositories, thus reducing pressure on central storage and network traffic.
- **Improve resource utilization & lower cost**: Faster loading allows less reliance on standby compute nodes; resources can be scaled up/down more flexibly.
- **Enhance business continuity & high availability**: In failure recovery, new instances can quickly take over without long downtime, improving system reliability and user experience.
@@ -37,7 +37,7 @@ To enable Netloader, pass `--load-format=netloader` and provide configuration vi
| Field Name | Type | Description | Allowed Values / Notes |
|--------------------|---------|------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------|
| **SOURCE** | List | Weighted data sources. Each item is a map with `device_id` and `sources`, specifying the rank and its endpoints (IP:port). <br>Example: `{"SOURCE": [{"device_id": 0, "sources": ["10.170.22.152:19374"]}, {"device_id": 1, "sources": ["10.170.22.152:11228"]}]}` <br>If omitted or empty, fallback to default loader. The SOURCE here is second priority. | A list of objects with keys `device_id: int` and `sources: List[str]` |
| **SOURCE** | List | Weight data sources. Each item is a map with `device_id` and `sources`, specifying the rank and its endpoints (IP:port). <br>Example: `{"SOURCE": [{"device_id": 0, "sources": ["10.170.22.152:19374"]}, {"device_id": 1, "sources": ["10.170.22.152:11228"]}]}` <br>If omitted or empty, fallback to default loader. The SOURCE here is second priority. | A list of objects with keys `device_id: int` and `sources: List[str]` |
| **MODEL** | String | The model name, used to verify consistency between client and server. | Defaults to the `--model` argument if not specified. |
| **LISTEN_PORT** | Integer | Base port for the server listener. | The actual port = `LISTEN_PORT + RANK`. If omitted, a random valid port is chosen. Valid range: 102465535. If out of range, that server instance wont open a listener. |
| **INT8_CACHE** | String | Behavior for handling int8 parameters in quantized models. | One of `["hbm", "dram", "no"]`. <br> - `hbm`: copy original int8 parameters to high-bandwidth memory (HBM) (may cost a lot of HBM). <br> - `dram`: copy to DRAM. <br> - `no`: no special handling (may lead to divergence or unpredictable behavior). Default: `"no"`. |

View File

@@ -8,7 +8,7 @@ Model quantization is a technique that reduces model size and computational over
>
> You can choose to convert the model yourself or use the quantized model we uploaded.
> See <https://www.modelscope.cn/models/vllm-ascend/Kimi-K2-Instruct-W8A8>.
> Before you quantize a model, ensure that the RAM size is enough.
> Before you quantize a model, ensure sufficient RAM is available.
## Quantization Tools

View File

@@ -2,7 +2,7 @@
## Overview
Sleep Mode is an API designed to offload model weights and discard KV cache from NPU memory. This functionality is essential for reinforcement learning (RL) post-training workloads, particularly in online algorithms such as PPO, GRPO, or DPO. During training, the policy model typically performs auto-regressive generation using inference engines like vLLM, followed by forward and backward passes for optimization.
Sleep Mode is an API designed to offload model weights and discard KV cache from NPU memory. This functionality is essential for reinforcement learning (RL) post-training workloads, particularly in online algorithms such as PPO, GRPO, or DPO. During training, the policy model typically performs autoregressive generation using inference engines like vLLM, followed by forward and backward passes for optimization.
Since the generation and training phases may employ different model parallelism strategies, it becomes crucial to free KV cache and even offload model parameters stored within vLLM during training. This ensures efficient memory utilization and avoids resource contention on the NPU.
@@ -71,7 +71,7 @@ The following is a simple example of how to use sleep mode.
- Online serving:
:::{note}
Considering there may be a risk of malicious access, please make sure you are under a dev-mode, and explicit specify the dev environment `VLLM_SERVER_DEV_MODE` to expose these endpoints (sleep/wake up).
Considering there may be a risk of malicious access, please make sure you are under a dev-mode, and explicitly specify the dev environment `VLLM_SERVER_DEV_MODE` to expose these endpoints (sleep/wake up).
:::
```bash

View File

@@ -150,3 +150,4 @@ Suffix Decoding can achieve better performance for tasks with high repetition, s
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```

View File

@@ -7,7 +7,7 @@ Unified Cache Management (UCM) provides an external KV-cache storage layer desig
## Prerequisites
* OS: Linux
* A hardware with Ascend NPU. Its usually the Atlas 800 A2 series.
* Hardware with Ascend NPUs. It's usually the Atlas 800 A2 series.
* **vLLM: main branch**
* **vLLM Ascend: main branch**
@@ -17,7 +17,7 @@ Unified Cache Management (UCM) provides an external KV-cache storage layer desig
## Configure UCM for Prefix Caching
Modify the UCM configuration file to specify which UCM connector to use and where KV blocks should be stored.
Modify the UCM configuration file to specify which UCM connector to use and where KV blocks should be stored.
You may directly edit the example file at:
`unified-cache-management/examples/ucm_config_example.yaml`
@@ -78,7 +78,7 @@ vllm serve Qwen/Qwen2.5-14B-Instruct \
**⚠️ Make sure to replace `"/vllm-workspace/unified-cache-management/examples/ucm_config_example.yaml"` with your actual config file path.**
If you see log as below:
If you see the log below:
```bash
INFO: Started server process [1049932]

View File

@@ -2,7 +2,7 @@
Weight prefetching optimizes memory usage by preloading weights into the cache before they are needed, minimizing delays caused by memory access during model execution. Linear layers sometimes exhibit relatively high MTE utilization. To address this, we create a separate pipeline specifically for weight prefetching, which runs in parallel with the original vector computation pipeline, such as quantize, MoE gating top_k, RMSNorm and SwiGlu. This approach allows the weights to be preloaded to L2 cache ahead of time, reducing MTE utilization during the linear layer computations and indirectly improving Cube computation efficiency by minimizing resource contention and optimizing data flow.
Since we use vector computations to hide the weight prefetching pipeline, it has effect on computation, if you prioritize low latency over high throughput, then it it best not to enable prefetching.
Since we use vector computations to hide the weight prefetching pipeline, it has effect on computation, if you prioritize low latency over high throughput, then it is best not to enable prefetching.
## Quick Start
@@ -10,25 +10,25 @@ With `--additional-config '{"weight_prefetch_config": {"enabled": true}}'` to op
## Fine-tune Prefetch Ratio
Since weight prefetch use vector computations to hide the weight prefetching pipeline, the setting of the prefetch size is crucial. If the size is too small, the optimization benefits will not be fully realized, while a larger size may lead to resource contention, resulting in performance degradation. To accommodate different scenarios, we have add `prefetch_ratio` to allow for flexible size configuration based on the specific workload, detail as following:
Since weight prefetch use vector computations to hide the weight prefetching pipeline, the setting of the prefetch size is crucial. If the size is too small, the optimization benefits will not be fully realized, while a larger size may lead to resource contention, resulting in performance degradation. To accommodate different scenarios, we have added `prefetch_ratio` to allow for flexible size configuration based on the specific workload, details as follows:
With `prefetch_ratio` in `"weight_prefetch_config"` to custom the weight prefetch ratio for specify linear layers.
With `prefetch_ratio` in `"weight_prefetch_config"` to custom the weight prefetch ratio for specific linear layers.
The “attn” and “moe” configuration options are used for MoE model, detail as following:
The “attn” and “moe” configuration options are used for MoE model, details as follows:
`"attn": { "qkv": 1.0, "o": 1.0}, "moe": {"gate_up": 0.8}`
The “mlp” configuration option is used to optimize the performance of the Dense model, detail as following:
The “mlp” configuration option is used to optimize the performance of the Dense model, details as follows:
`"mlp": {"gate_up": 1.0, "down": 1.0}`
Above value are the default config, the default value has a good performance for Qwen3-235B-A22B-W8A8 when `--max-num-seqs`is 144, for Qwen3-32B-W8A8 when `--max-num-seqs`is 72.
Above value are the default config, the default value has a good performance for Qwen3-235B-A22B-W8A8 when `--max-num-seqs` is 144, for Qwen3-32B-W8A8 when `--max-num-seqs` is 72.
However, this may not be the optimal configuration for your scenario. For higher concurrency, you can try increasing the prefetch size. For lower concurrency, prefetching may not offer any advantages, so you can decrease the size or disable prefetching. Determine if the prefetch size is appropriate by collecting profiling data. Specifically, check if the time required for the prefetch operation (e.g., MLP Down Proj weight prefetching) overlaps with the time required for parallel vector computation operators (e.g., SwiGlu computation), and whether the prefetch operation is no later than the completion time of the vector computation operator. In the profiling timeline, a prefetch operation appears as a CMO operation on a single stream; this CMO operation is the prefetch operation.
Notices:
Notes:
1) Weight prefetch of MLP `down` project prefetch dependence sequence parallel, if you want open for mlp `down` please also enable sequence parallel.
1) Weight prefetch of MLP `down` project prefetch depends on sequence parallel, if you want to open for mlp `down` please also enable sequence parallel.
2) Due to the current size of the L2 cache, the maximum prefetch cannot exceed 18MB. If `prefetch_ratio * linear_layer_weight_size >= 18 * 1024 * 1024` bytes, the backend will only prefetch 18MB.
## Example
@@ -55,7 +55,7 @@ Notices:
2) For dense model:
Following is the default configuration that can get a good performance for `--max-num-seqs`is 72 for Qwen3-32B-W8A8
Following is the default configuration that can get a good performance for `--max-num-seqs` is 72 for Qwen3-32B-W8A8
```shell
--additional-config \