[main][Docs] Fix typos across documentation (#6728)

## Summary

Fix typos and improve grammar consistency across 50 documentation files.
 
### Changes include:
- Spelling corrections (e.g., "Facotory" → "Factory", "certainty" →
"determinism")
- Grammar improvements (e.g., "multi-thread" → "multi-threaded",
"re-routed" → "re-run")
- Punctuation fixes (semicolon consistency in filter parameters)
- Code style fixes (correct flag name `--num-prompts` instead of
`--num-prompt`)
- Capitalization consistency (e.g., "python" → "Python", "ascend" →
"Ascend")
- vLLM version: v0.15.0
- vLLM main:
9562912cea

---------

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
This commit is contained in:
Cao Yi
2026-02-13 15:50:05 +08:00
committed by GitHub
parent b6bc3d2f9d
commit 6de207de88
50 changed files with 273 additions and 272 deletions

View File

@@ -1,6 +1,6 @@
# Long-Sequence Context Parallel (Deepseek)
## Getting Start
## Getting Started
:::{note}
Context parallel feature currently is only supported on Atlas A3 device, and will be supported on Atlas A2 in the future.
@@ -8,13 +8,13 @@ Context parallel feature currently is only supported on Atlas A3 device, and wil
vLLM-Ascend now supports long sequence with context parallel options. This guide takes one-by-one steps to verify these features with constrained resources.
Take the Deepseek-V3.1-w8a8 model as an example, use 3 Atlas 800T A3 servers to deploy the “1P1D” architecture. Node p is deployed across multiple machines, while node d is deployed on a single machine. Assume the ip of the prefiller server is 192.0.0.1 (prefill 1) and 192.0.0.2 (prefill 2), and the decoder servers are 192.0.0.3 (decoder 1). On each server, use 8 NPUs 16 chips to deploy one service instance.In the current example, we will enable the context parallel feature on node p to improve TTFT. Although enabling the DCP feature on node d can reduce memory usage, it would introduce additional communication and small operator overhead. Therefore, we will not enable the DCP feature on node d.
Take the Deepseek-V3.1-w8a8 model as an example, use 3 Atlas 800T A3 servers to deploy the “1P1D” architecture. Node p is deployed across multiple machines, while node d is deployed on a single machine. Assume the IP of the prefiller server is 192.0.0.1 (prefill 1) and 192.0.0.2 (prefill 2), and the decoder servers are 192.0.0.3 (decoder 1). On each server, use 8 NPUs 16 chips to deploy one service instance.In the current example, we will enable the context parallel feature on node p to improve TTFT. Although enabling the DCP feature on node d can reduce memory usage, it would introduce additional communication and small operator overhead. Therefore, we will not enable the DCP feature on node d.
## Environment Preparation
### Model Weight
- `DeepSeek-V3.1_w8a8mix_mtp`(Quantized version with mix mtp): [Download model weight](https://www.modelscope.cn/models/Eco-Tech/DeepSeek-V3.1-w8a8). Please modify `torch_dtype` from `float16` to `bfloat16` in `config.json`.
- `DeepSeek-V3.1_w8a8mix_mtp` (Quantized version with mix mtp): [Download model weight](https://www.modelscope.cn/models/Eco-Tech/DeepSeek-V3.1-w8a8). Please modify `torch_dtype` from `float16` to `bfloat16` in `config.json`.
It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`
@@ -24,9 +24,9 @@ Refer to [verify multi-node communication environment](../../installation.md#ver
### Installation
You can use our official docker image to run `DeepSeek-V3.1` directly.
You can use our official Docker image to run `DeepSeek-V3.1` directly.
Select an image based on your machine type and start the docker image on your node, refer to [using docker](../../installation.md#set-up-using-docker).
Select an image based on your machine type and start the Docker image on your node, refer to [using Docker](../../installation.md#set-up-using-docker).
```{code-block} bash
:substitutions:
@@ -35,7 +35,7 @@ export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version|
export NAME=vllm-ascend
# Run the container using the defined variables
# Note: If you are running bridge network with docker, please expose available ports for multiple nodes communication in advance
# Note: If you are running bridge network with Docker, please expose available ports for multiple nodes communication in advance
docker run --rm \
--name $NAME \
--net=host \
@@ -273,7 +273,7 @@ vllm serve /path_to_weight/DeepSeek-V3.1_w8a8mix_mtp \
:::::
2. Prefill master node `proxy.sh` scripts
2. Prefill master node `proxy.sh` script
```shell
python load_balance_proxy_server_example.py \
@@ -289,7 +289,7 @@ python load_balance_proxy_server_example.py \
8004
```
3. run proxy
3. Run proxy
Run a proxy server on the same node with the prefiller service instance. You can get the proxy program in the repository's examples: [load\_balance\_proxy\_server\_example.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)
@@ -318,12 +318,12 @@ The parameters are explained as follows:
"cudagraph_mode": represents the specific graph mode. Currently, "PIECEWISE" and "FULL_DECODE_ONLY" are supported. The graph mode is mainly used to reduce the cost of operator dispatch. Currently, "FULL_DECODE_ONLY" is recommended.
- "cudagraph_capture_sizes": represents different levels of graph modes. The default value is [1, 2, 4, 8, 16, 24, 32, 40,..., `--max-num-seqs`]. In the graph mode, the input for graphs at different levels is fixed, and inputs between levels are automatically padded to the next level. Currently, the default setting is recommended. Only in some scenarios is it necessary to set this separately to achieve optimal performance.
- `export VLLM_ASCEND_ENABLE_FLASHCOMM1=1` indicates that Flashcomm1 optimization is enabled. Currently, this optimization is only supported for MoE in scenarios where tensor-parallel-size > 1.
- `export VLLM_ASCEND_ENABLE_CONTEXT_PARALLEL=1` indicates that context parallel is enabled. This environment variable is required in the PD architecture but not needed in the pd co-locate deployment scenario. It will be removed in the future.
- `export VLLM_ASCEND_ENABLE_CONTEXT_PARALLEL=1` indicates that context parallel is enabled. This environment variable is required in the PD architecture but not needed in the PD co-locate deployment scenario. It will be removed in the future.
**Notice:**
- tensor-parallel-size needs to be divisible by decode-context-parallel-size.
- decode-context-parallel-size must less than or equal to tensor-parallel-size.
- decode-context-parallel-size must be less than or equal to tensor-parallel-size.
## Accuracy Evaluation
@@ -361,7 +361,7 @@ Take the `serve` as an example. Run the code as follows.
```shell
export VLLM_USE_MODELSCOPE=true
vllm bench serve --model /path_to_weight/DeepSeek-V3.1_w8a8mix_mtp --dataset-name random --random-input 131072 --num-prompt 20 --request-rate 0 --save-result --result-dir ./
vllm bench serve --model /path_to_weight/DeepSeek-V3.1_w8a8mix_mtp --dataset-name random --random-input 131072 --num-prompts 20 --request-rate 0 --save-result --result-dir ./
```
After about several minutes, you can get the performance evaluation result.

View File

@@ -1,16 +1,16 @@
# Long-Sequence Context Parallel (Qwen3-235B-A22B)
## Getting Start
## Getting Started
vLLM-Ascend now supports long-sequence context parallel. This guide takes one-by-one steps to verify these features with constrained resources.
Using the `Qwen3-235B-A22B-w8a8`(Quantized version) model as an example, use 1 Atlas 800 A3 (64G × 16) server to deploy the single node "pd co-locate" architecture.
Using the `Qwen3-235B-A22B-w8a8` (Quantized version) model as an example, use 1 Atlas 800 A3 (64G × 16) server to deploy the single node "pd co-locate" architecture.
## Environment Preparation
### Model Weight
- `Qwen3-235B-A22B-w8a8`(Quantized version): require 1 Atlas 800 A3 (64G × 16) node. [Download model weight](https://modelscope.cn/models/vllm-ascend/Qwen3-235B-A22B-W8A8)
- `Qwen3-235B-A22B-w8a8` (Quantized version): requires 1 Atlas 800 A3 (64G × 16) node. [Download model weight](https://modelscope.cn/models/vllm-ascend/Qwen3-235B-A22B-W8A8)
It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`
@@ -25,7 +25,7 @@ export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version|
export NAME=vllm-ascend
# Run the container using the defined variables
# Note: If you are running bridge network with docker, please expose available ports for multiple nodes communication in advance
# Note: If you are running bridge network with Docker, please expose available ports for multiple nodes communication in advance
docker run --rm \
--name $NAME \
--net=host \
@@ -65,7 +65,7 @@ docker run --rm \
### Single-node Deployment
`Qwen3-235B-A22B-w8a8` can be deployed on 1 Atlas 800 A364G*16.
Quantized version need to start with parameter `--quantization ascend`.
Quantized version needs to start with parameter `--quantization ascend`.
Run the following script to execute online 128k inference.
@@ -111,8 +111,8 @@ vllm serve vllm-ascend/Qwen3-235B-A22B-w8a8 \
The parameters are explained as follows:
- `--tensor-parallel-size` 8 are common settings for tensor parallelism (TP) sizes.
- `--prefill-context-parallel-size` 2 are common settings for prefill context parallelism PCP) sizes.
- `--decode-context-parallel-size` 2 are common settings for decode context parallelism DCP) sizes.
- `--prefill-context-parallel-size` 2 are common settings for prefill context parallelism (PCP) sizes.
- `--decode-context-parallel-size` 2 are common settings for decode context parallelism (DCP) sizes.
- `--max-model-len` represents the context length, which is the maximum value of the input plus output for a single request.
- `--max-num-seqs` indicates the maximum number of requests that each DP group is allowed to process. If the number of requests sent to the service exceeds this limit, the excess requests will remain in a waiting state and will not be scheduled. Note that the time spent in the waiting state is also counted in metrics such as TTFT and TPOT. Therefore, when testing performance, it is generally recommended that `--max-num-seqs` * `--data-parallel-size` >= the actual total concurrency.
- `--max-num-batched-tokens` represents the maximum number of tokens that the model can process in a single step. Currently, vLLM v1 scheduling enables ChunkPrefill/SplitFuse by default, which means:
@@ -131,7 +131,7 @@ The parameters are explained as follows:
**Notice:**
- tp_size needs to be divisible by dcp_size
- decode context parallel size must less than or equal to max_dcp_size, where max_dcp_size = tensor_parallel_size // total_num_kv_heads.
- decode context parallel size must be less than or equal to max_dcp_size, where max_dcp_size = tensor_parallel_size // total_num_kv_heads.
## Accuracy Evaluation
@@ -169,7 +169,7 @@ Take the `serve` as an example. Run the code as follows.
```shell
export VLLM_USE_MODELSCOPE=true
vllm bench serve --model vllm-ascend/Qwen3-235B-A22B-w8a8 --dataset-name random --random-input 131072 --num-prompt 1 --request-rate 1 --save-result --result-dir ./
vllm bench serve --model vllm-ascend/Qwen3-235B-A22B-w8a8 --dataset-name random --random-input 131072 --num-prompts 1 --request-rate 1 --save-result --result-dir ./
```
After about several minutes, you can get the performance evaluation result.

View File

@@ -70,7 +70,7 @@ such as IP addresses according to your actual environment.
5. Check NPU TLS Configuration
```bash
# The tls settings should be consistent across all nodes
# The tls settings should be consistent across all nodes.
for i in {0..7}; do hccn_tool -i $i -tls -g ; done | grep switch
```
@@ -267,7 +267,7 @@ configuration.
## Benchmark
We recommend using the **aisbench** tool to assess performance. The test uses
We recommend using the **AISBench** tool to assess performance. The test uses
**Dataset A**, consisting of fully random data, with the following
configuration:
@@ -328,7 +328,7 @@ models = [
]
```
**Performance Benchmarking Commands**
**Performance Benchmarking Commands**:
```shell
ais_bench --models vllm_api_stream_chat \

View File

@@ -1,10 +1,10 @@
# Prefill-Decode Disaggregation (Deepseek)
## Getting Start
## Getting Started
vLLM-Ascend now supports prefill-decode (PD) disaggregation with EP (Expert Parallel) options. This guide take one-by-one steps to verify these features with constrained resources.
vLLM-Ascend now supports prefill-decode (PD) disaggregation with EP (Expert Parallel) options. This guide takes one-by-one steps to verify these features with constrained resources.
Take the Deepseek-r1-w8a8 model as an example, use 4 Atlas 800T A3 servers to deploy the "2P1D" architecture. Assume the ip of the prefiller server is 192.0.0.1 (prefill 1) and 192.0.0.2 (prefill 2), and the decoder servers are 192.0.0.3 (decoder 1) and 192.0.0.4 (decoder 2). On each server, use 8 NPUs 16 chips to deploy one service instance.
Take the Deepseek-r1-w8a8 model as an example, use 4 Atlas 800T A3 servers to deploy the "2P1D" architecture. Assume the IP of the prefiller server is 192.0.0.1 (prefill 1) and 192.0.0.2 (prefill 2), and the decoder servers are 192.0.0.3 (decoder 1) and 192.0.0.4 (decoder 2). On each server, use 8 NPUs 16 chips to deploy one service instance.
## Verify Multi-Node Communication Environment
@@ -48,7 +48,7 @@ cat /etc/hccn.conf
3. Get NPU IP Addresses
```bash
# Get virtual npu ip
# Get virtual NPU IP.
for i in {0..15}; do hccn_tool -i $i -vnic -g;done
```
@@ -61,14 +61,14 @@ for i in {0..15}; do npu-smi info -t spod-info -i $i -c 0;npu-smi info -t spod-i
5. Cross-Node PING Test
```bash
# Execute on the target node (replace 'x.x.x.x' with virtual npu ip address)
# Execute on the target node (replace 'x.x.x.x' with virtual NPU IP address).
for i in {0..15}; do hccn_tool -i $i -hccs_ping -g address x.x.x.x;done
```
6. Check NPU TLS Configuration
```bash
# The tls settings should be consistent across all nodes
# The TLS settings should be consistent across all nodes
for i in {0..15}; do hccn_tool -i $i -tls -g ; done | grep switch
```
@@ -117,7 +117,7 @@ for i in {0..7}; do hccn_tool -i $i -ping -g address x.x.x.x;done
5. Check NPU TLS Configuration
```bash
# The tls settings should be consistent across all nodes
# The TLS settings should be consistent across all nodes
for i in {0..7}; do hccn_tool -i $i -tls -g ; done | grep switch
```
@@ -136,7 +136,7 @@ export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version|
export NAME=vllm-ascend
# Run the container using the defined variables
# Note: If you are running bridge network with docker, please expose available ports for multiple nodes communication in advance
# Note: If you are running bridge network with Docker, please expose available ports for multiple nodes communication in advance.
docker run --rm \
--name $NAME \
--net=host \
@@ -173,7 +173,7 @@ docker run --rm \
## Install Mooncake
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.Installation and Compilation Guide: <https://github.com/kvcache-ai/Mooncake?tab=readme-ov-file#build-and-use-binaries>.
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.Installation and Compilation Guide: <https://github.com/kvcache-ai/Mooncake?tab=readme-ov-file#build-and-use-binaries>
First, we need to obtain the Mooncake project. Refer to the following command:
```shell
@@ -886,7 +886,7 @@ cd benchmark/
pip3 install -e ./
```
You need to canncel the http proxy before assessing performance, as following
You need to cancel the http proxy before assessing performance, as following
```shell
# unset proxy
@@ -895,7 +895,7 @@ unset https_proxy
```
- You can place your datasets in the dir: `benchmark/ais_bench/datasets`
- You can change the configurationin the dir :`benchmark/ais_bench/benchmark/configs/models/vllm_api` Take the ``vllm_api_stream_chat.py`` for examples
- You can change the configuration in the dir :`benchmark/ais_bench/benchmark/configs/models/vllm_api` Take the ``vllm_api_stream_chat.py`` for example
```python
models = [
@@ -943,7 +943,7 @@ Check service health using the proxy server endpoint.
curl http://192.0.0.1:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-moe",
"model": "ds_r1",
"prompt": "Who are you?",
"max_completion_tokens": 100,
"temperature": 0

View File

@@ -44,7 +44,7 @@ for i in {0..7}; do hccn_tool -i $i -ip -g;done
4. Cross-Node PING Test
```bash
# Execute on the target node (replace 'x.x.x.x' with actual npu ip address)
# Execute on the target node (replace 'x.x.x.x' with actual npu ip address).
for i in {0..7}; do hccn_tool -i $i -ping -g address x.x.x.x;done
```
@@ -94,21 +94,21 @@ docker run --rm \
## Install Mooncake
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.Installation and Compilation Guide: <https://github.com/kvcache-ai/Mooncake?tab=readme-ov-file#build-and-use-binaries>.
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI. Installation and Compilation Guide: <https://github.com/kvcache-ai/Mooncake?tab=readme-ov-file#build-and-use-binaries>.
First, we need to obtain the Mooncake project. Refer to the following command:
```shell
git clone -b v0.3.8.post1 --depth 1 https://github.com/kvcache-ai/Mooncake.git
```
(Optional) Replace go install url if the network is poor
(Optional) Replace go install url if the network is poor.
```shell
cd Mooncake
sed -i 's|https://go.dev/dl/|https://golang.google.cn/dl/|g' dependencies.sh
```
Install mpi
Install mpi.
```shell
apt-get install mpich libmpich-dev -y
@@ -120,7 +120,7 @@ Install the relevant dependencies. The installation of Go is not required.
bash dependencies.sh -y
```
Compile and install
Compile and install.
```shell
mkdir build
@@ -130,7 +130,7 @@ make -j
make install
```
Set environment variables
Set environment variables.
**Note:**
@@ -268,7 +268,7 @@ curl http://192.0.0.1:8080/v1/chat/completions \
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": [
{"type": "image_url", "image_url": {"url": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png"}},
{"type": "text", "text": "What is the text in the illustrate?"}
{"type": "text", "text": "What is the text in the illustration?"}
]}
],
"max_completion_tokens": 100,

View File

@@ -64,7 +64,7 @@ export IMAGE=quay.nju.edu.cn/ascend/vllm-ascend:|vllm_ascend_version|
export NAME=vllm-ascend
# Run the container using the defined variables
# Note if you are running bridge network with docker, Please expose available ports for multiple nodes communication in advance
# Note if you are running bridge network with docker, please expose available ports for multiple nodes communication in advance.
docker run --rm \
--name $NAME \
--net=host \

View File

@@ -15,7 +15,7 @@ This document provides step-by-step guidance on how to deploy and benchmark the
| Comprehensive Examination | agieval |
| Multi-turn Dialogue | sharegpt |
The benchmarking tool used in this tutorial is AISbench, which supports performance testing for all the datasets listed above. The final section of this tutorial presents a performance comparison between enabling and disabling Suffix Decoding under the condition of satisfying an SLO TPOT < 50ms across different datasets and concurrency levels. Validations demonstrate that the Qwen3-32B model achieves a throughput improvement of approximately 20% to 80% on various real-world datasets when Suffix Decoding is enabled.
The benchmarking tool used in this tutorial is AISBench, which supports performance testing for all the datasets listed above. The final section of this tutorial presents a performance comparison between enabling and disabling Suffix Decoding under the condition of satisfying an SLO TPOT < 50ms across different datasets and concurrency levels. Validations demonstrate that the Qwen3-32B model achieves a throughput improvement of approximately 20% to 80% on various real-world datasets when Suffix Decoding is enabled.
## **Download vllm-ascend Image**
@@ -71,14 +71,14 @@ pip install arctic-inference
## **vLLM Instance Deployment**
Use the following command to start the container service instance. Speculative inference is enabled via the `--speculative-config` parameter, where `method` is set to`suffix`. For this test, `num_speculative_tokens` is uniformly set to`3`.
Use the following command to start the container service instance. Speculative inference is enabled via the `--speculative-config` parameter, where `method` is set to `suffix`. For this test, `num_speculative_tokens` is uniformly set to `3`.
```bash
# set the NPU device number
# set the NPU device number:
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3
# Set the operator dispatch pipeline level to 1 and disable manual memory control in ACLGraph
export TASK_QUEUE_ENABLE=1
# Enable the AIVector core to directly schedule ROCE communication
# Enable the AIVector core to directly schedule ROCE communication.
export HCCL_OP_EXPANSION_MODE="AIV"
# Enable MLP prefetch for better performance.
export VLLM_ASCEND_ENABLE_PREFETCH_MLP=1