[main][Docs] Fix typos across documentation (#6728)

## Summary

Fix typos and improve grammar consistency across 50 documentation files.
 
### Changes include:
- Spelling corrections (e.g., "Facotory" → "Factory", "certainty" →
"determinism")
- Grammar improvements (e.g., "multi-thread" → "multi-threaded",
"re-routed" → "re-run")
- Punctuation fixes (semicolon consistency in filter parameters)
- Code style fixes (correct flag name `--num-prompts` instead of
`--num-prompt`)
- Capitalization consistency (e.g., "python" → "Python", "ascend" →
"Ascend")
- vLLM version: v0.15.0
- vLLM main:
9562912cea

---------

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
This commit is contained in:
Cao Yi
2026-02-13 15:50:05 +08:00
committed by GitHub
parent b6bc3d2f9d
commit 6de207de88
50 changed files with 273 additions and 272 deletions

View File

@@ -1,6 +1,6 @@
# Long-Sequence Context Parallel (Deepseek)
## Getting Start
## Getting Started
:::{note}
Context parallel feature currently is only supported on Atlas A3 device, and will be supported on Atlas A2 in the future.
@@ -8,13 +8,13 @@ Context parallel feature currently is only supported on Atlas A3 device, and wil
vLLM-Ascend now supports long sequence with context parallel options. This guide takes one-by-one steps to verify these features with constrained resources.
Take the Deepseek-V3.1-w8a8 model as an example, use 3 Atlas 800T A3 servers to deploy the “1P1D” architecture. Node p is deployed across multiple machines, while node d is deployed on a single machine. Assume the ip of the prefiller server is 192.0.0.1 (prefill 1) and 192.0.0.2 (prefill 2), and the decoder servers are 192.0.0.3 (decoder 1). On each server, use 8 NPUs 16 chips to deploy one service instance.In the current example, we will enable the context parallel feature on node p to improve TTFT. Although enabling the DCP feature on node d can reduce memory usage, it would introduce additional communication and small operator overhead. Therefore, we will not enable the DCP feature on node d.
Take the Deepseek-V3.1-w8a8 model as an example, use 3 Atlas 800T A3 servers to deploy the “1P1D” architecture. Node p is deployed across multiple machines, while node d is deployed on a single machine. Assume the IP of the prefiller server is 192.0.0.1 (prefill 1) and 192.0.0.2 (prefill 2), and the decoder servers are 192.0.0.3 (decoder 1). On each server, use 8 NPUs 16 chips to deploy one service instance.In the current example, we will enable the context parallel feature on node p to improve TTFT. Although enabling the DCP feature on node d can reduce memory usage, it would introduce additional communication and small operator overhead. Therefore, we will not enable the DCP feature on node d.
## Environment Preparation
### Model Weight
- `DeepSeek-V3.1_w8a8mix_mtp`(Quantized version with mix mtp): [Download model weight](https://www.modelscope.cn/models/Eco-Tech/DeepSeek-V3.1-w8a8). Please modify `torch_dtype` from `float16` to `bfloat16` in `config.json`.
- `DeepSeek-V3.1_w8a8mix_mtp` (Quantized version with mix mtp): [Download model weight](https://www.modelscope.cn/models/Eco-Tech/DeepSeek-V3.1-w8a8). Please modify `torch_dtype` from `float16` to `bfloat16` in `config.json`.
It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`
@@ -24,9 +24,9 @@ Refer to [verify multi-node communication environment](../../installation.md#ver
### Installation
You can use our official docker image to run `DeepSeek-V3.1` directly.
You can use our official Docker image to run `DeepSeek-V3.1` directly.
Select an image based on your machine type and start the docker image on your node, refer to [using docker](../../installation.md#set-up-using-docker).
Select an image based on your machine type and start the Docker image on your node, refer to [using Docker](../../installation.md#set-up-using-docker).
```{code-block} bash
:substitutions:
@@ -35,7 +35,7 @@ export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version|
export NAME=vllm-ascend
# Run the container using the defined variables
# Note: If you are running bridge network with docker, please expose available ports for multiple nodes communication in advance
# Note: If you are running bridge network with Docker, please expose available ports for multiple nodes communication in advance
docker run --rm \
--name $NAME \
--net=host \
@@ -273,7 +273,7 @@ vllm serve /path_to_weight/DeepSeek-V3.1_w8a8mix_mtp \
:::::
2. Prefill master node `proxy.sh` scripts
2. Prefill master node `proxy.sh` script
```shell
python load_balance_proxy_server_example.py \
@@ -289,7 +289,7 @@ python load_balance_proxy_server_example.py \
8004
```
3. run proxy
3. Run proxy
Run a proxy server on the same node with the prefiller service instance. You can get the proxy program in the repository's examples: [load\_balance\_proxy\_server\_example.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)
@@ -318,12 +318,12 @@ The parameters are explained as follows:
"cudagraph_mode": represents the specific graph mode. Currently, "PIECEWISE" and "FULL_DECODE_ONLY" are supported. The graph mode is mainly used to reduce the cost of operator dispatch. Currently, "FULL_DECODE_ONLY" is recommended.
- "cudagraph_capture_sizes": represents different levels of graph modes. The default value is [1, 2, 4, 8, 16, 24, 32, 40,..., `--max-num-seqs`]. In the graph mode, the input for graphs at different levels is fixed, and inputs between levels are automatically padded to the next level. Currently, the default setting is recommended. Only in some scenarios is it necessary to set this separately to achieve optimal performance.
- `export VLLM_ASCEND_ENABLE_FLASHCOMM1=1` indicates that Flashcomm1 optimization is enabled. Currently, this optimization is only supported for MoE in scenarios where tensor-parallel-size > 1.
- `export VLLM_ASCEND_ENABLE_CONTEXT_PARALLEL=1` indicates that context parallel is enabled. This environment variable is required in the PD architecture but not needed in the pd co-locate deployment scenario. It will be removed in the future.
- `export VLLM_ASCEND_ENABLE_CONTEXT_PARALLEL=1` indicates that context parallel is enabled. This environment variable is required in the PD architecture but not needed in the PD co-locate deployment scenario. It will be removed in the future.
**Notice:**
- tensor-parallel-size needs to be divisible by decode-context-parallel-size.
- decode-context-parallel-size must less than or equal to tensor-parallel-size.
- decode-context-parallel-size must be less than or equal to tensor-parallel-size.
## Accuracy Evaluation
@@ -361,7 +361,7 @@ Take the `serve` as an example. Run the code as follows.
```shell
export VLLM_USE_MODELSCOPE=true
vllm bench serve --model /path_to_weight/DeepSeek-V3.1_w8a8mix_mtp --dataset-name random --random-input 131072 --num-prompt 20 --request-rate 0 --save-result --result-dir ./
vllm bench serve --model /path_to_weight/DeepSeek-V3.1_w8a8mix_mtp --dataset-name random --random-input 131072 --num-prompts 20 --request-rate 0 --save-result --result-dir ./
```
After about several minutes, you can get the performance evaluation result.

View File

@@ -1,16 +1,16 @@
# Long-Sequence Context Parallel (Qwen3-235B-A22B)
## Getting Start
## Getting Started
vLLM-Ascend now supports long-sequence context parallel. This guide takes one-by-one steps to verify these features with constrained resources.
Using the `Qwen3-235B-A22B-w8a8`(Quantized version) model as an example, use 1 Atlas 800 A3 (64G × 16) server to deploy the single node "pd co-locate" architecture.
Using the `Qwen3-235B-A22B-w8a8` (Quantized version) model as an example, use 1 Atlas 800 A3 (64G × 16) server to deploy the single node "pd co-locate" architecture.
## Environment Preparation
### Model Weight
- `Qwen3-235B-A22B-w8a8`(Quantized version): require 1 Atlas 800 A3 (64G × 16) node. [Download model weight](https://modelscope.cn/models/vllm-ascend/Qwen3-235B-A22B-W8A8)
- `Qwen3-235B-A22B-w8a8` (Quantized version): requires 1 Atlas 800 A3 (64G × 16) node. [Download model weight](https://modelscope.cn/models/vllm-ascend/Qwen3-235B-A22B-W8A8)
It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`
@@ -25,7 +25,7 @@ export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version|
export NAME=vllm-ascend
# Run the container using the defined variables
# Note: If you are running bridge network with docker, please expose available ports for multiple nodes communication in advance
# Note: If you are running bridge network with Docker, please expose available ports for multiple nodes communication in advance
docker run --rm \
--name $NAME \
--net=host \
@@ -65,7 +65,7 @@ docker run --rm \
### Single-node Deployment
`Qwen3-235B-A22B-w8a8` can be deployed on 1 Atlas 800 A364G*16.
Quantized version need to start with parameter `--quantization ascend`.
Quantized version needs to start with parameter `--quantization ascend`.
Run the following script to execute online 128k inference.
@@ -111,8 +111,8 @@ vllm serve vllm-ascend/Qwen3-235B-A22B-w8a8 \
The parameters are explained as follows:
- `--tensor-parallel-size` 8 are common settings for tensor parallelism (TP) sizes.
- `--prefill-context-parallel-size` 2 are common settings for prefill context parallelism PCP) sizes.
- `--decode-context-parallel-size` 2 are common settings for decode context parallelism DCP) sizes.
- `--prefill-context-parallel-size` 2 are common settings for prefill context parallelism (PCP) sizes.
- `--decode-context-parallel-size` 2 are common settings for decode context parallelism (DCP) sizes.
- `--max-model-len` represents the context length, which is the maximum value of the input plus output for a single request.
- `--max-num-seqs` indicates the maximum number of requests that each DP group is allowed to process. If the number of requests sent to the service exceeds this limit, the excess requests will remain in a waiting state and will not be scheduled. Note that the time spent in the waiting state is also counted in metrics such as TTFT and TPOT. Therefore, when testing performance, it is generally recommended that `--max-num-seqs` * `--data-parallel-size` >= the actual total concurrency.
- `--max-num-batched-tokens` represents the maximum number of tokens that the model can process in a single step. Currently, vLLM v1 scheduling enables ChunkPrefill/SplitFuse by default, which means:
@@ -131,7 +131,7 @@ The parameters are explained as follows:
**Notice:**
- tp_size needs to be divisible by dcp_size
- decode context parallel size must less than or equal to max_dcp_size, where max_dcp_size = tensor_parallel_size // total_num_kv_heads.
- decode context parallel size must be less than or equal to max_dcp_size, where max_dcp_size = tensor_parallel_size // total_num_kv_heads.
## Accuracy Evaluation
@@ -169,7 +169,7 @@ Take the `serve` as an example. Run the code as follows.
```shell
export VLLM_USE_MODELSCOPE=true
vllm bench serve --model vllm-ascend/Qwen3-235B-A22B-w8a8 --dataset-name random --random-input 131072 --num-prompt 1 --request-rate 1 --save-result --result-dir ./
vllm bench serve --model vllm-ascend/Qwen3-235B-A22B-w8a8 --dataset-name random --random-input 131072 --num-prompts 1 --request-rate 1 --save-result --result-dir ./
```
After about several minutes, you can get the performance evaluation result.

View File

@@ -70,7 +70,7 @@ such as IP addresses according to your actual environment.
5. Check NPU TLS Configuration
```bash
# The tls settings should be consistent across all nodes
# The tls settings should be consistent across all nodes.
for i in {0..7}; do hccn_tool -i $i -tls -g ; done | grep switch
```
@@ -267,7 +267,7 @@ configuration.
## Benchmark
We recommend using the **aisbench** tool to assess performance. The test uses
We recommend using the **AISBench** tool to assess performance. The test uses
**Dataset A**, consisting of fully random data, with the following
configuration:
@@ -328,7 +328,7 @@ models = [
]
```
**Performance Benchmarking Commands**
**Performance Benchmarking Commands**:
```shell
ais_bench --models vllm_api_stream_chat \

View File

@@ -1,10 +1,10 @@
# Prefill-Decode Disaggregation (Deepseek)
## Getting Start
## Getting Started
vLLM-Ascend now supports prefill-decode (PD) disaggregation with EP (Expert Parallel) options. This guide take one-by-one steps to verify these features with constrained resources.
vLLM-Ascend now supports prefill-decode (PD) disaggregation with EP (Expert Parallel) options. This guide takes one-by-one steps to verify these features with constrained resources.
Take the Deepseek-r1-w8a8 model as an example, use 4 Atlas 800T A3 servers to deploy the "2P1D" architecture. Assume the ip of the prefiller server is 192.0.0.1 (prefill 1) and 192.0.0.2 (prefill 2), and the decoder servers are 192.0.0.3 (decoder 1) and 192.0.0.4 (decoder 2). On each server, use 8 NPUs 16 chips to deploy one service instance.
Take the Deepseek-r1-w8a8 model as an example, use 4 Atlas 800T A3 servers to deploy the "2P1D" architecture. Assume the IP of the prefiller server is 192.0.0.1 (prefill 1) and 192.0.0.2 (prefill 2), and the decoder servers are 192.0.0.3 (decoder 1) and 192.0.0.4 (decoder 2). On each server, use 8 NPUs 16 chips to deploy one service instance.
## Verify Multi-Node Communication Environment
@@ -48,7 +48,7 @@ cat /etc/hccn.conf
3. Get NPU IP Addresses
```bash
# Get virtual npu ip
# Get virtual NPU IP.
for i in {0..15}; do hccn_tool -i $i -vnic -g;done
```
@@ -61,14 +61,14 @@ for i in {0..15}; do npu-smi info -t spod-info -i $i -c 0;npu-smi info -t spod-i
5. Cross-Node PING Test
```bash
# Execute on the target node (replace 'x.x.x.x' with virtual npu ip address)
# Execute on the target node (replace 'x.x.x.x' with virtual NPU IP address).
for i in {0..15}; do hccn_tool -i $i -hccs_ping -g address x.x.x.x;done
```
6. Check NPU TLS Configuration
```bash
# The tls settings should be consistent across all nodes
# The TLS settings should be consistent across all nodes
for i in {0..15}; do hccn_tool -i $i -tls -g ; done | grep switch
```
@@ -117,7 +117,7 @@ for i in {0..7}; do hccn_tool -i $i -ping -g address x.x.x.x;done
5. Check NPU TLS Configuration
```bash
# The tls settings should be consistent across all nodes
# The TLS settings should be consistent across all nodes
for i in {0..7}; do hccn_tool -i $i -tls -g ; done | grep switch
```
@@ -136,7 +136,7 @@ export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version|
export NAME=vllm-ascend
# Run the container using the defined variables
# Note: If you are running bridge network with docker, please expose available ports for multiple nodes communication in advance
# Note: If you are running bridge network with Docker, please expose available ports for multiple nodes communication in advance.
docker run --rm \
--name $NAME \
--net=host \
@@ -173,7 +173,7 @@ docker run --rm \
## Install Mooncake
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.Installation and Compilation Guide: <https://github.com/kvcache-ai/Mooncake?tab=readme-ov-file#build-and-use-binaries>.
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.Installation and Compilation Guide: <https://github.com/kvcache-ai/Mooncake?tab=readme-ov-file#build-and-use-binaries>
First, we need to obtain the Mooncake project. Refer to the following command:
```shell
@@ -886,7 +886,7 @@ cd benchmark/
pip3 install -e ./
```
You need to canncel the http proxy before assessing performance, as following
You need to cancel the http proxy before assessing performance, as following
```shell
# unset proxy
@@ -895,7 +895,7 @@ unset https_proxy
```
- You can place your datasets in the dir: `benchmark/ais_bench/datasets`
- You can change the configurationin the dir :`benchmark/ais_bench/benchmark/configs/models/vllm_api` Take the ``vllm_api_stream_chat.py`` for examples
- You can change the configuration in the dir :`benchmark/ais_bench/benchmark/configs/models/vllm_api` Take the ``vllm_api_stream_chat.py`` for example
```python
models = [
@@ -943,7 +943,7 @@ Check service health using the proxy server endpoint.
curl http://192.0.0.1:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-moe",
"model": "ds_r1",
"prompt": "Who are you?",
"max_completion_tokens": 100,
"temperature": 0

View File

@@ -44,7 +44,7 @@ for i in {0..7}; do hccn_tool -i $i -ip -g;done
4. Cross-Node PING Test
```bash
# Execute on the target node (replace 'x.x.x.x' with actual npu ip address)
# Execute on the target node (replace 'x.x.x.x' with actual npu ip address).
for i in {0..7}; do hccn_tool -i $i -ping -g address x.x.x.x;done
```
@@ -94,21 +94,21 @@ docker run --rm \
## Install Mooncake
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.Installation and Compilation Guide: <https://github.com/kvcache-ai/Mooncake?tab=readme-ov-file#build-and-use-binaries>.
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI. Installation and Compilation Guide: <https://github.com/kvcache-ai/Mooncake?tab=readme-ov-file#build-and-use-binaries>.
First, we need to obtain the Mooncake project. Refer to the following command:
```shell
git clone -b v0.3.8.post1 --depth 1 https://github.com/kvcache-ai/Mooncake.git
```
(Optional) Replace go install url if the network is poor
(Optional) Replace go install url if the network is poor.
```shell
cd Mooncake
sed -i 's|https://go.dev/dl/|https://golang.google.cn/dl/|g' dependencies.sh
```
Install mpi
Install mpi.
```shell
apt-get install mpich libmpich-dev -y
@@ -120,7 +120,7 @@ Install the relevant dependencies. The installation of Go is not required.
bash dependencies.sh -y
```
Compile and install
Compile and install.
```shell
mkdir build
@@ -130,7 +130,7 @@ make -j
make install
```
Set environment variables
Set environment variables.
**Note:**
@@ -268,7 +268,7 @@ curl http://192.0.0.1:8080/v1/chat/completions \
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": [
{"type": "image_url", "image_url": {"url": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png"}},
{"type": "text", "text": "What is the text in the illustrate?"}
{"type": "text", "text": "What is the text in the illustration?"}
]}
],
"max_completion_tokens": 100,

View File

@@ -64,7 +64,7 @@ export IMAGE=quay.nju.edu.cn/ascend/vllm-ascend:|vllm_ascend_version|
export NAME=vllm-ascend
# Run the container using the defined variables
# Note if you are running bridge network with docker, Please expose available ports for multiple nodes communication in advance
# Note if you are running bridge network with docker, please expose available ports for multiple nodes communication in advance.
docker run --rm \
--name $NAME \
--net=host \

View File

@@ -15,7 +15,7 @@ This document provides step-by-step guidance on how to deploy and benchmark the
| Comprehensive Examination | agieval |
| Multi-turn Dialogue | sharegpt |
The benchmarking tool used in this tutorial is AISbench, which supports performance testing for all the datasets listed above. The final section of this tutorial presents a performance comparison between enabling and disabling Suffix Decoding under the condition of satisfying an SLO TPOT < 50ms across different datasets and concurrency levels. Validations demonstrate that the Qwen3-32B model achieves a throughput improvement of approximately 20% to 80% on various real-world datasets when Suffix Decoding is enabled.
The benchmarking tool used in this tutorial is AISBench, which supports performance testing for all the datasets listed above. The final section of this tutorial presents a performance comparison between enabling and disabling Suffix Decoding under the condition of satisfying an SLO TPOT < 50ms across different datasets and concurrency levels. Validations demonstrate that the Qwen3-32B model achieves a throughput improvement of approximately 20% to 80% on various real-world datasets when Suffix Decoding is enabled.
## **Download vllm-ascend Image**
@@ -71,14 +71,14 @@ pip install arctic-inference
## **vLLM Instance Deployment**
Use the following command to start the container service instance. Speculative inference is enabled via the `--speculative-config` parameter, where `method` is set to`suffix`. For this test, `num_speculative_tokens` is uniformly set to`3`.
Use the following command to start the container service instance. Speculative inference is enabled via the `--speculative-config` parameter, where `method` is set to `suffix`. For this test, `num_speculative_tokens` is uniformly set to `3`.
```bash
# set the NPU device number
# set the NPU device number:
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3
# Set the operator dispatch pipeline level to 1 and disable manual memory control in ACLGraph
export TASK_QUEUE_ENABLE=1
# Enable the AIVector core to directly schedule ROCE communication
# Enable the AIVector core to directly schedule ROCE communication.
export HCCL_OP_EXPANSION_MODE="AIV"
# Enable MLP prefetch for better performance.
export VLLM_ASCEND_ENABLE_PREFETCH_MLP=1

View File

@@ -69,7 +69,7 @@ vllm serve Qwen/Qwen3-0.6B \
--compilation-config '{"custom_ops":["none", "+rms_norm", "+rotary_embedding"]}'
```
Once your server is started, you can query the model with input prompts
Once your server is started, you can query the model with input prompts.
```bash
curl http://localhost:8000/v1/completions \
@@ -99,7 +99,7 @@ vllm serve Qwen/Qwen2.5-7B-Instruct \
--compilation-config '{"custom_ops":["none", "+rms_norm", "+rotary_embedding"]}'
```
Once your server is started, you can query the model with input prompts
Once your server is started, you can query the model with input prompts.
```bash
curl http://localhost:8000/v1/completions \
@@ -129,7 +129,7 @@ vllm serve Qwen/Qwen2.5-VL-3B-Instruct \
--compilation-config '{"custom_ops":["none", "+rms_norm", "+rotary_embedding"]}'
```
Once your server is started, you can query the model with input prompts
Once your server is started, you can query the model with input prompts.
```bash
curl http://localhost:8000/v1/completions \

View File

@@ -39,7 +39,7 @@ export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version|
export NAME=vllm-ascend
# Run the container using the defined variables
# Note: If you are running bridge network with docker, please expose available ports for multiple nodes communication in advance
# Note: If you are running bridge network with docker, please expose available ports for multiple nodes communication in advance.
docker run --rm \
--name $NAME \
--net=host \
@@ -70,7 +70,7 @@ If you want to deploy multi-node environment, you need to set up environment on
## Deployment
### Service-oriented Deployment
### Service-oriented Deployment
- `DeepSeek-R1-W8A8`: require 1 Atlas 800 A3 (64G × 16) nodes or 2 Atlas 800 A2 (64G × 8).
@@ -303,7 +303,7 @@ Take the `serve` as an example. Run the code as follows.
```shell
export VLLM_USE_MODELSCOPE=true
vllm bench serve --model path/DeepSeek-R1-W8A8 --dataset-name random --random-input 200 --num-prompt 200 --request-rate 1 --save-result --result-dir ./
vllm bench serve --model path/DeepSeek-R1-W8A8 --dataset-name random --random-input 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./
```
After about several minutes, you can get the performance evaluation result.

View File

@@ -10,7 +10,7 @@ DeepSeek-V3.1 is a hybrid model that supports both thinking mode and non-thinkin
- Higher thinking efficiency: DeepSeek-V3.1-Think achieves comparable answer quality to DeepSeek-R1-0528, while responding more quickly.
The `DeepSeek-V3.1` model is first supported in `vllm-ascend:v0.9.1rc3`
The `DeepSeek-V3.1` model is first supported in `vllm-ascend:v0.9.1rc3`.
This document will show the main verification steps of the model, including supported features, feature configuration, environment preparation, single-node and multi-node deployment, accuracy and performance evaluation.
@@ -30,7 +30,7 @@ Refer to [feature guide](../../user_guide/feature_guide/index.md) to get the fea
- `DeepSeek-V3.1-Terminus-w4a8-mtp-QuaRot`(Quantized version with mix mtp): [Download model weight](https://www.modelscope.cn/models/Eco-Tech/DeepSeek-V3.1-Terminus-w4a8-mtp-QuaRot).
- `Method of Quantify`: [msmodelslim](https://gitcode.com/Ascend/msit/blob/master/msmodelslim/example/DeepSeek/README.md#deepseek-v31-w8a8-%E6%B7%B7%E5%90%88%E9%87%8F%E5%8C%96-mtp-%E9%87%8F%E5%8C%96). You can use these methods to quantify the model.
It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`
It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`.
### Verify Multi-node Communication(Optional)
@@ -52,7 +52,7 @@ export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version|
export NAME=vllm-ascend
# Run the container using the defined variables
# Note: If you are running bridge network with docker, please expose available ports for multiple nodes communication in advance
# Note: If you are running bridge network with docker, please expose available ports for multiple nodes communication in advance.
docker run --rm \
--name $NAME \
--net=host \
@@ -580,7 +580,7 @@ The parameters are explained as follows:
- `multistream_overlap_shared_expert: true`: When the Tensor Parallelism (TP) size is 1 or `enable_shared_expert_dp: true`, an additional stream is enabled to overlap the computation process of shared experts for improved efficiency.
- `lmhead_tensor_parallel_size: 16`: When the Tensor Parallelism (TP) size of the decode node is 1, this parameter allows the TP size of the LMHead embedding layer to be greater than 1, which is used to reduce the computational load of each card on the LMHead embedding layer.
6. run server for each node
6. run server for each node:
```shell
# p0
@@ -593,7 +593,7 @@ python launch_online_dp.py --dp-size 32 --tp-size 1 --dp-size-local 16 --dp-rank
python launch_online_dp.py --dp-size 32 --tp-size 1 --dp-size-local 16 --dp-rank-start 16 --dp-address 141.xx.xx.3 --dp-rpc-port 12321 --vllm-start-port 7100
```
7. Run proxy `proxy.sh` scripts on the prefill master node
7. Run the `proxy.sh` script on the prefill master node
Run a proxy server on the same node with the prefiller service instance. You can get the proxy program in the repository's examples: [load\_balance\_proxy\_server\_example.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)
@@ -716,7 +716,7 @@ There are three `vllm bench` subcommands:
Take the `serve` as an example. Run the code as follows.
```shell
vllm bench serve --model /weights/DeepSeek-V3.1-w8a8-mtp-QuaRot --dataset-name random --random-input 1024 --num-prompt 200 --request-rate 1 --save-result --result-dir ./
vllm bench serve --model /weights/DeepSeek-V3.1-w8a8-mtp-QuaRot --dataset-name random --random-input 1024 --num-prompts 200 --request-rate 1 --save-result --result-dir ./
```
After about several minutes, you can get the performance evaluation result.

View File

@@ -21,7 +21,7 @@ Refer to [feature guide](../../user_guide/feature_guide/index.md) to get the fea
- `DeepSeek-V3.2`(BF16 version): require 2 Atlas 800 A3 (64G × 16) nodes or 4 Atlas 800 A2 (64G × 8) nodes. Model weight in BF16 not found now.
- `DeepSeek-V3.2-w8a8`(Quantized version): require 1 Atlas 800 A3 (64G × 16) node or 2 Atlas 800 A2 (64G × 8) nodes. [Download model weight](https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V3.2-W8A8/)
It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`
It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`.
### Verify Multi-node Communication(Optional)
@@ -390,7 +390,7 @@ We'd like to show the deployment guide of `DeepSeek-V3.2` on multi-node environm
Before you start, please
1. prepare the script `launch_online_dp.py` on each node.
1. prepare the script `launch_online_dp.py` on each node:
```python
import argparse
@@ -905,7 +905,7 @@ Take the `serve` as an example. Run the code as follows.
```shell
export VLLM_USE_MODELSCOPE=true
vllm bench serve --model /root/.cache/Eco-Tech/DeepSeek-V3.2-w8a8-mtp-QuaRot --dataset-name random --random-input 200 --num-prompt 200 --request-rate 1 --save-result --result-dir ./
vllm bench serve --model /root/.cache/Eco-Tech/DeepSeek-V3.2-w8a8-mtp-QuaRot --dataset-name random --random-input 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./
```
## Function Call

View File

@@ -2,9 +2,9 @@
## Introduction
GLM-4.x series models use a Mixture-of-Experts (MoE) architecture and are foundational models specifically designed for agent applications
GLM-4.x series models use a Mixture-of-Experts (MoE) architecture and are foundational models specifically designed for agent applications.
The `GLM-4.5` model is first supported in `vllm-ascend:v0.10.0rc1`
The `GLM-4.5` model is first supported in `vllm-ascend:v0.10.0rc1`.
This document will show the main verification steps of the model, including supported features, feature configuration, environment preparation, single-node and multi-node deployment, accuracy and performance evaluation.
@@ -25,7 +25,7 @@ Refer to [feature guide](../../user_guide/feature_guide/index.md) to get the fea
- `GLM-4.6-w8a8`(Quantized version without mtp): [Download model weight](https://modelers.cn/models/Modelers_Park/GLM-4.6-w8a8). Because vllm do not support GLM4.6 mtp in October, so we do not provide mtp version. And last month, it supported, you can use the following quantization scheme to add mtp weights to Quantized weights.
- `Method of Quantify`: [quantization scheme](https://blog.csdn.net/qq_37368095/article/details/156429653?spm=1011.2124.3001.6209). You can use these methods to quantify the model.
It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`
It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`.
### Installation
@@ -43,7 +43,7 @@ export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version|
export NAME=vllm-ascend
# Run the container using the defined variables
# Note: If you are running bridge network with docker, please expose available ports for multiple nodes communication in advance
# Note: If you are running bridge network with docker, please expose available ports for multiple nodes communication in advance.
docker run --rm \
--name $NAME \
--net=host \

View File

@@ -78,7 +78,7 @@ Single-node deployment is recommended.
### Prefill-Decode Disaggregation
Not supported yet
Not supported yet.
## Functional Verification
@@ -190,7 +190,7 @@ pip install safetensors
```
:::{note}
The OpenCV component may be missing
The OpenCV component may be missing:
```bash
apt-get update

View File

@@ -563,7 +563,7 @@ The performance evaluation must be conducted in an online mode. Take the `serve`
:sync: single
```shell
vllm bench serve --model Qwen/Qwen3-VL-8B-Instruct --dataset-name random --random-input 200 --num-prompt 200 --request-rate 1 --save-result --result-dir ./
vllm bench serve --model Qwen/Qwen3-VL-8B-Instruct --dataset-name random --random-input 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./
```
::::
@@ -571,7 +571,7 @@ vllm bench serve --model Qwen/Qwen3-VL-8B-Instruct --dataset-name random --rand
:sync: multi
```shell
vllm bench serve --model Qwen/Qwen2.5-VL-32B-Instruct --dataset-name random --random-input 200 --num-prompt 200 --request-rate 1 --save-result --result-dir ./
vllm bench serve --model Qwen/Qwen2.5-VL-32B-Instruct --dataset-name random --random-input 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./
```
::::

View File

@@ -18,7 +18,7 @@ Refer to [feature guide](../../user_guide/feature_guide/index.md) to get the fea
### Model Weight
- `Qwen2.5-7B-Instruct`(BF16 version): require 1 910B4 cards(32G × 1). [Qwen2.5-7B-Instruct](https://modelscope.cn/models/Qwen/Qwen2.5-7B-Instruct)
- `Qwen2.5-7B-Instruct`(BF16 version): require 1 910B4 cards(32G × 1) [Qwen2.5-7B-Instruct](https://modelscope.cn/models/Qwen/Qwen2.5-7B-Instruct)
It is recommended to download the model weights to a local directory (e.g., `./Qwen2.5-7B-Instruct/`) for quick access during deployment.
@@ -171,7 +171,7 @@ vllm bench serve \
--model ./Qwen2.5-7B-Instruct/ \
--dataset-name random \
--random-input 200 \
--num-prompt 200 \
--num-prompts 200 \
--request-rate 1 \
--save-result \
--result-dir ./perf_results/

View File

@@ -95,7 +95,7 @@ VLLM_TARGET_DEVICE=empty pip install -v ".[audio]"
:::
`--allowed-local-media-path` is optional, only set it if you need infer model with local media file
`--allowed-local-media-path` is optional, only set it if you need infer model with local media file.
`--gpu-memory-utilization` should not be set manually only if you know what this parameter aims to.
@@ -118,11 +118,11 @@ vllm serve ${MODEL_PATH}\
--no-enable-prefix-caching
```
`--tensor_parallel_size` no need to set for this 7B model, but if you really need tensor parallel, tp size can be one of `1\2\4`
`--tensor_parallel_size` no need to set for this 7B model, but if you really need tensor parallel, tp size can be one of `1/2/4`.
### Prefill-Decode Disaggregation
Not supported yet
Not supported yet.
## Functional Verification
@@ -145,7 +145,7 @@ curl http://127.0.0.1:8000/v1/chat/completions -H "Content-Type: application/j
"content": [
{
"type": "text",
"text": "What is the text in the illustrate?"
"text": "What is the text in the illustration?"
},
{
"type": "image_url",
@@ -170,7 +170,7 @@ If you query the server successfully, you can see the info shown below (client):
## Accuracy Evaluation
Qwen2.5-Omni on vllm-ascend has been test on AISBench.
Qwen2.5-Omni on vllm-ascend has been tested on AISBench.
### Using AISBench
@@ -204,7 +204,7 @@ There are three `vllm bench` subcommands:
Take the `serve` as an example. Run the code as follows.
```shell
vllm bench serve --model Qwen/Qwen2.5-Omni-7B --dataset-name random --random-input 1024 --num-prompt 200 --request-rate 1 --save-result --result-dir ./
vllm bench serve --model Qwen/Qwen2.5-Omni-7B --dataset-name random --random-input 1024 --num-prompts 200 --request-rate 1 --save-result --result-dir ./
```
After about several minutes, you can get the performance evaluation result.

View File

@@ -18,10 +18,10 @@ Refer to [feature guide](../../user_guide/feature_guide/index.md) to get the fea
### Model Weight
- `Qwen3-235B-A22B`(BF16 version): require 1 Atlas 800 A3 (64G × 16) node 1 Atlas 800 A2 (64G × 8) node or 2 Atlas 800 A232G * 8nodes. [Download model weight](https://modelers.cn/models/Modelers_Park/Qwen3-235B-A22B)
- `Qwen3-235B-A22B-w8a8`(Quantized version): require 1 Atlas 800 A3 (64G × 16) node or 1 Atlas 800 A2 (64G × 8) node or 2 Atlas 800 A232G * 8nodes. [Download model weight](https://modelscope.cn/models/vllm-ascend/Qwen3-235B-A22B-W8A8)
- `Qwen3-235B-A22B`(BF16 version): require 1 Atlas 800 A3 (64G × 16) node, 1 Atlas 800 A2 (64G × 8) node or 2 Atlas 800 A2(32G * 8)nodes. [Download model weight](https://modelers.cn/models/Modelers_Park/Qwen3-235B-A22B)
- `Qwen3-235B-A22B-w8a8`(Quantized version): require 1 Atlas 800 A3 (64G × 16) node or 1 Atlas 800 A2 (64G × 8) node or 2 Atlas 800 A2(32G * 8)nodes. [Download model weight](https://modelscope.cn/models/vllm-ascend/Qwen3-235B-A22B-W8A8)
It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`
It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`.
### Verify Multi-node Communication(Optional)
@@ -46,7 +46,7 @@ Select an image based on your machine type and start the docker image on your no
export NAME=vllm-ascend
# Run the container using the defined variables
# Note: If you are running bridge network with docker, please expose available ports for multiple nodes communication in advance
# Note: If you are running bridge network with docker, please expose available ports for multiple nodes communication in advance.
docker run --rm \
--name $NAME \
--net=host \
@@ -87,7 +87,7 @@ If you want to deploy multi-node environment, you need to set up environment on
### Single-node Deployment
`Qwen3-235B-A22B` and `Qwen3-235B-A22B-w8a8` can both be deployed on 1 Atlas 800 A364G*16)、 1 Atlas 800 A264G*8.
`Qwen3-235B-A22B` and `Qwen3-235B-A22B-w8a8` can both be deployed on 1 Atlas 800 A3(64G*16), 1 Atlas 800 A2(64G*8).
Quantized version need to start with parameter `--quantization ascend`.
Run the following script to execute online 128k inference.
@@ -310,7 +310,7 @@ Take the `serve` as an example. Run the code as follows.
```shell
export VLLM_USE_MODELSCOPE=true
vllm bench serve --model vllm-ascend/Qwen3-235B-A22B-w8a8 --dataset-name random --random-input 200 --num-prompt 200 --request-rate 1 --save-result --result-dir ./
vllm bench serve --model vllm-ascend/Qwen3-235B-A22B-w8a8 --dataset-name random --random-input 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./
```
After about several minutes, you can get the performance evaluation result.
@@ -328,7 +328,7 @@ In this section, we provide simple scripts to re-produce our latest performance.
- HDK/driver 25.3.RC1
- triton_ascend 3.2.0
### Single Node A3 64G*16
### Single Node A3 (64G*16)
Example server scripts:
@@ -394,7 +394,7 @@ Note:
### Three Node A3 -- PD disaggregation
On three Atlas 800 A364G*16server, we recommend to use one node as one prefill instance and two nodes as one decode instance. Example server scripts:
On three Atlas 800 A3(64G*16) server, we recommend to use one node as one prefill instance and two nodes as one decode instance. Example server scripts:
Prefill Node 1
```shell

View File

@@ -304,7 +304,7 @@ Take the `serve` as an example. Run the code as follows.
- /model/Qwen3-32B-W8A8 is the model path, replace this with your actual path.
```shell
vllm bench serve --model /model/Qwen3-32B-W8A8 --served-model-name qwen3 --port 8113 --dataset-name random --random-input 200 --num-prompt 200 --request-rate 1 --save-result --result-dir ./
vllm bench serve --model /model/Qwen3-32B-W8A8 --served-model-name qwen3 --port 8113 --dataset-name random --random-input 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./
```
After about several minutes, you can get the performance evaluation result.
@@ -389,4 +389,4 @@ If this list is not manually specified, it will be filled with a series of evenl
Therefore, like the above real-world scenario, when adjusting the benchmark request concurrency, we always ensure that the concurrency is actually included in the cudagraph_capture_sizes list. This way, during the decode phase, padding operations are essentially avoided, ensuring the reliability of the experimental data.
Its important to note that if you enable FlashComm_v1, the values in this list must be integer multiples of the TP size. Any values that do not meet this condition will be automatically filtered out. Therefore, I recommend incrementally adding concurrency based on the TP size after enabling FlashComm_v1.
It's important to note that if you enable FlashComm_v1, the values in this list must be integer multiples of the TP size. Any values that do not meet this condition will be automatically filtered out. Therefore, I recommend incrementally adding concurrency based on the TP size after enabling FlashComm_v1.

View File

@@ -164,7 +164,7 @@ Take the `serve` as an example. Run the code as follows.
```shell
export VLLM_USE_MODELSCOPE=true
vllm bench serve --model Qwen/Qwen3-Next-80B-A3B-Instruct --dataset-name random --random-input 200 --num-prompt 200 --request-rate 1 --save-result --result-dir ./
vllm bench serve --model Qwen/Qwen3-Next-80B-A3B-Instruct --dataset-name random --random-input 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./
```
After about several minutes, you can get the performance evaluation result.

View File

@@ -285,7 +285,7 @@ python3 -m vllm.entrypoints.openai.api_server --model $MODEL --tensor-parallel-s
pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
pip install -r vllm-ascend/benchmarks/requirements-bench.txt
vllm bench serve --model $MODEL --dataset-name random --random-input 200 --num-prompt 200 --request-rate 1 --save-result --result-dir ./
vllm bench serve --model $MODEL --dataset-name random --random-input 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./
```
After execution, you can get the result, here is the result of `Qwen3-Omni-30B-A3B-Thinking` in vllm-ascend:0.13.0rc1 for reference only.

View File

@@ -270,7 +270,7 @@ Take the `serve` as an example. Run the code as follows.
```shell
export VLLM_USE_MODELSCOPE=true
vllm bench serve --model Qwen/Qwen3-VL-235B-A22B-Instruct --dataset-name random --random-input 200 --num-prompt 200 --request-rate 1 --save-result --result-dir ./
vllm bench serve --model Qwen/Qwen3-VL-235B-A22B-Instruct --dataset-name random --random-input 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./
```
After about several minutes, you can get the performance evaluation result.