[Lint]Style: reformat markdown files via markdownlint (#5884)
### What this PR does / why we need it?
reformat markdown files via markdownlint
- vLLM version: v0.13.0
- vLLM main:
bde38c11df
---------
Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain>
Signed-off-by: MrZ20 <2609716663@qq.com>
Co-authored-by: root <root@LAPTOP-VQKDDVMG.localdomain>
This commit is contained in:
@@ -69,6 +69,7 @@ docker run --rm \
|
||||
If you want to deploy multi-node environment, you need to set up environment on each node.
|
||||
|
||||
## Deployment
|
||||
|
||||
### Service-oriented Deployment
|
||||
|
||||
- `DeepSeek-R1-W8A8`: require 1 Atlas 800 A3 (64G × 16) nodes or 2 Atlas 800 A2 (64G × 8).
|
||||
@@ -119,6 +120,7 @@ vllm serve vllm-ascend/DeepSeek-R1-W8A8 \
|
||||
|
||||
**Notice:**
|
||||
The parameters are explained as follows:
|
||||
|
||||
- Setting the environment variable `VLLM_ASCEND_ENABLE_MLAPO=1` enables a fusion operator that can significantly improve performance, though it requires more NPU memory. It is therefore recommended to enable this option when sufficient NPU memory is available.
|
||||
- Setting the environment variable `VLLM_ASCEND_BALANCE_SCHEDULING=1` enables balance scheduling. This may help increase output throughput and reduce TPOT in v1 scheduler. However, TTFT may degrade in some scenarios. Furthermore, enabling this feature is not recommended in scenarios where PD is separated.
|
||||
- For single-node deployment, we recommend using `dp4tp4` instead of `dp2tp8`.
|
||||
@@ -284,6 +286,7 @@ lm_eval \
|
||||
3. After execution, you can get the result.
|
||||
|
||||
## Performance
|
||||
|
||||
### Using AISBench
|
||||
|
||||
Refer to [Using AISBench for performance evaluation](../developer_guide/evaluation/using_ais_bench.md#execute-performance-evaluation) for details.
|
||||
@@ -295,6 +298,7 @@ Run performance evaluation of `DeepSeek-R1-W8A8` as an example.
|
||||
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
|
||||
|
||||
There are three `vllm bench` subcommand:
|
||||
|
||||
- `latency`: Benchmark the latency of a single batch of requests.
|
||||
- `serve`: Benchmark the online serving throughput.
|
||||
- `throughput`: Benchmark offline inference throughput.
|
||||
|
||||
@@ -23,6 +23,7 @@ Refer to [feature guide](../user_guide/feature_guide/index.md) to get the featur
|
||||
## Environment Preparation
|
||||
|
||||
### Model Weight
|
||||
|
||||
- `DeepSeek-V3.1`(BF16 version): [Download model weight](https://www.modelscope.cn/models/deepseek-ai/DeepSeek-V3.1).
|
||||
- `DeepSeek-V3.1-w8a8`(Quantized version without mtp): [Download model weight](https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V3.1-w8a8).
|
||||
- `DeepSeek-V3.1_w8a8mix_mtp`(Quantized version with mix mtp): [Download model weight](https://www.modelscope.cn/models/Eco-Tech/DeepSeek-V3.1-w8a8). Please modify `torch_dtype` from `float16` to `bfloat16` in `config.json`.
|
||||
@@ -131,6 +132,7 @@ vllm serve /weights/DeepSeek-V3.1-w8a8-mtp-QuaRot \
|
||||
|
||||
**Notice:**
|
||||
The parameters are explained as follows:
|
||||
|
||||
- Setting the environment variable `VLLM_ASCEND_ENABLE_MLAPO=1` enables a fusion operator that can significantly improve performance, though it requires more NPU memory. It is therefore recommended to enable this option when sufficient NPU memory is available.
|
||||
- Setting the environment variable `VLLM_ASCEND_BALANCE_SCHEDULING=1` enables balance scheduling. This may help increase output throughput and reduce TPOT in v1 scheduler. However, TTFT may degrade in some scenarios. Furthermore, enabling this feature is not recommended in scenarios where PD is separated.
|
||||
- For single-node deployment, we recommend using `dp4tp4` instead of `dp2tp8`.
|
||||
@@ -257,7 +259,8 @@ vllm serve /weights/DeepSeek-V3.1-w8a8-mtp-QuaRot \
|
||||
We recommend using Mooncake for deployment: [Mooncake](./pd_disaggregation_mooncake_multi_node.md).
|
||||
|
||||
Take Atlas 800 A3 (64G × 16) for example, we recommend to deploy 2P1D (4 nodes) rather than 1P1D (2 nodes), because there is no enough NPU memory to serve high concurrency in 1P1D case.
|
||||
- `DeepSeek-V3.1-w8a8-mtp-QuaRot 2P1D Layerwise` require 4 Atlas 800 A3 (64G × 16).
|
||||
|
||||
- `DeepSeek-V3.1-w8a8-mtp-QuaRot 2P1D Layerwise` require 4 Atlas 800 A3 (64G × 16).
|
||||
|
||||
To run the vllm-ascend `Prefill-Decode Disaggregation` service, you need to deploy a `launch_dp_program.py` script and a `run_dp_template.sh` script on each node and deploy a `proxy.sh` script on prefill master node to forward requests.
|
||||
|
||||
@@ -659,6 +662,7 @@ curl http://<node0_ip>:<port>/v1/completions \
|
||||
Here are two accuracy evaluation methods.
|
||||
|
||||
### Using AISBench
|
||||
|
||||
1. Refer to [Using AISBench](../developer_guide/evaluation/using_ais_bench.md) for details.
|
||||
|
||||
2. After execution, you can get the result, here is the result of `DeepSeek-V3.1-w8a8-mtp-QuaRot` in `vllm-ascend:0.11.0rc1` for reference only.
|
||||
@@ -669,6 +673,7 @@ Here are two accuracy evaluation methods.
|
||||
| gsm8k | - | accuracy | gen | 96.28 | 1 Atlas 800 A3 (64G × 16) |
|
||||
|
||||
### Using Language Model Evaluation Harness
|
||||
|
||||
Not test yet.
|
||||
|
||||
## Performance
|
||||
@@ -684,6 +689,7 @@ Run performance evaluation of `DeepSeek-V3.1-w8a8-mtp-QuaRot` as an example.
|
||||
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
|
||||
|
||||
There are three `vllm bench` subcommand:
|
||||
|
||||
- `latency`: Benchmark the latency of a single batch of requests.
|
||||
- `serve`: Benchmark the online serving throughput.
|
||||
- `throughput`: Benchmark offline inference throughput.
|
||||
|
||||
@@ -153,9 +153,10 @@ In this tutorial, we suppose you downloaded the model weight to `/root/.cache/`.
|
||||
We'd like to show the deployment guide of `DeepSeek-V3.2` on multi-node environment with 1P1D for better performance.
|
||||
|
||||
Before you start, please
|
||||
|
||||
1. prepare the script `launch_online_dp.py` on each node.
|
||||
|
||||
```
|
||||
```python
|
||||
import argparse
|
||||
import multiprocessing
|
||||
import os
|
||||
@@ -260,7 +261,7 @@ Before you start, please
|
||||
|
||||
1. Prefill node 0
|
||||
|
||||
```
|
||||
```shell
|
||||
nic_name="enp48s3u1u1" # change to your own nic name
|
||||
local_ip=141.61.39.105 # change to your own ip
|
||||
|
||||
@@ -333,7 +334,7 @@ Before you start, please
|
||||
|
||||
2. Prefill node 1
|
||||
|
||||
```
|
||||
```shell
|
||||
nic_name="enp48s3u1u1" # change to your own nic name
|
||||
local_ip=141.61.39.113 # change to your own ip
|
||||
|
||||
@@ -406,7 +407,7 @@ Before you start, please
|
||||
|
||||
3. Decode node 0
|
||||
|
||||
```
|
||||
```shell
|
||||
nic_name="enp48s3u1u1" # change to your own nic name
|
||||
local_ip=141.61.39.117 # change to your own ip
|
||||
|
||||
@@ -484,7 +485,7 @@ Before you start, please
|
||||
|
||||
4. Decode node 1
|
||||
|
||||
```
|
||||
```shell
|
||||
nic_name="enp48s3u1u1" # change to your own nic name
|
||||
local_ip=141.61.39.181 # change to your own ip
|
||||
|
||||
@@ -564,28 +565,28 @@ Once the preparation is done, you can start the server with the following comman
|
||||
|
||||
1. Prefill node 0
|
||||
|
||||
```
|
||||
```shell
|
||||
# change ip to your own
|
||||
python launch_online_dp.py --dp-size 2 --tp-size 16 --dp-size-local 1 --dp-rank-start 0 --dp-address 141.61.39.105 --dp-rpc-port 12890 --vllm-start-port 9100
|
||||
```
|
||||
|
||||
2. Prefill node 1
|
||||
|
||||
```
|
||||
```shell
|
||||
# change ip to your own
|
||||
python launch_online_dp.py --dp-size 2 --tp-size 16 --dp-size-local 1 --dp-rank-start 1 --dp-address 141.61.39.105 --dp-rpc-port 12890 --vllm-start-port 9100
|
||||
```
|
||||
|
||||
3. Decode node 0
|
||||
|
||||
```
|
||||
```shell
|
||||
# change ip to your own
|
||||
python launch_online_dp.py --dp-size 8 --tp-size 4 --dp-size-local 4 --dp-rank-start 0 --dp-address 141.61.39.117 --dp-rpc-port 12777 --vllm-start-port 9100
|
||||
```
|
||||
|
||||
4. Decode node 1
|
||||
|
||||
```
|
||||
```shell
|
||||
# change ip to your own
|
||||
python launch_online_dp.py --dp-size 8 --tp-size 4 --dp-size-local 4 --dp-rank-start 4 --dp-address 141.61.39.117 --dp-rpc-port 12777 --vllm-start-port 9100
|
||||
```
|
||||
@@ -656,6 +657,7 @@ Run performance evaluation of `DeepSeek-V3.2-W8A8` as an example.
|
||||
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
|
||||
|
||||
There are three `vllm bench` subcommand:
|
||||
|
||||
- `latency`: Benchmark the latency of a single batch of requests.
|
||||
- `serve`: Benchmark the online serving throughput.
|
||||
- `throughput`: Benchmark offline inference throughput.
|
||||
|
||||
@@ -17,6 +17,7 @@ Refer to [feature guide](../user_guide/feature_guide/index.md) to get the featur
|
||||
## Environment Preparation
|
||||
|
||||
### Model Weight
|
||||
|
||||
- `GLM-4.5`(BF16 version): [Download model weight](https://www.modelscope.cn/models/ZhipuAI/GLM-4.5).
|
||||
- `GLM-4.6`(BF16 version): [Download model weight](https://www.modelscope.cn/models/ZhipuAI/GLM-4.6).
|
||||
- `GLM-4.7`(BF16 version): [Download model weight](https://www.modelscope.cn/models/ZhipuAI/GLM-4.7).
|
||||
@@ -102,6 +103,7 @@ vllm serve /weight/glm4.5_w8a8_with_float_mtp \
|
||||
|
||||
**Notice:**
|
||||
The parameters are explained as follows:
|
||||
|
||||
- For single-node deployment, we recommend using `dp1tp16` and turn off expert parallel in low-latency scenarios.
|
||||
- `--async-scheduling` Asynchronous scheduling is a technique used to optimize inference efficiency. It allows non-blocking task scheduling to improve concurrency and throughput, especially when processing large-scale models.
|
||||
|
||||
@@ -118,6 +120,7 @@ Not test yet.
|
||||
Here are two accuracy evaluation methods.
|
||||
|
||||
### Using AISBench
|
||||
|
||||
1. Refer to [Using AISBench](../developer_guide/evaluation/using_ais_bench.md) for details.
|
||||
|
||||
2. After execution, you can get the result, here is the result of `GLM4.6` in `vllm-ascend:main` (after `vllm-ascend:0.13.0rc1`) for reference only.
|
||||
@@ -144,6 +147,7 @@ Run performance evaluation of `GLM-4.x` as an example.
|
||||
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
|
||||
|
||||
There are three `vllm bench` subcommand:
|
||||
|
||||
- `latency`: Benchmark the latency of a single batch of requests.
|
||||
- `serve`: Benchmark the online serving throughput.
|
||||
- `throughput`: Benchmark offline inference throughput.
|
||||
|
||||
@@ -44,6 +44,7 @@ docker run --rm \
|
||||
```
|
||||
|
||||
## Verify the Quantized Model
|
||||
|
||||
Please be advised to edit the value of `"quantization_config.config_groups.group_0.targets"` from `["Linear"]` into `["MoE"]` in `config.json` of original model downloaded from [Hugging Face](https://huggingface.co/moonshotai/Kimi-K2-Thinking).
|
||||
|
||||
```json
|
||||
|
||||
@@ -19,6 +19,7 @@ Refer to [feature guide](../user_guide/feature_guide/index.md) to get the featur
|
||||
### Model Weight
|
||||
|
||||
require 1 Atlas 800I A2 (64G × 8) node or 1 Atlas 800 A3 (64G × 16) node:
|
||||
|
||||
- `Qwen2.5-VL-3B-Instruct`: [Download model weight](https://modelscope.cn/models/Qwen/Qwen2.5-VL-3B-Instruct)
|
||||
- `Qwen2.5-VL-7B-Instruct`: [Download model weight](https://modelscope.cn/models/Qwen/Qwen2.5-VL-7B-Instruct)
|
||||
- `Qwen2.5-VL-32B-Instruct`:[Download model weight](https://modelscope.cn/models/Qwen/Qwen2.5-VL-32B-Instruct)
|
||||
@@ -188,7 +189,7 @@ print(generated_text)
|
||||
|
||||
If you run this script successfully, you can see the info shown below:
|
||||
|
||||
```
|
||||
```shell
|
||||
**Visual Components:**
|
||||
|
||||
1. **Abstract Geometric Icon (Left Side):**
|
||||
@@ -284,7 +285,7 @@ print(generated_text)
|
||||
|
||||
If you run this script successfully, you can see the info shown below:
|
||||
|
||||
```
|
||||
```shell
|
||||
The image displays a logo and text related to the Qwen model, which is an artificial intelligence (AI) language model developed by Alibaba Cloud. Here is a detailed description of the elements in the image:
|
||||
|
||||
### **1. Logo:**
|
||||
@@ -469,6 +470,7 @@ INFO 12-05 08:51:20 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 toke
|
||||
### Using Language Model Evaluation Harness
|
||||
|
||||
The accuracy of some models is already within our CI monitoring scope, including:
|
||||
|
||||
- `Qwen2.5-VL-7B-Instruct`
|
||||
- `Qwen3-VL-8B-Instruct`
|
||||
|
||||
@@ -547,6 +549,7 @@ lm_eval \
|
||||
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
|
||||
|
||||
There are three `vllm bench` subcommand:
|
||||
|
||||
- `latency`: Benchmark the latency of a single batch of requests.
|
||||
- `serve`: Benchmark the online serving throughput.
|
||||
- `throughput`: Benchmark offline inference throughput.
|
||||
|
||||
@@ -153,11 +153,13 @@ Results and logs are saved to `benchmark/outputs/default/`. A sample accuracy re
|
||||
Refer to [Using AISBench for performance evaluation](../developer_guide/evaluation/using_ais_bench.md#execute-performance-evaluation) for details.
|
||||
|
||||
### Using vLLM Benchmark
|
||||
|
||||
Run performance evaluation of `Qwen2.5-7B-Instruct` as an example.
|
||||
|
||||
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
|
||||
|
||||
There are three `vllm bench` subcommand:
|
||||
|
||||
- `latency`: Benchmark the latency of a single batch of requests.
|
||||
- `serve`: Benchmark the online serving throughput.
|
||||
- `throughput`: Benchmark offline inference throughput.
|
||||
|
||||
@@ -196,6 +196,7 @@ Run performance evaluation of `Qwen2.5-Omni-7B` as an example.
|
||||
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
|
||||
|
||||
There are three `vllm bench` subcommand:
|
||||
|
||||
- `latency`: Benchmark the latency of a single batch of requests.
|
||||
- `serve`: Benchmark the online serving throughput.
|
||||
- `throughput`: Benchmark offline inference throughput.
|
||||
|
||||
@@ -124,19 +124,21 @@ vllm serve vllm-ascend/Qwen3-235B-A22B-w8a8 \
|
||||
```
|
||||
|
||||
**Notice:**
|
||||
|
||||
- [Qwen3-235B-A22B](https://huggingface.co/Qwen/Qwen3-235B-A22B#processing-long-texts) originally only supports 40960 context(max_position_embeddings). If you want to use it and its related quantization weights to run long seqs (such as 128k context), it is required to use yarn rope-scaling technique.
|
||||
- For vLLM version same as or new than `v0.12.0`, use parameter: `--hf-overrides '{"rope_parameters": {"rope_type":"yarn","rope_theta":1000000,"factor":4,"original_max_position_embeddings":32768}}' \`.
|
||||
- For vllm version below `v0.12.0`, use parameter: `--rope_scaling '{"rope_type":"yarn","factor":4,"original_max_position_embeddings":32768}' \`.
|
||||
- For vLLM version same as or new than `v0.12.0`, use parameter: `--hf-overrides '{"rope_parameters": {"rope_type":"yarn","rope_theta":1000000,"factor":4,"original_max_position_embeddings":32768}}' \`.
|
||||
- For vllm version below `v0.12.0`, use parameter: `--rope_scaling '{"rope_type":"yarn","factor":4,"original_max_position_embeddings":32768}' \`.
|
||||
If you are using weights like [Qwen3-235B-A22B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507) which originally supports long contexts, there is no need to add this parameter.
|
||||
|
||||
The parameters are explained as follows:
|
||||
|
||||
- `--data-parallel-size` 1 and `--tensor-parallel-size` 8 are common settings for data parallelism (DP) and tensor parallelism (TP) sizes.
|
||||
- `--max-model-len` represents the context length, which is the maximum value of the input plus output for a single request.
|
||||
- `--max-num-seqs` indicates the maximum number of requests that each DP group is allowed to process. If the number of requests sent to the service exceeds this limit, the excess requests will remain in a waiting state and will not be scheduled. Note that the time spent in the waiting state is also counted in metrics such as TTFT and TPOT. Therefore, when testing performance, it is generally recommended that `--max-num-seqs` * `--data-parallel-size` >= the actual total concurrency.
|
||||
- `--max-num-batched-tokens` represents the maximum number of tokens that the model can process in a single step. Currently, vLLM v1 scheduling enables ChunkPrefill/SplitFuse by default, which means:
|
||||
- (1) If the input length of a request is greater than `--max-num-batched-tokens`, it will be divided into multiple rounds of computation according to `--max-num-batched-tokens`;
|
||||
- (2) Decode requests are prioritized for scheduling, and prefill requests are scheduled only if there is available capacity.
|
||||
- Generally, if `--max-num-batched-tokens` is set to a larger value, the overall latency will be lower, but the pressure on GPU memory (activation value usage) will be greater.
|
||||
- (1) If the input length of a request is greater than `--max-num-batched-tokens`, it will be divided into multiple rounds of computation according to `--max-num-batched-tokens`;
|
||||
- (2) Decode requests are prioritized for scheduling, and prefill requests are scheduled only if there is available capacity.
|
||||
- Generally, if `--max-num-batched-tokens` is set to a larger value, the overall latency will be lower, but the pressure on GPU memory (activation value usage) will be greater.
|
||||
- `--gpu-memory-utilization` represents the proportion of HBM that vLLM will use for actual inference. Its essential function is to calculate the available kv_cache size. During the warm-up phase (referred to as profile run in vLLM), vLLM records the peak GPU memory usage during an inference process with an input size of `--max-num-batched-tokens`. The available kv_cache size is then calculated as: `--gpu-memory-utilization` * HBM size - peak GPU memory usage. Therefore, the larger the value of `--gpu-memory-utilization`, the more kv_cache can be used. However, since the GPU memory usage during the warm-up phase may differ from that during actual inference (e.g., due to uneven EP load), setting `--gpu-memory-utilization` too high may lead to OOM (Out of Memory) issues during actual inference. The default value is `0.9`.
|
||||
- `--enable-expert-parallel` indicates that EP is enabled. Note that vLLM does not support a mixed approach of ETP and EP; that is, MoE can either use pure EP or pure TP.
|
||||
- `--no-enable-prefix-caching` indicates that prefix caching is disabled. To enable it, remove this option.
|
||||
@@ -147,7 +149,8 @@ The parameters are explained as follows:
|
||||
- `export VLLM_ASCEND_ENABLE_FLASHCOMM1=1` indicates that Flashcomm1 optimization is enabled. Currently, this optimization is only supported for MoE in scenarios where tp_size > 1.
|
||||
|
||||
### Multi-node Deployment with MP (Recommended)
|
||||
Assume you have Atlas 800 A3 (64G*16) nodes (or 2 * A2), and want to deploy the `Qwen3-VL-235B-A22B-Instruct` model across multiple nodes.
|
||||
|
||||
Assume you have Atlas 800 A3 (64G*16) nodes (or 2* A2), and want to deploy the `Qwen3-VL-235B-A22B-Instruct` model across multiple nodes.
|
||||
|
||||
Node 0
|
||||
|
||||
@@ -239,7 +242,7 @@ vllm serve vllm-ascend/Qwen3-235B-A22B \
|
||||
|
||||
If the service starts successfully, the following information will be displayed on node 0:
|
||||
|
||||
```
|
||||
```shell
|
||||
INFO: Started server process [44610]
|
||||
INFO: Waiting for application startup.
|
||||
INFO: Application startup complete.
|
||||
@@ -298,6 +301,7 @@ Run performance evaluation of `Qwen3-235B-A22B-w8a8` as an example.
|
||||
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
|
||||
|
||||
There are three `vllm bench` subcommand:
|
||||
|
||||
- `latency`: Benchmark the latency of a single batch of requests.
|
||||
- `serve`: Benchmark the online serving throughput.
|
||||
- `throughput`: Benchmark offline inference throughput.
|
||||
@@ -392,6 +396,7 @@ Reference test results:
|
||||
| 720 | 144 | 4717.45 | 48.69 | 2761.72 |
|
||||
|
||||
Note:
|
||||
|
||||
1. Setting `export VLLM_ASCEND_ENABLE_FUSED_MC2=1` enables MoE fused operators that reduce time consumption of MoE in both prefill and decode. This is an experimental feature which only supports W8A8 quantization on Atlas A3 servers now. If you encounter any problems when using this feature, you can disable it by setting `export VLLM_ASCEND_ENABLE_FUSED_MC2=0` and update issues in vLLM-Ascend community.
|
||||
2. Here we disable prefix cache because of random datasets. You can enable prefix cache if requests have long common prefix.
|
||||
|
||||
@@ -597,7 +602,7 @@ vllm serve vllm-ascend/Qwen3-235B-A22B-w8a8 \
|
||||
|
||||
PD proxy:
|
||||
|
||||
```
|
||||
```shell
|
||||
python load_balance_proxy_server_example.py --port 12347 --prefiller-hosts prefill_node_1_ip --prefiller-port 8000 --decoder-hosts decode_node_1_ip --decoder-ports 8000
|
||||
```
|
||||
|
||||
@@ -624,4 +629,5 @@ Reference test results:
|
||||
| 2880 | 576 | 3735.98 | 52.07 | 8593.44 |
|
||||
|
||||
Note:
|
||||
|
||||
1. We recommend to set `export VLLM_ASCEND_ENABLE_FUSED_MC2=2` on this scenario (typically EP32 for Qwen3-235B). This enables a different MoE fusion operator.
|
||||
|
||||
@@ -36,7 +36,7 @@ docker run --rm \
|
||||
|
||||
:::{note}
|
||||
You can choose to convert the model yourself or use the quantized model we uploaded,
|
||||
see https://www.modelscope.cn/models/vllm-ascend/Qwen3-32B-W4A4
|
||||
see <https://www.modelscope.cn/models/vllm-ascend/Qwen3-32B-W4A4>
|
||||
:::
|
||||
|
||||
```bash
|
||||
|
||||
@@ -1,6 +1,7 @@
|
||||
# Qwen3-8B-W4A8
|
||||
|
||||
## Run Docker Container
|
||||
|
||||
:::{note}
|
||||
w4a8 quantization feature is supported by v0.9.1rc2 and later.
|
||||
:::
|
||||
@@ -27,9 +28,10 @@ docker run --rm \
|
||||
```
|
||||
|
||||
## Install modelslim and Convert Model
|
||||
|
||||
:::{note}
|
||||
You can choose to convert the model yourself or use the quantized model we uploaded,
|
||||
see https://www.modelscope.cn/models/vllm-ascend/Qwen3-8B-W4A8
|
||||
see <https://www.modelscope.cn/models/vllm-ascend/Qwen3-8B-W4A8>
|
||||
:::
|
||||
|
||||
```bash
|
||||
@@ -69,6 +71,7 @@ python quant_qwen.py \
|
||||
```
|
||||
|
||||
## Verify the Quantized Model
|
||||
|
||||
The converted model files look like:
|
||||
|
||||
```bash
|
||||
|
||||
@@ -11,6 +11,7 @@ This document will show the main verification steps of the model, including supp
|
||||
The Qwen3 Dense models is first supported in [v0.8.4rc2](https://github.com/vllm-project/vllm-ascend/blob/main/docs/source/user_guide/release_notes.md#v084rc2---20250429)
|
||||
|
||||
## **Node**
|
||||
|
||||
This example requires version **v0.11.0rc2**. Earlier versions may lack certain features.
|
||||
|
||||
## Supported Features
|
||||
@@ -105,15 +106,17 @@ If you want to deploy multi-node environment, you need to set up environment on
|
||||
In this section, we will demonstrate best practices for adjusting hyperparameters in vLLM-Ascend to maximize inference throughput performance. By tailoring service-level configurations to fit different use cases, you can ensure that your system performs optimally across various scenarios. We will guide you through how to fine-tune hyperparameters based on observed phenomena, such as max_model_len, max_num_batched_tokens, and cudagraph_capture_sizes, to achieve the best performance.
|
||||
|
||||
The specific example scenario is as follows:
|
||||
|
||||
- The machine environment is an Atlas 800 A3 (64G*16)
|
||||
- The LLM is Qwen3-32B-W8A8
|
||||
- The data scenario is a fixed-length input of 3.5K and an output of 1.5K.
|
||||
- The parallel configuration requirement is DP=1&TP=4
|
||||
- If the machine environment is an **Atlas 800I A2(64G*8)**, the deployment approach stays identical.
|
||||
|
||||
### Run docker container:
|
||||
### Run docker container
|
||||
|
||||
#### **Node**
|
||||
|
||||
- /model/Qwen3-32B-W8A8 is the model path, replace this with your actual path.
|
||||
- v0.11.0rc2-a3 is image tag, replace this with your actual tag.
|
||||
- replace this with your actual port: '-p 8113:8113'.
|
||||
@@ -191,6 +194,7 @@ vllm serve /model/Qwen3-32B-W8A8 \
|
||||
```
|
||||
|
||||
#### **Node**
|
||||
|
||||
- /model/Qwen3-32B-W8A8 is the model path, replace this with your actual path.
|
||||
|
||||
- If the model is not a quantized model, remove the `--quantization ascend` parameter.
|
||||
@@ -219,6 +223,7 @@ curl http://localhost:8113/v1/chat/completions -H "Content-Type: application/jso
|
||||
Run the following script to execute offline inference on multi-NPU.
|
||||
|
||||
#### **Node**
|
||||
|
||||
- /model/Qwen3-32B-W8A8 is the model path, replace this with your actual path.
|
||||
|
||||
- If the model is not a quantized model,remove the `quantization="ascend"` parameter.
|
||||
@@ -290,6 +295,7 @@ Run performance evaluation of `Qwen3-32B-W8A8` as an example.
|
||||
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
|
||||
|
||||
There are three `vllm bench` subcommand:
|
||||
|
||||
- `latency`: Benchmark the latency of a single batch of requests.
|
||||
- `serve`: Benchmark the online serving throughput.
|
||||
- `throughput`: Benchmark offline inference throughput.
|
||||
@@ -297,6 +303,7 @@ There are three `vllm bench` subcommand:
|
||||
Take the `serve` as an example. Run the code as follows.
|
||||
|
||||
#### **Node**
|
||||
|
||||
- /model/Qwen3-32B-W8A8 is the model path, replace this with your actual path.
|
||||
|
||||
```shell
|
||||
@@ -306,19 +313,23 @@ vllm bench serve --model /model/Qwen3-32B-W8A8 --served-model-name qwen3 --port
|
||||
After about several minutes, you can get the performance evaluation result.
|
||||
|
||||
## Key Optimization Points
|
||||
|
||||
In this section, we will cover the key optimization points that can significantly improve the performance of Qwen Dense models. These techniques are designed to enhance throughput and efficiency across various scenarios.
|
||||
|
||||
### 1. Rope Optimization
|
||||
|
||||
Rope optimization enhances the model's efficiency by modifying the position encoding process. Specifically, it ensures that the cos_sin_cache and the associated index selection operation are only performed during the first layer of the forward pass. For subsequent layers, the position encoding is directly reused, eliminating redundant calculations and significantly speeding up inference in decode phase.
|
||||
|
||||
This optimization is enabled by default and does not require any additional environment variables to be set.
|
||||
|
||||
### 2. AddRMSNormQuant Fusion
|
||||
|
||||
AddRMSNormQuant fusion merges the Address-wise Multi-Scale Normalization and Quantization operations, allowing for more efficient memory access and computation, thereby enhancing throughput.
|
||||
|
||||
This optimization is enabled by default and does not require any additional environment variables to be set.
|
||||
|
||||
### 3. FlashComm_v1
|
||||
|
||||
FlashComm_v1 significantly improves performance in large-batch scenarios by decomposing the traditional allreduce collective communication into reduce-scatter and all-gather. This breakdown helps reduce the computation of the RMSNorm token dimensions, leading to more efficient processing. In quantization scenarios, FlashComm_v1 also reduces the communication overhead by decreasing the bit-level data transfer, which further minimizes the end-to-end latency during the prefill phase.
|
||||
|
||||
It is important to note that the decomposition of the allreduce communication into reduce-scatter and all-gather operations only provides benefits in high-concurrency scenarios, where there is no significant communication degradation. In other cases, this decomposition may result in noticeable performance degradation. To mitigate this, the current implementation uses a threshold-based approach, where FlashComm_v1 is only enabled if the actual token count for each inference schedule exceeds the threshold. This ensures that the feature is only activated in scenarios where it improves performance, avoiding potential degradation in lower-concurrency situations.
|
||||
@@ -326,11 +337,13 @@ It is important to note that the decomposition of the allreduce communication in
|
||||
This optimization requires setting the environment variable `VLLM_ASCEND_ENABLE_FLASHCOMM1 = 1` to be enabled.
|
||||
|
||||
### 4. Matmul and ReduceScatter Fusion
|
||||
|
||||
Once FlashComm_v1 is enabled, an additional optimization can be applied. This optimization fuses matrix multiplication and ReduceScatter operations, along with tiling optimization. The Matmul computation is treated as one pipeline, while the ReduceScatter and dequant operations are handled in a separate pipeline. This approach significantly reduces communication steps, improves computational efficiency, and allows for better resource utilization, resulting in enhanced throughput, especially in large-scale distributed environments.
|
||||
|
||||
This optimization is automatically enabled once FlashComm_v1 is activated. However, due to an issue with performance degradation in small-concurrency scenarios after this fusion, a threshold-based approach is currently used to mitigate this problem. The optimization is only applied when the token count exceeds the threshold, ensuring that it is not enabled in cases where it could negatively impact performance.
|
||||
|
||||
### 5. Weight Prefetching
|
||||
|
||||
Weight prefetching optimizes memory usage by preloading weights into the cache before they are needed, minimizing delays caused by memory access during model execution.
|
||||
|
||||
In dense model scenarios, the MLP's gate_up_proj and down_proj linear layers often exhibit relatively high MTE utilization. To address this, we create a separate pipeline specifically for weight prefetching, which runs in parallel with the original vector computation pipeline, such as RMSNorm and SiLU, before the MLP. This approach allows the weights to be preloaded to L2 cache ahead of time, reducing MTE utilization during the MLP computations and indirectly improving Cube computation efficiency by minimizing resource contention and optimizing data flow.
|
||||
@@ -340,16 +353,19 @@ It is important to emphasize that, since we use vector computations to hide the
|
||||
This optimization requires setting the environment variable `VLLM_ASCEND_ENABLE_PREFETCH_MLP = 1` to be enabled.
|
||||
|
||||
### 6. Zerolike Elimination
|
||||
|
||||
This elimination removes unnecessary operations related to zero-like tensors in Attention forward, improving the efficiency of matrix operations and reducing memory usage.
|
||||
|
||||
This optimization is enabled by default and does not require any additional environment variables to be set.
|
||||
|
||||
### 7. FullGraph Optimization
|
||||
|
||||
ACLGraph offers several key optimizations to improve model execution efficiency. By replaying the entire model execution graph at once, we significantly reduce dispatch latency compared to multiple smaller replays. This approach also stabilizes multi-device performance, as capturing the model as a single static graph mitigates dispatch fluctuations across devices. Additionally, consolidating graph captures frees up streams, allowing for the capture of more graphs and optimizing resource usage, ultimately leading to improved system efficiency and reduced overhead.
|
||||
|
||||
The configuration compilation_config = { "cudagraph_mode": "FULL_DECODE_ONLY"} is used when starting the service. This setup is necessary to enable the aclgraph's full decode-only mode.
|
||||
|
||||
### 8. Asynchronous Scheduling
|
||||
|
||||
Asynchronous scheduling is a technique used to optimize inference efficiency. It allows non-blocking task scheduling to improve concurrency and throughput, especially when processing large-scale models.
|
||||
|
||||
This optimization is enabled by setting `--async-scheduling`.
|
||||
@@ -359,16 +375,19 @@ This optimization is enabled by setting `--async-scheduling`.
|
||||
Building on the specific example scenarios outlined earlier, this section highlights the key tuning points that played a crucial role in achieving optimal performance. By focusing on the most impactful adjustments to hyperparameters and optimizations, we’ll emphasize the strategies that can be leveraged to maximize throughput, minimize latency, and ensure efficient resource utilization in various environments. These insights will help guide you in fine-tuning your own configurations for the best possible results.
|
||||
|
||||
### 1.Prefetch Buffer Size
|
||||
|
||||
Setting the right prefetch buffer size is essential for optimizing weight loading and the size of this prefetch buffer is directly related to the time that can be hidden by vector computations. To achieve a near-perfect overlap between the prefetch and computation streams, you can flexibly adjust the buffer size by profiling and observing the degree of overlap at different buffer sizes.
|
||||
|
||||
For example, in the real-world scenario mentioned above, I set the prefetch buffer size for the gate_up_proj and down_proj in the MLP to 18MB. The reason for this is that, at this value, the vector computations of RMSNorm and SiLU can effectively hide the prefetch stream, thereby accelerating the Matmul computations of the two linear layers.
|
||||
|
||||
### 2.Max-num-batched-tokens
|
||||
|
||||
The max-num-batched-tokens parameter determines the maximum number of tokens that can be processed in a single batch. Adjusting this value helps to balance throughput and memory usage. Setting this value too small can negatively impact end-to-end performance, as fewer tokens are processed per batch, potentially leading to inefficiencies. Conversely, setting it too large increases the risk of Out of Memory (OOM) errors due to excessive memory consumption.
|
||||
|
||||
In the above real-world scenario, we not only conducted extensive testing to determine the most cost-effective value, but also took into account the accumulation of decode tokens when enabling chunked prefill. If the value is set too small, a single request may be chunked multiple times, and during the early stages of inference, a batch may contain only a small number of decode tokens. This can result in the end-to-end throughput falling short of expectations.
|
||||
|
||||
### 3.Cudagraph_capture_sizes
|
||||
|
||||
The cudagraph_capture_sizes parameter controls the granularity of graph captures during the inference process. Adjusting this value determines how much of the computation graph is captured at once, which can significantly impact both performance and memory usage.
|
||||
|
||||
If this list is not manually specified, it will be filled with a series of evenly distributed values, which typically ensures good performance. However, if you want to fine-tune it further, manually specifying the values will yield better results. This is because if the batch size falls between two sizes, the framework will automatically pad the token count to the larger size. This often leads to actual performance deviating from the expected or even degrading.
|
||||
|
||||
@@ -174,6 +174,7 @@ Run performance evaluation of `Qwen3-Next` as an example.
|
||||
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
|
||||
|
||||
There are three `vllm bench` subcommand:
|
||||
|
||||
- `latency`: Benchmark the latency of a single batch of requests.
|
||||
- `serve`: Benchmark the online serving throughput.
|
||||
- `throughput`: Benchmark offline inference throughput.
|
||||
|
||||
@@ -7,11 +7,13 @@ Qwen3-Omni is the natively end-to-end multilingual omni-modal foundation models.
|
||||
This document will show the main verification steps of the model, including supported features, feature configuration, environment preparation, single-node deployment, accuracy and performance evaluation.
|
||||
|
||||
## Supported Features
|
||||
|
||||
Refer to [supported features](https://docs.vllm.ai/projects/ascend/zh-cn/latest/user_guide/support_matrix/supported_models.html) to get the model's supported feature matrix.
|
||||
|
||||
Refer to [feature guide](https://docs.vllm.ai/projects/ascend/zh-cn/latest/user_guide/feature_guide/index.html) to get the feature's configuration.
|
||||
|
||||
## Environment Preparation
|
||||
|
||||
### Model Weight
|
||||
|
||||
- `Qwen3-Omni-30B-A3B-Thinking` require 2 NPU Card(64G × 2).[Download model weight](https://modelscope.cn/models/Qwen/Qwen3-Omni-30B-A3B-Thinking)
|
||||
@@ -77,7 +79,9 @@ ffmpeg -version
|
||||
```
|
||||
|
||||
## Deployment
|
||||
|
||||
### Single-node Deployment
|
||||
|
||||
#### Offline Inference on Multi-NPU
|
||||
|
||||
Run the following script to execute offline inference on multi-NPU:
|
||||
@@ -177,6 +181,7 @@ vllm serve Qwen/Qwen3-Omni-30B-A3B-Thinking --tensor-parallel-size 2 --enable_ex
|
||||
```
|
||||
|
||||
## Functional Verification
|
||||
|
||||
Once your server is started, you can query the model with input prompts.
|
||||
|
||||
```bash
|
||||
@@ -225,7 +230,8 @@ Here are accuracy evaluation methods.
|
||||
### Using EvalScope
|
||||
|
||||
As an example, take the `gsm8k` `omnibench` `bbh` dataset as a test dataset, and run accuracy evaluation of `Qwen3-Omni-30B-A3B-Thinking` in online mode.
|
||||
1. Refer to Using evalscope(https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/evaluation/using_evalscope.html#install-evalscope-using-pip) for `evalscope`installation.
|
||||
|
||||
1. Refer to Using evalscope(<https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/evaluation/using_evalscope.html#install-evalscope-using-pip>) for `evalscope`installation.
|
||||
2. Run `evalscope` to execute the accuracy evaluation.
|
||||
|
||||
```bash
|
||||
@@ -258,11 +264,13 @@ evalscope eval \
|
||||
## Performance
|
||||
|
||||
### Using vLLM Benchmark
|
||||
|
||||
Run performance evaluation of `Qwen3-Omni-30B-A3B-Thinking` as an example.
|
||||
Refer to vllm benchmark for more details.
|
||||
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
|
||||
|
||||
There are three `vllm bench` subcommand:
|
||||
|
||||
- `latency`: Benchmark the latency of a single batch of requests.
|
||||
- `serve`: Benchmark the online serving throughput.
|
||||
- `throughput`: Benchmark offline inference throughput.
|
||||
|
||||
@@ -86,7 +86,8 @@ If you want to deploy multi-node environment, you need to set up environment on
|
||||
## Deployment
|
||||
|
||||
### Multi-node Deployment with MP (Recommended)
|
||||
Assume you have Atlas 800 A3 (64G*16) nodes (or 2 * A2), and want to deploy the `Qwen3-VL-235B-A22B-Instruct` model across multiple nodes.
|
||||
|
||||
Assume you have Atlas 800 A3 (64G*16) nodes (or 2* A2), and want to deploy the `Qwen3-VL-235B-A22B-Instruct` model across multiple nodes.
|
||||
|
||||
Node 0
|
||||
|
||||
@@ -179,12 +180,13 @@ vllm serve Qwen/Qwen3-VL-235B-A22B-Instruct \
|
||||
```
|
||||
|
||||
The parameters are explained as follows:
|
||||
|
||||
- `--max-model-len` represents the context length, which is the maximum value of the input plus output for a single request.
|
||||
- `--max-num-seqs` indicates the maximum number of requests that each DP group is allowed to process. If the number of requests sent to the service exceeds this limit, the excess requests will remain in a waiting state and will not be scheduled. Note that the time spent in the waiting state is also counted in metrics such as TTFT and TPOT. Therefore, when testing performance, it is generally recommended that `--max-num-seqs` * `--data-parallel-size` >= the actual total concurrency.
|
||||
- `--max-num-batched-tokens` represents the maximum number of tokens that the model can process in a single step. Currently, vLLM v1 scheduling enables ChunkPrefill/SplitFuse by default, which means:
|
||||
- (1) If the input length of a request is greater than `--max-num-batched-tokens`, it will be divided into multiple rounds of computation according to `--max-num-batched-tokens`;
|
||||
- (2) Decode requests are prioritized for scheduling, and prefill requests are scheduled only if there is available capacity.
|
||||
- Generally, if `--max-num-batched-tokens` is set to a larger value, the overall latency will be lower, but the pressure on GPU memory (activation value usage) will be greater.
|
||||
- (1) If the input length of a request is greater than `--max-num-batched-tokens`, it will be divided into multiple rounds of computation according to `--max-num-batched-tokens`;
|
||||
- (2) Decode requests are prioritized for scheduling, and prefill requests are scheduled only if there is available capacity.
|
||||
- Generally, if `--max-num-batched-tokens` is set to a larger value, the overall latency will be lower, but the pressure on GPU memory (activation value usage) will be greater.
|
||||
- `--gpu-memory-utilization` represents the proportion of HBM that vLLM will use for actual inference. Its essential function is to calculate the available kv_cache size. During the warm-up phase (referred to as profile run in vLLM), vLLM records the peak GPU memory usage during an inference process with an input size of `--max-num-batched-tokens`. The available kv_cache size is then calculated as: `--gpu-memory-utilization` * HBM size - peak GPU memory usage. Therefore, the larger the value of `--gpu-memory-utilization`, the more kv_cache can be used. However, since the GPU memory usage during the warm-up phase may differ from that during actual inference (e.g., due to uneven EP load), setting `--gpu-memory-utilization` too high may lead to OOM (Out of Memory) issues during actual inference. The default value is `0.9`.
|
||||
- `--enable-expert-parallel` indicates that EP is enabled. Note that vLLM does not support a mixed approach of ETP and EP; that is, MoE can either use pure EP or pure TP.
|
||||
- `--no-enable-prefix-caching` indicates that prefix caching is disabled. To enable it, remove this option.
|
||||
@@ -196,7 +198,7 @@ The parameters are explained as follows:
|
||||
|
||||
If the service starts successfully, the following information will be displayed on node 0:
|
||||
|
||||
```
|
||||
```shell
|
||||
INFO: Started server process [44610]
|
||||
INFO: Waiting for application startup.
|
||||
INFO: Application startup complete.
|
||||
@@ -259,6 +261,7 @@ Run performance evaluation of `Qwen3-VL-235B-A22B-Instruct` as an example.
|
||||
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
|
||||
|
||||
There are three `vllm bench` subcommand:
|
||||
|
||||
- `latency`: Benchmark the latency of a single batch of requests.
|
||||
- `serve`: Benchmark the online serving throughput.
|
||||
- `throughput`: Benchmark offline inference throughput.
|
||||
|
||||
@@ -1,6 +1,7 @@
|
||||
# Qwen3-Embedding
|
||||
|
||||
## Introduction
|
||||
|
||||
The Qwen3 Embedding model series is the latest proprietary model of the Qwen family, specifically designed for text embedding and ranking tasks. Building upon the dense foundational models of the Qwen3 series, it provides a comprehensive range of text embeddings and reranking models in various sizes (0.6B, 4B, and 8B). This guide describes how to run the model with vLLM Ascend. Note that only 0.9.2rc1 and higher versions of vLLM Ascend support the model.
|
||||
|
||||
## Supported Features
|
||||
@@ -16,11 +17,15 @@ Refer to [supported features](../user_guide/support_matrix/supported_models.md)
|
||||
- `Qwen3-Embedding-0.6B` [Download model weight](https://www.modelscope.cn/models/Qwen/Qwen3-Embedding-0.6B)
|
||||
|
||||
It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`
|
||||
|
||||
### Installation
|
||||
|
||||
You can use our official docker image to run `Qwen3-Embedding` series models.
|
||||
|
||||
- Start the docker image on your node, refer to [using docker](../installation.md#set-up-using-docker).
|
||||
|
||||
if you don't want to use the docker image as above, you can also build all from source:
|
||||
|
||||
- Install `vllm-ascend` from source, refer to [installation](../installation.md).
|
||||
|
||||
## Deployment
|
||||
|
||||
@@ -1,6 +1,7 @@
|
||||
# Qwen3-Reranker
|
||||
|
||||
## Introduction
|
||||
|
||||
The Qwen3 Reranker model series is the latest proprietary model of the Qwen family, specifically designed for text embedding and ranking tasks. Building upon the dense foundational models of the Qwen3 series, it provides a comprehensive range of text embeddings and reranking models in various sizes (0.6B, 4B, and 8B). This guide describes how to run the model with vLLM Ascend. Note that only 0.9.2rc1 and higher versions of vLLM Ascend support the model.
|
||||
|
||||
## Supported Features
|
||||
@@ -18,10 +19,13 @@ Refer to [supported features](../user_guide/support_matrix/supported_models.md)
|
||||
It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`
|
||||
|
||||
### Installation
|
||||
|
||||
You can use our official docker image to run `Qwen3-Reranker` series models.
|
||||
|
||||
- Start the docker image on your node, refer to [using docker](../installation.md#set-up-using-docker).
|
||||
|
||||
if you don't want to use the docker image as above, you can also build all from source:
|
||||
|
||||
- Install `vllm-ascend` from source, refer to [installation](../installation.md).
|
||||
|
||||
## Deployment
|
||||
|
||||
@@ -301,15 +301,16 @@ bash proxy.sh
|
||||
|
||||
**Notice:**
|
||||
The parameters are explained as follows:
|
||||
|
||||
- `--tensor-parallel-size` 16 are common settings for tensor parallelism (TP) sizes.
|
||||
- `--prefill-context-parallel-size` 2 are common settings for prefill context parallelism (PCP) sizes.
|
||||
- `--decode-context-parallel-size` 8 are common settings for decode context parallelism (DCP) sizes.
|
||||
- `--max-model-len` represents the context length, which is the maximum value of the input plus output for a single request.
|
||||
- `--max-num-seqs` indicates the maximum number of requests that each DP group is allowed to process. If the number of requests sent to the service exceeds this limit, the excess requests will remain in a waiting state and will not be scheduled. Note that the time spent in the waiting state is also counted in metrics such as TTFT and TPOT. Therefore, when testing performance, it is generally recommended that `--max-num-seqs` * `--data-parallel-size` >= the actual total concurrency.
|
||||
- `--max-num-batched-tokens` represents the maximum number of tokens that the model can process in a single step. Currently, vLLM v1 scheduling enables ChunkPrefill/SplitFuse by default, which means:
|
||||
- (1) If the input length of a request is greater than `--max-num-batched-tokens`, it will be divided into multiple rounds of computation according to `--max-num-batched-tokens`;
|
||||
- (2) Decode requests are prioritized for scheduling, and prefill requests are scheduled only if there is available capacity.
|
||||
- Generally, if `--max-num-batched-tokens` is set to a larger value, the overall latency will be lower, but the pressure on GPU memory (activation value usage) will be greater.
|
||||
- (1) If the input length of a request is greater than `--max-num-batched-tokens`, it will be divided into multiple rounds of computation according to `--max-num-batched-tokens`;
|
||||
- (2) Decode requests are prioritized for scheduling, and prefill requests are scheduled only if there is available capacity.
|
||||
- Generally, if `--max-num-batched-tokens` is set to a larger value, the overall latency will be lower, but the pressure on GPU memory (activation value usage) will be greater.
|
||||
- `--gpu-memory-utilization` represents the proportion of HBM that vLLM will use for actual inference. Its essential function is to calculate the available kv_cache size. During the warm-up phase (referred to as profile run in vLLM), vLLM records the peak GPU memory usage during an inference process with an input size of `--max-num-batched-tokens`. The available kv_cache size is then calculated as: `--gpu-memory-utilization` * HBM size - peak GPU memory usage. Therefore, the larger the value of `--gpu-memory-utilization`, the more kv_cache can be used. However, since the GPU memory usage during the warm-up phase may differ from that during actual inference (e.g., due to uneven EP load), setting `--gpu-memory-utilization` too high may lead to OOM (Out of Memory) issues during actual inference. The default value is `0.9`.
|
||||
- `--enable-expert-parallel` indicates that EP is enabled. Note that vLLM does not support a mixed approach of ETP and EP; that is, MoE can either use pure EP or pure TP.
|
||||
- `--no-enable-prefix-caching` indicates that prefix caching is disabled. To enable it, remove this option.
|
||||
@@ -321,6 +322,7 @@ The parameters are explained as follows:
|
||||
- `export VLLM_ASCEND_ENABLE_CONTEXT_PARALLEL=1` indicates that context parallel is enabled. This environment variable is required in the PD architecture but not needed in the pd co-locate deployment scenario. It will be removed in the future.
|
||||
|
||||
**Notice:**
|
||||
|
||||
- tensor-parallel-size needs to be divisible by decode-context-parallel-size.
|
||||
- decode-context-parallel-size must less than or equal to tensor-parallel-size.
|
||||
|
||||
@@ -351,6 +353,7 @@ Run performance evaluation of `DeepSeek-V3.1-w8a8` as an example.
|
||||
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
|
||||
|
||||
There are three `vllm bench` subcommand:
|
||||
|
||||
- `latency`: Benchmark the latency of a single batch of requests.
|
||||
- `serve`: Benchmark the online serving throughput.
|
||||
- `throughput`: Benchmark offline inference throughput.
|
||||
|
||||
@@ -15,6 +15,7 @@ Using the `Qwen3-235B-A22B-w8a8`(Quantized version) model as an example, use 1 A
|
||||
It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`
|
||||
|
||||
### Run with Docker
|
||||
|
||||
Start a Docker container on each node.
|
||||
|
||||
```{code-block} bash
|
||||
@@ -103,19 +104,21 @@ vllm serve vllm-ascend/Qwen3-235B-A22B-w8a8 \
|
||||
```
|
||||
|
||||
**Notice:**
|
||||
|
||||
- for vllm version below `v0.12.0` use parameter: `--rope_scaling '{"rope_type":"yarn","factor":4,"original_max_position_embeddings":32768}' \`
|
||||
- for vllm version `v0.12.0` use parameter: `--hf-overrides '{"rope_parameters": {"rope_type":"yarn","rope_theta":1000000,"factor":4,"original_max_position_embeddings":32768}}' \`
|
||||
|
||||
The parameters are explained as follows:
|
||||
|
||||
- `--tensor-parallel-size` 8 are common settings for tensor parallelism (TP) sizes.
|
||||
- `--prefill-context-parallel-size` 2 are common settings for prefill context parallelism PCP) sizes.
|
||||
- `--decode-context-parallel-size` 2 are common settings for decode context parallelism DCP) sizes.
|
||||
- `--max-model-len` represents the context length, which is the maximum value of the input plus output for a single request.
|
||||
- `--max-num-seqs` indicates the maximum number of requests that each DP group is allowed to process. If the number of requests sent to the service exceeds this limit, the excess requests will remain in a waiting state and will not be scheduled. Note that the time spent in the waiting state is also counted in metrics such as TTFT and TPOT. Therefore, when testing performance, it is generally recommended that `--max-num-seqs` * `--data-parallel-size` >= the actual total concurrency.
|
||||
- `--max-num-batched-tokens` represents the maximum number of tokens that the model can process in a single step. Currently, vLLM v1 scheduling enables ChunkPrefill/SplitFuse by default, which means:
|
||||
- (1) If the input length of a request is greater than `--max-num-batched-tokens`, it will be divided into multiple rounds of computation according to `--max-num-batched-tokens`;
|
||||
- (2) Decode requests are prioritized for scheduling, and prefill requests are scheduled only if there is available capacity.
|
||||
- Generally, if `--max-num-batched-tokens` is set to a larger value, the overall latency will be lower, but the pressure on GPU memory (activation value usage) will be greater.
|
||||
- (1) If the input length of a request is greater than `--max-num-batched-tokens`, it will be divided into multiple rounds of computation according to `--max-num-batched-tokens`;
|
||||
- (2) Decode requests are prioritized for scheduling, and prefill requests are scheduled only if there is available capacity.
|
||||
- Generally, if `--max-num-batched-tokens` is set to a larger value, the overall latency will be lower, but the pressure on GPU memory (activation value usage) will be greater.
|
||||
- `--gpu-memory-utilization` represents the proportion of HBM that vLLM will use for actual inference. Its essential function is to calculate the available kv_cache size. During the warm-up phase (referred to as profile run in vLLM), vLLM records the peak GPU memory usage during an inference process with an input size of `--max-num-batched-tokens`. The available kv_cache size is then calculated as: `--gpu-memory-utilization` * HBM size - peak GPU memory usage. Therefore, the larger the value of `--gpu-memory-utilization`, the more kv_cache can be used. However, since the GPU memory usage during the warm-up phase may differ from that during actual inference (e.g., due to uneven EP load), setting `--gpu-memory-utilization` too high may lead to OOM (Out of Memory) issues during actual inference. The default value is `0.9`.
|
||||
- `--enable-expert-parallel` indicates that EP is enabled. Note that vLLM does not support a mixed approach of ETP and EP; that is, MoE can either use pure EP or pure TP.
|
||||
- `--no-enable-prefix-caching` indicates that prefix caching is disabled. To enable it, remove this option.
|
||||
@@ -126,6 +129,7 @@ The parameters are explained as follows:
|
||||
- `export VLLM_ASCEND_ENABLE_FLASHCOMM1=1` indicates that Flashcomm1 optimization is enabled. Currently, this optimization is only supported for MoE in scenarios where tp_size > 1.
|
||||
|
||||
**Notice:**
|
||||
|
||||
- tp_size needs to be divisible by dcp_size
|
||||
- decode context parallel size must less than or equal to max_dcp_size, where max_dcp_size = tensor_parallel_size // total_num_kv_heads.
|
||||
|
||||
@@ -156,6 +160,7 @@ Run performance evaluation of `Qwen3-235B-A22B-w8a8` as an example.
|
||||
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
|
||||
|
||||
There are three `vllm bench` subcommand:
|
||||
|
||||
- `latency`: Benchmark the latency of a single batch of requests.
|
||||
- `serve`: Benchmark the online serving throughput.
|
||||
- `throughput`: Benchmark offline inference throughput.
|
||||
|
||||
@@ -126,6 +126,7 @@ for i in {0..7}; do hccn_tool -i $i -tls -g ; done | grep switch
|
||||
:::::
|
||||
|
||||
## Run with Docker
|
||||
|
||||
Start a Docker container on each node.
|
||||
|
||||
```{code-block} bash
|
||||
@@ -172,7 +173,7 @@ docker run --rm \
|
||||
|
||||
## Install Mooncake
|
||||
|
||||
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.Installation and Compilation Guide: https://github.com/kvcache-ai/Mooncake?tab=readme-ov-file#build-and-use-binaries.
|
||||
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.Installation and Compilation Guide: <https://github.com/kvcache-ai/Mooncake?tab=readme-ov-file#build-and-use-binaries>.
|
||||
First, we need to obtain the Mooncake project. Refer to the following command:
|
||||
|
||||
```shell
|
||||
@@ -224,10 +225,12 @@ export LD_LIBRARY_PATH=/usr/local/lib64/python3.11/site-packages/mooncake:$LD_LI
|
||||
We can run the following scripts to launch a server on the prefiller/decoder node, respectively. Please note that each P/D node will occupy ports ranging from kv_port to kv_port + num_chips to initialize socket listeners. To avoid any issues, port conflicts should be prevented. Additionally, ensure that each node's engine_id is uniquely assigned to avoid conflicts.
|
||||
|
||||
### launch_online_dp.py
|
||||
|
||||
Use `launch_online_dp.py` to launch external dp vllm servers.
|
||||
[launch\_online\_dp.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/external_online_dp/launch_online_dp.py)
|
||||
|
||||
### run_dp_template.sh
|
||||
|
||||
Modify `run_dp_template.sh` on each node.
|
||||
[run\_dp\_template.sh](https://github.com/vllm-project/vllm-ascend/blob/main/examples/external_online_dp/run_dp_template.sh)
|
||||
|
||||
|
||||
@@ -56,6 +56,7 @@ for i in {0..7}; do hccn_tool -i $i -tls -g ; done | grep switch
|
||||
```
|
||||
|
||||
## Run with Docker
|
||||
|
||||
Start a Docker container.
|
||||
|
||||
```{code-block} bash
|
||||
@@ -93,7 +94,7 @@ docker run --rm \
|
||||
|
||||
## Install Mooncake
|
||||
|
||||
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.Installation and Compilation Guide: https://github.com/kvcache-ai/Mooncake?tab=readme-ov-file#build-and-use-binaries.
|
||||
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.Installation and Compilation Guide: <https://github.com/kvcache-ai/Mooncake?tab=readme-ov-file#build-and-use-binaries>.
|
||||
First, we need to obtain the Mooncake project. Refer to the following command:
|
||||
|
||||
```shell
|
||||
|
||||
@@ -8,12 +8,12 @@ Multi-node inference is suitable for scenarios where the model cannot be deploye
|
||||
|
||||
## Verify Multi-Node Communication Environment
|
||||
|
||||
### Physical Layer Requirements:
|
||||
### Physical Layer Requirements
|
||||
|
||||
* The physical machines must be located on the same LAN, with network connectivity.
|
||||
* All NPUs are connected with optical modules, and the connection status must be normal.
|
||||
|
||||
### Verification Process:
|
||||
### Verification Process
|
||||
|
||||
Execute the following commands on each node in sequence. The results must all be `success` and the status must be `UP`:
|
||||
|
||||
@@ -32,7 +32,8 @@ Execute the following commands on each node in sequence. The results must all be
|
||||
cat /etc/hccn.conf
|
||||
```
|
||||
|
||||
### NPU Interconnect Verification:
|
||||
### NPU Interconnect Verification
|
||||
|
||||
#### 1. Get NPU IP Addresses
|
||||
|
||||
```bash
|
||||
@@ -47,7 +48,9 @@ hccn_tool -i 0 -ping -g address 10.20.0.20
|
||||
```
|
||||
|
||||
## Set Up and Start the Ray Cluster
|
||||
|
||||
### Setting Up the Basic Container
|
||||
|
||||
To ensure a consistent execution environment across all nodes, including the model path and Python environment, it is advised to use Docker images.
|
||||
|
||||
For setting up a multi-node inference cluster with Ray, **containerized deployment** is the preferred approach. Containers should be started on both the primary and secondary nodes, with the `--net=host` option to enable proper network connectivity.
|
||||
@@ -88,6 +91,7 @@ docker run --rm \
|
||||
```
|
||||
|
||||
### Start Ray Cluster
|
||||
|
||||
After setting up the containers and installing vllm-ascend on each node, follow the steps below to start the Ray cluster and execute inference tasks.
|
||||
|
||||
Choose one machine as the primary node and the others as secondary nodes. Before proceeding, use `ip addr` to check your `nic_name` (network interface name).
|
||||
@@ -133,9 +137,10 @@ Once the cluster is started on multiple nodes, execute `ray status` and `ray lis
|
||||
|
||||
After Ray is successfully started, the following content will appear:\
|
||||
A local Ray instance has started successfully.\
|
||||
Dashboard URL: The access address for the Ray Dashboard (default: http://localhost:8265); Node status (CPU/memory resources, number of healthy nodes); Cluster connection address (used for adding multiple nodes).
|
||||
Dashboard URL: The access address for the Ray Dashboard (default: <http://localhost:8265>); Node status (CPU/memory resources, number of healthy nodes); Cluster connection address (used for adding multiple nodes).
|
||||
|
||||
## Start the Online Inference Service on Multi-node scenario
|
||||
|
||||
In the container, you can use vLLM as if all NPUs were on a single node. vLLM will utilize NPU resources across all nodes in the Ray cluster.
|
||||
|
||||
**You only need to run the vllm command on one node.**
|
||||
|
||||
Reference in New Issue
Block a user