[Doc][Misc] Restructure tutorial documentation (#6501)
### What this PR does / why we need it? This PR refactors the tutorial documentation by restructuring it into three categories: Models, Features, and Hardware. This improves the organization and navigation of the tutorials, making it easier for users to find relevant information. - The single `tutorials/index.md` is split into three separate index files: - `docs/source/tutorials/models/index.md` - `docs/source/tutorials/features/index.md` - `docs/source/tutorials/hardwares/index.md` - Existing tutorial markdown files have been moved into their respective new subdirectories (`models/`, `features/`, `hardwares/`). - The main `index.md` has been updated to link to these new tutorial sections. This change makes the documentation structure more logical and scalable for future additions. ### Does this PR introduce _any_ user-facing change? Yes, this PR changes the structure and URLs of the tutorial documentation pages. Users following old links to tutorials will encounter broken links. It is recommended to set up redirects if the documentation framework supports them. ### How was this patch tested? These are documentation-only changes. The documentation should be built and reviewed locally to ensure all links are correct and the pages render as expected. - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
This commit is contained in:
309
docs/source/tutorials/models/DeepSeek-R1.md
Normal file
309
docs/source/tutorials/models/DeepSeek-R1.md
Normal file
@@ -0,0 +1,309 @@
|
||||
# DeepSeek-R1
|
||||
|
||||
## Introduction
|
||||
|
||||
DeepSeek-R1 is a high-performance Mixture-of-Experts (MoE) large language model developed by DeepSeek Company. It excels in complex logical reasoning, mathematical problem-solving, and code generation. By dynamically activating its expert networks, it delivers exceptional performance while maintaining computational efficiency. Building upon R1, DeepSeek-R1-W8A8 is a fully quantized version of the model. It employs 8-bit integer (INT8) quantization for both weights and activations, which significantly reduces the model's memory footprint and computational requirements, enabling more efficient deployment and application in resource-constrained environments.
|
||||
This article takes the `DeepSeek-R1-W8A8` version as an example to introduce the deployment of the R1 series models.
|
||||
|
||||
## Supported Features
|
||||
|
||||
Refer to [supported features](../../user_guide/support_matrix/supported_models.md) to get the model's supported feature matrix.
|
||||
|
||||
Refer to [feature guide](../../user_guide/feature_guide/index.md) to get the feature's configuration.
|
||||
|
||||
## Environment Preparation
|
||||
|
||||
### Model Weight
|
||||
|
||||
- `DeepSeek-R1-W8A8`(Quantized version): require 1 Atlas 800 A3 (64G × 16) nodes or 2 Atlas 800 A2 (64G × 8) nodes. [Download model weight](https://www.modelscope.cn/models/vllm-ascend/DeepSeek-R1-W8A8)
|
||||
|
||||
It is recommended to download the model weight to the shared directory of multiple nodes.
|
||||
|
||||
### Verify Multi-node Communication(Optional)
|
||||
|
||||
If you want to deploy multi-node environment, you need to verify multi-node communication according to [verify multi-node communication environment](../../installation.md#verify-multi-node-communication).
|
||||
|
||||
### Installation
|
||||
|
||||
You can use our official docker image to run `DeepSeek-R1-W8A8` directly.
|
||||
|
||||
Select an image based on your machine type and start the docker image on your node, refer to [using docker](../../installation.md#set-up-using-docker).
|
||||
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
# Update --device according to your device (Atlas A2: /dev/davinci[0-7] Atlas A3:/dev/davinci[0-15]).
|
||||
# Update the vllm-ascend image according to your environment.
|
||||
# Note you should download the weight to /root/.cache in advance.
|
||||
# Update the vllm-ascend image
|
||||
export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version|
|
||||
export NAME=vllm-ascend
|
||||
|
||||
# Run the container using the defined variables
|
||||
# Note: If you are running bridge network with docker, please expose available ports for multiple nodes communication in advance
|
||||
docker run --rm \
|
||||
--name $NAME \
|
||||
--net=host \
|
||||
--shm-size=1g \
|
||||
--device /dev/davinci0 \
|
||||
--device /dev/davinci1 \
|
||||
--device /dev/davinci2 \
|
||||
--device /dev/davinci3 \
|
||||
--device /dev/davinci4 \
|
||||
--device /dev/davinci5 \
|
||||
--device /dev/davinci6 \
|
||||
--device /dev/davinci7 \
|
||||
--device /dev/davinci_manager \
|
||||
--device /dev/devmm_svm \
|
||||
--device /dev/hisi_hdc \
|
||||
-v /usr/local/dcmi:/usr/local/dcmi \
|
||||
-v /etc/hccn.conf:/etc/hccn.conf \
|
||||
-v /usr/bin/hccn_tool:/usr/bin/hccn_tool \
|
||||
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
|
||||
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
|
||||
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
|
||||
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
|
||||
-v /etc/ascend_install.info:/etc/ascend_install.info \
|
||||
-it $IMAGE bash
|
||||
```
|
||||
|
||||
If you want to deploy multi-node environment, you need to set up environment on each node.
|
||||
|
||||
## Deployment
|
||||
|
||||
### Service-oriented Deployment
|
||||
|
||||
- `DeepSeek-R1-W8A8`: require 1 Atlas 800 A3 (64G × 16) nodes or 2 Atlas 800 A2 (64G × 8).
|
||||
|
||||
:::::{tab-set}
|
||||
:sync-group: install
|
||||
|
||||
::::{tab-item} DeepSeek-R1-W8A8 A3 series
|
||||
|
||||
```shell
|
||||
#!/bin/sh
|
||||
|
||||
# this obtained through ifconfig
|
||||
# nic_name is the network interface name corresponding to local_ip of the current node
|
||||
nic_name="xxxx"
|
||||
local_ip="xxxx"
|
||||
|
||||
# AIV
|
||||
export HCCL_OP_EXPANSION_MODE="AIV"
|
||||
|
||||
export HCCL_IF_IP=$local_ip
|
||||
export GLOO_SOCKET_IFNAME=$nic_name
|
||||
export TP_SOCKET_IFNAME=$nic_name
|
||||
export HCCL_SOCKET_IFNAME=$nic_name
|
||||
export VLLM_ASCEND_BALANCE_SCHEDULING=1
|
||||
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
|
||||
export VLLM_USE_MODELSCOPE=True
|
||||
|
||||
vllm serve vllm-ascend/DeepSeek-R1-W8A8 \
|
||||
--host 0.0.0.0 \
|
||||
--port 8000 \
|
||||
--data-parallel-size 4 \
|
||||
--tensor-parallel-size 4 \
|
||||
--quantization ascend \
|
||||
--seed 1024 \
|
||||
--served-model-name deepseek_r1 \
|
||||
--enable-expert-parallel \
|
||||
--async-scheduling \
|
||||
--max-num-seqs 16 \
|
||||
--max-model-len 16384 \
|
||||
--max-num-batched-tokens 4096 \
|
||||
--trust-remote-code \
|
||||
--gpu-memory-utilization 0.92 \
|
||||
--speculative-config '{"num_speculative_tokens":3,"method":"mtp"}' \
|
||||
--compilation-config '{"cudagraph_capture_sizes":[4,16,32,48,64], "cudagraph_mode": "FULL_DECODE_ONLY"}'
|
||||
```
|
||||
|
||||
**Notice:**
|
||||
The parameters are explained as follows:
|
||||
|
||||
- Setting the environment variable `VLLM_ASCEND_BALANCE_SCHEDULING=1` enables balance scheduling. This may help increase output throughput and reduce TPOT in v1 scheduler. However, TTFT may degrade in some scenarios. Furthermore, enabling this feature is not recommended in scenarios where PD is separated.
|
||||
- For single-node deployment, we recommend using `dp4tp4` instead of `dp2tp8`.
|
||||
- `--max-model-len` specifies the maximum context length - that is, the sum of input and output tokens for a single request. For performance testing with an input length of 3.5K and output length of 1.5K, a value of `16384` is sufficient, however, for precision testing, please set it at least `35000`.
|
||||
- `--no-enable-prefix-caching` indicates that prefix caching is disabled. To enable it, remove this option.
|
||||
- If you use the w4a8 weight, more memory will be allocated to kvcache, and you can try to increase system throughput to achieve greater throughput.
|
||||
|
||||
::::
|
||||
::::{tab-item} DeepSeek-R1-W8A8 A2 series
|
||||
|
||||
Run the following scripts on two nodes respectively.
|
||||
|
||||
**Node 0**
|
||||
|
||||
```shell
|
||||
#!/bin/sh
|
||||
|
||||
# this obtained through ifconfig
|
||||
# nic_name is the network interface name corresponding to local_ip of the current node
|
||||
nic_name="xxxx"
|
||||
local_ip="xxxx"
|
||||
|
||||
export HCCL_IF_IP=$local_ip
|
||||
export GLOO_SOCKET_IFNAME=$nic_name
|
||||
export TP_SOCKET_IFNAME=$nic_name
|
||||
export HCCL_SOCKET_IFNAME=$nic_name
|
||||
export OMP_PROC_BIND=false
|
||||
export OMP_NUM_THREADS=1
|
||||
export HCCL_BUFFSIZE=200
|
||||
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
|
||||
export VLLM_ASCEND_BALANCE_SCHEDULING=1
|
||||
export HCCL_INTRA_PCIE_ENABLE=1
|
||||
export HCCL_INTRA_ROCE_ENABLE=0
|
||||
export VLLM_USE_MODELSCOPE=True
|
||||
|
||||
vllm serve vllm-ascend/DeepSeek-R1-W8A8 \
|
||||
--host 0.0.0.0 \
|
||||
--port 8000 \
|
||||
--data-parallel-size 4 \
|
||||
--data-parallel-size-local 2 \
|
||||
--data-parallel-address $local_ip \
|
||||
--data-parallel-rpc-port 13389 \
|
||||
--tensor-parallel-size 4 \
|
||||
--quantization ascend \
|
||||
--seed 1024 \
|
||||
--served-model-name deepseek_r1 \
|
||||
--enable-expert-parallel \
|
||||
--async-scheduling \
|
||||
--max-num-seqs 16 \
|
||||
--max-model-len 16384 \
|
||||
--max-num-batched-tokens 4096 \
|
||||
--trust-remote-code \
|
||||
--gpu-memory-utilization 0.92 \
|
||||
--speculative-config '{"num_speculative_tokens":3,"method":"mtp"}' \
|
||||
--compilation-config '{"cudagraph_capture_sizes":[4,16,32,48,64], "cudagraph_mode": "FULL_DECODE_ONLY"}'
|
||||
```
|
||||
|
||||
**Node 1**
|
||||
|
||||
```shell
|
||||
#!/bin/sh
|
||||
|
||||
# this obtained through ifconfig
|
||||
# nic_name is the network interface name corresponding to local_ip of the current node
|
||||
nic_name="xxxx"
|
||||
local_ip="xxxx"
|
||||
node0_ip="xxxx" # same as the local_IP address in node 0
|
||||
|
||||
export HCCL_IF_IP=$local_ip
|
||||
export GLOO_SOCKET_IFNAME=$nic_name
|
||||
export TP_SOCKET_IFNAME=$nic_name
|
||||
export HCCL_SOCKET_IFNAME=$nic_name
|
||||
export OMP_PROC_BIND=false
|
||||
export OMP_NUM_THREADS=1
|
||||
export HCCL_BUFFSIZE=200
|
||||
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
|
||||
export VLLM_ASCEND_BALANCE_SCHEDULING=1
|
||||
export HCCL_INTRA_PCIE_ENABLE=1
|
||||
export HCCL_INTRA_ROCE_ENABLE=0
|
||||
export VLLM_USE_MODELSCOPE=True
|
||||
|
||||
vllm serve vllm-ascend/DeepSeek-R1-W8A8 \
|
||||
--host 0.0.0.0 \
|
||||
--port 8000 \
|
||||
--headless \
|
||||
--data-parallel-size 4 \
|
||||
--data-parallel-size-local 2 \
|
||||
--data-parallel-start-rank 2 \
|
||||
--data-parallel-address $node0_ip \
|
||||
--data-parallel-rpc-port 13389 \
|
||||
--tensor-parallel-size 4 \
|
||||
--quantization ascend \
|
||||
--seed 1024 \
|
||||
--served-model-name deepseek_r1 \
|
||||
--enable-expert-parallel \
|
||||
--async-scheduling \
|
||||
--max-num-seqs 16 \
|
||||
--max-model-len 16384 \
|
||||
--max-num-batched-tokens 4096 \
|
||||
--trust-remote-code \
|
||||
--gpu-memory-utilization 0.92 \
|
||||
--speculative-config '{"num_speculative_tokens":3,"method":"mtp"}' \
|
||||
--compilation-config '{"cudagraph_capture_sizes":[4,16,32,48,64], "cudagraph_mode": "FULL_DECODE_ONLY"}'
|
||||
```
|
||||
|
||||
::::
|
||||
:::::
|
||||
|
||||
### Prefill-Decode Disaggregation
|
||||
|
||||
We recommend using DeepSeek-V3.1 for deployment: [DeepSeek-V3.1](./DeepSeek-V3.1.md).
|
||||
|
||||
This solution has been tested and demonstrates excellent performance.
|
||||
|
||||
## Functional Verification
|
||||
|
||||
Once your server is started, you can query the model with input prompts:
|
||||
|
||||
```shell
|
||||
curl http://<node0_ip>:<port>/v1/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "deepseek_r1",
|
||||
"prompt": "The future of AI is",
|
||||
"max_completion_tokens": 50,
|
||||
"temperature": 0
|
||||
}'
|
||||
```
|
||||
|
||||
## Accuracy Evaluation
|
||||
|
||||
Here are two accuracy evaluation methods.
|
||||
|
||||
### Using AISBench
|
||||
|
||||
1. Refer to [Using AISBench](../../developer_guide/evaluation/using_ais_bench.md) for details.
|
||||
|
||||
2. After execution, you can get the result, here is the result of `DeepSeek-R1-W8A8` in `vllm-ascend:0.11.0rc2` for reference only.
|
||||
|
||||
| dataset | version | metric | mode | vllm-api-general-chat |
|
||||
|----- | ----- | ----- | ----- | -----|
|
||||
| aime2024dataset | - | accuracy | gen | 80.00 |
|
||||
| gpqadataset | - | accuracy | gen | 72.22 |
|
||||
|
||||
### Using Language Model Evaluation Harness
|
||||
|
||||
As an example, take the `gsm8k` dataset as a test dataset, and run accuracy evaluation of `DeepSeek-R1-W8A8` in online mode.
|
||||
|
||||
1. Refer to [Using lm_eval](../../developer_guide/evaluation/using_lm_eval.md) for `lm_eval` installation.
|
||||
|
||||
2. Run `lm_eval` to execute the accuracy evaluation.
|
||||
|
||||
```shell
|
||||
lm_eval \
|
||||
--model local-completions \
|
||||
--model_args model=path/DeepSeek-R1-W8A8,base_url=http://<node0_ip>:<port>/v1/completions,tokenized_requests=False,trust_remote_code=True \
|
||||
--tasks gsm8k \
|
||||
--output_path ./
|
||||
```
|
||||
|
||||
3. After execution, you can get the result.
|
||||
|
||||
## Performance
|
||||
|
||||
### Using AISBench
|
||||
|
||||
Refer to [Using AISBench for performance evaluation](../../developer_guide/evaluation/using_ais_bench.md#execute-performance-evaluation) for details.
|
||||
|
||||
### Using vLLM Benchmark
|
||||
|
||||
Run performance evaluation of `DeepSeek-R1-W8A8` as an example.
|
||||
|
||||
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
|
||||
|
||||
There are three `vllm bench` subcommands:
|
||||
|
||||
- `latency`: Benchmark the latency of a single batch of requests.
|
||||
- `serve`: Benchmark the online serving throughput.
|
||||
- `throughput`: Benchmark offline inference throughput.
|
||||
|
||||
Take the `serve` as an example. Run the code as follows.
|
||||
|
||||
```shell
|
||||
export VLLM_USE_MODELSCOPE=true
|
||||
vllm bench serve --model path/DeepSeek-R1-W8A8 --dataset-name random --random-input 200 --num-prompt 200 --request-rate 1 --save-result --result-dir ./
|
||||
```
|
||||
|
||||
After about several minutes, you can get the performance evaluation result.
|
||||
722
docs/source/tutorials/models/DeepSeek-V3.1.md
Normal file
722
docs/source/tutorials/models/DeepSeek-V3.1.md
Normal file
@@ -0,0 +1,722 @@
|
||||
# DeepSeek-V3/3.1
|
||||
|
||||
## Introduction
|
||||
|
||||
DeepSeek-V3.1 is a hybrid model that supports both thinking mode and non-thinking mode. Compared to the previous version, this upgrade brings improvements in multiple aspects:
|
||||
|
||||
- Hybrid thinking mode: One model supports both thinking mode and non-thinking mode by changing the chat template.
|
||||
|
||||
- Smarter tool calling: Through post-training optimization, the model's performance in tool usage and agent tasks has significantly improved.
|
||||
|
||||
- Higher thinking efficiency: DeepSeek-V3.1-Think achieves comparable answer quality to DeepSeek-R1-0528, while responding more quickly.
|
||||
|
||||
The `DeepSeek-V3.1` model is first supported in `vllm-ascend:v0.9.1rc3`
|
||||
|
||||
This document will show the main verification steps of the model, including supported features, feature configuration, environment preparation, single-node and multi-node deployment, accuracy and performance evaluation.
|
||||
|
||||
## Supported Features
|
||||
|
||||
Refer to [supported features](../../user_guide/support_matrix/supported_models.md) to get the model's supported feature matrix.
|
||||
|
||||
Refer to [feature guide](../../user_guide/feature_guide/index.md) to get the feature's configuration.
|
||||
|
||||
## Environment Preparation
|
||||
|
||||
### Model Weight
|
||||
|
||||
- `DeepSeek-V3.1`(BF16 version): [Download model weight](https://www.modelscope.cn/models/deepseek-ai/DeepSeek-V3.1).
|
||||
- `DeepSeek-V3.1-w8a8`(Quantized version without mtp): [Download model weight](https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V3.1-w8a8).
|
||||
- `DeepSeek-V3.1_w8a8mix_mtp`(Quantized version with mix mtp): [Download model weight](https://www.modelscope.cn/models/Eco-Tech/DeepSeek-V3.1-w8a8). Please modify `torch_dtype` from `float16` to `bfloat16` in `config.json`.
|
||||
- `DeepSeek-V3.1-Terminus-w4a8-mtp-QuaRot`(Quantized version with mix mtp): [Download model weight](https://www.modelscope.cn/models/Eco-Tech/DeepSeek-V3.1-Terminus-w4a8-mtp-QuaRot).
|
||||
- `Method of Quantify`: [msmodelslim](https://gitcode.com/Ascend/msit/blob/master/msmodelslim/example/DeepSeek/README.md#deepseek-v31-w8a8-%E6%B7%B7%E5%90%88%E9%87%8F%E5%8C%96-mtp-%E9%87%8F%E5%8C%96). You can use these methods to quantify the model.
|
||||
|
||||
It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`
|
||||
|
||||
### Verify Multi-node Communication(Optional)
|
||||
|
||||
If you want to deploy multi-node environment, you need to verify multi-node communication according to [verify multi-node communication environment](../../installation.md#verify-multi-node-communication).
|
||||
|
||||
### Installation
|
||||
|
||||
You can use our official docker image to run `DeepSeek-V3.1` directly.
|
||||
|
||||
Select an image based on your machine type and start the docker image on your node, refer to [using docker](../../installation.md#set-up-using-docker).
|
||||
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
# Update --device according to your device (Atlas A2: /dev/davinci[0-7] Atlas A3:/dev/davinci[0-15]).
|
||||
# Update the vllm-ascend image according to your environment.
|
||||
# Note you should download the weight to /root/.cache in advance.
|
||||
# Update the vllm-ascend image
|
||||
export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version|
|
||||
export NAME=vllm-ascend
|
||||
|
||||
# Run the container using the defined variables
|
||||
# Note: If you are running bridge network with docker, please expose available ports for multiple nodes communication in advance
|
||||
docker run --rm \
|
||||
--name $NAME \
|
||||
--net=host \
|
||||
--shm-size=1g \
|
||||
--device /dev/davinci0 \
|
||||
--device /dev/davinci1 \
|
||||
--device /dev/davinci2 \
|
||||
--device /dev/davinci3 \
|
||||
--device /dev/davinci4 \
|
||||
--device /dev/davinci5 \
|
||||
--device /dev/davinci6 \
|
||||
--device /dev/davinci7 \
|
||||
--device /dev/davinci_manager \
|
||||
--device /dev/devmm_svm \
|
||||
--device /dev/hisi_hdc \
|
||||
-v /usr/local/dcmi:/usr/local/dcmi \
|
||||
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
|
||||
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
|
||||
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
|
||||
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
|
||||
-v /etc/ascend_install.info:/etc/ascend_install.info \
|
||||
-v /root/.cache:/root/.cache \
|
||||
-it $IMAGE bash
|
||||
```
|
||||
|
||||
If you want to deploy multi-node environment, you need to set up environment on each node.
|
||||
|
||||
## Deployment
|
||||
|
||||
### Single-node Deployment
|
||||
|
||||
- Quantized model `DeepSeek-V3.1-w8a8-mtp-QuaRot` can be deployed on 1 Atlas 800 A3 (64G × 16).
|
||||
|
||||
Run the following script to execute online inference.
|
||||
|
||||
```shell
|
||||
#!/bin/sh
|
||||
# this obtained through ifconfig
|
||||
# nic_name is the network interface name corresponding to local_ip of the current node
|
||||
nic_name="xxxx"
|
||||
local_ip="xxxx"
|
||||
|
||||
# [Optional] jemalloc
|
||||
# jemalloc is for better performance, if `libjemalloc.so` is installed on your machine, you can turn it on.
|
||||
# export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
|
||||
|
||||
# AIV
|
||||
export HCCL_OP_EXPANSION_MODE="AIV"
|
||||
|
||||
export HCCL_IF_IP=$local_ip
|
||||
export GLOO_SOCKET_IFNAME=$nic_name
|
||||
export TP_SOCKET_IFNAME=$nic_name
|
||||
export HCCL_SOCKET_IFNAME=$nic_name
|
||||
export VLLM_ASCEND_BALANCE_SCHEDULING=1
|
||||
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
|
||||
|
||||
vllm serve /weights/DeepSeek-V3.1-w8a8-mtp-QuaRot \
|
||||
--host 0.0.0.0 \
|
||||
--port 8015 \
|
||||
--data-parallel-size 4 \
|
||||
--tensor-parallel-size 4 \
|
||||
--quantization ascend \
|
||||
--seed 1024 \
|
||||
--served-model-name deepseek_v3 \
|
||||
--enable-expert-parallel \
|
||||
--async-scheduling \
|
||||
--max-num-seqs 16 \
|
||||
--max-model-len 16384 \
|
||||
--max-num-batched-tokens 4096 \
|
||||
--trust-remote-code \
|
||||
--no-enable-prefix-caching \
|
||||
--gpu-memory-utilization 0.92 \
|
||||
--speculative-config '{"num_speculative_tokens": 3, "method": "mtp"}' \
|
||||
--compilation-config '{"cudagraph_capture_sizes":[4,16,32,48,64], "cudagraph_mode": "FULL_DECODE_ONLY"}'
|
||||
```
|
||||
|
||||
**Notice:**
|
||||
The parameters are explained as follows:
|
||||
|
||||
- Setting the environment variable `VLLM_ASCEND_BALANCE_SCHEDULING=1` enables balance scheduling. This may help increase output throughput and reduce TPOT in v1 scheduler. However, TTFT may degrade in some scenarios. Furthermore, enabling this feature is not recommended in scenarios where PD is separated.
|
||||
- For single-node deployment, we recommend using `dp4tp4` instead of `dp2tp8`.
|
||||
- `--max-model-len` specifies the maximum context length - that is, the sum of input and output tokens for a single request. For performance testing with an input length of 3.5K and output length of 1.5K, a value of `16384` is sufficient, however, for precision testing, please set it at least `35000`.
|
||||
- `--no-enable-prefix-caching` indicates that prefix caching is disabled. To enable it, remove this option.
|
||||
- If you use the w4a8 weight, more memory will be allocated to kvcache, and you can try to increase system throughput to achieve greater throughput.
|
||||
|
||||
### Multi-node Deployment
|
||||
|
||||
- `DeepSeek-V3.1-w8a8-mtp-QuaRot`: require at least 2 Atlas 800 A2 (64G × 8).
|
||||
|
||||
Run the following scripts on two nodes respectively.
|
||||
|
||||
**Node 0**
|
||||
|
||||
```shell
|
||||
#!/bin/sh
|
||||
|
||||
# this obtained through ifconfig
|
||||
# nic_name is the network interface name corresponding to local_ip of the current node
|
||||
nic_name="xxxx"
|
||||
local_ip="xxxx"
|
||||
|
||||
# The value of node0_ip must be consistent with the value of local_ip set in node0 (master node)
|
||||
node0_ip="xxxx"
|
||||
|
||||
# [Optional] jemalloc
|
||||
# jemalloc is for better performance, if `libjemalloc.so` is installed on your machine, you can turn it on.
|
||||
# export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
|
||||
|
||||
export HCCL_IF_IP=$local_ip
|
||||
export GLOO_SOCKET_IFNAME=$nic_name
|
||||
export TP_SOCKET_IFNAME=$nic_name
|
||||
export HCCL_SOCKET_IFNAME=$nic_name
|
||||
export OMP_PROC_BIND=false
|
||||
export OMP_NUM_THREADS=1
|
||||
export HCCL_BUFFSIZE=200
|
||||
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
|
||||
export VLLM_ASCEND_BALANCE_SCHEDULING=1
|
||||
export HCCL_INTRA_PCIE_ENABLE=1
|
||||
export HCCL_INTRA_ROCE_ENABLE=0
|
||||
|
||||
vllm serve /weights/DeepSeek-V3.1-w8a8-mtp-QuaRot \
|
||||
--host 0.0.0.0 \
|
||||
--port 8004 \
|
||||
--data-parallel-size 4 \
|
||||
--data-parallel-size-local 2 \
|
||||
--data-parallel-address $node0_ip \
|
||||
--data-parallel-rpc-port 13389 \
|
||||
--tensor-parallel-size 4 \
|
||||
--quantization ascend \
|
||||
--seed 1024 \
|
||||
--served-model-name deepseek_v3 \
|
||||
--enable-expert-parallel \
|
||||
--async-scheduling \
|
||||
--max-num-seqs 16 \
|
||||
--max-model-len 16384 \
|
||||
--max-num-batched-tokens 4096 \
|
||||
--trust-remote-code \
|
||||
--no-enable-prefix-caching \
|
||||
--gpu-memory-utilization 0.92 \
|
||||
--speculative-config '{"num_speculative_tokens": 3, "method": "mtp"}' \
|
||||
--compilation-config '{"cudagraph_capture_sizes":[4,16,32,48,64], "cudagraph_mode": "FULL_DECODE_ONLY"}'
|
||||
```
|
||||
|
||||
**Node 1**
|
||||
|
||||
```shell
|
||||
#!/bin/sh
|
||||
|
||||
# this obtained through ifconfig
|
||||
# nic_name is the network interface name corresponding to local_ip of the current node
|
||||
nic_name="xxx"
|
||||
local_ip="xxx"
|
||||
|
||||
# The value of node0_ip must be consistent with the value of local_ip set in node0 (master node)
|
||||
node0_ip="xxxx"
|
||||
|
||||
# [Optional] jemalloc
|
||||
# jemalloc is for better performance, if `libjemalloc.so` is installed on your machine, you can turn it on.
|
||||
# export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
|
||||
|
||||
export HCCL_IF_IP=$local_ip
|
||||
export GLOO_SOCKET_IFNAME=$nic_name
|
||||
export TP_SOCKET_IFNAME=$nic_name
|
||||
export HCCL_SOCKET_IFNAME=$nic_name
|
||||
export OMP_PROC_BIND=false
|
||||
export OMP_NUM_THREADS=1
|
||||
export HCCL_BUFFSIZE=200
|
||||
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
|
||||
export VLLM_ASCEND_BALANCE_SCHEDULING=1
|
||||
export HCCL_INTRA_PCIE_ENABLE=1
|
||||
export HCCL_INTRA_ROCE_ENABLE=0
|
||||
|
||||
vllm serve /weights/DeepSeek-V3.1-w8a8-mtp-QuaRot \
|
||||
--host 0.0.0.0 \
|
||||
--port 8004 \
|
||||
--headless \
|
||||
--data-parallel-size 4 \
|
||||
--data-parallel-size-local 2 \
|
||||
--data-parallel-start-rank 2 \
|
||||
--data-parallel-address $node0_ip \
|
||||
--data-parallel-rpc-port 13389 \
|
||||
--tensor-parallel-size 4 \
|
||||
--quantization ascend \
|
||||
--seed 1024 \
|
||||
--served-model-name deepseek_v3 \
|
||||
--enable-expert-parallel \
|
||||
--async-scheduling \
|
||||
--max-num-seqs 16 \
|
||||
--max-model-len 16384 \
|
||||
--max-num-batched-tokens 4096 \
|
||||
--trust-remote-code \
|
||||
--no-enable-prefix-caching \
|
||||
--gpu-memory-utilization 0.92 \
|
||||
--speculative-config '{"num_speculative_tokens": 3, "method": "mtp"}' \
|
||||
--compilation-config '{"cudagraph_capture_sizes":[4,16,32,48,64], "cudagraph_mode": "FULL_DECODE_ONLY"}'
|
||||
```
|
||||
|
||||
### Prefill-Decode Disaggregation
|
||||
|
||||
We recommend using Mooncake for deployment: [Mooncake](../features/pd_disaggregation_mooncake_multi_node.md).
|
||||
|
||||
Take Atlas 800 A3 (64G × 16) for example, we recommend to deploy 2P1D (4 nodes) rather than 1P1D (2 nodes), because there is no enough NPU memory to serve high concurrency in 1P1D case.
|
||||
|
||||
- `DeepSeek-V3.1-w8a8-mtp-QuaRot 2P1D Layerwise` require 4 Atlas 800 A3 (64G × 16).
|
||||
|
||||
To run the vllm-ascend `Prefill-Decode Disaggregation` service, you need to deploy a `launch_dp_program.py` script and a `run_dp_template.sh` script on each node and deploy a `proxy.sh` script on prefill master node to forward requests.
|
||||
|
||||
1. `launch_online_dp.py` to launch external dp vllm servers.
|
||||
[launch\_online\_dp.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/external_online_dp/launch_online_dp.py)
|
||||
|
||||
2. Prefill Node 0 `run_dp_template.sh` script
|
||||
|
||||
```shell
|
||||
# this obtained through ifconfig
|
||||
# nic_name is the network interface name corresponding to local_ip of the current node
|
||||
nic_name="xxx"
|
||||
local_ip="141.xx.xx.1"
|
||||
|
||||
# The value of node0_ip must be consistent with the value of local_ip set in node0 (master node)
|
||||
node0_ip="xxxx"
|
||||
|
||||
# [Optional] jemalloc
|
||||
# jemalloc is for better performance, if `libjemalloc.so` is installed on your machine, you can turn it on.
|
||||
# export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
|
||||
|
||||
export HCCL_IF_IP=$local_ip
|
||||
export GLOO_SOCKET_IFNAME=$nic_name
|
||||
export TP_SOCKET_IFNAME=$nic_name
|
||||
export HCCL_SOCKET_IFNAME=$nic_name
|
||||
|
||||
export VLLM_RPC_TIMEOUT=3600000
|
||||
export VLLM_EXECUTE_MODEL_TIMEOUT_SECONDS=30000
|
||||
export HCCL_EXEC_TIMEOUT=204
|
||||
export HCCL_CONNECT_TIMEOUT=120
|
||||
|
||||
export OMP_PROC_BIND=false
|
||||
export OMP_NUM_THREADS=10
|
||||
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
|
||||
export HCCL_BUFFSIZE=256
|
||||
export TASK_QUEUE_ENABLE=1
|
||||
export HCCL_OP_EXPANSION_MODE="AIV"
|
||||
export VLLM_USE_V1=1
|
||||
export ASCEND_RT_VISIBLE_DEVICES=$1
|
||||
export ASCEND_BUFFER_POOL=4:8
|
||||
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/mooncake:$LD_LIBRARY_PATH
|
||||
|
||||
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
|
||||
|
||||
vllm serve /weights/DeepSeek-V3.1-w8a8-mtp-QuaRot \
|
||||
--host 0.0.0.0 \
|
||||
--port $2 \
|
||||
--data-parallel-size $3 \
|
||||
--data-parallel-rank $4 \
|
||||
--data-parallel-address $5 \
|
||||
--data-parallel-rpc-port $6 \
|
||||
--tensor-parallel-size $7 \
|
||||
--enable-expert-parallel \
|
||||
--seed 1024 \
|
||||
--served-model-name deepseek_v3 \
|
||||
--max-model-len 65536 \
|
||||
--max-num-batched-tokens 16384 \
|
||||
--max-num-seqs 8 \
|
||||
--enforce-eager \
|
||||
--trust-remote-code \
|
||||
--gpu-memory-utilization 0.9 \
|
||||
--quantization ascend \
|
||||
--no-enable-prefix-caching \
|
||||
--speculative-config '{"num_speculative_tokens": 1, "method": "mtp"}' \
|
||||
--additional-config '{"recompute_scheduler_enable":true}' \
|
||||
--kv-transfer-config \
|
||||
'{"kv_connector": "MooncakeConnectorV1",
|
||||
"kv_role": "kv_producer",
|
||||
"kv_port": "30000",
|
||||
"engine_id": "0",
|
||||
"kv_connector_extra_config": {
|
||||
"prefill": {
|
||||
"dp_size": 2,
|
||||
"tp_size": 8
|
||||
},
|
||||
"decode": {
|
||||
"dp_size": 32,
|
||||
"tp_size": 1
|
||||
}
|
||||
}
|
||||
}'
|
||||
```
|
||||
|
||||
3. Prefill Node 1 `run_dp_template.sh` script
|
||||
|
||||
```shell
|
||||
# this obtained through ifconfig
|
||||
# nic_name is the network interface name corresponding to local_ip of the current node
|
||||
nic_name="xxx"
|
||||
local_ip="141.xx.xx.2"
|
||||
|
||||
# The value of node0_ip must be consistent with the value of local_ip set in node0 (master node)
|
||||
node0_ip="xxxx"
|
||||
|
||||
# [Optional] jemalloc
|
||||
# jemalloc is for better performance, if `libjemalloc.so` is installed on your machine, you can turn it on.
|
||||
# export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
|
||||
|
||||
export HCCL_IF_IP=$local_ip
|
||||
export GLOO_SOCKET_IFNAME=$nic_name
|
||||
export TP_SOCKET_IFNAME=$nic_name
|
||||
export HCCL_SOCKET_IFNAME=$nic_name
|
||||
|
||||
export VLLM_RPC_TIMEOUT=3600000
|
||||
export VLLM_EXECUTE_MODEL_TIMEOUT_SECONDS=30000
|
||||
export HCCL_EXEC_TIMEOUT=204
|
||||
export HCCL_CONNECT_TIMEOUT=120
|
||||
|
||||
export OMP_PROC_BIND=false
|
||||
export OMP_NUM_THREADS=10
|
||||
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
|
||||
export HCCL_BUFFSIZE=256
|
||||
export TASK_QUEUE_ENABLE=1
|
||||
export HCCL_OP_EXPANSION_MODE="AIV"
|
||||
export VLLM_USE_V1=1
|
||||
export ASCEND_RT_VISIBLE_DEVICES=$1
|
||||
export ASCEND_BUFFER_POOL=4:8
|
||||
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/mooncake:$LD_LIBRARY_PATH
|
||||
|
||||
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
|
||||
|
||||
vllm serve /weights/DeepSeek-V3.1-w8a8-mtp-QuaRot \
|
||||
--host 0.0.0.0 \
|
||||
--port $2 \
|
||||
--data-parallel-size $3 \
|
||||
--data-parallel-rank $4 \
|
||||
--data-parallel-address $5 \
|
||||
--data-parallel-rpc-port $6 \
|
||||
--tensor-parallel-size $7 \
|
||||
--enable-expert-parallel \
|
||||
--seed 1024 \
|
||||
--served-model-name deepseek_v3 \
|
||||
--max-model-len 65536 \
|
||||
--max-num-batched-tokens 16384 \
|
||||
--max-num-seqs 8 \
|
||||
--enforce-eager \
|
||||
--trust-remote-code \
|
||||
--gpu-memory-utilization 0.9 \
|
||||
--quantization ascend \
|
||||
--no-enable-prefix-caching \
|
||||
--speculative-config '{"num_speculative_tokens": 1, "method": "mtp"}' \
|
||||
--additional-config '{"recompute_scheduler_enable":true}' \
|
||||
--kv-transfer-config \
|
||||
'{"kv_connector": "MooncakeConnectorV1",
|
||||
"kv_role": "kv_producer",
|
||||
"kv_port": "30100",
|
||||
"engine_id": "1",
|
||||
"kv_connector_extra_config": {
|
||||
"prefill": {
|
||||
"dp_size": 2,
|
||||
"tp_size": 8
|
||||
},
|
||||
"decode": {
|
||||
"dp_size": 32,
|
||||
"tp_size": 1
|
||||
}
|
||||
}
|
||||
}'
|
||||
```
|
||||
|
||||
4. Decode Node 0 `run_dp_template.sh` script
|
||||
|
||||
```shell
|
||||
# this obtained through ifconfig
|
||||
# nic_name is the network interface name corresponding to local_ip of the current node
|
||||
nic_name="xxx"
|
||||
local_ip="141.xx.xx.3"
|
||||
|
||||
# The value of node0_ip must be consistent with the value of local_ip set in node0 (master node)
|
||||
node0_ip="xxxx"
|
||||
|
||||
# [Optional] jemalloc
|
||||
# jemalloc is for better performance, if `libjemalloc.so` is installed on your machine, you can turn it on.
|
||||
# export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
|
||||
|
||||
export HCCL_IF_IP=$local_ip
|
||||
export GLOO_SOCKET_IFNAME=$nic_name
|
||||
export TP_SOCKET_IFNAME=$nic_name
|
||||
export HCCL_SOCKET_IFNAME=$nic_name
|
||||
|
||||
export VLLM_RPC_TIMEOUT=3600000
|
||||
export VLLM_EXECUTE_MODEL_TIMEOUT_SECONDS=30000
|
||||
export HCCL_EXEC_TIMEOUT=204
|
||||
export HCCL_CONNECT_TIMEOUT=120
|
||||
|
||||
export OMP_PROC_BIND=false
|
||||
export OMP_NUM_THREADS=10
|
||||
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
|
||||
export HCCL_BUFFSIZE=1100
|
||||
export TASK_QUEUE_ENABLE=1
|
||||
export HCCL_OP_EXPANSION_MODE="AIV"
|
||||
export VLLM_USE_V1=1
|
||||
export ASCEND_RT_VISIBLE_DEVICES=$1
|
||||
export ASCEND_BUFFER_POOL=4:8
|
||||
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/mooncake:$LD_LIBRARY_PATH
|
||||
|
||||
vllm serve /weights/DeepSeek-V3.1-w8a8-mtp-QuaRot \
|
||||
--host 0.0.0.0 \
|
||||
--port $2 \
|
||||
--data-parallel-size $3 \
|
||||
--data-parallel-rank $4 \
|
||||
--data-parallel-address $5 \
|
||||
--data-parallel-rpc-port $6 \
|
||||
--tensor-parallel-size $7 \
|
||||
--enable-expert-parallel \
|
||||
--seed 1024 \
|
||||
--served-model-name deepseek_v3 \
|
||||
--max-model-len 65536 \
|
||||
--max-num-batched-tokens 256 \
|
||||
--max-num-seqs 28 \
|
||||
--trust-remote-code \
|
||||
--gpu-memory-utilization 0.92 \
|
||||
--quantization ascend \
|
||||
--no-enable-prefix-caching \
|
||||
--async-scheduling \
|
||||
--speculative-config '{"num_speculative_tokens": 1, "method": "mtp"}' \
|
||||
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes":[2, 4, 8, 16, 24, 32, 48, 56]}' \
|
||||
--additional-config '{"recompute_scheduler_enable":true,"multistream_overlap_shared_expert": true,"finegrained_tp_config": {"lmhead_tensor_parallel_size":16}}' \
|
||||
--kv-transfer-config \
|
||||
'{"kv_connector": "MooncakeConnectorV1",
|
||||
"kv_role": "kv_consumer",
|
||||
"kv_port": "30200",
|
||||
"engine_id": "2",
|
||||
"kv_connector_extra_config": {
|
||||
"prefill": {
|
||||
"dp_size": 2,
|
||||
"tp_size": 8
|
||||
},
|
||||
"decode": {
|
||||
"dp_size": 32,
|
||||
"tp_size": 1
|
||||
}
|
||||
}
|
||||
}'
|
||||
```
|
||||
|
||||
5. Decode Node 1 `run_dp_template.sh` script
|
||||
|
||||
```shell
|
||||
# this obtained through ifconfig
|
||||
# nic_name is the network interface name corresponding to local_ip of the current node
|
||||
nic_name="xxx"
|
||||
local_ip="141.xx.xx.4"
|
||||
|
||||
# The value of node0_ip must be consistent with the value of local_ip set in node0 (master node)
|
||||
node0_ip="xxxx"
|
||||
|
||||
# [Optional] jemalloc
|
||||
# jemalloc is for better performance, if `libjemalloc.so` is installed on your machine, you can turn it on.
|
||||
# export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
|
||||
|
||||
export HCCL_IF_IP=$local_ip
|
||||
export GLOO_SOCKET_IFNAME=$nic_name
|
||||
export TP_SOCKET_IFNAME=$nic_name
|
||||
export HCCL_SOCKET_IFNAME=$nic_name
|
||||
|
||||
export VLLM_RPC_TIMEOUT=3600000
|
||||
export VLLM_EXECUTE_MODEL_TIMEOUT_SECONDS=30000
|
||||
export HCCL_EXEC_TIMEOUT=204
|
||||
export HCCL_CONNECT_TIMEOUT=120
|
||||
|
||||
export OMP_PROC_BIND=false
|
||||
export OMP_NUM_THREADS=10
|
||||
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
|
||||
export HCCL_BUFFSIZE=1100
|
||||
export TASK_QUEUE_ENABLE=1
|
||||
export HCCL_OP_EXPANSION_MODE="AIV"
|
||||
export VLLM_USE_V1=1
|
||||
export ASCEND_RT_VISIBLE_DEVICES=$1
|
||||
export ASCEND_BUFFER_POOL=4:8
|
||||
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/mooncake:$LD_LIBRARY_PATH
|
||||
|
||||
vllm serve /weights/DeepSeek-V3.1-w8a8-mtp-QuaRot \
|
||||
--host 0.0.0.0 \
|
||||
--port $2 \
|
||||
--data-parallel-size $3 \
|
||||
--data-parallel-rank $4 \
|
||||
--data-parallel-address $5 \
|
||||
--data-parallel-rpc-port $6 \
|
||||
--tensor-parallel-size $7 \
|
||||
--enable-expert-parallel \
|
||||
--seed 1024 \
|
||||
--served-model-name deepseek_v3 \
|
||||
--max-model-len 65536 \
|
||||
--max-num-batched-tokens 256 \
|
||||
--max-num-seqs 28 \
|
||||
--trust-remote-code \
|
||||
--gpu-memory-utilization 0.92 \
|
||||
--quantization ascend \
|
||||
--no-enable-prefix-caching \
|
||||
--async-scheduling \
|
||||
--speculative-config '{"num_speculative_tokens": 1, "method": "mtp"}' \
|
||||
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes":[2, 4, 8, 16, 24, 32, 48, 56]}' \
|
||||
--additional-config '{"recompute_scheduler_enable":true,"multistream_overlap_shared_expert": true,"finegrained_tp_config": {"lmhead_tensor_parallel_size":16}}' \
|
||||
--kv-transfer-config \
|
||||
'{"kv_connector": "MooncakeConnectorV1",
|
||||
"kv_role": "kv_consumer",
|
||||
"kv_port": "30300",
|
||||
"engine_id": "3",
|
||||
"kv_connector_extra_config": {
|
||||
"prefill": {
|
||||
"dp_size": 2,
|
||||
"tp_size": 8
|
||||
},
|
||||
"decode": {
|
||||
"dp_size": 32,
|
||||
"tp_size": 1
|
||||
}
|
||||
}
|
||||
}'
|
||||
```
|
||||
|
||||
**Notice:**
|
||||
The parameters are explained as follows:
|
||||
|
||||
- `VLLM_ASCEND_ENABLE_FLASHCOMM1=1`: enables the communication optimization function on the prefill nodes.
|
||||
- `VLLM_ASCEND_ENABLE_MLAPO=1`: enables the fusion operator, which can significantly improve performance but consumes more NPU memory. In the Prefill-Decode (PD) separation scenario, enable MLAPO only on decode nodes.
|
||||
- `--async-scheduling`: enables the asynchronous scheduling function. When Multi-Token Prediction (MTP) is enabled, asynchronous scheduling of operator delivery can be implemented to overlap the operator delivery latency.
|
||||
- `cudagraph_capture_sizes`: The recommended value is `n x (mtp + 1)`. And the min is `n = 1` and the max is `n = max-num-seqs`. For other values, it is recommended to set them to the number of frequently occurring requests on the Decode (D) node.
|
||||
- `recompute_scheduler_enable: true`: enables the recomputation scheduler. When the Key-Value Cache (KV Cache) of the decode node is insufficient, requests will be sent to the prefill node to recompute the KV Cache. In the PD separation scenario, it is recommended to enable this configuration on both prefill and decode nodes simultaneously.
|
||||
- `multistream_overlap_shared_expert: true`: When the Tensor Parallelism (TP) size is 1 or `enable_shared_expert_dp: true`, an additional stream is enabled to overlap the computation process of shared experts for improved efficiency.
|
||||
- `lmhead_tensor_parallel_size: 16`: When the Tensor Parallelism (TP) size of the decode node is 1, this parameter allows the TP size of the LMHead embedding layer to be greater than 1, which is used to reduce the computational load of each card on the LMHead embedding layer.
|
||||
|
||||
6. run server for each node
|
||||
|
||||
```shell
|
||||
# p0
|
||||
python launch_online_dp.py --dp-size 2 --tp-size 8 --dp-size-local 2 --dp-rank-start 0 --dp-address 141.xx.xx.1 --dp-rpc-port 12321 --vllm-start-port 7100
|
||||
# p1
|
||||
python launch_online_dp.py --dp-size 2 --tp-size 8 --dp-size-local 2 --dp-rank-start 0 --dp-address 141.xx.xx.2 --dp-rpc-port 12321 --vllm-start-port 7100
|
||||
# d0
|
||||
python launch_online_dp.py --dp-size 32 --tp-size 1 --dp-size-local 16 --dp-rank-start 0 --dp-address 141.xx.xx.3 --dp-rpc-port 12321 --vllm-start-port 7100
|
||||
# d1
|
||||
python launch_online_dp.py --dp-size 32 --tp-size 1 --dp-size-local 16 --dp-rank-start 16 --dp-address 141.xx.xx.3 --dp-rpc-port 12321 --vllm-start-port 7100
|
||||
```
|
||||
|
||||
7. Run proxy `proxy.sh` scripts on the prefill master node
|
||||
|
||||
Run a proxy server on the same node with the prefiller service instance. You can get the proxy program in the repository's examples: [load\_balance\_proxy\_server\_example.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)
|
||||
|
||||
```shell
|
||||
python load_balance_proxy_server_example.py \
|
||||
--port 1999 \
|
||||
--host 141.xx.xx.1 \
|
||||
--prefiller-hosts \
|
||||
141.xx.xx.1 \
|
||||
141.xx.xx.1 \
|
||||
141.xx.xx.2 \
|
||||
141.xx.xx.2 \
|
||||
--prefiller-ports \
|
||||
7100 7101 7100 7101 \
|
||||
--decoder-hosts \
|
||||
141.xx.xx.3 \
|
||||
141.xx.xx.3 \
|
||||
141.xx.xx.3 \
|
||||
141.xx.xx.3 \
|
||||
141.xx.xx.3 \
|
||||
141.xx.xx.3 \
|
||||
141.xx.xx.3 \
|
||||
141.xx.xx.3 \
|
||||
141.xx.xx.3 \
|
||||
141.xx.xx.3 \
|
||||
141.xx.xx.3 \
|
||||
141.xx.xx.3 \
|
||||
141.xx.xx.3 \
|
||||
141.xx.xx.3 \
|
||||
141.xx.xx.3 \
|
||||
141.xx.xx.3 \
|
||||
141.xx.xx.4 \
|
||||
141.xx.xx.4 \
|
||||
141.xx.xx.4 \
|
||||
141.xx.xx.4 \
|
||||
141.xx.xx.4 \
|
||||
141.xx.xx.4 \
|
||||
141.xx.xx.4 \
|
||||
141.xx.xx.4 \
|
||||
141.xx.xx.4 \
|
||||
141.xx.xx.4 \
|
||||
141.xx.xx.4 \
|
||||
141.xx.xx.4 \
|
||||
141.xx.xx.4 \
|
||||
141.xx.xx.4 \
|
||||
141.xx.xx.4 \
|
||||
141.xx.xx.4 \
|
||||
--decoder-ports \
|
||||
7100 7101 7102 7103 7104 7105 7106 7107 7108 7109 7110 7111 7112 7113 7114 7115 \
|
||||
7100 7101 7102 7103 7104 7105 7106 7107 7108 7109 7110 7111 7112 7113 7114 7115 \
|
||||
```
|
||||
|
||||
```shell
|
||||
cd vllm-ascend/examples/disaggregated_prefill_v1/
|
||||
bash proxy.sh
|
||||
```
|
||||
|
||||
## Functional Verification
|
||||
|
||||
Once your server is started, you can query the model with input prompts:
|
||||
|
||||
```shell
|
||||
curl http://<node0_ip>:<port>/v1/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "deepseek_v3",
|
||||
"prompt": "The future of AI is",
|
||||
"max_completion_tokens": 50,
|
||||
"temperature": 0
|
||||
}'
|
||||
```
|
||||
|
||||
## Accuracy Evaluation
|
||||
|
||||
Here are two accuracy evaluation methods.
|
||||
|
||||
### Using AISBench
|
||||
|
||||
1. Refer to [Using AISBench](../../developer_guide/evaluation/using_ais_bench.md) for details.
|
||||
|
||||
2. After execution, you can get the result, here is the result of `DeepSeek-V3.1-w8a8-mtp-QuaRot` in `vllm-ascend:0.11.0rc1` for reference only.
|
||||
|
||||
| dataset | version | metric | mode | vllm-api-general-chat | note |
|
||||
|----- | ----- | ----- | ----- | -----| ----- |
|
||||
| ceval | - | accuracy | gen | 90.94 | 1 Atlas 800 A3 (64G × 16) |
|
||||
| gsm8k | - | accuracy | gen | 96.28 | 1 Atlas 800 A3 (64G × 16) |
|
||||
|
||||
### Using Language Model Evaluation Harness
|
||||
|
||||
Not test yet.
|
||||
|
||||
## Performance
|
||||
|
||||
### Using AISBench
|
||||
|
||||
Refer to [Using AISBench for performance evaluation](../../developer_guide/evaluation/using_ais_bench.md#execute-performance-evaluation) for details.
|
||||
|
||||
The performance result is:
|
||||
|
||||
**Hardware**: A3-752T, 4 node
|
||||
|
||||
**Deployment**: 2P1D, Prefill node: DP2+TP8, Decode Node: DP32+TP1
|
||||
|
||||
**Input/Output**: 3.5k/1.5k
|
||||
|
||||
**Performance**: TTFT = 6.16s, TPOT = 48.82ms, Average performance of each card is 478 TPS (Token Per Second).
|
||||
|
||||
### Using vLLM Benchmark
|
||||
|
||||
Run performance evaluation of `DeepSeek-V3.1-w8a8-mtp-QuaRot` as an example.
|
||||
|
||||
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
|
||||
|
||||
There are three `vllm bench` subcommands:
|
||||
|
||||
- `latency`: Benchmark the latency of a single batch of requests.
|
||||
- `serve`: Benchmark the online serving throughput.
|
||||
- `throughput`: Benchmark offline inference throughput.
|
||||
|
||||
Take the `serve` as an example. Run the code as follows.
|
||||
|
||||
```shell
|
||||
vllm bench serve --model /weights/DeepSeek-V3.1-w8a8-mtp-QuaRot --dataset-name random --random-input 1024 --num-prompt 200 --request-rate 1 --save-result --result-dir ./
|
||||
```
|
||||
|
||||
After about several minutes, you can get the performance evaluation result.
|
||||
915
docs/source/tutorials/models/DeepSeek-V3.2.md
Normal file
915
docs/source/tutorials/models/DeepSeek-V3.2.md
Normal file
@@ -0,0 +1,915 @@
|
||||
# DeepSeek-V3.2
|
||||
|
||||
## Introduction
|
||||
|
||||
DeepSeek-V3.2 is a sparse attention model. The main architecture is similar to DeepSeek-V3.1, but with a sparse attention mechanism, which is designed to explore and validate optimizations for training and inference efficiency in long-context scenarios.
|
||||
|
||||
This document will show the main verification steps of the model, including supported features, feature configuration, environment preparation, single-node and multi-node deployment, accuracy and performance evaluation.
|
||||
|
||||
## Supported Features
|
||||
|
||||
Refer to [supported features](../../user_guide/support_matrix/supported_models.md) to get the model's supported feature matrix.
|
||||
|
||||
Refer to [feature guide](../../user_guide/feature_guide/index.md) to get the feature's configuration.
|
||||
|
||||
## Environment Preparation
|
||||
|
||||
### Model Weight
|
||||
|
||||
- `DeepSeek-V3.2-Exp`(BF16 version): require 2 Atlas 800 A3 (64G × 16) nodes or 4 Atlas 800 A2 (64G × 8) nodes. [Download model weight](https://modelers.cn/models/Modelers_Park/DeepSeek-V3.2-Exp-BF16)
|
||||
- `DeepSeek-V3.2-Exp-w8a8`(Quantized version): require 1 Atlas 800 A3 (64G × 16) node or 2 Atlas 800 A2 (64G × 8) nodes. [Download model weight](https://modelers.cn/models/Modelers_Park/DeepSeek-V3.2-Exp-w8a8)
|
||||
- `DeepSeek-V3.2`(BF16 version): require 2 Atlas 800 A3 (64G × 16) nodes or 4 Atlas 800 A2 (64G × 8) nodes. Model weight in BF16 not found now.
|
||||
- `DeepSeek-V3.2-w8a8`(Quantized version): require 1 Atlas 800 A3 (64G × 16) node or 2 Atlas 800 A2 (64G × 8) nodes. [Download model weight](https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V3.2-W8A8/)
|
||||
|
||||
It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`
|
||||
|
||||
### Verify Multi-node Communication(Optional)
|
||||
|
||||
If you want to deploy multi-node environment, you need to verify multi-node communication according to [verify multi-node communication environment](../../installation.md#verify-multi-node-communication).
|
||||
|
||||
### Installation
|
||||
|
||||
You can use our official docker image to run `DeepSeek-V3.2` directly.
|
||||
|
||||
:::::{tab-set}
|
||||
:sync-group: install
|
||||
|
||||
::::{tab-item} A3 series
|
||||
:sync: A3
|
||||
|
||||
Start the docker image on your each node.
|
||||
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
|
||||
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|-a3
|
||||
docker run --rm \
|
||||
--name vllm-ascend \
|
||||
--shm-size=1g \
|
||||
--net=host \
|
||||
--device /dev/davinci0 \
|
||||
--device /dev/davinci1 \
|
||||
--device /dev/davinci2 \
|
||||
--device /dev/davinci3 \
|
||||
--device /dev/davinci4 \
|
||||
--device /dev/davinci5 \
|
||||
--device /dev/davinci6 \
|
||||
--device /dev/davinci7 \
|
||||
--device /dev/davinci8 \
|
||||
--device /dev/davinci9 \
|
||||
--device /dev/davinci10 \
|
||||
--device /dev/davinci11 \
|
||||
--device /dev/davinci12 \
|
||||
--device /dev/davinci13 \
|
||||
--device /dev/davinci14 \
|
||||
--device /dev/davinci15 \
|
||||
--device /dev/davinci_manager \
|
||||
--device /dev/devmm_svm \
|
||||
--device /dev/hisi_hdc \
|
||||
-v /usr/local/dcmi:/usr/local/dcmi \
|
||||
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
|
||||
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
|
||||
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
|
||||
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
|
||||
-v /etc/ascend_install.info:/etc/ascend_install.info \
|
||||
-v /root/.cache:/root/.cache \
|
||||
-it $IMAGE bash
|
||||
```
|
||||
|
||||
::::
|
||||
::::{tab-item} A2 series
|
||||
:sync: A2
|
||||
|
||||
Start the docker image on your each node.
|
||||
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
|
||||
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
|
||||
docker run --rm \
|
||||
--name vllm-ascend \
|
||||
--shm-size=1g \
|
||||
--net=host \
|
||||
--device /dev/davinci0 \
|
||||
--device /dev/davinci1 \
|
||||
--device /dev/davinci2 \
|
||||
--device /dev/davinci3 \
|
||||
--device /dev/davinci4 \
|
||||
--device /dev/davinci5 \
|
||||
--device /dev/davinci6 \
|
||||
--device /dev/davinci7 \
|
||||
--device /dev/davinci_manager \
|
||||
--device /dev/devmm_svm \
|
||||
--device /dev/hisi_hdc \
|
||||
-v /usr/local/dcmi:/usr/local/dcmi \
|
||||
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
|
||||
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
|
||||
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
|
||||
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
|
||||
-v /etc/ascend_install.info:/etc/ascend_install.info \
|
||||
-v /root/.cache:/root/.cache \
|
||||
-it $IMAGE bash
|
||||
```
|
||||
|
||||
::::
|
||||
:::::
|
||||
|
||||
In addition, if you don't want to use the docker image as above, you can also build all from source:
|
||||
|
||||
- Install `vllm-ascend` from source, refer to [installation](../../installation.md).
|
||||
|
||||
If you want to deploy multi-node environment, you need to set up environment on each node.
|
||||
|
||||
## Deployment
|
||||
|
||||
:::{note}
|
||||
In this tutorial, we suppose you downloaded the model weight to `/root/.cache/`. Feel free to change it to your own path.
|
||||
:::
|
||||
|
||||
### Single-node Deployment
|
||||
|
||||
- Quantized model `DeepSeek-V3.2-w8a8` can be deployed on 1 Atlas 800 A3 (64G × 16).
|
||||
|
||||
Run the following script to execute online inference.
|
||||
|
||||
```shell
|
||||
export HCCL_OP_EXPANSION_MODE="AIV"
|
||||
export OMP_PROC_BIND=false
|
||||
export OMP_NUM_THREADS=10
|
||||
export VLLM_USE_V1=1
|
||||
export HCCL_BUFFSIZE=200
|
||||
export VLLM_ASCEND_ENABLE_MLAPO=1
|
||||
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
|
||||
|
||||
vllm serve /root/.cache/modelscope/hub/models/vllm-ascend/DeepSeek-V3.2-W8A8 \
|
||||
--host 0.0.0.0 \
|
||||
--port 8000 \
|
||||
--data-parallel-size 2 \
|
||||
--tensor-parallel-size 8 \
|
||||
--quantization ascend \
|
||||
--seed 1024 \
|
||||
--served-model-name deepseek_v3_2 \
|
||||
--enable-expert-parallel \
|
||||
--max-num-seqs 16 \
|
||||
--max-model-len 8192 \
|
||||
--max-num-batched-tokens 4096 \
|
||||
--trust-remote-code \
|
||||
--no-enable-prefix-caching \
|
||||
--gpu-memory-utilization 0.92 \
|
||||
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
|
||||
--speculative-config '{"num_speculative_tokens": 3, "method": "deepseek_mtp"}'
|
||||
|
||||
```
|
||||
|
||||
### Multi-node Deployment
|
||||
|
||||
- `DeepSeek-V3.2-w8a8`: require at least 2 Atlas 800 A2 (64G × 8).
|
||||
|
||||
Run the following scripts on two nodes respectively.
|
||||
|
||||
:::::{tab-set}
|
||||
:sync-group: install
|
||||
|
||||
::::{tab-item} A3 series
|
||||
:sync: A3
|
||||
|
||||
**Node0**
|
||||
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
# this obtained through ifconfig
|
||||
# nic_name is the network interface name corresponding to local_ip of the current node
|
||||
nic_name="xxx"
|
||||
local_ip="xxx"
|
||||
|
||||
# The value of node0_ip must be consistent with the value of local_ip set in node0 (master node)
|
||||
node0_ip="xxxx"
|
||||
|
||||
export HCCL_OP_EXPANSION_MODE="AIV"
|
||||
|
||||
export HCCL_IF_IP=$local_ip
|
||||
export GLOO_SOCKET_IFNAME=$nic_name
|
||||
export TP_SOCKET_IFNAME=$nic_name
|
||||
export HCCL_SOCKET_IFNAME=$nic_name
|
||||
export OMP_PROC_BIND=false
|
||||
export OMP_NUM_THREADS=10
|
||||
export VLLM_USE_V1=1
|
||||
export HCCL_BUFFSIZE=200
|
||||
export VLLM_ASCEND_ENABLE_MLAPO=1
|
||||
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
|
||||
|
||||
vllm serve /root/.cache/modelscope/hub/models/vllm-ascend/DeepSeek-V3.2-W8A8 \
|
||||
--host 0.0.0.0 \
|
||||
--port 8077 \
|
||||
--data-parallel-size 2 \
|
||||
--data-parallel-size-local 1 \
|
||||
--data-parallel-address $node0_ip \
|
||||
--data-parallel-rpc-port 12890 \
|
||||
--tensor-parallel-size 16 \
|
||||
--quantization ascend \
|
||||
--seed 1024 \
|
||||
--served-model-name deepseek_v3_2 \
|
||||
--enable-expert-parallel \
|
||||
--max-num-seqs 16 \
|
||||
--max-model-len 8192 \
|
||||
--max-num-batched-tokens 4096 \
|
||||
--trust-remote-code \
|
||||
--no-enable-prefix-caching \
|
||||
--gpu-memory-utilization 0.92 \
|
||||
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
|
||||
--speculative-config '{"num_speculative_tokens": 3, "method": "deepseek_mtp"}'
|
||||
```
|
||||
|
||||
**Node1**
|
||||
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
# this obtained through ifconfig
|
||||
# nic_name is the network interface name corresponding to local_ip of the current node
|
||||
nic_name="xxx"
|
||||
local_ip="xxx"
|
||||
|
||||
# The value of node0_ip must be consistent with the value of local_ip set in node0 (master node)
|
||||
node0_ip="xxxx"
|
||||
|
||||
export HCCL_OP_EXPANSION_MODE="AIV"
|
||||
|
||||
export HCCL_IF_IP=$local_ip
|
||||
export GLOO_SOCKET_IFNAME=$nic_name
|
||||
export TP_SOCKET_IFNAME=$nic_name
|
||||
export HCCL_SOCKET_IFNAME=$nic_name
|
||||
export OMP_PROC_BIND=false
|
||||
export OMP_NUM_THREADS=10
|
||||
export VLLM_USE_V1=1
|
||||
export HCCL_BUFFSIZE=200
|
||||
export VLLM_ASCEND_ENABLE_MLAPO=1
|
||||
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
|
||||
|
||||
vllm serve /root/.cache/modelscope/hub/models/vllm-ascend/DeepSeek-V3.2-W8A8 \
|
||||
--host 0.0.0.0 \
|
||||
--port 8077 \
|
||||
--headless \
|
||||
--data-parallel-size 2 \
|
||||
--data-parallel-size-local 1 \
|
||||
--data-parallel-start-rank 1 \
|
||||
--data-parallel-address $node0_ip \
|
||||
--data-parallel-rpc-port 12890 \
|
||||
--tensor-parallel-size 16 \
|
||||
--quantization ascend \
|
||||
--seed 1024 \
|
||||
--served-model-name deepseek_v3_2 \
|
||||
--enable-expert-parallel \
|
||||
--max-num-seqs 16 \
|
||||
--max-model-len 8192 \
|
||||
--max-num-batched-tokens 4096 \
|
||||
--trust-remote-code \
|
||||
--no-enable-prefix-caching \
|
||||
--gpu-memory-utilization 0.92 \
|
||||
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
|
||||
--speculative-config '{"num_speculative_tokens": 3, "method": "deepseek_mtp"}'
|
||||
```
|
||||
|
||||
::::
|
||||
::::{tab-item} A2 series
|
||||
:sync: A2
|
||||
|
||||
**Node0**
|
||||
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
# this obtained through ifconfig
|
||||
# nic_name is the network interface name corresponding to local_ip of the current node
|
||||
nic_name="xxx"
|
||||
local_ip="xxx"
|
||||
|
||||
# The value of node0_ip must be consistent with the value of local_ip set in node0 (master node)
|
||||
node0_ip="xxxx"
|
||||
|
||||
export HCCL_OP_EXPANSION_MODE="AIV"
|
||||
|
||||
export HCCL_IF_IP=$local_ip
|
||||
export GLOO_SOCKET_IFNAME=$nic_name
|
||||
export TP_SOCKET_IFNAME=$nic_name
|
||||
export HCCL_SOCKET_IFNAME=$nic_name
|
||||
export OMP_PROC_BIND=false
|
||||
export OMP_NUM_THREADS=100
|
||||
export VLLM_USE_V1=1
|
||||
export HCCL_BUFFSIZE=200
|
||||
export VLLM_ASCEND_ENABLE_MLAPO=1
|
||||
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
|
||||
export VLLM_ASCEND_ENABLE_FLASHCOMM1=0
|
||||
export HCCL_CONNECT_TIMEOUT=120
|
||||
export HCCL_INTRA_PCIE_ENABLE=1
|
||||
export HCCL_INTRA_ROCE_ENABLE=0
|
||||
|
||||
|
||||
vllm serve /root/.cache/modelscope/hub/models/vllm-ascend/DeepSeek-V3.2-W8A8 \
|
||||
--host 0.0.0.0 \
|
||||
--port 8077 \
|
||||
--data-parallel-size 2 \
|
||||
--data-parallel-size-local 1 \
|
||||
--data-parallel-address $node0_ip \
|
||||
--data-parallel-rpc-port 13389 \
|
||||
--tensor-parallel-size 8 \
|
||||
--quantization ascend \
|
||||
--seed 1024 \
|
||||
--served-model-name deepseek_v3_2 \
|
||||
--enable-expert-parallel \
|
||||
--max-num-seqs 16 \
|
||||
--max-model-len 8192 \
|
||||
--max-num-batched-tokens 4096 \
|
||||
--trust-remote-code \
|
||||
--no-enable-prefix-caching \
|
||||
--gpu-memory-utilization 0.92 \
|
||||
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes":[3,6,9,12,15,18,21,24,27,30,33,36,39,42,45,48]}' \
|
||||
--speculative-config '{"num_speculative_tokens": 2, "method": "deepseek_mtp"}'
|
||||
|
||||
```
|
||||
|
||||
**Node1**
|
||||
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
# this obtained through ifconfig
|
||||
# nic_name is the network interface name corresponding to local_ip of the current node
|
||||
nic_name="xxx"
|
||||
local_ip="xxx"
|
||||
|
||||
# The value of node0_ip must be consistent with the value of local_ip set in node0 (master node)
|
||||
node0_ip="xxxx"
|
||||
|
||||
export HCCL_OP_EXPANSION_MODE="AIV"
|
||||
|
||||
export HCCL_IF_IP=$local_ip
|
||||
export GLOO_SOCKET_IFNAME=$nic_name
|
||||
export TP_SOCKET_IFNAME=$nic_name
|
||||
export HCCL_SOCKET_IFNAME=$nic_name
|
||||
export OMP_PROC_BIND=false
|
||||
export OMP_NUM_THREADS=100
|
||||
export VLLM_USE_V1=1
|
||||
export HCCL_BUFFSIZE=200
|
||||
export VLLM_ASCEND_ENABLE_MLAPO=1
|
||||
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
|
||||
export VLLM_ASCEND_ENABLE_FLASHCOMM1=0
|
||||
export HCCL_CONNECT_TIMEOUT=120
|
||||
export HCCL_INTRA_PCIE_ENABLE=1
|
||||
export HCCL_INTRA_ROCE_ENABLE=0
|
||||
|
||||
|
||||
vllm serve /root/.cache/modelscope/hub/models/vllm-ascend/DeepSeek-V3.2-W8A8 \
|
||||
--host 0.0.0.0 \
|
||||
--port 8077 \
|
||||
--headless \
|
||||
--data-parallel-size 2 \
|
||||
--data-parallel-size-local 1 \
|
||||
--data-parallel-start-rank 1 \
|
||||
--data-parallel-address $node0_ip \
|
||||
--data-parallel-rpc-port 13389 \
|
||||
--tensor-parallel-size 8 \
|
||||
--quantization ascend \
|
||||
--seed 1024 \
|
||||
--served-model-name deepseek_v3_2 \
|
||||
--enable-expert-parallel \
|
||||
--max-num-seqs 16 \
|
||||
--max-model-len 8192 \
|
||||
--max-num-batched-tokens 4096 \
|
||||
--trust-remote-code \
|
||||
--no-enable-prefix-caching \
|
||||
--gpu-memory-utilization 0.92 \
|
||||
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes":[3,6,9,12,15,18,21,24,27,30,33,36,39,42,45,48]}' \
|
||||
--speculative-config '{"num_speculative_tokens": 2, "method": "deepseek_mtp"}'
|
||||
|
||||
```
|
||||
|
||||
::::
|
||||
:::::
|
||||
|
||||
### Prefill-Decode Disaggregation
|
||||
|
||||
We'd like to show the deployment guide of `DeepSeek-V3.2` on multi-node environment with 1P1D for better performance.
|
||||
|
||||
Before you start, please
|
||||
|
||||
1. prepare the script `launch_online_dp.py` on each node.
|
||||
|
||||
```python
|
||||
import argparse
|
||||
import multiprocessing
|
||||
import os
|
||||
import subprocess
|
||||
import sys
|
||||
|
||||
def parse_args():
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument(
|
||||
"--dp-size",
|
||||
type=int,
|
||||
required=True,
|
||||
help="Data parallel size."
|
||||
)
|
||||
parser.add_argument(
|
||||
"--tp-size",
|
||||
type=int,
|
||||
default=1,
|
||||
help="Tensor parallel size."
|
||||
)
|
||||
parser.add_argument(
|
||||
"--dp-size-local",
|
||||
type=int,
|
||||
default=-1,
|
||||
help="Local data parallel size."
|
||||
)
|
||||
parser.add_argument(
|
||||
"--dp-rank-start",
|
||||
type=int,
|
||||
default=0,
|
||||
help="Starting rank for data parallel."
|
||||
)
|
||||
parser.add_argument(
|
||||
"--dp-address",
|
||||
type=str,
|
||||
required=True,
|
||||
help="IP address for data parallel master node."
|
||||
)
|
||||
parser.add_argument(
|
||||
"--dp-rpc-port",
|
||||
type=str,
|
||||
default=12345,
|
||||
help="Port for data parallel master node."
|
||||
)
|
||||
parser.add_argument(
|
||||
"--vllm-start-port",
|
||||
type=int,
|
||||
default=9000,
|
||||
help="Starting port for the engine."
|
||||
)
|
||||
return parser.parse_args()
|
||||
|
||||
args = parse_args()
|
||||
dp_size = args.dp_size
|
||||
tp_size = args.tp_size
|
||||
dp_size_local = args.dp_size_local
|
||||
if dp_size_local == -1:
|
||||
dp_size_local = dp_size
|
||||
dp_rank_start = args.dp_rank_start
|
||||
dp_address = args.dp_address
|
||||
dp_rpc_port = args.dp_rpc_port
|
||||
vllm_start_port = args.vllm_start_port
|
||||
|
||||
def run_command(visible_devices, dp_rank, vllm_engine_port):
|
||||
command = [
|
||||
"bash",
|
||||
"./run_dp_template.sh",
|
||||
visible_devices,
|
||||
str(vllm_engine_port),
|
||||
str(dp_size),
|
||||
str(dp_rank),
|
||||
dp_address,
|
||||
dp_rpc_port,
|
||||
str(tp_size),
|
||||
]
|
||||
subprocess.run(command, check=True)
|
||||
|
||||
if __name__ == "__main__":
|
||||
template_path = "./run_dp_template.sh"
|
||||
if not os.path.exists(template_path):
|
||||
print(f"Template file {template_path} does not exist.")
|
||||
sys.exit(1)
|
||||
|
||||
processes = []
|
||||
num_cards = dp_size_local * tp_size
|
||||
for i in range(dp_size_local):
|
||||
dp_rank = dp_rank_start + i
|
||||
vllm_engine_port = vllm_start_port + i
|
||||
visible_devices = ",".join(str(x) for x in range(i * tp_size, (i + 1) * tp_size))
|
||||
process = multiprocessing.Process(target=run_command,
|
||||
args=(visible_devices, dp_rank,
|
||||
vllm_engine_port))
|
||||
processes.append(process)
|
||||
process.start()
|
||||
|
||||
for process in processes:
|
||||
process.join()
|
||||
|
||||
```
|
||||
|
||||
2. prepare the script `run_dp_template.sh` on each node.
|
||||
|
||||
1. Prefill node 0
|
||||
|
||||
```shell
|
||||
nic_name="enp48s3u1u1" # change to your own nic name
|
||||
local_ip=141.61.39.105 # change to your own ip
|
||||
|
||||
export HCCL_OP_EXPANSION_MODE="AIV"
|
||||
|
||||
export HCCL_IF_IP=$local_ip
|
||||
export GLOO_SOCKET_IFNAME=$nic_name
|
||||
export TP_SOCKET_IFNAME=$nic_name
|
||||
export HCCL_SOCKET_IFNAME=$nic_name
|
||||
|
||||
export OMP_PROC_BIND=false
|
||||
export OMP_NUM_THREADS=10
|
||||
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
|
||||
export VLLM_USE_V1=1
|
||||
export HCCL_BUFFSIZE=256
|
||||
|
||||
export ASCEND_AGGREGATE_ENABLE=1
|
||||
export ASCEND_TRANSPORT_PRINT=1
|
||||
export ACL_OP_INIT_MODE=1
|
||||
export ASCEND_A3_ENABLE=1
|
||||
export VLLM_NIXL_ABORT_REQUEST_TIMEOUT=300000
|
||||
|
||||
export ASCEND_RT_VISIBLE_DEVICES=$1
|
||||
|
||||
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
|
||||
|
||||
|
||||
vllm serve /root/.cache/Eco-Tech/DeepSeek-V3.2-w8a8-mtp-QuaRot \
|
||||
--host 0.0.0.0 \
|
||||
--port $2 \
|
||||
--data-parallel-size $3 \
|
||||
--data-parallel-rank $4 \
|
||||
--data-parallel-address $5 \
|
||||
--data-parallel-rpc-port $6 \
|
||||
--tensor-parallel-size $7 \
|
||||
--enable-expert-parallel \
|
||||
--speculative-config '{"num_speculative_tokens": 2, "method":"deepseek_mtp"}' \
|
||||
--profiler-config \
|
||||
'{"profiler": "torch",
|
||||
"torch_profiler_dir": "./vllm_profile",
|
||||
"torch_profiler_with_stack": false}' \
|
||||
--seed 1024 \
|
||||
--served-model-name dsv3 \
|
||||
--max-model-len 68000 \
|
||||
--max-num-batched-tokens 32550 \
|
||||
--trust-remote-code \
|
||||
--max-num-seqs 64 \
|
||||
--gpu-memory-utilization 0.82 \
|
||||
--quantization ascend \
|
||||
--enforce-eager \
|
||||
--no-enable-prefix-caching \
|
||||
--additional-config '{"layer_sharding": ["q_b_proj", "o_proj"]}' \
|
||||
--kv-transfer-config \
|
||||
'{"kv_connector": "MooncakeConnectorV1",
|
||||
"kv_role": "kv_producer",
|
||||
"kv_port": "30000",
|
||||
"engine_id": "0",
|
||||
"kv_connector_extra_config": {
|
||||
"use_ascend_direct": true,
|
||||
"prefill": {
|
||||
"dp_size": 2,
|
||||
"tp_size": 16
|
||||
},
|
||||
"decode": {
|
||||
"dp_size": 8,
|
||||
"tp_size": 4
|
||||
}
|
||||
}
|
||||
}'
|
||||
|
||||
```
|
||||
|
||||
2. Prefill node 1
|
||||
|
||||
```shell
|
||||
nic_name="enp48s3u1u1" # change to your own nic name
|
||||
local_ip=141.61.39.113 # change to your own ip
|
||||
|
||||
export HCCL_OP_EXPANSION_MODE="AIV"
|
||||
|
||||
export HCCL_IF_IP=$local_ip
|
||||
export GLOO_SOCKET_IFNAME=$nic_name
|
||||
export TP_SOCKET_IFNAME=$nic_name
|
||||
export HCCL_SOCKET_IFNAME=$nic_name
|
||||
|
||||
export OMP_PROC_BIND=false
|
||||
export OMP_NUM_THREADS=10
|
||||
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
|
||||
export VLLM_USE_V1=1
|
||||
export HCCL_BUFFSIZE=256
|
||||
|
||||
export ASCEND_AGGREGATE_ENABLE=1
|
||||
export ASCEND_TRANSPORT_PRINT=1
|
||||
export ACL_OP_INIT_MODE=1
|
||||
export ASCEND_A3_ENABLE=1
|
||||
export VLLM_NIXL_ABORT_REQUEST_TIMEOUT=300000
|
||||
|
||||
export ASCEND_RT_VISIBLE_DEVICES=$1
|
||||
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
|
||||
|
||||
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
|
||||
|
||||
|
||||
vllm serve /root/.cache/Eco-Tech/DeepSeek-V3.2-w8a8-mtp-QuaRot \
|
||||
--host 0.0.0.0 \
|
||||
--port $2 \
|
||||
--data-parallel-size $3 \
|
||||
--data-parallel-rank $4 \
|
||||
--data-parallel-address $5 \
|
||||
--data-parallel-rpc-port $6 \
|
||||
--tensor-parallel-size $7 \
|
||||
--enable-expert-parallel \
|
||||
--speculative-config '{"num_speculative_tokens": 2, "method":"deepseek_mtp"}' \
|
||||
--profiler-config \
|
||||
'{"profiler": "torch",
|
||||
"torch_profiler_dir": "./vllm_profile",
|
||||
"torch_profiler_with_stack": false}' \
|
||||
--seed 1024 \
|
||||
--served-model-name dsv3 \
|
||||
--max-model-len 68000 \
|
||||
--max-num-batched-tokens 32550 \
|
||||
--trust-remote-code \
|
||||
--max-num-seqs 64 \
|
||||
--gpu-memory-utilization 0.82 \
|
||||
--quantization ascend \
|
||||
--enforce-eager \
|
||||
--no-enable-prefix-caching \
|
||||
--additional-config '{"layer_sharding": ["q_b_proj", "o_proj"]}' \
|
||||
--kv-transfer-config \
|
||||
'{"kv_connector": "MooncakeConnectorV1",
|
||||
"kv_role": "kv_producer",
|
||||
"kv_port": "30000",
|
||||
"engine_id": "0",
|
||||
"kv_connector_extra_config": {
|
||||
"use_ascend_direct": true,
|
||||
"prefill": {
|
||||
"dp_size": 2,
|
||||
"tp_size": 16
|
||||
},
|
||||
"decode": {
|
||||
"dp_size": 8,
|
||||
"tp_size": 4
|
||||
}
|
||||
}
|
||||
}'
|
||||
```
|
||||
|
||||
3. Decode node 0
|
||||
|
||||
```shell
|
||||
nic_name="enp48s3u1u1" # change to your own nic name
|
||||
local_ip=141.61.39.117 # change to your own ip
|
||||
|
||||
export HCCL_OP_EXPANSION_MODE="AIV"
|
||||
|
||||
export HCCL_IF_IP=$local_ip
|
||||
export GLOO_SOCKET_IFNAME=$nic_name
|
||||
export TP_SOCKET_IFNAME=$nic_name
|
||||
export HCCL_SOCKET_IFNAME=$nic_name
|
||||
|
||||
#Mooncake
|
||||
export OMP_PROC_BIND=false
|
||||
export OMP_NUM_THREADS=10
|
||||
|
||||
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
|
||||
export VLLM_USE_V1=1
|
||||
export HCCL_BUFFSIZE=256
|
||||
|
||||
|
||||
export ASCEND_AGGREGATE_ENABLE=1
|
||||
export ASCEND_TRANSPORT_PRINT=1
|
||||
export ACL_OP_INIT_MODE=1
|
||||
export ASCEND_A3_ENABLE=1
|
||||
export VLLM_NIXL_ABORT_REQUEST_TIMEOUT=300000
|
||||
|
||||
export TASK_QUEUE_ENABLE=1
|
||||
|
||||
export ASCEND_RT_VISIBLE_DEVICES=$1
|
||||
|
||||
|
||||
vllm serve /root/.cache/Eco-Tech/DeepSeek-V3.2-w8a8-mtp-QuaRot \
|
||||
--host 0.0.0.0 \
|
||||
--port $2 \
|
||||
--data-parallel-size $3 \
|
||||
--data-parallel-rank $4 \
|
||||
--data-parallel-address $5 \
|
||||
--data-parallel-rpc-port $6 \
|
||||
--tensor-parallel-size $7 \
|
||||
--enable-expert-parallel \
|
||||
--speculative-config '{"num_speculative_tokens": 2, "method":"deepseek_mtp"}' \
|
||||
--profiler-config \
|
||||
'{"profiler": "torch",
|
||||
"torch_profiler_dir": "./vllm_profile",
|
||||
"torch_profiler_with_stack": false}' \
|
||||
--seed 1024 \
|
||||
--served-model-name dsv3 \
|
||||
--max-model-len 68000 \
|
||||
--max-num-batched-tokens 12 \
|
||||
--compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY", "cudagraph_capture_sizes":[3, 6, 9, 12]}' \
|
||||
--trust-remote-code \
|
||||
--max-num-seqs 4 \
|
||||
--gpu-memory-utilization 0.95 \
|
||||
--no-enable-prefix-caching \
|
||||
--async-scheduling \
|
||||
--quantization ascend \
|
||||
--kv-transfer-config \
|
||||
'{"kv_connector": "MooncakeConnectorV1",
|
||||
"kv_role": "kv_consumer",
|
||||
"kv_port": "30100",
|
||||
"engine_id": "1",
|
||||
"kv_connector_extra_config": {
|
||||
"use_ascend_direct": true,
|
||||
"prefill": {
|
||||
"dp_size": 2,
|
||||
"tp_size": 16
|
||||
},
|
||||
"decode": {
|
||||
"dp_size": 8,
|
||||
"tp_size": 4
|
||||
}
|
||||
}
|
||||
}' \
|
||||
--additional-config '{"recompute_scheduler_enable" : true}'
|
||||
```
|
||||
|
||||
4. Decode node 1
|
||||
|
||||
```shell
|
||||
nic_name="enp48s3u1u1" # change to your own nic name
|
||||
local_ip=141.61.39.181 # change to your own ip
|
||||
|
||||
export HCCL_OP_EXPANSION_MODE="AIV"
|
||||
|
||||
export HCCL_IF_IP=$local_ip
|
||||
export GLOO_SOCKET_IFNAME=$nic_name
|
||||
export TP_SOCKET_IFNAME=$nic_name
|
||||
export HCCL_SOCKET_IFNAME=$nic_name
|
||||
|
||||
#Mooncake
|
||||
export OMP_PROC_BIND=false
|
||||
export OMP_NUM_THREADS=10
|
||||
|
||||
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
|
||||
export VLLM_USE_V1=1
|
||||
export HCCL_BUFFSIZE=256
|
||||
|
||||
export ASCEND_AGGREGATE_ENABLE=1
|
||||
export ASCEND_TRANSPORT_PRINT=1
|
||||
export ACL_OP_INIT_MODE=1
|
||||
export ASCEND_A3_ENABLE=1
|
||||
export VLLM_NIXL_ABORT_REQUEST_TIMEOUT=300000
|
||||
|
||||
export TASK_QUEUE_ENABLE=1
|
||||
|
||||
export ASCEND_RT_VISIBLE_DEVICES=$1
|
||||
|
||||
|
||||
vllm serve /root/.cache/Eco-Tech/DeepSeek-V3.2-w8a8-mtp-QuaRot \
|
||||
--host 0.0.0.0 \
|
||||
--port $2 \
|
||||
--data-parallel-size $3 \
|
||||
--data-parallel-rank $4 \
|
||||
--data-parallel-address $5 \
|
||||
--data-parallel-rpc-port $6 \
|
||||
--tensor-parallel-size $7 \
|
||||
--enable-expert-parallel \
|
||||
--speculative-config '{"num_speculative_tokens": 2, "method":"deepseek_mtp"}' \
|
||||
--profiler-config \
|
||||
'{"profiler": "torch",
|
||||
"torch_profiler_dir": "./vllm_profile",
|
||||
"torch_profiler_with_stack": false}' \
|
||||
--seed 1024 \
|
||||
--served-model-name dsv3 \
|
||||
--max-model-len 68000 \
|
||||
--max-num-batched-tokens 12 \
|
||||
--compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY", "cudagraph_capture_sizes":[3, 6, 9, 12]}' \
|
||||
--trust-remote-code \
|
||||
--async-scheduling \
|
||||
--max-num-seqs 4 \
|
||||
--gpu-memory-utilization 0.95 \
|
||||
--no-enable-prefix-caching \
|
||||
--quantization ascend \
|
||||
--kv-transfer-config \
|
||||
'{"kv_connector": "MooncakeConnectorV1",
|
||||
"kv_role": "kv_consumer",
|
||||
"kv_port": "30100",
|
||||
"engine_id": "1",
|
||||
"kv_connector_extra_config": {
|
||||
"use_ascend_direct": true,
|
||||
"prefill": {
|
||||
"dp_size": 2,
|
||||
"tp_size": 16
|
||||
},
|
||||
"decode": {
|
||||
"dp_size": 8,
|
||||
"tp_size": 4
|
||||
}
|
||||
}
|
||||
}' \
|
||||
--additional-config '{"recompute_scheduler_enable" : true}'
|
||||
```
|
||||
|
||||
Once the preparation is done, you can start the server with the following command on each node:
|
||||
|
||||
1. Prefill node 0
|
||||
|
||||
```shell
|
||||
# change ip to your own
|
||||
python launch_online_dp.py --dp-size 2 --tp-size 16 --dp-size-local 1 --dp-rank-start 0 --dp-address 141.61.39.105 --dp-rpc-port 12890 --vllm-start-port 9100
|
||||
```
|
||||
|
||||
2. Prefill node 1
|
||||
|
||||
```shell
|
||||
# change ip to your own
|
||||
python launch_online_dp.py --dp-size 2 --tp-size 16 --dp-size-local 1 --dp-rank-start 1 --dp-address 141.61.39.105 --dp-rpc-port 12890 --vllm-start-port 9100
|
||||
```
|
||||
|
||||
3. Decode node 0
|
||||
|
||||
```shell
|
||||
# change ip to your own
|
||||
python launch_online_dp.py --dp-size 8 --tp-size 4 --dp-size-local 4 --dp-rank-start 0 --dp-address 141.61.39.117 --dp-rpc-port 12777 --vllm-start-port 9100
|
||||
```
|
||||
|
||||
4. Decode node 1
|
||||
|
||||
```shell
|
||||
# change ip to your own
|
||||
python launch_online_dp.py --dp-size 8 --tp-size 4 --dp-size-local 4 --dp-rank-start 4 --dp-address 141.61.39.117 --dp-rpc-port 12777 --vllm-start-port 9100
|
||||
```
|
||||
|
||||
## Functional Verification
|
||||
|
||||
Once your server is started, you can query the model with input prompts:
|
||||
|
||||
```shell
|
||||
curl http://<node0_ip>:<port>/v1/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "deepseek_v3.2",
|
||||
"prompt": "The future of AI is",
|
||||
"max_completion_tokens": 50,
|
||||
"temperature": 0
|
||||
}'
|
||||
```
|
||||
|
||||
## Accuracy Evaluation
|
||||
|
||||
Here are two accuracy evaluation methods.
|
||||
|
||||
### Using AISBench
|
||||
|
||||
1. Refer to [Using AISBench](../../developer_guide/evaluation/using_ais_bench.md) for details.
|
||||
|
||||
2. After execution, you can get the result.
|
||||
|
||||
### Using Language Model Evaluation Harness
|
||||
|
||||
As an example, take the `gsm8k` dataset as a test dataset, and run accuracy evaluation of `DeepSeek-V3.2-W8A8` in online mode.
|
||||
|
||||
1. Refer to [Using lm_eval](../../developer_guide/evaluation/using_lm_eval.md) for `lm_eval` installation.
|
||||
|
||||
2. Run `lm_eval` to execute the accuracy evaluation.
|
||||
|
||||
```shell
|
||||
lm_eval \
|
||||
--model local-completions \
|
||||
--model_args model=/root/.cache/Eco-Tech/DeepSeek-V3.2-w8a8-mtp-QuaRot,base_url=http://127.0.0.1:8000/v1/completions,tokenized_requests=False,trust_remote_code=True \
|
||||
--tasks gsm8k \
|
||||
--output_path ./
|
||||
```
|
||||
|
||||
3. After execution, you can get the result.
|
||||
|
||||
## Performance
|
||||
|
||||
### Using AISBench
|
||||
|
||||
Refer to [Using AISBench for performance evaluation](../../developer_guide/evaluation/using_ais_bench.md#execute-performance-evaluation) for details.
|
||||
|
||||
The performance result is:
|
||||
|
||||
**Hardware**: A3-752T, 4 node
|
||||
|
||||
**Deployment**: 1P1D, Prefill node: DP2+TP16, Decode Node: DP8+TP4
|
||||
|
||||
**Input/Output**: 64k/3k
|
||||
|
||||
**Performance**: 533tps, TPOT 32ms
|
||||
|
||||
### Using vLLM Benchmark
|
||||
|
||||
Run performance evaluation of `DeepSeek-V3.2-W8A8` as an example.
|
||||
|
||||
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
|
||||
|
||||
There are three `vllm bench` subcommands:
|
||||
|
||||
- `latency`: Benchmark the latency of a single batch of requests.
|
||||
- `serve`: Benchmark the online serving throughput.
|
||||
- `throughput`: Benchmark offline inference throughput.
|
||||
|
||||
Take the `serve` as an example. Run the code as follows.
|
||||
|
||||
```shell
|
||||
export VLLM_USE_MODELSCOPE=true
|
||||
vllm bench serve --model /root/.cache/Eco-Tech/DeepSeek-V3.2-w8a8-mtp-QuaRot --dataset-name random --random-input 200 --num-prompt 200 --request-rate 1 --save-result --result-dir ./
|
||||
```
|
||||
|
||||
## Function Call
|
||||
|
||||
The function call feature is supported from v0.13.0rc1 on. Please use the latest version.
|
||||
|
||||
Refer to [DeepSeek-V3.2 Usage Guide](https://docs.vllm.ai/projects/recipes/en/latest/DeepSeek/DeepSeek-V3_2.html#tool-calling-example) for details.
|
||||
177
docs/source/tutorials/models/GLM4.x.md
Normal file
177
docs/source/tutorials/models/GLM4.x.md
Normal file
@@ -0,0 +1,177 @@
|
||||
# GLM-4.5/4.6/4.7
|
||||
|
||||
## Introduction
|
||||
|
||||
GLM-4.x series models use a Mixture-of-Experts (MoE) architecture and are foundational models specifically designed for agent applications
|
||||
|
||||
The `GLM-4.5` model is first supported in `vllm-ascend:v0.10.0rc1`
|
||||
|
||||
This document will show the main verification steps of the model, including supported features, feature configuration, environment preparation, single-node and multi-node deployment, accuracy and performance evaluation.
|
||||
|
||||
## Supported Features
|
||||
|
||||
Refer to [supported features](../../user_guide/support_matrix/supported_models.md) to get the model's supported feature matrix.
|
||||
|
||||
Refer to [feature guide](../../user_guide/feature_guide/index.md) to get the feature's configuration.
|
||||
|
||||
## Environment Preparation
|
||||
|
||||
### Model Weight
|
||||
|
||||
- `GLM-4.5`(BF16 version): [Download model weight](https://www.modelscope.cn/models/ZhipuAI/GLM-4.5).
|
||||
- `GLM-4.6`(BF16 version): [Download model weight](https://www.modelscope.cn/models/ZhipuAI/GLM-4.6).
|
||||
- `GLM-4.7`(BF16 version): [Download model weight](https://www.modelscope.cn/models/ZhipuAI/GLM-4.7).
|
||||
- `GLM-4.5-w8a8-with-float-mtp`(Quantized version with mtp): [Download model weight](https://modelers.cn/models/Modelers_Park/GLM-4.5-w8a8).
|
||||
- `GLM-4.6-w8a8`(Quantized version without mtp): [Download model weight](https://modelers.cn/models/Modelers_Park/GLM-4.6-w8a8). Because vllm do not support GLM4.6 mtp in October, so we do not provide mtp version. And last month, it supported, you can use the following quantization scheme to add mtp weights to Quantized weights.
|
||||
- `Method of Quantify`: [quantization scheme](https://blog.csdn.net/qq_37368095/article/details/156429653?spm=1011.2124.3001.6209). You can use these methods to quantify the model.
|
||||
|
||||
It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`
|
||||
|
||||
### Installation
|
||||
|
||||
You can use our official docker image to run `GLM-4.x` directly.
|
||||
|
||||
Select an image based on your machine type and start the docker image on your node, refer to [using docker](../../installation.md#set-up-using-docker).
|
||||
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
# Update --device according to your device (Atlas A2: /dev/davinci[0-7] Atlas A3:/dev/davinci[0-15]).
|
||||
# Update the vllm-ascend image according to your environment.
|
||||
# Note you should download the weight to /root/.cache in advance.
|
||||
# Update the vllm-ascend image
|
||||
export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version|
|
||||
export NAME=vllm-ascend
|
||||
|
||||
# Run the container using the defined variables
|
||||
# Note: If you are running bridge network with docker, please expose available ports for multiple nodes communication in advance
|
||||
docker run --rm \
|
||||
--name $NAME \
|
||||
--net=host \
|
||||
--shm-size=1g \
|
||||
--device /dev/davinci0 \
|
||||
--device /dev/davinci1 \
|
||||
--device /dev/davinci2 \
|
||||
--device /dev/davinci3 \
|
||||
--device /dev/davinci4 \
|
||||
--device /dev/davinci5 \
|
||||
--device /dev/davinci6 \
|
||||
--device /dev/davinci7 \
|
||||
--device /dev/davinci_manager \
|
||||
--device /dev/devmm_svm \
|
||||
--device /dev/hisi_hdc \
|
||||
-v /usr/local/dcmi:/usr/local/dcmi \
|
||||
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
|
||||
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
|
||||
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
|
||||
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
|
||||
-v /etc/ascend_install.info:/etc/ascend_install.info \
|
||||
-v /root/.cache:/root/.cache \
|
||||
-it $IMAGE bash
|
||||
```
|
||||
|
||||
## Deployment
|
||||
|
||||
### Single-node Deployment
|
||||
|
||||
- In low-latency scenarios, we recommend a single-machine deployment.
|
||||
- Quantized model `glm4.5_w8a8_with_float_mtp` can be deployed on 1 Atlas 800 A3 (64G × 16) or 1 Atlas 800 A2 (64G × 8).
|
||||
|
||||
Run the following script to execute online inference.
|
||||
|
||||
```shell
|
||||
#!/bin/sh
|
||||
export HCCL_BUFFSIZE=1024
|
||||
export OMP_PROC_BIND=false
|
||||
export OMP_NUM_THREADS=10
|
||||
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
|
||||
export HCCL_OP_EXPANSION_MODE=AIV
|
||||
|
||||
vllm serve /weight/glm4.5_w8a8_with_float_mtp \
|
||||
--data-parallel-size 1 \
|
||||
--tensor-parallel-size 16 \
|
||||
--seed 1024 \
|
||||
--served-model-name glm \
|
||||
--max-model-len 35000 \
|
||||
--max-num-batched-tokens 16384 \
|
||||
--max-num-seqs 16 \
|
||||
--trust-remote-code \
|
||||
--gpu-memory-utilization 0.9 \
|
||||
--speculative-config '{"num_speculative_tokens": 1, "model":"/weight/glm4.5_w8a8_with_float_mtp", "method":"mtp"}' \
|
||||
--compilation-config '{"cudagraph_capture_sizes": [1,2,4,8,16,32], "cudagraph_mode": "FULL_DECODE_ONLY"}' \
|
||||
--async-scheduling \
|
||||
```
|
||||
|
||||
**Notice:**
|
||||
The parameters are explained as follows:
|
||||
|
||||
- For single-node deployment, we recommend using `dp1tp16` and turn off expert parallel in low-latency scenarios.
|
||||
- `--async-scheduling` Asynchronous scheduling is a technique used to optimize inference efficiency. It allows non-blocking task scheduling to improve concurrency and throughput, especially when processing large-scale models.
|
||||
|
||||
### Multi-node Deployment
|
||||
|
||||
Not recommended to deploy multi-node on Atlas 800 A2 (64G * 8).
|
||||
|
||||
### Prefill-Decode Disaggregation
|
||||
|
||||
Not test yet.
|
||||
|
||||
## Accuracy Evaluation
|
||||
|
||||
Here are two accuracy evaluation methods.
|
||||
|
||||
### Using AISBench
|
||||
|
||||
1. Refer to [Using AISBench](../../developer_guide/evaluation/using_ais_bench.md) for details.
|
||||
|
||||
2. After execution, you can get the result, here is the result of `GLM4.6` in `vllm-ascend:main` (after `vllm-ascend:0.13.0rc1`) for reference only.
|
||||
|
||||
| dataset | version | metric | mode | vllm-api-general-chat | note |
|
||||
|----- | ----- | ----- | ----- | -----| ----- |
|
||||
| gsm8k | - | accuracy | gen | 96.13 | 1 Atlas 800 A3 (64G × 16) |
|
||||
| gsm8k | - | accuracy | gen | 96.06 | GPU |
|
||||
|
||||
### Using Language Model Evaluation Harness
|
||||
|
||||
Not test yet.
|
||||
|
||||
## Performance
|
||||
|
||||
### Using AISBench
|
||||
|
||||
Refer to [Using AISBench for performance evaluation](../../developer_guide/evaluation/using_ais_bench.md#execute-performance-evaluation) for details.
|
||||
|
||||
### Using vLLM Benchmark
|
||||
|
||||
Run performance evaluation of `GLM-4.x` as an example.
|
||||
|
||||
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
|
||||
|
||||
There are three `vllm bench` subcommands:
|
||||
|
||||
- `latency`: Benchmark the latency of a single batch of requests.
|
||||
- `serve`: Benchmark the online serving throughput.
|
||||
- `throughput`: Benchmark offline inference throughput.
|
||||
|
||||
Take the `serve` as an example. Run the code as follows.
|
||||
|
||||
```shell
|
||||
vllm bench serve \
|
||||
--backend vllm \
|
||||
--dataset-name prefix_repetition \
|
||||
--prefix-repetition-prefix-len 22400 \
|
||||
--prefix-repetition-suffix-len 9600 \
|
||||
--prefix-repetition-output-len 1024 \
|
||||
--num-prompts 1 \
|
||||
--prefix-repetition-num-prefixes 1 \
|
||||
--ignore-eos \
|
||||
--model glm \
|
||||
--tokenizer /weight/glm4.5_w8a8_with_float_mtp \
|
||||
--seed 1000 \
|
||||
--host 0.0.0.0 \
|
||||
--port 8000 \
|
||||
--endpoint /v1/completions \
|
||||
--max-concurrency 1 \
|
||||
--request-rate 1 \
|
||||
```
|
||||
|
||||
After about several minutes, you can get the performance evaluation result.
|
||||
108
docs/source/tutorials/models/Kimi-K2-Thinking.md
Normal file
108
docs/source/tutorials/models/Kimi-K2-Thinking.md
Normal file
@@ -0,0 +1,108 @@
|
||||
# Kimi-K2-Thinking
|
||||
|
||||
## Run with Docker
|
||||
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
# Update the vllm-ascend image
|
||||
export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version|
|
||||
export NAME=vllm-ascend
|
||||
|
||||
# Run the container using the defined variables
|
||||
# Note: If you are running bridge network with docker, please expose available ports for multiple nodes communication in advance
|
||||
docker run --rm \
|
||||
--name $NAME \
|
||||
--net=host \
|
||||
--shm-size=1g \
|
||||
--device /dev/davinci0 \
|
||||
--device /dev/davinci1 \
|
||||
--device /dev/davinci2 \
|
||||
--device /dev/davinci3 \
|
||||
--device /dev/davinci4 \
|
||||
--device /dev/davinci5 \
|
||||
--device /dev/davinci6 \
|
||||
--device /dev/davinci7 \
|
||||
--device /dev/davinci8 \
|
||||
--device /dev/davinci9 \
|
||||
--device /dev/davinci10 \
|
||||
--device /dev/davinci11 \
|
||||
--device /dev/davinci12 \
|
||||
--device /dev/davinci13 \
|
||||
--device /dev/davinci14 \
|
||||
--device /dev/davinci15 \
|
||||
--device /dev/davinci_manager \
|
||||
--device /dev/devmm_svm \
|
||||
--device /dev/hisi_hdc \
|
||||
-v /usr/local/dcmi:/usr/local/dcmi \
|
||||
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
|
||||
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
|
||||
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
|
||||
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
|
||||
-v /etc/ascend_install.info:/etc/ascend_install.info \
|
||||
-v /mnt/sfs_turbo/.cache:/home/cache \
|
||||
-it $IMAGE bash
|
||||
```
|
||||
|
||||
## Verify the Quantized Model
|
||||
|
||||
Please be advised to edit the value of `"quantization_config.config_groups.group_0.targets"` from `["Linear"]` into `["MoE"]` in `config.json` of original model downloaded from [Hugging Face](https://huggingface.co/moonshotai/Kimi-K2-Thinking).
|
||||
|
||||
```json
|
||||
{
|
||||
"quantization_config": {
|
||||
"config_groups": {
|
||||
"group_0": {
|
||||
"targets": [
|
||||
"MoE"
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Your model files look like:
|
||||
|
||||
```bash
|
||||
.
|
||||
|-- chat_template.jinja
|
||||
|-- config.json
|
||||
|-- configuration_deepseek.py
|
||||
|-- configuration.json
|
||||
|-- generation_config.json
|
||||
|-- model-00001-of-000062.safetensors
|
||||
|-- ...
|
||||
|-- model-00062-of-000062.safetensors
|
||||
|-- model.safetensors.index.json
|
||||
|-- modeling_deepseek.py
|
||||
|-- tiktoken.model
|
||||
|-- tokenization_kimi.py
|
||||
`-- tokenizer_config.json
|
||||
```
|
||||
|
||||
## Online Inference on Multi-NPU
|
||||
|
||||
Run the following script to start the vLLM server on Multi-NPU:
|
||||
|
||||
For an Atlas 800 A3 (64G*16) node, tensor-parallel-size should be at least 16.
|
||||
|
||||
```bash
|
||||
vllm serve Kimi-K2-Thinking \
|
||||
--served-model-name kimi-k2-thinking \
|
||||
--tensor-parallel-size 16 \
|
||||
--enable_expert_parallel \
|
||||
--trust-remote-code \
|
||||
--no-enable-prefix-caching
|
||||
```
|
||||
|
||||
Once your server is started, you can query the model with input prompts.
|
||||
|
||||
```bash
|
||||
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
|
||||
"model": "kimi-k2-thinking",
|
||||
"messages": [
|
||||
{"role": "user", "content": "Who are you?"}
|
||||
],
|
||||
"temperature": 1.0
|
||||
}'
|
||||
```
|
||||
227
docs/source/tutorials/models/PaddleOCR-VL.md
Normal file
227
docs/source/tutorials/models/PaddleOCR-VL.md
Normal file
@@ -0,0 +1,227 @@
|
||||
# PaddleOCR-VL
|
||||
|
||||
## Introduction
|
||||
|
||||
PaddleOCR-VL is a SOTA and resource-efficient model tailored for document parsing. Its core component is PaddleOCR-VL-0.9B, a compact yet powerful vision-language model (VLM) that integrates a NaViT-style dynamic resolution visual encoder with the ERNIE-4.5-0.3B language model to enable accurate element recognition.
|
||||
|
||||
This document provides a detailed workflow for the complete deployment and verification of the model, including supported features, environment preparation, single-node deployment, and functional verification. It is designed to help users quickly complete model deployment and verification.
|
||||
|
||||
## Supported Features
|
||||
|
||||
Refer to [supported features](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/support_matrix/supported_models.html) to get the model's supported feature matrix.
|
||||
|
||||
Refer to [feature guide](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/feature_guide/index.html) to get the feature's configuration.
|
||||
|
||||
## Environment Preparation
|
||||
|
||||
### Model Weight
|
||||
|
||||
* `PaddleOCR-VL-0.9B`: [PaddleOCR-VL-0.9B](https://www.modelscope.cn/models/PaddlePaddle/PaddleOCR-VL)
|
||||
|
||||
It is recommended to download the model weights to a local directory (e.g., `./PaddleOCR-VL`) for quick access during deployment.
|
||||
|
||||
### Installation
|
||||
|
||||
You can use our official docker image to run `PaddleOCR-VL` directly.
|
||||
|
||||
Select an image based on your machine type and start the docker image on your node, refer to [using docker](../../installation.md#set-up-using-docker).
|
||||
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
export IMAGE=quay.io/ascend/vllm-ascend:v0.13.0rc1
|
||||
docker run --rm \
|
||||
--name vllm-ascend \
|
||||
--shm-size=1g \
|
||||
--net=host \
|
||||
--device /dev/davinci0 \
|
||||
--device /dev/davinci_manager \
|
||||
--device /dev/devmm_svm \
|
||||
--device /dev/hisi_hdc \
|
||||
-v /usr/local/dcmi:/usr/local/dcmi \
|
||||
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
|
||||
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
|
||||
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
|
||||
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
|
||||
-v /etc/ascend_install.info:/etc/ascend_install.info \
|
||||
-v /root/.cache:/root/.cache \
|
||||
-it $IMAGE bash
|
||||
```
|
||||
|
||||
## Deployment
|
||||
|
||||
### Single-node Deployment
|
||||
|
||||
#### Single NPU (PaddleOCR-VL)
|
||||
|
||||
PaddleOCR-VL supports single-node single-card deployment on the 910B4 platform. Follow these steps to start the inference service:
|
||||
|
||||
1. Prepare model weights: Ensure the downloaded model weights are stored in the `PaddleOCR-VL` directory.
|
||||
2. Create and execute the deployment script (save as `deploy.sh`):
|
||||
|
||||
```shell
|
||||
#!/bin/sh
|
||||
export VLLM_USE_MODELSCOPE=true
|
||||
export MODEL_PATH="PaddlePaddle/PaddleOCR-VL"
|
||||
|
||||
vllm serve ${MODEL_PATH} \
|
||||
--max-num-batched-tokens 16384 \
|
||||
--served-model-name PaddleOCR-VL-0.9B \
|
||||
--trust-remote-code \
|
||||
--no-enable-prefix-caching \
|
||||
--mm-processor-cache-gb 0 \
|
||||
--compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}'
|
||||
```
|
||||
|
||||
#### Multiple NPU (PaddleOCR-VL)
|
||||
|
||||
Single-node deployment is recommended.
|
||||
|
||||
### Prefill-Decode Disaggregation
|
||||
|
||||
Not supported yet
|
||||
|
||||
## Functional Verification
|
||||
|
||||
If your service start successfully, you can see the info shown below:
|
||||
|
||||
```bash
|
||||
INFO: Started server process [87471]
|
||||
INFO: Waiting for application startup.
|
||||
INFO: Application startup complete.
|
||||
```
|
||||
|
||||
Once your server is started, you can use the OpenAI API client to make queries.
|
||||
|
||||
```python
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI(
|
||||
api_key="EMPTY",
|
||||
base_url="http://localhost:8000/v1",
|
||||
timeout=3600
|
||||
)
|
||||
|
||||
# Task-specific base prompts
|
||||
TASKS = {
|
||||
"ocr": "OCR:",
|
||||
"table": "Table Recognition:",
|
||||
"formula": "Formula Recognition:",
|
||||
"chart": "Chart Recognition:",
|
||||
}
|
||||
|
||||
messages = [
|
||||
{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{
|
||||
"type": "image_url",
|
||||
"image_url": {
|
||||
"url": "https://ofasys-multimodal-wlcb-3-toshanghai.oss-accelerate.aliyuncs.com/wpf272043/keepme/image/receipt.png"
|
||||
}
|
||||
},
|
||||
{
|
||||
"type": "text",
|
||||
"text": TASKS["ocr"]
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
|
||||
response = client.chat.completions.create(
|
||||
model="PaddleOCR-VL-0.9B",
|
||||
messages=messages,
|
||||
temperature=0.0,
|
||||
)
|
||||
print(f"Generated text: {response.choices[0].message.content}")
|
||||
```
|
||||
|
||||
If you query the server successfully, you can see the info shown below (client):
|
||||
|
||||
```bash
|
||||
Generated text: CINNAMON SUGAR
|
||||
1 x 17,000
|
||||
17,000
|
||||
SUB TOTAL
|
||||
17,000
|
||||
GRAND TOTAL
|
||||
17,000
|
||||
CASH IDR
|
||||
20,000
|
||||
CHANGE DUE
|
||||
3,000
|
||||
```
|
||||
|
||||
## Offline Inference with vLLM and PP-DocLayoutV2
|
||||
|
||||
In the above example, we demonstrated how to use vLLM to infer the PaddleOCR-VL-0.9B model. Typically, we also need to integrate the PP-DocLayoutV2 model to fully unleash the capabilities of the PaddleOCR-VL model, making it more consistent with the examples provided by the official PaddlePaddle documentation.
|
||||
|
||||
:::{note}
|
||||
Use separate virtual environments for VLLM and PPdoclayoutV2 to prevent dependency conflicts.
|
||||
:::
|
||||
|
||||
### Pull the PaddlePaddle-compatible CANN image
|
||||
|
||||
Obtaining Ascend Images from PaddlePaddle:
|
||||
|
||||
```bash
|
||||
docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/device/paddle-npu:cann800-ubuntu20-npu-910b-base-aarch64-gcc84
|
||||
```
|
||||
|
||||
Start the container using the following command:
|
||||
|
||||
```bash
|
||||
docker run -it --name paddle-npu-dev -v $(pwd):/work \
|
||||
--privileged --network=host --shm-size=128G -w=/work \
|
||||
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
|
||||
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
|
||||
-v /usr/local/dcmi:/usr/local/dcmi \
|
||||
-e ASCEND_RT_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" \
|
||||
ccr-2vdh3abv-pub.cnc.bj.baidubce.com/device/paddle-npu:cann800-ubuntu20-npu-910b-base-$(uname -m)-gcc84 /bin/bash
|
||||
```
|
||||
|
||||
### Install [PaddlePaddle](https://www.paddlepaddle.org.cn/install/quick?docurl=undefined) and [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)
|
||||
|
||||
```bash
|
||||
python -m pip install paddlepaddle==3.2.0
|
||||
wget https://paddle-whl.bj.bcebos.com/stable/npu/paddle-custom-npu/paddle_custom_npu-3.2.0-cp310-cp310-linux_aarch64.whl
|
||||
pip install paddle_custom_npu-3.2.0-cp310-cp310-linux_aarch64.whl
|
||||
python -m pip install -U "paddleocr[doc-parser]"
|
||||
pip install safetensors
|
||||
```
|
||||
|
||||
:::{note}
|
||||
The OpenCV component may be missing:
|
||||
|
||||
```bash
|
||||
apt-get update
|
||||
apt-get install -y libgl1 libglib2.0-0
|
||||
```
|
||||
|
||||
CANN-8.0.0 does not support some versions of NumPy and OpenCV. It is recommended to install the specified versions.
|
||||
|
||||
```bash
|
||||
python -m pip install numpy==1.26.4
|
||||
python -m pip install opencv-python==3.4.18.65
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
### Using vLLM as the backend, combined with PP-DocLayoutV2 for offline inference
|
||||
|
||||
```python
|
||||
from paddleocr import PaddleOCRVL
|
||||
|
||||
doclayout_model_path = "/path/to/your/PP-DocLayoutV2/"
|
||||
|
||||
pipeline = PaddleOCRVL(vl_rec_backend="vllm-server",
|
||||
vl_rec_server_url="http://localhost:8000/v1",
|
||||
layout_detection_model_name="PP-DocLayoutV2",
|
||||
layout_detection_model_dir=doclayout_model_path,
|
||||
device="npu")
|
||||
|
||||
output = pipeline.predict("https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/paddleocr_vl_demo.png")
|
||||
|
||||
for i, res in enumerate(output):
|
||||
res.save_to_json(save_path=f"output_{i}.json")
|
||||
res.save_to_markdown(save_path=f"output_{i}.md")
|
||||
```
|
||||
580
docs/source/tutorials/models/Qwen-VL-Dense.md
Normal file
580
docs/source/tutorials/models/Qwen-VL-Dense.md
Normal file
@@ -0,0 +1,580 @@
|
||||
# Qwen-VL-Dense(Qwen2.5VL-3B/7B, Qwen3-VL-2B/4B/8B/32B)
|
||||
|
||||
## Introduction
|
||||
|
||||
The Qwen-VL(Vision-Language)series from Alibaba Cloud comprises a family of powerful Large Vision-Language Models (LVLMs) designed for comprehensive multimodal understanding. They accept images, text, and bounding boxes as input, and output text and detection boxes, enabling advanced functions like image detection, multi-modal dialogue, and multi-image reasoning.
|
||||
|
||||
This document will show the main verification steps of the model, including supported features, feature configuration, environment preparation, NPU deployment, accuracy and performance evaluation.
|
||||
|
||||
This tutorial uses the vLLM-Ascend `v0.11.0rc3-a3` version for demonstration, showcasing the `Qwen3-VL-8B-Instruct` model as an example for single NPU deployment and the `Qwen2.5-VL-32B-Instruct` model as an example for multi-NPU deployment.
|
||||
|
||||
## Supported Features
|
||||
|
||||
Refer to [supported features](../../user_guide/support_matrix/supported_models.md) to get the model's supported feature matrix.
|
||||
|
||||
Refer to [feature guide](../../user_guide/feature_guide/index.md) to get the feature's configuration.
|
||||
|
||||
## Environment Preparation
|
||||
|
||||
### Model Weight
|
||||
|
||||
require 1 Atlas 800I A2 (64G × 8) node or 1 Atlas 800 A3 (64G × 16) node:
|
||||
|
||||
- `Qwen2.5-VL-3B-Instruct`: [Download model weight](https://modelscope.cn/models/Qwen/Qwen2.5-VL-3B-Instruct)
|
||||
- `Qwen2.5-VL-7B-Instruct`: [Download model weight](https://modelscope.cn/models/Qwen/Qwen2.5-VL-7B-Instruct)
|
||||
- `Qwen2.5-VL-32B-Instruct`:[Download model weight](https://modelscope.cn/models/Qwen/Qwen2.5-VL-32B-Instruct)
|
||||
- `Qwen2.5-VL-72B-Instruct`:[Download model weight](https://modelscope.cn/models/Qwen/Qwen2.5-VL-72B-Instruct)
|
||||
- `Qwen3-VL-2B-Instruct`: [Download model weight](https://modelscope.cn/models/Qwen/Qwen3-VL-2B-Instruct)
|
||||
- `Qwen3-VL-4B-Instruct`: [Download model weight](https://modelscope.cn/models/Qwen/Qwen3-VL-4B-Instruct)
|
||||
- `Qwen3-VL-8B-Instruct`: [Download model weight](https://modelscope.cn/models/Qwen/Qwen3-VL-8B-Instruct)
|
||||
- `Qwen3-VL-32B-Instruct`: [Download model weight](https://modelscope.cn/models/Qwen/Qwen3-VL-32B-Instruct)
|
||||
|
||||
A sample Qwen2.5-VL quantization script can be found in the modelslim code repository. [Qwen2.5-VL Quantization Script Example](https://gitcode.com/Ascend/msit/blob/master/msmodelslim/example/multimodal_vlm/Qwen2.5-VL/README.md)
|
||||
|
||||
It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`
|
||||
|
||||
### Installation
|
||||
|
||||
:::::{tab-set}
|
||||
:sync-group: install
|
||||
|
||||
::::{tab-item} single-NPU
|
||||
:sync: single
|
||||
|
||||
Run docker container:
|
||||
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
# Update the vllm-ascend image
|
||||
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
|
||||
|
||||
docker run --rm \
|
||||
--name vllm-ascend \
|
||||
--shm-size=1g \
|
||||
--device /dev/davinci0 \
|
||||
--device /dev/davinci_manager \
|
||||
--device /dev/devmm_svm \
|
||||
--device /dev/hisi_hdc \
|
||||
-v /usr/local/dcmi:/usr/local/dcmi \
|
||||
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
|
||||
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
|
||||
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
|
||||
-v /etc/ascend_install.info:/etc/ascend_install.info \
|
||||
-v /root/.cache:/root/.cache \
|
||||
-p 8000:8000 \
|
||||
-it $IMAGE bash
|
||||
```
|
||||
|
||||
::::
|
||||
::::{tab-item} multi-NPU
|
||||
:sync: multi
|
||||
|
||||
Run docker container:
|
||||
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
# Update the vllm-ascend image
|
||||
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
|
||||
|
||||
docker run --rm \
|
||||
--name vllm-ascend \
|
||||
--shm-size=1g \
|
||||
--net=host \
|
||||
--device /dev/davinci0 \
|
||||
--device /dev/davinci1 \
|
||||
--device /dev/davinci_manager \
|
||||
--device /dev/devmm_svm \
|
||||
--device /dev/hisi_hdc \
|
||||
-v /usr/local/dcmi:/usr/local/dcmi \
|
||||
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
|
||||
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
|
||||
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
|
||||
-v /etc/ascend_install.info:/etc/ascend_install.info \
|
||||
-v /root/.cache:/root/.cache \
|
||||
-v /data:/data \
|
||||
-it $IMAGE bash
|
||||
```
|
||||
|
||||
::::
|
||||
:::::
|
||||
|
||||
Setup environment variables:
|
||||
|
||||
```bash
|
||||
# Load model from ModelScope to speed up download
|
||||
export VLLM_USE_MODELSCOPE=True
|
||||
|
||||
# Set `max_split_size_mb` to reduce memory fragmentation and avoid out of memory
|
||||
export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
|
||||
```
|
||||
|
||||
:::{note}
|
||||
`max_split_size_mb` prevents the native allocator from splitting blocks larger than this size (in MB). This can reduce fragmentation and may allow some borderline workloads to complete without running out of memory. You can find more details [<u>here</u>](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/800alpha003/apiref/envref/envref_07_0061.html).
|
||||
:::
|
||||
|
||||
## Deployment
|
||||
|
||||
### Offline Inference
|
||||
|
||||
:::::{tab-set}
|
||||
:sync-group: install
|
||||
|
||||
::::{tab-item} Qwen3-VL-8B-Instruct
|
||||
:sync: single
|
||||
|
||||
Run the following script to execute offline inference on single-NPU:
|
||||
|
||||
```bash
|
||||
pip install qwen_vl_utils --extra-index-url https://download.pytorch.org/whl/cpu/
|
||||
```
|
||||
|
||||
```python
|
||||
from transformers import AutoProcessor
|
||||
from vllm import LLM, SamplingParams
|
||||
from qwen_vl_utils import process_vision_info
|
||||
|
||||
MODEL_PATH = "Qwen/Qwen3-VL-8B-Instruct"
|
||||
|
||||
llm = LLM(
|
||||
model=MODEL_PATH,
|
||||
max_model_len=16384,
|
||||
limit_mm_per_prompt={"image": 10},
|
||||
)
|
||||
|
||||
sampling_params = SamplingParams(
|
||||
max_completion_tokens=512
|
||||
)
|
||||
|
||||
image_messages = [
|
||||
{"role": "system", "content": "You are a helpful assistant."},
|
||||
{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{
|
||||
"type": "image",
|
||||
"image": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png",
|
||||
"min_pixels": 224 * 224,
|
||||
"max_pixels": 1280 * 28 * 28,
|
||||
},
|
||||
{"type": "text", "text": "Please provide a detailed description of this image"},
|
||||
],
|
||||
},
|
||||
]
|
||||
|
||||
messages = image_messages
|
||||
|
||||
processor = AutoProcessor.from_pretrained(MODEL_PATH)
|
||||
prompt = processor.apply_chat_template(
|
||||
messages,
|
||||
tokenize=False,
|
||||
add_generation_prompt=True,
|
||||
)
|
||||
|
||||
image_inputs, _, _ = process_vision_info(messages, return_video_kwargs=True)
|
||||
|
||||
mm_data = {}
|
||||
if image_inputs is not None:
|
||||
mm_data["image"] = image_inputs
|
||||
|
||||
llm_inputs = {
|
||||
"prompt": prompt,
|
||||
"multi_modal_data": mm_data,
|
||||
}
|
||||
|
||||
outputs = llm.generate([llm_inputs], sampling_params=sampling_params)
|
||||
generated_text = outputs[0].outputs[0].text
|
||||
|
||||
print(generated_text)
|
||||
```
|
||||
|
||||
If you run this script successfully, you can see the info shown below:
|
||||
|
||||
```shell
|
||||
**Visual Components:**
|
||||
|
||||
1. **Abstract Geometric Icon (Left Side):**
|
||||
* The logo features a stylized, abstract icon on the left.
|
||||
* It is composed of interconnected lines and angular shapes, forming a complex, hexagonal-like structure.
|
||||
* The icon is rendered in a solid, thin blue line, giving it a modern, technological, and clean appearance.
|
||||
|
||||
2. **Text (Right Side):**
|
||||
* To the right of the icon, the name "TONGYI Qwen" is written.
|
||||
* **"TONGYI"** is written in uppercase letters in a bold, modern sans-serif font. The color is a medium blue, matching the icon's color.
|
||||
* **"Qwen"** is written below "TONGYI" in a slightly larger, bold, sans-serif font. The color of "Qwen" is a dark gray or black, creating a strong contrast with the blue text above it.
|
||||
* The text is aligned and spaced neatly, with "Qwen" appearing slightly larger and bolder than "TONGYI," emphasizing the proper noun.
|
||||
|
||||
**Overall Design and Aesthetics:**
|
||||
|
||||
* The logo has a clean, contemporary, and professional feel, suitable for a technology and AI product.
|
||||
* The use of blue conveys trust, innovation, and intelligence, while the dark gray adds stability and clarity.
|
||||
* The overall layout is balanced and symmetrical, with the icon and text arranged horizontally for easy recognition and memorability.
|
||||
* The design effectively communicates the product's high-tech nature while remaining brand-identifiable and straightforward.
|
||||
|
||||
The logo is designed to be easily recognizable across various media and scales, from digital screens to printed materials.
|
||||
```
|
||||
|
||||
::::
|
||||
::::{tab-item} Qwen2.5-VL-32B-Instruct
|
||||
:sync: multi
|
||||
|
||||
Run the following script to execute offline inference on multi-NPU:
|
||||
|
||||
```bash
|
||||
pip install qwen_vl_utils --extra-index-url https://download.pytorch.org/whl/cpu/
|
||||
```
|
||||
|
||||
```python
|
||||
from transformers import AutoProcessor
|
||||
from vllm import LLM, SamplingParams
|
||||
from qwen_vl_utils import process_vision_info
|
||||
|
||||
MODEL_PATH = "Qwen/Qwen2.5-VL-32B-Instruct"
|
||||
|
||||
llm = LLM(
|
||||
model=MODEL_PATH,
|
||||
tensor_parallel_size=2,
|
||||
max_model_len=16384,
|
||||
limit_mm_per_prompt={"image": 10},
|
||||
)
|
||||
|
||||
sampling_params = SamplingParams(
|
||||
max_completion_tokens=512
|
||||
)
|
||||
|
||||
image_messages = [
|
||||
{"role": "system", "content": "You are a helpful assistant."},
|
||||
{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{
|
||||
"type": "image",
|
||||
"image": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png",
|
||||
"min_pixels": 224 * 224,
|
||||
"max_pixels": 1280 * 28 * 28,
|
||||
},
|
||||
{"type": "text", "text": "Please provide a detailed description of this image"},
|
||||
],
|
||||
},
|
||||
]
|
||||
|
||||
messages = image_messages
|
||||
|
||||
processor = AutoProcessor.from_pretrained(MODEL_PATH)
|
||||
prompt = processor.apply_chat_template(
|
||||
messages,
|
||||
tokenize=False,
|
||||
add_generation_prompt=True,
|
||||
)
|
||||
|
||||
image_inputs, _, _ = process_vision_info(messages, return_video_kwargs=True)
|
||||
|
||||
mm_data = {}
|
||||
if image_inputs is not None:
|
||||
mm_data["image"] = image_inputs
|
||||
|
||||
llm_inputs = {
|
||||
"prompt": prompt,
|
||||
"multi_modal_data": mm_data,
|
||||
}
|
||||
|
||||
outputs = llm.generate([llm_inputs], sampling_params=sampling_params)
|
||||
generated_text = outputs[0].outputs[0].text
|
||||
|
||||
print(generated_text)
|
||||
```
|
||||
|
||||
If you run this script successfully, you can see the info shown below:
|
||||
|
||||
```shell
|
||||
The image displays a logo and text related to the Qwen model, which is an artificial intelligence (AI) language model developed by Alibaba Cloud. Here is a detailed description of the elements in the image:
|
||||
|
||||
### **1. Logo:**
|
||||
- The logo on the left side of the image consists of a stylized, abstract geometric design.
|
||||
- The logo is primarily composed of interconnected lines and shapes that resemble a combination of arrows, lines, and geometric forms.
|
||||
- The lines are arranged in a triangular pattern, giving it a dynamic and modern appearance.
|
||||
- The lines are rendered in a dark blue color, and they form a three-dimensional, arrow-like structure. This conveys a sense of movement, forward momentum, or direction, which is often symbolic of progress and integration.
|
||||
- The design appears to be complex yet minimalistic, with clean and sharp lines.
|
||||
- The triangular and square-like structure suggests precision, connectivity, and innovation, which are often associated with technology and advanced systems.
|
||||
- This abstract, arrow-like design implies a sense of flow, direction, and connectivity, which aligns with themes of progress and technological advancement.
|
||||
|
||||
### **2. Text:**
|
||||
- **"TONGYI" (on the top right side):
|
||||
- The text is in dark blue, which is a color often associated with technology, stability, and trustworthiness.
|
||||
- The name "Tongyi" is written in a bold, sans-serif font, giving it a modern and professional look.
|
||||
- **"Qwen" (below "Tongyi"):
|
||||
- The font for "Qwen" is in a bold, uppercase format.
|
||||
- The style
|
||||
```
|
||||
|
||||
::::
|
||||
:::::
|
||||
|
||||
### Online Serving
|
||||
|
||||
:::::{tab-set}
|
||||
:sync-group: install
|
||||
|
||||
::::{tab-item} Qwen3-VL-8B-Instruct
|
||||
:sync: single
|
||||
|
||||
Run docker container to start the vLLM server on single-NPU:
|
||||
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
vllm serve Qwen/Qwen3-VL-8B-Instruct \
|
||||
--dtype bfloat16 \
|
||||
--max_model_len 16384 \
|
||||
--max-num-batched-tokens 16384
|
||||
```
|
||||
|
||||
:::{note}
|
||||
Add `--max_model_len` option to avoid ValueError that the Qwen3-VL-8B-Instruct model's max seq len (256000) is larger than the maximum number of tokens that can be stored in KV cache. This will differ with different NPU series base on the HBM size. Please modify the value according to a suitable value for your NPU series.
|
||||
:::
|
||||
|
||||
If your service start successfully, you can see the info shown below:
|
||||
|
||||
```bash
|
||||
INFO: Started server process [2736]
|
||||
INFO: Waiting for application startup.
|
||||
INFO: Application startup complete.
|
||||
```
|
||||
|
||||
Once your server is started, you can query the model with input prompts:
|
||||
|
||||
```bash
|
||||
curl http://localhost:8000/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "Qwen/Qwen3-VL-8B-Instruct",
|
||||
"messages": [
|
||||
{"role": "system", "content": "You are a helpful assistant."},
|
||||
{"role": "user", "content": [
|
||||
{"type": "image_url", "image_url": {"url": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png"}},
|
||||
{"type": "text", "text": "What is the text in the illustrate?"}
|
||||
]}
|
||||
]
|
||||
}'
|
||||
```
|
||||
|
||||
If you query the server successfully, you can see the info shown below (client):
|
||||
|
||||
```bash
|
||||
{"id":"chatcmpl-d3270d4a16cb4b98936f71ee3016451f","object":"chat.completion","created":1764924127,"model":"Qwen/Qwen3-VL-8B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the illustration is: **TONGYI Qwen**","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":107,"total_tokens":123,"completion_tokens":16,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}
|
||||
```
|
||||
|
||||
Logs of the vllm server:
|
||||
|
||||
```bash
|
||||
INFO 12-05 08:42:07 [chat_utils.py:560] Detected the chat template content format to be 'openai'. You can set `--chat-template-content-format` to override this.
|
||||
Downloading Model from https://www.modelscope.cn to directory: /root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct
|
||||
INFO 12-05 08:42:11 [acl_graph.py:187] Replaying aclgraph
|
||||
INFO: 127.0.0.1:60988 - "POST /v1/chat/completions HTTP/1.1" 200 OK
|
||||
INFO 12-05 08:42:13 [loggers.py:127] Engine 000: Avg prompt throughput: 10.7 tokens/s, Avg generation throughput: 1.6 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
|
||||
INFO 12-05 08:42:23 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
|
||||
```
|
||||
|
||||
::::
|
||||
::::{tab-item} Qwen2.5-VL-32B-Instruct
|
||||
:sync: multi
|
||||
|
||||
Run docker container to start the vLLM server on multi-NPU:
|
||||
|
||||
```shell
|
||||
#!/bin/sh
|
||||
# if os is Ubuntu
|
||||
apt update
|
||||
apt install libjemalloc2
|
||||
# if os is openEuler
|
||||
yum update
|
||||
yum install jemalloc
|
||||
# Add the LD_PRELOAD environment variable
|
||||
if [ -f /usr/lib/aarch64-linux-gnu/libjemalloc.so.2 ]; then
|
||||
# On Ubuntu, first install with `apt install libjemalloc2`
|
||||
export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
|
||||
elif [ -f /usr/lib64/libjemalloc.so.2 ]; then
|
||||
# On openEuler, first install with `yum install jemalloc`
|
||||
export LD_PRELOAD=/usr/lib64/libjemalloc.so.2:$LD_PRELOAD
|
||||
fi
|
||||
# Enable the AIVector core to directly schedule ROCE communication
|
||||
export HCCL_OP_EXPANSION_MODE="AIV"
|
||||
# Set vLLM to Engine V1
|
||||
export VLLM_USE_V1=1
|
||||
|
||||
vllm serve Qwen/Qwen2.5-VL-32B-Instruct \
|
||||
--host 0.0.0.0 \
|
||||
--port 8000 \
|
||||
--async-scheduling \
|
||||
--tensor-parallel-size 2 \
|
||||
--max-model-len 30000 \
|
||||
--max-num-batched-tokens 50000 \
|
||||
--max-num-seqs 30 \
|
||||
--no-enable-prefix-caching \
|
||||
--trust-remote-code \
|
||||
--dtype bfloat16
|
||||
|
||||
```
|
||||
|
||||
:::{note}
|
||||
Add `--max_model_len` option to avoid ValueError that the Qwen2.5-VL-32B-Instruct model's max_model_len (128000) is larger than the maximum number of tokens that can be stored in KV cache. This will differ with different NPU series base on the HBM size. Please modify the value according to a suitable value for your NPU series.
|
||||
:::
|
||||
|
||||
If your service start successfully, you can see the info shown below:
|
||||
|
||||
```bash
|
||||
INFO: Started server process [14431]
|
||||
INFO: Waiting for application startup.
|
||||
INFO: Application startup complete.
|
||||
```
|
||||
|
||||
Once your server is started, you can query the model with input prompts:
|
||||
|
||||
```bash
|
||||
curl http://localhost:8000/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "Qwen/Qwen2.5-VL-32B-Instruct",
|
||||
"messages": [
|
||||
{"role": "system", "content": "You are a helpful assistant."},
|
||||
{"role": "user", "content": [
|
||||
{"type": "image_url", "image_url": {"url": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png"}},
|
||||
{"type": "text", "text": "What is the text in the illustrate?"}
|
||||
]}
|
||||
]
|
||||
}'
|
||||
```
|
||||
|
||||
If you query the server successfully, you can see the info shown below (client):
|
||||
|
||||
```bash
|
||||
{"id":"chatcmpl-c07088bf992a4b77a89d79480122a483","object":"chat.completion","created":1764905884,"model":"Qwen/Qwen2.5-VL-32B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the illustration is:\n\n**TONGYI Qwen**","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":73,"total_tokens":89,"completion_tokens":16,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}
|
||||
```
|
||||
|
||||
Logs of the vllm server:
|
||||
|
||||
```bash
|
||||
The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
|
||||
INFO 12-05 08:50:57 [chat_utils.py:560] Detected the chat template content format to be 'openai'. You can set `--chat-template-content-format` to override this.
|
||||
Downloading Model from https://www.modelscope.cn to directory: /root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-32B-Instruct
|
||||
2025-12-05 08:50:58,913 - modelscope - INFO - Target directory already exists, skipping creation.
|
||||
INFO 12-05 08:51:00 [acl_graph.py:187] Replaying aclgraph
|
||||
INFO: 127.0.0.1:50720 - "POST /v1/chat/completions HTTP/1.1" 200 OK
|
||||
INFO 12-05 08:51:10 [loggers.py:127] Engine 000: Avg prompt throughput: 7.3 tokens/s, Avg generation throughput: 1.6 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
|
||||
INFO 12-05 08:51:20 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
|
||||
```
|
||||
|
||||
::::
|
||||
:::::
|
||||
|
||||
## Accuracy Evaluation
|
||||
|
||||
### Using Language Model Evaluation Harness
|
||||
|
||||
The accuracy of some models is already within our CI monitoring scope, including:
|
||||
|
||||
- `Qwen2.5-VL-7B-Instruct`
|
||||
- `Qwen3-VL-8B-Instruct`
|
||||
|
||||
You can refer to the [monitoring configuration](https://github.com/vllm-project/vllm-ascend/blob/main/.github/workflows/vllm_ascend_test_nightly_a2.yaml).
|
||||
|
||||
:::::{tab-set}
|
||||
:sync-group: install
|
||||
|
||||
::::{tab-item} Qwen3-VL-8B-Instruct
|
||||
:sync: single
|
||||
|
||||
As an example, take the `mmmu_val` dataset as a test dataset, and run accuracy evaluation of `Qwen3-VL-8B-Instruct` in offline mode.
|
||||
|
||||
1. Refer to [Using lm_eval](../../developer_guide/evaluation/using_lm_eval.md) for more details on `lm_eval` installation.
|
||||
|
||||
```shell
|
||||
pip install lm_eval
|
||||
```
|
||||
|
||||
2. Run `lm_eval` to execute the accuracy evaluation.
|
||||
|
||||
```shell
|
||||
lm_eval \
|
||||
--model vllm-vlm \
|
||||
--model_args pretrained=Qwen/Qwen3-VL-8B-Instruct,max_model_len=8192,gpu_memory_utilization=0.7 \
|
||||
--tasks mmmu_val \
|
||||
--batch_size 32 \
|
||||
--apply_chat_template \
|
||||
--trust_remote_code \
|
||||
--output_path ./results
|
||||
```
|
||||
|
||||
3. After execution, you can get the result, here is the result of `Qwen3-VL-8B-Instruct` in `vllm-ascend:0.11.0rc3` for reference only.
|
||||
|
||||
| Tasks |Version|Filter|n-shot|Metric| |Value | |Stderr|
|
||||
|---------|------:|------|-----:|------|---|-----:|---|-----:|
|
||||
|mmmu_val | 0|none | |acc |↑ |0.5389|± |0.0159|
|
||||
|
||||
::::
|
||||
::::{tab-item} Qwen2.5-VL-32B-Instruct
|
||||
:sync: multi
|
||||
|
||||
As an example, take the `mmmu_val` dataset as a test dataset, and run accuracy evaluation of `Qwen2.5-VL-32B-Instruct` in offline mode.
|
||||
|
||||
1. Refer to [Using lm_eval](../../developer_guide/evaluation/using_lm_eval.md) for more details on `lm_eval` installation.
|
||||
|
||||
```shell
|
||||
pip install lm_eval
|
||||
```
|
||||
|
||||
2. Run `lm_eval` to execute the accuracy evaluation.
|
||||
|
||||
```shell
|
||||
lm_eval \
|
||||
--model vllm-vlm \
|
||||
--model_args pretrained=Qwen/Qwen2.5-VL-32B-Instruct,max_model_len=8192,tensor_parallel_size=2 \
|
||||
--tasks mmmu_val \
|
||||
--apply_chat_template \
|
||||
--trust_remote_code \
|
||||
--output_path ./results
|
||||
```
|
||||
|
||||
3. After execution, you can get the result, here is the result of `Qwen2.5-VL-32B-Instruct` in `vllm-ascend:0.11.0rc3` for reference only.
|
||||
|
||||
| Tasks |Version|Filter|n-shot|Metric| |Value | |Stderr|
|
||||
|---------|------:|------|-----:|------|---|-----:|---|-----:|
|
||||
|mmmu_val | 0|none | |acc |↑ |0.5744|± |0.0158|
|
||||
|
||||
::::
|
||||
:::::
|
||||
|
||||
## Performance
|
||||
|
||||
### Using vLLM Benchmark
|
||||
|
||||
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
|
||||
|
||||
There are three `vllm bench` subcommands:
|
||||
|
||||
- `latency`: Benchmark the latency of a single batch of requests.
|
||||
- `serve`: Benchmark the online serving throughput.
|
||||
- `throughput`: Benchmark offline inference throughput.
|
||||
|
||||
The performance evaluation must be conducted in an online mode. Take the `serve` as an example. Run the code as follows.
|
||||
|
||||
:::::{tab-set}
|
||||
:sync-group: install
|
||||
|
||||
::::{tab-item} Qwen3-VL-8B-Instruct
|
||||
:sync: single
|
||||
|
||||
```shell
|
||||
vllm bench serve --model Qwen/Qwen3-VL-8B-Instruct --dataset-name random --random-input 200 --num-prompt 200 --request-rate 1 --save-result --result-dir ./
|
||||
```
|
||||
|
||||
::::
|
||||
::::{tab-item} Qwen2.5-VL-32B-Instruct
|
||||
:sync: multi
|
||||
|
||||
```shell
|
||||
vllm bench serve --model Qwen/Qwen2.5-VL-32B-Instruct --dataset-name random --random-input 200 --num-prompt 200 --request-rate 1 --save-result --result-dir ./
|
||||
```
|
||||
|
||||
::::
|
||||
:::::
|
||||
|
||||
After about several minutes, you can get the performance evaluation result.
|
||||
180
docs/source/tutorials/models/Qwen2.5-7B.md
Normal file
180
docs/source/tutorials/models/Qwen2.5-7B.md
Normal file
@@ -0,0 +1,180 @@
|
||||
# Qwen2.5-7B
|
||||
|
||||
## Introduction
|
||||
|
||||
Qwen2.5-7B-Instruct is the flagship instruction-tuned variant of Alibaba Cloud’s Qwen 2.5 LLM series. It supports a maximum context window of 128K, enables generation of up to 8K tokens, and delivers enhanced capabilities in multilingual processing, instruction following, programming, mathematical computation, and structured data handling.
|
||||
|
||||
This document details the complete deployment and verification workflow for the model, including supported features, environment preparation, single-node deployment, functional verification, accuracy and performance evaluation, and troubleshooting of common issues. It is designed to help users quickly complete model deployment and validation.
|
||||
|
||||
The `Qwen2.5-7B-Instruct` model was supported since `vllm-ascend:v0.9.0`.
|
||||
|
||||
## Supported Features
|
||||
|
||||
Refer to [supported features](../../user_guide/support_matrix/supported_models.md) to get the model's supported feature matrix.
|
||||
|
||||
Refer to [feature guide](../../user_guide/feature_guide/index.md) to get the feature's configuration.
|
||||
|
||||
## Environment Preparation
|
||||
|
||||
### Model Weight
|
||||
|
||||
- `Qwen2.5-7B-Instruct`(BF16 version): require 1 910B4 cards(32G × 1). [Qwen2.5-7B-Instruct](https://modelscope.cn/models/Qwen/Qwen2.5-7B-Instruct)
|
||||
|
||||
It is recommended to download the model weights to a local directory (e.g., `./Qwen2.5-7B-Instruct/`) for quick access during deployment.
|
||||
|
||||
### Installation
|
||||
|
||||
You can use our official docker image and install extra operator for supporting `Qwen2.5-7B-Instruct`.
|
||||
|
||||
:::::{tab-set}
|
||||
:sync-group: install
|
||||
|
||||
::::{tab-item} A3 series
|
||||
:sync: A3
|
||||
|
||||
1. Start the docker image on your each node.
|
||||
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
|
||||
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|-a3
|
||||
docker run --rm \
|
||||
--name vllm-ascend \
|
||||
--shm-size=1g \
|
||||
--net=host \
|
||||
--device /dev/davinci0 \
|
||||
--device /dev/davinci_manager \
|
||||
--device /dev/devmm_svm \
|
||||
--device /dev/hisi_hdc \
|
||||
-v /usr/local/dcmi:/usr/local/dcmi \
|
||||
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
|
||||
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
|
||||
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
|
||||
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
|
||||
-v /etc/ascend_install.info:/etc/ascend_install.info \
|
||||
-v /root/.cache:/root/.cache \
|
||||
-it $IMAGE bash
|
||||
```
|
||||
|
||||
::::
|
||||
::::{tab-item} A2 series
|
||||
:sync: A2
|
||||
|
||||
Start the docker image on your each node.
|
||||
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
|
||||
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
|
||||
docker run --rm \
|
||||
--name vllm-ascend \
|
||||
--shm-size=1g \
|
||||
--net=host \
|
||||
--device /dev/davinci0 \
|
||||
--device /dev/davinci_manager \
|
||||
--device /dev/devmm_svm \
|
||||
--device /dev/hisi_hdc \
|
||||
-v /usr/local/dcmi:/usr/local/dcmi \
|
||||
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
|
||||
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
|
||||
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
|
||||
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
|
||||
-v /etc/ascend_install.info:/etc/ascend_install.info \
|
||||
-v /root/.cache:/root/.cache \
|
||||
-it $IMAGE bash
|
||||
```
|
||||
|
||||
::::
|
||||
:::::
|
||||
|
||||
## Deployment
|
||||
|
||||
### Single-node Deployment
|
||||
|
||||
Qwen2.5-7B-Instruct supports single-node single-card deployment on the 910B4 platform. Follow these steps to start the inference service:
|
||||
|
||||
1. Prepare model weights: Ensure the downloaded model weights are stored in the `./Qwen2.5-7B-Instruct/` directory.
|
||||
2. Create and execute the deployment script (save as `deploy.sh`):
|
||||
|
||||
```shell
|
||||
#!/bin/sh
|
||||
export ASCEND_RT_VISIBLE_DEVICES=0
|
||||
export MODEL_PATH="Qwen/Qwen2.5-7B-Instruct"
|
||||
|
||||
vllm serve ${MODEL_PATH} \
|
||||
--host 0.0.0.0 \
|
||||
--port 8000 \
|
||||
--served-model-name qwen-2.5-7b-instruct \
|
||||
--trust-remote-code \
|
||||
--max-model-len 32768
|
||||
```
|
||||
|
||||
### Multi-node Deployment
|
||||
|
||||
Single-node deployment is recommended.
|
||||
|
||||
### Prefill-Decode Disaggregation
|
||||
|
||||
Not supported yet.
|
||||
|
||||
## Functional Verification
|
||||
|
||||
After starting the service, verify functionality using a `curl` request:
|
||||
|
||||
```shell
|
||||
curl http://<IP>:<Port>/v1/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "qwen-2.5-7b-instruct",
|
||||
"prompt": "Beijing is a",
|
||||
"max_completion_tokens": 5,
|
||||
"temperature": 0
|
||||
}'
|
||||
```
|
||||
|
||||
A valid response (e.g., `"Beijing is a vibrant and historic capital city"`) indicates successful deployment.
|
||||
|
||||
## Accuracy Evaluation
|
||||
|
||||
### Using AISBench
|
||||
|
||||
Refer to [Using AISBench](../../developer_guide/evaluation/using_ais_bench.md) for details.
|
||||
|
||||
Results and logs are saved to `benchmark/outputs/default/`. A sample accuracy report is shown below:
|
||||
|
||||
| dataset | version | metric | mode | vllm-api-general-chat |
|
||||
|----- | ----- | ----- | ----- |--------------|
|
||||
| gsm8k | - | accuracy | gen | 75.00 |
|
||||
|
||||
## Performance
|
||||
|
||||
### Using AISBench
|
||||
|
||||
Refer to [Using AISBench for performance evaluation](../../developer_guide/evaluation/using_ais_bench.md#execute-performance-evaluation) for details.
|
||||
|
||||
### Using vLLM Benchmark
|
||||
|
||||
Run performance evaluation of `Qwen2.5-7B-Instruct` as an example.
|
||||
|
||||
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
|
||||
|
||||
There are three `vllm bench` subcommands:
|
||||
|
||||
- `latency`: Benchmark the latency of a single batch of requests.
|
||||
- `serve`: Benchmark the online serving throughput.
|
||||
- `throughput`: Benchmark offline inference throughput.
|
||||
|
||||
Take the `serve` as an example. Run the code as follows.
|
||||
|
||||
```shell
|
||||
vllm bench serve \
|
||||
--model ./Qwen2.5-7B-Instruct/ \
|
||||
--dataset-name random \
|
||||
--random-input 200 \
|
||||
--num-prompt 200 \
|
||||
--request-rate 1 \
|
||||
--save-result \
|
||||
--result-dir ./perf_results/
|
||||
```
|
||||
|
||||
After about several minutes, you can get the performance evaluation result.
|
||||
210
docs/source/tutorials/models/Qwen2.5-Omni.md
Normal file
210
docs/source/tutorials/models/Qwen2.5-Omni.md
Normal file
@@ -0,0 +1,210 @@
|
||||
# Qwen2.5-Omni-7B
|
||||
|
||||
## Introduction
|
||||
|
||||
Qwen2.5-Omni is an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner.
|
||||
|
||||
The `Qwen2.5-Omni` model was supported since `vllm-ascend:v0.11.0rc0`. This document will show the main verification steps of the model, including supported features, feature configuration, environment preparation, single-NPU and multi-NPU deployment, accuracy and performance evaluation.
|
||||
|
||||
## Supported Features
|
||||
|
||||
Refer to [supported features](../../user_guide/support_matrix/supported_models.md) to get the model's supported feature matrix.
|
||||
|
||||
Refer to [feature guide](../../user_guide/feature_guide/index.md) to get the feature's configuration.
|
||||
|
||||
## Environment Preparation
|
||||
|
||||
### Model Weight
|
||||
|
||||
- `Qwen2.5-Omni-3B`(BF16): [Download model weight](https://huggingface.co/Qwen/Qwen2.5-Omni-3B)
|
||||
- `Qwen2.5-Omni-7B`(BF16): [Download model weight](https://huggingface.co/Qwen/Qwen2.5-Omni-7B)
|
||||
|
||||
Following examples use the 7B version by default.
|
||||
|
||||
### Installation
|
||||
|
||||
You can use our official docker image to run `Qwen2.5-Omni` directly.
|
||||
|
||||
Select an image based on your machine type and start the docker image on your node, refer to [using docker](../../installation.md#set-up-using-docker).
|
||||
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
# Update --device according to your device (Atlas A2: /dev/davinci[0-7] Atlas A3:/dev/davinci[0-15]).
|
||||
# Update the vllm-ascend image according to your environment.
|
||||
# Note you should download the weight to /root/.cache in advance.
|
||||
# Update the vllm-ascend image
|
||||
export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version|
|
||||
export NAME=vllm-ascend
|
||||
# Run the container using the defined variables
|
||||
# Note: If you are running bridge network with docker, please expose available ports for multiple nodes communication in advance
|
||||
docker run --rm \
|
||||
--name $NAME \
|
||||
--net=host \
|
||||
--shm-size=1g \
|
||||
--device /dev/davinci0 \
|
||||
--device /dev/davinci1 \
|
||||
--device /dev/davinci2 \
|
||||
--device /dev/davinci3 \
|
||||
--device /dev/davinci4 \
|
||||
--device /dev/davinci5 \
|
||||
--device /dev/davinci6 \
|
||||
--device /dev/davinci7 \
|
||||
--device /dev/davinci_manager \
|
||||
--device /dev/devmm_svm \
|
||||
--device /dev/hisi_hdc \
|
||||
-v /usr/local/dcmi:/usr/local/dcmi \
|
||||
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
|
||||
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
|
||||
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
|
||||
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
|
||||
-v /etc/ascend_install.info:/etc/ascend_install.info \
|
||||
-v /mnt/sfs_turbo/.cache:/root/.cache \
|
||||
-it $IMAGE bash
|
||||
```
|
||||
|
||||
## Deployment
|
||||
|
||||
### Single-node Deployment
|
||||
|
||||
#### Single NPU (Qwen2.5-Omni-7B)
|
||||
|
||||
:::{note}
|
||||
The env `LOCAL_MEDIA_PATH` which allowing API requests to read local images or videos from directories specified by the server file system. Please note this is a security risk. Should only be enabled in trusted environments.
|
||||
|
||||
```bash
|
||||
export VLLM_USE_MODELSCOPE=true
|
||||
export MODEL_PATH="Qwen/Qwen2.5-Omni-7B"
|
||||
export LOCAL_MEDIA_PATH=$HOME/.cache/vllm/assets/vllm_public_assets/
|
||||
|
||||
vllm serve "${MODEL_PATH}" \
|
||||
--host 0.0.0.0 \
|
||||
--port 8000 \
|
||||
--served-model-name Qwen-Omni \
|
||||
--allowed-local-media-path ${LOCAL_MEDIA_PATH} \
|
||||
--trust-remote-code \
|
||||
--compilation-config '{"full_cuda_graph": 1}' \
|
||||
--no-enable-prefix-caching
|
||||
```
|
||||
|
||||
:::{note}
|
||||
Now vllm-ascend docker image should contain vllm[audio] build part, if you encounter *audio not supported issue* by any chance, please re-build vllm with [audio] flag.
|
||||
|
||||
```bash
|
||||
VLLM_TARGET_DEVICE=empty pip install -v ".[audio]"
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
`--allowed-local-media-path` is optional, only set it if you need infer model with local media file
|
||||
|
||||
`--gpu-memory-utilization` should not be set manually only if you know what this parameter aims to.
|
||||
|
||||
#### Multiple NPU (Qwen2.5-Omni-7B)
|
||||
|
||||
```bash
|
||||
export VLLM_USE_MODELSCOPE=true
|
||||
export MODEL_PATH=Qwen/Qwen2.5-Omni-7B
|
||||
export LOCAL_MEDIA_PATH=/local_path/to_media/
|
||||
export DP_SIZE=8
|
||||
|
||||
vllm serve ${MODEL_PATH}\
|
||||
--host 0.0.0.0 \
|
||||
--port 8000 \
|
||||
--served-model-name Qwen-Omni \
|
||||
--allowed-local-media-path ${LOCAL_MEDIA_PATH} \
|
||||
--trust-remote-code \
|
||||
--compilation-config {"full_cuda_graph": 1} \
|
||||
--data-parallel-size ${DP_SIZE} \
|
||||
--no-enable-prefix-caching
|
||||
```
|
||||
|
||||
`--tensor_parallel_size` no need to set for this 7B model, but if you really need tensor parallel, tp size can be one of `1\2\4`
|
||||
|
||||
### Prefill-Decode Disaggregation
|
||||
|
||||
Not supported yet
|
||||
|
||||
## Functional Verification
|
||||
|
||||
If your service start successfully, you can see the info shown below:
|
||||
|
||||
```bash
|
||||
INFO: Started server process [2736]
|
||||
INFO: Waiting for application startup.
|
||||
INFO: Application startup complete.
|
||||
```
|
||||
|
||||
Once your server is started, you can query the model with input prompts:
|
||||
|
||||
```bash
|
||||
curl http://127.0.0.1:8000/v1/chat/completions -H "Content-Type: application/json" -H "Authorization: Bearer EMPTY" -d '{
|
||||
"model": "Qwen-Omni",
|
||||
"messages": [
|
||||
{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{
|
||||
"type": "text",
|
||||
"text": "What is the text in the illustrate?"
|
||||
},
|
||||
{
|
||||
"type": "image_url",
|
||||
"image_url": {
|
||||
"url": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png"
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
],
|
||||
"max_completion_tokens": 100,
|
||||
"temperature": 0.7
|
||||
}'
|
||||
|
||||
```
|
||||
|
||||
If you query the server successfully, you can see the info shown below (client):
|
||||
|
||||
```bash
|
||||
{"id":"chatcmpl-a70a719c12f7445c8204390a8d0d8c97","object":"chat.completion","created":1764056861,"model":"Qwen-Omni","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the illustration is \"TONGYI Qwen\".","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":73,"total_tokens":88,"completion_tokens":15,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}
|
||||
```
|
||||
|
||||
## Accuracy Evaluation
|
||||
|
||||
Qwen2.5-Omni on vllm-ascend has been test on AISBench.
|
||||
|
||||
### Using AISBench
|
||||
|
||||
1. Refer to [Using AISBench](../../developer_guide/evaluation/using_ais_bench.md) for details.
|
||||
|
||||
2. After execution, you can get the result, here is the result of `Qwen2.5-Omni-7B` with `vllm-ascend:0.11.0rc0` for reference only.
|
||||
|
||||
| dataset | platform | metric | mode | vllm-api-stream-chat |
|
||||
|----- | ----- | ----- | ----- | -----|
|
||||
| textVQA | A2 | accuracy | gen_base64 | 83.47 |
|
||||
| textVQA | A3 | accuracy | gen_base64 | 84.04 |
|
||||
|
||||
## Performance Evaluation
|
||||
|
||||
### Using AISBench
|
||||
|
||||
Refer to [Using AISBench for performance evaluation](../../developer_guide/evaluation/using_ais_bench.md#execute-performance-evaluation) for details.
|
||||
|
||||
### Using vLLM Benchmark
|
||||
|
||||
Run performance evaluation of `Qwen2.5-Omni-7B` as an example.
|
||||
|
||||
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
|
||||
|
||||
There are three `vllm bench` subcommands:
|
||||
|
||||
- `latency`: Benchmark the latency of a single batch of requests.
|
||||
- `serve`: Benchmark the online serving throughput.
|
||||
- `throughput`: Benchmark offline inference throughput.
|
||||
|
||||
Take the `serve` as an example. Run the code as follows.
|
||||
|
||||
```shell
|
||||
vllm bench serve --model Qwen/Qwen2.5-Omni-7B --dataset-name random --random-input 1024 --num-prompt 200 --request-rate 1 --save-result --result-dir ./
|
||||
```
|
||||
|
||||
After about several minutes, you can get the performance evaluation result.
|
||||
625
docs/source/tutorials/models/Qwen3-235B-A22B.md
Normal file
625
docs/source/tutorials/models/Qwen3-235B-A22B.md
Normal file
@@ -0,0 +1,625 @@
|
||||
# Qwen3-235B-A22B
|
||||
|
||||
## Introduction
|
||||
|
||||
Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. Built upon extensive training, Qwen3 delivers groundbreaking advancements in reasoning, instruction-following, agent capabilities, and multilingual support.
|
||||
|
||||
This document will show the main verification steps of the model, including supported features, feature configuration, environment preparation, single-node and multi-node deployment, accuracy and performance evaluation.
|
||||
|
||||
The `Qwen3-235B-A22B` model is first supported in `vllm-ascend:v0.8.4rc2`.
|
||||
|
||||
## Supported Features
|
||||
|
||||
Refer to [supported features](../../user_guide/support_matrix/supported_models.md) to get the model's supported feature matrix.
|
||||
|
||||
Refer to [feature guide](../../user_guide/feature_guide/index.md) to get the feature's configuration.
|
||||
|
||||
## Environment Preparation
|
||||
|
||||
### Model Weight
|
||||
|
||||
- `Qwen3-235B-A22B`(BF16 version): require 1 Atlas 800 A3 (64G × 16) node, 1 Atlas 800 A2 (64G × 8) node or 2 Atlas 800 A2(32G * 8)nodes. [Download model weight](https://modelers.cn/models/Modelers_Park/Qwen3-235B-A22B)
|
||||
- `Qwen3-235B-A22B-w8a8`(Quantized version): require 1 Atlas 800 A3 (64G × 16) node or 1 Atlas 800 A2 (64G × 8) node or 2 Atlas 800 A2(32G * 8)nodes. [Download model weight](https://modelscope.cn/models/vllm-ascend/Qwen3-235B-A22B-W8A8)
|
||||
|
||||
It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`
|
||||
|
||||
### Verify Multi-node Communication(Optional)
|
||||
|
||||
If you want to deploy multi-node environment, you need to verify multi-node communication according to [verify multi-node communication environment](../../installation.md#verify-multi-node-communication).
|
||||
|
||||
### Installation
|
||||
|
||||
:::::{tab-set}
|
||||
::::{tab-item} Use docker image
|
||||
|
||||
For example, using images `quay.io/ascend/vllm-ascend:v0.11.0rc2`(for Atlas 800 A2) and `quay.io/ascend/vllm-ascend:v0.11.0rc2-a3`(for Atlas 800 A3).
|
||||
|
||||
Select an image based on your machine type and start the docker image on your node, refer to [using docker](../../installation.md#set-up-using-docker).
|
||||
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
# Update --device according to your device (Atlas A2: /dev/davinci[0-7] Atlas A3:/dev/davinci[0-15]).
|
||||
# Update the vllm-ascend image according to your environment.
|
||||
# Note you should download the weight to /root/.cache in advance.
|
||||
# Update the vllm-ascend image
|
||||
export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version|
|
||||
export NAME=vllm-ascend
|
||||
|
||||
# Run the container using the defined variables
|
||||
# Note: If you are running bridge network with docker, please expose available ports for multiple nodes communication in advance
|
||||
docker run --rm \
|
||||
--name $NAME \
|
||||
--net=host \
|
||||
--shm-size=1g \
|
||||
--device /dev/davinci0 \
|
||||
--device /dev/davinci1 \
|
||||
--device /dev/davinci2 \
|
||||
--device /dev/davinci3 \
|
||||
--device /dev/davinci4 \
|
||||
--device /dev/davinci5 \
|
||||
--device /dev/davinci6 \
|
||||
--device /dev/davinci7 \
|
||||
--device /dev/davinci_manager \
|
||||
--device /dev/devmm_svm \
|
||||
--device /dev/hisi_hdc \
|
||||
-v /usr/local/dcmi:/usr/local/dcmi \
|
||||
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
|
||||
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
|
||||
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
|
||||
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
|
||||
-v /etc/ascend_install.info:/etc/ascend_install.info \
|
||||
-it $IMAGE bash
|
||||
```
|
||||
|
||||
::::
|
||||
::::{tab-item} Build from source
|
||||
|
||||
You can build all from source.
|
||||
|
||||
- Install `vllm-ascend`, refer to [set up using python](../../installation.md#set-up-using-python).
|
||||
|
||||
::::
|
||||
:::::
|
||||
|
||||
If you want to deploy multi-node environment, you need to set up environment on each node.
|
||||
|
||||
## Deployment
|
||||
|
||||
### Single-node Deployment
|
||||
|
||||
`Qwen3-235B-A22B` and `Qwen3-235B-A22B-w8a8` can both be deployed on 1 Atlas 800 A3(64G*16)、 1 Atlas 800 A2(64G*8).
|
||||
Quantized version need to start with parameter `--quantization ascend`.
|
||||
|
||||
Run the following script to execute online 128k inference.
|
||||
|
||||
```shell
|
||||
#!/bin/sh
|
||||
# Load model from ModelScope to speed up download
|
||||
export VLLM_USE_MODELSCOPE=true
|
||||
# To reduce memory fragmentation and avoid out of memory
|
||||
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
|
||||
export HCCL_BUFFSIZE=512
|
||||
export OMP_PROC_BIND=false
|
||||
export OMP_NUM_THREADS=1
|
||||
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
|
||||
export TASK_QUEUE_ENABLE=1
|
||||
|
||||
vllm serve vllm-ascend/Qwen3-235B-A22B-w8a8 \
|
||||
--host 0.0.0.0 \
|
||||
--port 8000 \
|
||||
--tensor-parallel-size 8 \
|
||||
--data-parallel-size 1 \
|
||||
--seed 1024 \
|
||||
--quantization ascend \
|
||||
--served-model-name qwen3 \
|
||||
--max-num-seqs 32 \
|
||||
--max-model-len 133000 \
|
||||
--max-num-batched-tokens 8096 \
|
||||
--enable-expert-parallel \
|
||||
--trust-remote-code \
|
||||
--gpu-memory-utilization 0.95 \
|
||||
--hf-overrides '{"rope_parameters": {"rope_type":"yarn","rope_theta":1000000,"factor":4,"original_max_position_embeddings":32768}}' \
|
||||
--compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}' \
|
||||
--async-scheduling
|
||||
```
|
||||
|
||||
**Notice:**
|
||||
|
||||
- [Qwen3-235B-A22B](https://huggingface.co/Qwen/Qwen3-235B-A22B#processing-long-texts) originally only supports 40960 context(max_position_embeddings). If you want to use it and its related quantization weights to run long seqs (such as 128k context), it is required to use yarn rope-scaling technique.
|
||||
- For vLLM version same as or new than `v0.12.0`, use parameter: `--hf-overrides '{"rope_parameters": {"rope_type":"yarn","rope_theta":1000000,"factor":4,"original_max_position_embeddings":32768}}' \`.
|
||||
- For vllm version below `v0.12.0`, use parameter: `--rope_scaling '{"rope_type":"yarn","factor":4,"original_max_position_embeddings":32768}' \`.
|
||||
If you are using weights like [Qwen3-235B-A22B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507) which originally supports long contexts, there is no need to add this parameter.
|
||||
|
||||
The parameters are explained as follows:
|
||||
|
||||
- `--data-parallel-size` 1 and `--tensor-parallel-size` 8 are common settings for data parallelism (DP) and tensor parallelism (TP) sizes.
|
||||
- `--max-model-len` represents the context length, which is the maximum value of the input plus output for a single request.
|
||||
- `--max-num-seqs` indicates the maximum number of requests that each DP group is allowed to process. If the number of requests sent to the service exceeds this limit, the excess requests will remain in a waiting state and will not be scheduled. Note that the time spent in the waiting state is also counted in metrics such as TTFT and TPOT. Therefore, when testing performance, it is generally recommended that `--max-num-seqs` * `--data-parallel-size` >= the actual total concurrency.
|
||||
- `--max-num-batched-tokens` represents the maximum number of tokens that the model can process in a single step. Currently, vLLM v1 scheduling enables ChunkPrefill/SplitFuse by default, which means:
|
||||
- (1) If the input length of a request is greater than `--max-num-batched-tokens`, it will be divided into multiple rounds of computation according to `--max-num-batched-tokens`;
|
||||
- (2) Decode requests are prioritized for scheduling, and prefill requests are scheduled only if there is available capacity.
|
||||
- Generally, if `--max-num-batched-tokens` is set to a larger value, the overall latency will be lower, but the pressure on GPU memory (activation value usage) will be greater.
|
||||
- `--gpu-memory-utilization` represents the proportion of HBM that vLLM will use for actual inference. Its essential function is to calculate the available kv_cache size. During the warm-up phase (referred to as profile run in vLLM), vLLM records the peak GPU memory usage during an inference process with an input size of `--max-num-batched-tokens`. The available kv_cache size is then calculated as: `--gpu-memory-utilization` * HBM size - peak GPU memory usage. Therefore, the larger the value of `--gpu-memory-utilization`, the more kv_cache can be used. However, since the GPU memory usage during the warm-up phase may differ from that during actual inference (e.g., due to uneven EP load), setting `--gpu-memory-utilization` too high may lead to OOM (Out of Memory) issues during actual inference. The default value is `0.9`.
|
||||
- `--enable-expert-parallel` indicates that EP is enabled. Note that vLLM does not support a mixed approach of ETP and EP; that is, MoE can either use pure EP or pure TP.
|
||||
- `--no-enable-prefix-caching` indicates that prefix caching is disabled. To enable it, remove this option.
|
||||
- `--quantization` "ascend" indicates that quantization is used. To disable quantization, remove this option.
|
||||
- `--compilation-config` contains configurations related to the aclgraph graph mode. The most significant configurations are "cudagraph_mode" and "cudagraph_capture_sizes", which have the following meanings:
|
||||
"cudagraph_mode": represents the specific graph mode. Currently, "PIECEWISE" and "FULL_DECODE_ONLY" are supported. The graph mode is mainly used to reduce the cost of operator dispatch. Currently, "FULL_DECODE_ONLY" is recommended.
|
||||
- "cudagraph_capture_sizes": represents different levels of graph modes. The default value is [1, 2, 4, 8, 16, 24, 32, 40,..., `--max-num-seqs`]. In the graph mode, the input for graphs at different levels is fixed, and inputs between levels are automatically padded to the next level. Currently, the default setting is recommended. Only in some scenarios is it necessary to set this separately to achieve optimal performance.
|
||||
- `export VLLM_ASCEND_ENABLE_FLASHCOMM1=1` indicates that Flashcomm1 optimization is enabled. Currently, this optimization is only supported for MoE in scenarios where tp_size > 1.
|
||||
|
||||
### Multi-node Deployment with MP (Recommended)
|
||||
|
||||
Assume you have Atlas 800 A3 (64G*16) nodes (or 2* A2), and want to deploy the `Qwen3-VL-235B-A22B-Instruct` model across multiple nodes.
|
||||
|
||||
Node 0
|
||||
|
||||
```shell
|
||||
#!/bin/sh
|
||||
# Load model from ModelScope to speed up download
|
||||
export VLLM_USE_MODELSCOPE=true
|
||||
# To reduce memory fragmentation and avoid out of memory
|
||||
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
|
||||
# this obtained through ifconfig
|
||||
# nic_name is the network interface name corresponding to local_ip of the current node
|
||||
nic_name="xxxx"
|
||||
local_ip="xxxx"
|
||||
|
||||
export HCCL_IF_IP=$local_ip
|
||||
export GLOO_SOCKET_IFNAME=$nic_name
|
||||
export TP_SOCKET_IFNAME=$nic_name
|
||||
export HCCL_SOCKET_IFNAME=$nic_name
|
||||
export OMP_PROC_BIND=false
|
||||
export OMP_NUM_THREADS=1
|
||||
export HCCL_BUFFSIZE=1024
|
||||
export TASK_QUEUE_ENABLE=1
|
||||
|
||||
vllm serve vllm-ascend/Qwen3-235B-A22B \
|
||||
--host 0.0.0.0 \
|
||||
--port 8000 \
|
||||
--data-parallel-size 2 \
|
||||
--api-server-count 2 \
|
||||
--data-parallel-size-local 1 \
|
||||
--data-parallel-address $local_ip \
|
||||
--data-parallel-rpc-port 13389 \
|
||||
--seed 1024 \
|
||||
--served-model-name qwen3 \
|
||||
--tensor-parallel-size 8 \
|
||||
--enable-expert-parallel \
|
||||
--max-num-seqs 16 \
|
||||
--max-model-len 32768 \
|
||||
--max-num-batched-tokens 4096 \
|
||||
--trust-remote-code \
|
||||
--async-scheduling \
|
||||
--gpu-memory-utilization 0.9 \
|
||||
```
|
||||
|
||||
Node1
|
||||
|
||||
```shell
|
||||
#!/bin/sh
|
||||
# Load model from ModelScope to speed up download
|
||||
export VLLM_USE_MODELSCOPE=true
|
||||
# To reduce memory fragmentation and avoid out of memory
|
||||
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
|
||||
# this obtained through ifconfig
|
||||
# nic_name is the network interface name corresponding to local_ip of the current node
|
||||
nic_name="xxxx"
|
||||
local_ip="xxxx"
|
||||
|
||||
# The value of node0_ip must be consistent with the value of local_ip set in node0 (master node)
|
||||
node0_ip="xxxx"
|
||||
|
||||
export HCCL_IF_IP=$local_ip
|
||||
export GLOO_SOCKET_IFNAME=$nic_name
|
||||
export TP_SOCKET_IFNAME=$nic_name
|
||||
export HCCL_SOCKET_IFNAME=$nic_name
|
||||
export OMP_PROC_BIND=false
|
||||
export OMP_NUM_THREADS=1
|
||||
export HCCL_BUFFSIZE=1024
|
||||
export TASK_QUEUE_ENABLE=1
|
||||
|
||||
vllm serve vllm-ascend/Qwen3-235B-A22B \
|
||||
--host 0.0.0.0 \
|
||||
--port 8000 \
|
||||
--headless \
|
||||
--data-parallel-size 2 \
|
||||
--data-parallel-size-local 1 \
|
||||
--data-parallel-start-rank 1 \
|
||||
--data-parallel-address $node0_ip \
|
||||
--data-parallel-rpc-port 13389 \
|
||||
--seed 1024 \
|
||||
--tensor-parallel-size 8 \
|
||||
--served-model-name qwen3 \
|
||||
--max-num-seqs 16 \
|
||||
--max-model-len 32768 \
|
||||
--max-num-batched-tokens 4096 \
|
||||
--enable-expert-parallel \
|
||||
--trust-remote-code \
|
||||
--async-scheduling \
|
||||
--gpu-memory-utilization 0.9 \
|
||||
```
|
||||
|
||||
If the service starts successfully, the following information will be displayed on node 0:
|
||||
|
||||
```shell
|
||||
INFO: Started server process [44610]
|
||||
INFO: Waiting for application startup.
|
||||
INFO: Application startup complete.
|
||||
INFO: Started server process [44611]
|
||||
INFO: Waiting for application startup.
|
||||
INFO: Application startup complete.
|
||||
```
|
||||
|
||||
### Multi-node Deployment with Ray
|
||||
|
||||
- refer to [Ray Distributed (Qwen/Qwen3-235B-A22B)](../features/ray.md).
|
||||
|
||||
### Prefill-Decode Disaggregation
|
||||
|
||||
- refer to [Prefill-Decode Disaggregation Mooncake Verification (Qwen)](../features/pd_disaggregation_mooncake_multi_node.md)
|
||||
|
||||
## Functional Verification
|
||||
|
||||
Once your server is started, you can query the model with input prompts:
|
||||
|
||||
```shell
|
||||
curl http://<node0_ip>:<port>/v1/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "qwen3",
|
||||
"prompt": "The future of AI is",
|
||||
"max_completion_tokens": 50,
|
||||
"temperature": 0
|
||||
}'
|
||||
```
|
||||
|
||||
## Accuracy Evaluation
|
||||
|
||||
Here are two accuracy evaluation methods.
|
||||
|
||||
### Using AISBench
|
||||
|
||||
1. Refer to [Using AISBench](../../developer_guide/evaluation/using_ais_bench.md) for details.
|
||||
|
||||
2. After execution, you can get the result, here is the result of `Qwen3-235B-A22B-w8a8` in `vllm-ascend:0.11.0rc0` for reference only.
|
||||
|
||||
| dataset | version | metric | mode | vllm-api-general-chat |
|
||||
|----- | ----- | ----- | ----- | -----|
|
||||
| cevaldataset | - | accuracy | gen | 91.16 |
|
||||
|
||||
## Performance
|
||||
|
||||
### Using AISBench
|
||||
|
||||
Refer to [Using AISBench for performance evaluation](../../developer_guide/evaluation/using_ais_bench.md#execute-performance-evaluation) for details.
|
||||
|
||||
### Using vLLM Benchmark
|
||||
|
||||
Run performance evaluation of `Qwen3-235B-A22B-w8a8` as an example.
|
||||
|
||||
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
|
||||
|
||||
There are three `vllm bench` subcommands:
|
||||
|
||||
- `latency`: Benchmark the latency of a single batch of requests.
|
||||
- `serve`: Benchmark the online serving throughput.
|
||||
- `throughput`: Benchmark offline inference throughput.
|
||||
|
||||
Take the `serve` as an example. Run the code as follows.
|
||||
|
||||
```shell
|
||||
export VLLM_USE_MODELSCOPE=true
|
||||
vllm bench serve --model vllm-ascend/Qwen3-235B-A22B-w8a8 --dataset-name random --random-input 200 --num-prompt 200 --request-rate 1 --save-result --result-dir ./
|
||||
```
|
||||
|
||||
After about several minutes, you can get the performance evaluation result.
|
||||
|
||||
## Reproducing Performance Results
|
||||
|
||||
In this section, we provide simple scripts to re-produce our latest performance. It is also recommended to read instructions above to understand basic concepts or options in vLLM && vLLM-Ascend.
|
||||
|
||||
### Environment
|
||||
|
||||
- vLLM v0.13.0
|
||||
- vLLM-Ascend v0.13.0rc1
|
||||
- CANN 8.3.RC2
|
||||
- torch_npu 2.8.0
|
||||
- HDK/driver 25.3.RC1
|
||||
- triton_ascend 3.2.0
|
||||
|
||||
### Single Node A3 (64G*16)
|
||||
|
||||
Example server scripts:
|
||||
|
||||
```shell
|
||||
#!/bin/sh
|
||||
# Load model from ModelScope to speed up download
|
||||
export VLLM_USE_MODELSCOPE=true
|
||||
# To reduce memory fragmentation and avoid out of memory
|
||||
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
|
||||
export HCCL_BUFFSIZE=512
|
||||
export HCCL_OP_EXPANSION_MODE="AIV"
|
||||
export OMP_PROC_BIND=false
|
||||
export OMP_NUM_THREADS=1
|
||||
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
|
||||
export VLLM_ASCEND_ENABLE_FUSED_MC2=1
|
||||
export TASK_QUEUE_ENABLE=1
|
||||
|
||||
vllm serve vllm-ascend/Qwen3-235B-A22B-w8a8 \
|
||||
--host 0.0.0.0 \
|
||||
--port 8000 \
|
||||
--tensor-parallel-size 4 \
|
||||
--data-parallel-size 4 \
|
||||
--seed 1024 \
|
||||
--quantization ascend \
|
||||
--served-model-name qwen3 \
|
||||
--max-num-seqs 128 \
|
||||
--max-model-len 40960 \
|
||||
--max-num-batched-tokens 16384 \
|
||||
--enable-expert-parallel \
|
||||
--trust-remote-code \
|
||||
--gpu-memory-utilization 0.9 \
|
||||
--no-enable-prefix-caching \
|
||||
--compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}' \
|
||||
--async-scheduling
|
||||
```
|
||||
|
||||
Benchmark scripts:
|
||||
|
||||
```shell
|
||||
vllm bench serve --model qwen3 \
|
||||
--tokenizer vllm-ascend/Qwen3-235B-A22B-w8a8 \
|
||||
--ignore-eos \
|
||||
--dataset-name random \
|
||||
--random-input-len 3584 \
|
||||
--random-output-len 1536 \
|
||||
--num-prompts 800 \
|
||||
--max-concurrency 160 \
|
||||
--request-rate 24 \
|
||||
--host 0.0.0.0 \
|
||||
--port 8000 \
|
||||
```
|
||||
|
||||
Reference test results:
|
||||
|
||||
| num_requests | concurrency | mean TTFT(ms) | mean TPOT(ms) | output token throughput (tok/s) |
|
||||
|----- | ----- | ----- | ----- | -----|
|
||||
| 720 | 144 | 4717.45 | 48.69 | 2761.72 |
|
||||
|
||||
Note:
|
||||
|
||||
1. Setting `export VLLM_ASCEND_ENABLE_FUSED_MC2=1` enables MoE fused operators that reduce time consumption of MoE in both prefill and decode. This is an experimental feature which only supports W8A8 quantization on Atlas A3 servers now. If you encounter any problems when using this feature, you can disable it by setting `export VLLM_ASCEND_ENABLE_FUSED_MC2=0` and update issues in vLLM-Ascend community.
|
||||
2. Here we disable prefix cache because of random datasets. You can enable prefix cache if requests have long common prefix.
|
||||
|
||||
### Three Node A3 -- PD disaggregation
|
||||
|
||||
On three Atlas 800 A3(64G*16)server, we recommend to use one node as one prefill instance and two nodes as one decode instance. Example server scripts:
|
||||
Prefill Node 1
|
||||
|
||||
```shell
|
||||
#!/bin/sh
|
||||
export HCCL_IF_IP=prefill_node_1_ip
|
||||
|
||||
# Set ifname according to your network setting
|
||||
ifname=""
|
||||
|
||||
export GLOO_SOCKET_IFNAME=${ifname}
|
||||
export TP_SOCKET_IFNAME=${ifname}
|
||||
export HCCL_SOCKET_IFNAME=${ifname}
|
||||
|
||||
# Load model from ModelScope to speed up download
|
||||
export VLLM_USE_MODELSCOPE=true
|
||||
# To reduce memory fragmentation and avoid out of memory
|
||||
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
|
||||
export HCCL_BUFFSIZE=512
|
||||
export HCCL_OP_EXPANSION_MODE="AIV"
|
||||
export OMP_PROC_BIND=false
|
||||
export OMP_NUM_THREADS=1
|
||||
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
|
||||
export VLLM_ASCEND_ENABLE_FUSED_MC2=2
|
||||
export TASK_QUEUE_ENABLE=1
|
||||
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/mooncake:$LD_LIBRARY_PATH
|
||||
|
||||
vllm serve vllm-ascend/Qwen3-235B-A22B-w8a8 \
|
||||
--host 0.0.0.0 \
|
||||
--port 8000 \
|
||||
--tensor-parallel-size 8 \
|
||||
--data-parallel-size 2 \
|
||||
--data-parallel-size-local 2 \
|
||||
--data-parallel-start-rank 0 \
|
||||
--data-parallel-address prefill_node_1_ip \
|
||||
--data-parallel-rpc-port prefill_node_dp_port \
|
||||
--seed 1024 \
|
||||
--quantization ascend \
|
||||
--served-model-name qwen3 \
|
||||
--max-num-seqs 24 \
|
||||
--max-model-len 40960 \
|
||||
--max-num-batched-tokens 16384 \
|
||||
--enable-expert-parallel \
|
||||
--enforce-eager \
|
||||
--trust-remote-code \
|
||||
--gpu-memory-utilization 0.9 \
|
||||
--enforce-eager \
|
||||
--no-enable-prefix-caching \
|
||||
--kv-transfer-config \
|
||||
'{"kv_connector": "MooncakeConnectorV1",
|
||||
"kv_role": "kv_producer",
|
||||
"kv_port": "30000",
|
||||
"engine_id": "0",
|
||||
"kv_connector_extra_config": {
|
||||
"use_ascend_direct": true,
|
||||
"prefill": {
|
||||
"dp_size": 2,
|
||||
"tp_size": 8
|
||||
},
|
||||
"decode": {
|
||||
"dp_size": 8,
|
||||
"tp_size": 4
|
||||
}
|
||||
}
|
||||
}'
|
||||
```
|
||||
|
||||
Decode Node 1
|
||||
|
||||
```shell
|
||||
#!/bin/sh
|
||||
export HCCL_IF_IP=decode_node_1_ip
|
||||
|
||||
ifname=""
|
||||
|
||||
export GLOO_SOCKET_IFNAME=${ifname}
|
||||
export TP_SOCKET_IFNAME=${ifname}
|
||||
export HCCL_SOCKET_IFNAME=${ifname}
|
||||
|
||||
# Load model from ModelScope to speed up download
|
||||
export VLLM_USE_MODELSCOPE=true
|
||||
# To reduce memory fragmentation and avoid out of memory
|
||||
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
|
||||
export HCCL_BUFFSIZE=1024
|
||||
export HCCL_OP_EXPANSION_MODE="AIV"
|
||||
export OMP_PROC_BIND=false
|
||||
export OMP_NUM_THREADS=1
|
||||
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
|
||||
export VLLM_ASCEND_ENABLE_FUSED_MC2=2
|
||||
export TASK_QUEUE_ENABLE=1
|
||||
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/mooncake:$LD_LIBRARY_PATH
|
||||
|
||||
vllm serve vllm-ascend/Qwen3-235B-A22B-w8a8 \
|
||||
--host 0.0.0.0 \
|
||||
--port 8000 \
|
||||
--tensor-parallel-size 4 \
|
||||
--data-parallel-size 8 \
|
||||
--data-parallel-size-local 4 \
|
||||
--data-parallel-start-rank 0 \
|
||||
--data-parallel-address decode_node_1_ip \
|
||||
--data-parallel-rpc-port decode_node_dp_port \
|
||||
--seed 1024 \
|
||||
--quantization ascend \
|
||||
--served-model-name qwen3 \
|
||||
--max-num-seqs 128 \
|
||||
--max-model-len 40960 \
|
||||
--max-num-batched-tokens 256 \
|
||||
--enable-expert-parallel \
|
||||
--trust-remote-code \
|
||||
--gpu-memory-utilization 0.9 \
|
||||
--compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}' \
|
||||
--async-scheduling \
|
||||
--no-enable-prefix-caching \
|
||||
--kv-transfer-config \
|
||||
'{"kv_connector": "MooncakeConnectorV1",
|
||||
"kv_role": "kv_consumer",
|
||||
"kv_port": "30100",
|
||||
"engine_id": "1",
|
||||
"kv_connector_extra_config": {
|
||||
"use_ascend_direct": true,
|
||||
"prefill": {
|
||||
"dp_size": 2,
|
||||
"tp_size": 8
|
||||
},
|
||||
"decode": {
|
||||
"dp_size": 8,
|
||||
"tp_size": 4
|
||||
}
|
||||
}
|
||||
}'
|
||||
```
|
||||
|
||||
Decode Node 2
|
||||
|
||||
```shell
|
||||
#!/bin/sh
|
||||
export HCCL_IF_IP=decode_node_2_ip
|
||||
|
||||
ifname=""
|
||||
|
||||
export GLOO_SOCKET_IFNAME=${ifname}
|
||||
export TP_SOCKET_IFNAME=${ifname}
|
||||
export HCCL_SOCKET_IFNAME=${ifname}
|
||||
|
||||
# Load model from ModelScope to speed up download
|
||||
export VLLM_USE_MODELSCOPE=true
|
||||
# To reduce memory fragmentation and avoid out of memory
|
||||
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
|
||||
export HCCL_BUFFSIZE=1024
|
||||
export HCCL_OP_EXPANSION_MODE="AIV"
|
||||
export OMP_PROC_BIND=false
|
||||
export OMP_NUM_THREADS=1
|
||||
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
|
||||
export VLLM_ASCEND_ENABLE_FUSED_MC2=2
|
||||
export TASK_QUEUE_ENABLE=1
|
||||
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/mooncake:$LD_LIBRARY_PATH
|
||||
|
||||
vllm serve vllm-ascend/Qwen3-235B-A22B-w8a8 \
|
||||
--host 0.0.0.0 \
|
||||
--port 8000 \
|
||||
--headless \
|
||||
--tensor-parallel-size 4 \
|
||||
--data-parallel-size 8 \
|
||||
--data-parallel-size-local 4 \
|
||||
--data-parallel-start-rank 4 \
|
||||
--data-parallel-address decode_node_1_ip \
|
||||
--data-parallel-rpc-port decode_node_dp_port \
|
||||
--seed 1024 \
|
||||
--quantization ascend \
|
||||
--served-model-name qwen3 \
|
||||
--max-num-seqs 128 \
|
||||
--max-model-len 40960 \
|
||||
--max-num-batched-tokens 256 \
|
||||
--enable-expert-parallel \
|
||||
--trust-remote-code \
|
||||
--gpu-memory-utilization 0.9 \
|
||||
--compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}' \
|
||||
--async-scheduling \
|
||||
--no-enable-prefix-caching \
|
||||
--kv-transfer-config \
|
||||
'{"kv_connector": "MooncakeConnectorV1",
|
||||
"kv_role": "kv_consumer",
|
||||
"kv_port": "30100",
|
||||
"engine_id": "1",
|
||||
"kv_connector_extra_config": {
|
||||
"use_ascend_direct": true,
|
||||
"prefill": {
|
||||
"dp_size": 2,
|
||||
"tp_size": 8
|
||||
},
|
||||
"decode": {
|
||||
"dp_size": 8,
|
||||
"tp_size": 4
|
||||
}
|
||||
}
|
||||
}'
|
||||
```
|
||||
|
||||
PD proxy:
|
||||
|
||||
```shell
|
||||
python load_balance_proxy_server_example.py --port 12347 --prefiller-hosts prefill_node_1_ip --prefiller-port 8000 --decoder-hosts decode_node_1_ip --decoder-ports 8000
|
||||
```
|
||||
|
||||
Benchmark scripts:
|
||||
|
||||
```shell
|
||||
vllm bench serve --model qwen3 \
|
||||
--tokenizer vllm-ascend/Qwen3-235B-A22B-w8a8 \
|
||||
--ignore-eos \
|
||||
--dataset-name random \
|
||||
--random-input-len 3584 \
|
||||
--random-output-len 1536 \
|
||||
--num-prompts 2880 \
|
||||
--max-concurrency 576 \
|
||||
--request-rate 8 \
|
||||
--host 0.0.0.0 \
|
||||
--port 12347 \
|
||||
```
|
||||
|
||||
Reference test results:
|
||||
|
||||
| num_requests | concurrency | mean TTFT(ms) | mean TPOT(ms) | output token throughput (tok/s) |
|
||||
|----- | ----- | ----- | ----- | -----|
|
||||
| 2880 | 576 | 3735.98 | 52.07 | 8593.44 |
|
||||
|
||||
Note:
|
||||
|
||||
1. We recommend to set `export VLLM_ASCEND_ENABLE_FUSED_MC2=2` on this scenario (typically EP32 for Qwen3-235B). This enables a different MoE fusion operator.
|
||||
113
docs/source/tutorials/models/Qwen3-30B-A3B.md
Normal file
113
docs/source/tutorials/models/Qwen3-30B-A3B.md
Normal file
@@ -0,0 +1,113 @@
|
||||
# Qwen3-30B-A3B
|
||||
|
||||
## Run vllm-ascend on Multi-NPU with Qwen3 MoE
|
||||
|
||||
Run docker container:
|
||||
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
# Update the vllm-ascend image
|
||||
# For Atlas A2 machines:
|
||||
# export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
|
||||
# For Atlas A3 machines:
|
||||
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|-a3
|
||||
docker run --rm \
|
||||
--name vllm-ascend \
|
||||
--shm-size=1g \
|
||||
--device /dev/davinci0 \
|
||||
--device /dev/davinci1 \
|
||||
--device /dev/davinci2 \
|
||||
--device /dev/davinci3 \
|
||||
--device /dev/davinci_manager \
|
||||
--device /dev/devmm_svm \
|
||||
--device /dev/hisi_hdc \
|
||||
-v /usr/local/dcmi:/usr/local/dcmi \
|
||||
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
|
||||
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
|
||||
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
|
||||
-v /etc/ascend_install.info:/etc/ascend_install.info \
|
||||
-v /root/.cache:/root/.cache \
|
||||
-p 8000:8000 \
|
||||
-it $IMAGE bash
|
||||
```
|
||||
|
||||
Set up environment variables:
|
||||
|
||||
```bash
|
||||
# Load model from ModelScope to speed up download
|
||||
export VLLM_USE_MODELSCOPE=True
|
||||
|
||||
# Set `max_split_size_mb` to reduce memory fragmentation and avoid out of memory
|
||||
export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
|
||||
```
|
||||
|
||||
### Online Inference on Multi-NPU
|
||||
|
||||
Run the following script to start the vLLM server on Multi-NPU:
|
||||
|
||||
For an Atlas A2 with 64 GB of NPU card memory, tensor-parallel-size should be at least 2, and for 32 GB of memory, tensor-parallel-size should be at least 4.
|
||||
|
||||
```bash
|
||||
vllm serve Qwen/Qwen3-30B-A3B --tensor-parallel-size 4 --enable_expert_parallel
|
||||
```
|
||||
|
||||
Once your server is started, you can query the model with input prompts.
|
||||
|
||||
```bash
|
||||
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
|
||||
"model": "Qwen/Qwen3-30B-A3B",
|
||||
"messages": [
|
||||
{"role": "user", "content": "Give me a short introduction to large language models."}
|
||||
],
|
||||
"temperature": 0.6,
|
||||
"top_p": 0.95,
|
||||
"top_k": 20,
|
||||
"max_completion_tokens": 4096
|
||||
}'
|
||||
```
|
||||
|
||||
### Offline Inference on Multi-NPU
|
||||
|
||||
Run the following script to execute offline inference on multi-NPU:
|
||||
|
||||
```python
|
||||
import gc
|
||||
import torch
|
||||
|
||||
from vllm import LLM, SamplingParams
|
||||
from vllm.distributed.parallel_state import (destroy_distributed_environment,
|
||||
destroy_model_parallel)
|
||||
|
||||
def clean_up():
|
||||
destroy_model_parallel()
|
||||
destroy_distributed_environment()
|
||||
gc.collect()
|
||||
torch.npu.empty_cache()
|
||||
|
||||
prompts = [
|
||||
"Hello, my name is",
|
||||
"The future of AI is",
|
||||
]
|
||||
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40)
|
||||
llm = LLM(model="Qwen/Qwen3-30B-A3B",
|
||||
tensor_parallel_size=4,
|
||||
distributed_executor_backend="mp",
|
||||
max_model_len=4096,
|
||||
enable_expert_parallel=True)
|
||||
|
||||
outputs = llm.generate(prompts, sampling_params)
|
||||
for output in outputs:
|
||||
prompt = output.prompt
|
||||
generated_text = output.outputs[0].text
|
||||
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
|
||||
|
||||
del llm
|
||||
clean_up()
|
||||
```
|
||||
|
||||
If you run this script successfully, you can see the info shown below:
|
||||
|
||||
```bash
|
||||
Prompt: 'Hello, my name is', Generated text: " Lucy. I'm from the UK and I'm 11 years old."
|
||||
Prompt: 'The future of AI is', Generated text: ' a topic that has captured the imagination of scientists, philosophers, and the general public'
|
||||
```
|
||||
143
docs/source/tutorials/models/Qwen3-32B-W4A4.md
Normal file
143
docs/source/tutorials/models/Qwen3-32B-W4A4.md
Normal file
@@ -0,0 +1,143 @@
|
||||
# Qwen3-32B-W4A4
|
||||
|
||||
## Introduction
|
||||
|
||||
W4A4 Flat Quantization is for better model compression and inference efficiency on Ascend devices.
|
||||
And W4A4 is supported since `v0.11.0rc1`. For modelslim, W4A4 is supported since `tag_MindStudio_8.2.RC1.B120_002`.
|
||||
|
||||
The following steps will show how to quantize Qwen3 32B to W4A4.
|
||||
|
||||
## Environment Preparation
|
||||
|
||||
### Run Docker Container
|
||||
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
# Update the vllm-ascend image
|
||||
export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version|
|
||||
docker run --rm \
|
||||
--name vllm-ascend \
|
||||
--shm-size=1g \
|
||||
--device /dev/davinci0 \
|
||||
--device /dev/davinci_manager \
|
||||
--device /dev/devmm_svm \
|
||||
--device /dev/hisi_hdc \
|
||||
-v /usr/local/dcmi:/usr/local/dcmi \
|
||||
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
|
||||
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
|
||||
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
|
||||
-v /etc/ascend_install.info:/etc/ascend_install.info \
|
||||
-v /root/.cache:/root/.cache \
|
||||
-p 8000:8000 \
|
||||
-it $IMAGE bash
|
||||
```
|
||||
|
||||
### Install modelslim and Convert Model
|
||||
|
||||
:::{note}
|
||||
You can choose to convert the model yourself or use the quantized model we uploaded,
|
||||
see <https://www.modelscope.cn/models/vllm-ascend/Qwen3-32B-W4A4>
|
||||
:::
|
||||
|
||||
```bash
|
||||
git clone -b tag_MindStudio_8.2.RC1.B120_002 https://gitcode.com/Ascend/msit
|
||||
cd msit/msmodelslim
|
||||
|
||||
# Install by run this script
|
||||
bash install.sh
|
||||
pip install accelerate
|
||||
# transformers 4.51.0 is required for Qwen3 series model
|
||||
# see https://gitcode.com/Ascend/msit/blob/master/msmodelslim/example/Qwen/README.md#%E7%8E%AF%E5%A2%83%E9%85%8D%E7%BD%AE
|
||||
pip install transformers==4.51.0
|
||||
|
||||
cd example/Qwen
|
||||
# Original weight path, Replace with your local model path
|
||||
MODEL_PATH=/home/models/Qwen3-32B
|
||||
# Path to save converted weight, Replace with your local path
|
||||
SAVE_PATH=/home/models/Qwen3-32B-w4a4
|
||||
# Set two idle NPU cards
|
||||
export ASCEND_RT_VISIBLE_DEVICES=0,1
|
||||
|
||||
python3 w4a4.py --model_path $MODEL_PATH \
|
||||
--save_directory $SAVE_PATH \
|
||||
--calib_file ./calib_data/qwen3_cot_w4a4.json \
|
||||
--trust_remote_code True \
|
||||
--batch_size 1
|
||||
```
|
||||
|
||||
### Verify the Quantized Model
|
||||
|
||||
The converted model files look like:
|
||||
|
||||
```bash
|
||||
.
|
||||
|-- config.json
|
||||
|-- configuration.json
|
||||
|-- generation_config.json
|
||||
|-- quant_model_description.json
|
||||
|-- quant_model_weight_w4a4_flatquant_dynamic-00001-of-00011.safetensors
|
||||
|-- quant_model_weight_w4a4_flatquant_dynamic-00002-of-00011.safetensors
|
||||
|-- quant_model_weight_w4a4_flatquant_dynamic-00003-of-00011.safetensors
|
||||
|-- quant_model_weight_w4a4_flatquant_dynamic-00004-of-00011.safetensors
|
||||
|-- quant_model_weight_w4a4_flatquant_dynamic-00005-of-00011.safetensors
|
||||
|-- quant_model_weight_w4a4_flatquant_dynamic-00006-of-00011.safetensors
|
||||
|-- quant_model_weight_w4a4_flatquant_dynamic-00007-of-00011.safetensors
|
||||
|-- quant_model_weight_w4a4_flatquant_dynamic-00008-of-00011.safetensors
|
||||
|-- quant_model_weight_w4a4_flatquant_dynamic-00009-of-00011.safetensors
|
||||
|-- quant_model_weight_w4a4_flatquant_dynamic-00010-of-00011.safetensors
|
||||
|-- quant_model_weight_w4a4_flatquant_dynamic-00011-of-00011.safetensors
|
||||
|-- quant_model_weight_w4a4_flatquant_dynamic.safetensors.index.json
|
||||
|-- tokenizer.json
|
||||
|-- tokenizer_config.json
|
||||
`-- vocab.json
|
||||
```
|
||||
|
||||
## Deployment
|
||||
|
||||
### Online Serving on Single NPU
|
||||
|
||||
```bash
|
||||
vllm serve /home/models/Qwen3-32B-w4a4 --served-model-name "qwen3-32b-w4a4" --max-model-len 4096 --quantization ascend
|
||||
```
|
||||
|
||||
Once your server is started, you can query the model with input prompts.
|
||||
|
||||
```bash
|
||||
curl http://localhost:8000/v1/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "qwen3-32b-w4a4",
|
||||
"prompt": "what is large language model?",
|
||||
"max_completion_tokens": "128",
|
||||
"top_p": "0.95",
|
||||
"top_k": "40",
|
||||
"temperature": "0.0"
|
||||
}'
|
||||
```
|
||||
|
||||
### Offline Inference on Single NPU
|
||||
|
||||
:::{note}
|
||||
To enable quantization for ascend, quantization method must be "ascend".
|
||||
:::
|
||||
|
||||
```python
|
||||
|
||||
from vllm import LLM, SamplingParams
|
||||
|
||||
prompts = [
|
||||
"Hello, my name is",
|
||||
"The future of AI is",
|
||||
]
|
||||
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40)
|
||||
|
||||
llm = LLM(model="/home/models/Qwen3-32B-w4a4",
|
||||
max_model_len=4096,
|
||||
quantization="ascend")
|
||||
|
||||
outputs = llm.generate(prompts, sampling_params)
|
||||
for output in outputs:
|
||||
prompt = output.prompt
|
||||
generated_text = output.outputs[0].text
|
||||
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
|
||||
```
|
||||
141
docs/source/tutorials/models/Qwen3-8B-W4A8.md
Normal file
141
docs/source/tutorials/models/Qwen3-8B-W4A8.md
Normal file
@@ -0,0 +1,141 @@
|
||||
# Qwen3-8B-W4A8
|
||||
|
||||
## Run Docker Container
|
||||
|
||||
:::{note}
|
||||
w4a8 quantization feature is supported by v0.9.1rc2 and later.
|
||||
:::
|
||||
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
# Update the vllm-ascend image
|
||||
export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version|
|
||||
docker run --rm \
|
||||
--name vllm-ascend \
|
||||
--shm-size=1g \
|
||||
--device /dev/davinci0 \
|
||||
--device /dev/davinci_manager \
|
||||
--device /dev/devmm_svm \
|
||||
--device /dev/hisi_hdc \
|
||||
-v /usr/local/dcmi:/usr/local/dcmi \
|
||||
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
|
||||
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
|
||||
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
|
||||
-v /etc/ascend_install.info:/etc/ascend_install.info \
|
||||
-v /root/.cache:/root/.cache \
|
||||
-p 8000:8000 \
|
||||
-it $IMAGE bash
|
||||
```
|
||||
|
||||
## Install modelslim and Convert Model
|
||||
|
||||
:::{note}
|
||||
You can choose to convert the model yourself or use the quantized model we uploaded,
|
||||
see <https://www.modelscope.cn/models/vllm-ascend/Qwen3-8B-W4A8>
|
||||
:::
|
||||
|
||||
```bash
|
||||
# The branch(br_release_MindStudio_8.1.RC2_TR5_20260624) has been verified
|
||||
git clone -b br_release_MindStudio_8.1.RC2_TR5_20260624 https://gitcode.com/Ascend/msit
|
||||
|
||||
cd msit/msmodelslim
|
||||
|
||||
# Install by run this script
|
||||
bash install.sh
|
||||
pip install accelerate
|
||||
|
||||
cd example/Qwen
|
||||
# Original weight path, Replace with your local model path
|
||||
MODEL_PATH=/home/models/Qwen3-8B
|
||||
# Path to save converted weight, Replace with your local path
|
||||
SAVE_PATH=/home/models/Qwen3-8B-w4a8
|
||||
# Set an idle NPU card
|
||||
export ASCEND_RT_VISIBLE_DEVICES=0
|
||||
|
||||
python quant_qwen.py \
|
||||
--model_path $MODEL_PATH \
|
||||
--save_directory $SAVE_PATH \
|
||||
--device_type npu \
|
||||
--model_type qwen3 \
|
||||
--calib_file None \
|
||||
--anti_method m6 \
|
||||
--anti_calib_file ./calib_data/mix_dataset.json \
|
||||
--w_bit 4 \
|
||||
--a_bit 8 \
|
||||
--is_lowbit True \
|
||||
--open_outlier False \
|
||||
--group_size 256 \
|
||||
--is_dynamic True \
|
||||
--trust_remote_code True \
|
||||
--w_method HQQ
|
||||
```
|
||||
|
||||
## Verify the Quantized Model
|
||||
|
||||
The converted model files look like:
|
||||
|
||||
```bash
|
||||
.
|
||||
|-- config.json
|
||||
|-- configuration.json
|
||||
|-- generation_config.json
|
||||
|-- merges.txt
|
||||
|-- quant_model_description.json
|
||||
|-- quant_model_weight_w4a8_dynamic-00001-of-00003.safetensors
|
||||
|-- quant_model_weight_w4a8_dynamic-00002-of-00003.safetensors
|
||||
|-- quant_model_weight_w4a8_dynamic-00003-of-00003.safetensors
|
||||
|-- quant_model_weight_w4a8_dynamic.safetensors.index.json
|
||||
|-- README.md
|
||||
|-- tokenizer.json
|
||||
`-- tokenizer_config.json
|
||||
```
|
||||
|
||||
Run the following script to start the vLLM server with the quantized model:
|
||||
|
||||
```bash
|
||||
export VLLM_USE_MODELSCOPE=true
|
||||
export MODEL_PATH=vllm-ascend/Qwen3-8B-W4A8
|
||||
vllm serve ${MODEL_PATH} --served-model-name "qwen3-8b-w4a8" --max-model-len 4096 --quantization ascend
|
||||
```
|
||||
|
||||
Once your server is started, you can query the model with input prompts.
|
||||
|
||||
```bash
|
||||
curl http://localhost:8000/v1/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "qwen3-8b-w4a8",
|
||||
"prompt": "what is large language model?",
|
||||
"max_completion_tokens": "128",
|
||||
"top_p": "0.95",
|
||||
"top_k": "40",
|
||||
"temperature": "0.0"
|
||||
}'
|
||||
```
|
||||
|
||||
Run the following script to execute offline inference on single-NPU with the quantized model:
|
||||
|
||||
:::{note}
|
||||
To enable quantization for ascend, quantization method must be "ascend".
|
||||
:::
|
||||
|
||||
```python
|
||||
|
||||
from vllm import LLM, SamplingParams
|
||||
|
||||
prompts = [
|
||||
"Hello, my name is",
|
||||
"The future of AI is",
|
||||
]
|
||||
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40)
|
||||
|
||||
llm = LLM(model="/home/models/Qwen3-8B-w4a8",
|
||||
max_model_len=4096,
|
||||
quantization="ascend")
|
||||
|
||||
outputs = llm.generate(prompts, sampling_params)
|
||||
for output in outputs:
|
||||
prompt = output.prompt
|
||||
generated_text = output.outputs[0].text
|
||||
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
|
||||
```
|
||||
105
docs/source/tutorials/models/Qwen3-Coder-30B-A3B.md
Normal file
105
docs/source/tutorials/models/Qwen3-Coder-30B-A3B.md
Normal file
@@ -0,0 +1,105 @@
|
||||
# Qwen3-Coder-30B-A3B
|
||||
|
||||
## Introduction
|
||||
|
||||
The newly released Qwen3-Coder-30B-A3B employs a sparse MoE architecture for efficient training and inference, delivering significant optimizations in agentic coding, extended context support of up to 1M tokens, and versatile function calling.
|
||||
|
||||
This document will show the main verification steps of the model, including supported features, feature configuration, environment preparation, single-node deployment, accuracy and performance evaluation.
|
||||
|
||||
## Supported Features
|
||||
|
||||
Refer to [supported features](../../user_guide/support_matrix/supported_models.md) to get the model's supported feature matrix.
|
||||
|
||||
Refer to [feature guide](../../user_guide/feature_guide/index.md) to get the feature's configuration.
|
||||
|
||||
## Environment Preparation
|
||||
|
||||
### Model Weight
|
||||
|
||||
`Qwen3-Coder-30B-A3B-Instruct`(BF16 version): requires 1 Atlas 800 A3 node (with 16x 64G NPUs) or 1 Atlas 800 A2 node (with 8x 64G/32G NPUs). [Download model weight](https://modelers.cn/models/Modelers_Park/Qwen3-Coder-30B-A3B-Instruct)
|
||||
|
||||
It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`
|
||||
|
||||
### Installation
|
||||
|
||||
`Qwen3-Coder` is first supported in `vllm-ascend:v0.10.0rc1`, please run this model using a later version.
|
||||
|
||||
You can use our official docker image to run `Qwen3-Coder-30B-A3B-Instruct` directly.
|
||||
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
# Update the vllm-ascend image
|
||||
export IMAGE=quay.io/ascend/vllm-ascend:v0.11.0rc1
|
||||
docker run --rm \
|
||||
--name vllm-ascend \
|
||||
--shm-size=1g \
|
||||
--device /dev/davinci0 \
|
||||
--device /dev/davinci1 \
|
||||
--device /dev/davinci2 \
|
||||
--device /dev/davinci3 \
|
||||
--device /dev/davinci_manager \
|
||||
--device /dev/devmm_svm \
|
||||
--device /dev/hisi_hdc \
|
||||
-v /usr/local/dcmi:/usr/local/dcmi \
|
||||
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
|
||||
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
|
||||
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
|
||||
-v /etc/ascend_install.info:/etc/ascend_install.info \
|
||||
-v /root/.cache:/root/.cache \
|
||||
-p 8000:8000 \
|
||||
-it $IMAGE bash
|
||||
```
|
||||
|
||||
In addition, if you don't want to use the docker image as above, you can also build all from source:
|
||||
|
||||
- Install `vllm-ascend` from source, refer to [installation](../../installation.md).
|
||||
|
||||
## Deployment
|
||||
|
||||
### Single-node Deployment
|
||||
|
||||
Run the following script to execute online inference.
|
||||
|
||||
For an Atlas A2 with 64 GB of NPU card memory, tensor-parallel-size should be at least 2, and for 32 GB of memory, tensor-parallel-size should be at least 4.
|
||||
|
||||
```shell
|
||||
#!/bin/sh
|
||||
export VLLM_USE_MODELSCOPE=true
|
||||
|
||||
vllm serve Qwen/Qwen3-Coder-30B-A3B-Instruct --served-model-name qwen3-coder --tensor-parallel-size 4 --enable_expert_parallel
|
||||
```
|
||||
|
||||
## Functional Verification
|
||||
|
||||
Once your server is started, you can query the model with input prompts:
|
||||
|
||||
```shell
|
||||
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
|
||||
"model": "qwen3-coder",
|
||||
"messages": [
|
||||
{"role": "user", "content": "Give me a short introduction to large language models."}
|
||||
],
|
||||
"temperature": 0.6,
|
||||
"top_p": 0.95,
|
||||
"top_k": 20,
|
||||
"max_completion_tokens": 4096
|
||||
}'
|
||||
```
|
||||
|
||||
## Accuracy Evaluation
|
||||
|
||||
### Using AISBench
|
||||
|
||||
1. Refer to [Using AISBench](../../developer_guide/evaluation/using_ais_bench.md) for details.
|
||||
|
||||
2. After execution, you can get the result, here is the result of `Qwen3-Coder-30B-A3B-Instruct` in `vllm-ascend:0.11.0rc0` for reference only.
|
||||
|
||||
| dataset | version | metric | mode | vllm-api-general-chat |
|
||||
|----- | ----- | ----- | ----- | -----|
|
||||
| openai_humaneval | f4a973 | humaneval_pass@1 | gen | 94.51 |
|
||||
|
||||
## Performance
|
||||
|
||||
### Using AISBench
|
||||
|
||||
Refer to [Using AISBench for performance evaluation](../../developer_guide/evaluation/using_ais_bench.md#execute-performance-evaluation) for details.
|
||||
392
docs/source/tutorials/models/Qwen3-Dense.md
Normal file
392
docs/source/tutorials/models/Qwen3-Dense.md
Normal file
@@ -0,0 +1,392 @@
|
||||
# Qwen3-Dense(Qwen3-0.6B/8B/32B)
|
||||
|
||||
## Introduction
|
||||
|
||||
Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. Built upon extensive training, Qwen3 delivers groundbreaking advancements in reasoning, instruction-following, agent capabilities, and multilingual support.
|
||||
|
||||
Welcome to the tutorial on optimizing Qwen Dense models in the vLLM-Ascend environment. This guide will help you configure the most effective settings for your use case, with practical examples that highlight key optimization points. We will also explore how adjusting service parameters can maximize throughput performance across various scenarios.
|
||||
|
||||
This document will show the main verification steps of the model, including supported features, feature configuration, environment preparation, accuracy and performance evaluation.
|
||||
|
||||
The Qwen3 Dense models is first supported in [v0.8.4rc2](https://github.com/vllm-project/vllm-ascend/blob/main/docs/source/user_guide/release_notes.md#v084rc2---20250429)
|
||||
|
||||
## **Node**
|
||||
|
||||
This example requires version **v0.11.0rc2**. Earlier versions may lack certain features.
|
||||
|
||||
## Supported Features
|
||||
|
||||
Refer to [supported features](../../user_guide/support_matrix/supported_models.md) to get the model's supported feature matrix.
|
||||
|
||||
Refer to [feature guide](../../user_guide/feature_guide/index.md) to get the feature's configuration.
|
||||
|
||||
## Environment Preparation
|
||||
|
||||
### Model Weight
|
||||
|
||||
- `Qwen3-0.6B`(BF16 version): require 1 Atlas 800 A3 (64G × 2) card or 1 Atlas 800I A2 (64G × 1) card. [Download model weight](https://modelers.cn/models/Modelers_Park/Qwen3-0.6B)
|
||||
- `Qwen3-1.7B`(BF16 version): require 1 Atlas 800 A3 (64G × 2) card or 1 Atlas 800I A2 (64G × 1) card. [Download model weight](https://modelers.cn/models/Modelers_Park/Qwen3-1.7B)
|
||||
- `Qwen3-4B`(BF16 version): require 1 Atlas 800 A3 (64G × 2) card or 1 Atlas 800I A2 (64G × 1) card. [Download model weight](https://modelers.cn/models/Modelers_Park/Qwen3-4B)
|
||||
- `Qwen3-8B`(BF16 version): require 1 Atlas 800 A3 (64G × 2) card or 1 Atlas 800I A2 (64G × 1) card. [Download model weight](https://modelers.cn/models/Modelers_Park/Qwen3-8B)
|
||||
- `Qwen3-14B`(BF16 version): require 1 Atlas 800 A3 (64G × 2) card or 2 Atlas 800I A2 (64G × 1) cards. [Download model weight](https://modelers.cn/models/Modelers_Park/Qwen3-14B)
|
||||
- `Qwen3-32B`(BF16 version): require 2 Atlas 800 A3 (64G × 4) cards or 4 Atlas 800I A2 (64G × 4) cards. [Download model weight](https://modelers.cn/models/Modelers_Park/Qwen3-32B)
|
||||
- `Qwen3-32B-W8A8`(Quantized version): require 2 Atlas 800 A3 (64G × 4) cards or 4 Atlas 800I A2 (64G × 4) cards. [Download model weight](https://www.modelscope.cn/models/vllm-ascend/Qwen3-32B-W8A8)
|
||||
|
||||
These are the recommended numbers of cards, which can be adjusted according to the actual situation.
|
||||
|
||||
It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`
|
||||
|
||||
### Verify Multi-node Communication(Optional)
|
||||
|
||||
If you want to deploy multi-node environment, you need to verify multi-node communication according to [verify multi-node communication environment](../../installation.md#verify-multi-node-communication).
|
||||
|
||||
### Installation
|
||||
|
||||
You can use our official docker image for supporting Qwen3 Dense models.
|
||||
Currently, we provide the all-in-one images.[Download images](https://quay.io/repository/ascend/vllm-ascend?tab=tags)
|
||||
|
||||
#### Docker Pull (by tag)
|
||||
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
|
||||
docker pull quay.io/ascend/vllm-ascend:|vllm_ascend_version|
|
||||
|
||||
```
|
||||
|
||||
#### Docker run
|
||||
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
|
||||
# Update --device according to your device (Atlas A2: /dev/davinci[0-7] Atlas A3:/dev/davinci[0-15]).
|
||||
# Update the vllm-ascend image according to your environment.
|
||||
# Note you should download the weight to /root/.cache in advance.
|
||||
# For Atlas A2 machines:
|
||||
# export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
|
||||
# For Atlas A3 machines:
|
||||
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|-a3
|
||||
docker run --rm \
|
||||
--name vllm-ascend-env \
|
||||
--shm-size=1g \
|
||||
--net=host \
|
||||
--device /dev/davinci0 \
|
||||
--device /dev/davinci1 \
|
||||
--device /dev/davinci2 \
|
||||
--device /dev/davinci3 \
|
||||
--device /dev/davinci4 \
|
||||
--device /dev/davinci5 \
|
||||
--device /dev/davinci6 \
|
||||
--device /dev/davinci7 \
|
||||
--device /dev/davinci_manager \
|
||||
--device /dev/devmm_svm \
|
||||
--device /dev/hisi_hdc \
|
||||
-v /usr/local/dcmi:/usr/local/dcmi \
|
||||
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
|
||||
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
|
||||
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
|
||||
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
|
||||
-v /etc/ascend_install.info:/etc/ascend_install.info \
|
||||
-v /root/.cache:/root/.cache \
|
||||
-it $IMAGE bash
|
||||
```
|
||||
|
||||
The default workdir is `/workspace`, vLLM and vLLM Ascend code are placed in `/vllm-workspace` and installed in [development mode](https://setuptools.pypa.io/en/latest/userguide/development_mode.html) (`pip install -e`) to help developer immediately take place changes without requiring a new installation.
|
||||
|
||||
In the [Run docker container](./Qwen3-Dense.md#run-docker-container), detailed explanations are provided through specific examples.
|
||||
|
||||
In addition, if you don't want to use the docker image as above, you can also build all from source:
|
||||
|
||||
- Install `vllm-ascend` from source, refer to [installation](../../installation.md).
|
||||
|
||||
If you want to deploy multi-node environment, you need to set up environment on each node.
|
||||
|
||||
## Deployment
|
||||
|
||||
In this section, we will demonstrate best practices for adjusting hyperparameters in vLLM-Ascend to maximize inference throughput performance. By tailoring service-level configurations to fit different use cases, you can ensure that your system performs optimally across various scenarios. We will guide you through how to fine-tune hyperparameters based on observed phenomena, such as max_model_len, max_num_batched_tokens, and cudagraph_capture_sizes, to achieve the best performance.
|
||||
|
||||
The specific example scenario is as follows:
|
||||
|
||||
- The machine environment is an Atlas 800 A3 (64G*16)
|
||||
- The LLM is Qwen3-32B-W8A8
|
||||
- The data scenario is a fixed-length input of 3.5K and an output of 1.5K.
|
||||
- The parallel configuration requirement is DP=1&TP=4
|
||||
- If the machine environment is an **Atlas 800I A2(64G*8)**, the deployment approach stays identical.
|
||||
|
||||
### Run docker container
|
||||
|
||||
#### **Node**
|
||||
|
||||
- /model/Qwen3-32B-W8A8 is the model path, replace this with your actual path.
|
||||
- v0.11.0rc2-a3 is image tag, replace this with your actual tag.
|
||||
- replace this with your actual port: '-p 8113:8113'.
|
||||
- replace this with your actual card: '--device /dev/davinci0'.
|
||||
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
# Update the vllm-ascend image
|
||||
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
|
||||
docker run --rm \
|
||||
--name vllm-ascend \
|
||||
--shm-size=1g \
|
||||
--privileged=true \
|
||||
--device /dev/davinci0 \
|
||||
--device /dev/davinci1 \
|
||||
--device /dev/davinci2 \
|
||||
--device /dev/davinci3 \
|
||||
--device /dev/davinci_manager \
|
||||
--device /dev/devmm_svm \
|
||||
--device /dev/hisi_hdc \
|
||||
-v /usr/local/dcmi:/usr/local/dcmi \
|
||||
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
|
||||
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
|
||||
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
|
||||
-v /etc/ascend_install.info:/etc/ascend_install.info \
|
||||
-v /model/Qwen3-32B-W8A8:/model/Qwen3-32B-W8A8 \
|
||||
-p 8113:8113 \
|
||||
-it $IMAGE bash
|
||||
```
|
||||
|
||||
### Online Inference on Multi-NPU
|
||||
|
||||
Run the following script to start the vLLM server on Multi-NPU.
|
||||
|
||||
This script is configured to achieve optimal performance under the above specific example scenarios,with batchsize = 72 on two A3 cards.
|
||||
|
||||
```bash
|
||||
# set the NPU device number
|
||||
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3
|
||||
|
||||
# Set the operator dispatch pipeline level to 1 and disable manual memory control in ACLGraph
|
||||
export TASK_QUEUE_ENABLE=1
|
||||
|
||||
# [Optional] jemalloc
|
||||
# jemalloc is for better performance, if `libjemalloc.so` is installed on your machine, you can turn it on.
|
||||
# if os is Ubuntu
|
||||
# export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
|
||||
# if os is openEuler
|
||||
# export LD_PRELOAD=/usr/lib64/libjemalloc.so.2:$LD_PRELOAD
|
||||
|
||||
|
||||
# Enable the AIVector core to directly schedule ROCE communication
|
||||
export HCCL_OP_EXPANSION_MODE="AIV"
|
||||
|
||||
# Enable FlashComm_v1 optimization when tensor parallel is enabled.
|
||||
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
|
||||
|
||||
vllm serve /model/Qwen3-32B-W8A8 \
|
||||
--served-model-name qwen3 \
|
||||
--trust-remote-code \
|
||||
--async-scheduling \
|
||||
--quantization ascend \
|
||||
--distributed-executor-backend mp \
|
||||
--tensor-parallel-size 4 \
|
||||
--max-model-len 5500 \
|
||||
--max-num-batched-tokens 40960 \
|
||||
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
|
||||
--additional-config '{"pa_shape_list":[48,64,72,80], "weight_prefetch_config":{"enabled":true}}' \
|
||||
--port 8113 \
|
||||
--block-size 128 \
|
||||
--gpu-memory-utilization 0.9
|
||||
```
|
||||
|
||||
#### **Node**
|
||||
|
||||
- /model/Qwen3-32B-W8A8 is the model path, replace this with your actual path.
|
||||
|
||||
- If the model is not a quantized model, remove the `--quantization ascend` parameter.
|
||||
|
||||
- **[Optional]** `--additional-config '{"pa_shape_list":[48,64,72,80]}'`: `pa_shape_list` specifies the batch sizes where you want to switch to the PA operator. This is a temporary tuning knob. Currently, the attention operator dispatch defaults to the FIA operator. In some batch-size (concurrency) settings, FIA may have suboptimal performance. By setting `pa_shape_list`, when the runtime batch size matches one of the listed values, vLLM-Ascend will replace FIA with the PA operator to prevent performance degradation. In the future, FIA will be optimized for these scenarios and this parameter will be removed.
|
||||
|
||||
- If the ultimate performance is desired, the cudagraph_capture_sizes parameter can be enabled, reference: [key-optimization-points](./Qwen3-Dense.md#key-optimization-points)、[optimization-highlights](./Qwen3-Dense.md#optimization-highlights). Here is an example of batchsize of 72: `--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes":[1,8,24,48,60,64,72,76]}'`.
|
||||
|
||||
Once your server is started, you can query the model with input prompts
|
||||
|
||||
```bash
|
||||
curl http://localhost:8113/v1/chat/completions -H "Content-Type: application/json" -d '{
|
||||
"model": "qwen3",
|
||||
"messages": [
|
||||
{"role": "user", "content": "Give me a short introduction to large language models."}
|
||||
],
|
||||
"temperature": 0.6,
|
||||
"top_p": 0.95,
|
||||
"top_k": 20,
|
||||
"max_completion_tokens": 4096
|
||||
}'
|
||||
```
|
||||
|
||||
### Offline Inference on Multi-NPU
|
||||
|
||||
Run the following script to execute offline inference on multi-NPU.
|
||||
|
||||
#### **Node**
|
||||
|
||||
- /model/Qwen3-32B-W8A8 is the model path, replace this with your actual path.
|
||||
|
||||
- If the model is not a quantized model,remove the `quantization="ascend"` parameter.
|
||||
|
||||
```python
|
||||
import gc
|
||||
import torch
|
||||
|
||||
from vllm import LLM, SamplingParams
|
||||
from vllm.distributed.parallel_state import (destroy_distributed_environment,
|
||||
destroy_model_parallel)
|
||||
|
||||
def clean_up():
|
||||
destroy_model_parallel()
|
||||
destroy_distributed_environment()
|
||||
gc.collect()
|
||||
torch.npu.empty_cache()
|
||||
|
||||
prompts = [
|
||||
"Hello, my name is",
|
||||
"The future of AI is",
|
||||
]
|
||||
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40)
|
||||
llm = LLM(model="/model/Qwen3-32B-W8A8",
|
||||
tensor_parallel_size=4,
|
||||
trust_remote_code=True,
|
||||
distributed_executor_backend="mp",
|
||||
max_model_len=5500,
|
||||
max_num_batched_tokens=5500,
|
||||
quantization="ascend",
|
||||
compilation_config={"cudagraph_mode": "FULL_DECODE_ONLY"})
|
||||
|
||||
outputs = llm.generate(prompts, sampling_params)
|
||||
for output in outputs:
|
||||
prompt = output.prompt
|
||||
generated_text = output.outputs[0].text
|
||||
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
|
||||
|
||||
del llm
|
||||
clean_up()
|
||||
```
|
||||
|
||||
## Accuracy Evaluation
|
||||
|
||||
Here is one accuracy evaluation methods.
|
||||
|
||||
### Using AISBench
|
||||
|
||||
1. Refer to [Using AISBench](../../developer_guide/evaluation/using_ais_bench.md) for details.
|
||||
|
||||
2. After execution, you can get the result, here is the result of `Qwen3-32B-W8A8` in `vllm-ascend:0.11.0rc2` for reference only.
|
||||
|
||||
| dataset | version | metric | mode | task name | vllm-api-general-chat |
|
||||
|---------|---------|-----------|------|--------------------------------------|-----------------------|
|
||||
| gsm8k | - | accuracy | gen | gsm8k_gen_0_shot_noncot_chat_prompt | 96.44 |
|
||||
| math500 | - | accuracy | gen | math500_gen_0_shot_cot_chat_prompt | 97.60 |
|
||||
| aime | - | accuracy | gen | aime2024_gen_0_shot_chat_prompt | 76.67 |
|
||||
|
||||
## Performance
|
||||
|
||||
### Using AISBench
|
||||
|
||||
Refer to [Using AISBench for performance evaluation](../../developer_guide/evaluation/using_ais_bench.md#execute-performance-evaluation) for details.
|
||||
|
||||
### Using vLLM Benchmark
|
||||
|
||||
Run performance evaluation of `Qwen3-32B-W8A8` as an example.
|
||||
|
||||
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
|
||||
|
||||
There are three `vllm bench` subcommands:
|
||||
|
||||
- `latency`: Benchmark the latency of a single batch of requests.
|
||||
- `serve`: Benchmark the online serving throughput.
|
||||
- `throughput`: Benchmark offline inference throughput.
|
||||
|
||||
Take the `serve` as an example. Run the code as follows.
|
||||
|
||||
#### **Node**
|
||||
|
||||
- /model/Qwen3-32B-W8A8 is the model path, replace this with your actual path.
|
||||
|
||||
```shell
|
||||
vllm bench serve --model /model/Qwen3-32B-W8A8 --served-model-name qwen3 --port 8113 --dataset-name random --random-input 200 --num-prompt 200 --request-rate 1 --save-result --result-dir ./
|
||||
```
|
||||
|
||||
After about several minutes, you can get the performance evaluation result.
|
||||
|
||||
## Key Optimization Points
|
||||
|
||||
In this section, we will cover the key optimization points that can significantly improve the performance of Qwen Dense models. These techniques are designed to enhance throughput and efficiency across various scenarios.
|
||||
|
||||
### 1. Rope Optimization
|
||||
|
||||
Rope optimization enhances the model's efficiency by modifying the position encoding process. Specifically, it ensures that the cos_sin_cache and the associated index selection operation are only performed during the first layer of the forward pass. For subsequent layers, the position encoding is directly reused, eliminating redundant calculations and significantly speeding up inference in decode phase.
|
||||
|
||||
This optimization is enabled by default and does not require any additional environment variables to be set.
|
||||
|
||||
### 2. AddRMSNormQuant Fusion
|
||||
|
||||
AddRMSNormQuant fusion merges the Address-wise Multi-Scale Normalization and Quantization operations, allowing for more efficient memory access and computation, thereby enhancing throughput.
|
||||
|
||||
This optimization is enabled by default and does not require any additional environment variables to be set.
|
||||
|
||||
### 3. FlashComm_v1
|
||||
|
||||
FlashComm_v1 significantly improves performance in large-batch scenarios by decomposing the traditional allreduce collective communication into reduce-scatter and all-gather. This breakdown helps reduce the computation of the RMSNorm token dimensions, leading to more efficient processing. In quantization scenarios, FlashComm_v1 also reduces the communication overhead by decreasing the bit-level data transfer, which further minimizes the end-to-end latency during the prefill phase.
|
||||
|
||||
It is important to note that the decomposition of the allreduce communication into reduce-scatter and all-gather operations only provides benefits in high-concurrency scenarios, where there is no significant communication degradation. In other cases, this decomposition may result in noticeable performance degradation. To mitigate this, the current implementation uses a threshold-based approach, where FlashComm_v1 is only enabled if the actual token count for each inference schedule exceeds the threshold. This ensures that the feature is only activated in scenarios where it improves performance, avoiding potential degradation in lower-concurrency situations.
|
||||
|
||||
This optimization requires setting the environment variable `VLLM_ASCEND_ENABLE_FLASHCOMM1 = 1` to be enabled.
|
||||
|
||||
### 4. Matmul and ReduceScatter Fusion
|
||||
|
||||
Once FlashComm_v1 is enabled, an additional optimization can be applied. This optimization fuses matrix multiplication and ReduceScatter operations, along with tiling optimization. The Matmul computation is treated as one pipeline, while the ReduceScatter and dequant operations are handled in a separate pipeline. This approach significantly reduces communication steps, improves computational efficiency, and allows for better resource utilization, resulting in enhanced throughput, especially in large-scale distributed environments.
|
||||
|
||||
This optimization is automatically enabled once FlashComm_v1 is activated. However, due to an issue with performance degradation in small-concurrency scenarios after this fusion, a threshold-based approach is currently used to mitigate this problem. The optimization is only applied when the token count exceeds the threshold, ensuring that it is not enabled in cases where it could negatively impact performance.
|
||||
|
||||
### 5. Weight Prefetching
|
||||
|
||||
Weight prefetching optimizes memory usage by preloading weights into the cache before they are needed, minimizing delays caused by memory access during model execution.
|
||||
|
||||
In dense model scenarios, the MLP's gate_up_proj and down_proj linear layers often exhibit relatively high MTE utilization. To address this, we create a separate pipeline specifically for weight prefetching, which runs in parallel with the original vector computation pipeline, such as RMSNorm and SiLU, before the MLP. This approach allows the weights to be preloaded to L2 cache ahead of time, reducing MTE utilization during the MLP computations and indirectly improving Cube computation efficiency by minimizing resource contention and optimizing data flow.
|
||||
|
||||
Previously, the environment variables VLLM_ASCEND_ENABLE_PREFETCH_MLP used to enable MLP weight prefetch and VLLM_ASCEND_MLP_GATE_UP_PREFETCH_SIZE and VLLM_ASCEND_MLP_DOWN_PREFETCH_SIZE used to set the weight prefetch size for MLP gate_up_proj and down_proj were deprecated. Please use the following configuration instead: "weight_prefetch_config": { "enabled": true, "prefetch_ratio": { "mlp": { "gate_up": 1.0, "down": 1.0}}}. See User Guide->Feature Guide->Weight Prefetch Guide for details.
|
||||
|
||||
### 6. Zerolike Elimination
|
||||
|
||||
This elimination removes unnecessary operations related to zero-like tensors in Attention forward, improving the efficiency of matrix operations and reducing memory usage.
|
||||
|
||||
This optimization is enabled by default and does not require any additional environment variables to be set.
|
||||
|
||||
### 7. FullGraph Optimization
|
||||
|
||||
ACLGraph offers several key optimizations to improve model execution efficiency. By replaying the entire model execution graph at once, we significantly reduce dispatch latency compared to multiple smaller replays. This approach also stabilizes multi-device performance, as capturing the model as a single static graph mitigates dispatch fluctuations across devices. Additionally, consolidating graph captures frees up streams, allowing for the capture of more graphs and optimizing resource usage, ultimately leading to improved system efficiency and reduced overhead.
|
||||
|
||||
The configuration compilation_config = { "cudagraph_mode": "FULL_DECODE_ONLY"} is used when starting the service. This setup is necessary to enable the aclgraph's full decode-only mode.
|
||||
|
||||
### 8. Asynchronous Scheduling
|
||||
|
||||
Asynchronous scheduling is a technique used to optimize inference efficiency. It allows non-blocking task scheduling to improve concurrency and throughput, especially when processing large-scale models.
|
||||
|
||||
This optimization is enabled by setting `--async-scheduling`.
|
||||
|
||||
## Optimization Highlights
|
||||
|
||||
Building on the specific example scenarios outlined earlier, this section highlights the key tuning points that played a crucial role in achieving optimal performance. By focusing on the most impactful adjustments to hyperparameters and optimizations, we’ll emphasize the strategies that can be leveraged to maximize throughput, minimize latency, and ensure efficient resource utilization in various environments. These insights will help guide you in fine-tuning your own configurations for the best possible results.
|
||||
|
||||
### 1.Prefetch Buffer Size
|
||||
|
||||
Setting the right prefetch buffer size is essential for optimizing weight loading and the size of this prefetch buffer is directly related to the time that can be hidden by vector computations. To achieve a near-perfect overlap between the prefetch and computation streams, you can flexibly adjust the buffer size by profiling and observing the degree of overlap at different buffer sizes.
|
||||
|
||||
For example, in the real-world scenario mentioned above, I set the prefetch buffer size for the gate_up_proj and down_proj in the MLP to 18MB. The reason for this is that, at this value, the vector computations of RMSNorm and SiLU can effectively hide the prefetch stream, thereby accelerating the Matmul computations of the two linear layers.
|
||||
|
||||
### 2.Max-num-batched-tokens
|
||||
|
||||
The max-num-batched-tokens parameter determines the maximum number of tokens that can be processed in a single batch. Adjusting this value helps to balance throughput and memory usage. Setting this value too small can negatively impact end-to-end performance, as fewer tokens are processed per batch, potentially leading to inefficiencies. Conversely, setting it too large increases the risk of Out of Memory (OOM) errors due to excessive memory consumption.
|
||||
|
||||
In the above real-world scenario, we not only conducted extensive testing to determine the most cost-effective value, but also took into account the accumulation of decode tokens when enabling chunked prefill. If the value is set too small, a single request may be chunked multiple times, and during the early stages of inference, a batch may contain only a small number of decode tokens. This can result in the end-to-end throughput falling short of expectations.
|
||||
|
||||
### 3.Cudagraph_capture_sizes
|
||||
|
||||
The cudagraph_capture_sizes parameter controls the granularity of graph captures during the inference process. Adjusting this value determines how much of the computation graph is captured at once, which can significantly impact both performance and memory usage.
|
||||
|
||||
If this list is not manually specified, it will be filled with a series of evenly distributed values, which typically ensures good performance. However, if you want to fine-tune it further, manually specifying the values will yield better results. This is because if the batch size falls between two sizes, the framework will automatically pad the token count to the larger size. This often leads to actual performance deviating from the expected or even degrading.
|
||||
|
||||
Therefore, like the above real-world scenario, when adjusting the benchmark request concurrency, we always ensure that the concurrency is actually included in the cudagraph_capture_sizes list. This way, during the decode phase, padding operations are essentially avoided, ensuring the reliability of the experimental data.
|
||||
|
||||
It’s important to note that if you enable FlashComm_v1, the values in this list must be integer multiples of the TP size. Any values that do not meet this condition will be automatically filtered out. Therefore, I recommend incrementally adding concurrency based on the TP size after enabling FlashComm_v1.
|
||||
182
docs/source/tutorials/models/Qwen3-Next.md
Normal file
182
docs/source/tutorials/models/Qwen3-Next.md
Normal file
@@ -0,0 +1,182 @@
|
||||
# Qwen3-Next
|
||||
|
||||
## Introduction
|
||||
|
||||
The Qwen3-Next model is a sparse MoE (Mixture of Experts) model with high sparsity. Compared to the MoE architecture of Qwen3, it has introduced key improvements in aspects such as the hybrid attention mechanism and multi-token prediction mechanism, enhancing the training and inference efficiency of the model under long contexts and large total parameter scales.
|
||||
|
||||
This document will present the core verification steps of the model, including supported features, environment preparation, as well as accuracy and performance evaluation. Qwen3 Next is currently using Triton Ascend, which is in the experimental phase. In subsequent versions, its performance related to stability and accuracy may change, and performance will be continuously optimized.
|
||||
|
||||
The `Qwen3-Next` model is first supported in `vllm-ascend:v0.10.2rc1`.
|
||||
|
||||
## Supported Features
|
||||
|
||||
Refer to [supported features](../../user_guide/support_matrix/supported_models.md) to get the model's supported feature matrix.
|
||||
|
||||
Refer to [feature guide](../../user_guide/feature_guide/index.md) to get the feature's configuration.
|
||||
|
||||
## Weight Preparation
|
||||
|
||||
Download Link for the `Qwen3-Next-80B-A3B-Instruct` Model Weights: [Download model weight](https://modelers.cn/models/Modelers_Park/Qwen3-Next-80B-A3B-Instruct/tree/main)
|
||||
|
||||
## Deployment
|
||||
|
||||
If the machine environment is an Atlas 800I A3(64G*16), the deployment approach stays identical.
|
||||
|
||||
### Run docker container
|
||||
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
# Update the vllm-ascend image
|
||||
# For Atlas A2 machines:
|
||||
# export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
|
||||
# For Atlas A3 machines:
|
||||
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|-a3
|
||||
docker run --rm \
|
||||
--shm-size=1g \
|
||||
--name vllm-ascend-qwen3 \
|
||||
--device /dev/davinci0 \
|
||||
--device /dev/davinci1 \
|
||||
--device /dev/davinci2 \
|
||||
--device /dev/davinci3 \
|
||||
--device /dev/davinci_manager \
|
||||
--device /dev/devmm_svm \
|
||||
--device /dev/hisi_hdc \
|
||||
-v /usr/local/dcmi:/usr/local/dcmi \
|
||||
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
|
||||
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
|
||||
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
|
||||
-v /etc/ascend_install.info:/etc/ascend_install.info \
|
||||
-v /root/.cache:/root/.cache \
|
||||
-p 8000:8000 \
|
||||
-it $IMAGE bash
|
||||
```
|
||||
|
||||
The Qwen3 Next is using [Triton Ascend](https://gitee.com/ascend/triton-ascend) which is currently experimental. In future versions, there may be behavioral changes related to stability, accuracy, and performance improvement.
|
||||
|
||||
### Inference
|
||||
|
||||
:::::{tab-set}
|
||||
::::{tab-item} Online Inference
|
||||
|
||||
Run the following script to start the vLLM server on multi-NPU:
|
||||
|
||||
```bash
|
||||
vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct --tensor-parallel-size 4 --max-model-len 32768 --gpu-memory-utilization 0.8 --max-num-batched-tokens 4096 --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}'
|
||||
```
|
||||
|
||||
Once your server is started, you can query the model with input prompts.
|
||||
|
||||
```bash
|
||||
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
|
||||
"model": "Qwen/Qwen3-Next-80B-A3B-Instruct",
|
||||
"messages": [
|
||||
{"role": "user", "content": "Who are you?"}
|
||||
],
|
||||
"temperature": 0.6,
|
||||
"top_p": 0.95,
|
||||
"top_k": 20,
|
||||
"max_completion_tokens": 32
|
||||
}'
|
||||
```
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} Offline Inference
|
||||
|
||||
Run the following script to execute offline inference on multi-NPU:
|
||||
|
||||
```python
|
||||
import gc
|
||||
import torch
|
||||
|
||||
from vllm import LLM, SamplingParams
|
||||
from vllm.distributed.parallel_state import (destroy_distributed_environment,
|
||||
destroy_model_parallel)
|
||||
|
||||
def clean_up():
|
||||
destroy_model_parallel()
|
||||
destroy_distributed_environment()
|
||||
gc.collect()
|
||||
torch.npu.empty_cache()
|
||||
|
||||
if __name__ == '__main__':
|
||||
prompts = [
|
||||
"Who are you?",
|
||||
]
|
||||
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40, max_completion_tokens=32)
|
||||
llm = LLM(model="Qwen/Qwen3-Next-80B-A3B-Instruct",
|
||||
tensor_parallel_size=4,
|
||||
enforce_eager=True,
|
||||
distributed_executor_backend="mp",
|
||||
gpu_memory_utilization=0.7,
|
||||
max_model_len=4096)
|
||||
|
||||
outputs = llm.generate(prompts, sampling_params)
|
||||
for output in outputs:
|
||||
prompt = output.prompt
|
||||
generated_text = output.outputs[0].text
|
||||
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
|
||||
|
||||
del llm
|
||||
clean_up()
|
||||
```
|
||||
|
||||
If you run this script successfully, you can see the info shown below:
|
||||
|
||||
```bash
|
||||
Prompt: 'Who are you?', Generated text: ' What do you know about me?\n\nHello! I am Qwen, a large-scale language model independently developed by the Tongyi Lab under Alibaba Group. I am'
|
||||
```
|
||||
|
||||
::::
|
||||
:::::
|
||||
|
||||
## Accuracy Evaluation
|
||||
|
||||
### Using AISBench
|
||||
|
||||
1. Refer to [Using AISBench](../../developer_guide/evaluation/using_ais_bench.md) for details.
|
||||
|
||||
2. After execution, you can get the result, here is the result of `Qwen3-Next-80B-A3B-Instruct` in `vllm-ascend:0.13.0rc1` for reference only.
|
||||
|
||||
| dataset | version | metric | mode | vllm-api-general-chat |
|
||||
|----- | ----- | ----- | ----- | -----|
|
||||
| gsm8k | - | accuracy | gen | 95.53 |
|
||||
|
||||
## Performance
|
||||
|
||||
### Using AISBench
|
||||
|
||||
Refer to [Using AISBench for performance evaluation](../../developer_guide/evaluation/using_ais_bench.md#execute-performance-evaluation) for details.
|
||||
|
||||
### Using vLLM Benchmark
|
||||
|
||||
Run performance evaluation of `Qwen3-Next` as an example.
|
||||
|
||||
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
|
||||
|
||||
There are three `vllm bench` subcommands:
|
||||
|
||||
- `latency`: Benchmark the latency of a single batch of requests.
|
||||
- `serve`: Benchmark the online serving throughput.
|
||||
- `throughput`: Benchmark offline inference throughput.
|
||||
|
||||
Take the `serve` as an example. Run the code as follows.
|
||||
|
||||
```shell
|
||||
export VLLM_USE_MODELSCOPE=true
|
||||
vllm bench serve --model Qwen/Qwen3-Next-80B-A3B-Instruct --dataset-name random --random-input 200 --num-prompt 200 --request-rate 1 --save-result --result-dir ./
|
||||
```
|
||||
|
||||
After about several minutes, you can get the performance evaluation result.
|
||||
|
||||
The performance result is:
|
||||
|
||||
**Hardware**: A3-752T, 2 node
|
||||
|
||||
**Deployment**: TP4 + Full Decode Only
|
||||
|
||||
**Input/Output**: 2k/2k
|
||||
|
||||
**Concurrency**: 32
|
||||
|
||||
**Performance**: 580tps, TPOT 54ms
|
||||
319
docs/source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md
Normal file
319
docs/source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md
Normal file
@@ -0,0 +1,319 @@
|
||||
# Qwen3-Omni-30B-A3B-Thinking
|
||||
|
||||
## Introduction
|
||||
|
||||
Qwen3-Omni is the natively end-to-end multilingual omni-modal foundation models. It processes text, images, audio, and video, and delivers real-time streaming responses in both text and natural speech. We introduce several architectural upgrades to improve performance and efficiency. The Thinking model of Qwen3-Omni-30B-A3B, containing the thinker component, equipped with chain-of-thought reasoning, supporting audio, video, and text input, with text output.
|
||||
|
||||
This document will show the main verification steps of the model, including supported features, feature configuration, environment preparation, single-node deployment, accuracy and performance evaluation.
|
||||
|
||||
## Supported Features
|
||||
|
||||
Refer to [supported features](https://docs.vllm.ai/projects/ascend/zh-cn/latest/user_guide/support_matrix/supported_models.html) to get the model's supported feature matrix.
|
||||
|
||||
Refer to [feature guide](https://docs.vllm.ai/projects/ascend/zh-cn/latest/user_guide/feature_guide/index.html) to get the feature's configuration.
|
||||
|
||||
## Environment Preparation
|
||||
|
||||
### Model Weight
|
||||
|
||||
- `Qwen3-Omni-30B-A3B-Thinking` require 2 NPU Card(64G × 2).[Download model weight](https://modelscope.cn/models/Qwen/Qwen3-Omni-30B-A3B-Thinking)
|
||||
It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`
|
||||
|
||||
### Installation
|
||||
|
||||
:::::{tab-set}
|
||||
::::{tab-item} Use docker image
|
||||
|
||||
You can use our official docker image to run Qwen3-Omni-30B-A3B-Thinking directly
|
||||
|
||||
Select an image based on your machine type and start the docker image on your node, refer to [using docker](../../installation.md#set-up-using-docker).
|
||||
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
# Update --device according to your device (Atlas A2: /dev/davinci[0-7] Atlas A3:/dev/davinci[0-15]).
|
||||
# Update the vllm-ascend image according to your environment.
|
||||
# Note you should download the weight to /root/.cache in advance.
|
||||
# Update the vllm-ascend image
|
||||
export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version|
|
||||
export NAME=vllm-ascend
|
||||
|
||||
# Run the container using the defined variables
|
||||
# Note: If you are running bridge network with docker, please expose available ports for multiple nodes communication in advance
|
||||
docker run --rm \
|
||||
--name $NAME \
|
||||
--net=host \
|
||||
--shm-size=1g \
|
||||
--device /dev/davinci0 \
|
||||
--device /dev/davinci1 \
|
||||
--device /dev/davinci_manager \
|
||||
--device /dev/devmm_svm \
|
||||
--device /dev/hisi_hdc \
|
||||
-v /usr/local/dcmi:/usr/local/dcmi \
|
||||
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
|
||||
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
|
||||
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
|
||||
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
|
||||
-v /etc/ascend_install.info:/etc/ascend_install.info \
|
||||
-v /root/.cache:/root/.cache \
|
||||
-it $IMAGE bash
|
||||
```
|
||||
|
||||
::::
|
||||
::::{tab-item} Build from source
|
||||
|
||||
You can build all from source.
|
||||
|
||||
- Install `vllm-ascend`, refer to [set up using python](../../installation.md#set-up-using-python).
|
||||
|
||||
::::
|
||||
:::::
|
||||
|
||||
Please install system dependencies
|
||||
|
||||
```bash
|
||||
pip install qwen_omni_utils modelscope
|
||||
# Used for audio processing.
|
||||
apt-get update && apt-get install ffmpeg -y
|
||||
# Check the installation.
|
||||
ffmpeg -version
|
||||
```
|
||||
|
||||
## Deployment
|
||||
|
||||
### Single-node Deployment
|
||||
|
||||
#### Offline Inference on Multi-NPU
|
||||
|
||||
Run the following script to execute offline inference on multi-NPU:
|
||||
|
||||
```python
|
||||
import gc
|
||||
import torch
|
||||
import os
|
||||
from vllm import LLM, SamplingParams
|
||||
from vllm.distributed.parallel_state import (
|
||||
destroy_distributed_environment,
|
||||
destroy_model_parallel
|
||||
)
|
||||
from modelscope import Qwen3OmniMoeProcessor
|
||||
from qwen_omni_utils import process_mm_info
|
||||
|
||||
os.environ["HCCL_BUFFSIZE"] = "1024"
|
||||
|
||||
def clean_up():
|
||||
"""Clean up distributed resources and NPU memory"""
|
||||
destroy_model_parallel()
|
||||
destroy_distributed_environment()
|
||||
gc.collect() # Garbage collection to free up memory
|
||||
torch.npu.empty_cache()
|
||||
|
||||
|
||||
def main():
|
||||
MODEL_PATH = "Qwen3/Qwen3-Omni-30B-A3B-Thinking"
|
||||
llm = LLM(
|
||||
model=MODEL_PATH,
|
||||
tensor_parallel_size=2,
|
||||
enable_expert_parallel=True,
|
||||
distributed_executor_backend="mp",
|
||||
limit_mm_per_prompt={'image': 5, 'video': 2, 'audio': 3},
|
||||
max_model_len=32768,
|
||||
)
|
||||
|
||||
sampling_params = SamplingParams(
|
||||
temperature=0.6,
|
||||
top_p=0.95,
|
||||
top_k=20,
|
||||
max_completion_tokens=16384,
|
||||
)
|
||||
|
||||
processor = Qwen3OmniMoeProcessor.from_pretrained(MODEL_PATH)
|
||||
messages = [
|
||||
{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{"type": "video", "video": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/draw.mp4"},
|
||||
{"type": "text", "text": "What can you see and hear? Answer in one sentence."}
|
||||
]
|
||||
}
|
||||
]
|
||||
|
||||
text = processor.apply_chat_template(
|
||||
messages,
|
||||
tokenize=False,
|
||||
add_generation_prompt=True
|
||||
)
|
||||
# 'use_audio_in_video = True' requires equal number of audio and video items, including audio from the video.
|
||||
audios, images, videos = process_mm_info(messages, use_audio_in_video=True)
|
||||
|
||||
inputs = {
|
||||
"prompt": text,
|
||||
"multi_modal_data": {},
|
||||
"mm_processor_kwargs": {"use_audio_in_video": True}
|
||||
}
|
||||
if images is not None:
|
||||
inputs['multi_modal_data']['image'] = images
|
||||
if videos is not None:
|
||||
inputs['multi_modal_data']['video'] = videos
|
||||
if audios is not None:
|
||||
inputs['multi_modal_data']['audio'] = audios
|
||||
|
||||
outputs = llm.generate([inputs], sampling_params=sampling_params)
|
||||
for output in outputs:
|
||||
prompt = output.prompt
|
||||
generated_text = output.outputs[0].text
|
||||
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
|
||||
|
||||
del llm
|
||||
clean_up()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
```
|
||||
|
||||
#### Online Inference on Multi-NPU
|
||||
|
||||
Run the following script to start the vLLM server on Multi-NPU:
|
||||
For an Atlas A2 with 64 GB of NPU card memory, tensor-parallel-size should be at least 1, and for 32 GB of memory, tensor-parallel-size should be at least 2.
|
||||
|
||||
```bash
|
||||
vllm serve Qwen/Qwen3-Omni-30B-A3B-Thinking --tensor-parallel-size 2 --enable_expert_parallel
|
||||
```
|
||||
|
||||
## Functional Verification
|
||||
|
||||
Once your server is started, you can query the model with input prompts.
|
||||
|
||||
```bash
|
||||
curl http://localhost:8000/v1/chat/completions \
|
||||
-X POST \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "Qwen/Qwen3-Omni-30B-A3B-Thinking",
|
||||
"messages": [
|
||||
{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{
|
||||
"type": "image_url",
|
||||
"image_url": {
|
||||
"url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cars.jpg"
|
||||
}
|
||||
},
|
||||
{
|
||||
"type": "audio_url",
|
||||
"audio_url": {
|
||||
"url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cough.wav"
|
||||
}
|
||||
},
|
||||
{
|
||||
"type": "video_url",
|
||||
"video_url": {
|
||||
"url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/draw.mp4"
|
||||
}
|
||||
|
||||
},
|
||||
{
|
||||
"type": "text",
|
||||
"text": "Analyze this audio, image, and video together."
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}'
|
||||
```
|
||||
|
||||
## Accuracy Evaluation
|
||||
|
||||
Here are accuracy evaluation methods.
|
||||
|
||||
### Using EvalScope
|
||||
|
||||
As an example, take the `gsm8k` `omnibench` `bbh` dataset as a test dataset, and run accuracy evaluation of `Qwen3-Omni-30B-A3B-Thinking` in online mode.
|
||||
|
||||
1. Refer to Using evalscope(<https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/evaluation/using_evalscope.html#install-evalscope-using-pip>) for `evalscope`installation.
|
||||
2. Run `evalscope` to execute the accuracy evaluation.
|
||||
|
||||
```bash
|
||||
evalscope eval \
|
||||
--model /root/.cache/modelscope/hub/models/Qwen/Qwen3-Omni-30B-A3B-Thinking \
|
||||
--api-url http://localhost:8000/v1 \
|
||||
--api-key EMPTY \
|
||||
--eval-type server \
|
||||
--datasets omni_bench, gsm8k, bbh \
|
||||
--dataset-args '{"omni_bench": { "extra_params": { "use_image": true, "use_audio": false}}}' \
|
||||
--eval-batch-size 1 \
|
||||
--generation-config '{"max_completion_tokens": 10000, "temperature": 0.6}' \
|
||||
--limit 100
|
||||
```
|
||||
|
||||
3. After execution, you can get the result, here is the result of `Qwen3-Omni-30B-A3B-Thinking` in vllm-ascend:0.13.0rc1 for reference only.
|
||||
|
||||
```bash
|
||||
+-----------------------------+------------+----------+----------+-------+---------+---------+
|
||||
| Model | Dataset | Metric | Subset | Num | Score | Cat.0 |
|
||||
+=============================+============+==========+==========+=======+=========+=========+
|
||||
| Qwen3-Omni-30B-A3B-Thinking | omni_bench | mean_acc | default | 100 | 0.44 | default |
|
||||
+-----------------------------+------------+----------+----------+-------+---------+---------+
|
||||
| Qwen3-Omni-30B-A3B-Thinking | gsm8k | mean_acc | main | 100 | 0.98 | default |
|
||||
+-----------------------------+-----------+----------+----------+-------+---------+---------+
|
||||
| Qwen3-Omni-30B-A3B-Thinking | bbh | mean_acc | OVERALL | 270 | 0.9148 | |
|
||||
+-----------------------------+------------+----------+----------+-------+---------+---------+
|
||||
```
|
||||
|
||||
## Performance
|
||||
|
||||
### Using vLLM Benchmark
|
||||
|
||||
Run performance evaluation of `Qwen3-Omni-30B-A3B-Thinking` as an example.
|
||||
Refer to vllm benchmark for more details.
|
||||
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
|
||||
|
||||
There are three `vllm bench` subcommands:
|
||||
|
||||
- `latency`: Benchmark the latency of a single batch of requests.
|
||||
- `serve`: Benchmark the online serving throughput.
|
||||
- `throughput`: Benchmark offline inference throughput.
|
||||
|
||||
Take the `serve` as an example. Run the code as follows.
|
||||
|
||||
```bash
|
||||
VLLM_USE_MODELSCOPE=True
|
||||
export MODEL=Qwen/Qwen3-Omni-30B-A3B-Thinking
|
||||
python3 -m vllm.entrypoints.openai.api_server --model $MODEL --tensor-parallel-size 2 --swap-space 16 --disable-log-stats --disable-log-request --load-format dummy
|
||||
|
||||
pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
|
||||
pip install -r vllm-ascend/benchmarks/requirements-bench.txt
|
||||
|
||||
vllm bench serve --model $MODEL --dataset-name random --random-input 200 --num-prompt 200 --request-rate 1 --save-result --result-dir ./
|
||||
```
|
||||
|
||||
After execution, you can get the result, here is the result of `Qwen3-Omni-30B-A3B-Thinking` in vllm-ascend:0.13.0rc1 for reference only.
|
||||
|
||||
```bash
|
||||
============ Serving Benchmark Result ============
|
||||
Successful requests: 200
|
||||
Failed requests: 0
|
||||
Request rate configured (RPS): 1.00
|
||||
Benchmark duration (s): 211.90
|
||||
Total input tokens: 40000
|
||||
Total generated tokens: 25600
|
||||
Request throughput (req/s): 0.94
|
||||
Output token throughput (tok/s): 120.81
|
||||
Peak output token throughput (tok/s): 216.00
|
||||
Peak concurrent requests: 24.00
|
||||
Total token throughput (tok/s): 309.58
|
||||
---------------Time to First Token----------------
|
||||
Mean TTFT (ms): 215.50
|
||||
Median TTFT (ms): 211.51
|
||||
P99 TTFT (ms): 317.18
|
||||
-----Time per Output Token (excl. 1st token)------
|
||||
Mean TPOT (ms): 98.96
|
||||
Median TPOT (ms): 99.19
|
||||
P99 TPOT (ms): 101.52
|
||||
---------------Inter-token Latency----------------
|
||||
Mean ITL (ms): 99.02
|
||||
Median ITL (ms): 96.10
|
||||
P99 ITL (ms): 176.02
|
||||
==================================================
|
||||
```
|
||||
276
docs/source/tutorials/models/Qwen3-VL-235B-A22B-Instruct.md
Normal file
276
docs/source/tutorials/models/Qwen3-VL-235B-A22B-Instruct.md
Normal file
@@ -0,0 +1,276 @@
|
||||
# Qwen3-VL-235B-A22B-Instruct
|
||||
|
||||
## Introduction
|
||||
|
||||
The Qwen-VL(Vision-Language)series from Alibaba Cloud comprises a family of powerful Large Vision-Language Models (LVLMs) designed for comprehensive multimodal understanding. They accept images, text, and bounding boxes as input, and output text and detection boxes, enabling advanced functions like image detection, multi-modal dialogue, and multi-image reasoning.
|
||||
|
||||
This document will show the main verification steps of the model, including supported features, feature configuration, environment preparation, NPU deployment, accuracy and performance evaluation.
|
||||
|
||||
This tutorial uses the vLLM-Ascend `v0.11.0rc2` version for demonstration, showcasing the `Qwen3-VL-235B-A22B-Instruct` model as an example for multi-NPU deployment.
|
||||
|
||||
## Supported Features
|
||||
|
||||
Refer to [supported features](../../user_guide/support_matrix/supported_models.md) to get the model's supported feature matrix.
|
||||
|
||||
Refer to [feature guide](../../user_guide/feature_guide/index.md) to get the feature's configuration.
|
||||
|
||||
## Environment Preparation
|
||||
|
||||
### Model Weight
|
||||
|
||||
- `Qwen3-VL-235B-A22B-Instruct`(BF16 version): require 1 Atlas 800 A3 (64G × 16) node,2 Atlas 800 A2(64G * 8)nodes. [Download model weight](https://modelscope.cn/models/Qwen/Qwen3-VL-235B-A22B-Instruct/)
|
||||
|
||||
It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`
|
||||
|
||||
### Verify Multi-node Communication(Optional)
|
||||
|
||||
If you want to deploy multi-node environment, you need to verify multi-node communication according to [verify multi-node communication environment](../../installation.md#verify-multi-node-communication).
|
||||
|
||||
### Installation
|
||||
|
||||
:::::{tab-set}
|
||||
::::{tab-item} Use docker image
|
||||
|
||||
For example, using images `quay.io/ascend/vllm-ascend:v0.11.0rc2`(for Atlas 800 A2) and `quay.io/ascend/vllm-ascend:v0.11.0rc2-a3`(for Atlas 800 A3).
|
||||
|
||||
Select an image based on your machine type and start the docker image on your node, refer to [using docker](../../installation.md#set-up-using-docker).
|
||||
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
# Update --device according to your device (Atlas A2: /dev/davinci[0-7] Atlas A3:/dev/davinci[0-15]).
|
||||
# Update the vllm-ascend image according to your environment.
|
||||
# Note you should download the weight to /root/.cache in advance.
|
||||
# Update the vllm-ascend image
|
||||
export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version|
|
||||
export NAME=vllm-ascend
|
||||
|
||||
# Run the container using the defined variables
|
||||
# Note: If you are running bridge network with docker, please expose available ports for multiple nodes communication in advance
|
||||
docker run --rm \
|
||||
--name $NAME \
|
||||
--net=host \
|
||||
--privileged=true \
|
||||
--shm-size=500g \
|
||||
--device /dev/davinci0 \
|
||||
--device /dev/davinci1 \
|
||||
--device /dev/davinci2 \
|
||||
--device /dev/davinci3 \
|
||||
--device /dev/davinci4 \
|
||||
--device /dev/davinci5 \
|
||||
--device /dev/davinci6 \
|
||||
--device /dev/davinci7 \
|
||||
--device /dev/davinci_manager \
|
||||
--device /dev/devmm_svm \
|
||||
--device /dev/hisi_hdc \
|
||||
-v /usr/local/dcmi:/usr/local/dcmi \
|
||||
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
|
||||
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
|
||||
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
|
||||
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
|
||||
-v /etc/ascend_install.info:/etc/ascend_install.info \
|
||||
-it $IMAGE bash
|
||||
```
|
||||
|
||||
::::
|
||||
::::{tab-item} Build from source
|
||||
|
||||
You can build all from source.
|
||||
|
||||
- Install `vllm-ascend`, refer to [set up using python](../../installation.md#set-up-using-python).
|
||||
|
||||
::::
|
||||
:::::
|
||||
|
||||
If you want to deploy multi-node environment, you need to set up environment on each node.
|
||||
|
||||
## Deployment
|
||||
|
||||
### Multi-node Deployment with MP (Recommended)
|
||||
|
||||
Assume you have Atlas 800 A3 (64G*16) nodes (or 2* A2), and want to deploy the `Qwen3-VL-235B-A22B-Instruct` model across multiple nodes.
|
||||
|
||||
Node 0
|
||||
|
||||
```shell
|
||||
#!/bin/sh
|
||||
# Load model from ModelScope to speed up download
|
||||
export VLLM_USE_MODELSCOPE=true
|
||||
# To reduce memory fragmentation and avoid out of memory
|
||||
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
|
||||
# this obtained through ifconfig
|
||||
# nic_name is the network interface name corresponding to local_ip of the current node
|
||||
nic_name="xxxx"
|
||||
local_ip="xxxx"
|
||||
|
||||
export HCCL_IF_IP=$local_ip
|
||||
export GLOO_SOCKET_IFNAME=$nic_name
|
||||
export TP_SOCKET_IFNAME=$nic_name
|
||||
export HCCL_SOCKET_IFNAME=$nic_name
|
||||
export OMP_PROC_BIND=false
|
||||
export OMP_NUM_THREADS=1
|
||||
export HCCL_BUFFSIZE=1024
|
||||
export TASK_QUEUE_ENABLE=1
|
||||
export HCCL_OP_EXPANSION_MODE="AIV"
|
||||
|
||||
vllm serve Qwen/Qwen3-VL-235B-A22B-Instruct \
|
||||
--host 0.0.0.0 \
|
||||
--port 8000 \
|
||||
--data-parallel-size 2 \
|
||||
--api-server-count 2 \
|
||||
--data-parallel-size-local 1 \
|
||||
--data-parallel-address $local_ip \
|
||||
--data-parallel-rpc-port 13389 \
|
||||
--seed 1024 \
|
||||
--served-model-name qwen3 \
|
||||
--tensor-parallel-size 8 \
|
||||
--enable-expert-parallel \
|
||||
--max-num-seqs 16 \
|
||||
--max-model-len 262144 \
|
||||
--max-num-batched-tokens 4096 \
|
||||
--trust-remote-code \
|
||||
--async-scheduling \
|
||||
--gpu-memory-utilization 0.9 \
|
||||
```
|
||||
|
||||
Node1
|
||||
|
||||
```shell
|
||||
#!/bin/sh
|
||||
# Load model from ModelScope to speed up download
|
||||
export VLLM_USE_MODELSCOPE=true
|
||||
# To reduce memory fragmentation and avoid out of memory
|
||||
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
|
||||
# this obtained through ifconfig
|
||||
# nic_name is the network interface name corresponding to local_ip of the current node
|
||||
nic_name="xxxx"
|
||||
local_ip="xxxx"
|
||||
|
||||
# The value of node0_ip must be consistent with the value of local_ip set in node0 (master node)
|
||||
node0_ip="xxxx"
|
||||
|
||||
export HCCL_IF_IP=$local_ip
|
||||
export GLOO_SOCKET_IFNAME=$nic_name
|
||||
export TP_SOCKET_IFNAME=$nic_name
|
||||
export HCCL_SOCKET_IFNAME=$nic_name
|
||||
export OMP_PROC_BIND=false
|
||||
export OMP_NUM_THREADS=1
|
||||
export HCCL_BUFFSIZE=1024
|
||||
export TASK_QUEUE_ENABLE=1
|
||||
export HCCL_OP_EXPANSION_MODE="AIV"
|
||||
|
||||
vllm serve Qwen/Qwen3-VL-235B-A22B-Instruct \
|
||||
--host 0.0.0.0 \
|
||||
--port 8000 \
|
||||
--headless \
|
||||
--data-parallel-size 2 \
|
||||
--data-parallel-size-local 1 \
|
||||
--data-parallel-start-rank 1 \
|
||||
--data-parallel-address $node0_ip \
|
||||
--data-parallel-rpc-port 13389 \
|
||||
--seed 1024 \
|
||||
--tensor-parallel-size 8 \
|
||||
--served-model-name qwen3 \
|
||||
--max-num-seqs 16 \
|
||||
--max-model-len 262144 \
|
||||
--max-num-batched-tokens 4096 \
|
||||
--enable-expert-parallel \
|
||||
--trust-remote-code \
|
||||
--async-scheduling \
|
||||
--gpu-memory-utilization 0.9 \
|
||||
```
|
||||
|
||||
The parameters are explained as follows:
|
||||
|
||||
- `--max-model-len` represents the context length, which is the maximum value of the input plus output for a single request.
|
||||
- `--max-num-seqs` indicates the maximum number of requests that each DP group is allowed to process. If the number of requests sent to the service exceeds this limit, the excess requests will remain in a waiting state and will not be scheduled. Note that the time spent in the waiting state is also counted in metrics such as TTFT and TPOT. Therefore, when testing performance, it is generally recommended that `--max-num-seqs` * `--data-parallel-size` >= the actual total concurrency.
|
||||
- `--max-num-batched-tokens` represents the maximum number of tokens that the model can process in a single step. Currently, vLLM v1 scheduling enables ChunkPrefill/SplitFuse by default, which means:
|
||||
- (1) If the input length of a request is greater than `--max-num-batched-tokens`, it will be divided into multiple rounds of computation according to `--max-num-batched-tokens`;
|
||||
- (2) Decode requests are prioritized for scheduling, and prefill requests are scheduled only if there is available capacity.
|
||||
- Generally, if `--max-num-batched-tokens` is set to a larger value, the overall latency will be lower, but the pressure on GPU memory (activation value usage) will be greater.
|
||||
- `--gpu-memory-utilization` represents the proportion of HBM that vLLM will use for actual inference. Its essential function is to calculate the available kv_cache size. During the warm-up phase (referred to as profile run in vLLM), vLLM records the peak GPU memory usage during an inference process with an input size of `--max-num-batched-tokens`. The available kv_cache size is then calculated as: `--gpu-memory-utilization` * HBM size - peak GPU memory usage. Therefore, the larger the value of `--gpu-memory-utilization`, the more kv_cache can be used. However, since the GPU memory usage during the warm-up phase may differ from that during actual inference (e.g., due to uneven EP load), setting `--gpu-memory-utilization` too high may lead to OOM (Out of Memory) issues during actual inference. The default value is `0.9`.
|
||||
- `--enable-expert-parallel` indicates that EP is enabled. Note that vLLM does not support a mixed approach of ETP and EP; that is, MoE can either use pure EP or pure TP.
|
||||
- `--no-enable-prefix-caching` indicates that prefix caching is disabled. To enable it, remove this option.
|
||||
- `--quantization` "ascend" indicates that quantization is used. To disable quantization, remove this option.
|
||||
- `--compilation-config` contains configurations related to the aclgraph graph mode. The most significant configurations are "cudagraph_mode" and "cudagraph_capture_sizes", which have the following meanings:
|
||||
"cudagraph_mode": represents the specific graph mode. Currently, "PIECEWISE" and "FULL_DECODE_ONLY" are supported. The graph mode is mainly used to reduce the cost of operator dispatch. Currently, "FULL_DECODE_ONLY" is recommended.
|
||||
- "cudagraph_capture_sizes": represents different levels of graph modes. The default value is [1, 2, 4, 8, 16, 24, 32, 40,..., `--max-num-seqs`]. In the graph mode, the input for graphs at different levels is fixed, and inputs between levels are automatically padded to the next level. Currently, the default setting is recommended. Only in some scenarios is it necessary to set this separately to achieve optimal performance.
|
||||
- `export VLLM_ASCEND_ENABLE_FLASHCOMM1=1` indicates that Flashcomm1 optimization is enabled. Currently, this optimization is only supported for MoE in scenarios where tp_size > 1.
|
||||
|
||||
If the service starts successfully, the following information will be displayed on node 0:
|
||||
|
||||
```shell
|
||||
INFO: Started server process [44610]
|
||||
INFO: Waiting for application startup.
|
||||
INFO: Application startup complete.
|
||||
INFO: Started server process [44611]
|
||||
INFO: Waiting for application startup.
|
||||
INFO: Application startup complete.
|
||||
```
|
||||
|
||||
### Multi-node Deployment with Ray
|
||||
|
||||
- refer to [Ray Distributed (Qwen/Qwen3-235B-A22B)](../features/ray.md).
|
||||
|
||||
### Prefill-Decode Disaggregation
|
||||
|
||||
- refer to [Prefill-Decode Disaggregation Mooncake Verification](../features/pd_disaggregation_mooncake_multi_node.md)
|
||||
|
||||
## Functional Verification
|
||||
|
||||
Once your server is started, you can query the model with input prompts:
|
||||
|
||||
```shell
|
||||
curl http://<node0_ip>:<port>/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "qwen3",
|
||||
"messages": [
|
||||
{"role": "system", "content": "You are a helpful assistant."},
|
||||
{"role": "user", "content": [
|
||||
{"type": "image_url", "image_url": {"url": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png"}},
|
||||
{"type": "text", "text": "What is the text in the illustrate?"}
|
||||
]}
|
||||
]
|
||||
}'
|
||||
```
|
||||
|
||||
## Accuracy Evaluation
|
||||
|
||||
Here are two accuracy evaluation methods.
|
||||
|
||||
### Using AISBench
|
||||
|
||||
1. Refer to [Using AISBench](../../developer_guide/evaluation/using_ais_bench.md) for details.
|
||||
|
||||
2. After execution, you can get the result, here is the result of `Qwen3-VL-235B-A22B-Instruct` in `vllm-ascend:0.11.0rc2` for reference only.
|
||||
|
||||
| dataset | version | metric | mode | vllm-api-general-chat |
|
||||
| -------- | ------- | -------- | ---- | --------------------- |
|
||||
| aime2024 | - | accuracy | gen | 93 |
|
||||
|
||||
## Performance
|
||||
|
||||
### Using AISBench
|
||||
|
||||
Refer to [Using AISBench for performance evaluation](../../developer_guide/evaluation/using_ais_bench.md#execute-performance-evaluation) for details.
|
||||
|
||||
### Using vLLM Benchmark
|
||||
|
||||
Run performance evaluation of `Qwen3-VL-235B-A22B-Instruct` as an example.
|
||||
|
||||
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
|
||||
|
||||
There are three `vllm bench` subcommands:
|
||||
|
||||
- `latency`: Benchmark the latency of a single batch of requests.
|
||||
- `serve`: Benchmark the online serving throughput.
|
||||
- `throughput`: Benchmark offline inference throughput.
|
||||
|
||||
Take the `serve` as an example. Run the code as follows.
|
||||
|
||||
```shell
|
||||
export VLLM_USE_MODELSCOPE=true
|
||||
vllm bench serve --model Qwen/Qwen3-VL-235B-A22B-Instruct --dataset-name random --random-input 200 --num-prompt 200 --request-rate 1 --save-result --result-dir ./
|
||||
```
|
||||
|
||||
After about several minutes, you can get the performance evaluation result.
|
||||
207
docs/source/tutorials/models/Qwen3-VL-30B-A3B-Instruct.md
Normal file
207
docs/source/tutorials/models/Qwen3-VL-30B-A3B-Instruct.md
Normal file
@@ -0,0 +1,207 @@
|
||||
# Qwen3-VL-30B-A3B-Instruct
|
||||
|
||||
## Introduction
|
||||
|
||||
The Qwen-VL (Vision-Language) series from Alibaba Cloud comprises a family of powerful Large Vision-Language Models (LVLMs) designed for comprehensive multimodal understanding. They accept images, text, and bounding boxes as input, and output text and detection boxes, enabling advanced functions like image detection, multi-modal dialogue, and multi-image reasoning.
|
||||
|
||||
This document will show the main verification steps of the `Qwen3-VL-30B-A3B-Instruct`.
|
||||
|
||||
## Supported Features
|
||||
|
||||
- Refer to [supported features](../../user_guide/support_matrix/supported_models.md) to get the model's supported feature matrix.
|
||||
- Refer to [feature guide](../../user_guide/feature_guide/index.md) to get the feature's configuration.
|
||||
|
||||
## Environment Preparation
|
||||
|
||||
### Prepare Model Weights
|
||||
|
||||
Running this model requires 1 Atlas 800I A2 (64G × 8) node or 1 Atlas 800 A3 (64G × 16) node.
|
||||
|
||||
Download model weight at [ModelScope Website](https://modelscope.cn/models/Qwen/Qwen3-VL-30B-A3B-Instruct) or download by below command:
|
||||
|
||||
```bash
|
||||
pip install modelscope
|
||||
modelscope download --model Qwen/Qwen3-VL-30B-A3B-Instruct
|
||||
```
|
||||
|
||||
It is recommended to download the model weights to the shared directory of multiple nodes, such as `/root/.cache/`.
|
||||
|
||||
### Installation
|
||||
|
||||
Run docker container:
|
||||
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
# Update the vllm-ascend image
|
||||
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
|
||||
|
||||
docker run --rm \
|
||||
--name vllm-ascend \
|
||||
--shm-size=1g \
|
||||
--net=host \
|
||||
--device /dev/davinci0 \
|
||||
--device /dev/davinci1 \
|
||||
--device /dev/davinci_manager \
|
||||
--device /dev/devmm_svm \
|
||||
--device /dev/hisi_hdc \
|
||||
-v /usr/local/dcmi:/usr/local/dcmi \
|
||||
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
|
||||
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
|
||||
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
|
||||
-v /etc/ascend_install.info:/etc/ascend_install.info \
|
||||
-v /root/.cache:/root/.cache \
|
||||
-v /data:/data \
|
||||
-v <path/to/your/media>:/media \
|
||||
-it $IMAGE bash
|
||||
```
|
||||
|
||||
Setup environment variables:
|
||||
|
||||
```bash
|
||||
# Load model from ModelScope to speed up download
|
||||
export VLLM_USE_MODELSCOPE=True
|
||||
|
||||
# Set `max_split_size_mb` to reduce memory fragmentation and avoid out of memory
|
||||
export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
|
||||
```
|
||||
|
||||
:::{note}
|
||||
`max_split_size_mb` prevents the native allocator from splitting blocks larger than this size (in MB). This can reduce fragmentation and may allow some borderline workloads to complete without running out of memory. You can find more details [<u>here</u>](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/800alpha003/apiref/envref/envref_07_0061.html).
|
||||
:::
|
||||
|
||||
## Deployment
|
||||
|
||||
### Online Serving
|
||||
|
||||
:::::{tab-set}
|
||||
:sync-group: install
|
||||
|
||||
::::{tab-item} Image Inputs
|
||||
:sync: multi
|
||||
|
||||
Run the following command inside the container to start the vLLM server on multi-NPU:
|
||||
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
vllm serve Qwen/Qwen3-VL-30B-A3B-Instruct \
|
||||
--tensor-parallel-size 2 \
|
||||
--enable-expert-parallel \
|
||||
--limit-mm-per-prompt.video 0 \
|
||||
--max-model-len 128000
|
||||
```
|
||||
|
||||
:::{note}
|
||||
vllm-ascend supports Expert Parallelism (EP) via `--enable-expert-parallel`, which allows experts in MoE models to be deployed on separate GPUs for better throughput.
|
||||
|
||||
It's highly recommended to specify `--limit-mm-per-prompt.video 0` if your inference server will only process image inputs since enabling video inputs consumes more memory reserved for long video embeddings.
|
||||
|
||||
You can set `--max-model-len` to preserve memory. By default the model's context length is 262K, but `--max-model-len 128000` is good for most scenarios.
|
||||
:::
|
||||
|
||||
If your service start successfully, you can see the info shown below:
|
||||
|
||||
```bash
|
||||
INFO: Started server process [746077]
|
||||
INFO: Waiting for application startup.
|
||||
INFO: Application startup complete.
|
||||
```
|
||||
|
||||
Once your server is started, you can query the model with input prompts:
|
||||
|
||||
```bash
|
||||
curl http://localhost:8000/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "Qwen/Qwen3-VL-30B-A3B-Instruct",
|
||||
"messages": [
|
||||
{"role": "system", "content": "You are a helpful assistant."},
|
||||
{"role": "user", "content": [
|
||||
{"type": "image_url", "image_url": {"url": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png"}},
|
||||
{"type": "text", "text": "What is the text in the illustrate?"}
|
||||
]}
|
||||
],
|
||||
"max_completion_tokens": 100
|
||||
}'
|
||||
```
|
||||
|
||||
If you query the server successfully, you can see the info shown below (client):
|
||||
|
||||
```bash
|
||||
{"id":"chatcmpl-974cb7a7a746a13e","object":"chat.completion","created":1766569357,"model":"/root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-30B-A3B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the illustration is \"TONGYI Qwen\".","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":107,"total_tokens":122,"completion_tokens":15,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}
|
||||
```
|
||||
|
||||
Logs of the vllm server:
|
||||
|
||||
```bash
|
||||
INFO 12-24 09:42:37 [acl_graph.py:187] Replaying aclgraph
|
||||
INFO: 127.0.0.1:54946 - "POST /v1/chat/completions HTTP/1.1" 200 OK
|
||||
INFO 12-24 09:42:41 [loggers.py:257] Engine 000: Avg prompt throughput: 10.7 tokens/s, Avg generation throughput: 1.5 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, MM cache hit rate: 0.0%
|
||||
```
|
||||
|
||||
::::
|
||||
::::{tab-item} Video Inputs
|
||||
:sync: multi
|
||||
|
||||
Run the following command inside the container to start the vLLM server on multi-NPU:
|
||||
|
||||
```shell
|
||||
vllm serve Qwen/Qwen3-VL-30B-A3B-Instruct \
|
||||
--tensor-parallel-size 2 \
|
||||
--enable-expert-parallel \
|
||||
--max-model-len 128000 \
|
||||
--allowed-local-media-path /media
|
||||
```
|
||||
|
||||
:::{note}
|
||||
vllm-ascend supports Expert Parallelism (EP) via `--enable-expert-parallel`, which allows experts in MoE models to be deployed on separate GPUs for better throughput.
|
||||
|
||||
You can set `--max-model-len` to preserve memory. By default the model's context length is 262K, but `--max-model-len 128000` is good for most scenarios.
|
||||
|
||||
Set `--allowed-local-media-path /media` to use your local video that located at `/media`, since directly download the video during serving can be extremely slow due to network issues.
|
||||
:::
|
||||
|
||||
If your service start successfully, you can see the info shown below:
|
||||
|
||||
```bash
|
||||
INFO: Started server process [746077]
|
||||
INFO: Waiting for application startup.
|
||||
INFO: Application startup complete.
|
||||
```
|
||||
|
||||
Once your server is started, you can query the model with input prompts:
|
||||
|
||||
```bash
|
||||
curl http://localhost:8000/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "Qwen/Qwen3-VL-30B-A3B-Instruct",
|
||||
"messages": [
|
||||
{"role": "system", "content": "You are a helpful assistant."},
|
||||
{"role": "user", "content": [
|
||||
{"type": "video_url", "video_url": {"url": "file:///media/test.mp4"}},
|
||||
{"type": "text", "text": "What is in this video?"}
|
||||
]}
|
||||
],
|
||||
"max_completion_tokens": 100
|
||||
}'
|
||||
```
|
||||
|
||||
If you query the server successfully, you can see the info shown below (client):
|
||||
|
||||
```bash
|
||||
{"id":"chatcmpl-a03c6d6e40267738","object":"chat.completion","created":1766569752,"model":"/root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-30B-A3B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The video shows a standard test pattern, which is a series of vertical bars in various colors (red, green, blue, yellow, magenta, cyan, and white) arranged in a circular pattern on a black background. This is a common visual used in television broadcasting to calibrate and test equipment. The pattern remains static throughout the video.","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":196,"total_tokens":266,"completion_tokens":70,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}
|
||||
```
|
||||
|
||||
Logs of the vllm server:
|
||||
|
||||
```bash
|
||||
INFO: 127.0.0.1:49314 - "POST /v1/chat/completions HTTP/1.1" 200 OK
|
||||
INFO 12-24 09:49:22 [loggers.py:257] Engine 000: Avg prompt throughput: 19.6 tokens/s, Avg generation throughput: 7.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, MM cache hit rate: 33.3%
|
||||
```
|
||||
|
||||
::::
|
||||
:::::
|
||||
|
||||
### Offline Inference
|
||||
|
||||
The usage of offline inference with `Qwen3-VL-30B-A3B-Instruct` is totally the same as that of `Qwen3-VL-8B-Instruct`, find more details at [link](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/Qwen-VL-Dense.html#offline-inference).
|
||||
127
docs/source/tutorials/models/Qwen3-VL-Embedding.md
Normal file
127
docs/source/tutorials/models/Qwen3-VL-Embedding.md
Normal file
@@ -0,0 +1,127 @@
|
||||
# Qwen3-VL-Embedding
|
||||
|
||||
## Introduction
|
||||
|
||||
The Qwen3-VL-Embedding and Qwen3-VL-Reranker model series are the latest additions to the Qwen family, built upon the recently open-sourced and powerful Qwen3-VL foundation model. Specifically designed for multimodal information retrieval and cross-modal understanding, this suite accepts diverse inputs including text, images, screenshots, and videos, as well as inputs containing a mixture of these modalities. This guide describes how to run the model with vLLM Ascend.
|
||||
|
||||
## Supported Features
|
||||
|
||||
Refer to [supported features](../../user_guide/support_matrix/supported_models.md) to get the model's supported feature matrix.
|
||||
|
||||
## Environment Preparation
|
||||
|
||||
### Model Weight
|
||||
|
||||
- `Qwen3-VL-Embedding-8B` [Download model weight](https://www.modelscope.cn/models/Qwen/Qwen3-VL-Embedding-8B)
|
||||
- `Qwen3-VL-Embedding-2B` [Download model weight](https://www.modelscope.cn/models/Qwen/Qwen3-VL-Embedding-2B)
|
||||
|
||||
It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`
|
||||
|
||||
### Installation
|
||||
|
||||
You can use our official docker image to run `Qwen3-VL-Embedding` series models.
|
||||
|
||||
- Start the docker image on your node, refer to [using docker](../../installation.md#set-up-using-docker).
|
||||
|
||||
If you don't want to use the docker image as above, you can also build all from source:
|
||||
|
||||
- Install `vllm-ascend` from source, refer to [installation](../../installation.md).
|
||||
|
||||
## Deployment
|
||||
|
||||
Using the Qwen3-VL-Embedding-8B model as an example, first run the docker container with the following command:
|
||||
|
||||
### Online Inference
|
||||
|
||||
```bash
|
||||
vllm serve Qwen/Qwen3-VL-Embedding-8B --runner pooling
|
||||
```
|
||||
|
||||
Once your server is started, you can query the model with input prompts.
|
||||
|
||||
```bash
|
||||
curl http://127.0.0.1:8000/v1/embeddings -H "Content-Type: application/json" -d '{
|
||||
"input": [
|
||||
"The capital of China is Beijing.",
|
||||
"Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun."
|
||||
]
|
||||
}'
|
||||
```
|
||||
|
||||
### Offline Inference
|
||||
|
||||
```python
|
||||
import torch
|
||||
from vllm import LLM
|
||||
|
||||
def get_detailed_instruct(task_description: str, query: str) -> str:
|
||||
return f'Instruct: {task_description}\nQuery: {query}'
|
||||
|
||||
|
||||
if __name__=="__main__":
|
||||
# Each query must come with a one-sentence instruction that describes the task
|
||||
task = 'Given a web search query, retrieve relevant passages that answer the query'
|
||||
|
||||
queries = [
|
||||
get_detailed_instruct(task, 'What is the capital of China?'),
|
||||
get_detailed_instruct(task, 'Explain gravity')
|
||||
]
|
||||
# No need to add instruction for retrieval documents
|
||||
documents = [
|
||||
"The capital of China is Beijing.",
|
||||
"Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun."
|
||||
]
|
||||
input_texts = queries + documents
|
||||
|
||||
model = LLM(model="Qwen/Qwen3-VL-Embedding-8B",
|
||||
runner="pooling",
|
||||
distributed_executor_backend="mp")
|
||||
|
||||
outputs = model.embed(input_texts)
|
||||
embeddings = torch.tensor([o.outputs.embedding for o in outputs])
|
||||
scores = (embeddings[:2] @ embeddings[2:].T)
|
||||
print(scores.tolist())
|
||||
```
|
||||
|
||||
If you run this script successfully, you can see the info shown below:
|
||||
|
||||
```bash
|
||||
Adding requests: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 192.47it/s]
|
||||
Processed prompts: 0%| | 0/4 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s](EngineCore_DP0 pid=2425173) (Worker pid=2425180) INFO 01-09 00:44:40 [acl_graph.py:194] Replaying aclgraph
|
||||
(EngineCore_DP0 pid=2425173) (Worker pid=2425180) ('Warning: torch.save with "_use_new_zipfile_serialization = False" is not recommended for npu tensor, which may bring unexpected errors and hopefully set "_use_new_zipfile_serialization = True"', 'if it is necessary to use this, please convert the npu tensor to cpu tensor for saving')
|
||||
Processed prompts: 100%|████████████████████████████████████| 4/4 [00:00<00:00, 21.34it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
|
||||
[[0.9279120564460754, 0.32747742533683777], [0.4124627113342285, 0.7425257563591003]]
|
||||
```
|
||||
|
||||
For more examples, refer to the vLLM official examples:
|
||||
|
||||
- [Offline Vision Embedding Example](https://github.com/vllm-project/vllm/blob/main/examples/pooling/embed/vision_embedding_offline.py)
|
||||
- [Online Vision Embedding Example](https://github.com/vllm-project/vllm/blob/main/examples/pooling/embed/vision_embedding_online.py)
|
||||
|
||||
## Performance
|
||||
|
||||
Run performance of `Qwen3-VL-Embedding-8B` as an example.
|
||||
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/benchmarking/cli/) for more details.
|
||||
|
||||
Take the `serve` as an example. Run the code as follows.
|
||||
|
||||
```bash
|
||||
vllm bench serve --model Qwen/Qwen3-VL-Embedding-8B --backend openai-embeddings --dataset-name random --endpoint /v1/embeddings --random-input 200 --save-result --result-dir ./
|
||||
```
|
||||
|
||||
After about several minutes, you can get the performance evaluation result. With this tutorial, the performance result is:
|
||||
|
||||
```bash
|
||||
============ Serving Benchmark Result ============
|
||||
Successful requests: 1000
|
||||
Failed requests: 0
|
||||
Benchmark duration (s): 19.53
|
||||
Total input tokens: 200000
|
||||
Request throughput (req/s): 51.20
|
||||
Total token throughput (tok/s): 10240.42
|
||||
----------------End-to-end Latency----------------
|
||||
Mean E2EL (ms): 10360.53
|
||||
Median E2EL (ms): 10354.37
|
||||
P99 E2EL (ms): 19423.21
|
||||
==================================================
|
||||
```
|
||||
243
docs/source/tutorials/models/Qwen3-VL-Reranker.md
Normal file
243
docs/source/tutorials/models/Qwen3-VL-Reranker.md
Normal file
@@ -0,0 +1,243 @@
|
||||
# Qwen3-VL-Reranker
|
||||
|
||||
## Introduction
|
||||
|
||||
The Qwen3-VL-Embedding and Qwen3-VL-Reranker model series are the latest additions to the Qwen family, built upon the recently open-sourced and powerful Qwen3-VL foundation model. Specifically designed for multimodal information retrieval and cross-modal understanding, this suite accepts diverse inputs including text, images, screenshots, and videos, as well as inputs containing a mixture of these modalities. This guide describes how to run the model with vLLM Ascend.
|
||||
|
||||
## Supported Features
|
||||
|
||||
Refer to [supported features](../../user_guide/support_matrix/supported_models.md) to get the model's supported feature matrix.
|
||||
|
||||
## Environment Preparation
|
||||
|
||||
### Model Weight
|
||||
|
||||
- `Qwen3-VL-Reranker-8B` [Download model weight](https://www.modelscope.cn/models/Qwen/Qwen3-VL-Reranker-8B)
|
||||
- `Qwen3-VL-Reranker-2B` [Download model weight](https://www.modelscope.cn/models/Qwen/Qwen3-VL-Reranker-2B)
|
||||
|
||||
It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`
|
||||
|
||||
### Installation
|
||||
|
||||
You can use our official docker image to run `Qwen3-VL-Reranker` series models.
|
||||
|
||||
- Start the docker image on your node, refer to [using docker](../../installation.md#set-up-using-docker).
|
||||
|
||||
If you don't want to use the docker image as above, you can also build all from source:
|
||||
|
||||
- Install `vllm-ascend` from source, refer to [installation](../../installation.md).
|
||||
|
||||
## Deployment
|
||||
|
||||
Using the Qwen3-VL-Reranker-8B model as an example:
|
||||
|
||||
### Chat Template
|
||||
|
||||
The Qwen3-VL-Reranker model requires a specific chat template for proper formatting. Create a file named `qwen3_vl_reranker.jinja` with the following content:
|
||||
|
||||
```jinja
|
||||
<|im_start|>system
|
||||
Judge whether the Document meets the requirements based on the Query and the Instruct provided. Note that the answer can only be "yes" or "no".<|im_end|>
|
||||
<|im_start|>user
|
||||
<Instruct>: {{
|
||||
messages
|
||||
| selectattr("role", "eq", "system")
|
||||
| map(attribute="content")
|
||||
| first
|
||||
| default("Given a search query, retrieve relevant candidates that answer the query.")
|
||||
}}<Query>:{{
|
||||
messages
|
||||
| selectattr("role", "eq", "query")
|
||||
| map(attribute="content")
|
||||
| first
|
||||
}}
|
||||
<Document>:{{
|
||||
messages
|
||||
| selectattr("role", "eq", "document")
|
||||
| map(attribute="content")
|
||||
| first
|
||||
}}<|im_end|>
|
||||
<|im_start|>assistant
|
||||
|
||||
```
|
||||
|
||||
Save this file to a location of your choice (e.g., `./qwen3_vl_reranker.jinja`).
|
||||
|
||||
### Online Inference
|
||||
|
||||
Start the server with the following command:
|
||||
|
||||
```bash
|
||||
vllm serve Qwen/Qwen3-VL-Reranker-8B \
|
||||
--runner pooling \
|
||||
--max-model-len 4096 \
|
||||
--hf_overrides '{"architectures": ["Qwen3VLForSequenceClassification"],"classifier_from_token": ["no", "yes"],"is_original_qwen3_reranker": true}' \
|
||||
--chat-template ./qwen3_vl_reranker.jinja
|
||||
```
|
||||
|
||||
Once your server is started, you can send request with follow examples.
|
||||
|
||||
```python
|
||||
import requests
|
||||
|
||||
url = "http://127.0.0.1:8000/v1/rerank"
|
||||
|
||||
# Please use the query_template and document_template to format the query and
|
||||
# document for better reranker results.
|
||||
|
||||
prefix = '<|im_start|>system\nJudge whether the Document meets the requirements based on the Query and the Instruct provided. Note that the answer can only be "yes" or "no".<|im_end|>\n<|im_start|>user\n'
|
||||
suffix = "<|im_end|>\n<|im_start|>assistant\n"
|
||||
|
||||
query_template = "{prefix}<Instruct>: {instruction}\n<Query>: {query}\n"
|
||||
document_template = "<Document>: {doc}{suffix}"
|
||||
|
||||
instruction = (
|
||||
"Given a search query, retrieve relevant candidates that answer the query."
|
||||
)
|
||||
|
||||
query = "What is the capital of China?"
|
||||
|
||||
documents = [
|
||||
"The capital of China is Beijing.",
|
||||
"Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun.",
|
||||
]
|
||||
|
||||
documents = [
|
||||
document_template.format(doc=doc, suffix=suffix) for doc in documents
|
||||
]
|
||||
|
||||
response = requests.post(url,
|
||||
json={
|
||||
"query": query_template.format(prefix=prefix, instruction=instruction, query=query),
|
||||
"documents": documents,
|
||||
}).json()
|
||||
|
||||
print(response)
|
||||
```
|
||||
|
||||
If you run this script successfully, you will see a list of scores printed to the console, similar to this:
|
||||
|
||||
```bash
|
||||
{'id': 'rerank-ac3495afa8e12404', 'model': 'Qwen/Qwen3-VL-Reranker-8B', 'usage': {'prompt_tokens': 315, 'total_tokens': 315}, 'results': [{'index': 0, 'document': {'text': '<Document>: The capital of China is Beijing.<|im_end|>\n<|im_start|>assistant\n', 'multi_modal': None}, 'relevance_score': 0.6368980407714844}, {'index': 1, 'document': {'text': '<Document>: Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun.<|im_end|>\n<|im_start|>assistant\n', 'multi_modal': None}, 'relevance_score': 0.20816077291965485}]}
|
||||
```
|
||||
|
||||
### Offline Inference
|
||||
|
||||
```python
|
||||
from vllm import LLM
|
||||
|
||||
model_name = "Qwen/Qwen3-VL-Reranker-8B"
|
||||
|
||||
# What is the difference between the official original version and one
|
||||
# that has been converted into a sequence classification model?
|
||||
# Qwen3-Reranker is a language model that doing reranker by using the
|
||||
# logits of "no" and "yes" tokens.
|
||||
# It needs to computing 151669 tokens logits, making this method extremely
|
||||
# inefficient, not to mention incompatible with the vllm score API.
|
||||
# A method for converting the original model into a sequence classification
|
||||
# model was proposed. See: https://huggingface.co/Qwen/Qwen3-Reranker-0.6B/discussions/3
|
||||
# Models converted offline using this method can not only be more efficient
|
||||
# and support the vllm score API, but also make the init parameters more
|
||||
# concise, for example.
|
||||
# model = LLM(model="Qwen/Qwen3-VL-Reranker-8B", runner="pooling")
|
||||
|
||||
# If you want to load the official original version, the init parameters are
|
||||
# as follows.
|
||||
|
||||
model = LLM(
|
||||
model=model_name,
|
||||
runner="pooling",
|
||||
hf_overrides={
|
||||
# Manually route to sequence classification architecture
|
||||
# This tells vLLM to use Qwen3VLForSequenceClassification instead of
|
||||
# the default Qwen3VLForConditionalGeneration
|
||||
"architectures": ["Qwen3VLForSequenceClassification"],
|
||||
# Specify which token logits to extract from the language model head
|
||||
# The original reranker uses "no" and "yes" token logits for scoring
|
||||
"classifier_from_token": ["no", "yes"],
|
||||
# Enable special handling for original Qwen3-Reranker models
|
||||
# This flag triggers conversion logic that transforms the two token
|
||||
# vectors into a single classification vector
|
||||
"is_original_qwen3_reranker": True,
|
||||
},
|
||||
)
|
||||
|
||||
# Why do we need hf_overrides for the official original version:
|
||||
# vllm converts it to Qwen3VLForSequenceClassification when loaded for
|
||||
# better performance.
|
||||
# - Firstly, we need using `"architectures": ["Qwen3VLForSequenceClassification"],`
|
||||
# to manually route to Qwen3VLForSequenceClassification.
|
||||
# - Then, we will extract the vector corresponding to classifier_from_token
|
||||
# from lm_head using `"classifier_from_token": ["no", "yes"]`.
|
||||
# - Third, we will convert these two vectors into one vector. The use of
|
||||
# conversion logic is controlled by `using "is_original_qwen3_reranker": True`.
|
||||
|
||||
# Please use the query_template and document_template to format the query and
|
||||
# document for better reranker results.
|
||||
|
||||
prefix = '<|im_start|>system\nJudge whether the Document meets the requirements based on the Query and the Instruct provided. Note that the answer can only be "yes" or "no".<|im_end|>\n<|im_start|>user\n'
|
||||
suffix = "<|im_end|>\n<|im_start|>assistant\n"
|
||||
|
||||
query_template = "{prefix}<Instruct>: {instruction}\n<Query>: {query}\n"
|
||||
document_template = "<Document>: {doc}{suffix}"
|
||||
|
||||
if __name__ == "__main__":
|
||||
instruction = (
|
||||
"Given a search query, retrieve relevant candidates that answer the query."
|
||||
)
|
||||
|
||||
query = "What is the capital of China?"
|
||||
|
||||
documents = [
|
||||
"The capital of China is Beijing.",
|
||||
"Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun.",
|
||||
]
|
||||
|
||||
documents = [document_template.format(doc=doc, suffix=suffix) for doc in documents]
|
||||
|
||||
outputs = model.score(query_template.format(prefix=prefix, instruction=instruction, query=query), documents)
|
||||
|
||||
print([output.outputs.score for output in outputs])
|
||||
```
|
||||
|
||||
If you run this script successfully, you will see a list of scores printed to the console, similar to this:
|
||||
|
||||
```bash
|
||||
Adding requests: 100%|████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 2409.83it/s]
|
||||
Processed prompts: 0%| | 0/2 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s](EngineCore_DP0 pid=682882) INFO 01-20 04:38:46 [acl_graph.py:188] Replaying aclgraph
|
||||
Processed prompts: 100%|████████████████████████████████████| 2/2 [00:00<00:00, 9.44it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
|
||||
[0.7235596776008606, 0.0002742875076364726]
|
||||
```
|
||||
|
||||
For more examples, refer to the vLLM official examples:
|
||||
|
||||
- [Offline Vision Embedding Example](https://github.com/vllm-project/vllm/blob/main/examples/pooling/score/vision_reranker_offline.py)
|
||||
- [Online Vision Embedding Example](https://github.com/vllm-project/vllm/blob/main/examples/pooling/score/vision_reranker_online.py)
|
||||
|
||||
## Performance
|
||||
|
||||
Run performance of `Qwen3-VL-Reranker-8B` as an example.
|
||||
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/benchmarking/cli/) for more details.
|
||||
|
||||
Take the `serve` as an example. Run the code as follows.
|
||||
|
||||
```bash
|
||||
vllm bench serve --model Qwen/Qwen3-VL-Reranker-8B --backend vllm-rerank --dataset-name random-rerank --endpoint /v1/rerank --random-input 200 --save-result --result-dir ./
|
||||
```
|
||||
|
||||
After about several minutes, you can get the performance evaluation result. With this tutorial, the performance result is:
|
||||
|
||||
```bash
|
||||
============ Serving Benchmark Result ============
|
||||
Successful requests: 1000
|
||||
Failed requests: 0
|
||||
Benchmark duration (s): 13.70
|
||||
Total input tokens: 265122
|
||||
Request throughput (req/s): 72.99
|
||||
Total token throughput (tok/s): 19351.23
|
||||
----------------End-to-end Latency----------------
|
||||
Mean E2EL (ms): 7474.64
|
||||
Median E2EL (ms): 7528.72
|
||||
P99 E2EL (ms): 13523.32
|
||||
==================================================
|
||||
```
|
||||
122
docs/source/tutorials/models/Qwen3_embedding.md
Normal file
122
docs/source/tutorials/models/Qwen3_embedding.md
Normal file
@@ -0,0 +1,122 @@
|
||||
# Qwen3-Embedding
|
||||
|
||||
## Introduction
|
||||
|
||||
The Qwen3 Embedding model series is the latest proprietary model of the Qwen family, specifically designed for text embedding and ranking tasks. Building upon the dense foundational models of the Qwen3 series, it provides a comprehensive range of text embeddings and reranking models in various sizes (0.6B, 4B, and 8B). This guide describes how to run the model with vLLM Ascend. Note that only 0.9.2rc1 and higher versions of vLLM Ascend support the model.
|
||||
|
||||
## Supported Features
|
||||
|
||||
Refer to [supported features](../../user_guide/support_matrix/supported_models.md) to get the model's supported feature matrix.
|
||||
|
||||
## Environment Preparation
|
||||
|
||||
### Model Weight
|
||||
|
||||
- `Qwen3-Embedding-8B` [Download model weight](https://www.modelscope.cn/models/Qwen/Qwen3-Embedding-8B)
|
||||
- `Qwen3-Embedding-4B` [Download model weight](https://www.modelscope.cn/models/Qwen/Qwen3-Embedding-4B)
|
||||
- `Qwen3-Embedding-0.6B` [Download model weight](https://www.modelscope.cn/models/Qwen/Qwen3-Embedding-0.6B)
|
||||
|
||||
It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`
|
||||
|
||||
### Installation
|
||||
|
||||
You can use our official docker image to run `Qwen3-Embedding` series models.
|
||||
|
||||
- Start the docker image on your node, refer to [using docker](../../installation.md#set-up-using-docker).
|
||||
|
||||
if you don't want to use the docker image as above, you can also build all from source:
|
||||
|
||||
- Install `vllm-ascend` from source, refer to [installation](../../installation.md).
|
||||
|
||||
## Deployment
|
||||
|
||||
Using the Qwen3-Embedding-8B model as an example, first run the docker container with the following command:
|
||||
|
||||
### Online Inference
|
||||
|
||||
```bash
|
||||
vllm serve Qwen/Qwen3-Embedding-8B --runner pooling --host 127.0.0.1 --port 8888
|
||||
```
|
||||
|
||||
Once your server is started, you can query the model with input prompts.
|
||||
|
||||
```bash
|
||||
curl http://127.0.0.1:8888/v1/embeddings -H "Content-Type: application/json" -d '{
|
||||
"input": [
|
||||
"The capital of China is Beijing.",
|
||||
"Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun."
|
||||
]
|
||||
}'
|
||||
```
|
||||
|
||||
### Offline Inference
|
||||
|
||||
```python
|
||||
import torch
|
||||
import vllm
|
||||
from vllm import LLM
|
||||
|
||||
def get_detailed_instruct(task_description: str, query: str) -> str:
|
||||
return f'Instruct: {task_description}\nQuery:{query}'
|
||||
|
||||
|
||||
if __name__=="__main__":
|
||||
# Each query must come with a one-sentence instruction that describes the task
|
||||
task = 'Given a web search query, retrieve relevant passages that answer the query'
|
||||
|
||||
queries = [
|
||||
get_detailed_instruct(task, 'What is the capital of China?'),
|
||||
get_detailed_instruct(task, 'Explain gravity')
|
||||
]
|
||||
# No need to add instruction for retrieval documents
|
||||
documents = [
|
||||
"The capital of China is Beijing.",
|
||||
"Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun."
|
||||
]
|
||||
input_texts = queries + documents
|
||||
|
||||
model = LLM(model="Qwen/Qwen3-Embedding-8B",
|
||||
distributed_executor_backend="mp")
|
||||
|
||||
outputs = model.embed(input_texts)
|
||||
embeddings = torch.tensor([o.outputs.embedding for o in outputs])
|
||||
scores = (embeddings[:2] @ embeddings[2:].T)
|
||||
print(scores.tolist())
|
||||
```
|
||||
|
||||
If you run this script successfully, you can see the info shown below:
|
||||
|
||||
```bash
|
||||
Adding requests: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 282.22it/s]
|
||||
Processed prompts: 0%| | 0/4 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s](VllmWorker rank=0 pid=4074750) ('Warning: torch.save with "_use_new_zipfile_serialization = False" is not recommended for npu tensor, which may bring unexpected errors and hopefully set "_use_new_zipfile_serialization = True"', 'if it is necessary to use this, please convert the npu tensor to cpu tensor for saving')
|
||||
Processed prompts: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 31.95it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
|
||||
[[0.7477798461914062, 0.07548339664936066], [0.0886271521449089, 0.6311039924621582]]
|
||||
```
|
||||
|
||||
## Performance
|
||||
|
||||
Run performance of `Qwen3-Reranker-8B` as an example.
|
||||
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/) for more details.
|
||||
|
||||
Take the `serve` as an example. Run the code as follows.
|
||||
|
||||
```bash
|
||||
vllm bench serve --model Qwen3-Embedding-8B --backend openai-embeddings --dataset-name random --host 127.0.0.1 --port 8888 --endpoint /v1/embeddings --tokenizer /root/.cache/Qwen3-Embedding-8B --random-input 200 --save-result --result-dir ./
|
||||
```
|
||||
|
||||
After about several minutes, you can get the performance evaluation result. With this tutorial, the performance result is:
|
||||
|
||||
```bash
|
||||
============ Serving Benchmark Result ============
|
||||
Successful requests: 1000
|
||||
Failed requests: 0
|
||||
Benchmark duration (s): 6.78
|
||||
Total input tokens: 108032
|
||||
Request throughput (req/s): 31.11
|
||||
Total Token throughput (tok/s): 15929.35
|
||||
----------------End-to-end Latency----------------
|
||||
Mean E2EL (ms): 4422.79
|
||||
Median E2EL (ms): 4412.58
|
||||
P99 E2EL (ms): 6294.52
|
||||
==================================================
|
||||
```
|
||||
192
docs/source/tutorials/models/Qwen3_reranker.md
Normal file
192
docs/source/tutorials/models/Qwen3_reranker.md
Normal file
@@ -0,0 +1,192 @@
|
||||
# Qwen3-Reranker
|
||||
|
||||
## Introduction
|
||||
|
||||
The Qwen3 Reranker model series is the latest proprietary model of the Qwen family, specifically designed for text embedding and ranking tasks. Building upon the dense foundational models of the Qwen3 series, it provides a comprehensive range of text embeddings and reranking models in various sizes (0.6B, 4B, and 8B). This guide describes how to run the model with vLLM Ascend. Note that only 0.9.2rc1 and higher versions of vLLM Ascend support the model.
|
||||
|
||||
## Supported Features
|
||||
|
||||
Refer to [supported features](../../user_guide/support_matrix/supported_models.md) to get the model's supported feature matrix.
|
||||
|
||||
## Environment Preparation
|
||||
|
||||
### Model Weight
|
||||
|
||||
- `Qwen3-Reranker-8B` [Download model weight](https://www.modelscope.cn/models/Qwen/Qwen3-Reranker-8B)
|
||||
- `Qwen3-Reranker-4B` [Download model weight](https://www.modelscope.cn/models/Qwen/Qwen3-Reranker-4B)
|
||||
- `Qwen3-Reranker-0.6B` [Download model weight](https://www.modelscope.cn/models/Qwen/Qwen3-Reranker-0.6B)
|
||||
|
||||
It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`
|
||||
|
||||
### Installation
|
||||
|
||||
You can use our official docker image to run `Qwen3-Reranker` series models.
|
||||
|
||||
- Start the docker image on your node, refer to [using docker](../../installation.md#set-up-using-docker).
|
||||
|
||||
if you don't want to use the docker image as above, you can also build all from source:
|
||||
|
||||
- Install `vllm-ascend` from source, refer to [installation](../../installation.md).
|
||||
|
||||
## Deployment
|
||||
|
||||
Using the Qwen3-Reranker-8B model as an example, first run the docker container with the following command:
|
||||
|
||||
### Online Inference
|
||||
|
||||
```bash
|
||||
vllm serve Qwen/Qwen3-Reranker-8B --host 127.0.0.1 --port 8888 --hf_overrides '{"architectures": ["Qwen3ForSequenceClassification"],"classifier_from_token": ["no", "yes"],"is_original_qwen3_reranker": true}'
|
||||
```
|
||||
|
||||
Once your server is started, you can send request with follow examples.
|
||||
|
||||
### requests demo + formatting query & document
|
||||
|
||||
```python
|
||||
import requests
|
||||
|
||||
url = "http://127.0.0.1:8888/v1/rerank"
|
||||
|
||||
# Please use the query_template and document_template to format the query and
|
||||
# document for better reranker results.
|
||||
|
||||
prefix = '<|im_start|>system\nJudge whether the Document meets the requirements based on the Query and the Instruct provided. Note that the answer can only be "yes" or "no".<|im_end|>\n<|im_start|>user\n'
|
||||
suffix = "<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n"
|
||||
|
||||
query_template = "{prefix}<Instruct>: {instruction}\n<Query>: {query}\n"
|
||||
document_template = "<Document>: {doc}{suffix}"
|
||||
|
||||
instruction = (
|
||||
"Given a web search query, retrieve relevant passages that answer the query"
|
||||
)
|
||||
|
||||
query = "What is the capital of China?"
|
||||
|
||||
documents = [
|
||||
"The capital of China is Beijing.",
|
||||
"Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun.",
|
||||
]
|
||||
|
||||
documents = [
|
||||
document_template.format(doc=doc, suffix=suffix) for doc in documents
|
||||
]
|
||||
|
||||
response = requests.post(url,
|
||||
json={
|
||||
"query": query_template.format(prefix=prefix, instruction=instruction, query=query),
|
||||
"documents": documents,
|
||||
}).json()
|
||||
|
||||
print(response)
|
||||
```
|
||||
|
||||
If you run this script successfully, you will see a list of scores printed to the console, similar to this:
|
||||
|
||||
```bash
|
||||
{'id': 'rerank-e856a17c954047a3a40f73d5ec43dbc6', 'model': 'Qwen/Qwen3-Reranker-8B', 'usage': {'total_tokens': 193}, 'results': [{'index': 0, 'document': {'text': '<Document>: The capital of China is Beijing.<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n', 'multi_modal': None}, 'relevance_score': 0.9944348335266113}, {'index': 1, 'document': {'text': '<Document>: Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun.<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n', 'multi_modal': None}, 'relevance_score': 6.700084327349032e-07}]}
|
||||
```
|
||||
|
||||
### Offline Inference
|
||||
|
||||
```python
|
||||
from vllm import LLM
|
||||
|
||||
model_name = "Qwen/Qwen3-Reranker-8B"
|
||||
|
||||
# What is the difference between the official original version and one
|
||||
# that has been converted into a sequence classification model?
|
||||
# Qwen3-Reranker is a language model that doing reranker by using the
|
||||
# logits of "no" and "yes" tokens.
|
||||
# It needs to computing 151669 tokens logits, making this method extremely
|
||||
# inefficient, not to mention incompatible with the vllm score API.
|
||||
# A method for converting the original model into a sequence classification
|
||||
# model was proposed. See:https://huggingface.co/Qwen/Qwen3-Reranker-0.6B/discussions/3
|
||||
# Models converted offline using this method can not only be more efficient
|
||||
# and support the vllm score API, but also make the init parameters more
|
||||
# concise, for example.
|
||||
# model = LLM(model="Qwen/Qwen3-Reranker-8B", task="score")
|
||||
|
||||
# If you want to load the official original version, the init parameters are
|
||||
# as follows.
|
||||
|
||||
model = LLM(
|
||||
model=model_name,
|
||||
task="score",
|
||||
hf_overrides={
|
||||
"architectures": ["Qwen3ForSequenceClassification"],
|
||||
"classifier_from_token": ["no", "yes"],
|
||||
"is_original_qwen3_reranker": True,
|
||||
},
|
||||
)
|
||||
|
||||
# Why do we need hf_overrides for the official original version:
|
||||
# vllm converts it to Qwen3ForSequenceClassification when loaded for
|
||||
# better performance.
|
||||
# - Firstly, we need using `"architectures": ["Qwen3ForSequenceClassification"],`
|
||||
# to manually route to Qwen3ForSequenceClassification.
|
||||
# - Then, we will extract the vector corresponding to classifier_from_token
|
||||
# from lm_head using `"classifier_from_token": ["no", "yes"]`.
|
||||
# - Third, we will convert these two vectors into one vector. The use of
|
||||
# conversion logic is controlled by `using "is_original_qwen3_reranker": True`.
|
||||
|
||||
# Please use the query_template and document_template to format the query and
|
||||
# document for better reranker results.
|
||||
|
||||
prefix = '<|im_start|>system\nJudge whether the Document meets the requirements based on the Query and the Instruct provided. Note that the answer can only be "yes" or "no".<|im_end|>\n<|im_start|>user\n'
|
||||
suffix = "<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n"
|
||||
|
||||
query_template = "{prefix}<Instruct>: {instruction}\n<Query>: {query}\n"
|
||||
document_template = "<Document>: {doc}{suffix}"
|
||||
|
||||
if __name__ == "__main__":
|
||||
instruction = (
|
||||
"Given a web search query, retrieve relevant passages that answer the query"
|
||||
)
|
||||
|
||||
query = "What is the capital of China?"
|
||||
|
||||
documents = [
|
||||
"The capital of China is Beijing.",
|
||||
"Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun.",
|
||||
]
|
||||
|
||||
documents = [document_template.format(doc=doc, suffix=suffix) for doc in documents]
|
||||
|
||||
outputs = model.score(query_template.format(prefix=prefix, instruction=instruction, query=query), documents)
|
||||
|
||||
print([output.outputs[0].score for output in outputs])
|
||||
```
|
||||
|
||||
If you run this script successfully, you will see a list of scores printed to the console, similar to this:
|
||||
|
||||
```bash
|
||||
[0.9943699240684509, 6.876250040477316e-07]
|
||||
```
|
||||
|
||||
## Performance
|
||||
|
||||
Run performance of `Qwen3-Reranker-8B` as an example.
|
||||
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/) for more details.
|
||||
|
||||
Take the `serve` as an example. Run the code as follows.
|
||||
|
||||
```bash
|
||||
vllm bench serve --model Qwen3-Reranker-8B --backend vllm-rerank --dataset-name random-rerank --host 127.0.0.1 --port 8888 --endpoint /v1/rerank --tokenizer /root/.cache/Qwen3-Reranker-8B --random-input 200 --save-result --result-dir ./
|
||||
```
|
||||
|
||||
After about several minutes, you can get the performance evaluation result. With this tutorial, the performance result is:
|
||||
|
||||
```bash
|
||||
============ Serving Benchmark Result ============
|
||||
Successful requests: 1000
|
||||
Failed requests: 0
|
||||
Benchmark duration (s): 6.78
|
||||
Total input tokens: 108032
|
||||
Request throughput (req/s): 31.11
|
||||
Total Token throughput (tok/s): 15929.35
|
||||
----------------End-to-end Latency----------------
|
||||
Mean E2EL (ms): 4422.79
|
||||
Median E2EL (ms): 4412.58
|
||||
P99 E2EL (ms): 6294.52
|
||||
==================================================
|
||||
```
|
||||
31
docs/source/tutorials/models/index.md
Normal file
31
docs/source/tutorials/models/index.md
Normal file
@@ -0,0 +1,31 @@
|
||||
# Model Tutorials
|
||||
|
||||
This section provides tutorials for different models of vLLM Ascend.
|
||||
|
||||
:::{toctree}
|
||||
:caption: Model Tutorials
|
||||
:maxdepth: 1
|
||||
Qwen2.5-Omni.md
|
||||
Qwen2.5-7B.md
|
||||
Qwen3-Dense.md
|
||||
Qwen-VL-Dense.md
|
||||
Qwen3-30B-A3B.md
|
||||
Qwen3-235B-A22B.md
|
||||
Qwen3-VL-30B-A3B-Instruct.md
|
||||
Qwen3-VL-235B-A22B-Instruct.md
|
||||
Qwen3-Coder-30B-A3B.md
|
||||
Qwen3_embedding.md
|
||||
Qwen3-VL-Embedding.md
|
||||
Qwen3_reranker.md
|
||||
Qwen3-VL-Reranker.md
|
||||
Qwen3-8B-W4A8.md
|
||||
Qwen3-32B-W4A4.md
|
||||
Qwen3-Next.md
|
||||
Qwen3-Omni-30B-A3B-Thinking.md
|
||||
DeepSeek-V3.1.md
|
||||
DeepSeek-V3.2.md
|
||||
DeepSeek-R1.md
|
||||
GLM4.x.md
|
||||
Kimi-K2-Thinking.md
|
||||
PaddleOCR-VL.md
|
||||
:::
|
||||
Reference in New Issue
Block a user