v0.10.1rc1
This commit is contained in:
18
docs/source/tutorials/index.md
Normal file
18
docs/source/tutorials/index.md
Normal file
@@ -0,0 +1,18 @@
|
||||
# Tutorials
|
||||
|
||||
:::{toctree}
|
||||
:caption: Deployment
|
||||
:maxdepth: 1
|
||||
single_npu
|
||||
single_npu_multimodal
|
||||
single_npu_audio
|
||||
single_npu_qwen3_embedding
|
||||
single_npu_qwen3_quantization
|
||||
multi_npu
|
||||
multi_npu_moge
|
||||
multi_npu_qwen3_moe
|
||||
multi_npu_quantization
|
||||
single_node_300i
|
||||
multi_node
|
||||
multi_node_kimi
|
||||
:::
|
||||
207
docs/source/tutorials/multi_node.md
Normal file
207
docs/source/tutorials/multi_node.md
Normal file
@@ -0,0 +1,207 @@
|
||||
# Multi-Node-DP (DeepSeek)
|
||||
|
||||
## Getting Start
|
||||
vLLM-Ascend now supports Data Parallel (DP) deployment, enabling model weights to be replicated across multiple NPUs or instances, each processing independent batches of requests. This is particularly useful for scaling throughput across devices while maintaining high resource utilization.
|
||||
|
||||
Each DP rank is deployed as a separate “core engine” process which communicates with front-end process(es) via ZMQ sockets. Data Parallel can be combined with Tensor Parallel, in which case each DP engine owns a number of per-NPU worker processes equal to the TP size.
|
||||
|
||||
For Mixture-of-Experts (MoE) models — especially advanced architectures like DeepSeek that utilize Multi-head Latent Attention (MLA) — a hybrid parallelism approach is recommended:
|
||||
|
||||
- Use **Data Parallelism (DP)** for attention layers, which are replicated across devices and handle separate batches.
|
||||
- Use **Expert or Tensor Parallelism (EP/TP)** for expert layers, which are sharded across devices to distribute the computation.
|
||||
|
||||
This division enables attention layers to be replicated across Data Parallel (DP) ranks, enabling them to process different batches independently. Meanwhile, expert layers are partitioned (sharded) across devices using Expert or Tensor Parallelism(DP*TP), maximizing hardware utilization and efficiency.
|
||||
|
||||
In these cases the data parallel ranks are not completely independent, forward passes must be aligned and expert layers across all ranks are required to synchronize during every forward pass, even if there are fewer requests to be processed than DP ranks.
|
||||
|
||||
For MoE models, when any requests are in progress in any rank, we must ensure that empty “dummy” forward passes are performed in all ranks which don’t currently have any requests scheduled. This is handled via a separate DP `Coordinator` process which communicates with all of the ranks, and a collective operation performed every N steps to determine when all ranks become idle and can be paused. When TP is used in conjunction with DP, expert layers form an EP or TP group of size (DP x TP).
|
||||
|
||||
## Verify Multi-Node Communication Environment
|
||||
|
||||
### Physical Layer Requirements:
|
||||
|
||||
- The physical machines must be located on the same WLAN, with network connectivity.
|
||||
- All NPUs are connected with optical modules, and the connection status must be normal.
|
||||
|
||||
### Verification Process:
|
||||
|
||||
Execute the following commands on each node in sequence. The results must all be `success` and the status must be `UP`:
|
||||
|
||||
```bash
|
||||
# Check the remote switch ports
|
||||
for i in {0..7}; do hccn_tool -i $i -lldp -g | grep Ifname; done
|
||||
# Get the link status of the Ethernet ports (UP or DOWN)
|
||||
for i in {0..7}; do hccn_tool -i $i -link -g ; done
|
||||
# Check the network health status
|
||||
for i in {0..7}; do hccn_tool -i $i -net_health -g ; done
|
||||
# View the network detected IP configuration
|
||||
for i in {0..7}; do hccn_tool -i $i -netdetect -g ; done
|
||||
# View gateway configuration
|
||||
for i in {0..7}; do hccn_tool -i $i -gateway -g ; done
|
||||
# View NPU network configuration
|
||||
cat /etc/hccn.conf
|
||||
```
|
||||
|
||||
### NPU Interconnect Verification:
|
||||
#### 1. Get NPU IP Addresses
|
||||
|
||||
```bash
|
||||
for i in {0..7}; do hccn_tool -i $i -ip -g | grep ipaddr; done
|
||||
```
|
||||
|
||||
#### 2. Cross-Node PING Test
|
||||
|
||||
```bash
|
||||
# Execute on the target node (replace with actual IP)
|
||||
hccn_tool -i 0 -ping -g address 10.20.0.20
|
||||
```
|
||||
|
||||
## Run with docker
|
||||
Assume you have two Atlas 800 A2(64G*8) nodes, and want to deploy the `deepseek-v3.1-w8a8` quantitative model across multi-node.
|
||||
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
# Update the vllm-ascend image
|
||||
export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version|
|
||||
export NAME=vllm-ascend
|
||||
|
||||
# Run the container using the defined variables
|
||||
# Note if you are running bridge network with docker, Please expose available ports for multiple nodes communication in advance
|
||||
docker run --rm \
|
||||
--name $NAME \
|
||||
--net=host \
|
||||
--device /dev/davinci0 \
|
||||
--device /dev/davinci1 \
|
||||
--device /dev/davinci2 \
|
||||
--device /dev/davinci3 \
|
||||
--device /dev/davinci4 \
|
||||
--device /dev/davinci5 \
|
||||
--device /dev/davinci6 \
|
||||
--device /dev/davinci7 \
|
||||
--device /dev/davinci_manager \
|
||||
--device /dev/devmm_svm \
|
||||
--device /dev/hisi_hdc \
|
||||
-v /usr/local/dcmi:/usr/local/dcmi \
|
||||
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
|
||||
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
|
||||
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
|
||||
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
|
||||
-v /etc/ascend_install.info:/etc/ascend_install.info \
|
||||
-v /mnt/sfs_turbo/.cache:/root/.cache \
|
||||
-it $IMAGE bash
|
||||
```
|
||||
|
||||
Run the following scripts on two nodes respectively
|
||||
|
||||
:::{note}
|
||||
Before launch the inference server, ensure the following environment variables are set for multi node communication
|
||||
:::
|
||||
|
||||
**node0**
|
||||
|
||||
```shell
|
||||
#!/bin/sh
|
||||
|
||||
# this obtained through ifconfig
|
||||
# nic_name is the network interface name corresponding to local_ip
|
||||
nic_name="xxxx"
|
||||
local_ip="xxxx"
|
||||
|
||||
export VLLM_USE_MODELSCOPE=True
|
||||
export HCCL_IF_IP=$local_ip
|
||||
export GLOO_SOCKET_IFNAME=$nic_name
|
||||
export TP_SOCKET_IFNAME=$nic_name
|
||||
export HCCL_SOCKET_IFNAME=$nic_name
|
||||
export OMP_PROC_BIND=false
|
||||
export OMP_NUM_THREADS=100
|
||||
export HCCL_BUFFSIZE=1024
|
||||
|
||||
# The w8a8 weight can obtained from https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V3.1-W8A8
|
||||
# If you want to the quantization manually, please refer to https://vllm-ascend.readthedocs.io/en/latest/user_guide/feature_guide/quantization.html
|
||||
vllm serve vllm-ascend/DeepSeek-V3.1-W8A8 \
|
||||
--host 0.0.0.0 \
|
||||
--port 8004 \
|
||||
--data-parallel-size 4 \
|
||||
--data-parallel-size-local 2 \
|
||||
--data-parallel-address $local_ip \
|
||||
--data-parallel-rpc-port 13389 \
|
||||
--tensor-parallel-size 4 \
|
||||
--seed 1024 \
|
||||
--served-model-name deepseek_v3.1 \
|
||||
--enable-expert-parallel \
|
||||
--max-num-seqs 16 \
|
||||
--max-model-len 32768 \
|
||||
--quantization ascend \
|
||||
--max-num-batched-tokens 4096 \
|
||||
--trust-remote-code \
|
||||
--no-enable-prefix-caching \
|
||||
--gpu-memory-utilization 0.9 \
|
||||
--additional-config '{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true}}'
|
||||
```
|
||||
|
||||
**node1**
|
||||
|
||||
```shell
|
||||
#!/bin/sh
|
||||
|
||||
nic_name="xxx"
|
||||
local_ip="xxx"
|
||||
|
||||
export VLLM_USE_MODELSCOPE=True
|
||||
export HCCL_IF_IP=$local_ip
|
||||
export GLOO_SOCKET_IFNAME=$nic_name
|
||||
export TP_SOCKET_IFNAME=$nic_name
|
||||
export HCCL_SOCKET_IFNAME=$nic_name
|
||||
export OMP_PROC_BIND=false
|
||||
export OMP_NUM_THREADS=100
|
||||
export VLLM_USE_V1=1
|
||||
export HCCL_BUFFSIZE=1024
|
||||
|
||||
vllm serve vllm-ascend/DeepSeek-V3.1-W8A8 \
|
||||
--host 0.0.0.0 \
|
||||
--port 8004 \
|
||||
--headless \
|
||||
--data-parallel-size 4 \
|
||||
--data-parallel-size-local 2 \
|
||||
--data-parallel-start-rank 2 \
|
||||
--data-parallel-address { node0 ip } \
|
||||
--data-parallel-rpc-port 13389 \
|
||||
--tensor-parallel-size 4 \
|
||||
--seed 1024 \
|
||||
--quantization ascend \
|
||||
--served-model-name deepseek_v3.1 \
|
||||
--max-num-seqs 16 \
|
||||
--max-model-len 32768 \
|
||||
--max-num-batched-tokens 4096 \
|
||||
--enable-expert-parallel \
|
||||
--trust-remote-code \
|
||||
--no-enable-prefix-caching \
|
||||
--gpu-memory-utilization 0.92 \
|
||||
--additional-config '{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true}}'
|
||||
```
|
||||
|
||||
The Deployment view looks like:
|
||||

|
||||
|
||||
Once your server is started, you can query the model with input prompts:
|
||||
|
||||
```shell
|
||||
curl http://{ node0 ip:8004 }/v1/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "deepseek_v3.1",
|
||||
"prompt": "The future of AI is",
|
||||
"max_tokens": 50,
|
||||
"temperature": 0
|
||||
}'
|
||||
```
|
||||
|
||||
## Run benchmarks
|
||||
For details please refer to [benchmark](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks)
|
||||
|
||||
```shell
|
||||
export VLLM_USE_MODELSCOPE=true
|
||||
vllm bench serve --model vllm-ascend/DeepSeek-V3.1-W8A8 --served-model-name deepseek_v3.1 \
|
||||
--dataset-name random --random-input-len 128 --random-output-len 128 \
|
||||
--num-prompts 200 --trust-remote-code --base-url "http://{ node0 ip }:8004" --request-rate 1
|
||||
```
|
||||
153
docs/source/tutorials/multi_node_kimi.md
Normal file
153
docs/source/tutorials/multi_node_kimi.md
Normal file
@@ -0,0 +1,153 @@
|
||||
# Multi-Node-DP (Kimi-K2)
|
||||
|
||||
## Verify Multi-Node Communication Environment
|
||||
|
||||
referring to [multi_node.md](https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_node.html#verification-process)
|
||||
|
||||
## Run with docker
|
||||
Assume you have two Atlas 800 A3(64G*16) nodes(or 4 * A2), and want to deploy the `Kimi-K2-Instruct-W8A8` quantitative model across multi-node.
|
||||
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
# Update the vllm-ascend image
|
||||
export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version|
|
||||
export NAME=vllm-ascend
|
||||
|
||||
# Run the container using the defined variables
|
||||
# Note if you are running bridge network with docker, Please expose available ports for multiple nodes communication in advance
|
||||
docker run --rm \
|
||||
--name $NAME \
|
||||
--net=host \
|
||||
--device /dev/davinci0 \
|
||||
--device /dev/davinci1 \
|
||||
--device /dev/davinci2 \
|
||||
--device /dev/davinci3 \
|
||||
--device /dev/davinci4 \
|
||||
--device /dev/davinci5 \
|
||||
--device /dev/davinci6 \
|
||||
--device /dev/davinci7 \
|
||||
--device /dev/davinci8 \
|
||||
--device /dev/davinci9 \
|
||||
--device /dev/davinci10 \
|
||||
--device /dev/davinci11 \
|
||||
--device /dev/davinci12 \
|
||||
--device /dev/davinci13 \
|
||||
--device /dev/davinci14 \
|
||||
--device /dev/davinci15 \
|
||||
--device /dev/davinci_manager \
|
||||
--device /dev/devmm_svm \
|
||||
--device /dev/hisi_hdc \
|
||||
-v /usr/local/dcmi:/usr/local/dcmi \
|
||||
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
|
||||
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
|
||||
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
|
||||
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
|
||||
-v /etc/ascend_install.info:/etc/ascend_install.info \
|
||||
-v /mnt/sfs_turbo/.cache:/home/cache \
|
||||
-it $IMAGE bash
|
||||
```
|
||||
|
||||
Run the following scripts on two nodes respectively
|
||||
|
||||
:::{note}
|
||||
Before launch the inference server, ensure the following environment variables are set for multi node communication
|
||||
:::
|
||||
|
||||
**node0**
|
||||
|
||||
```shell
|
||||
#!/bin/sh
|
||||
|
||||
# this obtained through ifconfig
|
||||
# nic_name is the network interface name corresponding to local_ip
|
||||
nic_name="xxxx"
|
||||
local_ip="xxxx"
|
||||
|
||||
export HCCL_IF_IP=$local_ip
|
||||
export GLOO_SOCKET_IFNAME=$nic_name
|
||||
export TP_SOCKET_IFNAME=$nic_name
|
||||
export HCCL_SOCKET_IFNAME=$nic_name
|
||||
export OMP_PROC_BIND=false
|
||||
export OMP_NUM_THREADS=100
|
||||
export VLLM_USE_V1=1
|
||||
export HCCL_BUFFSIZE=1024
|
||||
|
||||
# The w8a8 weight can obtained from https://www.modelscope.cn/models/vllm-ascend/Kimi-K2-Instruct-W8A8
|
||||
# If you want to the quantization manually, please refer to https://vllm-ascend.readthedocs.io/en/latest/user_guide/feature_guide/quantization.html
|
||||
vllm serve /home/cache/weights/Kimi-K2-Instruct-W8A8 \
|
||||
--host 0.0.0.0 \
|
||||
--port 8004 \
|
||||
--data-parallel-size 4 \
|
||||
--api-server-count 2 \
|
||||
--data-parallel-size-local 2 \
|
||||
--data-parallel-address $local_ip \
|
||||
--data-parallel-rpc-port 13389 \
|
||||
--seed 1024 \
|
||||
--served-model-name kimi \
|
||||
--quantization ascend \
|
||||
--tensor-parallel-size 8 \
|
||||
--enable-expert-parallel \
|
||||
--max-num-seqs 16 \
|
||||
--max-model-len 32768 \
|
||||
--max-num-batched-tokens 4096 \
|
||||
--trust-remote-code \
|
||||
--no-enable-prefix-caching \
|
||||
--gpu-memory-utilization 0.9 \
|
||||
--additional-config '{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true}}'
|
||||
```
|
||||
|
||||
**node1**
|
||||
|
||||
```shell
|
||||
#!/bin/sh
|
||||
|
||||
nic_name="xxxx"
|
||||
local_ip="xxxx"
|
||||
|
||||
export HCCL_IF_IP=$local_ip
|
||||
export GLOO_SOCKET_IFNAME=$nic_name
|
||||
export TP_SOCKET_IFNAME=$nic_name
|
||||
export HCCL_SOCKET_IFNAME=$nic_name
|
||||
export OMP_PROC_BIND=false
|
||||
export OMP_NUM_THREADS=100
|
||||
export VLLM_USE_V1=1
|
||||
export HCCL_BUFFSIZE=1024
|
||||
|
||||
vllm serve /home/cache/weights/Kimi-K2-Instruct-W8A8 \
|
||||
--host 0.0.0.0 \
|
||||
--port 8004 \
|
||||
--headless \
|
||||
--data-parallel-size 4 \
|
||||
--data-parallel-size-local 2 \
|
||||
--data-parallel-start-rank 2 \
|
||||
--data-parallel-address $node0_ip \
|
||||
--data-parallel-rpc-port 13389 \
|
||||
--seed 1024 \
|
||||
--tensor-parallel-size 8 \
|
||||
--served-model-name kimi \
|
||||
--max-num-seqs 16 \
|
||||
--max-model-len 32768 \
|
||||
--quantization ascend \
|
||||
--max-num-batched-tokens 4096 \
|
||||
--enable-expert-parallel \
|
||||
--trust-remote-code \
|
||||
--no-enable-prefix-caching \
|
||||
--gpu-memory-utilization 0.92 \
|
||||
--additional-config '{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true}}'
|
||||
```
|
||||
|
||||
The Deployment view looks like:
|
||||

|
||||
|
||||
Once your server is started, you can query the model with input prompts:
|
||||
|
||||
```shell
|
||||
curl http://{ node0 ip:8004 }/v1/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "kimi",
|
||||
"prompt": "The future of AI is",
|
||||
"max_tokens": 50,
|
||||
"temperature": 0
|
||||
}'
|
||||
```
|
||||
107
docs/source/tutorials/multi_npu.md
Normal file
107
docs/source/tutorials/multi_npu.md
Normal file
@@ -0,0 +1,107 @@
|
||||
# Multi-NPU (QwQ 32B)
|
||||
|
||||
## Run vllm-ascend on Multi-NPU
|
||||
|
||||
Run docker container:
|
||||
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
# Update the vllm-ascend image
|
||||
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
|
||||
docker run --rm \
|
||||
--name vllm-ascend \
|
||||
--device /dev/davinci0 \
|
||||
--device /dev/davinci1 \
|
||||
--device /dev/davinci2 \
|
||||
--device /dev/davinci3 \
|
||||
--device /dev/davinci_manager \
|
||||
--device /dev/devmm_svm \
|
||||
--device /dev/hisi_hdc \
|
||||
-v /usr/local/dcmi:/usr/local/dcmi \
|
||||
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
|
||||
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
|
||||
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
|
||||
-v /etc/ascend_install.info:/etc/ascend_install.info \
|
||||
-v /root/.cache:/root/.cache \
|
||||
-p 8000:8000 \
|
||||
-it $IMAGE bash
|
||||
```
|
||||
|
||||
Setup environment variables:
|
||||
|
||||
```bash
|
||||
# Load model from ModelScope to speed up download
|
||||
export VLLM_USE_MODELSCOPE=True
|
||||
|
||||
# Set `max_split_size_mb` to reduce memory fragmentation and avoid out of memory
|
||||
export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
|
||||
```
|
||||
|
||||
### Online Inference on Multi-NPU
|
||||
|
||||
Run the following script to start the vLLM server on Multi-NPU:
|
||||
|
||||
```bash
|
||||
vllm serve Qwen/QwQ-32B --max-model-len 4096 --port 8000 -tp 4
|
||||
```
|
||||
|
||||
Once your server is started, you can query the model with input prompts
|
||||
|
||||
```bash
|
||||
curl http://localhost:8000/v1/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "Qwen/QwQ-32B",
|
||||
"prompt": "QwQ-32B是什么?",
|
||||
"max_tokens": "128",
|
||||
"top_p": "0.95",
|
||||
"top_k": "40",
|
||||
"temperature": "0.6"
|
||||
}'
|
||||
```
|
||||
|
||||
### Offline Inference on Multi-NPU
|
||||
|
||||
Run the following script to execute offline inference on multi-NPU:
|
||||
|
||||
```python
|
||||
import gc
|
||||
|
||||
import torch
|
||||
|
||||
from vllm import LLM, SamplingParams
|
||||
from vllm.distributed.parallel_state import (destroy_distributed_environment,
|
||||
destroy_model_parallel)
|
||||
|
||||
def clean_up():
|
||||
destroy_model_parallel()
|
||||
destroy_distributed_environment()
|
||||
gc.collect()
|
||||
torch.npu.empty_cache()
|
||||
|
||||
prompts = [
|
||||
"Hello, my name is",
|
||||
"The future of AI is",
|
||||
]
|
||||
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40)
|
||||
llm = LLM(model="Qwen/QwQ-32B",
|
||||
tensor_parallel_size=4,
|
||||
distributed_executor_backend="mp",
|
||||
max_model_len=4096)
|
||||
|
||||
outputs = llm.generate(prompts, sampling_params)
|
||||
for output in outputs:
|
||||
prompt = output.prompt
|
||||
generated_text = output.outputs[0].text
|
||||
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
|
||||
|
||||
del llm
|
||||
clean_up()
|
||||
```
|
||||
|
||||
If you run this script successfully, you can see the info shown below:
|
||||
|
||||
```bash
|
||||
Prompt: 'Hello, my name is', Generated text: ' Daniel and I am an 8th grade student at York Middle School. I'
|
||||
Prompt: 'The future of AI is', Generated text: ' following you. As the technology advances, a new report from the Institute for the'
|
||||
```
|
||||
242
docs/source/tutorials/multi_npu_moge.md
Normal file
242
docs/source/tutorials/multi_npu_moge.md
Normal file
@@ -0,0 +1,242 @@
|
||||
# Multi-NPU (Pangu Pro MoE)
|
||||
|
||||
## Run vllm-ascend on Multi-NPU
|
||||
|
||||
Run container:
|
||||
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
# Update the vllm-ascend image
|
||||
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
|
||||
docker run --rm \
|
||||
--name vllm-ascend \
|
||||
--device /dev/davinci0 \
|
||||
--device /dev/davinci1 \
|
||||
--device /dev/davinci2 \
|
||||
--device /dev/davinci3 \
|
||||
--device /dev/davinci_manager \
|
||||
--device /dev/devmm_svm \
|
||||
--device /dev/hisi_hdc \
|
||||
-v /usr/local/dcmi:/usr/local/dcmi \
|
||||
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
|
||||
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
|
||||
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
|
||||
-v /etc/ascend_install.info:/etc/ascend_install.info \
|
||||
-v /root/.cache:/root/.cache \
|
||||
-p 8000:8000 \
|
||||
-it $IMAGE bash
|
||||
```
|
||||
|
||||
Setup environment variables:
|
||||
|
||||
```bash
|
||||
# Set `max_split_size_mb` to reduce memory fragmentation and avoid out of memory
|
||||
export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
|
||||
```
|
||||
|
||||
Download the model:
|
||||
|
||||
```bash
|
||||
git lfs install
|
||||
git clone https://gitcode.com/ascend-tribe/pangu-pro-moe-model.git
|
||||
```
|
||||
|
||||
### Online Inference on Multi-NPU
|
||||
|
||||
Run the following script to start the vLLM server on Multi-NPU:
|
||||
|
||||
```bash
|
||||
vllm serve /path/to/pangu-pro-moe-model \
|
||||
--tensor-parallel-size 4 \
|
||||
--enable-expert-parallel \
|
||||
--trust-remote-code \
|
||||
--enforce-eager
|
||||
```
|
||||
|
||||
Once your server is started, you can query the model with input prompts:
|
||||
|
||||
:::::{tab-set}
|
||||
::::{tab-item} v1/completions
|
||||
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
export question="你是谁?"
|
||||
curl http://localhost:8000/v1/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"prompt": "[unused9]系统:[unused10][unused9]用户:'${question}'[unused10][unused9]助手:",
|
||||
"max_tokens": 64,
|
||||
"top_p": 0.95,
|
||||
"top_k": 50,
|
||||
"temperature": 0.6
|
||||
}'
|
||||
```
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} v1/chat/completions
|
||||
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
curl http://localhost:8000/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"messages": [
|
||||
{"role": "system", "content": ""},
|
||||
{"role": "user", "content": "你是谁?"}
|
||||
],
|
||||
"max_tokens": "64",
|
||||
"top_p": "0.95",
|
||||
"top_k": "50",
|
||||
"temperature": "0.6",
|
||||
"add_special_tokens" : true
|
||||
}'
|
||||
```
|
||||
|
||||
::::
|
||||
:::::
|
||||
|
||||
If you run this successfully, you can see the info shown below:
|
||||
|
||||
```json
|
||||
{"id":"cmpl-2cd4223228ab4be9a91f65b882e65b32","object":"text_completion","created":1751255067,"model":"/root/.cache/pangu-pro-moe-model","choices":[{"index":0,"text":" [unused16] 好的,用户问我是谁,我需要根据之前的设定来回答。用户提到我是华为开发的“盘古Reasoner”,属于盘古大模型系列,作为智能助手帮助解答问题和提供 信息支持。现在用户再次询问,可能是在确认我的身份或者测试我的回答是否一致。\n\n首先,我要确保","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":15,"total_tokens":79,"completion_tokens":64,"prompt_tokens_details":null},"kv_transfer_params":null}
|
||||
```
|
||||
|
||||
### Offline Inference on Multi-NPU
|
||||
|
||||
Run the following script to execute offline inference on multi-NPU:
|
||||
|
||||
:::::{tab-set}
|
||||
::::{tab-item} Graph Mode
|
||||
|
||||
```{code-block} python
|
||||
:substitutions:
|
||||
import gc
|
||||
from transformers import AutoTokenizer
|
||||
import torch
|
||||
import os
|
||||
|
||||
from vllm import LLM, SamplingParams
|
||||
from vllm.distributed.parallel_state import (destroy_distributed_environment,
|
||||
destroy_model_parallel)
|
||||
|
||||
os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"
|
||||
def clean_up():
|
||||
destroy_model_parallel()
|
||||
destroy_distributed_environment()
|
||||
gc.collect()
|
||||
torch.npu.empty_cache()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained("/path/to/pangu-pro-moe-model", trust_remote_code=True)
|
||||
tests = [
|
||||
"Hello, my name is",
|
||||
"The future of AI is",
|
||||
]
|
||||
prompts = []
|
||||
for text in tests:
|
||||
messages = [
|
||||
{"role": "system", "content": ""}, # Optionally customize system content
|
||||
{"role": "user", "content": text}
|
||||
]
|
||||
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
|
||||
prompts.append(prompt)
|
||||
|
||||
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40)
|
||||
|
||||
llm = LLM(model="/path/to/pangu-pro-moe-model",
|
||||
tensor_parallel_size=4,
|
||||
enable_expert_parallel=True,
|
||||
distributed_executor_backend="mp",
|
||||
max_model_len=1024,
|
||||
trust_remote_code=True,
|
||||
additional_config={
|
||||
'torchair_graph_config': {
|
||||
'enabled': True,
|
||||
},
|
||||
'ascend_scheduler_config':{
|
||||
'enabled': True,
|
||||
'enable_chunked_prefill' : False,
|
||||
'chunked_prefill_enabled': False
|
||||
},
|
||||
})
|
||||
|
||||
outputs = llm.generate(prompts, sampling_params)
|
||||
for output in outputs:
|
||||
prompt = output.prompt
|
||||
generated_text = output.outputs[0].text
|
||||
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
|
||||
|
||||
del llm
|
||||
clean_up()
|
||||
```
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} Eager Mode
|
||||
|
||||
```{code-block} python
|
||||
:substitutions:
|
||||
import gc
|
||||
from transformers import AutoTokenizer
|
||||
import torch
|
||||
import os
|
||||
|
||||
from vllm import LLM, SamplingParams
|
||||
from vllm.distributed.parallel_state import (destroy_distributed_environment,
|
||||
destroy_model_parallel)
|
||||
|
||||
os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"
|
||||
def clean_up():
|
||||
destroy_model_parallel()
|
||||
destroy_distributed_environment()
|
||||
gc.collect()
|
||||
torch.npu.empty_cache()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained("/path/to/pangu-pro-moe-model", trust_remote_code=True)
|
||||
tests = [
|
||||
"Hello, my name is",
|
||||
"The future of AI is",
|
||||
]
|
||||
prompts = []
|
||||
for text in tests:
|
||||
messages = [
|
||||
{"role": "system", "content": ""}, # Optionally customize system content
|
||||
{"role": "user", "content": text}
|
||||
]
|
||||
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
|
||||
prompts.append(prompt)
|
||||
|
||||
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40)
|
||||
|
||||
llm = LLM(model="/path/to/pangu-pro-moe-model",
|
||||
tensor_parallel_size=4,
|
||||
distributed_executor_backend="mp",
|
||||
max_model_len=1024,
|
||||
trust_remote_code=True,
|
||||
enforce_eager=True)
|
||||
|
||||
outputs = llm.generate(prompts, sampling_params)
|
||||
for output in outputs:
|
||||
prompt = output.prompt
|
||||
generated_text = output.outputs[0].text
|
||||
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
|
||||
|
||||
del llm
|
||||
clean_up()
|
||||
```
|
||||
|
||||
::::
|
||||
:::::
|
||||
|
||||
If you run this script successfully, you can see the info shown below:
|
||||
|
||||
```bash
|
||||
Prompt: 'Hello, my name is', Generated text: ' Daniel and I am an 8th grade student at York Middle School. I'
|
||||
Prompt: 'The future of AI is', Generated text: ' following you. As the technology advances, a new report from the Institute for the'
|
||||
```
|
||||
137
docs/source/tutorials/multi_npu_quantization.md
Normal file
137
docs/source/tutorials/multi_npu_quantization.md
Normal file
@@ -0,0 +1,137 @@
|
||||
# Multi-NPU (QwQ 32B W8A8)
|
||||
|
||||
## Run docker container
|
||||
:::{note}
|
||||
w8a8 quantization feature is supported by v0.8.4rc2 or higher
|
||||
:::
|
||||
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
# Update the vllm-ascend image
|
||||
export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version|
|
||||
docker run --rm \
|
||||
--name vllm-ascend \
|
||||
--device /dev/davinci0 \
|
||||
--device /dev/davinci1 \
|
||||
--device /dev/davinci2 \
|
||||
--device /dev/davinci3 \
|
||||
--device /dev/davinci_manager \
|
||||
--device /dev/devmm_svm \
|
||||
--device /dev/hisi_hdc \
|
||||
-v /usr/local/dcmi:/usr/local/dcmi \
|
||||
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
|
||||
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
|
||||
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
|
||||
-v /etc/ascend_install.info:/etc/ascend_install.info \
|
||||
-v /root/.cache:/root/.cache \
|
||||
-p 8000:8000 \
|
||||
-it $IMAGE bash
|
||||
```
|
||||
|
||||
## Install modelslim and convert model
|
||||
:::{note}
|
||||
You can choose to convert the model yourself or use the quantized model we uploaded,
|
||||
see https://www.modelscope.cn/models/vllm-ascend/QwQ-32B-W8A8
|
||||
:::
|
||||
|
||||
```bash
|
||||
# (Optional)This tag is recommended and has been verified
|
||||
git clone https://gitee.com/ascend/msit -b modelslim-VLLM-8.1.RC1.b020_001
|
||||
|
||||
cd msit/msmodelslim
|
||||
# Install by run this script
|
||||
bash install.sh
|
||||
pip install accelerate
|
||||
|
||||
cd example/Qwen
|
||||
# Original weight path, Replace with your local model path
|
||||
MODEL_PATH=/home/models/QwQ-32B
|
||||
# Path to save converted weight, Replace with your local path
|
||||
SAVE_PATH=/home/models/QwQ-32B-w8a8
|
||||
|
||||
# In this conversion process, the npu device is not must, you can also set --device_type cpu to have a conversion
|
||||
python3 quant_qwen.py --model_path $MODEL_PATH --save_directory $SAVE_PATH --calib_file ../common/boolq.jsonl --w_bit 8 --a_bit 8 --device_type npu --anti_method m1 --trust_remote_code True
|
||||
```
|
||||
|
||||
## Verify the quantized model
|
||||
The converted model files looks like:
|
||||
|
||||
```bash
|
||||
.
|
||||
|-- config.json
|
||||
|-- configuration.json
|
||||
|-- generation_config.json
|
||||
|-- quant_model_description.json
|
||||
|-- quant_model_weight_w8a8.safetensors
|
||||
|-- README.md
|
||||
|-- tokenizer.json
|
||||
`-- tokenizer_config.json
|
||||
```
|
||||
|
||||
Run the following script to start the vLLM server with quantized model:
|
||||
|
||||
:::{note}
|
||||
The value "ascend" for "--quantization" argument will be supported after [a specific PR](https://github.com/vllm-project/vllm-ascend/pull/877) is merged and released, you can cherry-pick this commit for now.
|
||||
:::
|
||||
|
||||
```bash
|
||||
vllm serve /home/models/QwQ-32B-w8a8 --tensor-parallel-size 4 --served-model-name "qwq-32b-w8a8" --max-model-len 4096 --quantization ascend
|
||||
```
|
||||
|
||||
Once your server is started, you can query the model with input prompts
|
||||
|
||||
```bash
|
||||
curl http://localhost:8000/v1/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "qwq-32b-w8a8",
|
||||
"prompt": "what is large language model?",
|
||||
"max_tokens": "128",
|
||||
"top_p": "0.95",
|
||||
"top_k": "40",
|
||||
"temperature": "0.0"
|
||||
}'
|
||||
```
|
||||
|
||||
Run the following script to execute offline inference on multi-NPU with quantized model:
|
||||
|
||||
:::{note}
|
||||
To enable quantization for ascend, quantization method must be "ascend"
|
||||
:::
|
||||
|
||||
```python
|
||||
import gc
|
||||
|
||||
import torch
|
||||
|
||||
from vllm import LLM, SamplingParams
|
||||
from vllm.distributed.parallel_state import (destroy_distributed_environment,
|
||||
destroy_model_parallel)
|
||||
|
||||
def clean_up():
|
||||
destroy_model_parallel()
|
||||
destroy_distributed_environment()
|
||||
gc.collect()
|
||||
torch.npu.empty_cache()
|
||||
|
||||
prompts = [
|
||||
"Hello, my name is",
|
||||
"The future of AI is",
|
||||
]
|
||||
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40)
|
||||
|
||||
llm = LLM(model="/home/models/QwQ-32B-w8a8",
|
||||
tensor_parallel_size=4,
|
||||
distributed_executor_backend="mp",
|
||||
max_model_len=4096,
|
||||
quantization="ascend")
|
||||
|
||||
outputs = llm.generate(prompts, sampling_params)
|
||||
for output in outputs:
|
||||
prompt = output.prompt
|
||||
generated_text = output.outputs[0].text
|
||||
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
|
||||
|
||||
del llm
|
||||
clean_up()
|
||||
```
|
||||
109
docs/source/tutorials/multi_npu_qwen3_moe.md
Normal file
109
docs/source/tutorials/multi_npu_qwen3_moe.md
Normal file
@@ -0,0 +1,109 @@
|
||||
# Multi-NPU (Qwen3-30B-A3B)
|
||||
|
||||
## Run vllm-ascend on Multi-NPU with Qwen3 MoE
|
||||
|
||||
Run docker container:
|
||||
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
# Update the vllm-ascend image
|
||||
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
|
||||
docker run --rm \
|
||||
--name vllm-ascend \
|
||||
--device /dev/davinci0 \
|
||||
--device /dev/davinci1 \
|
||||
--device /dev/davinci2 \
|
||||
--device /dev/davinci3 \
|
||||
--device /dev/davinci_manager \
|
||||
--device /dev/devmm_svm \
|
||||
--device /dev/hisi_hdc \
|
||||
-v /usr/local/dcmi:/usr/local/dcmi \
|
||||
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
|
||||
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
|
||||
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
|
||||
-v /etc/ascend_install.info:/etc/ascend_install.info \
|
||||
-v /root/.cache:/root/.cache \
|
||||
-p 8000:8000 \
|
||||
-it $IMAGE bash
|
||||
```
|
||||
|
||||
Setup environment variables:
|
||||
|
||||
```bash
|
||||
# Load model from ModelScope to speed up download
|
||||
export VLLM_USE_MODELSCOPE=True
|
||||
|
||||
# Set `max_split_size_mb` to reduce memory fragmentation and avoid out of memory
|
||||
export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
|
||||
```
|
||||
|
||||
### Online Inference on Multi-NPU
|
||||
|
||||
Run the following script to start the vLLM server on Multi-NPU:
|
||||
|
||||
For an Atlas A2 with 64GB of NPU card memory, tensor-parallel-size should be at least 2, and for 32GB of memory, tensor-parallel-size should be at least 4.
|
||||
|
||||
```bash
|
||||
vllm serve Qwen/Qwen3-30B-A3B --tensor-parallel-size 4 --enable_expert_parallel
|
||||
```
|
||||
|
||||
Once your server is started, you can query the model with input prompts
|
||||
|
||||
```bash
|
||||
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
|
||||
"model": "Qwen/Qwen3-30B-A3B",
|
||||
"messages": [
|
||||
{"role": "user", "content": "Give me a short introduction to large language models."}
|
||||
],
|
||||
"temperature": 0.6,
|
||||
"top_p": 0.95,
|
||||
"top_k": 20,
|
||||
"max_tokens": 4096
|
||||
}'
|
||||
```
|
||||
|
||||
### Offline Inference on Multi-NPU
|
||||
|
||||
Run the following script to execute offline inference on multi-NPU:
|
||||
|
||||
```python
|
||||
import gc
|
||||
import torch
|
||||
|
||||
from vllm import LLM, SamplingParams
|
||||
from vllm.distributed.parallel_state import (destroy_distributed_environment,
|
||||
destroy_model_parallel)
|
||||
|
||||
def clean_up():
|
||||
destroy_model_parallel()
|
||||
destroy_distributed_environment()
|
||||
gc.collect()
|
||||
torch.npu.empty_cache()
|
||||
|
||||
prompts = [
|
||||
"Hello, my name is",
|
||||
"The future of AI is",
|
||||
]
|
||||
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40)
|
||||
llm = LLM(model="Qwen/Qwen3-30B-A3B",
|
||||
tensor_parallel_size=4,
|
||||
distributed_executor_backend="mp",
|
||||
max_model_len=4096,
|
||||
enable_expert_parallel=True)
|
||||
|
||||
outputs = llm.generate(prompts, sampling_params)
|
||||
for output in outputs:
|
||||
prompt = output.prompt
|
||||
generated_text = output.outputs[0].text
|
||||
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
|
||||
|
||||
del llm
|
||||
clean_up()
|
||||
```
|
||||
|
||||
If you run this script successfully, you can see the info shown below:
|
||||
|
||||
```bash
|
||||
Prompt: 'Hello, my name is', Generated text: " Lucy. I'm from the UK and I'm 11 years old."
|
||||
Prompt: 'The future of AI is', Generated text: ' a topic that has captured the imagination of scientists, philosophers, and the general public'
|
||||
```
|
||||
406
docs/source/tutorials/single_node_300i.md
Normal file
406
docs/source/tutorials/single_node_300i.md
Normal file
@@ -0,0 +1,406 @@
|
||||
# Single Node (Atlas 300I series)
|
||||
|
||||
```{note}
|
||||
1. This Atlas 300I series is currently experimental. In future versions, there may be behavioral changes around model coverage, performance improvement.
|
||||
2. Currently, the 310I series only supports eager mode and the data type is float16.
|
||||
```
|
||||
|
||||
## Run vLLM on Altlas 300I series
|
||||
|
||||
Run docker container:
|
||||
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
# Update the vllm-ascend image
|
||||
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|-310p
|
||||
docker run --rm \
|
||||
--name vllm-ascend \
|
||||
--device /dev/davinci0 \
|
||||
--device /dev/davinci1 \
|
||||
--device /dev/davinci2 \
|
||||
--device /dev/davinci3 \
|
||||
--device /dev/davinci4 \
|
||||
--device /dev/davinci5 \
|
||||
--device /dev/davinci6 \
|
||||
--device /dev/davinci7 \
|
||||
--device /dev/davinci_manager \
|
||||
--device /dev/devmm_svm \
|
||||
--device /dev/hisi_hdc \
|
||||
-v /usr/local/dcmi:/usr/local/dcmi \
|
||||
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
|
||||
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
|
||||
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
|
||||
-v /etc/ascend_install.info:/etc/ascend_install.info \
|
||||
-v /root/.cache:/root/.cache \
|
||||
-p 8000:8000 \
|
||||
-it $IMAGE bash
|
||||
```
|
||||
|
||||
Setup environment variables:
|
||||
|
||||
```bash
|
||||
# Load model from ModelScope to speed up download
|
||||
export VLLM_USE_MODELSCOPE=True
|
||||
|
||||
# Set `max_split_size_mb` to reduce memory fragmentation and avoid out of memory
|
||||
export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
|
||||
```
|
||||
|
||||
### Online Inference on NPU
|
||||
|
||||
Run the following script to start the vLLM server on NPU(Qwen3-0.6B:1 card, Qwen2.5-7B-Instruct:2 cards, Pangu-Pro-MoE-72B: 8 cards):
|
||||
|
||||
:::::{tab-set}
|
||||
:sync-group: inference
|
||||
|
||||
::::{tab-item} Qwen3-0.6B
|
||||
:selected:
|
||||
:sync: qwen0.6
|
||||
|
||||
Run the following command to start the vLLM server:
|
||||
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
vllm serve Qwen/Qwen3-0.6B \
|
||||
--tensor-parallel-size 1 \
|
||||
--enforce-eager \
|
||||
--dtype float16 \
|
||||
--compilation-config '{"custom_ops":["none", "+rms_norm", "+rotary_embedding"]}'
|
||||
```
|
||||
|
||||
Once your server is started, you can query the model with input prompts
|
||||
|
||||
```bash
|
||||
curl http://localhost:8000/v1/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"prompt": "The future of AI is",
|
||||
"max_tokens": 64,
|
||||
"top_p": 0.95,
|
||||
"top_k": 50,
|
||||
"temperature": 0.6
|
||||
}'
|
||||
```
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} Qwen2.5-7B-Instruct
|
||||
:sync: qwen7b
|
||||
|
||||
Run the following command to start the vLLM server:
|
||||
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
vllm serve Qwen/Qwen2.5-7B-Instruct \
|
||||
--tensor-parallel-size 2 \
|
||||
--enforce-eager \
|
||||
--dtype float16 \
|
||||
--compilation-config '{"custom_ops":["none", "+rms_norm", "+rotary_embedding"]}'
|
||||
```
|
||||
|
||||
Once your server is started, you can query the model with input prompts
|
||||
|
||||
```bash
|
||||
curl http://localhost:8000/v1/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"prompt": "The future of AI is",
|
||||
"max_tokens": 64,
|
||||
"top_p": 0.95,
|
||||
"top_k": 50,
|
||||
"temperature": 0.6
|
||||
}'
|
||||
```
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} Qwen2.5-VL-3B-Instruct
|
||||
:sync: qwen-vl-2.5-3b
|
||||
|
||||
Run the following command to start the vLLM server:
|
||||
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
vllm serve Qwen/Qwen2.5-VL-3B-Instruct \
|
||||
--tensor-parallel-size 1 \
|
||||
--enforce-eager \
|
||||
--dtype float16 \
|
||||
--compilation-config '{"custom_ops":["none", "+rms_norm", "+rotary_embedding"]}'
|
||||
```
|
||||
|
||||
Once your server is started, you can query the model with input prompts
|
||||
|
||||
```bash
|
||||
curl http://localhost:8000/v1/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"prompt": "The future of AI is",
|
||||
"max_tokens": 64,
|
||||
"top_p": 0.95,
|
||||
"top_k": 50,
|
||||
"temperature": 0.6
|
||||
}'
|
||||
```
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} Pangu-Pro-MoE-72B
|
||||
:sync: pangu
|
||||
|
||||
Download the model:
|
||||
|
||||
```bash
|
||||
git lfs install
|
||||
git clone https://gitcode.com/ascend-tribe/pangu-pro-moe-model.git
|
||||
```
|
||||
|
||||
Run the following command to start the vLLM server:
|
||||
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
|
||||
vllm serve /home/pangu-pro-moe-mode/ \
|
||||
--tensor-parallel-size 4 \
|
||||
--enable-expert-parallel \
|
||||
--dtype "float16" \
|
||||
--trust-remote-code \
|
||||
--enforce-eager
|
||||
|
||||
```
|
||||
|
||||
Once your server is started, you can query the model with input prompts
|
||||
|
||||
```bash
|
||||
export question="你是谁?"
|
||||
curl http://localhost:8000/v1/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"prompt": "[unused9]系统:[unused10][unused9]用户:'${question}'[unused10][unused9]助手:",
|
||||
"max_tokens": 64,
|
||||
"top_p": 0.95,
|
||||
"top_k": 50,
|
||||
"temperature": 0.6
|
||||
}'
|
||||
```
|
||||
|
||||
::::
|
||||
:::::
|
||||
|
||||
If you run this script successfully, you can see the results.
|
||||
|
||||
### Offline Inference
|
||||
|
||||
Run the following script (`example.py`) to execute offline inference on NPU:
|
||||
|
||||
:::::{tab-set}
|
||||
:sync-group: inference
|
||||
|
||||
::::{tab-item} Qwen3-0.6B
|
||||
:selected:
|
||||
:sync: qwen0.6
|
||||
|
||||
```{code-block} python
|
||||
:substitutions:
|
||||
from vllm import LLM, SamplingParams
|
||||
import gc
|
||||
import torch
|
||||
from vllm import LLM, SamplingParams
|
||||
from vllm.distributed.parallel_state import (destroy_distributed_environment,
|
||||
destroy_model_parallel)
|
||||
|
||||
def clean_up():
|
||||
destroy_model_parallel()
|
||||
destroy_distributed_environment()
|
||||
gc.collect()
|
||||
torch.npu.empty_cache()
|
||||
prompts = [
|
||||
"Hello, my name is",
|
||||
"The future of AI is",
|
||||
]
|
||||
# Create a sampling params object.
|
||||
sampling_params = SamplingParams(max_tokens=100, temperature=0.0)
|
||||
# Create an LLM.
|
||||
llm = LLM(
|
||||
model="Qwen/Qwen3-0.6B",
|
||||
tensor_parallel_size=1,
|
||||
enforce_eager=True, # For 300I series, only eager mode is supported.
|
||||
dtype="float16", # IMPORTANT cause some ATB ops cannot support bf16 on 300I series
|
||||
compilation_config={"custom_ops":["none", "+rms_norm", "+rotary_embedding"]}, # High performance for 300I series
|
||||
)
|
||||
# Generate texts from the prompts.
|
||||
outputs = llm.generate(prompts, sampling_params)
|
||||
for output in outputs:
|
||||
prompt = output.prompt
|
||||
generated_text = output.outputs[0].text
|
||||
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
|
||||
del llm
|
||||
clean_up()
|
||||
```
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} Qwen2.5-7B-Instruct
|
||||
:sync: qwen7b
|
||||
|
||||
```{code-block} python
|
||||
:substitutions:
|
||||
from vllm import LLM, SamplingParams
|
||||
import gc
|
||||
import torch
|
||||
from vllm import LLM, SamplingParams
|
||||
from vllm.distributed.parallel_state import (destroy_distributed_environment,
|
||||
destroy_model_parallel)
|
||||
|
||||
def clean_up():
|
||||
destroy_model_parallel()
|
||||
destroy_distributed_environment()
|
||||
gc.collect()
|
||||
torch.npu.empty_cache()
|
||||
prompts = [
|
||||
"Hello, my name is",
|
||||
"The future of AI is",
|
||||
]
|
||||
# Create a sampling params object.
|
||||
sampling_params = SamplingParams(max_tokens=100, temperature=0.0)
|
||||
# Create an LLM.
|
||||
llm = LLM(
|
||||
model="Qwen/Qwen2.5-7B-Instruct",
|
||||
tensor_parallel_size=2,
|
||||
enforce_eager=True, # For 300I series, only eager mode is supported.
|
||||
dtype="float16", # IMPORTANT cause some ATB ops cannot support bf16 on 300I series
|
||||
compilation_config={"custom_ops":["none", "+rms_norm", "+rotary_embedding"]}, # High performance for 300I series
|
||||
)
|
||||
# Generate texts from the prompts.
|
||||
outputs = llm.generate(prompts, sampling_params)
|
||||
for output in outputs:
|
||||
prompt = output.prompt
|
||||
generated_text = output.outputs[0].text
|
||||
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
|
||||
del llm
|
||||
clean_up()
|
||||
```
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} Qwen2.5-VL-3B-Instruct
|
||||
:sync: qwen-vl-2.5-3b
|
||||
|
||||
```{code-block} python
|
||||
:substitutions:
|
||||
from vllm import LLM, SamplingParams
|
||||
import gc
|
||||
import torch
|
||||
from vllm import LLM, SamplingParams
|
||||
from vllm.distributed.parallel_state import (destroy_distributed_environment,
|
||||
destroy_model_parallel)
|
||||
|
||||
def clean_up():
|
||||
destroy_model_parallel()
|
||||
destroy_distributed_environment()
|
||||
gc.collect()
|
||||
torch.npu.empty_cache()
|
||||
prompts = [
|
||||
"Hello, my name is",
|
||||
"The future of AI is",
|
||||
]
|
||||
# Create a sampling params object.
|
||||
sampling_params = SamplingParams(max_tokens=100, top_p=0.95, top_k=50, temperature=0.6)
|
||||
# Create an LLM.
|
||||
llm = LLM(
|
||||
model="Qwen/Qwen2.5-VL-3B-Instruct",
|
||||
tensor_parallel_size=1,
|
||||
enforce_eager=True, # For 300I series, only eager mode is supported.
|
||||
dtype="float16", # IMPORTANT cause some ATB ops cannot support bf16 on 300I series
|
||||
compilation_config={"custom_ops":["none", "+rms_norm", "+rotary_embedding"]}, # High performance for 300I series
|
||||
)
|
||||
# Generate texts from the prompts.
|
||||
outputs = llm.generate(prompts, sampling_params)
|
||||
for output in outputs:
|
||||
prompt = output.prompt
|
||||
generated_text = output.outputs[0].text
|
||||
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
|
||||
del llm
|
||||
clean_up()
|
||||
```
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} Pangu-Pro-MoE-72B
|
||||
:sync: pangu
|
||||
|
||||
Download the model:
|
||||
|
||||
```bash
|
||||
git lfs install
|
||||
git clone https://gitcode.com/ascend-tribe/pangu-pro-moe-model.git
|
||||
```
|
||||
|
||||
```{code-block} python
|
||||
:substitutions:
|
||||
|
||||
import gc
|
||||
from transformers import AutoTokenizer
|
||||
import torch
|
||||
|
||||
from vllm import LLM, SamplingParams
|
||||
from vllm.distributed.parallel_state import (destroy_distributed_environment,
|
||||
destroy_model_parallel)
|
||||
|
||||
def clean_up():
|
||||
destroy_model_parallel()
|
||||
destroy_distributed_environment()
|
||||
gc.collect()
|
||||
torch.npu.empty_cache()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained("/home/pangu-pro-moe-mode/", trust_remote_code=True)
|
||||
tests = [
|
||||
"Hello, my name is",
|
||||
"The future of AI is",
|
||||
]
|
||||
prompts = []
|
||||
for text in tests:
|
||||
messages = [
|
||||
{"role": "system", "content": ""}, # Optionally customize system content
|
||||
{"role": "user", "content": text}
|
||||
]
|
||||
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) # 推荐使用官方的template
|
||||
prompts.append(prompt)
|
||||
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40)
|
||||
|
||||
llm = LLM(model="/home/pangu-pro-moe-mode/",
|
||||
tensor_parallel_size=8,
|
||||
distributed_executor_backend="mp",
|
||||
enable_expert_parallel=True,
|
||||
dtype="float16",
|
||||
max_model_len=1024,
|
||||
trust_remote_code=True,
|
||||
enforce_eager=True)
|
||||
|
||||
outputs = llm.generate(prompts, sampling_params)
|
||||
for output in outputs:
|
||||
prompt = output.prompt
|
||||
generated_text = output.outputs[0].text
|
||||
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
|
||||
|
||||
del llm
|
||||
clean_up()
|
||||
```
|
||||
|
||||
::::
|
||||
:::::
|
||||
|
||||
Run script:
|
||||
|
||||
```bash
|
||||
python example.py
|
||||
```
|
||||
|
||||
If you run this script successfully, you can see the info shown below:
|
||||
|
||||
```bash
|
||||
Prompt: 'Hello, my name is', Generated text: " Lina. I'm a 22-year-old student from China. I'm interested in studying in the US. I'm looking for a job in the US. I want to know if there are any opportunities in the US for me to work. I'm also interested in the culture and lifestyle in the US. I want to know if there are any opportunities for me to work in the US. I'm also interested in the culture and lifestyle in the US. I'm interested in the culture"
|
||||
Prompt: 'The future of AI is', Generated text: " not just about the technology itself, but about how we use it to solve real-world problems. As AI continues to evolve, it's important to consider the ethical implications of its use. AI has the potential to bring about significant changes in society, but it also has the power to create new challenges. Therefore, it's crucial to develop a comprehensive approach to AI that takes into account both the benefits and the risks associated with its use. This includes addressing issues such as bias, privacy, and accountability."
|
||||
```
|
||||
202
docs/source/tutorials/single_npu.md
Normal file
202
docs/source/tutorials/single_npu.md
Normal file
@@ -0,0 +1,202 @@
|
||||
# Single NPU (Qwen3 8B)
|
||||
|
||||
## Run vllm-ascend on Single NPU
|
||||
|
||||
### Offline Inference on Single NPU
|
||||
|
||||
Run docker container:
|
||||
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
# Update the vllm-ascend image
|
||||
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
|
||||
docker run --rm \
|
||||
--name vllm-ascend \
|
||||
--device /dev/davinci0 \
|
||||
--device /dev/davinci_manager \
|
||||
--device /dev/devmm_svm \
|
||||
--device /dev/hisi_hdc \
|
||||
-v /usr/local/dcmi:/usr/local/dcmi \
|
||||
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
|
||||
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
|
||||
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
|
||||
-v /etc/ascend_install.info:/etc/ascend_install.info \
|
||||
-v /root/.cache:/root/.cache \
|
||||
-p 8000:8000 \
|
||||
-it $IMAGE bash
|
||||
```
|
||||
|
||||
Setup environment variables:
|
||||
|
||||
```bash
|
||||
# Load model from ModelScope to speed up download
|
||||
export VLLM_USE_MODELSCOPE=True
|
||||
|
||||
# Set `max_split_size_mb` to reduce memory fragmentation and avoid out of memory
|
||||
export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
|
||||
```
|
||||
|
||||
:::{note}
|
||||
`max_split_size_mb` prevents the native allocator from splitting blocks larger than this size (in MB). This can reduce fragmentation and may allow some borderline workloads to complete without running out of memory. You can find more details [<u>here</u>](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/800alpha003/apiref/envref/envref_07_0061.html).
|
||||
:::
|
||||
|
||||
Run the following script to execute offline inference on a single NPU:
|
||||
|
||||
:::::{tab-set}
|
||||
::::{tab-item} Graph Mode
|
||||
|
||||
```{code-block} python
|
||||
:substitutions:
|
||||
import os
|
||||
from vllm import LLM, SamplingParams
|
||||
|
||||
prompts = [
|
||||
"Hello, my name is",
|
||||
"The future of AI is",
|
||||
]
|
||||
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
|
||||
llm = LLM(
|
||||
model="Qwen/Qwen3-8B",
|
||||
max_model_len=26240
|
||||
)
|
||||
|
||||
outputs = llm.generate(prompts, sampling_params)
|
||||
for output in outputs:
|
||||
prompt = output.prompt
|
||||
generated_text = output.outputs[0].text
|
||||
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
|
||||
```
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} Eager Mode
|
||||
|
||||
```{code-block} python
|
||||
:substitutions:
|
||||
import os
|
||||
from vllm import LLM, SamplingParams
|
||||
|
||||
prompts = [
|
||||
"Hello, my name is",
|
||||
"The future of AI is",
|
||||
]
|
||||
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
|
||||
llm = LLM(
|
||||
model="Qwen/Qwen3-8B",
|
||||
max_model_len=26240,
|
||||
enforce_eager=True
|
||||
)
|
||||
|
||||
outputs = llm.generate(prompts, sampling_params)
|
||||
for output in outputs:
|
||||
prompt = output.prompt
|
||||
generated_text = output.outputs[0].text
|
||||
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
|
||||
```
|
||||
|
||||
::::
|
||||
:::::
|
||||
|
||||
If you run this script successfully, you can see the info shown below:
|
||||
|
||||
```bash
|
||||
Prompt: 'Hello, my name is', Generated text: ' Daniel and I am an 8th grade student at York Middle School. I'
|
||||
Prompt: 'The future of AI is', Generated text: ' following you. As the technology advances, a new report from the Institute for the'
|
||||
```
|
||||
|
||||
### Online Serving on Single NPU
|
||||
|
||||
Run docker container to start the vLLM server on a single NPU:
|
||||
|
||||
:::::{tab-set}
|
||||
::::{tab-item} Graph Mode
|
||||
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
# Update the vllm-ascend image
|
||||
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
|
||||
docker run --rm \
|
||||
--name vllm-ascend \
|
||||
--device /dev/davinci0 \
|
||||
--device /dev/davinci_manager \
|
||||
--device /dev/devmm_svm \
|
||||
--device /dev/hisi_hdc \
|
||||
-v /usr/local/dcmi:/usr/local/dcmi \
|
||||
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
|
||||
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
|
||||
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
|
||||
-v /etc/ascend_install.info:/etc/ascend_install.info \
|
||||
-v /root/.cache:/root/.cache \
|
||||
-p 8000:8000 \
|
||||
-e VLLM_USE_MODELSCOPE=True \
|
||||
-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
|
||||
-it $IMAGE \
|
||||
vllm serve Qwen/Qwen3-8B --max_model_len 26240
|
||||
```
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} Eager Mode
|
||||
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
|
||||
docker run --rm \
|
||||
--name vllm-ascend \
|
||||
--device /dev/davinci0 \
|
||||
--device /dev/davinci_manager \
|
||||
--device /dev/devmm_svm \
|
||||
--device /dev/hisi_hdc \
|
||||
-v /usr/local/dcmi:/usr/local/dcmi \
|
||||
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
|
||||
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
|
||||
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
|
||||
-v /etc/ascend_install.info:/etc/ascend_install.info \
|
||||
-v /root/.cache:/root/.cache \
|
||||
-p 8000:8000 \
|
||||
-e VLLM_USE_MODELSCOPE=True \
|
||||
-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
|
||||
-it $IMAGE \
|
||||
vllm serve Qwen/Qwen3-8B --max_model_len 26240 --enforce-eager
|
||||
```
|
||||
|
||||
::::
|
||||
:::::
|
||||
|
||||
:::{note}
|
||||
Add `--max_model_len` option to avoid ValueError that the Qwen2.5-7B model's max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache (26240). This will differ with different NPU series base on the HBM size. Please modify the value according to a suitable value for your NPU series.
|
||||
:::
|
||||
|
||||
If your service start successfully, you can see the info shown below:
|
||||
|
||||
```bash
|
||||
INFO: Started server process [6873]
|
||||
INFO: Waiting for application startup.
|
||||
INFO: Application startup complete.
|
||||
```
|
||||
|
||||
Once your server is started, you can query the model with input prompts:
|
||||
|
||||
```bash
|
||||
curl http://localhost:8000/v1/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "Qwen/Qwen3-8B",
|
||||
"prompt": "The future of AI is",
|
||||
"max_tokens": 7,
|
||||
"temperature": 0
|
||||
}'
|
||||
```
|
||||
|
||||
If you query the server successfully, you can see the info shown below (client):
|
||||
|
||||
```bash
|
||||
{"id":"cmpl-b25a59a2f985459781ce7098aeddfda7","object":"text_completion","created":1739523925,"model":"Qwen/Qwen3-8B","choices":[{"index":0,"text":" here. It’s not just a","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":5,"total_tokens":12,"completion_tokens":7,"prompt_tokens_details":null}}
|
||||
```
|
||||
|
||||
Logs of the vllm server:
|
||||
|
||||
```bash
|
||||
INFO: 172.17.0.1:49518 - "POST /v1/completions HTTP/1.1" 200 OK
|
||||
INFO 02-13 08:34:35 logger.py:39] Received request cmpl-574f00e342904692a73fb6c1c986c521-0: prompt: 'San Francisco is a', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=7, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None), prompt_token_ids: [23729, 12879, 374, 264], lora_request: None, prompt_adapter_request: None.
|
||||
```
|
||||
122
docs/source/tutorials/single_npu_audio.md
Normal file
122
docs/source/tutorials/single_npu_audio.md
Normal file
@@ -0,0 +1,122 @@
|
||||
# Single NPU (Qwen2-Audio 7B)
|
||||
|
||||
## Run vllm-ascend on Single NPU
|
||||
|
||||
### Offline Inference on Single NPU
|
||||
|
||||
Run docker container:
|
||||
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
# Update the vllm-ascend image
|
||||
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
|
||||
docker run --rm \
|
||||
--name vllm-ascend \
|
||||
--device /dev/davinci0 \
|
||||
--device /dev/davinci_manager \
|
||||
--device /dev/devmm_svm \
|
||||
--device /dev/hisi_hdc \
|
||||
-v /usr/local/dcmi:/usr/local/dcmi \
|
||||
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
|
||||
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
|
||||
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
|
||||
-v /etc/ascend_install.info:/etc/ascend_install.info \
|
||||
-v /root/.cache:/root/.cache \
|
||||
-p 8000:8000 \
|
||||
-it $IMAGE bash
|
||||
```
|
||||
|
||||
Setup environment variables:
|
||||
|
||||
```bash
|
||||
# Load model from ModelScope to speed up download
|
||||
export VLLM_USE_MODELSCOPE=True
|
||||
|
||||
# Set `max_split_size_mb` to reduce memory fragmentation and avoid out of memory
|
||||
export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
|
||||
```
|
||||
|
||||
:::{note}
|
||||
`max_split_size_mb` prevents the native allocator from splitting blocks larger than this size (in MB). This can reduce fragmentation and may allow some borderline workloads to complete without running out of memory. You can find more details [<u>here</u>](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/800alpha003/apiref/envref/envref_07_0061.html).
|
||||
:::
|
||||
|
||||
Install packages required for audio processing:
|
||||
|
||||
```bash
|
||||
pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
|
||||
pip install librosa soundfile
|
||||
```
|
||||
|
||||
Run the following script to execute offline inference on a single NPU:
|
||||
|
||||
```python
|
||||
from vllm import LLM, SamplingParams
|
||||
from vllm.assets.audio import AudioAsset
|
||||
from vllm.utils import FlexibleArgumentParser
|
||||
|
||||
# If network issues prevent AudioAsset from fetching remote audio files, retry or check your network.
|
||||
audio_assets = [AudioAsset("mary_had_lamb"), AudioAsset("winning_call")]
|
||||
question_per_audio_count = {
|
||||
1: "What is recited in the audio?",
|
||||
2: "What sport and what nursery rhyme are referenced?"
|
||||
}
|
||||
|
||||
|
||||
def prepare_inputs(audio_count: int):
|
||||
audio_in_prompt = "".join([
|
||||
f"Audio {idx+1}: <|audio_bos|><|AUDIO|><|audio_eos|>\n"
|
||||
for idx in range(audio_count)
|
||||
])
|
||||
question = question_per_audio_count[audio_count]
|
||||
prompt = ("<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n"
|
||||
"<|im_start|>user\n"
|
||||
f"{audio_in_prompt}{question}<|im_end|>\n"
|
||||
"<|im_start|>assistant\n")
|
||||
|
||||
mm_data = {
|
||||
"audio":
|
||||
[asset.audio_and_sample_rate for asset in audio_assets[:audio_count]]
|
||||
}
|
||||
|
||||
# Merge text prompt and audio data into inputs
|
||||
inputs = {"prompt": prompt, "multi_modal_data": mm_data}
|
||||
return inputs
|
||||
|
||||
|
||||
def main(audio_count: int):
|
||||
# NOTE: The default `max_num_seqs` and `max_model_len` may result in OOM on
|
||||
# lower-end GPUs.
|
||||
# Unless specified, these settings have been tested to work on a single L4.
|
||||
# `limit_mm_per_prompt`: the max num items for each modality per prompt.
|
||||
llm = LLM(model="Qwen/Qwen2-Audio-7B-Instruct",
|
||||
max_model_len=4096,
|
||||
max_num_seqs=5,
|
||||
limit_mm_per_prompt={"audio": audio_count})
|
||||
|
||||
inputs = prepare_inputs(audio_count)
|
||||
|
||||
sampling_params = SamplingParams(temperature=0.2,
|
||||
max_tokens=64,
|
||||
stop_token_ids=None)
|
||||
|
||||
outputs = llm.generate(inputs, sampling_params=sampling_params)
|
||||
|
||||
for o in outputs:
|
||||
generated_text = o.outputs[0].text
|
||||
print(generated_text)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
audio_count = 2
|
||||
main(audio_count)
|
||||
```
|
||||
|
||||
If you run this script successfully, you can see the info shown below:
|
||||
|
||||
```bash
|
||||
The sport referenced is baseball, and the nursery rhyme is 'Mary Had a Little Lamb'.
|
||||
```
|
||||
|
||||
### Online Serving on Single NPU
|
||||
|
||||
Currently, vllm's OpenAI-compatible server doesn't support audio inputs, find more details [<u>here</u>](https://github.com/vllm-project/vllm/issues/19977).
|
||||
192
docs/source/tutorials/single_npu_multimodal.md
Normal file
192
docs/source/tutorials/single_npu_multimodal.md
Normal file
@@ -0,0 +1,192 @@
|
||||
# Single NPU (Qwen2.5-VL 7B)
|
||||
|
||||
## Run vllm-ascend on Single NPU
|
||||
|
||||
### Offline Inference on Single NPU
|
||||
|
||||
Run docker container:
|
||||
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
# Update the vllm-ascend image
|
||||
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
|
||||
docker run --rm \
|
||||
--name vllm-ascend \
|
||||
--device /dev/davinci0 \
|
||||
--device /dev/davinci_manager \
|
||||
--device /dev/devmm_svm \
|
||||
--device /dev/hisi_hdc \
|
||||
-v /usr/local/dcmi:/usr/local/dcmi \
|
||||
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
|
||||
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
|
||||
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
|
||||
-v /etc/ascend_install.info:/etc/ascend_install.info \
|
||||
-v /root/.cache:/root/.cache \
|
||||
-p 8000:8000 \
|
||||
-it $IMAGE bash
|
||||
```
|
||||
|
||||
Setup environment variables:
|
||||
|
||||
```bash
|
||||
# Load model from ModelScope to speed up download
|
||||
export VLLM_USE_MODELSCOPE=True
|
||||
|
||||
# Set `max_split_size_mb` to reduce memory fragmentation and avoid out of memory
|
||||
export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
|
||||
```
|
||||
|
||||
:::{note}
|
||||
`max_split_size_mb` prevents the native allocator from splitting blocks larger than this size (in MB). This can reduce fragmentation and may allow some borderline workloads to complete without running out of memory. You can find more details [<u>here</u>](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/800alpha003/apiref/envref/envref_07_0061.html).
|
||||
:::
|
||||
|
||||
Run the following script to execute offline inference on a single NPU:
|
||||
|
||||
```bash
|
||||
pip install qwen_vl_utils --extra-index-url https://download.pytorch.org/whl/cpu/
|
||||
```
|
||||
|
||||
```python
|
||||
from transformers import AutoProcessor
|
||||
from vllm import LLM, SamplingParams
|
||||
from qwen_vl_utils import process_vision_info
|
||||
|
||||
MODEL_PATH = "Qwen/Qwen2.5-VL-7B-Instruct"
|
||||
|
||||
llm = LLM(
|
||||
model=MODEL_PATH,
|
||||
max_model_len=16384,
|
||||
limit_mm_per_prompt={"image": 10},
|
||||
)
|
||||
|
||||
sampling_params = SamplingParams(
|
||||
max_tokens=512
|
||||
)
|
||||
|
||||
image_messages = [
|
||||
{"role": "system", "content": "You are a helpful assistant."},
|
||||
{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{
|
||||
"type": "image",
|
||||
"image": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png",
|
||||
"min_pixels": 224 * 224,
|
||||
"max_pixels": 1280 * 28 * 28,
|
||||
},
|
||||
{"type": "text", "text": "Please provide a detailed description of this image"},
|
||||
],
|
||||
},
|
||||
]
|
||||
|
||||
messages = image_messages
|
||||
|
||||
processor = AutoProcessor.from_pretrained(MODEL_PATH)
|
||||
prompt = processor.apply_chat_template(
|
||||
messages,
|
||||
tokenize=False,
|
||||
add_generation_prompt=True,
|
||||
)
|
||||
|
||||
image_inputs, _, _ = process_vision_info(messages, return_video_kwargs=True)
|
||||
|
||||
mm_data = {}
|
||||
if image_inputs is not None:
|
||||
mm_data["image"] = image_inputs
|
||||
|
||||
llm_inputs = {
|
||||
"prompt": prompt,
|
||||
"multi_modal_data": mm_data,
|
||||
}
|
||||
|
||||
outputs = llm.generate([llm_inputs], sampling_params=sampling_params)
|
||||
generated_text = outputs[0].outputs[0].text
|
||||
|
||||
print(generated_text)
|
||||
```
|
||||
|
||||
If you run this script successfully, you can see the info shown below:
|
||||
|
||||
```bash
|
||||
The image displays a logo consisting of two main elements: a stylized geometric design and a pair of text elements.
|
||||
|
||||
1. **Geometric Design**: On the left side of the image, there is a blue geometric design that appears to be made up of interconnected shapes. These shapes resemble a network or a complex polygonal structure, possibly hinting at a technological or interconnected theme. The design is monochromatic and uses only blue as its color, which could be indicative of a specific brand or company.
|
||||
|
||||
2. **Text Elements**: To the right of the geometric design, there are two lines of text. The first line reads "TONGYI" in a sans-serif font, with the "YI" part possibly being capitalized. The second line reads "Qwen" in a similar sans-serif font, but in a smaller size.
|
||||
|
||||
The overall design is modern and minimalist, with a clear contrast between the geometric and textual elements. The use of blue for the geometric design could suggest themes of technology, connectivity, or innovation, which are common associations with the color blue in branding. The simplicity of the design makes it easily recognizable and memorable.
|
||||
```
|
||||
|
||||
### Online Serving on Single NPU
|
||||
|
||||
Run docker container to start the vLLM server on a single NPU:
|
||||
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
|
||||
# Update the vllm-ascend image
|
||||
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
|
||||
docker run --rm \
|
||||
--name vllm-ascend \
|
||||
--device /dev/davinci0 \
|
||||
--device /dev/davinci_manager \
|
||||
--device /dev/devmm_svm \
|
||||
--device /dev/hisi_hdc \
|
||||
-v /usr/local/dcmi:/usr/local/dcmi \
|
||||
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
|
||||
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
|
||||
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
|
||||
-v /etc/ascend_install.info:/etc/ascend_install.info \
|
||||
-v /root/.cache:/root/.cache \
|
||||
-p 8000:8000 \
|
||||
-e VLLM_USE_MODELSCOPE=True \
|
||||
-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
|
||||
-it $IMAGE \
|
||||
vllm serve Qwen/Qwen2.5-VL-7B-Instruct \
|
||||
--dtype bfloat16 \
|
||||
--max_model_len 16384 \
|
||||
--max-num-batched-tokens 16384
|
||||
```
|
||||
|
||||
:::{note}
|
||||
Add `--max_model_len` option to avoid ValueError that the Qwen2.5-VL-7B-Instruct model's max seq len (128000) is larger than the maximum number of tokens that can be stored in KV cache. This will differ with different NPU series base on the HBM size. Please modify the value according to a suitable value for your NPU series.
|
||||
:::
|
||||
|
||||
If your service start successfully, you can see the info shown below:
|
||||
|
||||
```bash
|
||||
INFO: Started server process [2736]
|
||||
INFO: Waiting for application startup.
|
||||
INFO: Application startup complete.
|
||||
```
|
||||
|
||||
Once your server is started, you can query the model with input prompts:
|
||||
|
||||
```bash
|
||||
curl http://localhost:8000/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "Qwen/Qwen2.5-VL-7B-Instruct",
|
||||
"messages": [
|
||||
{"role": "system", "content": "You are a helpful assistant."},
|
||||
{"role": "user", "content": [
|
||||
{"type": "image_url", "image_url": {"url": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png"}},
|
||||
{"type": "text", "text": "What is the text in the illustrate?"}
|
||||
]}
|
||||
]
|
||||
}'
|
||||
```
|
||||
|
||||
If you query the server successfully, you can see the info shown below (client):
|
||||
|
||||
```bash
|
||||
{"id":"chatcmpl-f04fb20e79bb40b39b8ed7fdf5bd613a","object":"chat.completion","created":1741749149,"model":"Qwen/Qwen2.5-VL-7B-Instruct","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"The text in the illustration reads \"TONGYI Qwen.\"","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":74,"total_tokens":89,"completion_tokens":15,"prompt_tokens_details":null},"prompt_logprobs":null}
|
||||
```
|
||||
|
||||
Logs of the vllm server:
|
||||
|
||||
```bash
|
||||
INFO 03-12 11:16:50 logger.py:39] Received request chatcmpl-92148a41eca64b6d82d3d7cfa5723aeb: prompt: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>\nWhat is the text in the illustrate?<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=16353, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None.
|
||||
INFO 03-12 11:16:50 engine.py:280] Added request chatcmpl-92148a41eca64b6d82d3d7cfa5723aeb.
|
||||
INFO: 127.0.0.1:54004 - "POST /v1/chat/completions HTTP/1.1" 200 OK
|
||||
```
|
||||
99
docs/source/tutorials/single_npu_qwen3_embedding.md
Normal file
99
docs/source/tutorials/single_npu_qwen3_embedding.md
Normal file
@@ -0,0 +1,99 @@
|
||||
# Single NPU (Qwen3-Embedding-8B)
|
||||
|
||||
The Qwen3 Embedding model series is the latest proprietary model of the Qwen family, specifically designed for text embedding and ranking tasks. Building upon the dense foundational models of the Qwen3 series, it provides a comprehensive range of text embeddings and reranking models in various sizes (0.6B, 4B, and 8B). This guide describes how to run the model with vLLM Ascend. Note that only 0.9.2rc1 and higher versions of vLLM Ascend support the model.
|
||||
|
||||
## Run docker container
|
||||
|
||||
Take Qwen3-Embedding-8B model as an example, first run the docker container with the following command:
|
||||
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
# Update the vllm-ascend image
|
||||
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
|
||||
docker run --rm \
|
||||
--name vllm-ascend \
|
||||
--device /dev/davinci0 \
|
||||
--device /dev/davinci_manager \
|
||||
--device /dev/devmm_svm \
|
||||
--device /dev/hisi_hdc \
|
||||
-v /usr/local/dcmi:/usr/local/dcmi \
|
||||
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
|
||||
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
|
||||
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
|
||||
-v /etc/ascend_install.info:/etc/ascend_install.info \
|
||||
-v /root/.cache:/root/.cache \
|
||||
-p 8000:8000 \
|
||||
-it $IMAGE bash
|
||||
```
|
||||
|
||||
Setup environment variables:
|
||||
|
||||
```bash
|
||||
# Load model from ModelScope to speed up download
|
||||
export VLLM_USE_MODELSCOPE=True
|
||||
|
||||
# Set `max_split_size_mb` to reduce memory fragmentation and avoid out of memory
|
||||
export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
|
||||
```
|
||||
|
||||
### Online Inference
|
||||
|
||||
```bash
|
||||
vllm serve Qwen/Qwen3-Embedding-8B --task embed
|
||||
```
|
||||
|
||||
Once your server is started, you can query the model with input prompts
|
||||
|
||||
```bash
|
||||
curl http://localhost:8000/v1/embeddings -H "Content-Type: application/json" -d '{
|
||||
"model": "Qwen/Qwen3-Embedding-8B",
|
||||
"messages": [
|
||||
{"role": "user", "content": "Hello"}
|
||||
]
|
||||
}'
|
||||
```
|
||||
|
||||
### Offline Inference
|
||||
|
||||
```python
|
||||
import torch
|
||||
import vllm
|
||||
from vllm import LLM
|
||||
|
||||
def get_detailed_instruct(task_description: str, query: str) -> str:
|
||||
return f'Instruct: {task_description}\nQuery:{query}'
|
||||
|
||||
|
||||
if __name__=="__main__":
|
||||
# Each query must come with a one-sentence instruction that describes the task
|
||||
task = 'Given a web search query, retrieve relevant passages that answer the query'
|
||||
|
||||
queries = [
|
||||
get_detailed_instruct(task, 'What is the capital of China?'),
|
||||
get_detailed_instruct(task, 'Explain gravity')
|
||||
]
|
||||
# No need to add instruction for retrieval documents
|
||||
documents = [
|
||||
"The capital of China is Beijing.",
|
||||
"Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun."
|
||||
]
|
||||
input_texts = queries + documents
|
||||
|
||||
model = LLM(model="Qwen/Qwen3-Embedding-8B",
|
||||
task="embed",
|
||||
distributed_executor_backend="mp")
|
||||
|
||||
outputs = model.embed(input_texts)
|
||||
embeddings = torch.tensor([o.outputs.embedding for o in outputs])
|
||||
scores = (embeddings[:2] @ embeddings[2:].T)
|
||||
print(scores.tolist())
|
||||
```
|
||||
|
||||
If you run this script successfully, you can see the info shown below:
|
||||
|
||||
```bash
|
||||
Adding requests: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 282.22it/s]
|
||||
Processed prompts: 0%| | 0/4 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s](VllmWorker rank=0 pid=4074750) ('Warning: torch.save with "_use_new_zipfile_serialization = False" is not recommended for npu tensor, which may bring unexpected errors and hopefully set "_use_new_zipfile_serialization = True"', 'if it is necessary to use this, please convert the npu tensor to cpu tensor for saving')
|
||||
Processed prompts: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 31.95it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
|
||||
[[0.7477798461914062, 0.07548339664936066], [0.0886271521449089, 0.6311039924621582]]
|
||||
```
|
||||
133
docs/source/tutorials/single_npu_qwen3_quantization.md
Normal file
133
docs/source/tutorials/single_npu_qwen3_quantization.md
Normal file
@@ -0,0 +1,133 @@
|
||||
# Single-NPU (Qwen3 8B W4A8)
|
||||
|
||||
## Run docker container
|
||||
:::{note}
|
||||
w4a8 quantization feature is supported by v0.9.1rc2 or higher
|
||||
:::
|
||||
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
# Update the vllm-ascend image
|
||||
export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version|
|
||||
docker run --rm \
|
||||
--name vllm-ascend \
|
||||
--device /dev/davinci0 \
|
||||
--device /dev/davinci_manager \
|
||||
--device /dev/devmm_svm \
|
||||
--device /dev/hisi_hdc \
|
||||
-v /usr/local/dcmi:/usr/local/dcmi \
|
||||
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
|
||||
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
|
||||
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
|
||||
-v /etc/ascend_install.info:/etc/ascend_install.info \
|
||||
-v /root/.cache:/root/.cache \
|
||||
-p 8000:8000 \
|
||||
-it $IMAGE bash
|
||||
```
|
||||
|
||||
## Install modelslim and convert model
|
||||
:::{note}
|
||||
You can choose to convert the model yourself or use the quantized model we uploaded,
|
||||
see https://www.modelscope.cn/models/vllm-ascend/Qwen3-8B-W4A8
|
||||
:::
|
||||
|
||||
```bash
|
||||
# The branch(br_release_MindStudio_8.1.RC2_TR5_20260624) has been verified
|
||||
git clone -b br_release_MindStudio_8.1.RC2_TR5_20260624 https://gitee.com/ascend/msit
|
||||
|
||||
cd msit/msmodelslim
|
||||
|
||||
# Install by run this script
|
||||
bash install.sh
|
||||
pip install accelerate
|
||||
|
||||
cd example/Qwen
|
||||
# Original weight path, Replace with your local model path
|
||||
MODEL_PATH=/home/models/Qwen3-8B
|
||||
# Path to save converted weight, Replace with your local path
|
||||
SAVE_PATH=/home/models/Qwen3-8B-w4a8
|
||||
|
||||
python quant_qwen.py \
|
||||
--model_path $MODEL_PATH \
|
||||
--save_directory $SAVE_PATH \
|
||||
--device_type npu \
|
||||
--model_type qwen3 \
|
||||
--calib_file None \
|
||||
--anti_method m6 \
|
||||
--anti_calib_file ./calib_data/mix_dataset.json \
|
||||
--w_bit 4 \
|
||||
--a_bit 8 \
|
||||
--is_lowbit True \
|
||||
--open_outlier False \
|
||||
--group_size 256 \
|
||||
--is_dynamic True \
|
||||
--trust_remote_code True \
|
||||
--w_method HQQ
|
||||
```
|
||||
|
||||
## Verify the quantized model
|
||||
The converted model files looks like:
|
||||
|
||||
```bash
|
||||
.
|
||||
|-- config.json
|
||||
|-- configuration.json
|
||||
|-- generation_config.json
|
||||
|-- merges.txt
|
||||
|-- quant_model_description.json
|
||||
|-- quant_model_weight_w4a8_dynamic-00001-of-00003.safetensors
|
||||
|-- quant_model_weight_w4a8_dynamic-00002-of-00003.safetensors
|
||||
|-- quant_model_weight_w4a8_dynamic-00003-of-00003.safetensors
|
||||
|-- quant_model_weight_w4a8_dynamic.safetensors.index.json
|
||||
|-- README.md
|
||||
|-- tokenizer.json
|
||||
`-- tokenizer_config.json
|
||||
```
|
||||
|
||||
Run the following script to start the vLLM server with quantized model:
|
||||
|
||||
```bash
|
||||
vllm serve /home/models/Qwen3-8B-w4a8 --served-model-name "qwen3-8b-w4a8" --max-model-len 4096 --quantization ascend
|
||||
```
|
||||
|
||||
Once your server is started, you can query the model with input prompts
|
||||
|
||||
```bash
|
||||
curl http://localhost:8000/v1/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "qwen3-8b-w4a8",
|
||||
"prompt": "what is large language model?",
|
||||
"max_tokens": "128",
|
||||
"top_p": "0.95",
|
||||
"top_k": "40",
|
||||
"temperature": "0.0"
|
||||
}'
|
||||
```
|
||||
|
||||
Run the following script to execute offline inference on Single-NPU with quantized model:
|
||||
|
||||
:::{note}
|
||||
To enable quantization for ascend, quantization method must be "ascend"
|
||||
:::
|
||||
|
||||
```python
|
||||
|
||||
from vllm import LLM, SamplingParams
|
||||
|
||||
prompts = [
|
||||
"Hello, my name is",
|
||||
"The future of AI is",
|
||||
]
|
||||
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40)
|
||||
|
||||
llm = LLM(model="/home/models/Qwen3-8B-w4a8",
|
||||
max_model_len=4096,
|
||||
quantization="ascend")
|
||||
|
||||
outputs = llm.generate(prompts, sampling_params)
|
||||
for output in outputs:
|
||||
prompt = output.prompt
|
||||
generated_text = output.outputs[0].text
|
||||
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
|
||||
```
|
||||
Reference in New Issue
Block a user