v0.10.1rc1

This commit is contained in:
2025-09-09 09:40:35 +08:00
parent d6f6ef41fe
commit 9149384e03
432 changed files with 84698 additions and 1 deletions

View File

@@ -0,0 +1,9 @@
# Performance
:::{toctree}
:caption: Performance
:maxdepth: 1
performance_benchmark
profile_execute_duration
optimization_and_tuning
:::

View File

@@ -0,0 +1,183 @@
# Optimization and Tuning
This guide aims to help users to improve vllm-ascend performance on system level. It includes OS configuration, library optimization, deploy guide and so on. Any feedback is welcome.
## Preparation
Run the container:
```{code-block} bash
:substitutions:
# Update DEVICE according to your device (/dev/davinci[0-7])
export DEVICE=/dev/davinci0
# Update the cann base image
export IMAGE=m.daocloud.io/quay.io/ascend/cann:|cann_image_tag|
docker run --rm \
--name performance-test \
--device $DEVICE \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-it $IMAGE bash
```
Configure your environment:
```{code-block} bash
:substitutions:
# Configure the mirror
echo "deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy main restricted universe multiverse" > /etc/apt/sources.list && \
echo "deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy main restricted universe multiverse" >> /etc/apt/sources.list && \
echo "deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy-updates main restricted universe multiverse" >> /etc/apt/sources.list && \
echo "deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy-updates main restricted universe multiverse" >> /etc/apt/sources.list && \
echo "deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy-backports main restricted universe multiverse" >> /etc/apt/sources.list && \
echo "deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy-backports main restricted universe multiverse" >> /etc/apt/sources.list && \
echo "deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy-security main restricted universe multiverse" >> /etc/apt/sources.list && \
echo "deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy-security main restricted universe multiverse" >> /etc/apt/sources.list
# Install os packages
apt update && apt install wget gcc g++ libnuma-dev git vim -y
```
Install vllm and vllm-ascend:
```{code-block} bash
:substitutions:
# Install necessary dependencies
pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
pip install modelscope pandas datasets gevent sacrebleu rouge_score pybind11 pytest
# Configure this var to speed up model download
VLLM_USE_MODELSCOPE=true
```
Please follow the [Installation Guide](https://vllm-ascend.readthedocs.io/en/latest/installation.html) to make sure vllm, vllm-ascend and mindie-turbo is installed correctly.
:::{note}
Make sure your vllm and vllm-ascend are installed after your python configuration completed, because these packages will build binary files using the python in current environment. If you install vllm, vllm-ascend and mindie-turbo before chapter 1.1, the binary files will not use the optimized python.
:::
## Optimizations
### 1. Compilation Optimization
#### 1.1. Install optimized `python`
Python supports **LTO** and **PGO** optimization starting from version `3.6` and above, which can be enabled at compile time. And we have offered compilation optimized `python` packages directly to users for the sake of convenience. You can also reproduce the `python` build follow this [tutorial](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0063.html) according to your specific scenarios.
```{code-block} bash
:substitutions:
mkdir -p /workspace/tmp
cd /workspace/tmp
# Download prebuilt lib and packages
wget https://repo.oepkgs.net/ascend/pytorch/vllm/lib/libcrypto.so.1.1
wget https://repo.oepkgs.net/ascend/pytorch/vllm/lib/libomp.so
wget https://repo.oepkgs.net/ascend/pytorch/vllm/lib/libssl.so.1.1
wget https://repo.oepkgs.net/ascend/pytorch/vllm/python/py311_bisheng.tar.gz
# Configure python and pip
cp ./*.so* /usr/local/lib
tar -zxvf ./py311_bisheng.* -C /usr/local/
mv /usr/local/py311_bisheng/ /usr/local/python
sed -i "1c#\!/usr/local/python/bin/python3.11" /usr/local/python/bin/pip3
sed -i "1c#\!/usr/local/python/bin/python3.11" /usr/local/python/bin/pip3.11
ln -sf /usr/local/python/bin/python3 /usr/bin/python
ln -sf /usr/local/python/bin/python3 /usr/bin/python3
ln -sf /usr/local/python/bin/python3.11 /usr/bin/python3.11
ln -sf /usr/local/python/bin/pip3 /usr/bin/pip3
ln -sf /usr/local/python/bin/pip3 /usr/bin/pip
export PATH=/usr/bin:/usr/local/python/bin:$PATH
```
### 2. OS Optimization
#### 2.1. jemalloc
**jemalloc** is a memory allocator that improves performance for multi-threads scenario and can reduce memory fragment. jemalloc use thread local memory manager to allocate variables, which can avoid lock competition between multi-threads and can hugely optimize performance.
```{code-block} bash
:substitutions:
# Install jemalloc
sudo apt update
sudo apt install libjemalloc2
# Configure jemalloc
export LD_PRELOAD=/usr/lib/"$(uname -i)"-linux-gnu/libjemalloc.so.2 $LD_PRELOAD
```
#### 2.2. Tcmalloc
**Tcmalloc (Thread Counting Malloc)** is a universal memory allocator that improves overall performance while ensuring low latency by introducing a multi-level cache structure, reducing mutex competition and optimizing large object processing flow. Find more details [here](https://www.hiascend.com/document/detail/zh/Pytorch/700/ptmoddevg/trainingmigrguide/performance_tuning_0068.html).
```{code-block} bash
:substitutions:
# Install tcmalloc
sudo apt update
sudo apt install libgoogle-perftools4 libgoogle-perftools-dev
# Get the location of libtcmalloc.so*
find /usr -name libtcmalloc.so*
# Make the priority of tcmalloc higher
# The <path> is the location of libtcmalloc.so we get from the upper command
# Example: "$LD_PRELOAD:/usr/lib/aarch64-linux-gnu/libtcmalloc.so"
export LD_PRELOAD="$LD_PRELOAD:<path>"
# Verify your configuration
# The path of libtcmalloc.so will be contained in the result if your configuration is valid
ldd `which python`
```
### 3. `torch_npu` Optimization
Some performance tuning features in `torch_npu` are controlled by environment variables. Some features and their related environment variables are shown below.
Memory optimization:
```{code-block} bash
:substitutions:
# Upper limit of memory block splitting allowed (MB), Setting this parameter can prevent large memory blocks from being split.
export PYTORCH_NPU_ALLOC_CONF="max_split_size_mb:250"
# When operators on the communication stream have dependencies, they all need to be ended before being released for reuse. The logic of multi-stream reuse is to release the memory on the communication stream in advance so that the computing stream can be reused.
export PYTORCH_NPU_ALLOC_CONF="expandable_segments:True"
```
Schedule optimization:
```{code-block} bash
:substitutions:
# Optimize operator delivery queue, this will affect the memory peak value, and may degrade if the memory is tight.
export TASK_QUEUE_ENABLE=2
# This will greatly improve the CPU bottleneck model and ensure the same performance for the NPU bottleneck model.
export CPU_AFFINITY_CONF=1
```
### 4. CANN Optimization
#### 4.1. HCCL Optimization
There are some performance tuning features in HCCL, which are controlled by environment variables.
You can configure HCCL to use "AIV" mode to optimize performance by setting the environment variable shown below. In "AIV" mode, the communication is scheduled by AI vector core directly with ROCE, instead of being scheduled by AI cpu.
```{code-block} bash
:substitutions:
export HCCL_OP_EXPANSION_MODE="AIV"
```
Plus, there are more features for performance optimization in specific scenarios, which are shown below.
- `HCCL_INTRA_ROCE_ENABLE`: Use RDMA link instead of SDMA link between two 8Ps as the mesh interconnect link, find more details [here](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0044.html).
- `HCCL_RDMA_TC`: Use this var to configure traffic class of RDMA network card, find more details [here](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0045.html).
- `HCCL_RDMA_SL`: Use this var to configure service level of RDMA network card, find more details [here](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0046.html).
- `HCCL_BUFFSIZE`: Use this var to control the cache size for sharing data between two NPUs, find more details [here](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0047.html).

View File

@@ -0,0 +1,194 @@
# Performance Benchmark
This document details the benchmark methodology for vllm-ascend, aimed at evaluating the performance under a variety of workloads. To maintain alignment with vLLM, we use the [benchmark](https://github.com/vllm-project/vllm/tree/main/benchmarks) script provided by the vllm project.
**Benchmark Coverage**: We measure offline e2e latency and throughput, and fixed-QPS online serving benchmarks, for more details see [vllm-ascend benchmark scripts](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks).
## 1. Run docker container
```{code-block} bash
:substitutions:
# Update DEVICE according to your device (/dev/davinci[0-7])
export DEVICE=/dev/davinci7
export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version|
docker run --rm \
--name vllm-ascend \
--device $DEVICE \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-e VLLM_USE_MODELSCOPE=True \
-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
-it $IMAGE \
/bin/bash
```
## 2. Install dependencies
```bash
cd /workspace/vllm-ascend
pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
pip install -r benchmarks/requirements-bench.txt
```
## 3. (Optional)Prepare model weights
For faster running speed, we recommend downloading the model in advance
```bash
modelscope download --model LLM-Research/Meta-Llama-3.1-8B-Instruct
```
You can also replace all model paths in the [json](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks/tests) files with your local paths:
```bash
[
{
"test_name": "latency_llama8B_tp1",
"parameters": {
"model": "your local model path",
"tensor_parallel_size": 1,
"load_format": "dummy",
"num_iters_warmup": 5,
"num_iters": 15
}
}
]
```
## 4. Run benchmark script
Run benchmark script:
```bash
bash benchmarks/scripts/run-performance-benchmarks.sh
```
After about 10 mins, the output is as shown below:
```bash
online serving:
qps 1:
============ Serving Benchmark Result ============
Successful requests: 200
Benchmark duration (s): 212.77
Total input tokens: 42659
Total generated tokens: 43545
Request throughput (req/s): 0.94
Output token throughput (tok/s): 204.66
Total Token throughput (tok/s): 405.16
---------------Time to First Token----------------
Mean TTFT (ms): 104.14
Median TTFT (ms): 102.22
P99 TTFT (ms): 153.82
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 38.78
Median TPOT (ms): 38.70
P99 TPOT (ms): 48.03
---------------Inter-token Latency----------------
Mean ITL (ms): 38.46
Median ITL (ms): 36.96
P99 ITL (ms): 75.03
==================================================
qps 4:
============ Serving Benchmark Result ============
Successful requests: 200
Benchmark duration (s): 72.55
Total input tokens: 42659
Total generated tokens: 43545
Request throughput (req/s): 2.76
Output token throughput (tok/s): 600.24
Total Token throughput (tok/s): 1188.27
---------------Time to First Token----------------
Mean TTFT (ms): 115.62
Median TTFT (ms): 109.39
P99 TTFT (ms): 169.03
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 51.48
Median TPOT (ms): 52.40
P99 TPOT (ms): 69.41
---------------Inter-token Latency----------------
Mean ITL (ms): 50.47
Median ITL (ms): 43.95
P99 ITL (ms): 130.29
==================================================
qps 16:
============ Serving Benchmark Result ============
Successful requests: 200
Benchmark duration (s): 47.82
Total input tokens: 42659
Total generated tokens: 43545
Request throughput (req/s): 4.18
Output token throughput (tok/s): 910.62
Total Token throughput (tok/s): 1802.70
---------------Time to First Token----------------
Mean TTFT (ms): 128.50
Median TTFT (ms): 128.36
P99 TTFT (ms): 187.87
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 83.60
Median TPOT (ms): 77.85
P99 TPOT (ms): 165.90
---------------Inter-token Latency----------------
Mean ITL (ms): 65.72
Median ITL (ms): 54.84
P99 ITL (ms): 289.63
==================================================
qps inf:
============ Serving Benchmark Result ============
Successful requests: 200
Benchmark duration (s): 41.26
Total input tokens: 42659
Total generated tokens: 43545
Request throughput (req/s): 4.85
Output token throughput (tok/s): 1055.44
Total Token throughput (tok/s): 2089.40
---------------Time to First Token----------------
Mean TTFT (ms): 3394.37
Median TTFT (ms): 3359.93
P99 TTFT (ms): 3540.93
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 66.28
Median TPOT (ms): 64.19
P99 TPOT (ms): 97.66
---------------Inter-token Latency----------------
Mean ITL (ms): 56.62
Median ITL (ms): 55.69
P99 ITL (ms): 82.90
==================================================
offline:
latency:
Avg latency: 4.944929537673791 seconds
10% percentile latency: 4.894104263186454 seconds
25% percentile latency: 4.909652255475521 seconds
50% percentile latency: 4.932477846741676 seconds
75% percentile latency: 4.9608619548380375 seconds
90% percentile latency: 5.035418218374252 seconds
99% percentile latency: 5.052476694583893 seconds
throughput:
Throughput: 4.64 requests/s, 2000.51 total tokens/s, 1010.54 output tokens/s
Total num prompt tokens: 42659
Total num output tokens: 43545
```
The result json files are generated into the path `benchmark/results`
These files contain detailed benchmarking results for further analysis.
```bash
.
|-- latency_llama8B_tp1.json
|-- serving_llama8B_tp1_qps_1.json
|-- serving_llama8B_tp1_qps_16.json
|-- serving_llama8B_tp1_qps_4.json
|-- serving_llama8B_tp1_qps_inf.json
`-- throughput_llama8B_tp1.json
```

View File

@@ -0,0 +1,40 @@
# Profile Execute Duration
The execution duration of each stage (including pre/post-processing, model forward, etc.) usually needs to be captured during a complete inference process. Typically, this is done by using `torch.npu.synchronize()` and obtaining CPU timestamps, which increases the performance overhead of host/device synchronization.
**To reduce the performance overhead, we add this feature, using the NPU event timestamp mechanism to observe the device execution time asynchronously.**
## Usage
* Use the environment variable `VLLM_ASCEND_MODEL_EXECUTE_TIME_OBSERVE` to enable this feature.
* Use the non-blocking API `ProfileExecuteDuration().capture_async` to set observation points asynchronously when you need to observe the execution duration.
* Use the blocking API `ProfileExecuteDuration().pop_captured_sync` at an appropriate time to get and print the execution durations of all observed stages.
**We have instrumented the key inference stages (including pre-processing, model forward pass, etc.) for execute duration profiling. Execute the script as follows:**
```
VLLM_ASCEND_MODEL_EXECUTE_TIME_OBSERVE=1 python3 vllm-ascend/examples/offline_inference_npu.py
```
## Example Output
```
5691:(IntegratedWorker pid=1502285) Profile execute duration [Decode]: [post process]:14.17ms [prepare input and forward]:9.57ms [forward]:4.14ms
5695:(IntegratedWorker pid=1502285) Profile execute duration [Decode]: [post process]:14.29ms [prepare input and forward]:10.19ms [forward]:4.14ms
5697:(IntegratedWorker pid=1502343) Profile execute duration [Decode]: [post process]:14.81ms [prepare input and forward]:10.29ms [forward]:3.99ms
5701:(IntegratedWorker pid=1502343) Profile execute duration [Decode]: [post process]:14.10ms [prepare input and forward]:10.62ms [forward]:4.33ms
5705:(IntegratedWorker pid=1502343) Profile execute duration [Decode]: [post process]:14.65ms [prepare input and forward]:9.58ms [forward]:4.20ms
5709:(IntegratedWorker pid=1502343) Profile execute duration [Decode]: [post process]:14.43ms [prepare input and forward]:9.88ms [forward]:4.20ms
5711:(IntegratedWorker pid=1502401) Profile execute duration [Decode]: [post process]:14.89ms [prepare input and forward]:10.49ms [forward]:4.19ms
5715:(IntegratedWorker pid=1502401) Profile execute duration [Decode]: [post process]:14.14ms [prepare input and forward]:11.21ms [forward]:4.18ms
5719:(IntegratedWorker pid=1502401) Profile execute duration [Decode]: [post process]:14.71ms [prepare input and forward]:10.15ms [forward]:4.42ms
5723:(IntegratedWorker pid=1502401) Profile execute duration [Decode]: [post process]:14.62ms [prepare input and forward]:10.31ms [forward]:4.25ms
5725:(IntegratedWorker pid=1502462) Profile execute duration [Decode]: [post process]:14.12ms [prepare input and forward]:10.33ms [forward]:4.24ms
5729:(IntegratedWorker pid=1502462) Profile execute duration [Decode]: [post process]:14.58ms [prepare input and forward]:10.85ms [forward]:4.32ms
5733:(IntegratedWorker pid=1502462) Profile execute duration [Decode]: [post process]:14.32ms [prepare input and forward]:9.79ms [forward]:4.28ms
5737:(IntegratedWorker pid=1502462) Profile execute duration [Decode]: [post process]:15.06ms [prepare input and forward]:9.89ms [forward]:4.32ms
5739:(IntegratedWorker pid=1502524) Profile execute duration [Decode]: [post process]:14.62ms [prepare input and forward]:10.48ms [forward]:4.27ms
5743:(IntegratedWorker pid=1502524) Profile execute duration [Decode]: [post process]:14.60ms [prepare input and forward]:10.71ms [forward]:4.61ms
5747:(IntegratedWorker pid=1502524) Profile execute duration [Decode]: [post process]:14.21ms [prepare input and forward]:10.10ms [forward]:4.52ms
5751:(IntegratedWorker pid=1502524) Profile execute duration [Decode]: [post process]:15.03ms [prepare input and forward]:10.00ms [forward]:4.42ms
```