[v0.11.0][Doc] Update doc (#3852)

### What this PR does / why we need it?
Update doc


Signed-off-by: hfadzxy <starmoon_zhang@163.com>
This commit is contained in:
zhangxinyuehfad
2025-10-29 11:32:12 +08:00
committed by GitHub
parent 6188450269
commit 75de3fa172
49 changed files with 724 additions and 701 deletions

View File

@@ -1,6 +1,6 @@
# Optimization and Tuning
This guide aims to help users to improve vllm-ascend performance on system level. It includes OS configuration, library optimization, deploy guide and so on. Any feedback is welcome.
This guide aims to help users to improve vllm-ascend performance on system level. It includes OS configuration, library optimization, deployment guide and so on. Any feedback is welcome.
## Preparation
@@ -57,10 +57,10 @@ pip install modelscope pandas datasets gevent sacrebleu rouge_score pybind11 pyt
VLLM_USE_MODELSCOPE=true
```
Please follow the [Installation Guide](https://vllm-ascend.readthedocs.io/en/latest/installation.html) to make sure vllm, vllm-ascend and mindie-turbo is installed correctly.
Please follow the [Installation Guide](https://vllm-ascend.readthedocs.io/en/latest/installation.html) to make sure vLLM, vllm-ascend, and MindIE Turbo are installed correctly.
:::{note}
Make sure your vllm and vllm-ascend are installed after your python configuration completed, because these packages will build binary files using the python in current environment. If you install vllm, vllm-ascend and mindie-turbo before chapter 1.1, the binary files will not use the optimized python.
Make sure your vLLM and vllm-ascend are installed after your python configuration is completed, because these packages will build binary files using python in current environment. If you install vLLM, vllm-ascend, and MindIE Turbo before completing section 1.1, the binary files will not use the optimized python.
:::
## Optimizations
@@ -69,7 +69,7 @@ Make sure your vllm and vllm-ascend are installed after your python configuratio
#### 1.1. Install optimized `python`
Python supports **LTO** and **PGO** optimization starting from version `3.6` and above, which can be enabled at compile time. And we have offered compilation optimized `python` packages directly to users for the sake of convenience. You can also reproduce the `python` build follow this [tutorial](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0063.html) according to your specific scenarios.
Python supports **LTO** and **PGO** optimization starting from version `3.6` and above, which can be enabled at compile time. And we have offered optimized `python` packages directly to users for the sake of convenience. You can also reproduce the `python` built following this [tutorial](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0063.html) according to your specific scenarios.
```{code-block} bash
:substitutions:
@@ -101,7 +101,7 @@ export PATH=/usr/bin:/usr/local/python/bin:$PATH
#### 2.1. jemalloc
**jemalloc** is a memory allocator that improves performance for multi-threads scenario and can reduce memory fragment. jemalloc use thread local memory manager to allocate variables, which can avoid lock competition between multi-threads and can hugely optimize performance.
**jemalloc** is a memory allocator that improves performance for multi-thread scenarios and can reduce memory fragmentation. jemalloc uses local thread memory manager to allocate variables, which can avoid lock competition between threads and can hugely optimize performance.
```{code-block} bash
:substitutions:
@@ -144,18 +144,18 @@ Memory optimization:
```{code-block} bash
:substitutions:
# Upper limit of memory block splitting allowed (MB), Setting this parameter can prevent large memory blocks from being split.
# Upper limit of memory block splitting allowed (MB): Setting this parameter can prevent large memory blocks from being split.
export PYTORCH_NPU_ALLOC_CONF="max_split_size_mb:250"
# When operators on the communication stream have dependencies, they all need to be ended before being released for reuse. The logic of multi-stream reuse is to release the memory on the communication stream in advance so that the computing stream can be reused.
export PYTORCH_NPU_ALLOC_CONF="expandable_segments:True"
```
Schedule optimization:
Scheduling optimization:
```{code-block} bash
:substitutions:
# Optimize operator delivery queue, this will affect the memory peak value, and may degrade if the memory is tight.
# Optimize operator delivery queue. This will affect the memory peak value, and may degrade if the memory is tight.
export TASK_QUEUE_ENABLE=2
# This will greatly improve the CPU bottleneck model and ensure the same performance for the NPU bottleneck model.
@@ -168,7 +168,7 @@ export CPU_AFFINITY_CONF=1
There are some performance tuning features in HCCL, which are controlled by environment variables.
You can configure HCCL to use "AIV" mode to optimize performance by setting the environment variable shown below. In "AIV" mode, the communication is scheduled by AI vector core directly with ROCE, instead of being scheduled by AI cpu.
You can configure HCCL to use "AIV" mode to optimize performance by setting the environment variable shown below. In "AIV" mode, the communication is scheduled by AI vector core directly with RoCE, instead of being scheduled by AI CPU.
```{code-block} bash
:substitutions:
@@ -177,7 +177,7 @@ export HCCL_OP_EXPANSION_MODE="AIV"
Plus, there are more features for performance optimization in specific scenarios, which are shown below.
- `HCCL_INTRA_ROCE_ENABLE`: Use RDMA link instead of SDMA link between two 8Ps as the mesh interconnect link, find more details [here](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0044.html).
- `HCCL_RDMA_TC`: Use this var to configure traffic class of RDMA network card, find more details [here](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0045.html).
- `HCCL_RDMA_SL`: Use this var to configure service level of RDMA network card, find more details [here](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0046.html).
- `HCCL_BUFFSIZE`: Use this var to control the cache size for sharing data between two NPUs, find more details [here](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0047.html).
- `HCCL_INTRA_ROCE_ENABLE`: Use RDMA link instead of SDMA link between two 8Ps as the mesh interconnect link. Find more details [here](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0044.html).
- `HCCL_RDMA_TC`: Use this var to configure traffic class of RDMA NIC. Find more details [here](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0045.html).
- `HCCL_RDMA_SL`: Use this var to configure service level of RDMA NIC. Find more details [here](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0046.html).
- `HCCL_BUFFSIZE`: Use this var to control the cache size for sharing data between two NPUs. Find more details [here](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0047.html).

View File

@@ -1,7 +1,7 @@
# Performance Benchmark
This document details the benchmark methodology for vllm-ascend, aimed at evaluating the performance under a variety of workloads. To maintain alignment with vLLM, we use the [benchmark](https://github.com/vllm-project/vllm/tree/main/benchmarks) script provided by the vllm project.
**Benchmark Coverage**: We measure offline e2e latency and throughput, and fixed-QPS online serving benchmarks, for more details see [vllm-ascend benchmark scripts](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks).
**Benchmark Coverage**: We measure offline E2E latency and throughput, and fixed-QPS online serving benchmarks. For more details, see [vllm-ascend benchmark scripts](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks).
## 1. Run docker container
@@ -37,7 +37,7 @@ pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/si
pip install -r benchmarks/requirements-bench.txt
```
## 3. (Optional)Prepare model weights
## 3. (Optional) Prepare model weights
For faster running speed, we recommend downloading the model in advance
```bash
@@ -68,7 +68,7 @@ Run benchmark script:
bash benchmarks/scripts/run-performance-benchmarks.sh
```
After about 10 mins, the output is as shown below:
After about 10 mins, the output is shown below:
```bash
online serving:
@@ -180,7 +180,7 @@ Total num prompt tokens: 42659
Total num output tokens: 43545
```
The result json files are generated into the path `benchmark/results`
The result json files are generated into the path `benchmark/results`.
These files contain detailed benchmarking results for further analysis.
```bash

View File

@@ -9,7 +9,7 @@ The execution duration of each stage (including pre/post-processing, model forwa
* Use the non-blocking API `ProfileExecuteDuration().capture_async` to set observation points asynchronously when you need to observe the execution duration.
* Use the blocking API `ProfileExecuteDuration().pop_captured_sync` at an appropriate time to get and print the execution durations of all observed stages.
**We have instrumented the key inference stages (including pre-processing, model forward pass, etc.) for execute duration profiling. Execute the script as follows:**
**We have instrumented the key inference stages (including pre-processing, model forward pass, etc.) for execution duration profiling. Execute the script as follows:**
```
VLLM_ASCEND_MODEL_EXECUTE_TIME_OBSERVE=1 python3 vllm-ascend/examples/offline_inference_npu.py