[Doc] Add patch doc (#1414)
1. Format the developer guide content to make it more clear 2. Add the patch doc for developer guide Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
This commit is contained in:
8
docs/source/developer_guide/performance/index.md
Normal file
8
docs/source/developer_guide/performance/index.md
Normal file
@@ -0,0 +1,8 @@
|
||||
# Performance
|
||||
|
||||
:::{toctree}
|
||||
:caption: Performance
|
||||
:maxdepth: 1
|
||||
performance_benchmark
|
||||
profile_execute_duration
|
||||
:::
|
||||
187
docs/source/developer_guide/performance/performance_benchmark.md
Normal file
187
docs/source/developer_guide/performance/performance_benchmark.md
Normal file
@@ -0,0 +1,187 @@
|
||||
# Performance Benchmark
|
||||
This document details the benchmark methodology for vllm-ascend, aimed at evaluating the performance under a variety of workloads. To maintain alignment with vLLM, we use the [benchmark](https://github.com/vllm-project/vllm/tree/main/benchmarks) script provided by the vllm project.
|
||||
|
||||
**Benchmark Coverage**: We measure offline e2e latency and throughput, and fixed-QPS online serving benchmarks, for more details see [vllm-ascend benchmark scripts](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks).
|
||||
|
||||
## 1. Run docker container
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
# Update DEVICE according to your device (/dev/davinci[0-7])
|
||||
export DEVICE=/dev/davinci7
|
||||
export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version|
|
||||
docker run --rm \
|
||||
--name vllm-ascend \
|
||||
--device $DEVICE \
|
||||
--device /dev/davinci_manager \
|
||||
--device /dev/devmm_svm \
|
||||
--device /dev/hisi_hdc \
|
||||
-v /usr/local/dcmi:/usr/local/dcmi \
|
||||
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
|
||||
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
|
||||
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
|
||||
-v /etc/ascend_install.info:/etc/ascend_install.info \
|
||||
-v /root/.cache:/root/.cache \
|
||||
-p 8000:8000 \
|
||||
-e VLLM_USE_MODELSCOPE=True \
|
||||
-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
|
||||
-it $IMAGE \
|
||||
/bin/bash
|
||||
```
|
||||
|
||||
## 2. Install dependencies
|
||||
```bash
|
||||
cd /workspace/vllm-ascend
|
||||
pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
|
||||
pip install -r benchmarks/requirements-bench.txt
|
||||
```
|
||||
|
||||
## 3. (Optional)Prepare model weights
|
||||
For faster running speed, we recommend downloading the model in advance:
|
||||
```bash
|
||||
modelscope download --model LLM-Research/Meta-Llama-3.1-8B-Instruct
|
||||
```
|
||||
|
||||
You can also replace all model paths in the [json](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks/tests) files with your local paths:
|
||||
```bash
|
||||
[
|
||||
{
|
||||
"test_name": "latency_llama8B_tp1",
|
||||
"parameters": {
|
||||
"model": "your local model path",
|
||||
"tensor_parallel_size": 1,
|
||||
"load_format": "dummy",
|
||||
"num_iters_warmup": 5,
|
||||
"num_iters": 15
|
||||
}
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
## 4. Run benchmark script
|
||||
Run benchmark script:
|
||||
```bash
|
||||
bash benchmarks/scripts/run-performance-benchmarks.sh
|
||||
```
|
||||
|
||||
After about 10 mins, the output is as shown below:
|
||||
```bash
|
||||
online serving:
|
||||
qps 1:
|
||||
============ Serving Benchmark Result ============
|
||||
Successful requests: 200
|
||||
Benchmark duration (s): 212.77
|
||||
Total input tokens: 42659
|
||||
Total generated tokens: 43545
|
||||
Request throughput (req/s): 0.94
|
||||
Output token throughput (tok/s): 204.66
|
||||
Total Token throughput (tok/s): 405.16
|
||||
---------------Time to First Token----------------
|
||||
Mean TTFT (ms): 104.14
|
||||
Median TTFT (ms): 102.22
|
||||
P99 TTFT (ms): 153.82
|
||||
-----Time per Output Token (excl. 1st token)------
|
||||
Mean TPOT (ms): 38.78
|
||||
Median TPOT (ms): 38.70
|
||||
P99 TPOT (ms): 48.03
|
||||
---------------Inter-token Latency----------------
|
||||
Mean ITL (ms): 38.46
|
||||
Median ITL (ms): 36.96
|
||||
P99 ITL (ms): 75.03
|
||||
==================================================
|
||||
|
||||
qps 4:
|
||||
============ Serving Benchmark Result ============
|
||||
Successful requests: 200
|
||||
Benchmark duration (s): 72.55
|
||||
Total input tokens: 42659
|
||||
Total generated tokens: 43545
|
||||
Request throughput (req/s): 2.76
|
||||
Output token throughput (tok/s): 600.24
|
||||
Total Token throughput (tok/s): 1188.27
|
||||
---------------Time to First Token----------------
|
||||
Mean TTFT (ms): 115.62
|
||||
Median TTFT (ms): 109.39
|
||||
P99 TTFT (ms): 169.03
|
||||
-----Time per Output Token (excl. 1st token)------
|
||||
Mean TPOT (ms): 51.48
|
||||
Median TPOT (ms): 52.40
|
||||
P99 TPOT (ms): 69.41
|
||||
---------------Inter-token Latency----------------
|
||||
Mean ITL (ms): 50.47
|
||||
Median ITL (ms): 43.95
|
||||
P99 ITL (ms): 130.29
|
||||
==================================================
|
||||
|
||||
qps 16:
|
||||
============ Serving Benchmark Result ============
|
||||
Successful requests: 200
|
||||
Benchmark duration (s): 47.82
|
||||
Total input tokens: 42659
|
||||
Total generated tokens: 43545
|
||||
Request throughput (req/s): 4.18
|
||||
Output token throughput (tok/s): 910.62
|
||||
Total Token throughput (tok/s): 1802.70
|
||||
---------------Time to First Token----------------
|
||||
Mean TTFT (ms): 128.50
|
||||
Median TTFT (ms): 128.36
|
||||
P99 TTFT (ms): 187.87
|
||||
-----Time per Output Token (excl. 1st token)------
|
||||
Mean TPOT (ms): 83.60
|
||||
Median TPOT (ms): 77.85
|
||||
P99 TPOT (ms): 165.90
|
||||
---------------Inter-token Latency----------------
|
||||
Mean ITL (ms): 65.72
|
||||
Median ITL (ms): 54.84
|
||||
P99 ITL (ms): 289.63
|
||||
==================================================
|
||||
|
||||
qps inf:
|
||||
============ Serving Benchmark Result ============
|
||||
Successful requests: 200
|
||||
Benchmark duration (s): 41.26
|
||||
Total input tokens: 42659
|
||||
Total generated tokens: 43545
|
||||
Request throughput (req/s): 4.85
|
||||
Output token throughput (tok/s): 1055.44
|
||||
Total Token throughput (tok/s): 2089.40
|
||||
---------------Time to First Token----------------
|
||||
Mean TTFT (ms): 3394.37
|
||||
Median TTFT (ms): 3359.93
|
||||
P99 TTFT (ms): 3540.93
|
||||
-----Time per Output Token (excl. 1st token)------
|
||||
Mean TPOT (ms): 66.28
|
||||
Median TPOT (ms): 64.19
|
||||
P99 TPOT (ms): 97.66
|
||||
---------------Inter-token Latency----------------
|
||||
Mean ITL (ms): 56.62
|
||||
Median ITL (ms): 55.69
|
||||
P99 ITL (ms): 82.90
|
||||
==================================================
|
||||
|
||||
offline:
|
||||
latency:
|
||||
Avg latency: 4.944929537673791 seconds
|
||||
10% percentile latency: 4.894104263186454 seconds
|
||||
25% percentile latency: 4.909652255475521 seconds
|
||||
50% percentile latency: 4.932477846741676 seconds
|
||||
75% percentile latency: 4.9608619548380375 seconds
|
||||
90% percentile latency: 5.035418218374252 seconds
|
||||
99% percentile latency: 5.052476694583893 seconds
|
||||
|
||||
throughput:
|
||||
Throughput: 4.64 requests/s, 2000.51 total tokens/s, 1010.54 output tokens/s
|
||||
Total num prompt tokens: 42659
|
||||
Total num output tokens: 43545
|
||||
```
|
||||
The result json files are generated into the path `benchmark/results`
|
||||
These files contain detailed benchmarking results for further analysis.
|
||||
|
||||
```bash
|
||||
.
|
||||
|-- latency_llama8B_tp1.json
|
||||
|-- serving_llama8B_tp1_qps_1.json
|
||||
|-- serving_llama8B_tp1_qps_16.json
|
||||
|-- serving_llama8B_tp1_qps_4.json
|
||||
|-- serving_llama8B_tp1_qps_inf.json
|
||||
`-- throughput_llama8B_tp1.json
|
||||
```
|
||||
@@ -0,0 +1,39 @@
|
||||
# Profile Execute Duration
|
||||
|
||||
The execution duration of each stage (including pre/post-processing, model forward, etc.) usually needs to be captured during a complete inference process. Typically, this is done by using `torch.npu.synchronize()` and obtaining CPU timestamps, which increases the performance overhead of host/device synchronization.
|
||||
|
||||
**To reduce the performance overhead, we add this feature, using the NPU event timestamp mechanism to observe the device execution time asynchronously.**
|
||||
|
||||
## Usage
|
||||
* Use the environment variable `VLLM_ASCEND_MODEL_EXECUTE_TIME_OBSERVE` to enable this feature.
|
||||
* Use the non-blocking API `ProfileExecuteDuration().capture_async` to set observation points asynchronously when you need to observe the execution duration.
|
||||
* Use the blocking API `ProfileExecuteDuration().pop_captured_sync` at an appropriate time to get and print the execution durations of all observed stages.
|
||||
|
||||
**We have instrumented the key inference stages (including pre-processing, model forward pass, etc.) for execute duration profiling. Execute the script as follows:**
|
||||
```
|
||||
VLLM_ASCEND_MODEL_EXECUTE_TIME_OBSERVE=1 python3 vllm-ascend/examples/offline_inference_npu.py
|
||||
```
|
||||
|
||||
## Example Output
|
||||
|
||||
```
|
||||
5691:(IntegratedWorker pid=1502285) Profile execute duration [Decode]: [post process]:14.17ms [prepare input and forward]:9.57ms [forward]:4.14ms
|
||||
5695:(IntegratedWorker pid=1502285) Profile execute duration [Decode]: [post process]:14.29ms [prepare input and forward]:10.19ms [forward]:4.14ms
|
||||
5697:(IntegratedWorker pid=1502343) Profile execute duration [Decode]: [post process]:14.81ms [prepare input and forward]:10.29ms [forward]:3.99ms
|
||||
5701:(IntegratedWorker pid=1502343) Profile execute duration [Decode]: [post process]:14.10ms [prepare input and forward]:10.62ms [forward]:4.33ms
|
||||
5705:(IntegratedWorker pid=1502343) Profile execute duration [Decode]: [post process]:14.65ms [prepare input and forward]:9.58ms [forward]:4.20ms
|
||||
5709:(IntegratedWorker pid=1502343) Profile execute duration [Decode]: [post process]:14.43ms [prepare input and forward]:9.88ms [forward]:4.20ms
|
||||
5711:(IntegratedWorker pid=1502401) Profile execute duration [Decode]: [post process]:14.89ms [prepare input and forward]:10.49ms [forward]:4.19ms
|
||||
5715:(IntegratedWorker pid=1502401) Profile execute duration [Decode]: [post process]:14.14ms [prepare input and forward]:11.21ms [forward]:4.18ms
|
||||
5719:(IntegratedWorker pid=1502401) Profile execute duration [Decode]: [post process]:14.71ms [prepare input and forward]:10.15ms [forward]:4.42ms
|
||||
5723:(IntegratedWorker pid=1502401) Profile execute duration [Decode]: [post process]:14.62ms [prepare input and forward]:10.31ms [forward]:4.25ms
|
||||
5725:(IntegratedWorker pid=1502462) Profile execute duration [Decode]: [post process]:14.12ms [prepare input and forward]:10.33ms [forward]:4.24ms
|
||||
5729:(IntegratedWorker pid=1502462) Profile execute duration [Decode]: [post process]:14.58ms [prepare input and forward]:10.85ms [forward]:4.32ms
|
||||
5733:(IntegratedWorker pid=1502462) Profile execute duration [Decode]: [post process]:14.32ms [prepare input and forward]:9.79ms [forward]:4.28ms
|
||||
5737:(IntegratedWorker pid=1502462) Profile execute duration [Decode]: [post process]:15.06ms [prepare input and forward]:9.89ms [forward]:4.32ms
|
||||
5739:(IntegratedWorker pid=1502524) Profile execute duration [Decode]: [post process]:14.62ms [prepare input and forward]:10.48ms [forward]:4.27ms
|
||||
5743:(IntegratedWorker pid=1502524) Profile execute duration [Decode]: [post process]:14.60ms [prepare input and forward]:10.71ms [forward]:4.61ms
|
||||
5747:(IntegratedWorker pid=1502524) Profile execute duration [Decode]: [post process]:14.21ms [prepare input and forward]:10.10ms [forward]:4.52ms
|
||||
5751:(IntegratedWorker pid=1502524) Profile execute duration [Decode]: [post process]:15.03ms [prepare input and forward]:10.00ms [forward]:4.42ms
|
||||
|
||||
```
|
||||
Reference in New Issue
Block a user