# Performance Benchmark
This document details the benchmark methodology for vllm-ascend, aimed at evaluating the performance under a variety of workloads. To maintain alignment with vLLM, we use the [benchmark](https://github.com/vllm-project/vllm/tree/main/benchmarks) script provided by the vllm project.

**Benchmark Coverage**: We measure offline e2e latency and throughput, and fixed-QPS online serving benchmarks, for more details see [vllm-ascend benchmark scripts](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks).

## 1. Run docker container

```{code-block} bash
   :substitutions:
# Update DEVICE according to your device (/dev/davinci[0-7])
export DEVICE=/dev/davinci7
export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version|
docker run --rm \
--name vllm-ascend \
--device $DEVICE \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-e VLLM_USE_MODELSCOPE=True \
-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
-it $IMAGE \
/bin/bash
```

## 2. Install dependencies

```bash
cd /workspace/vllm-ascend
pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
pip install -r benchmarks/requirements-bench.txt
```

## 3. (Optional)Prepare model weights
For faster running speed, we recommend downloading the model in advance：

```bash
modelscope download --model LLM-Research/Meta-Llama-3.1-8B-Instruct
```

You can also replace all model paths in the [json](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks/tests) files with your local paths:

```bash
[
  {
    "test_name": "latency_llama8B_tp1",
    "parameters": {
      "model": "your local model path",
      "tensor_parallel_size": 1,
      "load_format": "dummy",
      "num_iters_warmup": 5,
      "num_iters": 15
    }
  }
]
```

## 4. Run benchmark script
Run benchmark script:

```bash
bash benchmarks/scripts/run-performance-benchmarks.sh
```

After about 10 mins, the output is as shown below:

```bash
online serving:
qps 1:
============ Serving Benchmark Result ============
Successful requests:                     200       
Benchmark duration (s):                  212.77    
Total input tokens:                      42659     
Total generated tokens:                  43545     
Request throughput (req/s):              0.94      
Output token throughput (tok/s):         204.66    
Total Token throughput (tok/s):          405.16    
---------------Time to First Token----------------
Mean TTFT (ms):                          104.14    
Median TTFT (ms):                        102.22    
P99 TTFT (ms):                           153.82    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          38.78     
Median TPOT (ms):                        38.70     
P99 TPOT (ms):                           48.03     
---------------Inter-token Latency----------------
Mean ITL (ms):                           38.46     
Median ITL (ms):                         36.96     
P99 ITL (ms):                            75.03     
==================================================

qps 4:
============ Serving Benchmark Result ============
Successful requests:                     200       
Benchmark duration (s):                  72.55     
Total input tokens:                      42659     
Total generated tokens:                  43545     
Request throughput (req/s):              2.76      
Output token throughput (tok/s):         600.24    
Total Token throughput (tok/s):          1188.27   
---------------Time to First Token----------------
Mean TTFT (ms):                          115.62    
Median TTFT (ms):                        109.39    
P99 TTFT (ms):                           169.03    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          51.48     
Median TPOT (ms):                        52.40     
P99 TPOT (ms):                           69.41     
---------------Inter-token Latency----------------
Mean ITL (ms):                           50.47     
Median ITL (ms):                         43.95     
P99 ITL (ms):                            130.29    
==================================================

qps 16:
============ Serving Benchmark Result ============
Successful requests:                     200       
Benchmark duration (s):                  47.82     
Total input tokens:                      42659     
Total generated tokens:                  43545     
Request throughput (req/s):              4.18      
Output token throughput (tok/s):         910.62    
Total Token throughput (tok/s):          1802.70   
---------------Time to First Token----------------
Mean TTFT (ms):                          128.50    
Median TTFT (ms):                        128.36    
P99 TTFT (ms):                           187.87    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          83.60     
Median TPOT (ms):                        77.85     
P99 TPOT (ms):                           165.90    
---------------Inter-token Latency----------------
Mean ITL (ms):                           65.72     
Median ITL (ms):                         54.84     
P99 ITL (ms):                            289.63    
==================================================

qps inf:
============ Serving Benchmark Result ============
Successful requests:                     200       
Benchmark duration (s):                  41.26     
Total input tokens:                      42659     
Total generated tokens:                  43545     
Request throughput (req/s):              4.85      
Output token throughput (tok/s):         1055.44   
Total Token throughput (tok/s):          2089.40   
---------------Time to First Token----------------
Mean TTFT (ms):                          3394.37   
Median TTFT (ms):                        3359.93   
P99 TTFT (ms):                           3540.93   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          66.28     
Median TPOT (ms):                        64.19     
P99 TPOT (ms):                           97.66     
---------------Inter-token Latency----------------
Mean ITL (ms):                           56.62     
Median ITL (ms):                         55.69     
P99 ITL (ms):                            82.90     
==================================================

offline:
latency:
Avg latency: 4.944929537673791 seconds
10% percentile latency: 4.894104263186454 seconds
25% percentile latency: 4.909652255475521 seconds
50% percentile latency: 4.932477846741676 seconds
75% percentile latency: 4.9608619548380375 seconds
90% percentile latency: 5.035418218374252 seconds
99% percentile latency: 5.052476694583893 seconds

throughput:
Throughput: 4.64 requests/s, 2000.51 total tokens/s, 1010.54 output tokens/s
Total num prompt tokens:  42659
Total num output tokens:  43545
```

The result json files are generated into the path `benchmark/results`
These files contain detailed benchmarking results for further analysis.

```bash
.
|-- latency_llama8B_tp1.json
|-- serving_llama8B_tp1_qps_1.json
|-- serving_llama8B_tp1_qps_16.json
|-- serving_llama8B_tp1_qps_4.json
|-- serving_llama8B_tp1_qps_inf.json
`-- throughput_llama8B_tp1.json
```