diff --git a/benchmarks/requirements-bench.txt b/benchmarks/requirements-bench.txt index e9af75e..b3f3c06 100644 --- a/benchmarks/requirements-bench.txt +++ b/benchmarks/requirements-bench.txt @@ -1,2 +1,3 @@ pandas -datasets \ No newline at end of file +datasets +modelscope \ No newline at end of file diff --git a/docs/source/developer_guide/evaluation/index.md b/docs/source/developer_guide/evaluation/index.md index b0f325f..68ffab8 100644 --- a/docs/source/developer_guide/evaluation/index.md +++ b/docs/source/developer_guide/evaluation/index.md @@ -7,3 +7,9 @@ using_opencompass using_lm_eval using_evalscope ::: + +:::{toctree} +:caption: Performance +:maxdepth: 1 +performance_benchmark +::: \ No newline at end of file diff --git a/docs/source/developer_guide/evaluation/performance_benchmark.md b/docs/source/developer_guide/evaluation/performance_benchmark.md new file mode 100644 index 0000000..98daabe --- /dev/null +++ b/docs/source/developer_guide/evaluation/performance_benchmark.md @@ -0,0 +1,187 @@ +# Performance Benchmark +This document details the benchmark methodology for vllm-ascend, aimed at evaluating the performance under a variety of workloads. To maintain alignment with vLLM, we use the [benchmark](https://github.com/vllm-project/vllm/tree/main/benchmarks) script provided by the vllm project. + +**Benchmark Coverage**: We measure offline e2e latency and throughput, and fixed-QPS online serving benchmarks, for more details see [vllm-ascend benchmark scripts](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks). + +## 1. Run docker container +```{code-block} bash + :substitutions: +# Update DEVICE according to your device (/dev/davinci[0-7]) +export DEVICE=/dev/davinci7 +export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version| +docker run --rm \ +--name vllm-ascend \ +--device $DEVICE \ +--device /dev/davinci_manager \ +--device /dev/devmm_svm \ +--device /dev/hisi_hdc \ +-v /usr/local/dcmi:/usr/local/dcmi \ +-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \ +-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \ +-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \ +-v /etc/ascend_install.info:/etc/ascend_install.info \ +-v /root/.cache:/root/.cache \ +-p 8000:8000 \ +-e VLLM_USE_MODELSCOPE=True \ +-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \ +-it $IMAGE \ +/bin/bash +``` + +## 2. Install dependencies +```bash +cd /workspace/vllm-ascend +pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple +pip install -r benchmarks/requirements-bench.txt +``` + +## 3. (Optional)Prepare model weights +For faster running speed, we recommend downloading the model in advance: +```bash +modelscope download --model LLM-Research/Meta-Llama-3.1-8B-Instruct +``` + +You can also replace all model paths in the [json](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks/tests) files with your local paths: +```bash +[ + { + "test_name": "latency_llama8B_tp1", + "parameters": { + "model": "your local model path", + "tensor_parallel_size": 1, + "load_format": "dummy", + "num_iters_warmup": 5, + "num_iters": 15 + } + } +] +``` + +## 4. Run benchmark script +Run benchmark script: +```bash +bash benchmarks/scripts/run-performance-benchmarks.sh +``` + +After about 10 mins, the output is as shown below: +```bash +online serving: +qps 1: +============ Serving Benchmark Result ============ +Successful requests: 200 +Benchmark duration (s): 212.77 +Total input tokens: 42659 +Total generated tokens: 43545 +Request throughput (req/s): 0.94 +Output token throughput (tok/s): 204.66 +Total Token throughput (tok/s): 405.16 +---------------Time to First Token---------------- +Mean TTFT (ms): 104.14 +Median TTFT (ms): 102.22 +P99 TTFT (ms): 153.82 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 38.78 +Median TPOT (ms): 38.70 +P99 TPOT (ms): 48.03 +---------------Inter-token Latency---------------- +Mean ITL (ms): 38.46 +Median ITL (ms): 36.96 +P99 ITL (ms): 75.03 +================================================== + +qps 4: +============ Serving Benchmark Result ============ +Successful requests: 200 +Benchmark duration (s): 72.55 +Total input tokens: 42659 +Total generated tokens: 43545 +Request throughput (req/s): 2.76 +Output token throughput (tok/s): 600.24 +Total Token throughput (tok/s): 1188.27 +---------------Time to First Token---------------- +Mean TTFT (ms): 115.62 +Median TTFT (ms): 109.39 +P99 TTFT (ms): 169.03 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 51.48 +Median TPOT (ms): 52.40 +P99 TPOT (ms): 69.41 +---------------Inter-token Latency---------------- +Mean ITL (ms): 50.47 +Median ITL (ms): 43.95 +P99 ITL (ms): 130.29 +================================================== + +qps 16: +============ Serving Benchmark Result ============ +Successful requests: 200 +Benchmark duration (s): 47.82 +Total input tokens: 42659 +Total generated tokens: 43545 +Request throughput (req/s): 4.18 +Output token throughput (tok/s): 910.62 +Total Token throughput (tok/s): 1802.70 +---------------Time to First Token---------------- +Mean TTFT (ms): 128.50 +Median TTFT (ms): 128.36 +P99 TTFT (ms): 187.87 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 83.60 +Median TPOT (ms): 77.85 +P99 TPOT (ms): 165.90 +---------------Inter-token Latency---------------- +Mean ITL (ms): 65.72 +Median ITL (ms): 54.84 +P99 ITL (ms): 289.63 +================================================== + +qps inf: +============ Serving Benchmark Result ============ +Successful requests: 200 +Benchmark duration (s): 41.26 +Total input tokens: 42659 +Total generated tokens: 43545 +Request throughput (req/s): 4.85 +Output token throughput (tok/s): 1055.44 +Total Token throughput (tok/s): 2089.40 +---------------Time to First Token---------------- +Mean TTFT (ms): 3394.37 +Median TTFT (ms): 3359.93 +P99 TTFT (ms): 3540.93 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 66.28 +Median TPOT (ms): 64.19 +P99 TPOT (ms): 97.66 +---------------Inter-token Latency---------------- +Mean ITL (ms): 56.62 +Median ITL (ms): 55.69 +P99 ITL (ms): 82.90 +================================================== + +offline: +latency: +Avg latency: 4.944929537673791 seconds +10% percentile latency: 4.894104263186454 seconds +25% percentile latency: 4.909652255475521 seconds +50% percentile latency: 4.932477846741676 seconds +75% percentile latency: 4.9608619548380375 seconds +90% percentile latency: 5.035418218374252 seconds +99% percentile latency: 5.052476694583893 seconds + +throughput: +Throughput: 4.64 requests/s, 2000.51 total tokens/s, 1010.54 output tokens/s +Total num prompt tokens: 42659 +Total num output tokens: 43545 +``` +The result json files are generated into the path `benchmark/results` +These files contain detailed benchmarking results for further analysis. + +```bash +. +|-- latency_llama8B_tp1.json +|-- serving_llama8B_tp1_qps_1.json +|-- serving_llama8B_tp1_qps_16.json +|-- serving_llama8B_tp1_qps_4.json +|-- serving_llama8B_tp1_qps_inf.json +`-- throughput_llama8B_tp1.json +```