[Doc] Refact benchmark doc (#5173)
### What this PR does / why we need it?
Refactor some outdated doc
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
Signed-off-by: wangli <wangli858794774@gmail.com>
This commit is contained in:
@@ -25,7 +25,6 @@ docker run --rm \
|
||||
-v /root/.cache:/root/.cache \
|
||||
-p 8000:8000 \
|
||||
-e VLLM_USE_MODELSCOPE=True \
|
||||
-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
|
||||
-it $IMAGE \
|
||||
/bin/bash
|
||||
```
|
||||
@@ -38,158 +37,203 @@ pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/si
|
||||
pip install -r benchmarks/requirements-bench.txt
|
||||
```
|
||||
|
||||
## 3. (Optional) Prepare model weights
|
||||
For faster running speed, we recommend downloading the model in advance:
|
||||
## 3. Run basic benchmarks
|
||||
This section introduces how to perform performance testing using the benchmark suite built into VLLM.
|
||||
|
||||
### 3.1 Dataset
|
||||
VLLM supports a variety of (datasets)[https://github.com/vllm-project/vllm/blob/main/vllm/benchmarks/datasets.py].
|
||||
|
||||
<style>
|
||||
th {
|
||||
min-width: 0 !important;
|
||||
}
|
||||
</style>
|
||||
|
||||
| Dataset | Online | Offline | Data Path |
|
||||
|---------|--------|---------|-----------|
|
||||
| ShareGPT | ✅ | ✅ | `wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json` |
|
||||
| ShareGPT4V (Image) | ✅ | ✅ | `wget https://huggingface.co/datasets/Lin-Chen/ShareGPT4V/resolve/main/sharegpt4v_instruct_gpt4-vision_cap100k.json`<br>Note that the images need to be downloaded separately. For example, to download COCO's 2017 Train images:<br>`wget http://images.cocodataset.org/zips/train2017.zip` |
|
||||
| ShareGPT4Video (Video) | ✅ | ✅ | `git clone https://huggingface.co/datasets/ShareGPT4Video/ShareGPT4Video` |
|
||||
| BurstGPT | ✅ | ✅ | `wget https://github.com/HPMLL/BurstGPT/releases/download/v1.1/BurstGPT_without_fails_2.csv` |
|
||||
| Sonnet (deprecated) | ✅ | ✅ | Local file: `benchmarks/sonnet.txt` |
|
||||
| Random | ✅ | ✅ | `synthetic` |
|
||||
| RandomMultiModal (Image/Video) | 🟡 | 🚧 | `synthetic` |
|
||||
| RandomForReranking | ✅ | ✅ | `synthetic` |
|
||||
| Prefix Repetition | ✅ | ✅ | `synthetic` |
|
||||
| HuggingFace-VisionArena | ✅ | ✅ | `lmarena-ai/VisionArena-Chat` |
|
||||
| HuggingFace-MMVU | ✅ | ✅ | `yale-nlp/MMVU` |
|
||||
| HuggingFace-InstructCoder | ✅ | ✅ | `likaixin/InstructCoder` |
|
||||
| HuggingFace-AIMO | ✅ | ✅ | `AI-MO/aimo-validation-aime`, `AI-MO/NuminaMath-1.5`, `AI-MO/NuminaMath-CoT` |
|
||||
| HuggingFace-Other | ✅ | ✅ | `lmms-lab/LLaVA-OneVision-Data`, `Aeala/ShareGPT_Vicuna_unfiltered` |
|
||||
| HuggingFace-MTBench | ✅ | ✅ | `philschmid/mt-bench` |
|
||||
| HuggingFace-Blazedit | ✅ | ✅ | `vdaita/edit_5k_char`, `vdaita/edit_10k_char` |
|
||||
| Spec Bench | ✅ | ✅ | `wget https://raw.githubusercontent.com/hemingkx/Spec-Bench/refs/heads/main/data/spec_bench/question.jsonl` |
|
||||
| Custom | ✅ | ✅ | Local file: `data.jsonl` |
|
||||
|
||||
:::{note}
|
||||
The datasets mentioned above are all links to datasets on huggingface.
|
||||
The dataset's `dataset-name` should be set to `hf`.
|
||||
For local `dataset-path`, please set `hf-name` to its Hugging Face ID like
|
||||
|
||||
```bash
|
||||
modelscope download --model LLM-Research/Meta-Llama-3.1-8B-Instruct
|
||||
--dataset-path /datasets/VisionArena-Chat/ --hf-name lmarena-ai/VisionArena-Chat
|
||||
```
|
||||
|
||||
You can also replace all model paths in the [json](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks/tests) files with your local paths:
|
||||
:::
|
||||
|
||||
### 3.2 Run basic benchmark
|
||||
|
||||
#### 3.2.1 Online serving
|
||||
|
||||
First start serving your model:
|
||||
|
||||
```bash
|
||||
[
|
||||
{
|
||||
"test_name": "latency_llama8B_tp1",
|
||||
"parameters": {
|
||||
"model": "your local model path",
|
||||
"tensor_parallel_size": 1,
|
||||
"load_format": "dummy",
|
||||
"num_iters_warmup": 5,
|
||||
"num_iters": 15
|
||||
}
|
||||
}
|
||||
]
|
||||
VLLM_USE_MODELSCOPE=True vllm serve Qwen/Qwen3-8B
|
||||
```
|
||||
|
||||
## 4. Run benchmark script
|
||||
Run benchmark script:
|
||||
Then run the benchmarking script:
|
||||
|
||||
```bash
|
||||
bash benchmarks/scripts/run-performance-benchmarks.sh
|
||||
# download dataset
|
||||
# wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
|
||||
export VLLM_USE_MODELSCOPE=True
|
||||
vllm bench serve \
|
||||
--backend vllm \
|
||||
--model Qwen/Qwen3-8B \
|
||||
--endpoint /v1/completions \
|
||||
--dataset-name sharegpt \
|
||||
--dataset-path <your data path>/ShareGPT_V3_unfiltered_cleaned_split.json \
|
||||
--num-prompts 10
|
||||
```
|
||||
|
||||
After about 10 mins, the output is shown below:
|
||||
If successful, you will see the following output:
|
||||
|
||||
```bash
|
||||
online serving:
|
||||
qps 1:
|
||||
```shell
|
||||
============ Serving Benchmark Result ============
|
||||
Successful requests: 200
|
||||
Benchmark duration (s): 212.77
|
||||
Total input tokens: 42659
|
||||
Total generated tokens: 43545
|
||||
Request throughput (req/s): 0.94
|
||||
Output token throughput (tok/s): 204.66
|
||||
Total Token throughput (tok/s): 405.16
|
||||
Successful requests: 10
|
||||
Failed requests: 0
|
||||
Benchmark duration (s): 19.92
|
||||
Total input tokens: 1374
|
||||
Total generated tokens: 2663
|
||||
Request throughput (req/s): 0.50
|
||||
Output token throughput (tok/s): 133.67
|
||||
Peak output token throughput (tok/s): 312.00
|
||||
Peak concurrent requests: 10.00
|
||||
Total Token throughput (tok/s): 202.64
|
||||
---------------Time to First Token----------------
|
||||
Mean TTFT (ms): 104.14
|
||||
Median TTFT (ms): 102.22
|
||||
P99 TTFT (ms): 153.82
|
||||
Mean TTFT (ms): 127.10
|
||||
Median TTFT (ms): 136.29
|
||||
P99 TTFT (ms): 137.83
|
||||
-----Time per Output Token (excl. 1st token)------
|
||||
Mean TPOT (ms): 38.78
|
||||
Median TPOT (ms): 38.70
|
||||
P99 TPOT (ms): 48.03
|
||||
Mean TPOT (ms): 25.85
|
||||
Median TPOT (ms): 25.78
|
||||
P99 TPOT (ms): 26.64
|
||||
---------------Inter-token Latency----------------
|
||||
Mean ITL (ms): 38.46
|
||||
Median ITL (ms): 36.96
|
||||
P99 ITL (ms): 75.03
|
||||
Mean ITL (ms): 25.78
|
||||
Median ITL (ms): 25.74
|
||||
P99 ITL (ms): 28.85
|
||||
==================================================
|
||||
|
||||
qps 4:
|
||||
============ Serving Benchmark Result ============
|
||||
Successful requests: 200
|
||||
Benchmark duration (s): 72.55
|
||||
Total input tokens: 42659
|
||||
Total generated tokens: 43545
|
||||
Request throughput (req/s): 2.76
|
||||
Output token throughput (tok/s): 600.24
|
||||
Total Token throughput (tok/s): 1188.27
|
||||
---------------Time to First Token----------------
|
||||
Mean TTFT (ms): 115.62
|
||||
Median TTFT (ms): 109.39
|
||||
P99 TTFT (ms): 169.03
|
||||
-----Time per Output Token (excl. 1st token)------
|
||||
Mean TPOT (ms): 51.48
|
||||
Median TPOT (ms): 52.40
|
||||
P99 TPOT (ms): 69.41
|
||||
---------------Inter-token Latency----------------
|
||||
Mean ITL (ms): 50.47
|
||||
Median ITL (ms): 43.95
|
||||
P99 ITL (ms): 130.29
|
||||
==================================================
|
||||
|
||||
qps 16:
|
||||
============ Serving Benchmark Result ============
|
||||
Successful requests: 200
|
||||
Benchmark duration (s): 47.82
|
||||
Total input tokens: 42659
|
||||
Total generated tokens: 43545
|
||||
Request throughput (req/s): 4.18
|
||||
Output token throughput (tok/s): 910.62
|
||||
Total Token throughput (tok/s): 1802.70
|
||||
---------------Time to First Token----------------
|
||||
Mean TTFT (ms): 128.50
|
||||
Median TTFT (ms): 128.36
|
||||
P99 TTFT (ms): 187.87
|
||||
-----Time per Output Token (excl. 1st token)------
|
||||
Mean TPOT (ms): 83.60
|
||||
Median TPOT (ms): 77.85
|
||||
P99 TPOT (ms): 165.90
|
||||
---------------Inter-token Latency----------------
|
||||
Mean ITL (ms): 65.72
|
||||
Median ITL (ms): 54.84
|
||||
P99 ITL (ms): 289.63
|
||||
==================================================
|
||||
|
||||
qps inf:
|
||||
============ Serving Benchmark Result ============
|
||||
Successful requests: 200
|
||||
Benchmark duration (s): 41.26
|
||||
Total input tokens: 42659
|
||||
Total generated tokens: 43545
|
||||
Request throughput (req/s): 4.85
|
||||
Output token throughput (tok/s): 1055.44
|
||||
Total Token throughput (tok/s): 2089.40
|
||||
---------------Time to First Token----------------
|
||||
Mean TTFT (ms): 3394.37
|
||||
Median TTFT (ms): 3359.93
|
||||
P99 TTFT (ms): 3540.93
|
||||
-----Time per Output Token (excl. 1st token)------
|
||||
Mean TPOT (ms): 66.28
|
||||
Median TPOT (ms): 64.19
|
||||
P99 TPOT (ms): 97.66
|
||||
---------------Inter-token Latency----------------
|
||||
Mean ITL (ms): 56.62
|
||||
Median ITL (ms): 55.69
|
||||
P99 ITL (ms): 82.90
|
||||
==================================================
|
||||
|
||||
offline:
|
||||
latency:
|
||||
Avg latency: 4.944929537673791 seconds
|
||||
10% percentile latency: 4.894104263186454 seconds
|
||||
25% percentile latency: 4.909652255475521 seconds
|
||||
50% percentile latency: 4.932477846741676 seconds
|
||||
75% percentile latency: 4.9608619548380375 seconds
|
||||
90% percentile latency: 5.035418218374252 seconds
|
||||
99% percentile latency: 5.052476694583893 seconds
|
||||
|
||||
throughput:
|
||||
Throughput: 4.64 requests/s, 2000.51 total tokens/s, 1010.54 output tokens/s
|
||||
Total num prompt tokens: 42659
|
||||
Total num output tokens: 43545
|
||||
```
|
||||
|
||||
The result json files are generated into the path `benchmark/results`.
|
||||
These files contain detailed benchmarking results for further analysis.
|
||||
#### 3.2.2 Offline Throughput Benchmark
|
||||
|
||||
```bash
|
||||
.
|
||||
|-- latency_llama8B_tp1.json
|
||||
|-- serving_llama8B_tp1_qps_1.json
|
||||
|-- serving_llama8B_tp1_qps_16.json
|
||||
|-- serving_llama8B_tp1_qps_4.json
|
||||
|-- serving_llama8B_tp1_qps_inf.json
|
||||
`-- throughput_llama8B_tp1.json
|
||||
VLLM_USE_MODELSCOPE=True
|
||||
vllm bench throughput \
|
||||
--model Qwen/Qwen3-8B \
|
||||
--dataset-name random \
|
||||
--input-len 128 \
|
||||
--output-len 128
|
||||
```
|
||||
|
||||
If successful, you will see the following output
|
||||
|
||||
```shell
|
||||
Processed prompts: 100%|█| 10/10 [00:03<00:00, 2.74it/s, est. speed input: 351.02 toks/s, output: 351.02 t
|
||||
Throughput: 2.73 requests/s, 699.93 total tokens/s, 349.97 output tokens/s
|
||||
Total num prompt tokens: 1280
|
||||
Total num output tokens: 1280
|
||||
```
|
||||
|
||||
#### 3.2.4 Multi-Modal Benchmark
|
||||
|
||||
```shell
|
||||
export VLLM_USE_MODELSCOPE=True
|
||||
vllm serve Qwen/Qwen2.5-VL-7B-Instruct \
|
||||
--dtype bfloat16 \
|
||||
--limit-mm-per-prompt '{"image": 1}' \
|
||||
--allowed-local-media-path /path/to/sharegpt4v/images
|
||||
```
|
||||
|
||||
```shell
|
||||
export HF_ENDPOINT="https://hf-mirror.com"
|
||||
vllm bench serve --model Qwen/Qwen2.5-VL-7B-Instruct \
|
||||
--backend "openai-chat" \
|
||||
--dataset-name hf \
|
||||
--hf-split train \
|
||||
--endpoint "/v1/chat/completions" \
|
||||
--dataset-path "lmarena-ai/vision-arena-bench-v0.1" \
|
||||
--num-prompts 10 \
|
||||
--no-stream
|
||||
```
|
||||
|
||||
```shell
|
||||
============ Serving Benchmark Result ============
|
||||
Successful requests: 10
|
||||
Failed requests: 0
|
||||
Benchmark duration (s): 4.89
|
||||
Total input tokens: 7191
|
||||
Total generated tokens: 951
|
||||
Request throughput (req/s): 2.05
|
||||
Output token throughput (tok/s): 194.63
|
||||
Peak output token throughput (tok/s): 290.00
|
||||
Peak concurrent requests: 10.00
|
||||
Total Token throughput (tok/s): 1666.35
|
||||
---------------Time to First Token----------------
|
||||
Mean TTFT (ms): 722.22
|
||||
Median TTFT (ms): 589.81
|
||||
P99 TTFT (ms): 1377.02
|
||||
-----Time per Output Token (excl. 1st token)------
|
||||
Mean TPOT (ms): 44.13
|
||||
Median TPOT (ms): 34.58
|
||||
P99 TPOT (ms): 124.72
|
||||
---------------Inter-token Latency----------------
|
||||
Mean ITL (ms): 33.14
|
||||
Median ITL (ms): 28.01
|
||||
P99 ITL (ms): 182.28
|
||||
==================================================
|
||||
```
|
||||
|
||||
#### 3.2.5 Embedding Benchmark
|
||||
|
||||
```shell
|
||||
vllm serve Qwen/Qwen3-Embedding-8B --trust-remote-code
|
||||
```
|
||||
|
||||
```shell
|
||||
# download dataset
|
||||
# wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
|
||||
export VLLM_USE_MODELSCOPE=true
|
||||
vllm bench serve \
|
||||
--model Qwen/Qwen3-Embedding-8B \
|
||||
--backend openai-embeddings \
|
||||
--endpoint /v1/embeddings \
|
||||
--dataset-name sharegpt \
|
||||
--num-prompt 10 \
|
||||
--dataset-path <your dataset path>/datasets/ShareGPT_V3_unfiltered_cleaned_split.json
|
||||
```
|
||||
|
||||
```shell
|
||||
============ Serving Benchmark Result ============
|
||||
Successful requests: 10
|
||||
Failed requests: 0
|
||||
Benchmark duration (s): 0.18
|
||||
Total input tokens: 1372
|
||||
Request throughput (req/s): 56.32
|
||||
Total Token throughput (tok/s): 7726.76
|
||||
----------------End-to-end Latency----------------
|
||||
Mean E2EL (ms): 154.06
|
||||
Median E2EL (ms): 165.57
|
||||
P99 E2EL (ms): 166.66
|
||||
==================================================
|
||||
```
|
||||
|
||||
Reference in New Issue
Block a user