[Doc] Refact benchmark doc (#5173)

### What this PR does / why we need it? Refactor some outdated doc - vLLM version: v0.12.0 - vLLM main: ad32e3e19c Signed-off-by: wangli <wangli858794774@gmail.com>
2025-12-18 22:26:13 +08:00
parent 6cb76ecd02
commit 7d32371b7e
1 changed files with 177 additions and 133 deletions
--- a/docs/source/developer_guide/performance_and_debug/performance_benchmark.md
+++ b/docs/source/developer_guide/performance_and_debug/performance_benchmark.md
@@ -25,7 +25,6 @@ docker run --rm \
 -v /root/.cache:/root/.cache \
 -p 8000:8000 \
 -e VLLM_USE_MODELSCOPE=True \
-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
 -it $IMAGE \
 /bin/bash
 ```
@@ -38,158 +37,203 @@ pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/si
 pip install -r benchmarks/requirements-bench.txt
 ```

-## 3. (Optional) Prepare model weights
-For faster running speed, we recommend downloading the model in advance：
+## 3. Run basic benchmarks
+This section introduces how to perform performance testing using the benchmark suite built into VLLM.
+
+### 3.1 Dataset
+VLLM supports a variety of (datasets)[https://github.com/vllm-project/vllm/blob/main/vllm/benchmarks/datasets.py].
+
+<style>
+th {
+  min-width: 0 !important;
+}
+</style>
+
+| Dataset | Online | Offline | Data Path |
+|---------|--------|---------|-----------|
+| ShareGPT | ✅ | ✅ | `wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json` |
+| ShareGPT4V (Image) | ✅ | ✅ | `wget https://huggingface.co/datasets/Lin-Chen/ShareGPT4V/resolve/main/sharegpt4v_instruct_gpt4-vision_cap100k.json`<br>Note that the images need to be downloaded separately. For example, to download COCO's 2017 Train images:<br>`wget http://images.cocodataset.org/zips/train2017.zip` |
+| ShareGPT4Video (Video) | ✅ | ✅ | `git clone https://huggingface.co/datasets/ShareGPT4Video/ShareGPT4Video` |
+| BurstGPT | ✅ | ✅ | `wget https://github.com/HPMLL/BurstGPT/releases/download/v1.1/BurstGPT_without_fails_2.csv` |
+| Sonnet (deprecated) | ✅ | ✅ | Local file: `benchmarks/sonnet.txt` |
+| Random | ✅ | ✅ | `synthetic` |
+| RandomMultiModal (Image/Video) | 🟡 | 🚧 | `synthetic` |
+| RandomForReranking | ✅ | ✅ | `synthetic` |
+| Prefix Repetition | ✅ | ✅ | `synthetic` |
+| HuggingFace-VisionArena | ✅ | ✅ | `lmarena-ai/VisionArena-Chat` |
+| HuggingFace-MMVU | ✅ | ✅ | `yale-nlp/MMVU` |
+| HuggingFace-InstructCoder | ✅ | ✅ | `likaixin/InstructCoder` |
+| HuggingFace-AIMO | ✅ | ✅ | `AI-MO/aimo-validation-aime`, `AI-MO/NuminaMath-1.5`, `AI-MO/NuminaMath-CoT` |
+| HuggingFace-Other | ✅ | ✅ | `lmms-lab/LLaVA-OneVision-Data`, `Aeala/ShareGPT_Vicuna_unfiltered` |
+| HuggingFace-MTBench | ✅ | ✅ | `philschmid/mt-bench` |
+| HuggingFace-Blazedit | ✅ | ✅ | `vdaita/edit_5k_char`, `vdaita/edit_10k_char` |
+| Spec Bench | ✅ | ✅ | `wget https://raw.githubusercontent.com/hemingkx/Spec-Bench/refs/heads/main/data/spec_bench/question.jsonl` |
+| Custom | ✅ | ✅ | Local file: `data.jsonl` |
+
+:::{note}
+The datasets mentioned above are all links to datasets on huggingface.
+The dataset's `dataset-name` should be set to `hf`.
+For local `dataset-path`, please set `hf-name` to its Hugging Face ID like

 ```bash
-modelscope download --model LLM-Research/Meta-Llama-3.1-8B-Instruct
+--dataset-path /datasets/VisionArena-Chat/ --hf-name lmarena-ai/VisionArena-Chat
 ```

-You can also replace all model paths in the [json](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks/tests) files with your local paths:
+:::
+
+### 3.2 Run basic benchmark
+
+#### 3.2.1 Online serving
+
+First start serving your model:

 ```bash
-[
-  {
-    "test_name": "latency_llama8B_tp1",
-    "parameters": {
-      "model": "your local model path",
-      "tensor_parallel_size": 1,
-      "load_format": "dummy",
-      "num_iters_warmup": 5,
-      "num_iters": 15
-    }
-  }
-]
+VLLM_USE_MODELSCOPE=True vllm serve Qwen/Qwen3-8B
 ```

-## 4. Run benchmark script
-Run benchmark script:
+Then run the benchmarking script:

 ```bash
-bash benchmarks/scripts/run-performance-benchmarks.sh
+# download dataset
+# wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
+export VLLM_USE_MODELSCOPE=True
+vllm bench serve \
+  --backend vllm \
+  --model Qwen/Qwen3-8B \
+  --endpoint /v1/completions \
+  --dataset-name sharegpt \
+  --dataset-path <your data path>/ShareGPT_V3_unfiltered_cleaned_split.json \
+  --num-prompts 10
 ```

-After about 10 mins, the output is shown below:
+If successful, you will see the following output:

-```bash
-online serving:
-qps 1:
+```shell
 ============ Serving Benchmark Result ============
-Successful requests:                     200       
-Benchmark duration (s):                  212.77    
-Total input tokens:                      42659     
-Total generated tokens:                  43545     
-Request throughput (req/s):              0.94      
-Output token throughput (tok/s):         204.66    
-Total Token throughput (tok/s):          405.16    
+Successful requests:                     10        
+Failed requests:                         0         
+Benchmark duration (s):                  19.92     
+Total input tokens:                      1374      
+Total generated tokens:                  2663      
+Request throughput (req/s):              0.50      
+Output token throughput (tok/s):         133.67    
+Peak output token throughput (tok/s):    312.00    
+Peak concurrent requests:                10.00     
+Total Token throughput (tok/s):          202.64    
 ---------------Time to First Token----------------
-Mean TTFT (ms):                          104.14    
-Median TTFT (ms):                        102.22    
-P99 TTFT (ms):                           153.82    
+Mean TTFT (ms):                          127.10    
+Median TTFT (ms):                        136.29    
+P99 TTFT (ms):                           137.83    
 -----Time per Output Token (excl. 1st token)------
-Mean TPOT (ms):                          38.78     
-Median TPOT (ms):                        38.70     
-P99 TPOT (ms):                           48.03     
+Mean TPOT (ms):                          25.85     
+Median TPOT (ms):                        25.78     
+P99 TPOT (ms):                           26.64     
 ---------------Inter-token Latency----------------
-Mean ITL (ms):                           38.46     
-Median ITL (ms):                         36.96     
-P99 ITL (ms):                            75.03     
+Mean ITL (ms):                           25.78     
+Median ITL (ms):                         25.74     
+P99 ITL (ms):                            28.85     
 ==================================================
-
-qps 4:
-============ Serving Benchmark Result ============
-Successful requests:                     200       
-Benchmark duration (s):                  72.55     
-Total input tokens:                      42659     
-Total generated tokens:                  43545     
-Request throughput (req/s):              2.76      
-Output token throughput (tok/s):         600.24    
-Total Token throughput (tok/s):          1188.27   
---------------Time to First Token----------------
-Mean TTFT (ms):                          115.62    
-Median TTFT (ms):                        109.39    
-P99 TTFT (ms):                           169.03    
-----Time per Output Token (excl. 1st token)------
-Mean TPOT (ms):                          51.48     
-Median TPOT (ms):                        52.40     
-P99 TPOT (ms):                           69.41     
---------------Inter-token Latency----------------
-Mean ITL (ms):                           50.47     
-Median ITL (ms):                         43.95     
-P99 ITL (ms):                            130.29    
-==================================================
-
-qps 16:
-============ Serving Benchmark Result ============
-Successful requests:                     200       
-Benchmark duration (s):                  47.82     
-Total input tokens:                      42659     
-Total generated tokens:                  43545     
-Request throughput (req/s):              4.18      
-Output token throughput (tok/s):         910.62    
-Total Token throughput (tok/s):          1802.70   
---------------Time to First Token----------------
-Mean TTFT (ms):                          128.50    
-Median TTFT (ms):                        128.36    
-P99 TTFT (ms):                           187.87    
-----Time per Output Token (excl. 1st token)------
-Mean TPOT (ms):                          83.60     
-Median TPOT (ms):                        77.85     
-P99 TPOT (ms):                           165.90    
---------------Inter-token Latency----------------
-Mean ITL (ms):                           65.72     
-Median ITL (ms):                         54.84     
-P99 ITL (ms):                            289.63    
-==================================================
-
-qps inf:
-============ Serving Benchmark Result ============
-Successful requests:                     200       
-Benchmark duration (s):                  41.26     
-Total input tokens:                      42659     
-Total generated tokens:                  43545     
-Request throughput (req/s):              4.85      
-Output token throughput (tok/s):         1055.44   
-Total Token throughput (tok/s):          2089.40   
---------------Time to First Token----------------
-Mean TTFT (ms):                          3394.37   
-Median TTFT (ms):                        3359.93   
-P99 TTFT (ms):                           3540.93   
-----Time per Output Token (excl. 1st token)------
-Mean TPOT (ms):                          66.28     
-Median TPOT (ms):                        64.19     
-P99 TPOT (ms):                           97.66     
---------------Inter-token Latency----------------
-Mean ITL (ms):                           56.62     
-Median ITL (ms):                         55.69     
-P99 ITL (ms):                            82.90     
-==================================================
-
-offline:
-latency:
-Avg latency: 4.944929537673791 seconds
-10% percentile latency: 4.894104263186454 seconds
-25% percentile latency: 4.909652255475521 seconds
-50% percentile latency: 4.932477846741676 seconds
-75% percentile latency: 4.9608619548380375 seconds
-90% percentile latency: 5.035418218374252 seconds
-99% percentile latency: 5.052476694583893 seconds
-
-throughput:
-Throughput: 4.64 requests/s, 2000.51 total tokens/s, 1010.54 output tokens/s
-Total num prompt tokens:  42659
-Total num output tokens:  43545
 ```

-The result json files are generated into the path `benchmark/results`.
-These files contain detailed benchmarking results for further analysis.
+#### 3.2.2 Offline Throughput Benchmark

 ```bash
-.
-|-- latency_llama8B_tp1.json
-|-- serving_llama8B_tp1_qps_1.json
-|-- serving_llama8B_tp1_qps_16.json
-|-- serving_llama8B_tp1_qps_4.json
-|-- serving_llama8B_tp1_qps_inf.json
-`-- throughput_llama8B_tp1.json
+VLLM_USE_MODELSCOPE=True
+vllm bench throughput \
+  --model Qwen/Qwen3-8B \
+  --dataset-name random \
+  --input-len 128 \
+  --output-len 128
+```
+
+If successful, you will see the following output
+
+```shell
+Processed prompts: 100%|█| 10/10 [00:03<00:00,  2.74it/s, est. speed input: 351.02 toks/s, output: 351.02 t
+Throughput: 2.73 requests/s, 699.93 total tokens/s, 349.97 output tokens/s
+Total num prompt tokens:  1280
+Total num output tokens:  1280
+```
+
+#### 3.2.4 Multi-Modal Benchmark
+
+```shell
+export VLLM_USE_MODELSCOPE=True
+vllm serve Qwen/Qwen2.5-VL-7B-Instruct \
+  --dtype bfloat16 \
+  --limit-mm-per-prompt '{"image": 1}' \
+  --allowed-local-media-path /path/to/sharegpt4v/images
+```
+
+```shell
+export HF_ENDPOINT="https://hf-mirror.com"
+vllm bench serve --model Qwen/Qwen2.5-VL-7B-Instruct \
+--backend "openai-chat" \
+--dataset-name hf \
+--hf-split train \
+--endpoint "/v1/chat/completions" \
+--dataset-path "lmarena-ai/vision-arena-bench-v0.1" \
+--num-prompts 10 \
+--no-stream
+```
+
+```shell
+============ Serving Benchmark Result ============
+Successful requests:                     10        
+Failed requests:                         0         
+Benchmark duration (s):                  4.89      
+Total input tokens:                      7191      
+Total generated tokens:                  951       
+Request throughput (req/s):              2.05      
+Output token throughput (tok/s):         194.63    
+Peak output token throughput (tok/s):    290.00    
+Peak concurrent requests:                10.00     
+Total Token throughput (tok/s):          1666.35   
+---------------Time to First Token----------------
+Mean TTFT (ms):                          722.22    
+Median TTFT (ms):                        589.81    
+P99 TTFT (ms):                           1377.02   
+-----Time per Output Token (excl. 1st token)------
+Mean TPOT (ms):                          44.13     
+Median TPOT (ms):                        34.58     
+P99 TPOT (ms):                           124.72    
+---------------Inter-token Latency----------------
+Mean ITL (ms):                           33.14     
+Median ITL (ms):                         28.01     
+P99 ITL (ms):                            182.28    
+==================================================
+```
+
+#### 3.2.5 Embedding Benchmark
+
+```shell
+vllm serve Qwen/Qwen3-Embedding-8B --trust-remote-code
+```
+
+```shell
+# download dataset
+# wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
+export VLLM_USE_MODELSCOPE=true
+vllm bench serve \
+  --model Qwen/Qwen3-Embedding-8B \
+  --backend openai-embeddings \
+  --endpoint /v1/embeddings \
+  --dataset-name sharegpt \
+  --num-prompt 10 \
+  --dataset-path <your dataset path>/datasets/ShareGPT_V3_unfiltered_cleaned_split.json
+```
+
+```shell
+============ Serving Benchmark Result ============
+Successful requests:                     10        
+Failed requests:                         0         
+Benchmark duration (s):                  0.18      
+Total input tokens:                      1372      
+Request throughput (req/s):              56.32     
+Total Token throughput (tok/s):          7726.76   
+----------------End-to-end Latency----------------
+Mean E2EL (ms):                          154.06    
+Median E2EL (ms):                        165.57    
+P99 E2EL (ms):                           166.66    
+==================================================
 ```