forked from EngineX-Ascend/enginex-ascend-910-vllm
v0.10.1rc1
This commit is contained in:
175
benchmarks/README.md
Normal file
175
benchmarks/README.md
Normal file
@@ -0,0 +1,175 @@
|
||||
# Introduction
|
||||
This document outlines the benchmarking methodology for vllm-ascend, aimed at evaluating the performance under a variety of workloads. The primary goal is to help developers assess whether their pull requests improve or degrade vllm-ascend's performance.
|
||||
|
||||
# Overview
|
||||
**Benchmarking Coverage**: We measure latency, throughput, and fixed-QPS serving on the Atlas800I A2 (see [quick_start](../docs/source/quick_start.md) to learn more supported devices list), with different models(coming soon).
|
||||
- Latency tests
|
||||
- Input length: 32 tokens.
|
||||
- Output length: 128 tokens.
|
||||
- Batch size: fixed (8).
|
||||
- Models: Qwen2.5-7B-Instruct, Qwen3-8B.
|
||||
- Evaluation metrics: end-to-end latency (mean, median, p99).
|
||||
|
||||
- Throughput tests
|
||||
- Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
|
||||
- Output length: the corresponding output length of these 200 prompts.
|
||||
- Batch size: dynamically determined by vllm to achieve maximum throughput.
|
||||
- Models: Qwen2.5-VL-7B-Instruct, Qwen2.5-7B-Instruct, Qwen3-8B.
|
||||
- Evaluation metrics: throughput.
|
||||
- Serving tests
|
||||
- Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
|
||||
- Output length: the corresponding output length of these 200 prompts.
|
||||
- Batch size: dynamically determined by vllm and the arrival pattern of the requests.
|
||||
- **Average QPS (query per second)**: 1, 4, 16 and inf. QPS = inf means all requests come at once. For other QPS values, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
|
||||
- Models: Qwen2.5-VL-7B-Instruct, Qwen2.5-7B-Instruct, Qwen3-8B.
|
||||
- Evaluation metrics: throughput, TTFT (time to the first token, with mean, median and p99), ITL (inter-token latency, with mean, median and p99).
|
||||
|
||||
**Benchmarking Duration**: about 800 senond for single model.
|
||||
|
||||
# Quick Use
|
||||
## Prerequisites
|
||||
Before running the benchmarks, ensure the following:
|
||||
|
||||
- vllm and vllm-ascend are installed and properly set up in an NPU environment, as these scripts are specifically designed for NPU devices.
|
||||
|
||||
- Install necessary dependencies for benchmarks:
|
||||
|
||||
```shell
|
||||
pip install -r benchmarks/requirements-bench.txt
|
||||
```
|
||||
|
||||
- For performance benchmark, it is recommended to set the [load-format](https://github.com/vllm-project/vllm-ascend/blob/5897dc5bbe321ca90c26225d0d70bff24061d04b/benchmarks/tests/latency-tests.json#L7) as `dummy`, It will construct random weights based on the passed model without downloading the weights from internet, which can greatly reduce the benchmark time.
|
||||
- If you want to run benchmark customized, feel free to add your own models and parameters in the [JSON](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks/tests), let's take `Qwen2.5-VL-7B-Instruct`as an example:
|
||||
|
||||
```shell
|
||||
[
|
||||
{
|
||||
"test_name": "serving_qwen2_5vl_7B_tp1",
|
||||
"qps_list": [
|
||||
1,
|
||||
4,
|
||||
16,
|
||||
"inf"
|
||||
],
|
||||
"server_parameters": {
|
||||
"model": "Qwen/Qwen2.5-VL-7B-Instruct",
|
||||
"tensor_parallel_size": 1,
|
||||
"swap_space": 16,
|
||||
"disable_log_stats": "",
|
||||
"disable_log_requests": "",
|
||||
"trust_remote_code": "",
|
||||
"max_model_len": 16384
|
||||
},
|
||||
"client_parameters": {
|
||||
"model": "Qwen/Qwen2.5-VL-7B-Instruct",
|
||||
"backend": "openai-chat",
|
||||
"dataset_name": "hf",
|
||||
"hf_split": "train",
|
||||
"endpoint": "/v1/chat/completions",
|
||||
"dataset_path": "lmarena-ai/vision-arena-bench-v0.1",
|
||||
"num_prompts": 200
|
||||
}
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
this Json will be structured and parsed into server parameters and client parameters by the benchmark script. This configuration defines a test case named `serving_qwen2_5vl_7B_tp1`, designed to evaluate the performance of the `Qwen/Qwen2.5-VL-7B-Instruct` model under different request rates. The test includes both server and client parameters, for more parameters details, see vllm benchmark [cli](https://github.com/vllm-project/vllm/tree/main/vllm/benchmarks).
|
||||
|
||||
- **Test Overview**
|
||||
- Test Name: serving_qwen2_5vl_7B_tp1
|
||||
|
||||
- Queries Per Second (QPS): The test is run at four different QPS levels: 1, 4, 16, and inf (infinite load, typically used for stress testing).
|
||||
|
||||
- Server Parameters
|
||||
- Model: Qwen/Qwen2.5-VL-7B-Instruct
|
||||
|
||||
- Tensor Parallelism: 1 (no model parallelism is used; the model runs on a single device or node)
|
||||
|
||||
- Swap Space: 16 GB (used to handle memory overflow by swapping to disk)
|
||||
|
||||
- disable_log_stats: disables logging of performance statistics.
|
||||
|
||||
- disable_log_requests: disables logging of individual requests.
|
||||
|
||||
- Trust Remote Code: enabled (allows execution of model-specific custom code)
|
||||
|
||||
- Max Model Length: 16,384 tokens (maximum context length supported by the model)
|
||||
|
||||
- Client Parameters
|
||||
|
||||
- Model: Qwen/Qwen2.5-VL-7B-Instruct (same as the server)
|
||||
|
||||
- Backend: openai-chat (suggests the client uses the OpenAI-compatible chat API format)
|
||||
|
||||
- Dataset Source: Hugging Face (hf)
|
||||
|
||||
- Dataset Split: train
|
||||
|
||||
- Endpoint: /v1/chat/completions (the REST API endpoint to which chat requests are sent)
|
||||
|
||||
- Dataset Path: lmarena-ai/vision-arena-bench-v0.1 (the benchmark dataset used for evaluation, hosted on Hugging Face)
|
||||
|
||||
- Number of Prompts: 200 (the total number of prompts used during the test)
|
||||
|
||||
## Run benchmarks
|
||||
|
||||
### Use benchmark script
|
||||
The provided scripts automatically execute performance tests for serving, throughput, and latency. To start the benchmarking process, run command in the vllm-ascend root directory:
|
||||
|
||||
```shell
|
||||
bash benchmarks/scripts/run-performance-benchmarks.sh
|
||||
```
|
||||
|
||||
Once the script completes, you can find the results in the benchmarks/results folder. The output files may resemble the following:
|
||||
|
||||
```shell
|
||||
.
|
||||
|-- serving_qwen2_5_7B_tp1_qps_1.json
|
||||
|-- serving_qwen2_5_7B_tp1_qps_16.json
|
||||
|-- serving_qwen2_5_7B_tp1_qps_4.json
|
||||
|-- serving_qwen2_5_7B_tp1_qps_inf.json
|
||||
|-- latency_qwen2_5_7B_tp1.json
|
||||
|-- throughput_qwen2_5_7B_tp1.json
|
||||
```
|
||||
|
||||
These files contain detailed benchmarking results for further analysis.
|
||||
|
||||
### Use benchmark cli
|
||||
|
||||
For more flexible and customized use, benchmark cli is also provided to run online/offline benchmarks
|
||||
Similarly, let’s take `Qwen2.5-VL-7B-Instruct` benchmark as an example:
|
||||
#### Online serving
|
||||
1. Launch the server:
|
||||
|
||||
```shell
|
||||
vllm serve Qwen2.5-VL-7B-Instruct --max-model-len 16789
|
||||
```
|
||||
|
||||
2. Running performance tests using cli
|
||||
|
||||
```shell
|
||||
vllm bench serve --model Qwen2.5-VL-7B-Instruct\
|
||||
--endpoint-type "openai-chat" --dataset-name hf \
|
||||
--hf-split train --endpoint "/v1/chat/completions" \
|
||||
--dataset-path "lmarena-ai/vision-arena-bench-v0.1" \
|
||||
--num-prompts 200 \
|
||||
--request-rate 16
|
||||
```
|
||||
|
||||
#### Offline
|
||||
- **Throughput**
|
||||
|
||||
```shell
|
||||
vllm bench throughput --output-json results/throughput_qwen2_5_7B_tp1.json \
|
||||
--model Qwen/Qwen2.5-7B-Instruct --tensor-parallel-size 1 --load-format dummy \
|
||||
--dataset-path /github/home/.cache/datasets/ShareGPT_V3_unfiltered_cleaned_split.json \
|
||||
--num-prompts 200 --backend vllm
|
||||
```
|
||||
|
||||
- **Latency**
|
||||
|
||||
```shell
|
||||
vllm bench latency --output-json results/latency_qwen2_5_7B_tp1.json \
|
||||
--model Qwen/Qwen2.5-7B-Instruct --tensor-parallel-size 1 \
|
||||
--load-format dummy --num-iters-warmup 5 --num-iters 15
|
||||
```
|
||||
158
benchmarks/ops/ben_vocabparallelembedding.py
Normal file
158
benchmarks/ops/ben_vocabparallelembedding.py
Normal file
@@ -0,0 +1,158 @@
|
||||
from typing import Tuple
|
||||
|
||||
import numpy as np
|
||||
import pytest
|
||||
import torch
|
||||
import torch_npu # noqa: F401
|
||||
import vllm # noqa: F401
|
||||
|
||||
import vllm_ascend.platform # noqa: F401
|
||||
|
||||
|
||||
def benchmark_npu(fn, num_iterations=100, num_warmup_iterations=50):
|
||||
"""
|
||||
Benchmark function for NPU operations
|
||||
|
||||
Args:
|
||||
fn: Function to benchmark
|
||||
num_iterations: Number of timing iterations
|
||||
num_warmup_iterations: Number of warmup iterations
|
||||
|
||||
Returns:
|
||||
float: Minimum elapsed time in seconds
|
||||
"""
|
||||
start = torch.npu.Event(enable_timing=True)
|
||||
end = torch.npu.Event(enable_timing=True)
|
||||
times = np.zeros(num_iterations + num_warmup_iterations)
|
||||
|
||||
# Run iterations
|
||||
for i in range(num_warmup_iterations + num_iterations):
|
||||
with torch.no_grad():
|
||||
start.record()
|
||||
fn() # Execute the function
|
||||
end.record()
|
||||
torch.npu.synchronize()
|
||||
times[i] = start.elapsed_time(end)
|
||||
|
||||
# Remove warmup iterations and convert to seconds
|
||||
times = times[num_warmup_iterations:]
|
||||
elapsed_time = np.amin(times) / 1000
|
||||
return elapsed_time
|
||||
|
||||
|
||||
def get_masked_input_and_mask_ref(
|
||||
input_: torch.Tensor,
|
||||
org_vocab_start_index: int,
|
||||
org_vocab_end_index: int,
|
||||
num_org_vocab_padding: int,
|
||||
added_vocab_start_index: int,
|
||||
added_vocab_end_index: int,
|
||||
) -> Tuple[torch.Tensor, torch.Tensor]:
|
||||
"""Reference implementation for verification"""
|
||||
org_vocab_mask = (input_ >= org_vocab_start_index) & (input_ < org_vocab_end_index)
|
||||
added_vocab_mask = (input_ >= added_vocab_start_index) & (
|
||||
input_ < added_vocab_end_index
|
||||
)
|
||||
added_offset = (
|
||||
added_vocab_start_index
|
||||
- (org_vocab_end_index - org_vocab_start_index)
|
||||
- num_org_vocab_padding
|
||||
)
|
||||
valid_offset = (org_vocab_start_index * org_vocab_mask) + (
|
||||
added_offset * added_vocab_mask
|
||||
)
|
||||
vocab_mask = org_vocab_mask | added_vocab_mask
|
||||
masked_input = vocab_mask * (input_ - valid_offset)
|
||||
return masked_input, ~vocab_mask
|
||||
|
||||
|
||||
DTYPES = [torch.int32]
|
||||
SHAPES = [(3, 4, 5)]
|
||||
DEVICES = [f"npu:{0}"]
|
||||
SEEDS = [0]
|
||||
|
||||
|
||||
@pytest.mark.parametrize("shape", SHAPES)
|
||||
@pytest.mark.parametrize("dtype", DTYPES)
|
||||
@pytest.mark.parametrize("device", DEVICES)
|
||||
@pytest.mark.parametrize("seed", SEEDS)
|
||||
@torch.inference_mode()
|
||||
def test_get_masked_input_and_mask(
|
||||
shape: Tuple[int, ...],
|
||||
dtype: torch.dtype,
|
||||
device: str,
|
||||
seed: int,
|
||||
) -> None:
|
||||
# Set random seed and device
|
||||
torch.manual_seed(seed)
|
||||
torch.set_default_device(device)
|
||||
|
||||
# Generate random input tensor
|
||||
input_tensor = torch.randint(0, 1000, shape, dtype=dtype)
|
||||
|
||||
# Test parameters
|
||||
test_case = {
|
||||
"org_start": 100,
|
||||
"org_end": 200,
|
||||
"padding": 0,
|
||||
"added_start": 300,
|
||||
"added_end": 400,
|
||||
}
|
||||
|
||||
# Define reference function
|
||||
def ref_fn():
|
||||
return get_masked_input_and_mask_ref(
|
||||
input_tensor,
|
||||
test_case["org_start"],
|
||||
test_case["org_end"],
|
||||
test_case["padding"],
|
||||
test_case["added_start"],
|
||||
test_case["added_end"],
|
||||
)
|
||||
|
||||
# Define custom function
|
||||
def custom_fn():
|
||||
return torch.ops._C.get_masked_input_and_mask(
|
||||
input_tensor,
|
||||
test_case["org_start"],
|
||||
test_case["org_end"],
|
||||
test_case["padding"],
|
||||
test_case["added_start"],
|
||||
test_case["added_end"],
|
||||
)
|
||||
|
||||
# Get results for correctness testing
|
||||
ref_masked_input, ref_mask = ref_fn()
|
||||
custom_masked_input, custom_mask = custom_fn()
|
||||
|
||||
# Benchmark both implementations
|
||||
ref_time = benchmark_npu(ref_fn)
|
||||
custom_time = benchmark_npu(custom_fn)
|
||||
|
||||
# Print performance results
|
||||
print("\nPerformance Results:")
|
||||
print(f"Reference implementation: {ref_time * 1000:.3f} ms")
|
||||
print(f"Custom implementation: {custom_time * 1000:.3f} ms")
|
||||
print(f"Speedup: {ref_time / custom_time:.2f}x")
|
||||
|
||||
# Compare results for correctness
|
||||
ref_masked_input = ref_masked_input.to(dtype)
|
||||
print("\nResults comparison:")
|
||||
print("custom_masked_input:", custom_masked_input)
|
||||
print("ref_masked_input:", ref_masked_input)
|
||||
print("custom_mask:", custom_mask)
|
||||
print("ref_mask:", ref_mask)
|
||||
torch.testing.assert_close(
|
||||
custom_masked_input,
|
||||
ref_masked_input,
|
||||
rtol=1e-5,
|
||||
atol=1e-5,
|
||||
msg=f"Masked input mismatch for case: {test_case}",
|
||||
)
|
||||
torch.testing.assert_close(
|
||||
custom_mask,
|
||||
ref_mask,
|
||||
rtol=1e-5,
|
||||
atol=1e-5,
|
||||
msg=f"Mask mismatch for case: {test_case}",
|
||||
)
|
||||
4
benchmarks/requirements-bench.txt
Normal file
4
benchmarks/requirements-bench.txt
Normal file
@@ -0,0 +1,4 @@
|
||||
pandas
|
||||
datasets
|
||||
modelscope
|
||||
tabulate
|
||||
188
benchmarks/scripts/convert_json_to_markdown.py
Normal file
188
benchmarks/scripts/convert_json_to_markdown.py
Normal file
@@ -0,0 +1,188 @@
|
||||
import argparse
|
||||
import json
|
||||
import os
|
||||
from pathlib import Path
|
||||
|
||||
import pandas as pd
|
||||
from tabulate import tabulate
|
||||
|
||||
CUR_PATH = Path(__file__).parent.resolve()
|
||||
# latency results and the keys that will be printed into markdown
|
||||
latency_results = []
|
||||
latency_column_mapping = {
|
||||
"test_name": "Test name",
|
||||
"avg_latency": "Mean latency (ms)",
|
||||
"P50": "Median latency (ms)",
|
||||
"P99": "P99 latency (ms)",
|
||||
}
|
||||
|
||||
# throughput tests and the keys that will be printed into markdown
|
||||
throughput_results = []
|
||||
throughput_results_column_mapping = {
|
||||
"test_name": "Test name",
|
||||
"num_requests": "Num of reqs",
|
||||
"total_num_tokens": "Total num of tokens",
|
||||
"elapsed_time": "Elapsed time (s)",
|
||||
"requests_per_second": "Tput (req/s)",
|
||||
"tokens_per_second": "Tput (tok/s)",
|
||||
}
|
||||
|
||||
# serving results and the keys that will be printed into markdown
|
||||
serving_results = []
|
||||
serving_column_mapping = {
|
||||
"test_name": "Test name",
|
||||
"request_rate": "Request rate (req/s)",
|
||||
"request_throughput": "Tput (req/s)",
|
||||
"output_throughput": "Output Tput (tok/s)",
|
||||
"median_ttft_ms": "TTFT (ms)",
|
||||
"median_tpot_ms": "TPOT (ms)",
|
||||
"median_itl_ms": "ITL (ms)",
|
||||
}
|
||||
|
||||
|
||||
def read_markdown(file):
|
||||
if os.path.exists(file):
|
||||
with open(file) as f:
|
||||
return f.read() + "\n"
|
||||
else:
|
||||
return f"{file} not found.\n"
|
||||
|
||||
|
||||
def results_to_json(latency, throughput, serving):
|
||||
return json.dumps(
|
||||
{
|
||||
"latency": latency.to_dict(),
|
||||
"throughput": throughput.to_dict(),
|
||||
"serving": serving.to_dict(),
|
||||
}
|
||||
)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Process the results of the benchmark tests."
|
||||
)
|
||||
parser.add_argument(
|
||||
"--results_folder",
|
||||
type=str,
|
||||
default="../results/",
|
||||
help="The folder where the benchmark results are stored.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--output_folder",
|
||||
type=str,
|
||||
default="../results/",
|
||||
help="The folder where the benchmark results are stored.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--markdown_template",
|
||||
type=str,
|
||||
default="./perf_result_template.md",
|
||||
help="The template file for the markdown report.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--tag", default="main", help="Tag to be used for release message."
|
||||
)
|
||||
parser.add_argument(
|
||||
"--commit_id", default="", help="Commit ID to be used for release message."
|
||||
)
|
||||
|
||||
args = parser.parse_args()
|
||||
results_folder = (CUR_PATH / args.results_folder).resolve()
|
||||
output_folder = (CUR_PATH / args.output_folder).resolve()
|
||||
markdown_template = (CUR_PATH / args.markdown_template).resolve()
|
||||
|
||||
# collect results
|
||||
for test_file in results_folder.glob("*.json"):
|
||||
with open(test_file) as f:
|
||||
raw_result = json.loads(f.read())
|
||||
|
||||
if "serving" in str(test_file):
|
||||
# this result is generated via `benchmark_serving.py`
|
||||
|
||||
# update the test name of this result
|
||||
raw_result.update({"test_name": test_file.stem})
|
||||
|
||||
# add the result to raw_result
|
||||
serving_results.append(raw_result)
|
||||
continue
|
||||
|
||||
elif "latency" in f.name:
|
||||
# this result is generated via `benchmark_latency.py`
|
||||
|
||||
# update the test name of this result
|
||||
raw_result.update({"test_name": test_file.stem})
|
||||
|
||||
# get different percentiles
|
||||
for perc in [10, 25, 50, 75, 90, 99]:
|
||||
# Multiply 1000 to convert the time unit from s to ms
|
||||
raw_result.update(
|
||||
{f"P{perc}": 1000 * raw_result["percentiles"][str(perc)]}
|
||||
)
|
||||
raw_result["avg_latency"] = raw_result["avg_latency"] * 1000
|
||||
|
||||
# add the result to raw_result
|
||||
latency_results.append(raw_result)
|
||||
continue
|
||||
|
||||
elif "throughput" in f.name:
|
||||
# this result is generated via `benchmark_throughput.py`
|
||||
|
||||
# update the test name of this result
|
||||
raw_result.update({"test_name": test_file.stem})
|
||||
|
||||
# add the result to raw_result
|
||||
throughput_results.append(raw_result)
|
||||
continue
|
||||
|
||||
print(f"Skipping {test_file}")
|
||||
serving_results.sort(key=lambda x: (len(x["test_name"]), x["test_name"]))
|
||||
|
||||
latency_results = pd.DataFrame.from_dict(latency_results)
|
||||
serving_results = pd.DataFrame.from_dict(serving_results)
|
||||
throughput_results = pd.DataFrame.from_dict(throughput_results)
|
||||
|
||||
raw_results_json = results_to_json(
|
||||
latency_results, throughput_results, serving_results
|
||||
)
|
||||
|
||||
# remapping the key, for visualization purpose
|
||||
if not latency_results.empty:
|
||||
latency_results = latency_results[list(latency_column_mapping.keys())].rename(
|
||||
columns=latency_column_mapping
|
||||
)
|
||||
if not serving_results.empty:
|
||||
serving_results = serving_results[list(serving_column_mapping.keys())].rename(
|
||||
columns=serving_column_mapping
|
||||
)
|
||||
if not throughput_results.empty:
|
||||
throughput_results = throughput_results[
|
||||
list(throughput_results_column_mapping.keys())
|
||||
].rename(columns=throughput_results_column_mapping)
|
||||
|
||||
processed_results_json = results_to_json(
|
||||
latency_results, throughput_results, serving_results
|
||||
)
|
||||
|
||||
# get markdown tables
|
||||
latency_md_table = tabulate(
|
||||
latency_results, headers="keys", tablefmt="pipe", showindex=False
|
||||
)
|
||||
serving_md_table = tabulate(
|
||||
serving_results, headers="keys", tablefmt="pipe", showindex=False
|
||||
)
|
||||
throughput_md_table = tabulate(
|
||||
throughput_results, headers="keys", tablefmt="pipe", showindex=False
|
||||
)
|
||||
|
||||
# document the result
|
||||
print(output_folder)
|
||||
with open(output_folder / "benchmark_results.md", "w") as f:
|
||||
results = read_markdown(markdown_template)
|
||||
results = results.format(
|
||||
latency_tests_markdown_table=latency_md_table,
|
||||
throughput_tests_markdown_table=throughput_md_table,
|
||||
serving_tests_markdown_table=serving_md_table,
|
||||
benchmarking_results_in_json_string=processed_results_json,
|
||||
)
|
||||
f.write(results)
|
||||
31
benchmarks/scripts/perf_result_template.md
Normal file
31
benchmarks/scripts/perf_result_template.md
Normal file
@@ -0,0 +1,31 @@
|
||||
## Online serving tests
|
||||
|
||||
- Input length: randomly sample 200 prompts from [ShareGPT](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/blob/main/ShareGPT_V3_unfiltered_cleaned_split.json) and [lmarena-ai/vision-arena-bench-v0.1](https://huggingface.co/datasets/lmarena-ai/vision-arena-bench-v0.1/tree/main)(multi-modal) dataset (with fixed random seed).
|
||||
- Output length: the corresponding output length of these 200 prompts.
|
||||
- Batch size: dynamically determined by vllm and the arrival pattern of the requests.
|
||||
- **Average QPS (query per second)**: 1, 4, 16 and inf. QPS = inf means all requests come at once. For other QPS values, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
|
||||
- Models: Qwen/Qwen3-8B, Qwen/Qwen2.5-VL-7B-Instruct
|
||||
- Evaluation metrics: throughput, TTFT (median time to the first token ), ITL (median inter-token latency) TPOT(median time per output token).
|
||||
|
||||
{serving_tests_markdown_table}
|
||||
|
||||
## Offline tests
|
||||
### Latency tests
|
||||
|
||||
- Input length: 32 tokens.
|
||||
- Output length: 128 tokens.
|
||||
- Batch size: fixed (8).
|
||||
- Models: Qwen/Qwen3-8B, Qwen/Qwen2.5-VL-7B-Instruct
|
||||
- Evaluation metrics: end-to-end latency.
|
||||
|
||||
{latency_tests_markdown_table}
|
||||
|
||||
### Throughput tests
|
||||
|
||||
- Input length: randomly sample 200 prompts from [ShareGPT](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/blob/main/ShareGPT_V3_unfiltered_cleaned_split.json) and [lmarena-ai/vision-arena-bench-v0.1](https://huggingface.co/datasets/lmarena-ai/vision-arena-bench-v0.1/tree/main)(multi-modal) dataset (with fixed random seed).
|
||||
- Output length: the corresponding output length of these 200 prompts.
|
||||
- Batch size: dynamically determined by vllm to achieve maximum throughput.
|
||||
- Models: Qwen/Qwen3-8B, Qwen/Qwen2.5-VL-7B-Instruct
|
||||
- Evaluation metrics: throughput.
|
||||
|
||||
{throughput_tests_markdown_table}
|
||||
321
benchmarks/scripts/run-performance-benchmarks.sh
Normal file
321
benchmarks/scripts/run-performance-benchmarks.sh
Normal file
@@ -0,0 +1,321 @@
|
||||
#!/bin/bash
|
||||
set -e
|
||||
|
||||
check_npus() {
|
||||
# shellcheck disable=SC2155
|
||||
declare -g npu_count=$(npu-smi info -l | grep "Total Count" | awk -F ':' '{print $2}' | tr -d ' ')
|
||||
|
||||
if [[ -z "$npu_count" || "$npu_count" -eq 0 ]]; then
|
||||
echo "Need at least 1 NPU to run benchmarking."
|
||||
exit 1
|
||||
else
|
||||
echo "found NPU conut: $npu_count"
|
||||
fi
|
||||
|
||||
npu_type=$(npu-smi info | grep -E "^\| [0-9]+" | awk -F '|' '{print $2}' | awk '{$1=$1;print}' | awk '{print $2}')
|
||||
|
||||
echo "NPU type is: $npu_type"
|
||||
}
|
||||
|
||||
ensure_sharegpt_downloaded() {
|
||||
local FILE="/github/home/.cache/datasets/ShareGPT_V3_unfiltered_cleaned_split.json"
|
||||
local DIR
|
||||
DIR=$(dirname "$FILE")
|
||||
|
||||
if [ ! -f "$FILE" ]; then
|
||||
echo "$FILE not found, downloading from hf-mirror ..."
|
||||
mkdir -p "$DIR"
|
||||
wget -O "$FILE" https://hf-mirror.com/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
|
||||
if [ $? -ne 0 ]; then
|
||||
echo "Download failed!" >&2
|
||||
return 1
|
||||
fi
|
||||
echo "Download completed and saved to $FILE"
|
||||
else
|
||||
echo "$FILE already exists."
|
||||
fi
|
||||
}
|
||||
|
||||
json2args() {
|
||||
# transforms the JSON string to command line args, and '_' is replaced to '-'
|
||||
# example:
|
||||
# input: { "model": "meta-llama/Llama-2-7b-chat-hf", "tensor_parallel_size": 1 }
|
||||
# output: --model meta-llama/Llama-2-7b-chat-hf --tensor-parallel-size 1
|
||||
local json_string=$1
|
||||
local args
|
||||
args=$(
|
||||
echo "$json_string" | jq -r '
|
||||
to_entries |
|
||||
map("--" + (.key | gsub("_"; "-")) + " " + (.value | tostring)) |
|
||||
join(" ")
|
||||
'
|
||||
)
|
||||
echo "$args"
|
||||
}
|
||||
|
||||
wait_for_server() {
|
||||
local waited=0
|
||||
local timeout_sec=1200
|
||||
|
||||
while (( waited < timeout_sec )); do
|
||||
if curl -s -X GET localhost:8000/health > /dev/null; then
|
||||
return 0
|
||||
fi
|
||||
echo "Waiting for vllm server to start..."
|
||||
sleep 1
|
||||
((waited++))
|
||||
done
|
||||
|
||||
echo "Timeout waiting for server"
|
||||
return 1
|
||||
}
|
||||
|
||||
get_cur_npu_id() {
|
||||
npu-smi info -l | awk -F ':' '/NPU ID/ {print $2+0; exit}'
|
||||
}
|
||||
|
||||
kill_npu_processes() {
|
||||
ps -aux
|
||||
lsof -t -i:8000 | xargs -r kill -9
|
||||
pgrep python3 | xargs -r kill -9
|
||||
|
||||
sleep 4
|
||||
rm -rf ~/.config/vllm
|
||||
|
||||
}
|
||||
|
||||
update_json_field() {
|
||||
local json_file="$1"
|
||||
local field_name="$2"
|
||||
local field_value="$3"
|
||||
|
||||
jq --arg value "$field_value" \
|
||||
--arg key "$field_name" \
|
||||
'.[$key] = $value' "$json_file" > "${json_file}.tmp" && \
|
||||
mv "${json_file}.tmp" "$json_file"
|
||||
}
|
||||
|
||||
run_latency_tests() {
|
||||
# run latency tests using `benchmark_latency.py`
|
||||
# $1: a json file specifying latency test cases
|
||||
|
||||
local latency_test_file
|
||||
latency_test_file=$1
|
||||
|
||||
# Iterate over latency tests
|
||||
jq -c '.[]' "$latency_test_file" | while read -r params; do
|
||||
# get the test name, and append the NPU type back to it.
|
||||
test_name=$(echo "$params" | jq -r '.test_name')
|
||||
if [[ ! "$test_name" =~ ^latency_ ]]; then
|
||||
echo "In latency-test.json, test_name must start with \"latency_\"."
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# if TEST_SELECTOR is set, only run the test cases that match the selector
|
||||
if [[ -n "$TEST_SELECTOR" ]] && [[ ! "$test_name" =~ $TEST_SELECTOR ]]; then
|
||||
echo "Skip test case $test_name."
|
||||
continue
|
||||
fi
|
||||
|
||||
# get arguments
|
||||
latency_params=$(echo "$params" | jq -r '.parameters')
|
||||
latency_args=$(json2args "$latency_params")
|
||||
|
||||
latency_command="vllm bench latency \
|
||||
--output-json $RESULTS_FOLDER/${test_name}.json \
|
||||
$latency_args"
|
||||
|
||||
echo "Running test case $test_name"
|
||||
echo "Latency command: $latency_command"
|
||||
|
||||
# run the benchmark
|
||||
eval "$latency_command"
|
||||
# echo model_name to result file
|
||||
model_name=$(echo "$latency_params" | jq -r '.model')
|
||||
update_json_field "$RESULTS_FOLDER/${test_name}.json" "model_name" "$model_name"
|
||||
kill_npu_processes
|
||||
|
||||
done
|
||||
}
|
||||
|
||||
run_throughput_tests() {
|
||||
# run throughput tests using `benchmark_throughput.py`
|
||||
# $1: a json file specifying throughput test cases
|
||||
|
||||
local throughput_test_file
|
||||
throughput_test_file=$1
|
||||
|
||||
# Iterate over throughput tests
|
||||
jq -c '.[]' "$throughput_test_file" | while read -r params; do
|
||||
# get the test name, and append the NPU type back to it.
|
||||
test_name=$(echo "$params" | jq -r '.test_name')
|
||||
if [[ ! "$test_name" =~ ^throughput_ ]]; then
|
||||
echo "In throughput-test.json, test_name must start with \"throughput_\"."
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# if TEST_SELECTOR is set, only run the test cases that match the selector
|
||||
if [[ -n "$TEST_SELECTOR" ]] && [[ ! "$test_name" =~ $TEST_SELECTOR ]]; then
|
||||
echo "Skip test case $test_name."
|
||||
continue
|
||||
fi
|
||||
|
||||
# get arguments
|
||||
throughput_params=$(echo "$params" | jq -r '.parameters')
|
||||
throughput_args=$(json2args "$throughput_params")
|
||||
|
||||
throughput_command="vllm bench throughput \
|
||||
--output-json $RESULTS_FOLDER/${test_name}.json \
|
||||
$throughput_args"
|
||||
|
||||
echo "Running test case $test_name"
|
||||
echo "Throughput command: $throughput_command"
|
||||
|
||||
# run the benchmark
|
||||
eval "$throughput_command"
|
||||
# echo model_name to result file
|
||||
model_name=$(echo "$throughput_params" | jq -r '.model')
|
||||
update_json_field "$RESULTS_FOLDER/${test_name}.json" "model_name" "$model_name"
|
||||
kill_npu_processes
|
||||
|
||||
done
|
||||
}
|
||||
|
||||
run_serving_tests() {
|
||||
# run serving tests using `benchmark_serving.py`
|
||||
# $1: a json file specifying serving test cases
|
||||
|
||||
local serving_test_file
|
||||
serving_test_file=$1
|
||||
|
||||
# Iterate over serving tests
|
||||
jq -c '.[]' "$serving_test_file" | while read -r params; do
|
||||
# get the test name, and append the NPU type back to it.
|
||||
test_name=$(echo "$params" | jq -r '.test_name')
|
||||
if [[ ! "$test_name" =~ ^serving_ ]]; then
|
||||
echo "In serving-test.json, test_name must start with \"serving_\"."
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# if TEST_SELECTOR is set, only run the test cases that match the selector
|
||||
if [[ -n "$TEST_SELECTOR" ]] && [[ ! "$test_name" =~ $TEST_SELECTOR ]]; then
|
||||
echo "Skip test case $test_name."
|
||||
continue
|
||||
fi
|
||||
|
||||
# get client and server arguments
|
||||
server_params=$(echo "$params" | jq -r '.server_parameters')
|
||||
client_params=$(echo "$params" | jq -r '.client_parameters')
|
||||
server_args=$(json2args "$server_params")
|
||||
client_args=$(json2args "$client_params")
|
||||
qps_list=$(echo "$params" | jq -r '.qps_list')
|
||||
qps_list=$(echo "$qps_list" | jq -r '.[] | @sh')
|
||||
echo "Running over qps list $qps_list"
|
||||
|
||||
# check if server model and client model is aligned
|
||||
server_model=$(echo "$server_params" | jq -r '.model')
|
||||
client_model=$(echo "$client_params" | jq -r '.model')
|
||||
if [[ $server_model != "$client_model" ]]; then
|
||||
echo "Server model and client model must be the same. Skip testcase $test_name."
|
||||
continue
|
||||
fi
|
||||
|
||||
server_command="python3 \
|
||||
-m vllm.entrypoints.openai.api_server \
|
||||
$server_args"
|
||||
|
||||
# run the server
|
||||
echo "Running test case $test_name"
|
||||
echo "Server command: $server_command"
|
||||
bash -c "$server_command" &
|
||||
server_pid=$!
|
||||
|
||||
# wait until the server is alive
|
||||
if wait_for_server; then
|
||||
echo ""
|
||||
echo "vllm server is up and running."
|
||||
else
|
||||
echo ""
|
||||
echo "vllm failed to start within the timeout period."
|
||||
fi
|
||||
|
||||
# iterate over different QPS
|
||||
for qps in $qps_list; do
|
||||
# remove the surrounding single quote from qps
|
||||
if [[ "$qps" == *"inf"* ]]; then
|
||||
echo "qps was $qps"
|
||||
qps="inf"
|
||||
echo "now qps is $qps"
|
||||
fi
|
||||
|
||||
new_test_name=$test_name"_qps_"$qps
|
||||
|
||||
client_command="vllm bench serve \
|
||||
--save-result \
|
||||
--result-dir $RESULTS_FOLDER \
|
||||
--result-filename ${new_test_name}.json \
|
||||
--request-rate $qps \
|
||||
$client_args"
|
||||
|
||||
echo "Running test case $test_name with qps $qps"
|
||||
echo "Client command: $client_command"
|
||||
|
||||
bash -c "$client_command"
|
||||
done
|
||||
|
||||
# clean up
|
||||
kill -9 $server_pid
|
||||
kill_npu_processes
|
||||
done
|
||||
}
|
||||
|
||||
cleanup() {
|
||||
rm -rf ./vllm_benchmarks
|
||||
}
|
||||
|
||||
cleanup_on_error() {
|
||||
echo "An error occurred. Cleaning up results folder..."
|
||||
rm -rf $RESULTS_FOLDER
|
||||
}
|
||||
|
||||
main() {
|
||||
START_TIME=$(date +%s)
|
||||
check_npus
|
||||
|
||||
# dependencies
|
||||
(which wget && which curl) || (apt-get update && apt-get install -y wget curl)
|
||||
(which jq) || (apt-get update && apt-get -y install jq)
|
||||
(which lsof) || (apt-get update && apt-get install -y lsof)
|
||||
|
||||
# get the current IP address, required by benchmark_serving.py
|
||||
# shellcheck disable=SC2155
|
||||
export VLLM_HOST_IP=$(hostname -I | awk '{print $1}')
|
||||
# turn of the reporting of the status of each request, to clean up the terminal output
|
||||
export VLLM_LOG_LEVEL="WARNING"
|
||||
|
||||
# set env
|
||||
export VLLM_USE_MODELSCOPE=True
|
||||
|
||||
# prepare for benchmarking
|
||||
cd benchmarks || exit 1
|
||||
trap cleanup EXIT
|
||||
|
||||
QUICK_BENCHMARK_ROOT=./
|
||||
|
||||
declare -g RESULTS_FOLDER=results
|
||||
mkdir -p $RESULTS_FOLDER
|
||||
|
||||
trap cleanup_on_error ERR
|
||||
ensure_sharegpt_downloaded
|
||||
# benchmarks
|
||||
run_serving_tests $QUICK_BENCHMARK_ROOT/tests/serving-tests.json
|
||||
run_latency_tests $QUICK_BENCHMARK_ROOT/tests/latency-tests.json
|
||||
run_throughput_tests $QUICK_BENCHMARK_ROOT/tests/throughput-tests.json
|
||||
|
||||
END_TIME=$(date +%s)
|
||||
ELAPSED_TIME=$((END_TIME - START_TIME))
|
||||
echo "Total execution time: $ELAPSED_TIME seconds"
|
||||
|
||||
}
|
||||
|
||||
main "$@"
|
||||
23
benchmarks/tests/latency-tests.json
Normal file
23
benchmarks/tests/latency-tests.json
Normal file
@@ -0,0 +1,23 @@
|
||||
[
|
||||
{
|
||||
"test_name": "latency_qwen3_8B_tp1",
|
||||
"parameters": {
|
||||
"model": "Qwen/Qwen3-8B",
|
||||
"tensor_parallel_size": 1,
|
||||
"load_format": "dummy",
|
||||
"max_model_len": 16384,
|
||||
"num_iters_warmup": 5,
|
||||
"num_iters": 15
|
||||
}
|
||||
},
|
||||
{
|
||||
"test_name": "latency_qwen2_5_7B_tp1",
|
||||
"parameters": {
|
||||
"model": "Qwen/Qwen2.5-7B-Instruct",
|
||||
"tensor_parallel_size": 1,
|
||||
"load_format": "dummy",
|
||||
"num_iters_warmup": 5,
|
||||
"num_iters": 15
|
||||
}
|
||||
}
|
||||
]
|
||||
77
benchmarks/tests/serving-tests.json
Normal file
77
benchmarks/tests/serving-tests.json
Normal file
@@ -0,0 +1,77 @@
|
||||
[
|
||||
{
|
||||
"test_name": "serving_qwen2_5vl_7B_tp1",
|
||||
"qps_list": [
|
||||
1,
|
||||
4,
|
||||
16,
|
||||
"inf"
|
||||
],
|
||||
"server_parameters": {
|
||||
"model": "Qwen/Qwen2.5-VL-7B-Instruct",
|
||||
"tensor_parallel_size": 1,
|
||||
"swap_space": 16,
|
||||
"disable_log_stats": "",
|
||||
"disable_log_requests": "",
|
||||
"trust_remote_code": "",
|
||||
"max_model_len": 16384
|
||||
},
|
||||
"client_parameters": {
|
||||
"model": "Qwen/Qwen2.5-VL-7B-Instruct",
|
||||
"endpoint_type": "openai-chat",
|
||||
"dataset_name": "hf",
|
||||
"hf_split": "train",
|
||||
"endpoint": "/v1/chat/completions",
|
||||
"dataset_path": "lmarena-ai/vision-arena-bench-v0.1",
|
||||
"num_prompts": 200
|
||||
}
|
||||
},
|
||||
{
|
||||
"test_name": "serving_qwen3_8B_tp1",
|
||||
"qps_list": [
|
||||
1,
|
||||
4,
|
||||
16,
|
||||
"inf"
|
||||
],
|
||||
"server_parameters": {
|
||||
"model": "Qwen/Qwen3-8B",
|
||||
"tensor_parallel_size": 1,
|
||||
"swap_space": 16,
|
||||
"disable_log_stats": "",
|
||||
"disable_log_requests": "",
|
||||
"load_format": "dummy"
|
||||
},
|
||||
"client_parameters": {
|
||||
"model": "Qwen/Qwen3-8B",
|
||||
"endpoint_type": "vllm",
|
||||
"dataset_name": "sharegpt",
|
||||
"dataset_path": "/github/home/.cache/datasets/ShareGPT_V3_unfiltered_cleaned_split.json",
|
||||
"num_prompts": 200
|
||||
}
|
||||
},
|
||||
{
|
||||
"test_name": "serving_qwen2_5_7B_tp1",
|
||||
"qps_list": [
|
||||
1,
|
||||
4,
|
||||
16,
|
||||
"inf"
|
||||
],
|
||||
"server_parameters": {
|
||||
"model": "Qwen/Qwen2.5-7B-Instruct",
|
||||
"tensor_parallel_size": 1,
|
||||
"swap_space": 16,
|
||||
"disable_log_stats": "",
|
||||
"disable_log_requests": "",
|
||||
"load_format": "dummy"
|
||||
},
|
||||
"client_parameters": {
|
||||
"model": "Qwen/Qwen2.5-7B-Instruct",
|
||||
"endpoint_type": "vllm",
|
||||
"dataset_name": "sharegpt",
|
||||
"dataset_path": "/github/home/.cache/datasets/ShareGPT_V3_unfiltered_cleaned_split.json",
|
||||
"num_prompts": 200
|
||||
}
|
||||
}
|
||||
]
|
||||
38
benchmarks/tests/throughput-tests.json
Normal file
38
benchmarks/tests/throughput-tests.json
Normal file
@@ -0,0 +1,38 @@
|
||||
[
|
||||
{
|
||||
"test_name": "throughput_qwen3_8B_tp1",
|
||||
"parameters": {
|
||||
"model": "Qwen/Qwen3-8B",
|
||||
"tensor_parallel_size": 1,
|
||||
"load_format": "dummy",
|
||||
"dataset_path": "/github/home/.cache/datasets/ShareGPT_V3_unfiltered_cleaned_split.json",
|
||||
"num_prompts": 200,
|
||||
"backend": "vllm"
|
||||
}
|
||||
},
|
||||
{
|
||||
"test_name": "throughput_qwen2_5vl_7B_tp1",
|
||||
"parameters": {
|
||||
"model": "Qwen/Qwen2.5-VL-7B-Instruct",
|
||||
"tensor_parallel_size": 1,
|
||||
"backend": "vllm-chat",
|
||||
"dataset_name": "hf",
|
||||
"hf_split": "train",
|
||||
"max_model_len": 16384,
|
||||
"dataset_path": "lmarena-ai/vision-arena-bench-v0.1",
|
||||
"num_prompts": 200
|
||||
}
|
||||
},
|
||||
{
|
||||
"test_name": "throughput_qwen2_5_7B_tp1",
|
||||
"parameters": {
|
||||
"model": "Qwen/Qwen2.5-7B-Instruct",
|
||||
"tensor_parallel_size": 1,
|
||||
"load_format": "dummy",
|
||||
"dataset_path": "/github/home/.cache/datasets/ShareGPT_V3_unfiltered_cleaned_split.json",
|
||||
"num_prompts": 200,
|
||||
"backend": "vllm"
|
||||
}
|
||||
}
|
||||
]
|
||||
|
||||
Reference in New Issue
Block a user