v0.10.1rc1

2025-09-09 09:40:35 +08:00
parent d6f6ef41fe
commit 9149384e03
432 changed files with 84698 additions and 1 deletions
--- a/benchmarks/README.md
+++ b/benchmarks/README.md
@@ -0,0 +1,175 @@
+# Introduction
+This document outlines the benchmarking methodology for vllm-ascend, aimed at evaluating the performance under a variety of workloads. The primary goal is to help developers assess whether their pull requests improve or degrade vllm-ascend's performance.
+
+# Overview
+**Benchmarking Coverage**: We measure latency, throughput, and fixed-QPS serving on the Atlas800I A2 (see [quick_start](../docs/source/quick_start.md) to learn more supported devices list), with different models(coming soon).
+- Latency tests
+    - Input length: 32 tokens.
+    - Output length: 128 tokens.
+    - Batch size: fixed (8).
+    - Models: Qwen2.5-7B-Instruct, Qwen3-8B.
+    - Evaluation metrics: end-to-end latency (mean, median, p99).
+
+- Throughput tests
+    - Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
+    - Output length: the corresponding output length of these 200 prompts.
+    - Batch size: dynamically determined by vllm to achieve maximum throughput.
+    - Models: Qwen2.5-VL-7B-Instruct, Qwen2.5-7B-Instruct, Qwen3-8B.
+    - Evaluation metrics: throughput.
+- Serving tests
+    - Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
+    - Output length: the corresponding output length of these 200 prompts.
+    - Batch size: dynamically determined by vllm and the arrival pattern of the requests.
+    - **Average QPS (query per second)**: 1, 4, 16 and inf. QPS = inf means all requests come at once. For other QPS values, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
+    - Models: Qwen2.5-VL-7B-Instruct, Qwen2.5-7B-Instruct, Qwen3-8B.
+    - Evaluation metrics: throughput, TTFT (time to the first token, with mean, median and p99), ITL (inter-token latency, with mean, median and p99).
+
+**Benchmarking Duration**: about 800 senond for single model.
+
+# Quick Use
+## Prerequisites
+Before running the benchmarks, ensure the following:
+
+- vllm and vllm-ascend are installed and properly set up in an NPU environment, as these scripts are specifically designed for NPU devices.
+
+- Install necessary dependencies for benchmarks:
+  
+  ```shell
+  pip install -r benchmarks/requirements-bench.txt
+  ```
+  
+- For performance benchmark, it is recommended to set the [load-format](https://github.com/vllm-project/vllm-ascend/blob/5897dc5bbe321ca90c26225d0d70bff24061d04b/benchmarks/tests/latency-tests.json#L7) as `dummy`, It will construct random weights based on the passed model without downloading the weights from internet, which can greatly reduce the benchmark time.
+- If you want to run benchmark customized, feel free to add your own models and parameters in the [JSON](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks/tests), let's take `Qwen2.5-VL-7B-Instruct`as an example:
+
+  ```shell
+  [
+  {
+    "test_name": "serving_qwen2_5vl_7B_tp1",
+    "qps_list": [
+      1,
+      4,
+      16,
+      "inf"
+    ],
+    "server_parameters": {
+      "model": "Qwen/Qwen2.5-VL-7B-Instruct",
+      "tensor_parallel_size": 1,
+      "swap_space": 16,
+      "disable_log_stats": "",
+      "disable_log_requests": "",
+      "trust_remote_code": "",
+      "max_model_len": 16384
+    },
+    "client_parameters": {
+      "model": "Qwen/Qwen2.5-VL-7B-Instruct",
+      "backend": "openai-chat",
+      "dataset_name": "hf",
+      "hf_split": "train",
+      "endpoint": "/v1/chat/completions",
+      "dataset_path": "lmarena-ai/vision-arena-bench-v0.1",
+      "num_prompts": 200
+    }
+  }
+  ]
+  ```
+  
+this Json will be structured and parsed into server parameters and client parameters by the benchmark script. This configuration defines a test case named `serving_qwen2_5vl_7B_tp1`, designed to evaluate the performance of the `Qwen/Qwen2.5-VL-7B-Instruct` model under different request rates. The test includes both server and client parameters, for more parameters details, see vllm benchmark [cli](https://github.com/vllm-project/vllm/tree/main/vllm/benchmarks).
+
+  - **Test Overview**
+     - Test Name: serving_qwen2_5vl_7B_tp1
+
+     - Queries Per Second (QPS): The test is run at four different QPS levels: 1, 4, 16, and inf (infinite load, typically used for stress testing).
+
+  - Server Parameters
+     - Model: Qwen/Qwen2.5-VL-7B-Instruct
+
+     - Tensor Parallelism: 1 (no model parallelism is used; the model runs on a single device or node)
+
+     - Swap Space: 16 GB (used to handle memory overflow by swapping to disk)
+
+     - disable_log_stats: disables logging of performance statistics.
+
+     - disable_log_requests: disables logging of individual requests.
+
+     - Trust Remote Code: enabled (allows execution of model-specific custom code)
+
+     - Max Model Length: 16,384 tokens (maximum context length supported by the model)
+
+  - Client Parameters
+
+     - Model: Qwen/Qwen2.5-VL-7B-Instruct (same as the server)
+
+     - Backend: openai-chat (suggests the client uses the OpenAI-compatible chat API format)
+
+     - Dataset Source: Hugging Face (hf)
+
+     - Dataset Split: train
+
+     - Endpoint: /v1/chat/completions (the REST API endpoint to which chat requests are sent)
+
+     - Dataset Path: lmarena-ai/vision-arena-bench-v0.1 (the benchmark dataset used for evaluation, hosted on Hugging Face)
+
+     - Number of Prompts: 200 (the total number of prompts used during the test)
+
+## Run benchmarks
+
+### Use benchmark script
+The provided scripts automatically execute performance tests for serving, throughput, and latency. To start the benchmarking process, run command in the vllm-ascend root directory:
+
+```shell
+bash benchmarks/scripts/run-performance-benchmarks.sh
+```
+
+Once the script completes, you can find the results in the benchmarks/results folder. The output files may resemble the following:
+
+```shell
+.
+|-- serving_qwen2_5_7B_tp1_qps_1.json
+|-- serving_qwen2_5_7B_tp1_qps_16.json
+|-- serving_qwen2_5_7B_tp1_qps_4.json
+|-- serving_qwen2_5_7B_tp1_qps_inf.json
+|-- latency_qwen2_5_7B_tp1.json
+|-- throughput_qwen2_5_7B_tp1.json
+```
+
+These files contain detailed benchmarking results for further analysis.
+
+### Use benchmark cli
+
+For more flexible and customized use, benchmark cli is also provided to run online/offline benchmarks
+Similarly, let’s take `Qwen2.5-VL-7B-Instruct` benchmark as an example:
+#### Online serving
+1. Launch the server:
+
+    ```shell
+    vllm serve Qwen2.5-VL-7B-Instruct --max-model-len 16789
+    ```
+
+2. Running performance tests using cli
+  
+    ```shell
+    vllm bench serve --model Qwen2.5-VL-7B-Instruct\
+    --endpoint-type "openai-chat" --dataset-name hf \
+    --hf-split train --endpoint "/v1/chat/completions" \
+    --dataset-path "lmarena-ai/vision-arena-bench-v0.1" \
+    --num-prompts 200 \
+    --request-rate 16
+    ```
+
+#### Offline
+- **Throughput**
+
+  ```shell
+  vllm bench throughput --output-json results/throughput_qwen2_5_7B_tp1.json \
+  --model Qwen/Qwen2.5-7B-Instruct --tensor-parallel-size 1 --load-format dummy \
+  --dataset-path /github/home/.cache/datasets/ShareGPT_V3_unfiltered_cleaned_split.json \
+  --num-prompts 200 --backend vllm
+  ```
+
+- **Latency**
+  
+  ```shell
+  vllm bench latency --output-json results/latency_qwen2_5_7B_tp1.json \
+  --model Qwen/Qwen2.5-7B-Instruct --tensor-parallel-size 1 \
+  --load-format dummy --num-iters-warmup 5 --num-iters 15
+  ```
--- a/benchmarks/ops/ben_vocabparallelembedding.py
+++ b/benchmarks/ops/ben_vocabparallelembedding.py
@@ -0,0 +1,158 @@
+from typing import Tuple
+
+import numpy as np
+import pytest
+import torch
+import torch_npu  # noqa: F401
+import vllm  # noqa: F401
+
+import vllm_ascend.platform  # noqa: F401
+
+
+def benchmark_npu(fn, num_iterations=100, num_warmup_iterations=50):
+    """
+    Benchmark function for NPU operations
+
+    Args:
+        fn: Function to benchmark
+        num_iterations: Number of timing iterations
+        num_warmup_iterations: Number of warmup iterations
+
+    Returns:
+        float: Minimum elapsed time in seconds
+    """
+    start = torch.npu.Event(enable_timing=True)
+    end = torch.npu.Event(enable_timing=True)
+    times = np.zeros(num_iterations + num_warmup_iterations)
+
+    # Run iterations
+    for i in range(num_warmup_iterations + num_iterations):
+        with torch.no_grad():
+            start.record()
+            fn()  # Execute the function
+            end.record()
+        torch.npu.synchronize()
+        times[i] = start.elapsed_time(end)
+
+    # Remove warmup iterations and convert to seconds
+    times = times[num_warmup_iterations:]
+    elapsed_time = np.amin(times) / 1000
+    return elapsed_time
+
+
+def get_masked_input_and_mask_ref(
+    input_: torch.Tensor,
+    org_vocab_start_index: int,
+    org_vocab_end_index: int,
+    num_org_vocab_padding: int,
+    added_vocab_start_index: int,
+    added_vocab_end_index: int,
+) -> Tuple[torch.Tensor, torch.Tensor]:
+    """Reference implementation for verification"""
+    org_vocab_mask = (input_ >= org_vocab_start_index) & (input_ < org_vocab_end_index)
+    added_vocab_mask = (input_ >= added_vocab_start_index) & (
+        input_ < added_vocab_end_index
+    )
+    added_offset = (
+        added_vocab_start_index
+        - (org_vocab_end_index - org_vocab_start_index)
+        - num_org_vocab_padding
+    )
+    valid_offset = (org_vocab_start_index * org_vocab_mask) + (
+        added_offset * added_vocab_mask
+    )
+    vocab_mask = org_vocab_mask | added_vocab_mask
+    masked_input = vocab_mask * (input_ - valid_offset)
+    return masked_input, ~vocab_mask
+
+
+DTYPES = [torch.int32]
+SHAPES = [(3, 4, 5)]
+DEVICES = [f"npu:{0}"]
+SEEDS = [0]
+
+
+@pytest.mark.parametrize("shape", SHAPES)
+@pytest.mark.parametrize("dtype", DTYPES)
+@pytest.mark.parametrize("device", DEVICES)
+@pytest.mark.parametrize("seed", SEEDS)
+@torch.inference_mode()
+def test_get_masked_input_and_mask(
+    shape: Tuple[int, ...],
+    dtype: torch.dtype,
+    device: str,
+    seed: int,
+) -> None:
+    # Set random seed and device
+    torch.manual_seed(seed)
+    torch.set_default_device(device)
+
+    # Generate random input tensor
+    input_tensor = torch.randint(0, 1000, shape, dtype=dtype)
+
+    # Test parameters
+    test_case = {
+        "org_start": 100,
+        "org_end": 200,
+        "padding": 0,
+        "added_start": 300,
+        "added_end": 400,
+    }
+
+    # Define reference function
+    def ref_fn():
+        return get_masked_input_and_mask_ref(
+            input_tensor,
+            test_case["org_start"],
+            test_case["org_end"],
+            test_case["padding"],
+            test_case["added_start"],
+            test_case["added_end"],
+        )
+
+    # Define custom function
+    def custom_fn():
+        return torch.ops._C.get_masked_input_and_mask(
+            input_tensor,
+            test_case["org_start"],
+            test_case["org_end"],
+            test_case["padding"],
+            test_case["added_start"],
+            test_case["added_end"],
+        )
+
+    # Get results for correctness testing
+    ref_masked_input, ref_mask = ref_fn()
+    custom_masked_input, custom_mask = custom_fn()
+
+    # Benchmark both implementations
+    ref_time = benchmark_npu(ref_fn)
+    custom_time = benchmark_npu(custom_fn)
+
+    # Print performance results
+    print("\nPerformance Results:")
+    print(f"Reference implementation: {ref_time * 1000:.3f} ms")
+    print(f"Custom implementation: {custom_time * 1000:.3f} ms")
+    print(f"Speedup: {ref_time / custom_time:.2f}x")
+
+    # Compare results for correctness
+    ref_masked_input = ref_masked_input.to(dtype)
+    print("\nResults comparison:")
+    print("custom_masked_input:", custom_masked_input)
+    print("ref_masked_input:", ref_masked_input)
+    print("custom_mask:", custom_mask)
+    print("ref_mask:", ref_mask)
+    torch.testing.assert_close(
+        custom_masked_input,
+        ref_masked_input,
+        rtol=1e-5,
+        atol=1e-5,
+        msg=f"Masked input mismatch for case: {test_case}",
+    )
+    torch.testing.assert_close(
+        custom_mask,
+        ref_mask,
+        rtol=1e-5,
+        atol=1e-5,
+        msg=f"Mask mismatch for case: {test_case}",
+    )
--- a/benchmarks/requirements-bench.txt
+++ b/benchmarks/requirements-bench.txt
@@ -0,0 +1,4 @@
+pandas
+datasets
+modelscope
+tabulate
--- a/benchmarks/scripts/convert_json_to_markdown.py
+++ b/benchmarks/scripts/convert_json_to_markdown.py
@@ -0,0 +1,188 @@
+import argparse
+import json
+import os
+from pathlib import Path
+
+import pandas as pd
+from tabulate import tabulate
+
+CUR_PATH = Path(__file__).parent.resolve()
+# latency results and the keys that will be printed into markdown
+latency_results = []
+latency_column_mapping = {
+    "test_name": "Test name",
+    "avg_latency": "Mean latency (ms)",
+    "P50": "Median latency (ms)",
+    "P99": "P99 latency (ms)",
+}
+
+# throughput tests and the keys that will be printed into markdown
+throughput_results = []
+throughput_results_column_mapping = {
+    "test_name": "Test name",
+    "num_requests": "Num of reqs",
+    "total_num_tokens": "Total num of tokens",
+    "elapsed_time": "Elapsed time (s)",
+    "requests_per_second": "Tput (req/s)",
+    "tokens_per_second": "Tput (tok/s)",
+}
+
+# serving results and the keys that will be printed into markdown
+serving_results = []
+serving_column_mapping = {
+    "test_name": "Test name",
+    "request_rate": "Request rate (req/s)",
+    "request_throughput": "Tput (req/s)",
+    "output_throughput": "Output Tput (tok/s)",
+    "median_ttft_ms": "TTFT (ms)",
+    "median_tpot_ms": "TPOT (ms)",
+    "median_itl_ms": "ITL (ms)",
+}
+
+
+def read_markdown(file):
+    if os.path.exists(file):
+        with open(file) as f:
+            return f.read() + "\n"
+    else:
+        return f"{file} not found.\n"
+
+
+def results_to_json(latency, throughput, serving):
+    return json.dumps(
+        {
+            "latency": latency.to_dict(),
+            "throughput": throughput.to_dict(),
+            "serving": serving.to_dict(),
+        }
+    )
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(
+        description="Process the results of the benchmark tests."
+    )
+    parser.add_argument(
+        "--results_folder",
+        type=str,
+        default="../results/",
+        help="The folder where the benchmark results are stored.",
+    )
+    parser.add_argument(
+        "--output_folder",
+        type=str,
+        default="../results/",
+        help="The folder where the benchmark results are stored.",
+    )
+    parser.add_argument(
+        "--markdown_template",
+        type=str,
+        default="./perf_result_template.md",
+        help="The template file for the markdown report.",
+    )
+    parser.add_argument(
+        "--tag", default="main", help="Tag to be used for release message."
+    )
+    parser.add_argument(
+        "--commit_id", default="", help="Commit ID to be used for release message."
+    )
+
+    args = parser.parse_args()
+    results_folder = (CUR_PATH / args.results_folder).resolve()
+    output_folder = (CUR_PATH / args.output_folder).resolve()
+    markdown_template = (CUR_PATH / args.markdown_template).resolve()
+
+    # collect results
+    for test_file in results_folder.glob("*.json"):
+        with open(test_file) as f:
+            raw_result = json.loads(f.read())
+
+        if "serving" in str(test_file):
+            # this result is generated via `benchmark_serving.py`
+
+            # update the test name of this result
+            raw_result.update({"test_name": test_file.stem})
+
+            # add the result to raw_result
+            serving_results.append(raw_result)
+            continue
+
+        elif "latency" in f.name:
+            # this result is generated via `benchmark_latency.py`
+
+            # update the test name of this result
+            raw_result.update({"test_name": test_file.stem})
+
+            # get different percentiles
+            for perc in [10, 25, 50, 75, 90, 99]:
+                # Multiply 1000 to convert the time unit from s to ms
+                raw_result.update(
+                    {f"P{perc}": 1000 * raw_result["percentiles"][str(perc)]}
+                )
+            raw_result["avg_latency"] = raw_result["avg_latency"] * 1000
+
+            # add the result to raw_result
+            latency_results.append(raw_result)
+            continue
+
+        elif "throughput" in f.name:
+            # this result is generated via `benchmark_throughput.py`
+
+            # update the test name of this result
+            raw_result.update({"test_name": test_file.stem})
+
+            # add the result to raw_result
+            throughput_results.append(raw_result)
+            continue
+
+        print(f"Skipping {test_file}")
+    serving_results.sort(key=lambda x: (len(x["test_name"]), x["test_name"]))
+
+    latency_results = pd.DataFrame.from_dict(latency_results)
+    serving_results = pd.DataFrame.from_dict(serving_results)
+    throughput_results = pd.DataFrame.from_dict(throughput_results)
+
+    raw_results_json = results_to_json(
+        latency_results, throughput_results, serving_results
+    )
+
+    # remapping the key, for visualization purpose
+    if not latency_results.empty:
+        latency_results = latency_results[list(latency_column_mapping.keys())].rename(
+            columns=latency_column_mapping
+        )
+    if not serving_results.empty:
+        serving_results = serving_results[list(serving_column_mapping.keys())].rename(
+            columns=serving_column_mapping
+        )
+    if not throughput_results.empty:
+        throughput_results = throughput_results[
+            list(throughput_results_column_mapping.keys())
+        ].rename(columns=throughput_results_column_mapping)
+
+    processed_results_json = results_to_json(
+        latency_results, throughput_results, serving_results
+    )
+
+    # get markdown tables
+    latency_md_table = tabulate(
+        latency_results, headers="keys", tablefmt="pipe", showindex=False
+    )
+    serving_md_table = tabulate(
+        serving_results, headers="keys", tablefmt="pipe", showindex=False
+    )
+    throughput_md_table = tabulate(
+        throughput_results, headers="keys", tablefmt="pipe", showindex=False
+    )
+
+    # document the result
+    print(output_folder)
+    with open(output_folder / "benchmark_results.md", "w") as f:
+        results = read_markdown(markdown_template)
+        results = results.format(
+            latency_tests_markdown_table=latency_md_table,
+            throughput_tests_markdown_table=throughput_md_table,
+            serving_tests_markdown_table=serving_md_table,
+            benchmarking_results_in_json_string=processed_results_json,
+        )
+        f.write(results)
--- a/benchmarks/scripts/perf_result_template.md
+++ b/benchmarks/scripts/perf_result_template.md
@@ -0,0 +1,31 @@
+## Online serving tests
+
+- Input length: randomly sample 200 prompts from [ShareGPT](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/blob/main/ShareGPT_V3_unfiltered_cleaned_split.json) and [lmarena-ai/vision-arena-bench-v0.1](https://huggingface.co/datasets/lmarena-ai/vision-arena-bench-v0.1/tree/main)(multi-modal) dataset (with fixed random seed).
+- Output length: the corresponding output length of these 200 prompts.
+- Batch size: dynamically determined by vllm and the arrival pattern of the requests.
+- **Average QPS (query per second)**: 1, 4, 16 and inf. QPS = inf means all requests come at once. For other QPS values, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
+- Models: Qwen/Qwen3-8B, Qwen/Qwen2.5-VL-7B-Instruct
+- Evaluation metrics: throughput, TTFT (median time to the first token ), ITL (median inter-token latency) TPOT(median time per output token).
+
+{serving_tests_markdown_table}
+
+## Offline tests
+### Latency tests
+
+- Input length: 32 tokens.
+- Output length: 128 tokens.
+- Batch size: fixed (8).
+- Models: Qwen/Qwen3-8B, Qwen/Qwen2.5-VL-7B-Instruct
+- Evaluation metrics: end-to-end latency.
+
+{latency_tests_markdown_table}
+
+### Throughput tests
+
+- Input length: randomly sample 200 prompts from [ShareGPT](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/blob/main/ShareGPT_V3_unfiltered_cleaned_split.json) and [lmarena-ai/vision-arena-bench-v0.1](https://huggingface.co/datasets/lmarena-ai/vision-arena-bench-v0.1/tree/main)(multi-modal) dataset (with fixed random seed).
+- Output length: the corresponding output length of these 200 prompts.
+- Batch size: dynamically determined by vllm to achieve maximum throughput.
+- Models: Qwen/Qwen3-8B, Qwen/Qwen2.5-VL-7B-Instruct
+- Evaluation metrics: throughput.
+
+{throughput_tests_markdown_table}
--- a/benchmarks/scripts/run-performance-benchmarks.sh
+++ b/benchmarks/scripts/run-performance-benchmarks.sh
@@ -0,0 +1,321 @@
+#!/bin/bash
+set -e
+
+check_npus() {
+  # shellcheck disable=SC2155
+  declare -g npu_count=$(npu-smi info -l | grep "Total Count" | awk -F ':' '{print $2}' | tr -d ' ')
+  
+  if [[ -z "$npu_count" || "$npu_count" -eq 0 ]]; then
+    echo "Need at least 1 NPU to run benchmarking."
+    exit 1
+  else
+    echo "found NPU conut: $npu_count"
+  fi
+
+  npu_type=$(npu-smi info | grep -E "^\| [0-9]+" | awk -F '|' '{print $2}' | awk '{$1=$1;print}' | awk '{print $2}')
+
+  echo "NPU type is: $npu_type"
+}
+
+ensure_sharegpt_downloaded() {
+  local FILE="/github/home/.cache/datasets/ShareGPT_V3_unfiltered_cleaned_split.json"
+  local DIR
+  DIR=$(dirname "$FILE")
+
+  if [ ! -f "$FILE" ]; then
+    echo "$FILE not found, downloading from hf-mirror ..."
+    mkdir -p "$DIR"
+    wget -O "$FILE" https://hf-mirror.com/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
+    if [ $? -ne 0 ]; then
+      echo "Download failed!" >&2
+      return 1
+    fi
+    echo "Download completed and saved to $FILE"
+  else
+    echo "$FILE already exists."
+  fi
+}
+
+json2args() {
+  # transforms the JSON string to command line args, and '_' is replaced to '-'
+  # example:
+  # input: { "model": "meta-llama/Llama-2-7b-chat-hf", "tensor_parallel_size": 1 }
+  # output: --model meta-llama/Llama-2-7b-chat-hf --tensor-parallel-size 1
+  local json_string=$1
+  local args
+  args=$(
+    echo "$json_string" | jq -r '
+      to_entries |
+      map("--" + (.key | gsub("_"; "-")) + " " + (.value | tostring)) |
+      join(" ")
+    '
+  )
+  echo "$args"
+}
+
+wait_for_server() {
+  local waited=0
+  local timeout_sec=1200
+
+  while (( waited < timeout_sec )); do
+    if curl -s -X GET localhost:8000/health > /dev/null; then
+      return 0
+    fi
+    echo "Waiting for vllm server to start..."
+    sleep 1
+    ((waited++))
+  done
+
+  echo "Timeout waiting for server"
+  return 1
+}
+
+get_cur_npu_id() {
+    npu-smi info -l | awk -F ':' '/NPU ID/ {print $2+0; exit}'
+}
+
+kill_npu_processes() {
+  ps -aux
+  lsof -t -i:8000 | xargs -r kill -9
+  pgrep python3 | xargs -r kill -9
+  
+  sleep 4
+  rm -rf ~/.config/vllm
+
+}
+
+update_json_field() {
+  local json_file="$1"
+  local field_name="$2"
+  local field_value="$3"
+
+  jq --arg value "$field_value" \
+     --arg key "$field_name" \
+     '.[$key] = $value' "$json_file" > "${json_file}.tmp" && \
+     mv "${json_file}.tmp" "$json_file"
+}
+
+run_latency_tests() {
+  # run latency tests using `benchmark_latency.py`
+  # $1: a json file specifying latency test cases
+
+  local latency_test_file
+  latency_test_file=$1
+
+  # Iterate over latency tests
+  jq -c '.[]' "$latency_test_file" | while read -r params; do
+    # get the test name, and append the NPU type back to it.
+    test_name=$(echo "$params" | jq -r '.test_name')
+    if [[ ! "$test_name" =~ ^latency_ ]]; then
+      echo "In latency-test.json, test_name must start with \"latency_\"."
+      exit 1
+    fi
+
+    # if TEST_SELECTOR is set, only run the test cases that match the selector
+    if [[ -n "$TEST_SELECTOR" ]] && [[ ! "$test_name" =~ $TEST_SELECTOR ]]; then
+      echo "Skip test case $test_name."
+      continue
+    fi
+
+    # get arguments
+    latency_params=$(echo "$params" | jq -r '.parameters')
+    latency_args=$(json2args "$latency_params")
+
+    latency_command="vllm bench latency \
+      --output-json $RESULTS_FOLDER/${test_name}.json \
+      $latency_args"
+
+    echo "Running test case $test_name"
+    echo "Latency command: $latency_command"
+
+    # run the benchmark
+    eval "$latency_command"
+    # echo model_name to result file
+    model_name=$(echo "$latency_params" | jq -r '.model')
+    update_json_field "$RESULTS_FOLDER/${test_name}.json" "model_name" "$model_name"
+    kill_npu_processes
+
+  done
+}
+
+run_throughput_tests() {
+  # run throughput tests using `benchmark_throughput.py`
+  # $1: a json file specifying throughput test cases
+
+  local throughput_test_file
+  throughput_test_file=$1
+
+  # Iterate over throughput tests
+  jq -c '.[]' "$throughput_test_file" | while read -r params; do
+    # get the test name, and append the NPU type back to it.
+    test_name=$(echo "$params" | jq -r '.test_name')
+    if [[ ! "$test_name" =~ ^throughput_ ]]; then
+      echo "In throughput-test.json, test_name must start with \"throughput_\"."
+      exit 1
+    fi
+
+    # if TEST_SELECTOR is set, only run the test cases that match the selector
+    if [[ -n "$TEST_SELECTOR" ]] && [[ ! "$test_name" =~ $TEST_SELECTOR ]]; then
+      echo "Skip test case $test_name."
+      continue
+    fi
+
+    # get arguments
+    throughput_params=$(echo "$params" | jq -r '.parameters')
+    throughput_args=$(json2args "$throughput_params")
+
+    throughput_command="vllm bench throughput \
+      --output-json $RESULTS_FOLDER/${test_name}.json \
+      $throughput_args"
+
+    echo "Running test case $test_name"
+    echo "Throughput command: $throughput_command"
+
+    # run the benchmark
+    eval "$throughput_command"
+    # echo model_name to result file
+    model_name=$(echo "$throughput_params" | jq -r '.model')
+    update_json_field "$RESULTS_FOLDER/${test_name}.json" "model_name" "$model_name"
+    kill_npu_processes
+
+  done
+}
+
+run_serving_tests() {
+  # run serving tests using `benchmark_serving.py`
+  # $1: a json file specifying serving test cases
+
+  local serving_test_file
+  serving_test_file=$1
+
+  # Iterate over serving tests
+  jq -c '.[]' "$serving_test_file" | while read -r params; do
+    # get the test name, and append the NPU type back to it.
+    test_name=$(echo "$params" | jq -r '.test_name')
+    if [[ ! "$test_name" =~ ^serving_ ]]; then
+      echo "In serving-test.json, test_name must start with \"serving_\"."
+      exit 1
+    fi
+
+    # if TEST_SELECTOR is set, only run the test cases that match the selector
+    if [[ -n "$TEST_SELECTOR" ]] && [[ ! "$test_name" =~ $TEST_SELECTOR ]]; then
+      echo "Skip test case $test_name."
+      continue
+    fi
+
+    # get client and server arguments
+    server_params=$(echo "$params" | jq -r '.server_parameters')
+    client_params=$(echo "$params" | jq -r '.client_parameters')
+    server_args=$(json2args "$server_params")
+    client_args=$(json2args "$client_params")
+    qps_list=$(echo "$params" | jq -r '.qps_list')
+    qps_list=$(echo "$qps_list" | jq -r '.[] | @sh')
+    echo "Running over qps list $qps_list"
+
+    # check if server model and client model is aligned
+    server_model=$(echo "$server_params" | jq -r '.model')
+    client_model=$(echo "$client_params" | jq -r '.model')
+    if [[ $server_model != "$client_model" ]]; then
+      echo "Server model and client model must be the same. Skip testcase $test_name."
+      continue
+    fi
+
+    server_command="python3 \
+      -m vllm.entrypoints.openai.api_server \
+      $server_args"
+
+    # run the server
+    echo "Running test case $test_name"
+    echo "Server command: $server_command"
+    bash -c "$server_command" &
+    server_pid=$!
+
+    # wait until the server is alive
+    if wait_for_server; then
+      echo ""
+      echo "vllm server is up and running."
+    else
+      echo ""
+      echo "vllm failed to start within the timeout period."
+    fi
+
+    # iterate over different QPS
+    for qps in $qps_list; do
+      # remove the surrounding single quote from qps
+      if [[ "$qps" == *"inf"* ]]; then
+        echo "qps was $qps"
+        qps="inf"
+        echo "now qps is $qps"
+      fi
+
+      new_test_name=$test_name"_qps_"$qps
+
+      client_command="vllm bench serve \
+        --save-result \
+        --result-dir $RESULTS_FOLDER \
+        --result-filename ${new_test_name}.json \
+        --request-rate $qps \
+        $client_args"
+
+      echo "Running test case $test_name with qps $qps"
+      echo "Client command: $client_command"
+
+      bash -c "$client_command"
+    done
+
+    # clean up
+    kill -9 $server_pid
+    kill_npu_processes
+  done
+}
+
+cleanup() {
+  rm -rf ./vllm_benchmarks
+}
+
+cleanup_on_error() {
+  echo "An error occurred. Cleaning up results folder..."
+  rm -rf $RESULTS_FOLDER
+}
+
+main() {
+  START_TIME=$(date +%s)
+  check_npus
+  
+  # dependencies
+  (which wget && which curl) || (apt-get update && apt-get install -y wget curl)
+  (which jq) || (apt-get update && apt-get -y install jq)
+  (which lsof) || (apt-get update && apt-get install -y lsof)
+
+  # get the current IP address, required by benchmark_serving.py
+  # shellcheck disable=SC2155
+  export VLLM_HOST_IP=$(hostname -I | awk '{print $1}')
+  # turn of the reporting of the status of each request, to clean up the terminal output
+  export VLLM_LOG_LEVEL="WARNING"
+  
+  # set env
+  export VLLM_USE_MODELSCOPE=True
+
+  # prepare for benchmarking
+  cd benchmarks || exit 1
+  trap cleanup EXIT
+
+  QUICK_BENCHMARK_ROOT=./
+
+  declare -g RESULTS_FOLDER=results
+  mkdir -p $RESULTS_FOLDER
+
+  trap cleanup_on_error ERR
+  ensure_sharegpt_downloaded
+  # benchmarks
+  run_serving_tests $QUICK_BENCHMARK_ROOT/tests/serving-tests.json
+  run_latency_tests $QUICK_BENCHMARK_ROOT/tests/latency-tests.json
+  run_throughput_tests $QUICK_BENCHMARK_ROOT/tests/throughput-tests.json
+
+  END_TIME=$(date +%s)
+  ELAPSED_TIME=$((END_TIME - START_TIME))
+  echo "Total execution time: $ELAPSED_TIME seconds"
+
+}
+
+main "$@"
--- a/benchmarks/tests/latency-tests.json
+++ b/benchmarks/tests/latency-tests.json
@@ -0,0 +1,23 @@
+[
+  {
+    "test_name": "latency_qwen3_8B_tp1",
+    "parameters": {
+      "model": "Qwen/Qwen3-8B",
+      "tensor_parallel_size": 1,
+      "load_format": "dummy",
+      "max_model_len": 16384,
+      "num_iters_warmup": 5,
+      "num_iters": 15
+    }
+  },
+  {
+    "test_name": "latency_qwen2_5_7B_tp1",
+    "parameters": {
+      "model": "Qwen/Qwen2.5-7B-Instruct",
+      "tensor_parallel_size": 1,
+      "load_format": "dummy",
+      "num_iters_warmup": 5,
+      "num_iters": 15
+    }
+  }
+]
--- a/benchmarks/tests/serving-tests.json
+++ b/benchmarks/tests/serving-tests.json
@@ -0,0 +1,77 @@
+[
+  {
+    "test_name": "serving_qwen2_5vl_7B_tp1",
+    "qps_list": [
+      1,
+      4,
+      16,
+      "inf"
+    ],
+    "server_parameters": {
+      "model": "Qwen/Qwen2.5-VL-7B-Instruct",
+      "tensor_parallel_size": 1,
+      "swap_space": 16,
+      "disable_log_stats": "",
+      "disable_log_requests": "",
+      "trust_remote_code": "",
+      "max_model_len": 16384
+    },
+    "client_parameters": {
+      "model": "Qwen/Qwen2.5-VL-7B-Instruct",
+      "endpoint_type": "openai-chat",
+      "dataset_name": "hf",
+      "hf_split": "train",
+      "endpoint": "/v1/chat/completions",
+      "dataset_path": "lmarena-ai/vision-arena-bench-v0.1",
+      "num_prompts": 200
+    }
+  },
+  {
+    "test_name": "serving_qwen3_8B_tp1",
+    "qps_list": [
+      1,
+      4,
+      16,
+      "inf"
+    ],
+    "server_parameters": {
+      "model": "Qwen/Qwen3-8B",
+      "tensor_parallel_size": 1,
+      "swap_space": 16,
+      "disable_log_stats": "",
+      "disable_log_requests": "",
+      "load_format": "dummy"
+    },
+    "client_parameters": {
+      "model": "Qwen/Qwen3-8B",
+      "endpoint_type": "vllm",
+      "dataset_name": "sharegpt",
+      "dataset_path": "/github/home/.cache/datasets/ShareGPT_V3_unfiltered_cleaned_split.json",
+      "num_prompts": 200
+    }
+  },
+  {
+    "test_name": "serving_qwen2_5_7B_tp1",
+    "qps_list": [
+      1,
+      4,
+      16,
+      "inf"
+    ],
+    "server_parameters": {
+      "model": "Qwen/Qwen2.5-7B-Instruct",
+      "tensor_parallel_size": 1,
+      "swap_space": 16,
+      "disable_log_stats": "",
+      "disable_log_requests": "",
+      "load_format": "dummy"
+    },
+    "client_parameters": {
+      "model": "Qwen/Qwen2.5-7B-Instruct",
+      "endpoint_type": "vllm",
+      "dataset_name": "sharegpt",
+      "dataset_path": "/github/home/.cache/datasets/ShareGPT_V3_unfiltered_cleaned_split.json",
+      "num_prompts": 200
+    }
+  }
+]
--- a/benchmarks/tests/throughput-tests.json
+++ b/benchmarks/tests/throughput-tests.json
@@ -0,0 +1,38 @@
+[
+  {
+    "test_name": "throughput_qwen3_8B_tp1",
+    "parameters": {
+      "model": "Qwen/Qwen3-8B",
+      "tensor_parallel_size": 1,
+      "load_format": "dummy",
+      "dataset_path": "/github/home/.cache/datasets/ShareGPT_V3_unfiltered_cleaned_split.json",
+      "num_prompts": 200,
+      "backend": "vllm"
+    }
+  },
+  {
+    "test_name": "throughput_qwen2_5vl_7B_tp1",
+    "parameters": {
+      "model": "Qwen/Qwen2.5-VL-7B-Instruct",
+      "tensor_parallel_size": 1,
+      "backend": "vllm-chat",
+      "dataset_name": "hf",
+      "hf_split": "train",
+      "max_model_len": 16384,
+      "dataset_path": "lmarena-ai/vision-arena-bench-v0.1",
+      "num_prompts": 200
+    }
+  },
+  {
+    "test_name": "throughput_qwen2_5_7B_tp1",
+    "parameters": {
+      "model": "Qwen/Qwen2.5-7B-Instruct",
+      "tensor_parallel_size": 1,
+      "load_format": "dummy",
+      "dataset_path": "/github/home/.cache/datasets/ShareGPT_V3_unfiltered_cleaned_split.json",
+      "num_prompts": 200,
+      "backend": "vllm"
+    }
+  }
+]
+