xc-llm-ascend/benchmarks/README.md

# Introduction
This document outlines the benchmarking methodology for vllm-ascend, aimed at evaluating the performance under a variety of workloads. The primary goal is to help developers assess whether their pull requests improve or degrade vllm-ascend's performance.

# Overview
**Benchmarking Coverage**: We measure latency, throughput, and fixed-QPS serving on the Atlas800I A2 (see [quick_start](../docs/source/quick_start.md) to learn more supported devices list), with different models(coming soon).
- Latency tests
    - Input length: 32 tokens.
    - Output length: 128 tokens.
    - Batch size: fixed (8).
    - Models: Qwen2.5-7B-Instruct, Qwen3-8B.
    - Evaluation metrics: end-to-end latency (mean, median, p99).

- Throughput tests
    - Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
    - Output length: the corresponding output length of these 200 prompts.
    - Batch size: dynamically determined by vllm to achieve maximum throughput.
    - Models: Qwen2.5-VL-7B-Instruct, Qwen2.5-7B-Instruct, Qwen3-8B.
    - Evaluation metrics: throughput.
- Serving tests
    - Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
    - Output length: the corresponding output length of these 200 prompts.
    - Batch size: dynamically determined by vllm and the arrival pattern of the requests.
    - **Average QPS (query per second)**: 1, 4, 16 and inf. QPS = inf means all requests come at once. For other QPS values, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
    - Models: Qwen2.5-VL-7B-Instruct, Qwen2.5-7B-Instruct, Qwen3-8B.
    - Evaluation metrics: throughput, TTFT (time to the first token, with mean, median and p99), ITL (inter-token latency, with mean, median and p99).

**Benchmarking Duration**: about 800 senond for single model.

# Quick Use
## Prerequisites
Before running the benchmarks, ensure the following:

- vllm and vllm-ascend are installed and properly set up in an NPU environment, as these scripts are specifically designed for NPU devices.

- Install necessary dependencies for benchmarks:
  
  ```shell
  pip install -r benchmarks/requirements-bench.txt
  ```
  
- For performance benchmark, it is recommended to set the [load-format](https://github.com/vllm-project/vllm-ascend/blob/5897dc5bbe321ca90c26225d0d70bff24061d04b/benchmarks/tests/latency-tests.json#L7) as `dummy`, It will construct random weights based on the passed model without downloading the weights from internet, which can greatly reduce the benchmark time.
- If you want to run benchmark customized, feel free to add your own models and parameters in the [JSON](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks/tests), let's take `Qwen2.5-VL-7B-Instruct`as an example:

  ```shell
  [
  {
    "test_name": "serving_qwen2_5vl_7B_tp1",
    "qps_list": [
      1,
      4,
      16,
      "inf"
    ],
    "server_parameters": {
      "model": "Qwen/Qwen2.5-VL-7B-Instruct",
      "tensor_parallel_size": 1,
      "swap_space": 16,
      "disable_log_stats": "",
      "disable_log_requests": "",
      "trust_remote_code": "",
      "max_model_len": 16384
    },
    "client_parameters": {
      "model": "Qwen/Qwen2.5-VL-7B-Instruct",
      "backend": "openai-chat",
      "dataset_name": "hf",
      "hf_split": "train",
      "endpoint": "/v1/chat/completions",
      "dataset_path": "lmarena-ai/vision-arena-bench-v0.1",
      "num_prompts": 200
    }
  }
  ]
  ```
  
this Json will be structured and parsed into server parameters and client parameters by the benchmark script. This configuration defines a test case named `serving_qwen2_5vl_7B_tp1`, designed to evaluate the performance of the `Qwen/Qwen2.5-VL-7B-Instruct` model under different request rates. The test includes both server and client parameters, for more parameters details, see vllm benchmark [cli](https://github.com/vllm-project/vllm/tree/main/vllm/benchmarks).

  - **Test Overview**
     - Test Name: serving_qwen2_5vl_7B_tp1

     - Queries Per Second (QPS): The test is run at four different QPS levels: 1, 4, 16, and inf (infinite load, typically used for stress testing).

  - Server Parameters
     - Model: Qwen/Qwen2.5-VL-7B-Instruct

     - Tensor Parallelism: 1 (no model parallelism is used; the model runs on a single device or node)

     - Swap Space: 16 GB (used to handle memory overflow by swapping to disk)

     - disable_log_stats: disables logging of performance statistics.

     - disable_log_requests: disables logging of individual requests.

     - Trust Remote Code: enabled (allows execution of model-specific custom code)

     - Max Model Length: 16,384 tokens (maximum context length supported by the model)

  - Client Parameters

     - Model: Qwen/Qwen2.5-VL-7B-Instruct (same as the server)

     - Backend: openai-chat (suggests the client uses the OpenAI-compatible chat API format)

     - Dataset Source: Hugging Face (hf)

     - Dataset Split: train

     - Endpoint: /v1/chat/completions (the REST API endpoint to which chat requests are sent)

     - Dataset Path: lmarena-ai/vision-arena-bench-v0.1 (the benchmark dataset used for evaluation, hosted on Hugging Face)

     - Number of Prompts: 200 (the total number of prompts used during the test)

## Run benchmarks

### Use benchmark script
The provided scripts automatically execute performance tests for serving, throughput, and latency. To start the benchmarking process, run command in the vllm-ascend root directory:

```shell
bash benchmarks/scripts/run-performance-benchmarks.sh
```

Once the script completes, you can find the results in the benchmarks/results folder. The output files may resemble the following:

```shell
.
|-- serving_qwen2_5_7B_tp1_qps_1.json
|-- serving_qwen2_5_7B_tp1_qps_16.json
|-- serving_qwen2_5_7B_tp1_qps_4.json
|-- serving_qwen2_5_7B_tp1_qps_inf.json
|-- latency_qwen2_5_7B_tp1.json
|-- throughput_qwen2_5_7B_tp1.json
```

These files contain detailed benchmarking results for further analysis.

### Use benchmark cli

For more flexible and customized use, benchmark cli is also provided to run online/offline benchmarks
Similarly, let’s take `Qwen2.5-VL-7B-Instruct` benchmark as an example:
#### Online serving
1. Launch the server:

    ```shell
    vllm serve Qwen2.5-VL-7B-Instruct --max-model-len 16789
    ```

2. Running performance tests using cli
  
    ```shell
    vllm bench serve --model Qwen2.5-VL-7B-Instruct\
    --endpoint-type "openai-chat" --dataset-name hf \
    --hf-split train --endpoint "/v1/chat/completions" \
    --dataset-path "lmarena-ai/vision-arena-bench-v0.1" \
    --num-prompts 200 \
    --request-rate 16
    ```

#### Offline
- **Throughput**

  ```shell
  vllm bench throughput --output-json results/throughput_qwen2_5_7B_tp1.json \
  --model Qwen/Qwen2.5-7B-Instruct --tensor-parallel-size 1 --load-format dummy \
  --dataset-path /github/home/.cache/datasets/ShareGPT_V3_unfiltered_cleaned_split.json \
  --num-prompts 200 --backend vllm
  ```

- **Latency**
  
  ```shell
  vllm bench latency --output-json results/latency_qwen2_5_7B_tp1.json \
  --model Qwen/Qwen2.5-7B-Instruct --tensor-parallel-size 1 \
  --load-format dummy --num-iters-warmup 5 --num-iters 15
  ```
-												[Doc]Add benchmark scripts (#74)

### What this PR does / why we need it?
The purpose of this PR is to add benchmark scripts for npu, developers
can easily run performance tests on their own machines with one line of
code .


---------

Signed-off-by: wangli <wangli858794774@gmail.com>
											
										
										
											2025-03-21 15:54:34 +08:00
+								# Introduction
-												[Benchmark] Refactor perf script to use benchmark cli (#1524)

### What this PR does / why we need it?

Since, `vllm bench` cli has optimized enough for production use(support
more datasets), we are now do not need to copy vllm codes, now , with
vllm installed, we can easily use the benchmark cli
### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
											
										
										
											2025-06-30 23:42:04 +08:00
+								This document outlines the benchmarking methodology for vllm-ascend, aimed at evaluating the performance under a variety of workloads. The primary goal is to help developers assess whether their pull requests improve or degrade vllm-ascend's performance.
-												[Benchmarks] Add qwen2.5-7b test (#763)

### What this PR does / why we need it?
- Add qwen2.5-7b test
- Optimize the documentation to be more developer-friendly 

Signed-off-by: xuedinge233 <damow890@gmail.com>
Co-authored-by: xuedinge233 <damow890@gmail.com>
											
										
										
											2025-05-10 09:47:42 +08:00
-												[Doc]Add benchmark scripts (#74)

### What this PR does / why we need it?
The purpose of this PR is to add benchmark scripts for npu, developers
can easily run performance tests on their own machines with one line of
code .


---------

Signed-off-by: wangli <wangli858794774@gmail.com>
											
										
										
											2025-03-21 15:54:34 +08:00
+								# Overview
 								**Benchmarking Coverage**: We measure latency, throughput, and fixed-QPS serving on the Atlas800I A2 (see [quick_start](../docs/source/quick_start.md) to learn more supported devices list), with different models(coming soon).
 								- Latency tests
 								    - Input length: 32 tokens.
 								    - Output length: 128 tokens.
 								    - Batch size: fixed (8).
-												[Benchmark] Refactor perf script to use benchmark cli (#1524)

### What this PR does / why we need it?

Since, `vllm bench` cli has optimized enough for production use(support
more datasets), we are now do not need to copy vllm codes, now , with
vllm installed, we can easily use the benchmark cli
### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
											
										
										
											2025-06-30 23:42:04 +08:00
+								    - Models: Qwen2.5-7B-Instruct, Qwen3-8B.
-												[Doc]Add benchmark scripts (#74)

### What this PR does / why we need it?
The purpose of this PR is to add benchmark scripts for npu, developers
can easily run performance tests on their own machines with one line of
code .


---------

Signed-off-by: wangli <wangli858794774@gmail.com>
											
										
										
											2025-03-21 15:54:34 +08:00
+								    - Evaluation metrics: end-to-end latency (mean, median, p99).
 								- Throughput tests
 								    - Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
 								    - Output length: the corresponding output length of these 200 prompts.
 								    - Batch size: dynamically determined by vllm to achieve maximum throughput.
-												[Benchmark] Refactor perf script to use benchmark cli (#1524)

### What this PR does / why we need it?

Since, `vllm bench` cli has optimized enough for production use(support
more datasets), we are now do not need to copy vllm codes, now , with
vllm installed, we can easily use the benchmark cli
### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
											
										
										
											2025-06-30 23:42:04 +08:00
+								    - Models: Qwen2.5-VL-7B-Instruct, Qwen2.5-7B-Instruct, Qwen3-8B.
-												[Doc]Add benchmark scripts (#74)

### What this PR does / why we need it?
The purpose of this PR is to add benchmark scripts for npu, developers
can easily run performance tests on their own machines with one line of
code .


---------

Signed-off-by: wangli <wangli858794774@gmail.com>
											
										
										
											2025-03-21 15:54:34 +08:00
+								    - Evaluation metrics: throughput.
 								- Serving tests
 								    - Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
 								    - Output length: the corresponding output length of these 200 prompts.
 								    - Batch size: dynamically determined by vllm and the arrival pattern of the requests.
 								    - **Average QPS (query per second)**: 1, 4, 16 and inf. QPS = inf means all requests come at once. For other QPS values, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
-												[Benchmark] Refactor perf script to use benchmark cli (#1524)

### What this PR does / why we need it?

Since, `vllm bench` cli has optimized enough for production use(support
more datasets), we are now do not need to copy vllm codes, now , with
vllm installed, we can easily use the benchmark cli
### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
											
										
										
											2025-06-30 23:42:04 +08:00
+								    - Models: Qwen2.5-VL-7B-Instruct, Qwen2.5-7B-Instruct, Qwen3-8B.
-												[Doc]Add benchmark scripts (#74)

### What this PR does / why we need it?
The purpose of this PR is to add benchmark scripts for npu, developers
can easily run performance tests on their own machines with one line of
code .


---------

Signed-off-by: wangli <wangli858794774@gmail.com>
											
										
										
											2025-03-21 15:54:34 +08:00
+								    - Evaluation metrics: throughput, TTFT (time to the first token, with mean, median and p99), ITL (inter-token latency, with mean, median and p99).
-												[Benchmarks] Add qwen2.5-7b test (#763)

### What this PR does / why we need it?
- Add qwen2.5-7b test
- Optimize the documentation to be more developer-friendly 

Signed-off-by: xuedinge233 <damow890@gmail.com>
Co-authored-by: xuedinge233 <damow890@gmail.com>
											
										
										
											2025-05-10 09:47:42 +08:00
+								**Benchmarking Duration**: about 800 senond for single model.
-												[Doc]Add benchmark scripts (#74)

### What this PR does / why we need it?
The purpose of this PR is to add benchmark scripts for npu, developers
can easily run performance tests on their own machines with one line of
code .


---------

Signed-off-by: wangli <wangli858794774@gmail.com>
											
										
										
											2025-03-21 15:54:34 +08:00
 								# Quick Use
 								## Prerequisites
 								Before running the benchmarks, ensure the following:
-												[Benchmarks] Add qwen2.5-7b test (#763)

### What this PR does / why we need it?
- Add qwen2.5-7b test
- Optimize the documentation to be more developer-friendly 

Signed-off-by: xuedinge233 <damow890@gmail.com>
Co-authored-by: xuedinge233 <damow890@gmail.com>
											
										
										
											2025-05-10 09:47:42 +08:00
-												[Doc]Add benchmark scripts (#74)

### What this PR does / why we need it?
The purpose of this PR is to add benchmark scripts for npu, developers
can easily run performance tests on their own machines with one line of
code .


---------

Signed-off-by: wangli <wangli858794774@gmail.com>
											
										
										
											2025-03-21 15:54:34 +08:00
+								- vllm and vllm-ascend are installed and properly set up in an NPU environment, as these scripts are specifically designed for NPU devices.
-												[Benchmarks] Add qwen2.5-7b test (#763)

### What this PR does / why we need it?
- Add qwen2.5-7b test
- Optimize the documentation to be more developer-friendly 

Signed-off-by: xuedinge233 <damow890@gmail.com>
Co-authored-by: xuedinge233 <damow890@gmail.com>
											
										
										
											2025-05-10 09:47:42 +08:00
-												[Doc]Add benchmark scripts (#74)

### What this PR does / why we need it?
The purpose of this PR is to add benchmark scripts for npu, developers
can easily run performance tests on their own machines with one line of
code .


---------

Signed-off-by: wangli <wangli858794774@gmail.com>
											
										
										
											2025-03-21 15:54:34 +08:00
+								- Install necessary dependencies for benchmarks:
-												[1/2/N] Enable pymarkdown and python __init__ for lint system (#2011)

### What this PR does / why we need it?
1. Enable pymarkdown check
2. Enable python `__init__.py` check for vllm and vllm-ascend
3. Make clean code

### How was this patch tested?


- vLLM version: v0.9.2
- vLLM main:
https://github.com/vllm-project/vllm/commit/29c6fbe58cfa705c26ed1b38f262d5ade0b4f9ba

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
											
										
										
											2025-07-25 22:16:10 +08:00
 								  ```shell
 								  pip install -r benchmarks/requirements-bench.txt
 								  ```
 								- For performance benchmark, it is recommended to set the [load-format](https://github.com/vllm-project/vllm-ascend/blob/5897dc5bbe321ca90c26225d0d70bff24061d04b/benchmarks/tests/latency-tests.json#L7) as `dummy`, It will construct random weights based on the passed model without downloading the weights from internet, which can greatly reduce the benchmark time.
-												[Benchmark] Refactor perf script to use benchmark cli (#1524)

### What this PR does / why we need it?

Since, `vllm bench` cli has optimized enough for production use(support
more datasets), we are now do not need to copy vllm codes, now , with
vllm installed, we can easily use the benchmark cli
### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
											
										
										
											2025-06-30 23:42:04 +08:00
+								- If you want to run benchmark customized, feel free to add your own models and parameters in the [JSON](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks/tests), let's take `Qwen2.5-VL-7B-Instruct`as an example:
 								  ```shell
 								  [
 								  {
 								    "test_name": "serving_qwen2_5vl_7B_tp1",
 								    "qps_list": [
 ,
 ,
 ,
 								      "inf"
 								    ],
 								    "server_parameters": {
 								      "model": "Qwen/Qwen2.5-VL-7B-Instruct",
 								      "tensor_parallel_size": 1,
 								      "swap_space": 16,
 								      "disable_log_stats": "",
 								      "disable_log_requests": "",
 								      "trust_remote_code": "",
 								      "max_model_len": 16384
 								    },
 								    "client_parameters": {
 								      "model": "Qwen/Qwen2.5-VL-7B-Instruct",
 								      "backend": "openai-chat",
 								      "dataset_name": "hf",
 								      "hf_split": "train",
 								      "endpoint": "/v1/chat/completions",
 								      "dataset_path": "lmarena-ai/vision-arena-bench-v0.1",
 								      "num_prompts": 200
 								    }
 								  }
 								  ]
 								  ```
-												[1/2/N] Enable pymarkdown and python __init__ for lint system (#2011)

### What this PR does / why we need it?
1. Enable pymarkdown check
2. Enable python `__init__.py` check for vllm and vllm-ascend
3. Make clean code

### How was this patch tested?


- vLLM version: v0.9.2
- vLLM main:
https://github.com/vllm-project/vllm/commit/29c6fbe58cfa705c26ed1b38f262d5ade0b4f9ba

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
											
										
										
											2025-07-25 22:16:10 +08:00
 								this Json will be structured and parsed into server parameters and client parameters by the benchmark script. This configuration defines a test case named `serving_qwen2_5vl_7B_tp1`, designed to evaluate the performance of the `Qwen/Qwen2.5-VL-7B-Instruct` model under different request rates. The test includes both server and client parameters, for more parameters details, see vllm benchmark [cli](https://github.com/vllm-project/vllm/tree/main/vllm/benchmarks).
-												[Benchmark] Refactor perf script to use benchmark cli (#1524)

### What this PR does / why we need it?

Since, `vllm bench` cli has optimized enough for production use(support
more datasets), we are now do not need to copy vllm codes, now , with
vllm installed, we can easily use the benchmark cli
### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
											
										
										
											2025-06-30 23:42:04 +08:00
 								  - **Test Overview**
 								     - Test Name: serving_qwen2_5vl_7B_tp1
 								     - Queries Per Second (QPS): The test is run at four different QPS levels: 1, 4, 16, and inf (infinite load, typically used for stress testing).
-												[1/2/N] Enable pymarkdown and python __init__ for lint system (#2011)

### What this PR does / why we need it?
1. Enable pymarkdown check
2. Enable python `__init__.py` check for vllm and vllm-ascend
3. Make clean code

### How was this patch tested?


- vLLM version: v0.9.2
- vLLM main:
https://github.com/vllm-project/vllm/commit/29c6fbe58cfa705c26ed1b38f262d5ade0b4f9ba

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
											
										
										
											2025-07-25 22:16:10 +08:00
+								  - Server Parameters
 								     - Model: Qwen/Qwen2.5-VL-7B-Instruct
-												[Benchmark] Refactor perf script to use benchmark cli (#1524)

### What this PR does / why we need it?

Since, `vllm bench` cli has optimized enough for production use(support
more datasets), we are now do not need to copy vllm codes, now , with
vllm installed, we can easily use the benchmark cli
### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
											
										
										
											2025-06-30 23:42:04 +08:00
-												[1/2/N] Enable pymarkdown and python __init__ for lint system (#2011)

### What this PR does / why we need it?
1. Enable pymarkdown check
2. Enable python `__init__.py` check for vllm and vllm-ascend
3. Make clean code

### How was this patch tested?


- vLLM version: v0.9.2
- vLLM main:
https://github.com/vllm-project/vllm/commit/29c6fbe58cfa705c26ed1b38f262d5ade0b4f9ba

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
											
										
										
											2025-07-25 22:16:10 +08:00
+								     - Tensor Parallelism: 1 (no model parallelism is used; the model runs on a single device or node)
-												[Benchmark] Refactor perf script to use benchmark cli (#1524)

### What this PR does / why we need it?

Since, `vllm bench` cli has optimized enough for production use(support
more datasets), we are now do not need to copy vllm codes, now , with
vllm installed, we can easily use the benchmark cli
### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
											
										
										
											2025-06-30 23:42:04 +08:00
-												[1/2/N] Enable pymarkdown and python __init__ for lint system (#2011)

### What this PR does / why we need it?
1. Enable pymarkdown check
2. Enable python `__init__.py` check for vllm and vllm-ascend
3. Make clean code

### How was this patch tested?


- vLLM version: v0.9.2
- vLLM main:
https://github.com/vllm-project/vllm/commit/29c6fbe58cfa705c26ed1b38f262d5ade0b4f9ba

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
											
										
										
											2025-07-25 22:16:10 +08:00
+								     - Swap Space: 16 GB (used to handle memory overflow by swapping to disk)
-												[Benchmark] Refactor perf script to use benchmark cli (#1524)

### What this PR does / why we need it?

Since, `vllm bench` cli has optimized enough for production use(support
more datasets), we are now do not need to copy vllm codes, now , with
vllm installed, we can easily use the benchmark cli
### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
											
										
										
											2025-06-30 23:42:04 +08:00
-												[1/2/N] Enable pymarkdown and python __init__ for lint system (#2011)

### What this PR does / why we need it?
1. Enable pymarkdown check
2. Enable python `__init__.py` check for vllm and vllm-ascend
3. Make clean code

### How was this patch tested?


- vLLM version: v0.9.2
- vLLM main:
https://github.com/vllm-project/vllm/commit/29c6fbe58cfa705c26ed1b38f262d5ade0b4f9ba

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
											
										
										
											2025-07-25 22:16:10 +08:00
+								     - disable_log_stats: disables logging of performance statistics.
-												[Benchmark] Refactor perf script to use benchmark cli (#1524)

### What this PR does / why we need it?

Since, `vllm bench` cli has optimized enough for production use(support
more datasets), we are now do not need to copy vllm codes, now , with
vllm installed, we can easily use the benchmark cli
### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
											
										
										
											2025-06-30 23:42:04 +08:00
-												[1/2/N] Enable pymarkdown and python __init__ for lint system (#2011)

### What this PR does / why we need it?
1. Enable pymarkdown check
2. Enable python `__init__.py` check for vllm and vllm-ascend
3. Make clean code

### How was this patch tested?


- vLLM version: v0.9.2
- vLLM main:
https://github.com/vllm-project/vllm/commit/29c6fbe58cfa705c26ed1b38f262d5ade0b4f9ba

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
											
										
										
											2025-07-25 22:16:10 +08:00
+								     - disable_log_requests: disables logging of individual requests.
-												[Benchmark] Refactor perf script to use benchmark cli (#1524)

### What this PR does / why we need it?

Since, `vllm bench` cli has optimized enough for production use(support
more datasets), we are now do not need to copy vllm codes, now , with
vllm installed, we can easily use the benchmark cli
### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
											
										
										
											2025-06-30 23:42:04 +08:00
-												[1/2/N] Enable pymarkdown and python __init__ for lint system (#2011)

### What this PR does / why we need it?
1. Enable pymarkdown check
2. Enable python `__init__.py` check for vllm and vllm-ascend
3. Make clean code

### How was this patch tested?


- vLLM version: v0.9.2
- vLLM main:
https://github.com/vllm-project/vllm/commit/29c6fbe58cfa705c26ed1b38f262d5ade0b4f9ba

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
											
										
										
											2025-07-25 22:16:10 +08:00
+								     - Trust Remote Code: enabled (allows execution of model-specific custom code)
-												[Benchmark] Refactor perf script to use benchmark cli (#1524)

### What this PR does / why we need it?

Since, `vllm bench` cli has optimized enough for production use(support
more datasets), we are now do not need to copy vllm codes, now , with
vllm installed, we can easily use the benchmark cli
### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
											
										
										
											2025-06-30 23:42:04 +08:00
-												[1/2/N] Enable pymarkdown and python __init__ for lint system (#2011)

### What this PR does / why we need it?
1. Enable pymarkdown check
2. Enable python `__init__.py` check for vllm and vllm-ascend
3. Make clean code

### How was this patch tested?


- vLLM version: v0.9.2
- vLLM main:
https://github.com/vllm-project/vllm/commit/29c6fbe58cfa705c26ed1b38f262d5ade0b4f9ba

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
											
										
										
											2025-07-25 22:16:10 +08:00
+								     - Max Model Length: 16,384 tokens (maximum context length supported by the model)
-												[Benchmark] Refactor perf script to use benchmark cli (#1524)

### What this PR does / why we need it?

Since, `vllm bench` cli has optimized enough for production use(support
more datasets), we are now do not need to copy vllm codes, now , with
vllm installed, we can easily use the benchmark cli
### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
											
										
										
											2025-06-30 23:42:04 +08:00
 								  - Client Parameters
 								     - Model: Qwen/Qwen2.5-VL-7B-Instruct (same as the server)
 								     - Backend: openai-chat (suggests the client uses the OpenAI-compatible chat API format)
 								     - Dataset Source: Hugging Face (hf)
 								     - Dataset Split: train
 								     - Endpoint: /v1/chat/completions (the REST API endpoint to which chat requests are sent)
 								     - Dataset Path: lmarena-ai/vision-arena-bench-v0.1 (the benchmark dataset used for evaluation, hosted on Hugging Face)
 								     - Number of Prompts: 200 (the total number of prompts used during the test)
-												[Doc]Add benchmark scripts (#74)

### What this PR does / why we need it?
The purpose of this PR is to add benchmark scripts for npu, developers
can easily run performance tests on their own machines with one line of
code .


---------

Signed-off-by: wangli <wangli858794774@gmail.com>
											
										
										
											2025-03-21 15:54:34 +08:00
+								## Run benchmarks
-												[Benchmark] Refactor perf script to use benchmark cli (#1524)

### What this PR does / why we need it?

Since, `vllm bench` cli has optimized enough for production use(support
more datasets), we are now do not need to copy vllm codes, now , with
vllm installed, we can easily use the benchmark cli
### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
											
										
										
											2025-06-30 23:42:04 +08:00
 								### Use benchmark script
-												[Doc]Add benchmark scripts (#74)

### What this PR does / why we need it?
The purpose of this PR is to add benchmark scripts for npu, developers
can easily run performance tests on their own machines with one line of
code .


---------

Signed-off-by: wangli <wangli858794774@gmail.com>
											
										
										
											2025-03-21 15:54:34 +08:00
+								The provided scripts automatically execute performance tests for serving, throughput, and latency. To start the benchmarking process, run command in the vllm-ascend root directory:
-												[1/2/N] Enable pymarkdown and python __init__ for lint system (#2011)

### What this PR does / why we need it?
1. Enable pymarkdown check
2. Enable python `__init__.py` check for vllm and vllm-ascend
3. Make clean code

### How was this patch tested?


- vLLM version: v0.9.2
- vLLM main:
https://github.com/vllm-project/vllm/commit/29c6fbe58cfa705c26ed1b38f262d5ade0b4f9ba

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
											
										
										
											2025-07-25 22:16:10 +08:00
 								```shell
-												[Doc]Add benchmark scripts (#74)

### What this PR does / why we need it?
The purpose of this PR is to add benchmark scripts for npu, developers
can easily run performance tests on their own machines with one line of
code .


---------

Signed-off-by: wangli <wangli858794774@gmail.com>
											
										
										
											2025-03-21 15:54:34 +08:00
+								bash benchmarks/scripts/run-performance-benchmarks.sh
 								```
-												[1/2/N] Enable pymarkdown and python __init__ for lint system (#2011)

### What this PR does / why we need it?
1. Enable pymarkdown check
2. Enable python `__init__.py` check for vllm and vllm-ascend
3. Make clean code

### How was this patch tested?


- vLLM version: v0.9.2
- vLLM main:
https://github.com/vllm-project/vllm/commit/29c6fbe58cfa705c26ed1b38f262d5ade0b4f9ba

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
											
										
										
											2025-07-25 22:16:10 +08:00
-												[Doc]Add benchmark scripts (#74)

### What this PR does / why we need it?
The purpose of this PR is to add benchmark scripts for npu, developers
can easily run performance tests on their own machines with one line of
code .


---------

Signed-off-by: wangli <wangli858794774@gmail.com>
											
										
										
											2025-03-21 15:54:34 +08:00
+								Once the script completes, you can find the results in the benchmarks/results folder. The output files may resemble the following:
-												[1/2/N] Enable pymarkdown and python __init__ for lint system (#2011)

### What this PR does / why we need it?
1. Enable pymarkdown check
2. Enable python `__init__.py` check for vllm and vllm-ascend
3. Make clean code

### How was this patch tested?


- vLLM version: v0.9.2
- vLLM main:
https://github.com/vllm-project/vllm/commit/29c6fbe58cfa705c26ed1b38f262d5ade0b4f9ba

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
											
										
										
											2025-07-25 22:16:10 +08:00
 								```shell
-												[Benchmark] Refactor perf script to use benchmark cli (#1524)

### What this PR does / why we need it?

Since, `vllm bench` cli has optimized enough for production use(support
more datasets), we are now do not need to copy vllm codes, now , with
vllm installed, we can easily use the benchmark cli
### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
											
										
										
											2025-06-30 23:42:04 +08:00
+								.
 								|-- serving_qwen2_5_7B_tp1_qps_1.json
 								|-- serving_qwen2_5_7B_tp1_qps_16.json
 								|-- serving_qwen2_5_7B_tp1_qps_4.json
 								|-- serving_qwen2_5_7B_tp1_qps_inf.json
 								|-- latency_qwen2_5_7B_tp1.json
 								|-- throughput_qwen2_5_7B_tp1.json
-												[Doc]Add benchmark scripts (#74)

### What this PR does / why we need it?
The purpose of this PR is to add benchmark scripts for npu, developers
can easily run performance tests on their own machines with one line of
code .


---------

Signed-off-by: wangli <wangli858794774@gmail.com>
											
										
										
											2025-03-21 15:54:34 +08:00
+								```
-												[1/2/N] Enable pymarkdown and python __init__ for lint system (#2011)

### What this PR does / why we need it?
1. Enable pymarkdown check
2. Enable python `__init__.py` check for vllm and vllm-ascend
3. Make clean code

### How was this patch tested?


- vLLM version: v0.9.2
- vLLM main:
https://github.com/vllm-project/vllm/commit/29c6fbe58cfa705c26ed1b38f262d5ade0b4f9ba

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
											
										
										
											2025-07-25 22:16:10 +08:00
-												[Doc]Add benchmark scripts (#74)

### What this PR does / why we need it?
The purpose of this PR is to add benchmark scripts for npu, developers
can easily run performance tests on their own machines with one line of
code .


---------

Signed-off-by: wangli <wangli858794774@gmail.com>
											
										
										
											2025-03-21 15:54:34 +08:00
+								These files contain detailed benchmarking results for further analysis.
-												[Benchmark] Refactor perf script to use benchmark cli (#1524)

### What this PR does / why we need it?

Since, `vllm bench` cli has optimized enough for production use(support
more datasets), we are now do not need to copy vllm codes, now , with
vllm installed, we can easily use the benchmark cli
### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
											
										
										
											2025-06-30 23:42:04 +08:00
 								### Use benchmark cli
 								For more flexible and customized use, benchmark cli is also provided to run online/offline benchmarks
 								Similarly, let’s take `Qwen2.5-VL-7B-Instruct` benchmark as an example:
 								#### Online serving
 . Launch the server:
-												[1/2/N] Enable pymarkdown and python __init__ for lint system (#2011)

### What this PR does / why we need it?
1. Enable pymarkdown check
2. Enable python `__init__.py` check for vllm and vllm-ascend
3. Make clean code

### How was this patch tested?


- vLLM version: v0.9.2
- vLLM main:
https://github.com/vllm-project/vllm/commit/29c6fbe58cfa705c26ed1b38f262d5ade0b4f9ba

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
											
										
										
											2025-07-25 22:16:10 +08:00
 								    ```shell
 								    vllm serve Qwen2.5-VL-7B-Instruct --max-model-len 16789
 								    ```
-												[Benchmark] Refactor perf script to use benchmark cli (#1524)

### What this PR does / why we need it?

Since, `vllm bench` cli has optimized enough for production use(support
more datasets), we are now do not need to copy vllm codes, now , with
vllm installed, we can easily use the benchmark cli
### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
											
										
										
											2025-06-30 23:42:04 +08:00
+. Running performance tests using cli
-												[1/2/N] Enable pymarkdown and python __init__ for lint system (#2011)

### What this PR does / why we need it?
1. Enable pymarkdown check
2. Enable python `__init__.py` check for vllm and vllm-ascend
3. Make clean code

### How was this patch tested?


- vLLM version: v0.9.2
- vLLM main:
https://github.com/vllm-project/vllm/commit/29c6fbe58cfa705c26ed1b38f262d5ade0b4f9ba

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
											
										
										
											2025-07-25 22:16:10 +08:00
 								    ```shell
-												[Benchmark] Refactor perf script to use benchmark cli (#1524)

### What this PR does / why we need it?

Since, `vllm bench` cli has optimized enough for production use(support
more datasets), we are now do not need to copy vllm codes, now , with
vllm installed, we can easily use the benchmark cli
### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
											
										
										
											2025-06-30 23:42:04 +08:00
+								    vllm bench serve --model Qwen2.5-VL-7B-Instruct\
 								    --endpoint-type "openai-chat" --dataset-name hf \
 								    --hf-split train --endpoint "/v1/chat/completions" \
 								    --dataset-path "lmarena-ai/vision-arena-bench-v0.1" \
 								    --num-prompts 200 \
 								    --request-rate 16
-												[1/2/N] Enable pymarkdown and python __init__ for lint system (#2011)

### What this PR does / why we need it?
1. Enable pymarkdown check
2. Enable python `__init__.py` check for vllm and vllm-ascend
3. Make clean code

### How was this patch tested?


- vLLM version: v0.9.2
- vLLM main:
https://github.com/vllm-project/vllm/commit/29c6fbe58cfa705c26ed1b38f262d5ade0b4f9ba

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
											
										
										
											2025-07-25 22:16:10 +08:00
+								    ```
-												[Benchmark] Refactor perf script to use benchmark cli (#1524)

### What this PR does / why we need it?

Since, `vllm bench` cli has optimized enough for production use(support
more datasets), we are now do not need to copy vllm codes, now , with
vllm installed, we can easily use the benchmark cli
### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
											
										
										
											2025-06-30 23:42:04 +08:00
 								#### Offline
 								- **Throughput**
-												[1/2/N] Enable pymarkdown and python __init__ for lint system (#2011)

### What this PR does / why we need it?
1. Enable pymarkdown check
2. Enable python `__init__.py` check for vllm and vllm-ascend
3. Make clean code

### How was this patch tested?


- vLLM version: v0.9.2
- vLLM main:
https://github.com/vllm-project/vllm/commit/29c6fbe58cfa705c26ed1b38f262d5ade0b4f9ba

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
											
										
										
											2025-07-25 22:16:10 +08:00
 								  ```shell
 								  vllm bench throughput --output-json results/throughput_qwen2_5_7B_tp1.json \
 								  --model Qwen/Qwen2.5-7B-Instruct --tensor-parallel-size 1 --load-format dummy \
 								  --dataset-path /github/home/.cache/datasets/ShareGPT_V3_unfiltered_cleaned_split.json \
 								  --num-prompts 200 --backend vllm
 								  ```
-												[Benchmark] Refactor perf script to use benchmark cli (#1524)

### What this PR does / why we need it?

Since, `vllm bench` cli has optimized enough for production use(support
more datasets), we are now do not need to copy vllm codes, now , with
vllm installed, we can easily use the benchmark cli
### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
											
										
										
											2025-06-30 23:42:04 +08:00
+								- **Latency**
-												[1/2/N] Enable pymarkdown and python __init__ for lint system (#2011)

### What this PR does / why we need it?
1. Enable pymarkdown check
2. Enable python `__init__.py` check for vllm and vllm-ascend
3. Make clean code

### How was this patch tested?


- vLLM version: v0.9.2
- vLLM main:
https://github.com/vllm-project/vllm/commit/29c6fbe58cfa705c26ed1b38f262d5ade0b4f9ba

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
											
										
										
											2025-07-25 22:16:10 +08:00
 								  ```shell
 								  vllm bench latency --output-json results/latency_qwen2_5_7B_tp1.json \
 								  --model Qwen/Qwen2.5-7B-Instruct --tensor-parallel-size 1 \
 								  --load-format dummy --num-iters-warmup 5 --num-iters 15
 								  ```