[1/2/N] Enable pymarkdown and python __init__ for lint system (#2011)

### What this PR does / why we need it?
1. Enable pymarkdown check
2. Enable python `__init__.py` check for vllm and vllm-ascend
3. Make clean code

### How was this patch tested?


- vLLM version: v0.9.2
- vLLM main:
29c6fbe58c

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
This commit is contained in:
Li Wang
2025-07-25 22:16:10 +08:00
committed by GitHub
parent d629f0b2b5
commit bdfb065b5d
31 changed files with 215 additions and 64 deletions

View File

@@ -26,7 +26,6 @@ This document outlines the benchmarking methodology for vllm-ascend, aimed at ev
**Benchmarking Duration**: about 800 senond for single model.
# Quick Use
## Prerequisites
Before running the benchmarks, ensure the following:
@@ -34,11 +33,12 @@ Before running the benchmarks, ensure the following:
- vllm and vllm-ascend are installed and properly set up in an NPU environment, as these scripts are specifically designed for NPU devices.
- Install necessary dependencies for benchmarks:
```
pip install -r benchmarks/requirements-bench.txt
```
- For performance benchmark, it is recommended to set the [load-format](https://github.com/vllm-project/vllm-ascend/blob/5897dc5bbe321ca90c26225d0d70bff24061d04b/benchmarks/tests/latency-tests.json#L7) as `dummy`, It will construct random weights based on the passed model without downloading the weights from internet, which can greatly reduce the benchmark time.
```shell
pip install -r benchmarks/requirements-bench.txt
```
- For performance benchmark, it is recommended to set the [load-format](https://github.com/vllm-project/vllm-ascend/blob/5897dc5bbe321ca90c26225d0d70bff24061d04b/benchmarks/tests/latency-tests.json#L7) as `dummy`, It will construct random weights based on the passed model without downloading the weights from internet, which can greatly reduce the benchmark time.
- If you want to run benchmark customized, feel free to add your own models and parameters in the [JSON](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks/tests), let's take `Qwen2.5-VL-7B-Instruct`as an example:
```shell
@@ -72,27 +72,28 @@ Before running the benchmarks, ensure the following:
}
]
```
this Json will be structured and parsed into server parameters and client parameters by the benchmark script. This configuration defines a test case named `serving_qwen2_5vl_7B_tp1`, designed to evaluate the performance of the `Qwen/Qwen2.5-VL-7B-Instruct` model under different request rates. The test includes both server and client parameters, for more parameters details, see vllm benchmark [cli](https://github.com/vllm-project/vllm/tree/main/vllm/benchmarks).
this Json will be structured and parsed into server parameters and client parameters by the benchmark script. This configuration defines a test case named `serving_qwen2_5vl_7B_tp1`, designed to evaluate the performance of the `Qwen/Qwen2.5-VL-7B-Instruct` model under different request rates. The test includes both server and client parameters, for more parameters details, see vllm benchmark [cli](https://github.com/vllm-project/vllm/tree/main/vllm/benchmarks).
- **Test Overview**
- Test Name: serving_qwen2_5vl_7B_tp1
- Queries Per Second (QPS): The test is run at four different QPS levels: 1, 4, 16, and inf (infinite load, typically used for stress testing).
- Server Parameters
- Model: Qwen/Qwen2.5-VL-7B-Instruct
- Server Parameters
- Model: Qwen/Qwen2.5-VL-7B-Instruct
- Tensor Parallelism: 1 (no model parallelism is used; the model runs on a single device or node)
- Tensor Parallelism: 1 (no model parallelism is used; the model runs on a single device or node)
- Swap Space: 16 GB (used to handle memory overflow by swapping to disk)
- Swap Space: 16 GB (used to handle memory overflow by swapping to disk)
- disable_log_stats: disables logging of performance statistics.
- disable_log_stats: disables logging of performance statistics.
- disable_log_requests: disables logging of individual requests.
- disable_log_requests: disables logging of individual requests.
- Trust Remote Code: enabled (allows execution of model-specific custom code)
- Trust Remote Code: enabled (allows execution of model-specific custom code)
- Max Model Length: 16,384 tokens (maximum context length supported by the model)
- Max Model Length: 16,384 tokens (maximum context length supported by the model)
- Client Parameters
@@ -110,17 +111,18 @@ Before running the benchmarks, ensure the following:
- Number of Prompts: 200 (the total number of prompts used during the test)
## Run benchmarks
### Use benchmark script
The provided scripts automatically execute performance tests for serving, throughput, and latency. To start the benchmarking process, run command in the vllm-ascend root directory:
```
```shell
bash benchmarks/scripts/run-performance-benchmarks.sh
```
Once the script completes, you can find the results in the benchmarks/results folder. The output files may resemble the following:
```
```shell
.
|-- serving_qwen2_5_7B_tp1_qps_1.json
|-- serving_qwen2_5_7B_tp1_qps_16.json
@@ -129,6 +131,7 @@ Once the script completes, you can find the results in the benchmarks/results fo
|-- latency_qwen2_5_7B_tp1.json
|-- throughput_qwen2_5_7B_tp1.json
```
These files contain detailed benchmarking results for further analysis.
### Use benchmark cli
@@ -137,30 +140,36 @@ For more flexible and customized use, benchmark cli is also provided to run onli
Similarly, lets take `Qwen2.5-VL-7B-Instruct` benchmark as an example:
#### Online serving
1. Launch the server:
```shell
vllm serve Qwen2.5-VL-7B-Instruct --max-model-len 16789
```
```shell
vllm serve Qwen2.5-VL-7B-Instruct --max-model-len 16789
```
2. Running performance tests using cli
```shell
```shell
vllm bench serve --model Qwen2.5-VL-7B-Instruct\
--endpoint-type "openai-chat" --dataset-name hf \
--hf-split train --endpoint "/v1/chat/completions" \
--dataset-path "lmarena-ai/vision-arena-bench-v0.1" \
--num-prompts 200 \
--request-rate 16
```
```
#### Offline
- **Throughput**
```shell
vllm bench throughput --output-json results/throughput_qwen2_5_7B_tp1.json \
--model Qwen/Qwen2.5-7B-Instruct --tensor-parallel-size 1 --load-format dummy \
--dataset-path /github/home/.cache/datasets/ShareGPT_V3_unfiltered_cleaned_split.json \
--num-prompts 200 --backend vllm
```
```shell
vllm bench throughput --output-json results/throughput_qwen2_5_7B_tp1.json \
--model Qwen/Qwen2.5-7B-Instruct --tensor-parallel-size 1 --load-format dummy \
--dataset-path /github/home/.cache/datasets/ShareGPT_V3_unfiltered_cleaned_split.json \
--num-prompts 200 --backend vllm
```
- **Latency**
```shell
vllm bench latency --output-json results/latency_qwen2_5_7B_tp1.json \
--model Qwen/Qwen2.5-7B-Instruct --tensor-parallel-size 1 \
--load-format dummy --num-iters-warmup 5 --num-iters 15
```
```shell
vllm bench latency --output-json results/latency_qwen2_5_7B_tp1.json \
--model Qwen/Qwen2.5-7B-Instruct --tensor-parallel-size 1 \
--load-format dummy --num-iters-warmup 5 --num-iters 15
```

View File

@@ -28,4 +28,4 @@
- Models: Qwen/Qwen3-8B, Qwen/Qwen2.5-VL-7B-Instruct
- Evaluation metrics: throughput.
{throughput_tests_markdown_table}
{throughput_tests_markdown_table}