[Lint]Style: reformat markdown files via markdownlint (#5884)

### What this PR does / why we need it?
reformat markdown files via markdownlint

- vLLM version: v0.13.0
- vLLM main:
bde38c11df

---------

Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain>
Signed-off-by: MrZ20 <2609716663@qq.com>
Co-authored-by: root <root@LAPTOP-VQKDDVMG.localdomain>
This commit is contained in:
SILONG ZENG
2026-01-15 09:06:01 +08:00
committed by GitHub
parent 96edd4673f
commit 4811ba62e0
75 changed files with 711 additions and 308 deletions

View File

@@ -1,8 +1,13 @@
# Introduction
# vLLM Ascend Benchmarks
## Introduction
This document outlines the benchmarking methodology for vllm-ascend, aimed at evaluating the performance under a variety of workloads. The primary goal is to help developers assess whether their pull requests improve or degrade vllm-ascend's performance.
# Overview
## Overview
**Benchmarking Coverage**: We measure latency, throughput, and fixed-QPS serving on the Atlas800I A2 (see [quick_start](../docs/source/quick_start.md) to learn more supported devices list), with different models(coming soon).
- Latency tests
- Input length: 32 tokens.
- Output length: 128 tokens.
@@ -26,8 +31,10 @@ This document outlines the benchmarking methodology for vllm-ascend, aimed at ev
**Benchmarking Duration**: about 800 senond for single model.
# Quick Use
## Prerequisites
## Quick Use
### Prerequisites
Before running the benchmarks, ensure the following:
- vllm and vllm-ascend are installed and properly set up in an NPU environment, as these scripts are specifically designed for NPU devices.
@@ -41,7 +48,7 @@ Before running the benchmarks, ensure the following:
- For performance benchmark, it is recommended to set the [load-format](https://github.com/vllm-project/vllm-ascend/blob/5897dc5bbe321ca90c26225d0d70bff24061d04b/benchmarks/tests/latency-tests.json#L7) as `dummy`, It will construct random weights based on the passed model without downloading the weights from internet, which can greatly reduce the benchmark time.
- If you want to run benchmark customized, feel free to add your own models and parameters in the [JSON](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks/tests), let's take `Qwen2.5-VL-7B-Instruct`as an example:
```shell
```json
[
{
"test_name": "serving_qwen2_5vl_7B_tp1",
@@ -75,45 +82,46 @@ Before running the benchmarks, ensure the following:
this Json will be structured and parsed into server parameters and client parameters by the benchmark script. This configuration defines a test case named `serving_qwen2_5vl_7B_tp1`, designed to evaluate the performance of the `Qwen/Qwen2.5-VL-7B-Instruct` model under different request rates. The test includes both server and client parameters, for more parameters details, see vllm benchmark [cli](https://github.com/vllm-project/vllm/tree/main/vllm/benchmarks).
- **Test Overview**
- Test Name: serving_qwen2_5vl_7B_tp1
- **Test Overview**
- Test Name: serving_qwen2_5vl_7B_tp1
- Queries Per Second (QPS): The test is run at four different QPS levels: 1, 4, 16, and inf (infinite load, typically used for stress testing).
- Queries Per Second (QPS): The test is run at four different QPS levels: 1, 4, 16, and inf (infinite load, typically used for stress testing).
- Server Parameters
- Model: Qwen/Qwen2.5-VL-7B-Instruct
- Server Parameters
- Model: Qwen/Qwen2.5-VL-7B-Instruct
- Tensor Parallelism: 1 (no model parallelism is used; the model runs on a single device or node)
- Tensor Parallelism: 1 (no model parallelism is used; the model runs on a single device or node)
- Swap Space: 16 GB (used to handle memory overflow by swapping to disk)
- Swap Space: 16 GB (used to handle memory overflow by swapping to disk)
- disable_log_stats: disables logging of performance statistics.
- disable_log_stats: disables logging of performance statistics.
- disable_log_requests: disables logging of individual requests.
- disable_log_requests: disables logging of individual requests.
- Trust Remote Code: enabled (allows execution of model-specific custom code)
- Trust Remote Code: enabled (allows execution of model-specific custom code)
- Max Model Length: 16,384 tokens (maximum context length supported by the model)
- Max Model Length: 16,384 tokens (maximum context length supported by the model)
- Client Parameters
- Client Parameters
- Model: Qwen/Qwen2.5-VL-7B-Instruct (same as the server)
- Model: Qwen/Qwen2.5-VL-7B-Instruct (same as the server)
- Backend: openai-chat (suggests the client uses the OpenAI-compatible chat API format)
- Backend: openai-chat (suggests the client uses the OpenAI-compatible chat API format)
- Dataset Source: Hugging Face (hf)
- Dataset Source: Hugging Face (hf)
- Dataset Split: train
- Dataset Split: train
- Endpoint: /v1/chat/completions (the REST API endpoint to which chat requests are sent)
- Endpoint: /v1/chat/completions (the REST API endpoint to which chat requests are sent)
- Dataset Path: lmarena-ai/vision-arena-bench-v0.1 (the benchmark dataset used for evaluation, hosted on Hugging Face)
- Dataset Path: lmarena-ai/vision-arena-bench-v0.1 (the benchmark dataset used for evaluation, hosted on Hugging Face)
- Number of Prompts: 200 (the total number of prompts used during the test)
- Number of Prompts: 200 (the total number of prompts used during the test)
## Run benchmarks
### Run benchmarks
#### Use benchmark script
### Use benchmark script
The provided scripts automatically execute performance tests for serving, throughput, and latency. To start the benchmarking process, run command in the vllm-ascend root directory:
```shell
@@ -134,11 +142,13 @@ Once the script completes, you can find the results in the benchmarks/results fo
These files contain detailed benchmarking results for further analysis.
### Use benchmark cli
#### Use benchmark cli
For more flexible and customized use, benchmark cli is also provided to run online/offline benchmarks
Similarly, lets take `Qwen2.5-VL-7B-Instruct` benchmark as an example:
#### Online serving
##### Online serving
1. Launch the server:
```shell
@@ -156,7 +166,8 @@ Similarly, lets take `Qwen2.5-VL-7B-Instruct` benchmark as an example:
--request-rate 16
```
#### Offline
##### Offline
- **Throughput**
```shell