[Lint]Style: reformat markdown files via markdownlint (#5884)
### What this PR does / why we need it?
reformat markdown files via markdownlint
- vLLM version: v0.13.0
- vLLM main:
bde38c11df
---------
Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain>
Signed-off-by: MrZ20 <2609716663@qq.com>
Co-authored-by: root <root@LAPTOP-VQKDDVMG.localdomain>
This commit is contained in:
@@ -1,8 +1,13 @@
|
||||
# Introduction
|
||||
# vLLM Ascend Benchmarks
|
||||
|
||||
## Introduction
|
||||
|
||||
This document outlines the benchmarking methodology for vllm-ascend, aimed at evaluating the performance under a variety of workloads. The primary goal is to help developers assess whether their pull requests improve or degrade vllm-ascend's performance.
|
||||
|
||||
# Overview
|
||||
## Overview
|
||||
|
||||
**Benchmarking Coverage**: We measure latency, throughput, and fixed-QPS serving on the Atlas800I A2 (see [quick_start](../docs/source/quick_start.md) to learn more supported devices list), with different models(coming soon).
|
||||
|
||||
- Latency tests
|
||||
- Input length: 32 tokens.
|
||||
- Output length: 128 tokens.
|
||||
@@ -26,8 +31,10 @@ This document outlines the benchmarking methodology for vllm-ascend, aimed at ev
|
||||
|
||||
**Benchmarking Duration**: about 800 senond for single model.
|
||||
|
||||
# Quick Use
|
||||
## Prerequisites
|
||||
## Quick Use
|
||||
|
||||
### Prerequisites
|
||||
|
||||
Before running the benchmarks, ensure the following:
|
||||
|
||||
- vllm and vllm-ascend are installed and properly set up in an NPU environment, as these scripts are specifically designed for NPU devices.
|
||||
@@ -41,7 +48,7 @@ Before running the benchmarks, ensure the following:
|
||||
- For performance benchmark, it is recommended to set the [load-format](https://github.com/vllm-project/vllm-ascend/blob/5897dc5bbe321ca90c26225d0d70bff24061d04b/benchmarks/tests/latency-tests.json#L7) as `dummy`, It will construct random weights based on the passed model without downloading the weights from internet, which can greatly reduce the benchmark time.
|
||||
- If you want to run benchmark customized, feel free to add your own models and parameters in the [JSON](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks/tests), let's take `Qwen2.5-VL-7B-Instruct`as an example:
|
||||
|
||||
```shell
|
||||
```json
|
||||
[
|
||||
{
|
||||
"test_name": "serving_qwen2_5vl_7B_tp1",
|
||||
@@ -75,45 +82,46 @@ Before running the benchmarks, ensure the following:
|
||||
|
||||
this Json will be structured and parsed into server parameters and client parameters by the benchmark script. This configuration defines a test case named `serving_qwen2_5vl_7B_tp1`, designed to evaluate the performance of the `Qwen/Qwen2.5-VL-7B-Instruct` model under different request rates. The test includes both server and client parameters, for more parameters details, see vllm benchmark [cli](https://github.com/vllm-project/vllm/tree/main/vllm/benchmarks).
|
||||
|
||||
- **Test Overview**
|
||||
- Test Name: serving_qwen2_5vl_7B_tp1
|
||||
- **Test Overview**
|
||||
- Test Name: serving_qwen2_5vl_7B_tp1
|
||||
|
||||
- Queries Per Second (QPS): The test is run at four different QPS levels: 1, 4, 16, and inf (infinite load, typically used for stress testing).
|
||||
- Queries Per Second (QPS): The test is run at four different QPS levels: 1, 4, 16, and inf (infinite load, typically used for stress testing).
|
||||
|
||||
- Server Parameters
|
||||
- Model: Qwen/Qwen2.5-VL-7B-Instruct
|
||||
- Server Parameters
|
||||
- Model: Qwen/Qwen2.5-VL-7B-Instruct
|
||||
|
||||
- Tensor Parallelism: 1 (no model parallelism is used; the model runs on a single device or node)
|
||||
- Tensor Parallelism: 1 (no model parallelism is used; the model runs on a single device or node)
|
||||
|
||||
- Swap Space: 16 GB (used to handle memory overflow by swapping to disk)
|
||||
- Swap Space: 16 GB (used to handle memory overflow by swapping to disk)
|
||||
|
||||
- disable_log_stats: disables logging of performance statistics.
|
||||
- disable_log_stats: disables logging of performance statistics.
|
||||
|
||||
- disable_log_requests: disables logging of individual requests.
|
||||
- disable_log_requests: disables logging of individual requests.
|
||||
|
||||
- Trust Remote Code: enabled (allows execution of model-specific custom code)
|
||||
- Trust Remote Code: enabled (allows execution of model-specific custom code)
|
||||
|
||||
- Max Model Length: 16,384 tokens (maximum context length supported by the model)
|
||||
- Max Model Length: 16,384 tokens (maximum context length supported by the model)
|
||||
|
||||
- Client Parameters
|
||||
- Client Parameters
|
||||
|
||||
- Model: Qwen/Qwen2.5-VL-7B-Instruct (same as the server)
|
||||
- Model: Qwen/Qwen2.5-VL-7B-Instruct (same as the server)
|
||||
|
||||
- Backend: openai-chat (suggests the client uses the OpenAI-compatible chat API format)
|
||||
- Backend: openai-chat (suggests the client uses the OpenAI-compatible chat API format)
|
||||
|
||||
- Dataset Source: Hugging Face (hf)
|
||||
- Dataset Source: Hugging Face (hf)
|
||||
|
||||
- Dataset Split: train
|
||||
- Dataset Split: train
|
||||
|
||||
- Endpoint: /v1/chat/completions (the REST API endpoint to which chat requests are sent)
|
||||
- Endpoint: /v1/chat/completions (the REST API endpoint to which chat requests are sent)
|
||||
|
||||
- Dataset Path: lmarena-ai/vision-arena-bench-v0.1 (the benchmark dataset used for evaluation, hosted on Hugging Face)
|
||||
- Dataset Path: lmarena-ai/vision-arena-bench-v0.1 (the benchmark dataset used for evaluation, hosted on Hugging Face)
|
||||
|
||||
- Number of Prompts: 200 (the total number of prompts used during the test)
|
||||
- Number of Prompts: 200 (the total number of prompts used during the test)
|
||||
|
||||
## Run benchmarks
|
||||
### Run benchmarks
|
||||
|
||||
#### Use benchmark script
|
||||
|
||||
### Use benchmark script
|
||||
The provided scripts automatically execute performance tests for serving, throughput, and latency. To start the benchmarking process, run command in the vllm-ascend root directory:
|
||||
|
||||
```shell
|
||||
@@ -134,11 +142,13 @@ Once the script completes, you can find the results in the benchmarks/results fo
|
||||
|
||||
These files contain detailed benchmarking results for further analysis.
|
||||
|
||||
### Use benchmark cli
|
||||
#### Use benchmark cli
|
||||
|
||||
For more flexible and customized use, benchmark cli is also provided to run online/offline benchmarks
|
||||
Similarly, let’s take `Qwen2.5-VL-7B-Instruct` benchmark as an example:
|
||||
#### Online serving
|
||||
|
||||
##### Online serving
|
||||
|
||||
1. Launch the server:
|
||||
|
||||
```shell
|
||||
@@ -156,7 +166,8 @@ Similarly, let’s take `Qwen2.5-VL-7B-Instruct` benchmark as an example:
|
||||
--request-rate 16
|
||||
```
|
||||
|
||||
#### Offline
|
||||
##### Offline
|
||||
|
||||
- **Throughput**
|
||||
|
||||
```shell
|
||||
|
||||
Reference in New Issue
Block a user