[Lint]Style: reformat markdown files via markdownlint (#5884)

### What this PR does / why we need it? reformat markdown files via markdownlint - vLLM version: v0.13.0 - vLLM main: bde38c11df --------- Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain> Signed-off-by: MrZ20 <2609716663@qq.com> Co-authored-by: root <root@LAPTOP-VQKDDVMG.localdomain>
2026-01-15 09:06:01 +08:00
parent 96edd4673f
commit 4811ba62e0
75 changed files with 711 additions and 308 deletions
--- a/benchmarks/README.md
+++ b/benchmarks/README.md
@@ -1,8 +1,13 @@
-# Introduction
+# vLLM Ascend Benchmarks
+
+## Introduction
+
 This document outlines the benchmarking methodology for vllm-ascend, aimed at evaluating the performance under a variety of workloads. The primary goal is to help developers assess whether their pull requests improve or degrade vllm-ascend's performance.

-# Overview
+## Overview
+
 **Benchmarking Coverage**: We measure latency, throughput, and fixed-QPS serving on the Atlas800I A2 (see [quick_start](../docs/source/quick_start.md) to learn more supported devices list), with different models(coming soon).
+
 - Latency tests
    - Input length: 32 tokens.
    - Output length: 128 tokens.
@@ -26,8 +31,10 @@ This document outlines the benchmarking methodology for vllm-ascend, aimed at ev

 **Benchmarking Duration**: about 800 senond for single model.

-# Quick Use
-## Prerequisites
+## Quick Use
+
+### Prerequisites
+
 Before running the benchmarks, ensure the following:

 - vllm and vllm-ascend are installed and properly set up in an NPU environment, as these scripts are specifically designed for NPU devices.
@@ -41,7 +48,7 @@ Before running the benchmarks, ensure the following:
 - For performance benchmark, it is recommended to set the [load-format](https://github.com/vllm-project/vllm-ascend/blob/5897dc5bbe321ca90c26225d0d70bff24061d04b/benchmarks/tests/latency-tests.json#L7) as `dummy`, It will construct random weights based on the passed model without downloading the weights from internet, which can greatly reduce the benchmark time.
 - If you want to run benchmark customized, feel free to add your own models and parameters in the [JSON](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks/tests), let's take `Qwen2.5-VL-7B-Instruct`as an example:

-  ```shell
+  ```json
  [
  {
    "test_name": "serving_qwen2_5vl_7B_tp1",
@@ -75,45 +82,46 @@ Before running the benchmarks, ensure the following:
  
 this Json will be structured and parsed into server parameters and client parameters by the benchmark script. This configuration defines a test case named `serving_qwen2_5vl_7B_tp1`, designed to evaluate the performance of the `Qwen/Qwen2.5-VL-7B-Instruct` model under different request rates. The test includes both server and client parameters, for more parameters details, see vllm benchmark [cli](https://github.com/vllm-project/vllm/tree/main/vllm/benchmarks).

-  - **Test Overview**
-     - Test Name: serving_qwen2_5vl_7B_tp1
+- **Test Overview**
+    - Test Name: serving_qwen2_5vl_7B_tp1

-     - Queries Per Second (QPS): The test is run at four different QPS levels: 1, 4, 16, and inf (infinite load, typically used for stress testing).
+    - Queries Per Second (QPS): The test is run at four different QPS levels: 1, 4, 16, and inf (infinite load, typically used for stress testing).

-  - Server Parameters
-     - Model: Qwen/Qwen2.5-VL-7B-Instruct
+- Server Parameters
+    - Model: Qwen/Qwen2.5-VL-7B-Instruct

-     - Tensor Parallelism: 1 (no model parallelism is used; the model runs on a single device or node)
+    - Tensor Parallelism: 1 (no model parallelism is used; the model runs on a single device or node)

-     - Swap Space: 16 GB (used to handle memory overflow by swapping to disk)
+    - Swap Space: 16 GB (used to handle memory overflow by swapping to disk)

-     - disable_log_stats: disables logging of performance statistics.
+    - disable_log_stats: disables logging of performance statistics.

-     - disable_log_requests: disables logging of individual requests.
+    - disable_log_requests: disables logging of individual requests.

-     - Trust Remote Code: enabled (allows execution of model-specific custom code)
+    - Trust Remote Code: enabled (allows execution of model-specific custom code)

-     - Max Model Length: 16,384 tokens (maximum context length supported by the model)
+    - Max Model Length: 16,384 tokens (maximum context length supported by the model)

-  - Client Parameters
+- Client Parameters

-     - Model: Qwen/Qwen2.5-VL-7B-Instruct (same as the server)
+    - Model: Qwen/Qwen2.5-VL-7B-Instruct (same as the server)

-     - Backend: openai-chat (suggests the client uses the OpenAI-compatible chat API format)
+    - Backend: openai-chat (suggests the client uses the OpenAI-compatible chat API format)

-     - Dataset Source: Hugging Face (hf)
+    - Dataset Source: Hugging Face (hf)

-     - Dataset Split: train
+    - Dataset Split: train

-     - Endpoint: /v1/chat/completions (the REST API endpoint to which chat requests are sent)
+    - Endpoint: /v1/chat/completions (the REST API endpoint to which chat requests are sent)

-     - Dataset Path: lmarena-ai/vision-arena-bench-v0.1 (the benchmark dataset used for evaluation, hosted on Hugging Face)
+    - Dataset Path: lmarena-ai/vision-arena-bench-v0.1 (the benchmark dataset used for evaluation, hosted on Hugging Face)

-     - Number of Prompts: 200 (the total number of prompts used during the test)
+    - Number of Prompts: 200 (the total number of prompts used during the test)

-## Run benchmarks
+### Run benchmarks
+
+#### Use benchmark script

-### Use benchmark script
 The provided scripts automatically execute performance tests for serving, throughput, and latency. To start the benchmarking process, run command in the vllm-ascend root directory:

 ```shell
@@ -134,11 +142,13 @@ Once the script completes, you can find the results in the benchmarks/results fo

 These files contain detailed benchmarking results for further analysis.

-### Use benchmark cli
+#### Use benchmark cli

 For more flexible and customized use, benchmark cli is also provided to run online/offline benchmarks
 Similarly, let’s take `Qwen2.5-VL-7B-Instruct` benchmark as an example:
-#### Online serving
+
+##### Online serving
+
 1. Launch the server:

    ```shell
@@ -156,7 +166,8 @@ Similarly, let’s take `Qwen2.5-VL-7B-Instruct` benchmark as an example:
    --request-rate 16
    ```

-#### Offline
+##### Offline
+
 - **Throughput**

  ```shell