[Lint]Style: reformat markdown files via markdownlint (#5884)
### What this PR does / why we need it?
reformat markdown files via markdownlint
- vLLM version: v0.13.0
- vLLM main:
bde38c11df
---------
Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain>
Signed-off-by: MrZ20 <2609716663@qq.com>
Co-authored-by: root <root@LAPTOP-VQKDDVMG.localdomain>
This commit is contained in:
@@ -1,6 +1,7 @@
|
||||
# Contributing
|
||||
|
||||
## Building and Testing
|
||||
|
||||
It's recommended to set up a local development environment to build vllm-ascend and run tests
|
||||
before you submit a PR.
|
||||
|
||||
|
||||
@@ -116,7 +116,7 @@ This section assumes that you already have a [Kubernetes](https://kubernetes.io/
|
||||
|
||||
- Step 1. Install LWS CRD resources
|
||||
|
||||
See https://lws.sigs.k8s.io/docs/installation/ Which can be used as a reference
|
||||
See <https://lws.sigs.k8s.io/docs/installation/> Which can be used as a reference
|
||||
|
||||
- Step 2. Deploy the following yaml file `lws.yaml` as what you want
|
||||
|
||||
@@ -318,14 +318,14 @@ Since our script is Kubernetes-friendly, we need to actively pass in some cluste
|
||||
`cluster_hosts: ["xxx.xxx.xxx.188", "xxx.xxx.xxx.212"]`
|
||||
|
||||
- Step 2. Install develop environment
|
||||
- Install vllm-ascend develop packages on every cluster host
|
||||
- Install vllm-ascend develop packages on every cluster host
|
||||
|
||||
``` bash
|
||||
cd /vllm-workspace/vllm-ascend
|
||||
python3 -m pip install -r requirements-dev.txt
|
||||
```
|
||||
|
||||
- Install AISBench on the first host(leader node) in cluster_hosts
|
||||
- Install AISBench on the first host(leader node) in cluster_hosts
|
||||
|
||||
``` bash
|
||||
export AIS_BENCH_TAG="v3.0-20250930-master"
|
||||
|
||||
@@ -248,7 +248,7 @@ This will reproduce the E2E test. See [vllm_ascend_test.yaml](https://github.com
|
||||
|
||||
Run nightly multi-node test cases locally refer to section of `Running Locally` of [Multi Node Test](./multi_node_test.md).
|
||||
|
||||
#### E2E test example:
|
||||
#### E2E test example
|
||||
|
||||
- Offline test example: [`tests/e2e/singlecard/test_offline_inference.py`](https://github.com/vllm-project/vllm-ascend/blob/main/tests/e2e/singlecard/test_offline_inference.py)
|
||||
- Online test examples: [`tests/e2e/singlecard/test_prompt_embedding.py`](https://github.com/vllm-project/vllm-ascend/blob/main/tests/e2e/singlecard/test_prompt_embedding.py)
|
||||
|
||||
@@ -1,8 +1,11 @@
|
||||
# Using AISBench
|
||||
|
||||
This document guides you to conduct accuracy testing using [AISBench](https://gitee.com/aisbench/benchmark/tree/master). AISBench provides accuracy and performance evaluation for many datasets.
|
||||
|
||||
## Online Server
|
||||
|
||||
### 1. Start the vLLM server
|
||||
|
||||
You can run docker container to start the vLLM server on a single NPU:
|
||||
|
||||
```{code-block} bash
|
||||
@@ -44,7 +47,7 @@ vllm serve Qwen/Qwen2.5-0.5B-Instruct --max_model_len 35000 &
|
||||
|
||||
The vLLM server is started successfully, if you see logs as below:
|
||||
|
||||
```
|
||||
```shell
|
||||
INFO: Started server process [9446]
|
||||
INFO: Waiting for application startup.
|
||||
INFO: Application startup complete.
|
||||
@@ -220,7 +223,7 @@ ais_bench --models vllm_api_general_chat --datasets aime2024_gen_0_shot_chat_pro
|
||||
|
||||
After each dataset execution, you can get the result from saved files such as `outputs/default/20250628_151326`, there is an example as follows:
|
||||
|
||||
```
|
||||
```shell
|
||||
20250628_151326/
|
||||
├── configs # Combined configuration file for model tasks, dataset tasks, and result presentation tasks
|
||||
│ └── 20250628_151326_29317.py
|
||||
@@ -276,7 +279,7 @@ ais_bench --models vllm_api_stream_chat --datasets textvqa_gen_base64 --summariz
|
||||
|
||||
After execution, you can get the result from saved files, there is an example as follows:
|
||||
|
||||
```
|
||||
```shell
|
||||
20251031_070226/
|
||||
|-- configs # Combined configuration file for model tasks, dataset tasks, and result presentation tasks
|
||||
| `-- 20251031_070226_122485.py
|
||||
|
||||
@@ -34,7 +34,7 @@ vllm serve Qwen/Qwen2.5-7B-Instruct --max_model_len 26240
|
||||
|
||||
If the vLLM server is started successfully, you can see information shown below:
|
||||
|
||||
```
|
||||
```shell
|
||||
INFO: Started server process [6873]
|
||||
INFO: Waiting for application startup.
|
||||
INFO: Application startup complete.
|
||||
@@ -42,7 +42,7 @@ INFO: Application startup complete.
|
||||
|
||||
Once your server is started, you can query the model with input prompts in a new terminal:
|
||||
|
||||
```
|
||||
```shell
|
||||
curl http://localhost:8000/v1/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
@@ -67,7 +67,7 @@ pip install gradio plotly evalscope
|
||||
|
||||
You can use `evalscope eval` to run GSM8K for accuracy testing:
|
||||
|
||||
```
|
||||
```shell
|
||||
evalscope eval \
|
||||
--model Qwen/Qwen2.5-7B-Instruct \
|
||||
--api-url http://localhost:8000/v1 \
|
||||
@@ -101,7 +101,7 @@ pip install evalscope[perf] -U
|
||||
|
||||
You can use `evalscope perf` to run perf testing:
|
||||
|
||||
```
|
||||
```shell
|
||||
evalscope perf \
|
||||
--url "http://localhost:8000/v1/chat/completions" \
|
||||
--parallel 5 \
|
||||
|
||||
@@ -1,8 +1,11 @@
|
||||
# Using lm-eval
|
||||
|
||||
This document guides you to conduct accuracy testing using [lm-eval][1].
|
||||
|
||||
## Online Server
|
||||
|
||||
### 1. Start the vLLM server
|
||||
|
||||
You can run docker container to start the vLLM server on a single NPU:
|
||||
|
||||
```{code-block} bash
|
||||
@@ -34,7 +37,7 @@ vllm serve Qwen/Qwen2.5-0.5B-Instruct --max_model_len 4096 &
|
||||
|
||||
The vLLM server is started successfully, if you see logs as below:
|
||||
|
||||
```
|
||||
```shell
|
||||
INFO: Started server process [9446]
|
||||
INFO: Waiting for application startup.
|
||||
INFO: Application startup complete.
|
||||
@@ -44,7 +47,7 @@ INFO: Application startup complete.
|
||||
|
||||
You can query the result with input prompts:
|
||||
|
||||
```
|
||||
```shell
|
||||
curl http://localhost:8000/v1/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
@@ -71,7 +74,7 @@ curl http://localhost:8000/v1/completions \
|
||||
|
||||
The output format matches the following:
|
||||
|
||||
```
|
||||
```json
|
||||
{
|
||||
"id": "cmpl-2f678e8bdf5a4b209a3f2c1fa5832e25",
|
||||
"object": "text_completion",
|
||||
@@ -108,7 +111,7 @@ pip install lm-eval[api]
|
||||
|
||||
Run the following command:
|
||||
|
||||
```
|
||||
```shell
|
||||
# Only test gsm8k dataset in this demo
|
||||
lm_eval \
|
||||
--model local-completions \
|
||||
@@ -119,7 +122,7 @@ lm_eval \
|
||||
|
||||
After 30 minutes, the output is as shown below:
|
||||
|
||||
```
|
||||
```shell
|
||||
The markdown format results is as below:
|
||||
|
||||
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
|
||||
@@ -130,6 +133,7 @@ The markdown format results is as below:
|
||||
```
|
||||
|
||||
## Offline Server
|
||||
|
||||
### 1. Run docker container
|
||||
|
||||
You can run docker container on a single NPU:
|
||||
@@ -161,6 +165,7 @@ docker run --rm \
|
||||
```
|
||||
|
||||
### 2. Run GSM8K using lm-eval for accuracy testing
|
||||
|
||||
Install lm-eval in the container:
|
||||
|
||||
```bash
|
||||
@@ -170,7 +175,7 @@ pip install lm-eval
|
||||
|
||||
Run the following command:
|
||||
|
||||
```
|
||||
```shell
|
||||
# Only test gsm8k dataset in this demo
|
||||
lm_eval \
|
||||
--model vllm \
|
||||
@@ -181,7 +186,7 @@ lm_eval \
|
||||
|
||||
After 1 to 2 minutes, the output is shown below:
|
||||
|
||||
```
|
||||
```shell
|
||||
The markdown format results is as below:
|
||||
|
||||
Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
|
||||
|
||||
@@ -1,4 +1,5 @@
|
||||
# Using OpenCompass
|
||||
|
||||
This document guides you to conduct accuracy testing using [OpenCompass](https://github.com/open-compass/opencompass).
|
||||
|
||||
## 1. Online Server
|
||||
@@ -33,7 +34,7 @@ vllm serve Qwen/Qwen2.5-7B-Instruct --max_model_len 26240
|
||||
|
||||
The vLLM server is started successfully, if you see information as below:
|
||||
|
||||
```
|
||||
```shell
|
||||
INFO: Started server process [6873]
|
||||
INFO: Waiting for application startup.
|
||||
INFO: Application startup complete.
|
||||
@@ -41,7 +42,7 @@ INFO: Application startup complete.
|
||||
|
||||
Once your server is started, you can query the model with input prompts in a new terminal.
|
||||
|
||||
```
|
||||
```shell
|
||||
curl http://localhost:8000/v1/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
@@ -53,6 +54,7 @@ curl http://localhost:8000/v1/completions \
|
||||
```
|
||||
|
||||
## 2. Run C-Eval using OpenCompass for accuracy testing
|
||||
|
||||
Install OpenCompass and configure the environment variables in the container:
|
||||
|
||||
```bash
|
||||
@@ -107,13 +109,13 @@ models = [
|
||||
|
||||
Run the following command:
|
||||
|
||||
```
|
||||
```shell
|
||||
python3 run.py opencompass/configs/eval_vllm_ascend_demo.py --debug
|
||||
```
|
||||
|
||||
After 1 to 2 minutes, the output is shown below:
|
||||
|
||||
```
|
||||
```shell
|
||||
The markdown format results is as below:
|
||||
|
||||
| dataset | version | metric | mode | Qwen2.5-7B-Instruct-vLLM-API |
|
||||
|
||||
@@ -4,7 +4,7 @@
|
||||
|
||||
When in LLM inference, each token requires nearly thousand operator executions, and when host launching operators are slower than device, it will cause host bound. In severe cases, the device will be idle for more than half of the time. To solve this problem, we use graph in LLM inference.
|
||||
|
||||
```
|
||||
```shell
|
||||
eager mode:
|
||||
|
||||
host: | launch op1 | launch op2 | launch op3 | launch op4 | launch op5 |
|
||||
@@ -38,11 +38,12 @@ But in reality, graph mode is not that simple.
|
||||
Due to graph can only replay the ops captured before, without doing tiling and checking graph input, we need to ensure the consistency of the graph input, but we know that model input's shape depends on the request scheduled by Scheduler, we can't ensure the consistency.
|
||||
|
||||
Obviously, we can solve this problem by capturing the biggest shape and padding all of the model input to it. But it will bring a lot of redundant computing and make performance worse. So we can capture multiple graphs with different shape, and pad the model input to the nearest graph, which will greatly reduce redundant computing. But when `max_num_batched_tokens` is very large, the number of graphs that need to be captured will also become very large. But we know that when intensor's shape is large, the computing time will be very long, and graph mode is not necessary in this case. So all of things we need to do is:
|
||||
|
||||
1. Set a threshold;
|
||||
2. When `num_scheduled_tokens` is bigger than the threshold, use `eager_mode`;
|
||||
3. Capture multiple graphs within a range below the threshold;
|
||||
|
||||
```
|
||||
```shell
|
||||
| graph1 |
|
||||
| graph2 |
|
||||
| graph3 |
|
||||
|
||||
@@ -21,6 +21,7 @@ vLLM Ascend Currently supports Mooncake Store for KV Cache Pool. To enable Moonc
|
||||
For step-by-step deployment and configuration, please refer to the [KV Pool User Guide](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/feature_guide/kv_pool.html).
|
||||
|
||||
## How it works?
|
||||
|
||||
The KV Cache Pool integrates multiple memory tiers (HBM, DRAM, SSD, etc.) through a connector-based architecture.
|
||||
|
||||
Each connector implements a unified interface for storing, retrieving, and transferring KV blocks between tiers, depending on access frequency and hardware bandwidth.
|
||||
@@ -28,6 +29,7 @@ Each connector implements a unified interface for storing, retrieving, and trans
|
||||
When combined with vLLM’s Prefix Caching mechanism, the pool enables efficient caching both locally (in HBM) and globally (via Mooncake), ensuring that frequently used prefixes remain hot while less frequently accessed KV data can spill over to lower-cost memory.
|
||||
|
||||
### 1. Combining KV Cache Pool with HBM Prefix Caching
|
||||
|
||||
Prefix Caching with HBM is already supported by the vLLM V1 Engine.
|
||||
By introducing KV Connector V1, users can seamlessly combine HBM-based Prefix Caching with Mooncake-backed KV Pool.
|
||||
|
||||
@@ -54,17 +56,22 @@ To Enable this feature, we need to setup both Mooncake Connector and Mooncake St
|
||||
For details, please also refer to the Mooncake Connector Store Deployment Guide.
|
||||
|
||||
## How is MooncakestoreConnectorV1 Implemented?
|
||||
|
||||
**MooncakestoreConnectorV1** inhereits the KV Connector V1 class in vLLM V1: through implementing the required methods defined in the KV connector V1 base class, one can integrate a thrid-party KV cache transfer/storage backend into the vLLM framework.
|
||||
|
||||
MooncakeStoreConnectorV1 is also largly inspried by LMCacheConnectorV1 in term of the `Lookup Engine`/`Lookup Client` design for looking up KV cache keys, and the `ChunkedTokenDatabase` class for processing tokens into prefix-aware hashes as well as other hashing related designs. On top of this, we have also added our own design including `KVTransferThread` that allows async `get` and `put` of KV caches with multi-threading, and NPU-related data transfer optimization such as removing the `LocalBuffer` in LMCache to remove redundant data transfer.
|
||||
|
||||
The KV Connector methods that need to be implemented can be categorized into scheduler-side methods that are called in V1 scheduler and worker-side methods that are called in V1 worker, namely:
|
||||
### KV Connector Scheduler-Side Methods:
|
||||
|
||||
### KV Connector Scheduler-Side Methods
|
||||
|
||||
`get_num_new_matched_tokens`: Get prefix cache hit in number of tokens through looking up into the KV pool.
|
||||
`update_states_after_alloc`: Update KVConnector state after temporary buffer alloc.
|
||||
`build_connector_meta`: Attach the connector metadata to the request object.
|
||||
`request_finished`: Once a request is finished, determine whether request blocks should be freed now or will be sent asynchronously and freed later.
|
||||
### Connector Worker-Side Methods:
|
||||
|
||||
### Connector Worker-Side Methods
|
||||
|
||||
`register_kv_caches`: Register KV cache buffers needed for KV cache transfer.
|
||||
`start_load_kv`: Perform KV cache load operation that transfers KV cache from storage to device.
|
||||
`wait_for_layer_load`: Optional; Wait for layer load in layerwise + async KV load scenario.
|
||||
@@ -73,6 +80,7 @@ The KV Connector methods that need to be implemented can be categorized into sch
|
||||
`get_finished` Get request that finished KV transfer, `done_sending` if `put` finished, `done_reciving` if `get` finished.
|
||||
|
||||
## DFX
|
||||
|
||||
1. When looking up a key in KV Pool, if we cannot find the key, there is no Cache Hit for this specific block; we return no hit for this block and do not look up further blocks for current request.
|
||||
2. Similaly, when we are trying to put a block into KV Pool and failed, we do not put further blocks (subject to change).
|
||||
|
||||
|
||||
@@ -1,13 +1,15 @@
|
||||
# Prepare inputs for model forwarding
|
||||
|
||||
## Purpose
|
||||
|
||||
Information required to perform model forward pass:
|
||||
- the inputs
|
||||
- the corresponding attention metadata of the inputs
|
||||
|
||||
- the inputs
|
||||
- the corresponding attention metadata of the inputs
|
||||
|
||||
The following diagram shows what we should prepare for model inference.
|
||||
|
||||
```
|
||||
```shell
|
||||
+---------------+
|
||||
inputs --> | |
|
||||
| model | --> output
|
||||
@@ -20,8 +22,11 @@ Therefore, as long as we have these two pieces of information mentioned above, w
|
||||
This document will explain **how we obtain the inputs and their corresponding attention metadata**.
|
||||
|
||||
## Overview
|
||||
|
||||
### 1. Obtain inputs
|
||||
|
||||
The workflow of obtaining inputs:
|
||||
|
||||
1. Get `token positions`: relative position of each token within its request sequence.
|
||||
|
||||
2. Get `token indices`: index of each scheduled token in the token table.
|
||||
@@ -33,7 +38,9 @@ At last, these `Token IDs` are required to be fed into a model, and also, `posit
|
||||
**Note**: The `Token IDs` are the inputs of a model, so we also call them `Inputs IDs`.
|
||||
|
||||
### 2. Build inputs attention metadata
|
||||
|
||||
A model requires these attention metadata during the forward pass:
|
||||
|
||||
- `query start location`: start and end location of each request corresponding to the scheduled tokens.
|
||||
- `sequence length`: length of each request including both computed tokens and newly scheduled tokens.
|
||||
- `number of computed tokens`: number of computed tokens for each request.
|
||||
@@ -45,7 +52,9 @@ A model requires these attention metadata during the forward pass:
|
||||
- `attention mask`: mask matrix applied to attention scores before softmax to control which tokens can attend to each other (usually a causal attention).
|
||||
|
||||
## Before start
|
||||
|
||||
There are mainly three types of variables.
|
||||
|
||||
- token level: represents one attribute corresponding to each scheduled token, so the length of this variable is the number of scheduled tokens
|
||||
- request level: represents one attribute of each scheduled request, whose length usually is the number of scheduled requests. (`query start location` is a special case, which has one more element)
|
||||
- system level:
|
||||
@@ -55,10 +64,11 @@ There are mainly three types of variables.
|
||||
**Note**: Both of these two tables are come from the `_update_states` method before **preparing inputs**. You can take a look if you need more inspiration.
|
||||
|
||||
### Tips
|
||||
|
||||
Simply put, a `token ID` is an **integer** (usually `int32`), which represents a token.
|
||||
Example of `Token ID`:
|
||||
|
||||
```
|
||||
```shell
|
||||
| Token ID | Token |
|
||||
|--------------|---------------|
|
||||
| 0 | [PAD] |
|
||||
@@ -76,19 +86,24 @@ Example of `Token ID`:
|
||||
```
|
||||
|
||||
## Go through details
|
||||
|
||||
Assumptions:
|
||||
|
||||
- maximum number of tokens can be scheduled at once: 10
|
||||
- `block size`: 2
|
||||
- Totally schedule 3 requests. Their prompt lengths are 3, 2, and 8 respectively.
|
||||
- `max model length`: 12 (the maximum token count can be handled at one request sequence in a model).
|
||||
|
||||
These assumptions are configured in the beginning when starting vLLM. They are not fixed, so you can manually set them.
|
||||
|
||||
### Step 1: All requests in the prefill phase
|
||||
|
||||
#### Obtain inputs
|
||||
|
||||
As the maximum number of tokens that can be schedules is 10, the scheduled tokens of each request can be represented as `{'0': 3, '1': 2, '2': 5}`. Note that`request_2` uses chunked prefill, leaving 3 prompt tokens unscheduled.
|
||||
|
||||
##### 1. Get token positions:
|
||||
##### 1. Get token positions
|
||||
|
||||
First, determine which request each token belongs to: tokens 0–2 are assigned to **request_0**, tokens 3–4 to **request_1**, and tokens 5–9 to **request_2**. To represent this mapping, we use `request indices`, for example, `request indices`: `[0, 0, 0, 1, 1, 2, 2, 2, 2, 2]`.
|
||||
|
||||
For each request, use **the number of computed tokens** + **the relative position of current scheduled tokens** (`request_0: [0 + 0, 0 + 1, 0 + 2]`, `request_1: [0 + 0, 0 + 1]`, `request_2: [0 + 0, 0 + 1,..., 0 + 4]`) and then concatenate them together (`[0, 1, 2, 0, 1, 0, 1, 2, 3, 4]`).
|
||||
@@ -97,13 +112,15 @@ Note: there is more efficient way (using `request indices`) to create positions
|
||||
|
||||
Finally, `token positions` can be obtained as `[0, 1, 2, 0, 1, 0, 1, 2, 3, 4]`. This variable is **token level**.
|
||||
|
||||
##### 2. Get token indices:
|
||||
##### 2. Get token indices
|
||||
|
||||
The shape of the current **Token IDs table** is `(max num request, max model len)`.
|
||||
|
||||
Why these `T_3_5`, `T_3_6`, `T_3_7` are in this table without being scheduled?
|
||||
|
||||
- We fill all Token IDs in one request sequence to this table at once, but we only retrieve the tokens we scheduled this time. Then we retrieve the remain Token IDs next time.
|
||||
|
||||
```
|
||||
```shell
|
||||
| T_0_0 | T_0_1 | T_0_2 | ? | ? | ? | ? | ? | ? | ? | ? | ? |
|
||||
| T_1_0 | T_1_1 | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? |
|
||||
| T_2_0 | T_2_1 | T_3_2 | T_3_3 | T_3_4 | T_3_5 | T_3_6 | T_3_7 | ? | ? | ? | ? |
|
||||
@@ -120,19 +137,22 @@ Let's say `M = max model len`. Then we can use `token positions` together with `
|
||||
So `token indices` = `[0 + 0 * M, 1 + 0 * M, 2 + 0 * M, 0 + 1 * M, 1 + 1 * M, 0 + 2 * M, 1 + 2 * M, 2 + 2 * M, 3 + 2 * M, 4 + 2 * M]` = `[0, 1, 2, 12, 13, 24, 25, 26, 27, 28]`
|
||||
|
||||
##### 3. Retrieve the Token IDs
|
||||
|
||||
We use `token indices` to select out the corresponding `Input IDs` from the token table. The pseudocode is as follows:
|
||||
|
||||
```
|
||||
```shell
|
||||
input_ids = token_table[token_indices]
|
||||
```
|
||||
|
||||
As mentioned before, we refer to these `Token IDs` as `Input IDs`.
|
||||
|
||||
- `Input IDs` = `[T_0_0, T_0_1, T_0_2, T_1_0, T_1_1, T_2_0, T_2_1, T_3_2, T_3_3, T_3_4]`
|
||||
|
||||
#### Build inputs attention metadata
|
||||
|
||||
In the current **Block Table**, we use the first block (i.e. block_0) to mark the unused block. The shape of the block is `(max num request, max model len / block size)`, where `max model len / block size = 12 / 2 = 6`.
|
||||
|
||||
```
|
||||
```shell
|
||||
| 1 | 2 | 0 | 0 | 0 | 0 |
|
||||
| 3 | 0 | 0 | 0 | 0 | 0 |
|
||||
| 4 | 5 | 6 | 0 | 0 | 0 |
|
||||
@@ -144,13 +164,14 @@ In the current **Block Table**, we use the first block (i.e. block_0) to mark th
|
||||
|
||||
The KV cache block in the device memory is like:
|
||||
|
||||
```
|
||||
```shell
|
||||
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ......
|
||||
```
|
||||
|
||||
Let's say `K = max model len / block size = 6`, and we can get token `device block number`.
|
||||
|
||||
The workflow of achieving slot mapping:
|
||||
|
||||
1. Get `block table indices` using `K`, `positions` and `request indices`.
|
||||
|
||||
Purpose: For each token, it could be used to select `device block number` from `block table`.
|
||||
@@ -168,6 +189,7 @@ The workflow of achieving slot mapping:
|
||||
Purpose: we can use `slot mapping` to store Token IDs into token slots.
|
||||
|
||||
Details:
|
||||
|
||||
1. (**Token level**) Use a simple formula to calculate `block table indices`: `request indices * K + positions / block size`. So it equal to `[0 * 6 + 0 / 2, 0 * 6 + 1 / 2, 0 * 6 + 2 / 2, 1 * 6 + 0 / 2, 1 * 6 + 1 / 2, 2 * 6 + 0 / 2, 2 * 6 + 1 / 2, 2 * 6 + 2 / 2, 2 * 6 + 3 / 2, 2 * 6 + 4 / 2] = [0, 0, 1, 6, 6, 12, 12, 13, 13, 14]`. This could be used to select `device block number` from `block table`.
|
||||
2. (**Token level**) Use `block table indices` to select out `device block number` for each scheduled token. The Pseudocode is `block_numbers = block_table[block_table_indices]`. So `device block number=[1, 1, 2, 3, 3, 4, 4, 5, 5, 6]`
|
||||
3. (**Token level**) `block offsets` could be computed by `block offsets = positions % block size = [0, 1, 0, 0, 1, 0, 1, 0, 1, 0]`.
|
||||
@@ -185,9 +207,11 @@ Details:
|
||||
- `attention mask`: For all requests that initiate a prefill process, we simply create only one mask matrix for reuse across different requests. The shape of this mask matrix is `5 * 5`:
|
||||
|
||||
### Step 2: Chunked prefill
|
||||
|
||||
In Step 2, we no longer provide explanations or perform calculations; instead, we directly present the final result.
|
||||
|
||||
#### Obtain inputs
|
||||
|
||||
Scheduled token of each request: `{'0': 1, '1': 1, '2': 3}`
|
||||
|
||||
1. `request indices`: `[0, 1, 2, 2, 2]`
|
||||
@@ -195,7 +219,7 @@ Scheduled token of each request: `{'0': 1, '1': 1, '2': 3}`
|
||||
|
||||
Current **Token IDs table**:
|
||||
|
||||
```
|
||||
```shell
|
||||
| T_0_0 | T_0_1 | T_0_2 | T_0_3 | ? | ? | ? | ? | ? | ? | ? | ? |
|
||||
| T_1_0 | T_1_1 | T_1_2 | ? | ? | ? | ? | ? | ? | ? | ? | ? |
|
||||
| T_2_0 | T_2_1 | T_3_2 | T_3_3 | T_3_4 | T_3_5 | T_3_6 | T_3_7 | ? | ? | ? | ? |
|
||||
@@ -211,11 +235,12 @@ Current **Token IDs table**:
|
||||
4. `Input IDs`: `[T_0_3, T_1_2, T_3_5, T_3_6, T_3_7]`
|
||||
|
||||
#### Build inputs attention metadata
|
||||
|
||||
We allocate the blocks `7` and `8` to `request_1` and `request_2` respectively, as they need more space in device to store KV cache following token generation or chunked prefill.
|
||||
|
||||
Current **Block Table**:
|
||||
|
||||
```
|
||||
```shell
|
||||
| 1 | 2 | 0 | 0 | 0 | 0 |
|
||||
| 3 | 7 | 0 | 0 | 0 | 0 |
|
||||
| 4 | 5 | 6 | 8 | 0 | 0 |
|
||||
@@ -227,7 +252,7 @@ Current **Block Table**:
|
||||
|
||||
KV cache block in the device memory:
|
||||
|
||||
```
|
||||
```shell
|
||||
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ......
|
||||
```
|
||||
|
||||
@@ -237,6 +262,7 @@ KV cache block in the device memory:
|
||||
4. (**Token level**) `slot mapping`: `[5, 14, 13, 16, 17]`
|
||||
|
||||
Scheduled token count:`[1, 1, 3]`
|
||||
|
||||
- `query start location`: `[0, 1, 2, 5]`
|
||||
|
||||
- `sequence length`: `[4, 3, 8]`
|
||||
@@ -254,6 +280,7 @@ Scheduled token count:`[1, 1, 3]`
|
||||
Each token has a `1 * 8` vector, and there are 5 scheduled tokens.
|
||||
|
||||
## At last
|
||||
|
||||
If you understand the step_1 and step_2, you will know the all following steps.
|
||||
|
||||
Hope this document can help you better understand how vLLM prepares inputs for model forwarding. If you have any good idea, welcome to contribute to us.
|
||||
|
||||
@@ -20,6 +20,7 @@ Its main objective is to eliminate duplicated storage of the KV cache by shardin
|
||||
DCP primarily influences the Decode logic, as well as the logic for chunked prefill and cached prefill.
|
||||
|
||||
## How to Use CP?
|
||||
|
||||
Please refer to the [context parallel user guide](../../user_guide/feature_guide/context_parallel.md) for detailed information.
|
||||
|
||||
## How It Works?
|
||||
|
||||
@@ -15,6 +15,7 @@ This feature addresses the need to optimize the **Time Per Output Token (TPOT)**
|
||||
## Usage
|
||||
|
||||
vLLM Ascend currently supports two types of connectors for handling KV cache management:
|
||||
|
||||
- **MooncakeConnector**: D nodes pull KV cache from P nodes.
|
||||
- **MooncakeLayerwiseConnector**: P nodes push KV cache to D nodes in a layered manner.
|
||||
|
||||
@@ -35,7 +36,7 @@ Our design diagram is shown below, illustrating the pull and push schemes respec
|
||||

|
||||

|
||||
|
||||
#### Mooncake Connector:
|
||||
#### Mooncake Connector
|
||||
|
||||
1. The request is sent to the Proxy’s `_handle_completions` endpoint.
|
||||
2. The Proxy calls `select_prefiller` to choose a P node and forwards the request, configuring `kv_transfer_params` with `do_remote_decode=True`, `max_tokens=1`, and `min_tokens=1`.
|
||||
@@ -43,7 +44,7 @@ Our design diagram is shown below, illustrating the pull and push schemes respec
|
||||
4. The Proxy calls `select_decoder` to choose a D node and forwards the request.
|
||||
5. On the D node, the scheduler marks the request as `RequestStatus.WAITING_FOR_REMOTE_KVS`, pre-allocates KV cache, calls `kv_connector_no_forward` to pull the remote KV cache, then notifies the P node to release KV cache and proceeds with decoding to return the result.
|
||||
|
||||
#### Mooncake Layerwise Connector:
|
||||
#### Mooncake Layerwise Connector
|
||||
|
||||
1. The request is sent to the Proxy’s `_handle_completions` endpoint.
|
||||
2. The Proxy calls `select_decoder` to choose a D node and forwards the request, configuring `kv_transfer_params` with `do_remote_prefill=True` and setting the `metaserver` endpoint.
|
||||
@@ -55,6 +56,7 @@ Our design diagram is shown below, illustrating the pull and push schemes respec
|
||||
### 3. Interface Design
|
||||
|
||||
Taking MooncakeConnector as an example, the system is organized into three primary classes:
|
||||
|
||||
- **MooncakeConnector**: Base class that provides core interfaces.
|
||||
- **MooncakeConnectorScheduler**: Interface for scheduling the connectors within the engine core, responsible for managing KV cache transfer requirements and completion.
|
||||
- **MooncakeConnectorWorker**: Interface for managing KV cache registration and transfer in worker processes.
|
||||
|
||||
@@ -1,18 +1,22 @@
|
||||
# Expert Parallelism Load Balancer (EPLB)
|
||||
|
||||
## Why We Need EPLB?
|
||||
|
||||
When using Expert Parallelism (EP), different experts are assigned to different NPUs. Given that the load of various experts may vary depending on the current workload, it is crucial to maintain balanced loads across different NPUs. We adopt a redundant experts strategy by duplicating heavily-loaded experts. Then, we heuristically pack these duplicated experts onto NPUs to ensure load balancing across them. Moreover, thanks to the group-limited expert routing used in MoE models, we also attempt to place experts of the same group on the same node to reduce inter-node data traffic, whenever possible.
|
||||
|
||||
To facilitate reproduction and deployment, Vllm Ascend supported deployed EP load balancing algorithm in `vllm_ascend/eplb/core/policy`. The algorithm computes a balanced expert replication and placement plan based on the estimated expert loads. Note that the exact method for predicting expert loads is outside the scope of this repository. A common method is to use a moving average of historical statistics.
|
||||
|
||||

|
||||
|
||||
## How to Use EPLB?
|
||||
|
||||
Please refer to the EPLB section of the user guide for detailed information: [How to Use EPLB](../../user_guide/feature_guide/eplb_swift_balancer.md)
|
||||
|
||||
## How It Works?
|
||||
|
||||
**EPLB Module Architecture**
|
||||
|
||||
```
|
||||
```shell
|
||||
vllm_ascend
|
||||
├── eplb
|
||||
│ ├── adaptor
|
||||
@@ -35,6 +39,7 @@ vllm_ascend
|
||||
|
||||
**1. Adaptor Module**
|
||||
*Handles registration and adaptation for different MoE model types*
|
||||
|
||||
- `abstract_adaptor.py`
|
||||
Abstract base class defining unified registration interfaces for EPLB adapters
|
||||
- `vllm_adaptor.py`
|
||||
@@ -42,17 +47,18 @@ vllm_ascend
|
||||
|
||||
**2. Core Module**
|
||||
*Implements core algorithms, updates, and asynchronous processing*
|
||||
|
||||
- **Policy Submodule**
|
||||
*Load balancing algorithms with factory pattern instantiation*
|
||||
- `policy_abstract.py`
|
||||
- `policy_abstract.py`
|
||||
Abstract class for load balancing strategy interfaces
|
||||
- `policy_dynamic_ep.py`
|
||||
- `policy_dynamic_ep.py`
|
||||
Default implementation of open-source EPLB paper algorithm
|
||||
- `policy_dynamic_ep_v2.py`
|
||||
- `policy_dynamic_ep_v2.py`
|
||||
Enhanced version optimizing expert swaps for low-bandwidth devices (e.g., A2)
|
||||
- `policy_flashlb.py`
|
||||
- `policy_flashlb.py`
|
||||
Threshold-based adjustment reducing operational costs through layer-wise fluctuation detection
|
||||
- `policy_factory.py`
|
||||
- `policy_factory.py`
|
||||
Strategy factory for automatic algorithm instantiation
|
||||
|
||||
- `eplb_device_transfer_loader.py`
|
||||
@@ -63,12 +69,14 @@ vllm_ascend
|
||||
Asynchronous algorithm orchestration and result processing
|
||||
|
||||
**3. System Components**
|
||||
|
||||
- `eplb_updator.py`
|
||||
Central coordinator for load balancing during inference workflows
|
||||
- `utils.py`
|
||||
General utilities for EPLB interface registration
|
||||
|
||||
*Key Optimizations:*
|
||||
|
||||
1. Maintained original structure while improving technical clarity
|
||||
2. Standardized terminology
|
||||
3. Enhanced algorithm differentiation through concise descriptors
|
||||
@@ -76,14 +84,19 @@ vllm_ascend
|
||||
5. Preserved file/class relationships while optimizing readability
|
||||
|
||||
### Default Algorithm
|
||||
|
||||
#### Hierarchical Load Balancing
|
||||
|
||||
When the number of server nodes evenly divides the number of expert groups, we use the hierarchical load balancing policy to leverage group-limited expert routing. We first pack the expert groups onto nodes evenly, ensuring balanced loads across different nodes. Then, we replicate the experts within each node. Finally, we pack the replicated experts onto individual NPUs to ensure load balancing across them. The hierarchical load balancing policy can be used in the prefilling stage with a smaller expert-parallel size.
|
||||
|
||||
#### Global Load Balancing
|
||||
|
||||
In other cases, we use the global load balancing policy, which replicates experts globally regardless of expert groups, and packs the replicated experts onto individual NPUs. This policy can be adopted in the decoding stage with a larger expert-parallel size.
|
||||
|
||||
### Add a New EPLB Policy
|
||||
|
||||
If you want to add a new eplb policy to vllm_ascend, you must follow these steps:
|
||||
|
||||
1. Inherit the `EplbPolicy` abstract class of `policy_abstract.py` and override the `rebalance_experts` interface, ensuring consistent input parameters `current_expert_table`, `expert_workload` and return types `newplacement`.
|
||||
For example:
|
||||
|
||||
@@ -113,6 +126,7 @@ class RandomLoadBalance(EplbPolicy):
|
||||
2. To add a new EPLB algorithm, include the policy type and its corresponding implementation class in the `PolicyFactory` of `policy_factory.py`.
|
||||
|
||||
### Add a New MoE Model
|
||||
|
||||
**Implementation Guide for Model Integration**
|
||||
|
||||
1. **Adapter File Modification**
|
||||
@@ -154,12 +168,17 @@ class RandomLoadBalance(EplbPolicy):
|
||||
- Benchmark against baseline implementations (e.g., Qwen3-MoE)
|
||||
|
||||
*Key Implementation Notes:*
|
||||
|
||||
- Preserve existing interface contracts in abstract classes
|
||||
- Use decorators for non-intrusive patch integration
|
||||
- Leverage `eplb_utils.py` for shared expert mapping operations
|
||||
|
||||
## DFX
|
||||
|
||||
### Parameter Validation
|
||||
|
||||
#### Integer Parameters
|
||||
|
||||
All integer input parameters must explicitly specify their maximum and minimum values and be subject to valid value validation. For example, `num_iterations_eplb_update` must be greater than 0:
|
||||
|
||||
```python
|
||||
@@ -176,6 +195,7 @@ All integer input parameters must explicitly specify their maximum and minimum v
|
||||
```
|
||||
|
||||
#### File Path
|
||||
|
||||
The file path for EPLB must be checked for legality, such as whether the file path is valid and whether it has appropriate read and write permissions. For example:
|
||||
|
||||
```python
|
||||
@@ -203,20 +223,27 @@ The file path for EPLB must be checked for legality, such as whether the file pa
|
||||
```
|
||||
|
||||
### Function Specifications
|
||||
|
||||
#### Initialization Function
|
||||
|
||||
All EPLB parameters must be initialized by default during initialization, with specified parameter types and default values for proper handling.
|
||||
|
||||
#### General Functions
|
||||
|
||||
All method arguments must specify parameter types and default values, and functions must include default return value handling for default arguments. It is recommended to use `try-except` blocks to handle the function body, specifying the type of exception captured and the failure handling (e.g., logging exceptions or returning a failure status).
|
||||
|
||||
### Consistency
|
||||
|
||||
#### Expert Map
|
||||
|
||||
The expert map must be globally unique during initialization and update. In a multi-node scenario during initialization, distributed communication should be used to verify the consistency of expert maps across each rank. If they are inconsistent, the user should be notified which ranks have inconsistent maps.
|
||||
During the update process, if only a few layers or the expert table of a certain rank has been changed, the updated expert table must be synchronized with the EPLB's context to ensure global consistency.
|
||||
|
||||
#### Expert Weight
|
||||
|
||||
When updating expert weights, ensure that the memory allocated for the expert weights has been released, or that the expert (referring to the old version) is no longer in use.
|
||||
|
||||
## Limitation
|
||||
|
||||
Before using EPLB, start the script and add `export DYNAMIC_EPLB="true"`.
|
||||
Before performing load data collection (or performance data collection), start the script and add `export EXPERT_MAP_RECORD="true"`.
|
||||
|
||||
@@ -16,7 +16,7 @@ We should keep in mind that Patch is not the best way to make vLLM Ascend compat
|
||||
|
||||
In `vllm_ascend/patch`, you can see the code structure as follows:
|
||||
|
||||
```
|
||||
```shell
|
||||
vllm_ascend
|
||||
├── patch
|
||||
│ ├── platform
|
||||
@@ -27,10 +27,10 @@ vllm_ascend
|
||||
```
|
||||
|
||||
- **platform**: The patch code in this directory is for patching the code in vLLM main process. It's called by `vllm_ascend/platform::NPUPlatform::pre_register_and_update` very early when vLLM is initialized.
|
||||
- For online mode, vLLM process calls the platform patch in `vllm/vllm/engine/arg_utils.py::AsyncEngineArgs.add_cli_args` when parsing the cli args.
|
||||
- For offline mode, vLLM process calls the platform patch in `vllm/vllm/engine/arg_utils.py::EngineArgs.create_engine_config` when parsing the input parameters.
|
||||
- For online mode, vLLM process calls the platform patch in `vllm/vllm/engine/arg_utils.py::AsyncEngineArgs.add_cli_args` when parsing the cli args.
|
||||
- For offline mode, vLLM process calls the platform patch in `vllm/vllm/engine/arg_utils.py::EngineArgs.create_engine_config` when parsing the input parameters.
|
||||
- **worker**: The patch code in this directory is for patching the code in vLLM worker process. It's called by `vllm_ascend/worker/worker::NPUWorker::__init__` when the vLLM worker process is initialized.
|
||||
- For both online and offline mode, vLLM engine core process calls the worker patch in `vllm/vllm/worker/worker_base.py::WorkerWrapperBase.init_worker` when initializing the worker process.
|
||||
- For both online and offline mode, vLLM engine core process calls the worker patch in `vllm/vllm/worker/worker_base.py::WorkerWrapperBase.init_worker` when initializing the worker process.
|
||||
|
||||
## How to write a patch
|
||||
|
||||
@@ -54,7 +54,7 @@ Before writing a patch, following the principle above, we should patch the least
|
||||
5. Import the patch file in `__init__.py`. In this example, add `import vllm_ascend.patch.platform.patch_distributed` into `vllm_ascend/patch/platform/__init__.py`.
|
||||
6. Add the description of the patch in `vllm_ascend/patch/__init__.py`. The description format is as follows:
|
||||
|
||||
```
|
||||
```python
|
||||
# ** File: <The patch file name> **
|
||||
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
# 1. `<The target patch module in vLLM>`
|
||||
@@ -71,5 +71,6 @@ Before writing a patch, following the principle above, we should patch the least
|
||||
7. Add the Unit Test and E2E Test. Any newly added code in vLLM Ascend should contain the Unit Test and E2E Test as well. You can find more details in [test guide](../contribution/testing.md)
|
||||
|
||||
## Limitation
|
||||
|
||||
1. In V1 Engine, vLLM starts three kinds of process: Main process, EngineCore process and Worker process. Now vLLM Ascend only can patch the code in Main process and Worker process by default. If you want to patch the code running in EngineCore process, you should patch EngineCore process entirely during setup. Find the entire code in `vllm.v1.engine.core`. Please override `EngineCoreProc` and `DPEngineCoreProc` entirely.
|
||||
2. If you are running edited vLLM code, the version of vLLM may be changed automatically. For example, if you run the edited vLLM based on v0.9.n, the version of vLLM may be changed to v0.9.nxxx. In this case, the patch for v0.9.n in vLLM Ascend would not work as expected, because vLLM Ascend can't distinguish the version of the vLLM you're using. In this case, you can set the environment variable `VLLM_VERSION` to specify the version of the vLLM you're using, and then the patch for v0.10.0 should work.
|
||||
|
||||
@@ -53,7 +53,7 @@ To restrict the operators that are captured, configure the `list` block:
|
||||
|
||||
- `scope` (list[str]): In PyTorch pynative scenarios this field restricts the dump range. Provide two module or API names that follow the tool's naming convention to lock a range; only data between the two names will be dumped. Examples:
|
||||
|
||||
```
|
||||
```json
|
||||
"scope": ["Module.conv1.Conv2d.forward.0", "Module.fc2.Linear.forward.0"]
|
||||
"scope": ["Cell.conv1.Conv2d.forward.0", "Cell.fc2.Dense.backward.0"]
|
||||
"scope": ["Tensor.add.0.forward", "Functional.square.2.forward"]
|
||||
@@ -62,9 +62,9 @@ To restrict the operators that are captured, configure the `list` block:
|
||||
The `level` setting determines what can be provided—modules when `level=L0`, APIs when `level=L1`, and either modules or APIs when `level=mix`.
|
||||
|
||||
- `list` (list[str]): Custom operator list. Options include:
|
||||
- Supply the full names of specific APIs in PyTorch pynative scenarios to only dump those APIs. Example: `"list": ["Tensor.permute.1.forward", "Tensor.transpose.2.forward", "Torch.relu.3.backward"]`.
|
||||
- When `level=mix`, you can provide module names so that the dump expands to everything produced while the module is running. Example: `"list": ["Module.module.language_model.encoder.layers.0.mlp.ParallelMlp.forward.0"]`.
|
||||
- Provide a substring such as `"list": ["relu"]` to dump every API whose name contains the substring. When `level=mix`, modules whose names contain the substring are also expanded.
|
||||
- Supply the full names of specific APIs in PyTorch pynative scenarios to only dump those APIs. Example: `"list": ["Tensor.permute.1.forward", "Tensor.transpose.2.forward", "Torch.relu.3.backward"]`.
|
||||
- When `level=mix`, you can provide module names so that the dump expands to everything produced while the module is running. Example: `"list": ["Module.module.language_model.encoder.layers.0.mlp.ParallelMlp.forward.0"]`.
|
||||
- Provide a substring such as `"list": ["relu"]` to dump every API whose name contains the substring. When `level=mix`, modules whose names contain the substring are also expanded.
|
||||
|
||||
Example configuration:
|
||||
|
||||
@@ -188,7 +188,7 @@ Use `msprobe graph_visualize` to generate results that can be opened inside `tb_
|
||||
Replace the paths with your dump directories before invoking `msprobe graph_visualize`. **If you only need to build a single graph**, omit `bench_path` to visualize one dump.
|
||||
Multi-rank scenarios (single rank, multi-rank, or multi-step multi-rank) are also supported. `npu_path` or `bench_path` must contain folders named `rank+number`, and every rank folder must contain a non-empty `construct.json` together with `dump.json` and `stack.json`. If any `construct.json` is empty, verify that the dump level includes `L0` or `mix`. When comparing graphs, both `npu_path` and `bench_path` must contain the same set of rank folders so they can be paired one-to-one.
|
||||
|
||||
```
|
||||
```shell
|
||||
├── npu_path or bench_path
|
||||
| ├── rank0
|
||||
| | ├── dump_tensor_data (only when the `tensor` option is enabled)
|
||||
|
||||
@@ -200,10 +200,12 @@ echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
|
||||
```
|
||||
|
||||
Purpose
|
||||
|
||||
- Forces all CPU cores to run under the `performance` governor
|
||||
- Disables dynamic frequency scaling (e.g., `ondemand`, `powersave`)
|
||||
|
||||
Benefits
|
||||
|
||||
- Keeps CPU cores at maximum frequency
|
||||
- Reduces latency jitter
|
||||
- Improves predictability for inference workloads
|
||||
@@ -224,6 +226,7 @@ Benefits
|
||||
- Improves stability for large in-memory models
|
||||
|
||||
Notes
|
||||
|
||||
- For inference workloads, swap can introduce second-level latency
|
||||
- Recommended values are `0` or `1`
|
||||
|
||||
@@ -244,6 +247,7 @@ Benefits
|
||||
- Improves performance stability on NUMA systems
|
||||
|
||||
Recommended For
|
||||
|
||||
- Multi-socket servers
|
||||
- Ascend / NPU deployments with explicit NUMA binding
|
||||
- Systems with manually managed CPU and memory affinity
|
||||
@@ -255,14 +259,17 @@ sysctl -w kernel.sched_migration_cost_ns=50000
|
||||
```
|
||||
|
||||
Purpose
|
||||
|
||||
- Increases the cost for the scheduler to migrate tasks between CPU cores
|
||||
|
||||
Benefits
|
||||
|
||||
- Reduces frequent thread migration
|
||||
- Improves CPU cache locality
|
||||
- Lowers latency jitter for inference workloads
|
||||
|
||||
Parameter Details
|
||||
|
||||
- Unit: nanoseconds (ns)
|
||||
- Typical recommended range: 50000–100000
|
||||
- Higher values encourage threads to stay on the same CPU core
|
||||
|
||||
@@ -1,4 +1,5 @@
|
||||
# Performance Benchmark
|
||||
|
||||
This document details the benchmark methodology for vllm-ascend, aimed at evaluating the performance under a variety of workloads. To maintain alignment with vLLM, we use the [benchmark](https://github.com/vllm-project/vllm/tree/main/benchmarks) script provided by the vllm project.
|
||||
|
||||
**Benchmark Coverage**: We measure offline E2E latency and throughput, and fixed-QPS online serving benchmarks. For more details, see [vllm-ascend benchmark scripts](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks).
|
||||
@@ -38,10 +39,12 @@ pip install -r benchmarks/requirements-bench.txt
|
||||
```
|
||||
|
||||
## 3. Run basic benchmarks
|
||||
|
||||
This section introduces how to perform performance testing using the benchmark suite built into VLLM.
|
||||
|
||||
### 3.1 Dataset
|
||||
VLLM supports a variety of (datasets)[https://github.com/vllm-project/vllm/blob/main/vllm/benchmarks/datasets.py].
|
||||
|
||||
VLLM supports a variety of [datasets](https://github.com/vllm-project/vllm/blob/main/vllm/benchmarks/datasets.py).
|
||||
|
||||
<style>
|
||||
th {
|
||||
|
||||
@@ -5,19 +5,20 @@ The execution duration of each stage (including pre/post-processing, model forwa
|
||||
**To reduce the performance overhead, we add this feature, using the NPU event timestamp mechanism to observe the device execution time asynchronously.**
|
||||
|
||||
## Usage
|
||||
|
||||
* Use the environment variable `VLLM_ASCEND_MODEL_EXECUTE_TIME_OBSERVE` to enable this feature.
|
||||
* Use the non-blocking API `ProfileExecuteDuration().capture_async` to set observation points asynchronously when you need to observe the execution duration.
|
||||
* Use the blocking API `ProfileExecuteDuration().pop_captured_sync` at an appropriate time to get and print the execution durations of all observed stages.
|
||||
|
||||
**We have instrumented the key inference stages (including pre-processing, model forward pass, etc.) for execution duration profiling. Execute the script as follows:**
|
||||
|
||||
```
|
||||
```shell
|
||||
VLLM_ASCEND_MODEL_EXECUTE_TIME_OBSERVE=1 python3 vllm-ascend/examples/offline_inference_npu.py
|
||||
```
|
||||
|
||||
## Example Output
|
||||
|
||||
```
|
||||
```shell
|
||||
5691:(IntegratedWorker pid=1502285) Profile execute duration [Decode]: [post process]:14.17ms [prepare input and forward]:9.57ms [forward]:4.14ms
|
||||
5695:(IntegratedWorker pid=1502285) Profile execute duration [Decode]: [post process]:14.29ms [prepare input and forward]:10.19ms [forward]:4.14ms
|
||||
5697:(IntegratedWorker pid=1502343) Profile execute duration [Decode]: [post process]:14.81ms [prepare input and forward]:10.29ms [forward]:3.99ms
|
||||
|
||||
@@ -15,6 +15,7 @@ pip install msserviceprofiler==1.2.2
|
||||
```
|
||||
|
||||
### 1 Preparation
|
||||
|
||||
Before starting the service, set the environment variable `SERVICE_PROF_CONFIG_PATH` to point to the profiling configuration file, and set the environment variable `PROFILING_SYMBOLS_PATH` to specify the YAML configuration file for the symbols that need to be imported. After that, start the vLLM service according to your deployment method.
|
||||
|
||||
```bash
|
||||
@@ -32,6 +33,7 @@ The file `ms_service_profiler_config.json` is the profiling configuration. If it
|
||||
`service_profiling_symbols.yaml` is the configuration file containing the profiling points to be imported. You can choose **not** to set the `PROFILING_SYMBOLS_PATH` environment variable, in which case the default configuration file will be used. If the file does not exist at the path you specified, likewise, the system will generate a configuration file at your specified path for future configuration. You can customize it according to the instructions in the `Symbols Configuration File` section below.
|
||||
|
||||
### 2 Enable Profiling
|
||||
|
||||
To enable the performance data collection switch, change the `enable` field from `0` to `1` in the configuration file `ms_service_profiler_config.json`. This can be accomplished by executing the following sed command:
|
||||
|
||||
```bash
|
||||
@@ -39,6 +41,7 @@ sed -i 's/"enable":\s*0/"enable": 1/' ./ms_service_profiler_config.json
|
||||
```
|
||||
|
||||
### 3 Send Requests
|
||||
|
||||
Choose a request-sending method that suits your actual profiling needs:
|
||||
|
||||
```bash
|
||||
@@ -65,6 +68,7 @@ msserviceprofiler analyze --input-path=./ --output-path output
|
||||
### 5 View Results
|
||||
|
||||
After analysis, the `output` directory will contain:
|
||||
|
||||
- `chrome_tracing.json`: Chrome tracing format data, which can be opened in [MindStudio Insight](https://www.hiascend.com/document/detail/zh/mindstudio/81RC1/GUI_baseddevelopmenttool/msascendinsightug/Insight_userguide_0002.html).
|
||||
- `profiler.db`: Performance data in database format.
|
||||
- `request.csv`: Request-related data.
|
||||
@@ -77,7 +81,9 @@ After analysis, the `output` directory will contain:
|
||||
---
|
||||
|
||||
## Appendix
|
||||
|
||||
(profiling-configuration-file)=
|
||||
|
||||
### 1 Profiling Configuration File
|
||||
|
||||
The profiling configuration file controls profiling parameters and behavior.
|
||||
@@ -116,6 +122,7 @@ The configuration is in JSON format. Main parameters:
|
||||
---
|
||||
|
||||
(symbols-configuration-file)=
|
||||
|
||||
### 2 Symbols Configuration File
|
||||
|
||||
The symbols configuration file defines which functions/methods to profile and supports flexible configuration with custom attribute collection.
|
||||
|
||||
Reference in New Issue
Block a user