[Lint]Style: reformat markdown files via markdownlint (#5884)
### What this PR does / why we need it?
reformat markdown files via markdownlint
- vLLM version: v0.13.0
- vLLM main:
bde38c11df
---------
Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain>
Signed-off-by: MrZ20 <2609716663@qq.com>
Co-authored-by: root <root@LAPTOP-VQKDDVMG.localdomain>
This commit is contained in:
@@ -4,7 +4,7 @@
|
||||
|
||||
When in LLM inference, each token requires nearly thousand operator executions, and when host launching operators are slower than device, it will cause host bound. In severe cases, the device will be idle for more than half of the time. To solve this problem, we use graph in LLM inference.
|
||||
|
||||
```
|
||||
```shell
|
||||
eager mode:
|
||||
|
||||
host: | launch op1 | launch op2 | launch op3 | launch op4 | launch op5 |
|
||||
@@ -38,11 +38,12 @@ But in reality, graph mode is not that simple.
|
||||
Due to graph can only replay the ops captured before, without doing tiling and checking graph input, we need to ensure the consistency of the graph input, but we know that model input's shape depends on the request scheduled by Scheduler, we can't ensure the consistency.
|
||||
|
||||
Obviously, we can solve this problem by capturing the biggest shape and padding all of the model input to it. But it will bring a lot of redundant computing and make performance worse. So we can capture multiple graphs with different shape, and pad the model input to the nearest graph, which will greatly reduce redundant computing. But when `max_num_batched_tokens` is very large, the number of graphs that need to be captured will also become very large. But we know that when intensor's shape is large, the computing time will be very long, and graph mode is not necessary in this case. So all of things we need to do is:
|
||||
|
||||
1. Set a threshold;
|
||||
2. When `num_scheduled_tokens` is bigger than the threshold, use `eager_mode`;
|
||||
3. Capture multiple graphs within a range below the threshold;
|
||||
|
||||
```
|
||||
```shell
|
||||
| graph1 |
|
||||
| graph2 |
|
||||
| graph3 |
|
||||
|
||||
@@ -21,6 +21,7 @@ vLLM Ascend Currently supports Mooncake Store for KV Cache Pool. To enable Moonc
|
||||
For step-by-step deployment and configuration, please refer to the [KV Pool User Guide](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/feature_guide/kv_pool.html).
|
||||
|
||||
## How it works?
|
||||
|
||||
The KV Cache Pool integrates multiple memory tiers (HBM, DRAM, SSD, etc.) through a connector-based architecture.
|
||||
|
||||
Each connector implements a unified interface for storing, retrieving, and transferring KV blocks between tiers, depending on access frequency and hardware bandwidth.
|
||||
@@ -28,6 +29,7 @@ Each connector implements a unified interface for storing, retrieving, and trans
|
||||
When combined with vLLM’s Prefix Caching mechanism, the pool enables efficient caching both locally (in HBM) and globally (via Mooncake), ensuring that frequently used prefixes remain hot while less frequently accessed KV data can spill over to lower-cost memory.
|
||||
|
||||
### 1. Combining KV Cache Pool with HBM Prefix Caching
|
||||
|
||||
Prefix Caching with HBM is already supported by the vLLM V1 Engine.
|
||||
By introducing KV Connector V1, users can seamlessly combine HBM-based Prefix Caching with Mooncake-backed KV Pool.
|
||||
|
||||
@@ -54,17 +56,22 @@ To Enable this feature, we need to setup both Mooncake Connector and Mooncake St
|
||||
For details, please also refer to the Mooncake Connector Store Deployment Guide.
|
||||
|
||||
## How is MooncakestoreConnectorV1 Implemented?
|
||||
|
||||
**MooncakestoreConnectorV1** inhereits the KV Connector V1 class in vLLM V1: through implementing the required methods defined in the KV connector V1 base class, one can integrate a thrid-party KV cache transfer/storage backend into the vLLM framework.
|
||||
|
||||
MooncakeStoreConnectorV1 is also largly inspried by LMCacheConnectorV1 in term of the `Lookup Engine`/`Lookup Client` design for looking up KV cache keys, and the `ChunkedTokenDatabase` class for processing tokens into prefix-aware hashes as well as other hashing related designs. On top of this, we have also added our own design including `KVTransferThread` that allows async `get` and `put` of KV caches with multi-threading, and NPU-related data transfer optimization such as removing the `LocalBuffer` in LMCache to remove redundant data transfer.
|
||||
|
||||
The KV Connector methods that need to be implemented can be categorized into scheduler-side methods that are called in V1 scheduler and worker-side methods that are called in V1 worker, namely:
|
||||
### KV Connector Scheduler-Side Methods:
|
||||
|
||||
### KV Connector Scheduler-Side Methods
|
||||
|
||||
`get_num_new_matched_tokens`: Get prefix cache hit in number of tokens through looking up into the KV pool.
|
||||
`update_states_after_alloc`: Update KVConnector state after temporary buffer alloc.
|
||||
`build_connector_meta`: Attach the connector metadata to the request object.
|
||||
`request_finished`: Once a request is finished, determine whether request blocks should be freed now or will be sent asynchronously and freed later.
|
||||
### Connector Worker-Side Methods:
|
||||
|
||||
### Connector Worker-Side Methods
|
||||
|
||||
`register_kv_caches`: Register KV cache buffers needed for KV cache transfer.
|
||||
`start_load_kv`: Perform KV cache load operation that transfers KV cache from storage to device.
|
||||
`wait_for_layer_load`: Optional; Wait for layer load in layerwise + async KV load scenario.
|
||||
@@ -73,6 +80,7 @@ The KV Connector methods that need to be implemented can be categorized into sch
|
||||
`get_finished` Get request that finished KV transfer, `done_sending` if `put` finished, `done_reciving` if `get` finished.
|
||||
|
||||
## DFX
|
||||
|
||||
1. When looking up a key in KV Pool, if we cannot find the key, there is no Cache Hit for this specific block; we return no hit for this block and do not look up further blocks for current request.
|
||||
2. Similaly, when we are trying to put a block into KV Pool and failed, we do not put further blocks (subject to change).
|
||||
|
||||
|
||||
@@ -1,13 +1,15 @@
|
||||
# Prepare inputs for model forwarding
|
||||
|
||||
## Purpose
|
||||
|
||||
Information required to perform model forward pass:
|
||||
- the inputs
|
||||
- the corresponding attention metadata of the inputs
|
||||
|
||||
- the inputs
|
||||
- the corresponding attention metadata of the inputs
|
||||
|
||||
The following diagram shows what we should prepare for model inference.
|
||||
|
||||
```
|
||||
```shell
|
||||
+---------------+
|
||||
inputs --> | |
|
||||
| model | --> output
|
||||
@@ -20,8 +22,11 @@ Therefore, as long as we have these two pieces of information mentioned above, w
|
||||
This document will explain **how we obtain the inputs and their corresponding attention metadata**.
|
||||
|
||||
## Overview
|
||||
|
||||
### 1. Obtain inputs
|
||||
|
||||
The workflow of obtaining inputs:
|
||||
|
||||
1. Get `token positions`: relative position of each token within its request sequence.
|
||||
|
||||
2. Get `token indices`: index of each scheduled token in the token table.
|
||||
@@ -33,7 +38,9 @@ At last, these `Token IDs` are required to be fed into a model, and also, `posit
|
||||
**Note**: The `Token IDs` are the inputs of a model, so we also call them `Inputs IDs`.
|
||||
|
||||
### 2. Build inputs attention metadata
|
||||
|
||||
A model requires these attention metadata during the forward pass:
|
||||
|
||||
- `query start location`: start and end location of each request corresponding to the scheduled tokens.
|
||||
- `sequence length`: length of each request including both computed tokens and newly scheduled tokens.
|
||||
- `number of computed tokens`: number of computed tokens for each request.
|
||||
@@ -45,7 +52,9 @@ A model requires these attention metadata during the forward pass:
|
||||
- `attention mask`: mask matrix applied to attention scores before softmax to control which tokens can attend to each other (usually a causal attention).
|
||||
|
||||
## Before start
|
||||
|
||||
There are mainly three types of variables.
|
||||
|
||||
- token level: represents one attribute corresponding to each scheduled token, so the length of this variable is the number of scheduled tokens
|
||||
- request level: represents one attribute of each scheduled request, whose length usually is the number of scheduled requests. (`query start location` is a special case, which has one more element)
|
||||
- system level:
|
||||
@@ -55,10 +64,11 @@ There are mainly three types of variables.
|
||||
**Note**: Both of these two tables are come from the `_update_states` method before **preparing inputs**. You can take a look if you need more inspiration.
|
||||
|
||||
### Tips
|
||||
|
||||
Simply put, a `token ID` is an **integer** (usually `int32`), which represents a token.
|
||||
Example of `Token ID`:
|
||||
|
||||
```
|
||||
```shell
|
||||
| Token ID | Token |
|
||||
|--------------|---------------|
|
||||
| 0 | [PAD] |
|
||||
@@ -76,19 +86,24 @@ Example of `Token ID`:
|
||||
```
|
||||
|
||||
## Go through details
|
||||
|
||||
Assumptions:
|
||||
|
||||
- maximum number of tokens can be scheduled at once: 10
|
||||
- `block size`: 2
|
||||
- Totally schedule 3 requests. Their prompt lengths are 3, 2, and 8 respectively.
|
||||
- `max model length`: 12 (the maximum token count can be handled at one request sequence in a model).
|
||||
|
||||
These assumptions are configured in the beginning when starting vLLM. They are not fixed, so you can manually set them.
|
||||
|
||||
### Step 1: All requests in the prefill phase
|
||||
|
||||
#### Obtain inputs
|
||||
|
||||
As the maximum number of tokens that can be schedules is 10, the scheduled tokens of each request can be represented as `{'0': 3, '1': 2, '2': 5}`. Note that`request_2` uses chunked prefill, leaving 3 prompt tokens unscheduled.
|
||||
|
||||
##### 1. Get token positions:
|
||||
##### 1. Get token positions
|
||||
|
||||
First, determine which request each token belongs to: tokens 0–2 are assigned to **request_0**, tokens 3–4 to **request_1**, and tokens 5–9 to **request_2**. To represent this mapping, we use `request indices`, for example, `request indices`: `[0, 0, 0, 1, 1, 2, 2, 2, 2, 2]`.
|
||||
|
||||
For each request, use **the number of computed tokens** + **the relative position of current scheduled tokens** (`request_0: [0 + 0, 0 + 1, 0 + 2]`, `request_1: [0 + 0, 0 + 1]`, `request_2: [0 + 0, 0 + 1,..., 0 + 4]`) and then concatenate them together (`[0, 1, 2, 0, 1, 0, 1, 2, 3, 4]`).
|
||||
@@ -97,13 +112,15 @@ Note: there is more efficient way (using `request indices`) to create positions
|
||||
|
||||
Finally, `token positions` can be obtained as `[0, 1, 2, 0, 1, 0, 1, 2, 3, 4]`. This variable is **token level**.
|
||||
|
||||
##### 2. Get token indices:
|
||||
##### 2. Get token indices
|
||||
|
||||
The shape of the current **Token IDs table** is `(max num request, max model len)`.
|
||||
|
||||
Why these `T_3_5`, `T_3_6`, `T_3_7` are in this table without being scheduled?
|
||||
|
||||
- We fill all Token IDs in one request sequence to this table at once, but we only retrieve the tokens we scheduled this time. Then we retrieve the remain Token IDs next time.
|
||||
|
||||
```
|
||||
```shell
|
||||
| T_0_0 | T_0_1 | T_0_2 | ? | ? | ? | ? | ? | ? | ? | ? | ? |
|
||||
| T_1_0 | T_1_1 | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? |
|
||||
| T_2_0 | T_2_1 | T_3_2 | T_3_3 | T_3_4 | T_3_5 | T_3_6 | T_3_7 | ? | ? | ? | ? |
|
||||
@@ -120,19 +137,22 @@ Let's say `M = max model len`. Then we can use `token positions` together with `
|
||||
So `token indices` = `[0 + 0 * M, 1 + 0 * M, 2 + 0 * M, 0 + 1 * M, 1 + 1 * M, 0 + 2 * M, 1 + 2 * M, 2 + 2 * M, 3 + 2 * M, 4 + 2 * M]` = `[0, 1, 2, 12, 13, 24, 25, 26, 27, 28]`
|
||||
|
||||
##### 3. Retrieve the Token IDs
|
||||
|
||||
We use `token indices` to select out the corresponding `Input IDs` from the token table. The pseudocode is as follows:
|
||||
|
||||
```
|
||||
```shell
|
||||
input_ids = token_table[token_indices]
|
||||
```
|
||||
|
||||
As mentioned before, we refer to these `Token IDs` as `Input IDs`.
|
||||
|
||||
- `Input IDs` = `[T_0_0, T_0_1, T_0_2, T_1_0, T_1_1, T_2_0, T_2_1, T_3_2, T_3_3, T_3_4]`
|
||||
|
||||
#### Build inputs attention metadata
|
||||
|
||||
In the current **Block Table**, we use the first block (i.e. block_0) to mark the unused block. The shape of the block is `(max num request, max model len / block size)`, where `max model len / block size = 12 / 2 = 6`.
|
||||
|
||||
```
|
||||
```shell
|
||||
| 1 | 2 | 0 | 0 | 0 | 0 |
|
||||
| 3 | 0 | 0 | 0 | 0 | 0 |
|
||||
| 4 | 5 | 6 | 0 | 0 | 0 |
|
||||
@@ -144,13 +164,14 @@ In the current **Block Table**, we use the first block (i.e. block_0) to mark th
|
||||
|
||||
The KV cache block in the device memory is like:
|
||||
|
||||
```
|
||||
```shell
|
||||
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ......
|
||||
```
|
||||
|
||||
Let's say `K = max model len / block size = 6`, and we can get token `device block number`.
|
||||
|
||||
The workflow of achieving slot mapping:
|
||||
|
||||
1. Get `block table indices` using `K`, `positions` and `request indices`.
|
||||
|
||||
Purpose: For each token, it could be used to select `device block number` from `block table`.
|
||||
@@ -168,6 +189,7 @@ The workflow of achieving slot mapping:
|
||||
Purpose: we can use `slot mapping` to store Token IDs into token slots.
|
||||
|
||||
Details:
|
||||
|
||||
1. (**Token level**) Use a simple formula to calculate `block table indices`: `request indices * K + positions / block size`. So it equal to `[0 * 6 + 0 / 2, 0 * 6 + 1 / 2, 0 * 6 + 2 / 2, 1 * 6 + 0 / 2, 1 * 6 + 1 / 2, 2 * 6 + 0 / 2, 2 * 6 + 1 / 2, 2 * 6 + 2 / 2, 2 * 6 + 3 / 2, 2 * 6 + 4 / 2] = [0, 0, 1, 6, 6, 12, 12, 13, 13, 14]`. This could be used to select `device block number` from `block table`.
|
||||
2. (**Token level**) Use `block table indices` to select out `device block number` for each scheduled token. The Pseudocode is `block_numbers = block_table[block_table_indices]`. So `device block number=[1, 1, 2, 3, 3, 4, 4, 5, 5, 6]`
|
||||
3. (**Token level**) `block offsets` could be computed by `block offsets = positions % block size = [0, 1, 0, 0, 1, 0, 1, 0, 1, 0]`.
|
||||
@@ -185,9 +207,11 @@ Details:
|
||||
- `attention mask`: For all requests that initiate a prefill process, we simply create only one mask matrix for reuse across different requests. The shape of this mask matrix is `5 * 5`:
|
||||
|
||||
### Step 2: Chunked prefill
|
||||
|
||||
In Step 2, we no longer provide explanations or perform calculations; instead, we directly present the final result.
|
||||
|
||||
#### Obtain inputs
|
||||
|
||||
Scheduled token of each request: `{'0': 1, '1': 1, '2': 3}`
|
||||
|
||||
1. `request indices`: `[0, 1, 2, 2, 2]`
|
||||
@@ -195,7 +219,7 @@ Scheduled token of each request: `{'0': 1, '1': 1, '2': 3}`
|
||||
|
||||
Current **Token IDs table**:
|
||||
|
||||
```
|
||||
```shell
|
||||
| T_0_0 | T_0_1 | T_0_2 | T_0_3 | ? | ? | ? | ? | ? | ? | ? | ? |
|
||||
| T_1_0 | T_1_1 | T_1_2 | ? | ? | ? | ? | ? | ? | ? | ? | ? |
|
||||
| T_2_0 | T_2_1 | T_3_2 | T_3_3 | T_3_4 | T_3_5 | T_3_6 | T_3_7 | ? | ? | ? | ? |
|
||||
@@ -211,11 +235,12 @@ Current **Token IDs table**:
|
||||
4. `Input IDs`: `[T_0_3, T_1_2, T_3_5, T_3_6, T_3_7]`
|
||||
|
||||
#### Build inputs attention metadata
|
||||
|
||||
We allocate the blocks `7` and `8` to `request_1` and `request_2` respectively, as they need more space in device to store KV cache following token generation or chunked prefill.
|
||||
|
||||
Current **Block Table**:
|
||||
|
||||
```
|
||||
```shell
|
||||
| 1 | 2 | 0 | 0 | 0 | 0 |
|
||||
| 3 | 7 | 0 | 0 | 0 | 0 |
|
||||
| 4 | 5 | 6 | 8 | 0 | 0 |
|
||||
@@ -227,7 +252,7 @@ Current **Block Table**:
|
||||
|
||||
KV cache block in the device memory:
|
||||
|
||||
```
|
||||
```shell
|
||||
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ......
|
||||
```
|
||||
|
||||
@@ -237,6 +262,7 @@ KV cache block in the device memory:
|
||||
4. (**Token level**) `slot mapping`: `[5, 14, 13, 16, 17]`
|
||||
|
||||
Scheduled token count:`[1, 1, 3]`
|
||||
|
||||
- `query start location`: `[0, 1, 2, 5]`
|
||||
|
||||
- `sequence length`: `[4, 3, 8]`
|
||||
@@ -254,6 +280,7 @@ Scheduled token count:`[1, 1, 3]`
|
||||
Each token has a `1 * 8` vector, and there are 5 scheduled tokens.
|
||||
|
||||
## At last
|
||||
|
||||
If you understand the step_1 and step_2, you will know the all following steps.
|
||||
|
||||
Hope this document can help you better understand how vLLM prepares inputs for model forwarding. If you have any good idea, welcome to contribute to us.
|
||||
|
||||
@@ -20,6 +20,7 @@ Its main objective is to eliminate duplicated storage of the KV cache by shardin
|
||||
DCP primarily influences the Decode logic, as well as the logic for chunked prefill and cached prefill.
|
||||
|
||||
## How to Use CP?
|
||||
|
||||
Please refer to the [context parallel user guide](../../user_guide/feature_guide/context_parallel.md) for detailed information.
|
||||
|
||||
## How It Works?
|
||||
|
||||
@@ -15,6 +15,7 @@ This feature addresses the need to optimize the **Time Per Output Token (TPOT)**
|
||||
## Usage
|
||||
|
||||
vLLM Ascend currently supports two types of connectors for handling KV cache management:
|
||||
|
||||
- **MooncakeConnector**: D nodes pull KV cache from P nodes.
|
||||
- **MooncakeLayerwiseConnector**: P nodes push KV cache to D nodes in a layered manner.
|
||||
|
||||
@@ -35,7 +36,7 @@ Our design diagram is shown below, illustrating the pull and push schemes respec
|
||||

|
||||

|
||||
|
||||
#### Mooncake Connector:
|
||||
#### Mooncake Connector
|
||||
|
||||
1. The request is sent to the Proxy’s `_handle_completions` endpoint.
|
||||
2. The Proxy calls `select_prefiller` to choose a P node and forwards the request, configuring `kv_transfer_params` with `do_remote_decode=True`, `max_tokens=1`, and `min_tokens=1`.
|
||||
@@ -43,7 +44,7 @@ Our design diagram is shown below, illustrating the pull and push schemes respec
|
||||
4. The Proxy calls `select_decoder` to choose a D node and forwards the request.
|
||||
5. On the D node, the scheduler marks the request as `RequestStatus.WAITING_FOR_REMOTE_KVS`, pre-allocates KV cache, calls `kv_connector_no_forward` to pull the remote KV cache, then notifies the P node to release KV cache and proceeds with decoding to return the result.
|
||||
|
||||
#### Mooncake Layerwise Connector:
|
||||
#### Mooncake Layerwise Connector
|
||||
|
||||
1. The request is sent to the Proxy’s `_handle_completions` endpoint.
|
||||
2. The Proxy calls `select_decoder` to choose a D node and forwards the request, configuring `kv_transfer_params` with `do_remote_prefill=True` and setting the `metaserver` endpoint.
|
||||
@@ -55,6 +56,7 @@ Our design diagram is shown below, illustrating the pull and push schemes respec
|
||||
### 3. Interface Design
|
||||
|
||||
Taking MooncakeConnector as an example, the system is organized into three primary classes:
|
||||
|
||||
- **MooncakeConnector**: Base class that provides core interfaces.
|
||||
- **MooncakeConnectorScheduler**: Interface for scheduling the connectors within the engine core, responsible for managing KV cache transfer requirements and completion.
|
||||
- **MooncakeConnectorWorker**: Interface for managing KV cache registration and transfer in worker processes.
|
||||
|
||||
@@ -1,18 +1,22 @@
|
||||
# Expert Parallelism Load Balancer (EPLB)
|
||||
|
||||
## Why We Need EPLB?
|
||||
|
||||
When using Expert Parallelism (EP), different experts are assigned to different NPUs. Given that the load of various experts may vary depending on the current workload, it is crucial to maintain balanced loads across different NPUs. We adopt a redundant experts strategy by duplicating heavily-loaded experts. Then, we heuristically pack these duplicated experts onto NPUs to ensure load balancing across them. Moreover, thanks to the group-limited expert routing used in MoE models, we also attempt to place experts of the same group on the same node to reduce inter-node data traffic, whenever possible.
|
||||
|
||||
To facilitate reproduction and deployment, Vllm Ascend supported deployed EP load balancing algorithm in `vllm_ascend/eplb/core/policy`. The algorithm computes a balanced expert replication and placement plan based on the estimated expert loads. Note that the exact method for predicting expert loads is outside the scope of this repository. A common method is to use a moving average of historical statistics.
|
||||
|
||||

|
||||
|
||||
## How to Use EPLB?
|
||||
|
||||
Please refer to the EPLB section of the user guide for detailed information: [How to Use EPLB](../../user_guide/feature_guide/eplb_swift_balancer.md)
|
||||
|
||||
## How It Works?
|
||||
|
||||
**EPLB Module Architecture**
|
||||
|
||||
```
|
||||
```shell
|
||||
vllm_ascend
|
||||
├── eplb
|
||||
│ ├── adaptor
|
||||
@@ -35,6 +39,7 @@ vllm_ascend
|
||||
|
||||
**1. Adaptor Module**
|
||||
*Handles registration and adaptation for different MoE model types*
|
||||
|
||||
- `abstract_adaptor.py`
|
||||
Abstract base class defining unified registration interfaces for EPLB adapters
|
||||
- `vllm_adaptor.py`
|
||||
@@ -42,17 +47,18 @@ vllm_ascend
|
||||
|
||||
**2. Core Module**
|
||||
*Implements core algorithms, updates, and asynchronous processing*
|
||||
|
||||
- **Policy Submodule**
|
||||
*Load balancing algorithms with factory pattern instantiation*
|
||||
- `policy_abstract.py`
|
||||
- `policy_abstract.py`
|
||||
Abstract class for load balancing strategy interfaces
|
||||
- `policy_dynamic_ep.py`
|
||||
- `policy_dynamic_ep.py`
|
||||
Default implementation of open-source EPLB paper algorithm
|
||||
- `policy_dynamic_ep_v2.py`
|
||||
- `policy_dynamic_ep_v2.py`
|
||||
Enhanced version optimizing expert swaps for low-bandwidth devices (e.g., A2)
|
||||
- `policy_flashlb.py`
|
||||
- `policy_flashlb.py`
|
||||
Threshold-based adjustment reducing operational costs through layer-wise fluctuation detection
|
||||
- `policy_factory.py`
|
||||
- `policy_factory.py`
|
||||
Strategy factory for automatic algorithm instantiation
|
||||
|
||||
- `eplb_device_transfer_loader.py`
|
||||
@@ -63,12 +69,14 @@ vllm_ascend
|
||||
Asynchronous algorithm orchestration and result processing
|
||||
|
||||
**3. System Components**
|
||||
|
||||
- `eplb_updator.py`
|
||||
Central coordinator for load balancing during inference workflows
|
||||
- `utils.py`
|
||||
General utilities for EPLB interface registration
|
||||
|
||||
*Key Optimizations:*
|
||||
|
||||
1. Maintained original structure while improving technical clarity
|
||||
2. Standardized terminology
|
||||
3. Enhanced algorithm differentiation through concise descriptors
|
||||
@@ -76,14 +84,19 @@ vllm_ascend
|
||||
5. Preserved file/class relationships while optimizing readability
|
||||
|
||||
### Default Algorithm
|
||||
|
||||
#### Hierarchical Load Balancing
|
||||
|
||||
When the number of server nodes evenly divides the number of expert groups, we use the hierarchical load balancing policy to leverage group-limited expert routing. We first pack the expert groups onto nodes evenly, ensuring balanced loads across different nodes. Then, we replicate the experts within each node. Finally, we pack the replicated experts onto individual NPUs to ensure load balancing across them. The hierarchical load balancing policy can be used in the prefilling stage with a smaller expert-parallel size.
|
||||
|
||||
#### Global Load Balancing
|
||||
|
||||
In other cases, we use the global load balancing policy, which replicates experts globally regardless of expert groups, and packs the replicated experts onto individual NPUs. This policy can be adopted in the decoding stage with a larger expert-parallel size.
|
||||
|
||||
### Add a New EPLB Policy
|
||||
|
||||
If you want to add a new eplb policy to vllm_ascend, you must follow these steps:
|
||||
|
||||
1. Inherit the `EplbPolicy` abstract class of `policy_abstract.py` and override the `rebalance_experts` interface, ensuring consistent input parameters `current_expert_table`, `expert_workload` and return types `newplacement`.
|
||||
For example:
|
||||
|
||||
@@ -113,6 +126,7 @@ class RandomLoadBalance(EplbPolicy):
|
||||
2. To add a new EPLB algorithm, include the policy type and its corresponding implementation class in the `PolicyFactory` of `policy_factory.py`.
|
||||
|
||||
### Add a New MoE Model
|
||||
|
||||
**Implementation Guide for Model Integration**
|
||||
|
||||
1. **Adapter File Modification**
|
||||
@@ -154,12 +168,17 @@ class RandomLoadBalance(EplbPolicy):
|
||||
- Benchmark against baseline implementations (e.g., Qwen3-MoE)
|
||||
|
||||
*Key Implementation Notes:*
|
||||
|
||||
- Preserve existing interface contracts in abstract classes
|
||||
- Use decorators for non-intrusive patch integration
|
||||
- Leverage `eplb_utils.py` for shared expert mapping operations
|
||||
|
||||
## DFX
|
||||
|
||||
### Parameter Validation
|
||||
|
||||
#### Integer Parameters
|
||||
|
||||
All integer input parameters must explicitly specify their maximum and minimum values and be subject to valid value validation. For example, `num_iterations_eplb_update` must be greater than 0:
|
||||
|
||||
```python
|
||||
@@ -176,6 +195,7 @@ All integer input parameters must explicitly specify their maximum and minimum v
|
||||
```
|
||||
|
||||
#### File Path
|
||||
|
||||
The file path for EPLB must be checked for legality, such as whether the file path is valid and whether it has appropriate read and write permissions. For example:
|
||||
|
||||
```python
|
||||
@@ -203,20 +223,27 @@ The file path for EPLB must be checked for legality, such as whether the file pa
|
||||
```
|
||||
|
||||
### Function Specifications
|
||||
|
||||
#### Initialization Function
|
||||
|
||||
All EPLB parameters must be initialized by default during initialization, with specified parameter types and default values for proper handling.
|
||||
|
||||
#### General Functions
|
||||
|
||||
All method arguments must specify parameter types and default values, and functions must include default return value handling for default arguments. It is recommended to use `try-except` blocks to handle the function body, specifying the type of exception captured and the failure handling (e.g., logging exceptions or returning a failure status).
|
||||
|
||||
### Consistency
|
||||
|
||||
#### Expert Map
|
||||
|
||||
The expert map must be globally unique during initialization and update. In a multi-node scenario during initialization, distributed communication should be used to verify the consistency of expert maps across each rank. If they are inconsistent, the user should be notified which ranks have inconsistent maps.
|
||||
During the update process, if only a few layers or the expert table of a certain rank has been changed, the updated expert table must be synchronized with the EPLB's context to ensure global consistency.
|
||||
|
||||
#### Expert Weight
|
||||
|
||||
When updating expert weights, ensure that the memory allocated for the expert weights has been released, or that the expert (referring to the old version) is no longer in use.
|
||||
|
||||
## Limitation
|
||||
|
||||
Before using EPLB, start the script and add `export DYNAMIC_EPLB="true"`.
|
||||
Before performing load data collection (or performance data collection), start the script and add `export EXPERT_MAP_RECORD="true"`.
|
||||
|
||||
@@ -16,7 +16,7 @@ We should keep in mind that Patch is not the best way to make vLLM Ascend compat
|
||||
|
||||
In `vllm_ascend/patch`, you can see the code structure as follows:
|
||||
|
||||
```
|
||||
```shell
|
||||
vllm_ascend
|
||||
├── patch
|
||||
│ ├── platform
|
||||
@@ -27,10 +27,10 @@ vllm_ascend
|
||||
```
|
||||
|
||||
- **platform**: The patch code in this directory is for patching the code in vLLM main process. It's called by `vllm_ascend/platform::NPUPlatform::pre_register_and_update` very early when vLLM is initialized.
|
||||
- For online mode, vLLM process calls the platform patch in `vllm/vllm/engine/arg_utils.py::AsyncEngineArgs.add_cli_args` when parsing the cli args.
|
||||
- For offline mode, vLLM process calls the platform patch in `vllm/vllm/engine/arg_utils.py::EngineArgs.create_engine_config` when parsing the input parameters.
|
||||
- For online mode, vLLM process calls the platform patch in `vllm/vllm/engine/arg_utils.py::AsyncEngineArgs.add_cli_args` when parsing the cli args.
|
||||
- For offline mode, vLLM process calls the platform patch in `vllm/vllm/engine/arg_utils.py::EngineArgs.create_engine_config` when parsing the input parameters.
|
||||
- **worker**: The patch code in this directory is for patching the code in vLLM worker process. It's called by `vllm_ascend/worker/worker::NPUWorker::__init__` when the vLLM worker process is initialized.
|
||||
- For both online and offline mode, vLLM engine core process calls the worker patch in `vllm/vllm/worker/worker_base.py::WorkerWrapperBase.init_worker` when initializing the worker process.
|
||||
- For both online and offline mode, vLLM engine core process calls the worker patch in `vllm/vllm/worker/worker_base.py::WorkerWrapperBase.init_worker` when initializing the worker process.
|
||||
|
||||
## How to write a patch
|
||||
|
||||
@@ -54,7 +54,7 @@ Before writing a patch, following the principle above, we should patch the least
|
||||
5. Import the patch file in `__init__.py`. In this example, add `import vllm_ascend.patch.platform.patch_distributed` into `vllm_ascend/patch/platform/__init__.py`.
|
||||
6. Add the description of the patch in `vllm_ascend/patch/__init__.py`. The description format is as follows:
|
||||
|
||||
```
|
||||
```python
|
||||
# ** File: <The patch file name> **
|
||||
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
# 1. `<The target patch module in vLLM>`
|
||||
@@ -71,5 +71,6 @@ Before writing a patch, following the principle above, we should patch the least
|
||||
7. Add the Unit Test and E2E Test. Any newly added code in vLLM Ascend should contain the Unit Test and E2E Test as well. You can find more details in [test guide](../contribution/testing.md)
|
||||
|
||||
## Limitation
|
||||
|
||||
1. In V1 Engine, vLLM starts three kinds of process: Main process, EngineCore process and Worker process. Now vLLM Ascend only can patch the code in Main process and Worker process by default. If you want to patch the code running in EngineCore process, you should patch EngineCore process entirely during setup. Find the entire code in `vllm.v1.engine.core`. Please override `EngineCoreProc` and `DPEngineCoreProc` entirely.
|
||||
2. If you are running edited vLLM code, the version of vLLM may be changed automatically. For example, if you run the edited vLLM based on v0.9.n, the version of vLLM may be changed to v0.9.nxxx. In this case, the patch for v0.9.n in vLLM Ascend would not work as expected, because vLLM Ascend can't distinguish the version of the vLLM you're using. In this case, you can set the environment variable `VLLM_VERSION` to specify the version of the vLLM you're using, and then the patch for v0.10.0 should work.
|
||||
|
||||
Reference in New Issue
Block a user