[Doc][Misc] Comprehensive documentation cleanup and grammatical fixes (#8073)

What this PR does / why we need it?
This pull request performs a comprehensive cleanup of the vLLM Ascend
documentation. It fixes numerous typos, grammatical errors, and phrasing
issues across community guidelines, developer documents, hardware
tutorials, and feature guides. Key improvements include correcting
hardware names (e.g., Atlas 300I), fixing broken links, cleaning up code
examples (removing duplicate flags and trailing commas), and improving
the clarity of technical explanations. These changes are necessary to
ensure the documentation is professional, accurate, and easy for users
to follow.

Does this PR introduce any user-facing change?
No, this PR contains documentation-only updates.

How was this patch tested?
The changes were manually reviewed for accuracy and grammatical
correctness. No functional code changes were introduced.

---------

Signed-off-by: herizhen <1270637059@qq.com>
Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com>
This commit is contained in:
herizhen
2026-04-09 15:37:57 +08:00
committed by GitHub
parent c40a387f63
commit 0d1424d81a
71 changed files with 1295 additions and 1296 deletions

View File

@@ -0,0 +1,103 @@
# ACL Graph
## Why do we need ACL Graph?
In LLM inference, each token requires nearly a thousand operator executions. When host launching operators are slower than device, it will cause host bound. In severe cases, the device will be idle for more than half of the time. To solve this problem, we use graph in LLM inference.
```shell
eager mode:
host: | launch op1 | launch op2 | launch op3 | launch op4 | launch op5 |
device: | run op1 |free| run op2 |free| run op3 |free| run op4 |free| run op5 |
| <----- total time -----> |
graph mode:
host: | launch graph |
device: | run op1 | run op2 | run op3 | run op4 | run op5 |
| <----- total time -----> |
```
## How to use ACL Graph?
ACL Graph is enabled by default in V1 Engine, you just need to check that `enforce_eager` is not set to `True`. More details see: [Graph Mode Guide](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/feature_guide/graph_mode.html)
## How it works?
In short, graph mode works in two steps: **capture and replay**. When the engine starts, we capture all of the ops in the model forward and save it as a graph. When a request comes in, we just replay the graph on the device and wait for the result.
But in reality, graph mode is not that simple.
### Padding and Bucketing
Due to the fact that a graph can only replay the ops captured before, without doing tiling and checking graph input, we need to ensure the consistency of the graph input. However, we know that the model input's shape depends on the request scheduled by the Scheduler, so we can't ensure consistency.
Obviously, we can solve this problem by capturing the biggest shape and padding all of the model inputs to it. But this will bring a lot of redundant computing and make performance worse. So we can capture multiple graphs with different shapes, and pad the model input to the nearest graph, which will greatly reduce redundant computing. But when `max_num_batched_tokens` is very large, the number of graphs that need to be captured will also become very large. We know that when the input tensor's shape is large, the computing time will be very long, and graph mode is not necessary in this case. So all of the things we need to do are:
1. Set a threshold;
2. When `num_scheduled_tokens` is bigger than the threshold, use `eager_mode`;
3. Capture multiple graphs within a range below the threshold;
```shell
| graph1 |
| graph2 |
| graph3 |
| graph4 | # the threshold
| input1 | pad | # use graph1
| input2 | # don't need pad
| input3 | pad | # use graph4
| input4 | # use eager mode
```
### Piecewise and Full graph
Due to the increasing complexity of the attention layer in current LLMs, we can't ensure all types of attention can run in graph. In MLA, prefill_tokens and decode_tokens have different calculation methods, so when a batch has both prefills and decodes in MLA, graph mode is difficult to handle this situation.
vLLM solves this problem with piecewise graph mode. We use eager mode to launch attention's ops, and use graph to deal with others. But this also brings some problems: The cost of launching ops has become large again. Although much smaller than eager mode, it will also lead to host bound when the CPU is poor or `num_tokens` is small.
Altogether, we need to support both piecewise and full graph mode.
1. When attention can run in graph, we tend to choose full graph mode to achieve optimal performance;
2. When full graph does not work, use piecewise graph as a substitute;
3. When piecewise graph's performance is not good and full graph mode is blocked, separate prefills and decodes, and use full graph mode in **decode_only** situations. Because when a batch includes prefill requests, usually `num_tokens` will be quite big and not cause host bound.
> Currently, due to stream resource constraint, we can only support a few buckets in piecewise graph mode now, which will cause redundant computing and may lead to performance degradation compared with eager mode.
## How is it implemented?
vLLM has already implemented most of the modules in graph mode. You can see more details at: [CUDA Graphs](https://docs.vllm.ai/en/latest/design/cuda_graphs.html)
When in graph mode, vLLM will call `current_platform.get_static_graph_wrapper_cls` to get the current device's graph model wrapper, so what we need to do is implement the graph mode wrapper on Ascend: `ACLGraphWrapper`.
vLLM has added `support_torch_compile` decorator to all models. This decorator will replace the `__init__` and `forward` interface of the model class. When `forward` is called, the code inside the `ACLGraphWrapper` will be executed, and it will do capture or replay as mentioned above.
When using piecewise graph, we just need to follow the above-mentioned process. But when in full graph, due to the complexity of the attention, sometimes we need to update attention op's params before execution. So we implement `update_attn_params` and `update_mla_attn_params` functions for full graph mode. During forward, memory will be reused between different ops, so we can't update attention op's params before forward. In ACL Graph, we use `torch.npu.graph_task_update_begin` and `torch.npu.graph_task_update_end` to do it, and use `torch.npu.ExternalEvent` to ensure order between param updates and op executions.
## DFX
### Stream resource constraint
Currently, we can only capture 1800 graphs at most, due to the limitation of ACL graph that a graph requires at least a separate stream. This number is bounded by the number of streams, which is 2048; we save 248 streams as a buffer. Besides, there are many variables that can affect the number of buckets:
+ Piecewise graph divides the model into `num_hidden_layers + 1` sub modules, based on the attention layer. Every sub module is a single graph which needs to cost a stream, so the number of buckets in piecewise graph mode is very tight compared with full graph mode.
+ The number of streams required for a graph is related to the number of comm domains. Each comm domain will increase one stream consumed by a graph.
+ When multi-stream is explicitly called in a sub module, it will consume an additional stream.
There are some other rules about ACL Graph and stream. Currently, we use func `update_aclgraph_sizes` to calculate the maximum number of buckets and update `graph_batch_sizes` to ensure stream resource is sufficient.
We will expand the stream resource limitation in the future.
## Limitations
1. `FULL` and `FULL_AND_PIECEWISE` are not supported now;
2. When use ACL Graph and MTP and `num_speculative_tokens > 1`, as vLLM don't support this case in v0.11.0, we need to set `cudagraph_capture_sizes` explicitly.
3. `use_inductor` is not supported now;

View File

@@ -0,0 +1,91 @@
# KV Cache Pool
## Why KV Cache Pool?
Prefix caching is an important feature in LLM inference that can reduce prefill computation time drastically.
However, the performance gain from prefix caching is highly dependent on the cache hit rate, while the cache hit rate can be limited if one only uses HBM for KV cache storage.
Hence, KV Cache Pool is proposed to utilize various types of storage including HBM, DRAM, and SSD, making a pool for KV Cache storage while making the prefix of requests visible across all nodes, increasing the cache hit rate for all requests.
vLLM Ascend currently supports [MooncakeStore](https://github.com/kvcache-ai/Mooncake), one of the most recognized KV Cache storage engines.
While one can utilize Mooncake Store in vLLM V1 engine by setting it as a remote backend of LMCache with GPU (see [Tutorial](https://github.com/LMCache/LMCache/blob/dev/examples/kv_cache_reuse/remote_backends/mooncakestore/README.md)), we find it would be better to integrate a connector that directly supports Mooncake Store and can utilize the data transfer strategy that best fits Huawei NPU hardware.
Hence, we propose to integrate Mooncake Store with a brand new **MooncakeStoreConnectorV1**, which is indeed largely inspired by **LMCacheConnectorV1** (see the `How is MooncakeStoreConnectorV1 Implemented?` section).
## Usage
vLLM Ascend currently supports Mooncake Store for KV Cache Pool. To enable Mooncake Store, one needs to configure `kv-transfer-config` and choose `MooncakeStoreConnector` as the KV Connector.
For step-by-step deployment and configuration, please refer to the [KV Pool User Guide](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/feature_guide/kv_pool.html).
## How it works?
The KV Cache Pool integrates multiple memory tiers (HBM, DRAM, SSD, etc.) through a connector-based architecture.
Each connector implements a unified interface for storing, retrieving, and transferring KV blocks between tiers, depending on access frequency and hardware bandwidth.
When combined with vLLMs Prefix Caching mechanism, the pool enables efficient caching both locally (in HBM) and globally (via Mooncake), ensuring that frequently used prefixes remain hot while less frequently accessed KV data can spill over to lower-cost memory.
### 1. Combining KV Cache Pool with HBM Prefix Caching
Prefix Caching with HBM is already supported by the vLLM V1 Engine.
By introducing KV Connector V1, users can seamlessly combine HBM-based Prefix Caching with Mooncake-backed KV Pool.
The user can enable both features simply by enabling Prefix Caching, which is enabled by default in vLLM V1 unless the `--no_enable_prefix_caching` flag is set, and setting up the KV Connector for KV Pool (e.g., the MooncakeStoreConnector).
**Workflow**:
1. The engine first checks for prefix hits in the HBM cache.
2. After getting the number of hit tokens on HBM, it queries the KV Pool via the connector. If there are additional hits in the KV Pool, we get the **additional blocks only** from the KV Pool, and get the rest of the blocks directly from HBM to minimize the data transfer latency.
3. After the KV Caches in the KV Pool are loaded into HBM, the remaining process is the same as Prefix Caching in HBM.
### 2. Combining KV Cache Pool with Mooncake PD Disaggregation
When used together with Mooncake PD (Prefill-Decode) Disaggregation, the KV Cache Pool can further decouple prefill and decode stages across devices or nodes.
Currently, we only perform put and get operations of KV Pool for **Prefill Nodes**, and Decode Nodes get their KV Cache from Mooncake P2P KV Connector, i.e., MooncakeConnector.
The key benefit of doing this is that we can keep the gain in performance by computing less with Prefix Caching from HBM and KV Pool for Prefill Nodes, while not sacrificing the data transfer efficiency between Prefill and Decode nodes with P2P KV Connector that transfers KV Caches between NPU devices directly.
To enable this feature, we need to set up both Mooncake Connector and Mooncake Store Connector with a Multi Connector, which is a KV Connector class provided by vLLM that can call multiple KV Connectors in a specific order.
For details, please also refer to the Mooncake Connector Store Deployment Guide.
## How is MooncakeStoreConnectorV1 Implemented?
**MooncakeStoreConnectorV1** inherits the KV Connector V1 class in vLLM V1: through implementing the required methods defined in the KV connector V1 base class, one can integrate a third-party KV cache transfer/storage backend into the vLLM framework.
MooncakeStoreConnectorV1 is also largely inspired by LMCacheConnectorV1 in terms of the `Lookup Engine`/`Lookup Client` design for looking up KV cache keys, and the `ChunkedTokenDatabase` class for processing tokens into prefix-aware hashes as well as other hashing related designs. On top of this, we have also added our own design including `KVTransferThread` that allows async `get` and `put` of KV caches with multi-threading, and NPU-related data transfer optimization such as removing the `LocalBuffer` in LMCache to remove redundant data transfer.
The KV Connector methods that need to be implemented can be categorized into scheduler-side methods that are called in V1 scheduler and worker-side methods that are called in V1 worker, namely:
### KV Connector Scheduler-Side Methods
`get_num_new_matched_tokens`: Get prefix cache hit in number of tokens through looking up into the KV pool.
`update_states_after_alloc`: Update KVConnector state after temporary buffer alloc.
`build_connector_meta`: Attach the connector metadata to the request object.
`request_finished`: Once a request is finished, determine whether request blocks should be freed now or will be sent asynchronously and freed later.
### Connector Worker-Side Methods
`register_kv_caches`: Register KV cache buffers needed for KV cache transfer.
`start_load_kv`: Perform KV cache load operation that transfers KV cache from storage to device.
`wait_for_layer_load`: Optional; Wait for layer load in layerwise + async KV load scenario.
`save_kv_layer`: Optional; Do layerwise KV cache put into KV Pool.
`wait_for_save`: Wait for KV Save to finish if async KV cache save/put.
`get_finished`: Get request that finished KV transfer, `done_sending` if `put` finished, `done_receiving` if `get` finished.
## DFX
1. When looking up a key in KV Pool, if we cannot find the key, there is no Cache Hit for this specific block; we return no hit for this block and do not look up further blocks for the current request.
2. Similarly, when we are trying to put a block into KV Pool and it fails, we do not put further blocks (subject to change).
## Limitations
1. Currently, Mooncake Store for vLLM-Ascend only supports DRAM as the storage for KV Cache pool.
2. For now, if we successfully looked up a key and found it exists, but failed to get it when calling KV Pool's get function, we just output a log indicating the get operation failed and keep going; hence, the accuracy of that specific request may be affected. We will handle this situation by falling back the request and re-compute everything assuming there's no prefix cache hit (or even better, revert only one block and keep using the Prefix Caches before that).

View File

@@ -0,0 +1,286 @@
# Prepare inputs for model forwarding
## Purpose
Information required to perform model forward pass:
- the inputs
- the corresponding attention metadata of the inputs
The following diagram shows what we should prepare for model inference.
```shell
+---------------+
inputs --> | |
| model | --> output
attn_meta --> | |
+---------------+
```
Therefore, as long as we have these two pieces of information mentioned above, we can perform the model's forward propagation.
This document will explain **how we obtain the inputs and their corresponding attention metadata**.
## Overview
### 1. Obtain inputs
The workflow of obtaining inputs:
1. Get `token positions`: relative position of each token within its request sequence.
2. Get `token indices`: index of each scheduled token in the token table.
3. Get `Token IDs`: using token indices to retrieve the Token IDs from **token id table**.
At last, these `Token IDs` are required to be fed into a model, and `positions` should also be sent into the model to create `Rope` (Rotary positional embedding). Both of them are the inputs of the model.
**Note**: The `Token IDs` are the inputs of a model, so we also call them `Inputs IDs`.
### 2. Build inputs attention metadata
A model requires these attention metadata during the forward pass:
- `query start location`: start and end location of each request corresponding to the scheduled tokens.
- `sequence length`: length of each request including both computed tokens and newly scheduled tokens.
- `number of computed tokens`: number of computed tokens for each request.
- `number of requests`: number of requests in this batch.
- `number of tokens`: total number of scheduled tokens in this batch.
- **`block table`**: translates the logical address (within its sequence) of each block to its global physical address in the device's memory.
- `max query len`: the longest scheduled tokens length in this request batch.
- `slot mapping`: indices of each token that input token will be stored into.
- `attention mask`: mask matrix applied to attention scores before softmax to control which tokens can attend to each other (usually a causal attention).
## Before start
There are mainly three types of variables.
- token level: represents one attribute corresponding to each scheduled token, so the length of this variable is the number of scheduled tokens.
- request level: represents one attribute of each scheduled request, whose length usually is the number of scheduled requests. (`query start location` is a special case, which has one more element.)
- system level:
1. **Token IDs table**: stores the token IDs (i.e. the inputs of a model) of each request. The shape of this table is `(max num request, max model len)`. Here, `max num request` is the maximum count of concurrent requests allowed in a forward batch and `max model len` is the maximum token count that can be handled at one request sequence in this model.
2. **Block table**: translates the logical address (within its sequence) of each block to its global physical address in the device's memory. The shape of this table is `(max num request, max model len / block size)`
**Note**: Both of these two tables come from the `_update_states` method before **preparing inputs**. You can take a look if you need more inspiration.
### Tips
Simply put, a `token ID` is an **integer** (usually `int32`), which represents a token.
Example of `Token ID`:
```shell
| Token ID | Token |
|--------------|---------------|
| 0 | [PAD] |
| 1 | <|endoftext|> |
| 2 | <|start|> |
| 3 | [SEP] |
| 4 | I |
| 5 | the |
| 6 | be |
| 7 | of |
| 8 | and |
| ... | ... |
| ... | ... |
| vocab_size-1 | <|im_end|> |
```
## Go through details
Assumptions:
- maximum number of tokens that can be scheduled at once: 10
- `block size`: 2
- Totally schedule 3 requests. Their prompt lengths are 3, 2, and 8 respectively.
- `max model length`: 12 (the maximum token count that can be handled at one request sequence in a model).
These assumptions are configured at the beginning when starting vLLM. They are not fixed, so you can manually set them.
### Step 1: All requests in the prefill phase
#### Obtain inputs
As the maximum number of tokens that can be scheduled is 10, the scheduled tokens of each request can be represented as `{'0': 3, '1': 2, '2': 5}`. Note that `request_2` uses chunked prefill, leaving 3 prompt tokens unscheduled.
##### 1. Get token positions
First, determine which request each token belongs to: tokens 02 are assigned to **request_0**, tokens 34 to **request_1**, and tokens 59 to **request_2**. To represent this mapping, we use `request indices`, for example, `request indices`: `[0, 0, 0, 1, 1, 2, 2, 2, 2, 2]`.
For each request, use **the number of computed tokens** + **the relative position of current scheduled tokens** (`request_0: [0 + 0, 0 + 1, 0 + 2]`, `request_1: [0 + 0, 0 + 1]`, `request_2: [0 + 0, 0 + 1,..., 0 + 4]`) and then concatenate them together (`[0, 1, 2, 0, 1, 0, 1, 2, 3, 4]`).
Note: there is a more efficient way (using `request indices`) to create positions in actual code.
Finally, `token positions` can be obtained as `[0, 1, 2, 0, 1, 0, 1, 2, 3, 4]`. This variable is **token level**.
##### 2. Get token indices
The shape of the current **Token IDs table** is `(max num request, max model len)`.
Why are these `T_3_5`, `T_3_6`, `T_3_7` in this table without being scheduled?
- We fill all Token IDs in one request sequence to this table at once, but we only retrieve the tokens we scheduled this time. Then we retrieve the remaining Token IDs next time.
```shell
| T_0_0 | T_0_1 | T_0_2 | ? | ? | ? | ? | ? | ? | ? | ? | ? |
| T_1_0 | T_1_1 | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? |
| T_2_0 | T_2_1 | T_3_2 | T_3_3 | T_3_4 | T_3_5 | T_3_6 | T_3_7 | ? | ? | ? | ? |
| ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? |
......
......
......
```
Note that `T_x_x` is an `int32`.
Let's say `M = max model len`. Then we can use `token positions` together with `request indices` of each token to construct `token indices`.
So `token indices` = `[0 + 0 * M, 1 + 0 * M, 2 + 0 * M, 0 + 1 * M, 1 + 1 * M, 0 + 2 * M, 1 + 2 * M, 2 + 2 * M, 3 + 2 * M, 4 + 2 * M]` = `[0, 1, 2, 12, 13, 24, 25, 26, 27, 28]`
##### 3. Retrieve the Token IDs
We use `token indices` to select out the corresponding `Input IDs` from the token table. The pseudocode is as follows:
```shell
input_ids = token_table[token_indices]
```
As mentioned before, we refer to these `Token IDs` as `Input IDs`.
- `Input IDs` = `[T_0_0, T_0_1, T_0_2, T_1_0, T_1_1, T_2_0, T_2_1, T_3_2, T_3_3, T_3_4]`
#### Build inputs attention metadata
In the current **Block Table**, we use the first block (i.e. block_0) to mark the unused block. The shape of the block is `(max num request, max model len / block size)`, where `max model len / block size = 12 / 2 = 6`.
```shell
| 1 | 2 | 0 | 0 | 0 | 0 |
| 3 | 0 | 0 | 0 | 0 | 0 |
| 4 | 5 | 6 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 |
......
......
......
```
The KV cache block in the device memory is like:
```shell
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ......
```
Let's say `K = max model len / block size = 6`, and we can get token `device block number`.
The workflow of achieving slot mapping:
1. Get `block table indices` using `K`, `positions` and `request indices`.
Purpose: For each token, it could be used to select `device block number` from `block table`.
2. Get `device block number` using `block table indices`.
Purpose: `device block number` indicates which device block each token belongs to.
3. Get `block offsets` using `positions` and `block size`.
Purpose: `block offsets` indicates the offsets of each token within a block.
4. construct `slot mapping` using `device block number` and `block offsets`.
Purpose: we can use `slot mapping` to store Token IDs into token slots.
Details:
1. (**Token level**) Use a simple formula to calculate `block table indices`: `request indices * K + positions / block size`. So it equals `[0 * 6 + 0 / 2, 0 * 6 + 1 / 2, 0 * 6 + 2 / 2, 1 * 6 + 0 / 2, 1 * 6 + 1 / 2, 2 * 6 + 0 / 2, 2 * 6 + 1 / 2, 2 * 6 + 2 / 2, 2 * 6 + 3 / 2, 2 * 6 + 4 / 2] = [0, 0, 1, 6, 6, 12, 12, 13, 13, 14]`. This could be used to select `device block number` from `block table`.
2. (**Token level**) Use `block table indices` to select out `device block number` for each scheduled token. The pseudocode is `block_numbers = block_table[block_table_indices]`. So `device block number=[1, 1, 2, 3, 3, 4, 4, 5, 5, 6]`
3. (**Token level**) `block offsets` could be computed by `block offsets = positions % block size = [0, 1, 0, 0, 1, 0, 1, 0, 1, 0]`.
4. Finally, use `block offsets` and `device block number` to create `slot mapping`: `device block number * block size + block_offsets = [2, 3, 4, 6, 7, 8, 9, 10, 11, 12]`
(**Request level**) As we know the scheduled token count is `[3, 2, 5]`:
- (**Request level**) Use prefix sum to calculate `query start location`: `[0, 3, 5, 10]`.
- (**Request level**) All tokens in step 1 are in the prefill stage, and the computed tokens count is 0; then `sequence length` = `[3, 2, 5]`.
- (**Request level**) As mentioned above, `number of computed tokens` are all 0s: `[0, 0, 0]`.
- `number of requests`: `3`
- (**Request level**) `number of tokens`: `[3, 2, 5]`
- `max query len`: `5`
- (**Token level**) `slot mapping`: `[2, 3, 4, 6, 7, 8, 9, 10, 11, 12]`
- `attention mask`: For all requests that initiate a prefill process, we simply create only one mask matrix for reuse across different requests. The shape of this mask matrix is `5 * 5`:
### Step 2: Chunked prefill
In Step 2, we no longer provide explanations or perform calculations; instead, we directly present the final result.
#### Obtain inputs
Scheduled token of each request: `{'0': 1, '1': 1, '2': 3}`
1. `request indices`: `[0, 1, 2, 2, 2]`
2. `token positions`: `[3, 2, 5, 6, 7]`
Current **Token IDs table**:
```shell
| T_0_0 | T_0_1 | T_0_2 | T_0_3 | ? | ? | ? | ? | ? | ? | ? | ? |
| T_1_0 | T_1_1 | T_1_2 | ? | ? | ? | ? | ? | ? | ? | ? | ? |
| T_2_0 | T_2_1 | T_3_2 | T_3_3 | T_3_4 | T_3_5 | T_3_6 | T_3_7 | ? | ? | ? | ? |
| ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? |
......
......
......
```
**Note**: **T_0_3**, **T_1_2** are new Token IDs of **request_0** and **request_1** respectively. They are sampled from the output of the model.
3. `token indices`: `[3, 14, 29, 30, 31]`
4. `Input IDs`: `[T_0_3, T_1_2, T_3_5, T_3_6, T_3_7]`
#### Build inputs attention metadata
We allocate the blocks `7` and `8` to `request_1` and `request_2` respectively, as they need more space in device to store KV cache following token generation or chunked prefill.
Current **Block Table**:
```shell
| 1 | 2 | 0 | 0 | 0 | 0 |
| 3 | 7 | 0 | 0 | 0 | 0 |
| 4 | 5 | 6 | 8 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 |
......
......
......
```
KV cache block in the device memory:
```shell
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ......
```
1. (**Token level**) `block table indices`: `[1, 7, 14, 15, 15]`
2. (**Token level**) `device block number`: `[2, 7, 6, 8, 8]`
3. (**Token level**) `block offsets`: `[1, 0, 1, 0, 1]`
4. (**Token level**) `slot mapping`: `[5, 14, 13, 16, 17]`
Scheduled token count: `[1, 1, 3]`
- `query start location`: `[0, 1, 2, 5]`
- `sequence length`: `[4, 3, 8]`
- `number of computed tokens`: `[3, 2, 5]`
- `number of requests`: `3`
- `max query len`: `3`
- `slot mapping`: `[5, 14, 13, 16, 17]`
- `attention mask`: `5 * 8`
Each token has a `1 * 8` vector, and there are 5 scheduled tokens.
## At last
If you understand step 1 and step 2, you will know all the following steps.
Hope this document helps you better understand how vLLM prepares inputs for model forwarding. If you have any good ideas, you are welcome to contribute to us.

View File

@@ -0,0 +1,25 @@
# Adding a custom aclnn operation
This document describes how to add a custom aclnn operation to vllm-ascend.
## How custom aclnn operation works in vllm-ascend?
Custom aclnn operations are built and installed into `vllm_ascend/cann_ops_custom` directory during the build process of vllm-ascend. Then the aclnn operators are bound to `torch.ops._C_ascend` module, enabling users to invoke them in vllm-ascend python code.
To enable custom operations, use the following code:
```python
from vllm_ascend.utils import enable_custom_op
enable_custom_op()
```
## How to add a custom aclnn operation?
- Create a new operation folder under `csrc` directory.
- Create `op_host` and `op_kernel` directories for host and kernel source code.
- Add build options in `csrc/build_aclnn.sh` for supported SOC. Note that multiple ops should be separated with `;`, i.e. `CUSTOM_OPS=op1;op2;op3`.
- Bind aclnn operators to torch.ops._C_ascend module in `csrc/torch_binding.cpp`.
- Write a meta implementation in `csrc/torch_binding_meta.cpp` for the op to be captured into the aclgraph.
After a successful build of vllm-ascend, the custom aclnn operation can be invoked in python code.

View File

@@ -0,0 +1,129 @@
# Context Parallel (CP)
TL;DR PCP accelerates prefill via sequence splitting. DCP eliminates KV cache redundancy.
![ContextParallel](../../assets/cp/overview.png)
For the main discussions during the development process, please refer to the [RFC](https://github.com/vllm-project/vllm/issues/25749) and the relevant links referenced by or referencing this RFC.
## What is CP?
**Context Parallel (CP)** is a strategy for parallelizing computation along the sequence dimension across multiple devices.
**Prefill Context Parallel (PCP)** expands the world size of devices and uses dedicated communication domains.
Its primary goal is to partition the sequence dimension during the prefill phase, enabling different devices to compute distinct chunks of the sequence simultaneously.
The KV cache is sharded along the sequence dimension across devices.
This approach impacts the computational logic of both the Prefill and Decode stages to varying degrees.
**Decode Context Parallel (DCP)** reuses the communication domain of Tensor Parallelism (TP) and does not require additional devices.
Its main objective is to eliminate duplicated storage of the KV cache by sharding it along the sequence dimension across devices within the TP domain that would otherwise hold redundant copies.
DCP primarily influences the Decode logic, as well as the logic for chunked prefill and cached prefill.
## How to Use CP?
Please refer to the [context parallel user guide](../../user_guide/feature_guide/context_parallel.md) for detailed information.
## How It Works?
### Device Distribution
We introduce new communication domains for PCP and reuse TP for DCP, and this is the new layout of devices for PCP2, DCP2, and TP4.
![device_world](../../assets/cp/device_world.png)
### Block Table
CP performs sequence sharding on the KV cache storage. To facilitate efficient storage and access, tokens are stored in an interleaved manner across devices, with the interleaving granularity determined by `cp_kv_cache_interleave_size`, whose default value is `cp_kv_cache_interleave_size=1`, a.k.a. 'token interleave'.
Given that PCP and DCP behave similarly for KV cache sharding, we refer to them collectively as CP. Specifically, `cp_size = pcp_size * dcp_size`, and `cp_rank = pcp_rank * dcp_size + dcp_rank`.
As illustrated, a virtual block is defined in the block table, where blocks within the same CP device group form a virtual block. The virtual block size is `virtual_block_size = block_size * cp_size`.
For any token `x`, referencing the following figure, its (virtual) block index is `x // virtual_block_size`, and the offset within the virtual block is `offset_within_virtual_block = x % virtual_block_size`.
The local block index is `local_block_index = offset_within_virtual_block // cp_kv_cache_interleave_size`, and the device number is `target_rank = local_block_index % cp_size`.
The offset within the local block is `(local_block_index // cp_size) * cp_kv_cache_interleave_size + offset_within_virtual_block % cp_kv_cache_interleave_size`.
![BlockTable](../../assets/cp/blocktable.png)
Based on the logic above, the `slot_mapping` calculation process is adjusted, and the `slot_mapping` values on each device are modified to ensure the KV cache is sharded along the sequence dimension and stored across different devices as expected.
The current implementation requires that `block_size % cp_kv_cache_interleave_size == 0`.
### Decode Context Parallel (DCP)
As mentioned above, the primary function of DCP is to shard the KV cache along the sequence dimension for storage. Its impact lies in the logic of the decode and chunked prefill phases.
**Prefill Phase:**
As illustrated, during the Chunked Prefill computation, two distinct logic implementations are employed for MLA and GQA backends.
- In the **MLA backend**, a Context KV Cache `all_gather` operation is performed to aggregate the full KV values.
These are then used for attention computation with the Q values of the current chunk.
Note that in multi-request scenarios, the directly gathered KV results are interleaved across requests.
The `reorg_kvcache` function is used to reorganize the KV cache, ensuring that the KV cache of the same request is stored contiguously.
- In the **GQA backend**, an `all_gather` is performed along the head dimension for Q.
This is because DCP overlaps with the TP communication domain, and the Q heads within a DCP group differ.
However, they need to exchange results with the locally computed KV cache for online Softmax updates.
To ensure correctness during result updates, the Q values are synchronized across the DCP group via head-dimension `all_gather`.
During the result update process, `cp_lse_ag_out_rs` is invoked to aggregate `attn_output` and `attn_lse`, update the results, and perform a reduce-scatter operation on the outputs.
Alternatively, we can use an all-to-all communication to exchange the output and LSE results, followed by direct local updates. This approach aligns with the logic adapted for PCP compatibility.
![DCP-Prefill](../../assets/cp/dcp-prefill.png)
**Decode Phase:**
The logic during the decode phase is consistent with that of GQA's chunked prefill: an all-gather operation is first performed along the Q head dimension to ensure consistency within the DCP group.
After computing the results with the local KV cache, the results are updated via the `cp_lse_ag_out_rs` function.
![DCP-Decode](../../assets/cp/dcp-decode.png)
### Prefill Context Parallel (PCP)
**Tokens Partition in Head-Tail Style**
PCP requires splitting the input sequence and ensuring balanced computational load across devices during the prefill phase.
We employ a head-tail style for splitting and concatenation: specifically, the sequence is first padded to a length of `2*pcp_size`, then divided into `2*pcp_size` equal parts.
The first part is merged with the last part, the second part with the second last part, and so on, thereby assigning computationally balanced chunks to each device.
Additionally, since allgather aggregation of KV or Q results in interleaved chunks from different requests, we compute `pcp_allgather_restore_idx` to quickly restore the original order.
These logics are implemented in the function `_update_tokens_for_pcp`.
![PCP-Partition](../../assets/cp/head-tail-style.png)
**Prefill Phase:**
During the Prefill phase (excluding chunked prefill), we employ an all-gather KV approach to address the issue of incomplete sequences on individual GPUs.
It is important to note that we only aggregate the KV values for the current layer at a time, and these are discarded immediately after use, avoiding excessive peak memory usage.
This method can also be directly applied to KV cache storage (since the KV cache partitioning method differs from PCP sequence partitioning, it is inevitable that each GPU requires a complete copy of the KV values).
All attention backends maintain consistency in this logic.
Note: While a Ring Attention approach could also facilitate information exchange with lower peak memory and enable computation-communication overlap, we prioritized the all-gather KV implementation after evaluating that the development complexity was high and the benefits of overlap were limited.
![PCP-Prefill](../../assets/cp/pcp-prefill.png)
**Decode Phase:**
During the decode phase, we only need to add an allgather within the PCP group after the DCP all-to-all communication exchanges the output and LSE, before proceeding with the output update.
![PCP-Decode](../../assets/cp/pcp-decode.png)
**Chunked Prefill:**
Currently, there are three viable approaches for Chunked Prefill compatibility: **AllGatherQ**, **AllGatherKV**, and **Ring-Attn**.
Since PCP performs sequence sharding on both the query sequence and the KV cache, we need to ensure that one side has complete information or employ a method like Ring-Attn to perform computations sequentially.
The advantages and disadvantages of Ring-Attn will not be elaborated here.
We have implemented the **AllGatherQ** approach in the GQA attention backend and the **AllGatherKV** approach in the MLA attention backend.
The workflow after **AllGatherQ** is identical to the decode phase, while the workflow after **AllGatherKV** is the same as the standard prefill phase.
For details, please refer to the diagram below; specific steps will not be repeated.
One important note: **AllGatherKV** may lead to significant peak memory usage when the context length becomes excessively long.
To mitigate this, we adopt a segmented processing strategy.
By predefining the maximum amount of KV cache processed per round, we sequentially complete the attention computation and online softmax updates for each segment.
![PCP-ChunkedPrefill](../../assets/cp/chunkedprefill.png)
### Related Files
- slot_mapping computation: `vllm_ascend/worker/block_table.py`
- sequences splitting and metadata prepare: `vllm_ascend/worker/model_runner_v1.py`
- GQA backend: `vllm_ascend/attention/attention_cp.py`
- MLA backend: `vllm_ascend/attention/mla_cp.py`

View File

@@ -0,0 +1,228 @@
# CPU Binding
## Overview
CPU binding pins vLLM Ascend worker processes and key threads to specific CPU cores to reduce CPUNPU crossNUMA traffic and stabilize latency under multiprocess workloads. It is designed for ARM servers running Ascend NPUs and is automatically executed during worker initialization when enabled.
## Background
On multisocket ARM systems, the OS scheduler may place vLLM threads on CPUs far from the local NPU, causing NUMA crosstraffic and jitter. CPU binding enforces a deterministic CPU placement strategy and optionally binds NPU IRQs to the same CPU pool. This is distinct from other performance features (e.g., graph mode or dynamic batch) because it is purely a hostside affinity policy and does not change model execution logic.
## Design & How it works
### Key concepts
- **Allowed CPU list**: The cpuset from /proc/self/status (Cpus_allowed_list). All allocations are constrained to this list.
- **Running NPU list**: Logical NPU IDs extracted from npusmi process listing, optionally filtered by ASCEND_RT_VISIBLE_DEVICES.
- **CPU pool per NPU**: The CPU list assigned to each logical NPU ID based on the binding mode.
- **Binding modes & Device behavior**:
| Device type | Default mode | Description |
| ----------- | ------------ | ------------ |
| A3 (No Affinity) | `global_slice` | Splits the allowed CPU list evenly based on the **total number of global logical NPUs**, ensuring each NPU is assigned a contiguous segment of CPU cores. This prevents CPU core overlap across multiple process groups. |
| A2 / 310P / Others | `topo_affinity` | Allocates CPUs based on NPU topology affinity (`npusmi info -t topo`). If multiple NPUs are assigned to a single NUMA node (which may cause bandwidth contention), the CPU allocation extends to adjacent NUMA nodes. |
- **Default**: enabled (enable_cpu_binding = true).
- **Fallback**: If NPU topo affinity is unavailable, global_slice is used.
- **Failure handling**: Any exception in binding is logged as a warning and **binding is skipped for that rank**.
### Execution flow (simplified)
1. **Feature entry**: worker initialization calls `bind_cpus(local_rank)` when `enable_cpu_binding` is true.
2. **CPU architecture gate**: If the CPU is not ARM, binding is skipped with a log.
3. **Collect device info**:
- Map logical NPU IDs from `npusmi info -m`.
- Detect running NPU IDs from npusmi info process table.
- Read cpuset from /proc/self/status.
- Read topo affinity from `npusmi info -t topo`.
4. **Build CPU pools**:
- Use **global_slice** for A3 devices; **topo_affinity** for A2 and 310P.
- If topo affinity is missing, fall back to global_slice.
- Ensure each NPU has at least 5 CPUs.
5. **Allocate perrole CPUs**:
- Reserve the first two CPUs for IRQ binding.
- `main`: pool[2:-2]
- `acl`: pool[-2]
- `release`: pool[-1]
6. **Bind threads**:
- Main process is pinned to `main` CPUs.
- ACL threads (named with acl_thread) are pinned to `acl` CPU.
- Release threads (named with release_thread) are pinned to `release` CPU.
7. **Bind NPU IRQs (optional)**:
- If /proc/irq is writable, bind SQ/CQ IRQs to the first two CPUs in the pool.
- irqbalance may be stopped to prevent overrides.
8. **Memory binding (optional)**:
- If migratepages is available, memory for ACL threads is migrated to the NPUs NUMA node.
## Allocation plan examples
The allocation plan is derived directly from the CPU pool per NPU and then split into roles:
- IRQ CPUs: pool[0], pool[1]
- `main`: pool[2:-2]
- `acl`: pool[-2]
- `release`: pool[-1]
Below are concrete examples that reflect the actual code paths.
### Example 1: A3 inference server with 640 CPUs and 16 NPUs
- allowed_cpus = [0..639] (640 CPUs)
- NUMA nodes = 0..7 (8 NUMA nodes, symmetric layout)
- total_npus = 16
- running_npu_list = [0..15]
- base = 640 // 16 = 40, extra = 0
- Each NPU gets a 40CPU pool.
|NPU ID|Assigned CPU Cores (global_slice)|Role Division (IRQ/Main/ACL/Release)|
|---|---|---|
|0|0-39|`IRQ`: 0-1, `Main`: 2-37, `ACL`: 38, `Release`: 39|
|1|40-79|`IRQ`: 40-41, `Main`: 42-77, `ACL`: 78, `Release`: 79|
|...|...|...|
|15|600-639|`IRQ`: 600-601, `Main`: 602-637, `ACL`: 638, `Release`: 639|
This layout remains deterministic even when multiple processes share the same cpuset, because slicing is based on the global logical NPU ID.
### Example 2: A3 global_slice, even split
**Inputs**:
- allowed_cpus = [0..23] (24 CPUs)
- NUMA nodes = 0..1 (2 NUMA nodes, symmetric layout; NUMA0 = 0..11, NUMA1 = 12..23)
- total_npus = 4 (from npu-smi info -m)
- running_npu_list = [0, 1, 2, 3]
**Global slice**:
- base = 24 // 4 = 6, extra = 0
- Each NPU gets a 6CPU pool.
|NPU ID|Assigned CPU Cores (global_slice)|Role Division (IRQ/Main/ACL/Release)|
|---|---|---|
|0|0-5|`IRQ`: 0-1, `Main`: 2-3, `ACL`: 4, `Release`: 5|
|1|6-11|`IRQ`: 6-7, `Main`: 8-9, `ACL`: 10, `Release`: 11|
|2|12-17|`IRQ`: 12-13, `Main`: 14-15, `ACL`: 16, `Release`: 17|
|3|18-23|`IRQ`: 18-19, `Main`: 20-21, `ACL`: 22, `Release`: 23|
### Example 3: A3 global_slice, remainder distribution
**Inputs**:
- allowed_cpus = [0..16] (17 CPUs)
- NUMA nodes = 0..1 (2 NUMA nodes, symmetric layout; NUMA0 = 0..7, NUMA1 = 8..16)
- total_npus = 3
- running_npu_list = [0, 1, 2]
**Global slice**:
- base = 17 // 3 = 5, extra = 2
- NPU0 pool size = 6 (base+1)
- NPU1 pool size = 6 (base+1)
- NPU2 pool size = 5 (base)
|NPU ID|Assigned CPU Cores (global_slice)|Role Division (IRQ/Main/ACL/Release)|
|---|---|---|
|0|0-5|`IRQ`: 0-1, `Main`: 2-3, `ACL`: 4, `Release`: 5|
|1|6-11|`IRQ`: 6-7, `Main`: 8-9, `ACL`: 10, `Release`: 11|
|2|12-16|`IRQ`: 12-13, `Main`: 14, `ACL`: 15, `Release`: 16|
Note: When a pool size is exactly 5, `main` has a single CPU (pool[2]). If any pool is <5, binding raises an error.
**NUMA analysis**:
- With the symmetric NUMA layout above (NUMA0 = 0..7, NUMA1 = 8..16), NPU0 stays within NUMA0, NPU2 stays within NUMA1, but NPU1 spans both NUMA0 (6,7) and NUMA1 (8..11). This is a direct consequence of global slicing over the ordered cpuset; the remainder distribution does not enforce NUMA boundaries.
- If the cpuset numbering is interleaved across NUMA nodes (nonsymmetric layout), crossNUMA pools can happen even earlier. This is why symmetric NUMA layout is recommended for best locality.
### Known limitations and future improvements
With the current `global_slice` strategy, some CPU/NPU layouts cannot avoid crossNUMA pools. A future enhancement should incorporate NUMA node boundaries into the slicing logic so that pools remain within a single NUMA node whenever possible.
### Example 4: global_slice with visible subset of NPUs
**Inputs**:
- total_npus = 8 (from npu-smi info -m)
- running_npu_list = [2, 3] (filtered by ASCEND_RT_VISIBLE_DEVICES)
- allowed_cpus = [0..39] (40 CPUs)
- NUMA nodes = 0..3 (4 NUMA nodes, symmetric layout; 0..9, 10..19, 20..29, 30..39)
**Global slice**:
- base = 40 // 8 = 5, extra = 0
- Only the visible logical NPUs get pools, but slicing uses the global NPU ID so different processes do not overlap.
|NPU ID|Assigned CPU Cores (global_slice)|Role Division (IRQ/Main/ACL/Release)|
|---|---|---|
|2|10-14|`IRQ`: 10-11, `Main`: 12, `ACL`: 13, `Release`: 14|
|3|15-19|`IRQ`: 15-16, `Main`: 17, `ACL`: 18, `Release`: 19|
### Example 5: A2/310P topo_affinity with NUMA extension
**Inputs**:
- npu_affinity = {0: [0..7], 1: [0..7]} (from `npu-smi info -t topo`)
- allowed_cpus = [0..15] (16 CPUs)
- NUMA nodes = 0..1 (2 NUMA nodes; NUMA0 = 0..7, NUMA1 = 8..15)
**NUMA extension**:
- Both NPUs are on NUMA0, so each pool extends to the nearest NUMA node to reduce contention.
- NPU0 extends to NUMA1 -> [0..15]
- NPU1 extends to NUMA1 -> [0..15]
Because both pools are identical, the allocator applies average distribution across NPUs to avoid overlap. With a pool [0..15] and 2 NPUs, the final pools become:
|NPU ID|Assigned CPU Cores (topo_affinity)|Role Division (IRQ/Main/ACL/Release)|
|---|---|---|
|0|0-7|`IRQ`: 0-1, `Main`: 2-5, `ACL`: 6, `Release`: 7|
|1|8-15|`IRQ`: 8-9, `Main`: 10-13, `ACL`: 14, `Release`: 15|
### Example 6: Minimum CPUs per NPU
**Inputs**:
- total_npus = 2
- allowed_cpus = [0..7] (8 CPUs)
- NUMA nodes = 0..1 (2 NUMA nodes, symmetric layout; NUMA0 = 0..3, NUMA1 = 4..7)
**Result**:
- base = 4, which is < 5, so binding fails with: "Insufficient CPUs for binding with IRQ/ACL/REL reservations..."
|NPU ID|Assigned CPU Cores|Role Division (IRQ/Main/ACL/Release)|
|---|---|---|
|0|N/A|Binding error (insufficient CPUs per NPU)|
|1|N/A|Binding error (insufficient CPUs per NPU)|
To resolve, either reduce total_npus or enlarge the cpuset so that each NPU has at least 5 CPUs.
### Logging and verification
- Logs show the selected binding mode and the allocation plan, for example:
- `[cpu_bind_mode] mode=global_slice rank=0 visible_npus=[...]`
- `The CPU allocation plan is as follows: ...`
- You can verify affinity via taskset or `/proc/<pid>/status` after startup.
## Limitations & Notes
- **ARMonly**: Binding is skipped on nonARM CPUs.
- **Minimum CPU requirement**: Each logical NPU requires at least 5 CPUs. If the cpuset is smaller, binding fails with an error.
- **NUMA symmetry assumption**: For best locality, the current strategies assume the cpuset is evenly distributed across NUMA nodes and CPU numbering aligns with NUMA layout; otherwise NUMA locality may be suboptimal.
- Example (symmetric layout): 2 NUMA nodes, 64 CPUs total. NUMA0 = CPUs 031, NUMA1 = CPUs 3263, and the cpuset is 063. With 4 logical NPUs, global slicing yields 16 CPUs per NPU (015, 1631, 3247, 4863), so each NPUs pool stays within a single NUMA node.
- **Runtime dependencies**:
- Requires npusmi and lscpu commands.
- IRQ binding requires write access to /proc/irq.
- Memory binding requires migratepages; otherwise it is skipped.
- **IRQ side effects**: irqbalance may be stopped to avoid overriding bindings.
- **Perprocess behavior**: Only the current ranks NPU is used for IRQ binding to avoid crossprocess overwrite.
### Debug logging
Use the standard vLLM logging configuration to enable debug logs. The binding process emits debug messages (e.g., `[cpu_global_slice] ...`) when debug level is enabled.
## References
- CPU binding implementation: vllm_ascend/cpu_binding.py (`DeviceInfo`, `CpuAlloc`, `bind_cpus`)
- Worker integration: vllm_ascend/worker/worker.py (`NPUWorker._init_device`)
- Additional config option: docs/source/user_guide/configuration/additional_config.md (`enable_cpu_binding`)
- Tests: tests/ut/device_allocator/test_cpu_binding.py

View File

@@ -0,0 +1,105 @@
# Disaggregated-prefill
## Why disaggregated-prefill?
This feature addresses the need to optimize the **Time Per Output Token (TPOT)** and **Time To First Token (TTFT)** in large-scale inference tasks. The motivation is two-fold:
1. **Adjusting Parallel Strategy and Instance Count for P and D Nodes**
Using the disaggregated-prefill strategy, this feature allows the system to flexibly adjust the parallelization strategy (e.g., data parallelism (dp), tensor parallelism (tp), and expert parallelism (ep)) and the instance count for both P (Prefiller) and D (Decoder) nodes. This leads to better system performance tuning, particularly for **TTFT** and **TPOT**.
2. **Optimizing TPOT**
Without the disaggregated-prefill strategy, prefill tasks are inserted during decoding, which results in inefficiencies and delays. Disaggregated-prefill solves this by allowing for better control over the systems **TPOT**. By managing chunked prefill tasks effectively, the system avoids the challenge of determining the optimal chunk size and provides more reliable control over the time taken for generating output tokens.
---
## Usage
vLLM Ascend currently supports two types of connectors for handling KV cache management:
- **MooncakeConnector**: D nodes pull KV cache from P nodes.
- **MooncakeLayerwiseConnector**: P nodes push KV cache to D nodes in a layered manner.
For step-by-step deployment and configuration, refer to the following guide:
[https://docs.vllm.ai/projects/ascend/en/latest/tutorials/features/pd_disaggregation_mooncake_multi_node.html](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/features/pd_disaggregation_mooncake_multi_node.html)
---
## How It Works
### 1. Design Approach
Under the disaggregated-prefill, a global proxy receives external requests, forwarding prefill to P nodes and decode to D nodes; the KV cache (keyvalue cache) is exchanged between P and D nodes via peer-to-peer (P2P) communication.
### 2. Implementation Design
Our design diagram is shown below, illustrating the pull and push schemes respectively.
![alt text](../../assets/disaggregated_prefill_pull.png)
![alt text](../../assets/disaggregated_prefill_push.png)
#### Mooncake Connector
1. The request is sent to the Proxys `_handle_completions` endpoint.
2. The Proxy calls `select_prefiller` to choose a P node and forwards the request, configuring `kv_transfer_params` with `do_remote_decode=True`, `max_completion_tokens=1`, and `min_tokens=1`.
3. After the P nodes scheduler finishes prefill, `update_from_output` invokes the schedule connectors `request_finished` to defer KV cache release, constructs `kv_transfer_params` with `do_remote_prefill=True`, and returns to the Proxy.
4. The Proxy calls `select_decoder` to choose a D node and forwards the request.
5. On the D node, the scheduler marks the request as `RequestStatus.WAITING_FOR_REMOTE_KVS`, pre-allocates KV cache, calls `kv_connector_no_forward` to pull the remote KV cache, then notifies the P node to release KV cache and proceeds with decoding to return the result.
#### Mooncake Layerwise Connector
1. The request is sent to the Proxys `_handle_completions` endpoint.
2. The Proxy calls `select_decoder` to choose a D node and forwards the request, configuring `kv_transfer_params` with `do_remote_prefill=True` and setting the `metaserver` endpoint.
3. On the D node, the scheduler uses `kv_transfer_params` to mark the request as `RequestStatus.WAITING_FOR_REMOTE_KVS`, pre-allocates KV cache, then calls `kv_connector_no_forward` to send a request to the metaserver and waits for the KV cache transfer to complete.
4. The Proxys `metaserver` endpoint receives the request, calls `select_prefiller` to choose a P node, and forwards it with `kv_transfer_params` set to `do_remote_decode=True`, `max_completion_tokens=1`, and `min_tokens=1`.
5. During processing, the P nodes scheduler pushes KV cache layer-wise; once all layers pushing is complete, it releases the request and notifies the D node to begin decoding.
6. The D node performs decoding and returns the result.
### 3. Interface Design
Taking MooncakeConnector as an example, the system is organized into three primary classes:
- **MooncakeConnector**: Base class that provides core interfaces.
- **MooncakeConnectorScheduler**: Interface for scheduling the connectors within the engine core, responsible for managing KV cache transfer requirements and completion.
- **MooncakeConnectorWorker**: Interface for managing KV cache registration and transfer in worker processes.
### 4. Specifications Design
This feature is flexible and supports various configurations, including setups with MLA and GQA models. It is compatible with A2 and A3 hardware configurations and facilitates scenarios involving both equal and unequal TP setups across multiple P and D nodes.
| Feature | Status |
|-------------------------------|----------------|
| A2 | 🟢 Functional |
| A3 | 🟢 Functional |
| equal TP configuration | 🟢 Functional |
| unequal TP configuration | 🟢 Functional |
| MLA | 🟢 Functional |
| GQA | 🟢 Functional |
- 🟢 Functional: Fully operational, with ongoing optimizations.
- 🔵 Experimental: Experimental support, interfaces and functions may change.
- 🚧 WIP: Under active development, will be supported soon.
- 🟡 Planned: Scheduled for future implementation (some may have open PRs/RFCs).
- 🔴 NO plan/Deprecated: No plan or deprecated by vLLM.
---
## DFX Analysis
### 1. Config Parameter Validation
Validate KV transfer config by checking whether the kv_connector type is supported and whether kv_connector_module_path exists and is loadable. On transfer failures, emit clear error logs for diagnostics.
### 2. Port Conflict Detection
Before startup, perform a port-usage check on configured ports (e.g., rpc_port, metrics_port, http_port/metaserver) by attempting to bind. If a port is already in use, fail fast and log an error.
### 3. PD Ratio Validation
Under non-symmetric PD scenarios, validate the P-to-D tp ratio against expected and scheduling constraints to ensure correct and reliable operation.
---
## Limitations
- Heterogeneous P and D nodes are not supported—for example, running P nodes on A2 and D nodes on A3.
- In non-symmetric TP configurations, only cases where the P nodes have a higher TP degree than the D nodes and the P TP count is an integer multiple of the D TP count are supported (i.e., P_tp > D_tp and P_tp % D_tp = 0).

View File

@@ -0,0 +1,249 @@
# Expert Parallelism Load Balancer (EPLB)
## Why We Need EPLB?
When using Expert Parallelism (EP), different experts are assigned to different NPUs. Given that the load of various experts may vary depending on the current workload, it is crucial to maintain balanced loads across different NPUs. We adopt a redundant experts strategy by duplicating heavily-loaded experts. Then, we heuristically pack these duplicated experts onto NPUs to ensure load balancing across them. Moreover, thanks to the group-limited expert routing used in MoE models, we also attempt to place experts of the same group on the same node to reduce inter-node data traffic, whenever possible.
To facilitate reproduction and deployment, vLLM Ascend supports the deployed EP load balancing algorithm in `vllm_ascend/eplb/core/policy`. The algorithm computes a balanced expert replication and placement plan based on the estimated expert loads. Note that the exact method for predicting expert loads is outside the scope of this repository. A common method is to use a moving average of historical statistics.
![eplb](../../assets/eplb.png)
## How to Use EPLB?
Please refer to the EPLB section of the user guide for detailed information: [How to Use EPLB](../../user_guide/feature_guide/eplb_swift_balancer.md)
## How It Works?
**EPLB Module Architecture**
```shell
vllm_ascend
├── eplb
│ ├── adaptor
│ │ ├── abstract_adaptor.py
│ │ ├── vllm_adaptor.py
│ ├── core
│ │ ├── policy
│ │ │ ├── policy_abstract.py
│ │ │ ├── policy_default_eplb.py
│ │ │ ├── policy_swift_balancer.py
│ │ │ ├── policy_factory.py
│ │ │ ├── policy_flashlb.py
│ │ ├── eplb_device_transfer_loader.py
│ │ ├── eplb_utils.py
│ │ ├── eplb_worker.py
│ ├── eplb_updator.py
│ ├── utils.py
└───────────
```
**1. Adaptor Module**
*Handles registration and adaptation for different MoE model types*
- `abstract_adaptor.py`
Abstract base class defining unified registration interfaces for EPLB adapters
- `vllm_adaptor.py`
Implementation supporting Qwen3-MoE and DeepSeek models, standardizing parameter handling for policy algorithms
**2. Core Module**
*Implements core algorithms, updates, and asynchronous processing*
- **Policy Submodule**
*Load balancing algorithms with factory pattern instantiation*
- `policy_abstract.py`
Abstract class for load balancing strategy interfaces
- `policy_default_eplb.py`
Default implementation of open-source EPLB paper algorithm
- `policy_swift_balancer.py`
Enhanced version optimizing expert swaps for low-bandwidth devices (e.g., A2)
- `policy_flashlb.py`
Threshold-based adjustment reducing operational costs through layer-wise fluctuation detection
- `policy_factory.py`
Strategy factory for automatic algorithm instantiation
- `eplb_device_transfer_loader.py`
Manages expert table/weight transmission and updates
- `eplb_utils.py`
Utilities for expert table initialization and mapping
- `eplb_worker.py`
Asynchronous algorithm orchestration and result processing
**3. System Components**
- `eplb_updator.py`
Central coordinator for load balancing during inference workflows
- `utils.py`
General utilities for EPLB interface registration
*Key Optimizations:*
1. Maintained original structure while improving technical clarity
2. Standardized terminology
3. Enhanced algorithm differentiation through concise descriptors
4. Improved scoping through hierarchical presentation
5. Preserved file/class relationships while optimizing readability
### Default Algorithm
#### Hierarchical Load Balancing
When the number of server nodes evenly divides the number of expert groups, we use the hierarchical load balancing policy to leverage group-limited expert routing. We first pack the expert groups onto nodes evenly, ensuring balanced loads across different nodes. Then, we replicate the experts within each node. Finally, we pack the replicated experts onto individual NPUs to ensure load balancing across them. The hierarchical load balancing policy can be used in the prefilling stage with a smaller expert-parallel size.
#### Global Load Balancing
In other cases, we use the global load balancing policy, which replicates experts globally regardless of expert groups, and packs the replicated experts onto individual NPUs. This policy can be adopted in the decoding stage with a larger expert-parallel size.
### Add a New EPLB Policy
If you want to add a new eplb policy to vllm_ascend, you must follow these steps:
1. Inherit the `EplbPolicy` abstract class of `policy_abstract.py` and override the `rebalance_experts` interface, ensuring consistent input parameters `current_expert_table`, `expert_workload` and return types `newplacement`.
For example:
```python
class RandomLoadBalance(EplbPolicy):
def __init__(self, config: DynamicConfig):
super().__init__(config)
def rebalance_experts(self, current_expert_table, expert_workload):
new_table = copy.deepcopy(current_expert_table)
num_layers = len(current_expert_table)
for i in range(num_layers):
# randomly choose two card
# indices = random.sample(range(num_card), 2)
indices = [3, 1]
# swap redundant experts
expert_id_to_exchange = new_table[i][indices[0]][-1].clone()
new_table[i][indices[0]][-1] = new_table[i][indices[1]][-1]
new_table[i][indices[1]][-1] = expert_id_to_exchange
return 1, [-i for i in range(num_layers)], new_table
```
2. To add a new EPLB algorithm, include the policy type and its corresponding implementation class in the `PolicyFactory` of `policy_factory.py`.
### Add a New MoE Model
**Implementation Guide for Model Integration**
1. **Adapter File Modification**
- Inherit or modify `vllm_ascend/eplb/adaptor/vllm_adaptor.py`
- Add processing logic for key parameters:
- `num_dense_layers`
- `global_expert_num`
- `num_roe_layers`
- Ensure parameter synchronization in the `model_register` function.
For example:
Modify `__init__` of `vllm_adaptor.py` to add a new moe model eplb params:
```python
if self.model.config.model_type == "qwen3_moe":
self.num_dense_layers = 0
self.global_expert_num = self.model.config.num_experts
```
Modify `model_register` of `vllm_adaptor.py` to register eplb params for new moe model:
```python
if config.model_type == "qwen3_moe":
model.num_moe_layers = config.num_hidden_layers
```
2. **MoE Feature Integration**
- Extend `vllm_ascend/eplb/utils.py` with MoE-specific methods
- Implement required functionality for expert routing or weight management
3. **Registration Logic Update**
- Add patch logic within the `model_register` function
- Maintain backward compatibility with existing model types
4. **Validation & Testing**
- Verify parameter consistency across layers
- Test cross-device communication for expert tables
- Benchmark against baseline implementations (e.g., Qwen3-MoE)
*Key Implementation Notes:*
- Preserve existing interface contracts in abstract classes
- Use decorators for non-intrusive patch integration
- Leverage `eplb_utils.py` for shared expert mapping operations
## DFX
### Parameter Validation
#### Integer Parameters
All integer input parameters must explicitly specify their maximum and minimum values and be subject to valid value validation. For example, `expert_heat_collection_interval` must be greater than 0:
```python
@staticmethod
def check_iterations(iterations):
if not isinstance(iterations, int):
raise TypeError(f"The {iterations} is not int.")
if iterations <= 0:
raise ValueError(
f"The {iterations} can not less than or equal to 0.")
if iterations > sys.maxsize:
raise ValueError(
f"The {iterations} can not large than {sys.maxsize}")
```
#### File Path
The file path for EPLB must be checked for legality, such as whether the file path is valid and whether it has appropriate read and write permissions. For example:
```python
@staticmethod
def check_expert_map_path(expert_map):
if expert_map is None:
return
if not isinstance(expert_map, str):
raise TypeError("The expert_map is not str.")
if not expert_map.strip():
raise ValueError("The expert_map is not empty.")
_, ext = os.path.splitext(expert_map)
if ext.lower() != ".json":
raise TypeError("The expert_map is not json.")
if not os.path.exists(expert_map):
raise ValueError("The expert_map is not exist.")
try:
with open(expert_map, "w", encoding='utf-8') as f:
f.read()
except Exception as e:
raise IOError(
f"Fail read expert info from {expert_map}, please check the reading permission of {expert_map} : {e}"
)
```
### Function Specifications
#### Initialization Function
All EPLB parameters must be initialized by default during initialization, with specified parameter types and default values for proper handling.
#### General Functions
All method arguments must specify parameter types and default values, and functions must include default return value handling for default arguments. It is recommended to use `try-except` blocks to handle the function body, specifying the type of exception captured and the failure handling (e.g., logging exceptions or returning a failure status).
### Consistency
#### Expert Map
The expert map must be globally unique during initialization and update. In a multi-node scenario during initialization, distributed communication should be used to verify the consistency of expert maps across each rank. If they are inconsistent, the user should be notified which ranks have inconsistent maps.
During the update process, if only a few layers or the expert table of a certain rank has been changed, the updated expert table must be synchronized with the EPLB's context to ensure global consistency.
#### Expert Weight
When updating expert weights, ensure that the memory allocated for the expert weights has been released, or that the expert (referring to the old version) is no longer in use.
## Limitations
Before using EPLB, start the script and add `export DYNAMIC_EPLB="true"`.
Before performing load data collection (or performance data collection), start the script and add `export EXPERT_MAP_RECORD="true"`.

View File

@@ -0,0 +1,19 @@
# Design Documents
This section provides an overview of the features implemented in vLLM Ascend. Developers can refer to this guide to understand how vLLM Ascend works.
:::{toctree}
:caption: Design Documents
:maxdepth: 1
patch
cpu_binding
ModelRunner_prepare_inputs
disaggregated_prefill
eplb_swift_balancer
ACL_Graph
KV_Cache_Pool_Guide
add_custom_aclnn_op
context_parallel
quantization
npugraph_ex
:::

View File

@@ -0,0 +1,101 @@
# Npugraph_ex
## How Does It Work?
This is an optimization based on Fx graphs, which can be considered an acceleration solution for the aclgraph mode.
You can get its code [code](https://gitcode.com/Ascend/torchair)
## Default Fx Graph Optimization
### Fx Graph pass
- For the intermediate nodes of the model, replace the non-in-place operators contained in the nodes with in-place operators to reduce memory movement during computation and improve performance.
- For the original input parameters of the model, if they include in-place operators, Dynamo's Functionalize process will replace the in-place operators with a form of non-in-place operators + copy operators. npugraph_ex will reverse this process, restoring the in-place operators and reducing memory movement.
### Fx fusion pass
npugraph_ex now provides three default operator fusion passes, and more will be added in the future.
Operator combinations that meet the replacement rules can be replaced with the corresponding fused operators.
You can get the default [fusion pass list](https://www.hiascend.com/document/detail/zh/Pytorch/730/modthirdparty/torchairuseguide/torchair_00017.html)
## Custom fusion pass
Users can register a custom graph fusion pass in TorchAir to modify PyTorch FX graphs. The registration relies on the register_replacement API.
Below is the declaration of this API and a demo of its usage.
```python
register_replacement(search_fn, replace_fn, example_inputs, trace_fn=fwd_only, extra_check=_return_true, search_fn_pattern=None)
```
|Parameter Name| Input/Output |Explanation|Is necessary|
|--|--------------|---|-------|
|search_fn|Input|This function is the operator combination or calculation logic that you want to recognize in the FX graph, such as the operator combination that needs to be fused|Yes|
|replace_fn|Input|When the combination corresponding to search_fn is found in the target graph, this function's computation logic will replace the original subgraph to achieve operator fusion or optimization.|Yes|
|example_inputs|Input|Example input tensors used to track search_fn and replace_fn. The shape and dtype of the input should match the actual scenario.|Yes|
|trace_fn|Input|By default, only the forward computation graph is tracked, which is suitable for optimization during the inference phase; if training scenarios need to be supported, a function that supports backward tracking can be provided.|No|
|extra_check|Input|Find the extra verification function after operator fusion. The function's input parameter must be a Match object from torch._inductor.pattern_matcher, and it is used for further custom checks on the matching result, such as checking whether the fused operators are on the same stream, checking the device type, checking the input shapes, and so on.|No|
|search_fn_pattern|Input|A custom pattern object is generally unnecessary to provide. Its definition follows the rules of the native PyTorch MultiOutputPattern object. After passing this parameter, search_fn will no longer be used to match operator combinations; instead, this parameter will be used directly as the matching rule.|No|
Usage Example
```python
import functools
import torch, torch_npu, torchair
from torch._inductor.pattern_matcher import Match
from torch._subclasses.fake_tensor import FakeTensorMode
from torchair.core.utils import logger
# Assume fusing the add operator and the npu_rms_norm operator into the npu_add_rms_norm operator
# Define a search_fn to find the operator combinations in the original FX graph before fusion.
def search_fn(x1, x2, gamma):
xOut = torch.add(x1, x2)
y, _ = torch_npu.npu_rms_norm(xOut, gamma)
return y, xOut
# Define a replace_fn, that is, a fusion operator, used to replace operator combinations in the FX graph
def replace_fn(x1, x2, gamma):
y, _, xOut = torch_npu.npu_add_rms_norm(
x1, x2, gamma
)
return y, xOut
# extra_check can pass in additional validation logic. Here, it is used to check whether the last dimension of the first input parameter x1 is a specific value; if it is not the specific value, fusion is not allowed.
def extra_check(match: Match):
x1 = match.kwargs.get("x1")
if x1 is None:
return False
if not hasattr(x1, "meta") or "val" not in x1.meta:
return False
a_shape = x1.meta["val"].shape
return a_shape[-1] == 7168
# Define some sample inputs to trace search_fn and replace_fn into an FX graph
fake_mode = FakeTensorMode()
with fake_mode:
# sizes/values don't actually matter for initial trace
# once we get a possible match we re-trace with the actual values and verify the match still holds
input_tensor = functools.partial(torch.empty, (1, 1, 2), device="npu", dtype=torch.float16)
kwargs_tensor = functools.partial(torch.empty, 2, device="npu", dtype=torch.float16)
# Call the torchair.register_replacement API with search_fn, replace_fn, and example_inputs. If there are additional validations, you can pass them in as extra_check.
torchair.register_replacement(
search_fn=search_fn,
replace_fn=replace_fn,
example_inputs=(input_tensor(), input_tensor(), kwargs_tensor()),
extra_check=extra_check
)
```
The default fusion pass in npugraph_ex is also implemented based on this API. You can see more examples of using this API in the vllm-ascend and npugraph_ex code repositories.
### DFX
By reusing the TORCH_COMPILE_DEBUG environment variable from the PyTorch community, when TORCH_COMPILE_DEBUG=1 is set, it will output the FX graphs throughout the entire process.

View File

@@ -0,0 +1,76 @@
# Patch in vLLM Ascend
vLLM Ascend is a platform plugin for vLLM. Due to the different release cycle of vLLM and vLLM Ascend and their hardware limitations, we need to patch some code in vLLM to make it compatible with vLLM Ascend.
In vLLM Ascend code, we provide a patch module `vllm_ascend/patch` to adapt to changes in vLLM.
## Principle
We should keep in mind that Patch is not the best way to make vLLM Ascend compatible. It's just a temporary solution. The best way is to contribute the change to vLLM to make it compatible with vLLM Ascend initially. In vLLM Ascend, we have the basic principle for Patch strategy:
1. Less is more. Please do not patch unless it's the only way currently.
2. Once a patch is added, it's required to describe the future plan for removing the patch.
3. Anytime, cleaning the patch code is welcome.
## How it works
In `vllm_ascend/patch`, you can see the code structure as follows:
```shell
vllm_ascend
├── patch
│ ├── platform
│ │ ├── patch_xxx.py
│ ├── worker
│ │ ├── patch_yyy.py
└───────────
```
- **platform**: The patch code in this directory is for patching the code in vLLM main process. It's called by `vllm_ascend/platform::NPUPlatform::pre_register_and_update` very early when vLLM is initialized.
- For online mode, vLLM process calls the platform patch in `vllm/vllm/engine/arg_utils.py::AsyncEngineArgs.add_cli_args` when parsing the cli args.
- For offline mode, vLLM process calls the platform patch in `vllm/vllm/engine/arg_utils.py::EngineArgs.create_engine_config` when parsing the input parameters.
- **worker**: The patch code in this directory is for patching the code in vLLM worker process. It's called by `vllm_ascend/worker/worker::NPUWorker::__init__` when the vLLM worker process is initialized.
- For both online and offline mode, vLLM engine core process calls the worker patch in `vllm/vllm/worker/worker_base.py::WorkerWrapperBase.init_worker` when initializing the worker process.
## How to write a patch
Before writing a patch, following the principle above, we should patch the least code. If it's necessary, we can patch the code in either **platform** or **worker** folder. Here is an example to patch `distributed` module in vLLM.
1. Decide which version of vLLM we should patch. For example, after analysis, here we want to patch both `0.10.0` and `main` of vLLM.
2. Decide which process we should patch. For example, here `distributed` belongs to the vLLM main process, so we should patch `platform`.
3. Create the patch file in the right folder. The file should be named as `patch_{module_name}.py`. The example here is `vllm_ascend/patch/platform/patch_distributed.py`.
4. Write your patch code in the new file. Here is an example:
```python
import vllm
def patch_destroy_model_parallel():
# your patch code
...
vllm.distributed.parallel_state.destroy_model_parallel = patch_destroy_model_parallel
```
5. Import the patch file in `__init__.py`. In this example, add `import vllm_ascend.patch.platform.patch_distributed` into `vllm_ascend/patch/platform/__init__.py`.
6. Add the description of the patch in `vllm_ascend/patch/__init__.py`. The description format is as follows:
```python
# ** File: <The patch file name> **
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# 1. `<The target patch module in vLLM>`
# Why:
# <Describe the reason why we need to patch>
# How
# <Describe the way to patch>
# Related PR (if no, explain why):
# <Add a link to the related PR in vLLM. If there is no related PR, explain why>
# Future Plan:
# <Describe the future plan to remove the patch>
```
7. Add the Unit Test and E2E Test. Any newly added code in vLLM Ascend should contain the Unit Test and E2E Test as well. You can find more details in [test guide](../contribution/testing.md)
## Limitations
1. In V1 Engine, vLLM starts three kinds of processes: Main process, EngineCore process and Worker process. Now vLLM Ascend can only patch the code in Main process and Worker process by default. If you want to patch the code running in EngineCore process, you should patch EngineCore process entirely during setup. Find the entire code in `vllm.v1.engine.core`. Please override `EngineCoreProc` and `DPEngineCoreProc` entirely.
2. If you are running edited vLLM code, the version of vLLM may be changed automatically. For example, if you run the edited vLLM based on v0.9.n, the version of vLLM may be changed to v0.9.nxxx. In this case, the patch for v0.9.n in vLLM Ascend would not work as expected, because vLLM Ascend can't distinguish the version of the vLLM you're using. In this case, you can set the environment variable `VLLM_VERSION` to specify the version of the vLLM you're using, and then the patch for v0.10.0 should work.

View File

@@ -0,0 +1,114 @@
# Quantization Adaptation Guide
This document provides guidance for adapting quantization algorithms and models related to **ModelSlim**.
## Quantization Feature Introduction
### Quantization Inference Process
The current process for registering and obtaining quantization methods in vLLM Ascend is as follows:
![get_quant_method](../../assets/quantization/get_quant_method.png)
vLLM Ascend registers a custom Ascend quantization method. By configuring the `--quantization ascend` parameter (or `quantization="ascend"` for offline), the quantization feature is enabled. When constructing the `quant_config`, the registered `AscendModelSlimConfig` is initialized and `get_quant_method` is called to obtain the quantization method corresponding to each weight part, stored in the `quant_method` attribute.
Currently supported quantization methods include `AscendLinearMethod`, `AscendFusedMoEMethod`, `AscendEmbeddingMethod`, and their corresponding non-quantized methods:
![quant_methods_overview](../../assets/quantization/quant_methods_overview.png)
The quantization method base class defined by vLLM and the overall call flow of quantization methods are as follows:
![quant_method_call_flow](../../assets/quantization/quant_method_call_flow.png)
The `embedding` method is generally not implemented for quantization, focusing only on the other three methods.
The `create_weights` method is used for weight initialization; the `process_weights_after_loading` method is used for weight post-processing, such as transposition, format conversion, data type conversion, etc.; the `apply` method is used to perform activation quantization and quantized matrix multiplication calculations during the forward process.
We need to implement the `create_weights`, `process_weights_after_loading`, and `apply` methods for different **layers** (**attention**, **mlp**, **moe**).
**Supplement**: When loading the model, the quantized model's description file **quant_model_description.json** needs to be read. This file describes the quantization configuration and parameters for each part of the model weights, for example:
```json
{
"model.layers.0.linear_attn.dt_bias": "FLOAT",
"model.layers.0.linear_attn.A_log": "FLOAT",
"model.layers.0.linear_attn.conv1d.weight": "FLOAT",
"model.layers.0.linear_attn.in_proj_qkvz.weight": "W8A8_DYNAMIC",
"model.layers.0.linear_attn.in_proj_qkvz.weight_scale": "W8A8_DYNAMIC",
"model.layers.0.linear_attn.in_proj_qkvz.weight_offset": "W8A8_DYNAMIC",
"model.layers.0.linear_attn.in_proj_ba.weight": "FLOAT",
"model.layers.0.linear_attn.norm.weight": "FLOAT",
"model.layers.0.linear_attn.out_proj.weight": "FLOAT",
"model.layers.0.mlp.gate.weight": "FLOAT",
"model.layers.0.mlp.experts.0.gate_proj.weight": "W8A8_DYNAMIC",
"model.layers.0.mlp.experts.0.gate_proj.weight_scale": "W8A8_DYNAMIC",
"model.layers.0.mlp.experts.0.gate_proj.weight_offset": "W8A8_DYNAMIC"
}
```
Based on the above content, we present a brief description of the adaptation process for quantization algorithms and quantized models.
### Quantization Algorithm Adaptation
- **Step 1: Algorithm Design**. Define the algorithm ID (e.g., `W4A8_DYNAMIC`), determine supported layers (linear, moe, attention), and design the quantization scheme (static/dynamic, pertensor/perchannel/pergroup).
- **Step 2: Registration**. Use the `@register_scheme` decorator in `vllm_ascend/quantization/methods/registry.py` to register your quantization scheme class.
```python
from vllm_ascend.quantization.methods import register_scheme, AscendLinearScheme
@register_scheme("W4A8_DYNAMIC", "linear")
class AscendW4A8DynamicLinearMethod(AscendLinearScheme):
...
@register_scheme("W4A8_DYNAMIC", "moe")
class AscendW4A8DynamicFusedMoEMethod(AscendMoEScheme):
...
```
- **Step 3: Implementation**. Create an algorithm implementation file, such as `vllm_ascend/quantization/methods/w4a8.py`, and implement the method class and logic.
- **Step 4: Testing**. Use your algorithm to generate quantization configurations and verify correctness and performance on target models and hardware.
### Quantized Model Adaptation
Adapting a new quantized model requires ensuring the following three points:
- The original model has been successfully adapted in `vLLM Ascend`.
- **Fused Module Mapping**: Add the model's `model_type` to `packed_modules_model_mapping` in `vllm_ascend/quantization/modelslim_config.py` (e.g., `qkv_proj`, `gate_up_proj`, `experts`) to ensure sharding consistency and correct loading.
```python
packed_modules_model_mapping = {
"qwen3_moe": {
"qkv_proj": [
"q_proj",
"k_proj",
"v_proj",
],
"gate_up_proj": [
"gate_proj",
"up_proj",
],
"experts":
["experts.0.gate_proj", "experts.0.up_proj", "experts.0.down_proj"],
},
}
```
- All quantization algorithms used by the quantized model have been integrated into the `quantization` module.
## Currently Supported Quantization Algorithms
vLLM Ascend supports multiple quantization algorithms. The following table provides an overview of each quantization algorithm based on the implementation in the `vllm_ascend.quantization` module:
| Algorithm | Weight | Activation | Weight Granularity | Activation Granularity | Type | Description |
| ------------------------ | ------ | ---------- | ------------------ | ---------------------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `W4A16` | INT4 | FP16/BF16 | Per-Group | Per-Tensor | Static | 4-bit weight quantization with 16-bit activation precision, specifically designed for MoE model expert layers, supporting int32 format weight packing |
| `W8A16` | INT8 | FP16/BF16 | Per-Channel | Per-Tensor | Static | 8-bit weight quantization with 16-bit activation precision, balancing accuracy and performance, suitable for linear layers |
| `W8A8` | INT8 | INT8 | Per-Channel | Per-Tensor | Static | Static activation quantization, suitable for scenarios requiring high precision |
| `W8A8_DYNAMIC` | INT8 | INT8 | Per-Channel | Per-Token | Dynamic | Dynamic activation quantization with per-token scaling factor calculation |
| `W4A8_DYNAMIC` | INT4 | INT8 | Per-Group | Per-Token | Dynamic | Supports both direct per-channel quantization to 4-bit and two-step quantization (per-channel to 8-bit then per-group to 4-bit) |
| `W4A4_FLATQUANT_DYNAMIC` | INT4 | INT4 | Per-Channel | Per-Token | Dynamic | Uses FlatQuant for activation distribution smoothing before 4-bit dynamic quantization, with additional matrix multiplications for precision preservation |
| `W8A8_MIX` | INT8 | INT8 | Per-Channel | Per-Tensor/Token | Mixed | PD Colocation Scenario uses dynamic quantization for both P node and D node; PD Disaggregation Scenario uses dynamic quantization for P node and static for D node |
**Static vs Dynamic:** Static quantization uses pre-computed scaling factors with better performance, while dynamic quantization computes scaling factors on-the-fly for each token/activation tensor with higher precision.
**Granularity:** Refers to the scope of scaling factor computation (e.g., per-tensor, per-channel, per-group).