[Doc] fix the nit in docs (#6826)

Refresh the doc, fix the nit in the docs

- vLLM version: v0.15.0
- vLLM main:
83b47f67b1

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
This commit is contained in:
wangxiyuan
2026-02-27 11:50:27 +08:00
committed by GitHub
parent 981d803cb7
commit a95c0b8b82
30 changed files with 145 additions and 118 deletions

View File

@@ -103,7 +103,7 @@ If the PR spans more than one category, please include all relevant prefixes.
## Others
You may find more information about contributing to vLLM Ascend backend plugin on [<u>docs.vllm.ai</u>](https://docs.vllm.ai/en/latest/contributing/overview.html).
If you find any problem when contributing, you can feel free to submit a PR to improve the doc to help other developers.
If you encounter any problems while contributing, feel free to submit a PR to improve the documentation to help other developers.
:::{toctree}
:caption: Index

View File

@@ -2,9 +2,9 @@
Multi-Node CI is designed to test distributed scenarios of very large models, eg: disaggregated_prefill multi DP across multi nodes and so on.
## How is works
## How it works
The following picture shows the basic deployment view of the multi-node CI mechanism, It shows how the github action interact with [lws](https://lws.sigs.k8s.io/docs/overview/) (a kind of kubernetes crd resource)
The following picture shows the basic deployment view of the multi-node CI mechanism. It shows how the GitHub action interacts with [lws](https://lws.sigs.k8s.io/docs/overview/) (a kind of kubernetes crd resource).
![alt text](../../assets/deployment.png)
@@ -16,7 +16,7 @@ From the workflow perspective, we can see how the final test script is executed,
1. Upload custom weights
If you need customized weights, for example, you quantized a w8a8 weight for DeepSeek-V3 and you want your weight to run on CI, Uploading weights to ModelScope's [vllm-ascend](https://www.modelscope.cn/organization/vllm-ascend) organization is welcome, If you do not have permission to upload, please contact @Potabk
If you need customized weights, for example, you quantized a w8a8 weight for DeepSeek-V3 and you want your weight to run on CI, uploading weights to ModelScope's [vllm-ascend](https://www.modelscope.cn/organization/vllm-ascend) organization is welcome. If you do not have permission to upload, please contact @Potabk
2. Add config yaml
@@ -71,7 +71,8 @@ From the workflow perspective, we can see how the final test script is executed,
```
3. Add the case to nightly workflow
currently, the multi-node test workflow defined in the [nightly_test_a3.yaml](https://github.com/vllm-project/vllm-ascend/blob/main/.github/workflows/nightly_test_a3.yaml)
Currently, the multi-node test workflow is defined in the [nightly_test_a3.yaml](https://github.com/vllm-project/vllm-ascend/blob/main/.github/workflows/nightly_test_a3.yaml)
```yaml
multi-node-tests:
@@ -106,7 +107,7 @@ currently, the multi-node test workflow defined in the [nightly_test_a3.yaml](ht
KUBECONFIG_B64: ${{ secrets.KUBECONFIG_B64 }}
```
The matrix above defines all the parameters required to add a multi-machine use case, The parameters worth paying attention to (I mean if you are adding a new use case) are size and the path to the yaml configuration file. The former defines the number of nodes required for your use case, and the latter defines the path to the configuration file you have completed in step 2.
The matrix above defines all the parameters required to add a multi-machine use case. The parameters worth noting (if you are adding a new use case) are `size` and the path to the yaml configuration file. The former defines the number of nodes required for your use case, and the latter defines the path to the configuration file you have completed in step 2.
## Run Multi-Node tests locally

View File

@@ -2,9 +2,9 @@
This document explains how to write E2E tests and unit tests to verify the implementation of your feature.
## Setup a test environment
## Set up a test environment
The fastest way to setup a test environment is to use the main branch's container image:
The fastest way to set up a test environment is to use the main branch's container image:
:::::{tab-set}
:sync-group: e2e
@@ -178,7 +178,7 @@ TORCH_DEVICE_BACKEND_AUTOLOAD=0 pytest -sv tests/ut
```bash
cd /vllm-workspace/vllm-ascend/
# Run all single card the tests
# Run all single-card tests
pytest -sv tests/ut
# Run single test
@@ -192,7 +192,7 @@ pytest -sv tests/ut/test_ascend_config.py
```bash
cd /vllm-workspace/vllm-ascend/
# Run all single card the tests
# Run all multi-card tests
pytest -sv tests/ut
# Run single test
@@ -223,7 +223,7 @@ You can't run the E2E test on CPUs.
```bash
cd /vllm-workspace/vllm-ascend/
# Run all single card the tests
# Run all single-card tests
VLLM_USE_MODELSCOPE=true pytest -sv tests/e2e/singlecard/
# Run a certain test script
@@ -240,7 +240,7 @@ VLLM_USE_MODELSCOPE=true pytest -sv tests/e2e/singlecard/test_offline_inference.
```bash
cd /vllm-workspace/vllm-ascend/
# Run all the single card tests
# Run all multi-card tests
VLLM_USE_MODELSCOPE=true pytest -sv tests/e2e/singlecard/
# Run a certain test script
@@ -256,7 +256,7 @@ VLLM_USE_MODELSCOPE=true pytest -sv tests/e2e/singlecard/test_aclgraph_accuracy.
This will reproduce the E2E test. See [vllm_ascend_test.yaml](https://github.com/vllm-project/vllm-ascend/blob/main/.github/workflows/vllm_ascend_test.yaml).
Run nightly multi-node test cases locally refer to section of `Running Locally` of [Multi Node Test](./multi_node_test.md).
For running nightly multi-node test cases locally, refer to the `Running Locally` section in [Multi Node Test](./multi_node_test.md).
#### E2E test example

View File

@@ -53,7 +53,7 @@ INFO: Waiting for application startup.
INFO: Application startup complete.
```
### 2. Run different dataset using AISBench
### 2. Run different datasets using AISBench
#### Install AISBench
@@ -81,7 +81,7 @@ You can choose one or multiple datasets to execute accuracy evaluation.
1. `C-Eval` dataset.
Take `C-Eval` dataset as an example. And you can refer to [Datasets](https://gitee.com/aisbench/benchmark/tree/master/ais_bench/benchmark/configs/datasets) for more datasets. Every datasets have a `README.md` for detailed download and installation process.
Take `C-Eval` dataset as an example. You can refer to [Datasets](https://gitee.com/aisbench/benchmark/tree/master/ais_bench/benchmark/configs/datasets) for more datasets. Each dataset has a `README.md` with detailed download and installation instructions.
Download dataset and install it to specific path.

View File

@@ -1,6 +1,6 @@
# Using EvalScope
This document will guide you have model inference stress testing and accuracy testing using [EvalScope](https://github.com/modelscope/evalscope).
This document will guide you through model inference stress testing and accuracy testing using [EvalScope](https://github.com/modelscope/evalscope).
## 1. Online server

View File

@@ -1,8 +1,8 @@
# ACL Graph
## Why we need ACL Graph?
## Why do we need ACL Graph?
When in LLM inference, each token requires nearly thousand operator executions, and when host launching operators are slower than device, it will cause host bound. In severe cases, the device will be idle for more than half of the time. To solve this problem, we use graph in LLM inference.
In LLM inference, each token requires nearly a thousand operator executions. When host launching operators are slower than device, it will cause host bound. In severe cases, the device will be idle for more than half of the time. To solve this problem, we use graph in LLM inference.
```shell
eager mode:
@@ -29,15 +29,15 @@ ACL Graph is enabled by default in V1 Engine, just need to check that `enforce_e
## How it works?
In short, graph mode works in two steps: **capture and replay**. When engine starts, we will capture all of the ops in model forward and save it as a graph, and when req come in, we just replay the graph on devices, and waiting for result.
In short, graph mode works in two steps: **capture and replay**. When the engine starts, we capture all of the ops in the model forward and save it as a graph. When a request comes in, we just replay the graph on the device and wait for the result.
But in reality, graph mode is not that simple.
### Padding and Bucketing
Due to graph can only replay the ops captured before, without doing tiling and checking graph input, we need to ensure the consistency of the graph input, but we know that model input's shape depends on the request scheduled by Scheduler, we can't ensure the consistency.
Due to the fact that a graph can only replay the ops captured before, without doing tiling and checking graph input, we need to ensure the consistency of the graph input. However, we know that the model input's shape depends on the request scheduled by the Scheduler, so we can't ensure consistency.
Obviously, we can solve this problem by capturing the biggest shape and padding all of the model input to it. But it will bring a lot of redundant computing and make performance worse. So we can capture multiple graphs with different shape, and pad the model input to the nearest graph, which will greatly reduce redundant computing. But when `max_num_batched_tokens` is very large, the number of graphs that need to be captured will also become very large. But we know that when intensor's shape is large, the computing time will be very long, and graph mode is not necessary in this case. So all of things we need to do is:
Obviously, we can solve this problem by capturing the biggest shape and padding all of the model inputs to it. But this will bring a lot of redundant computing and make performance worse. So we can capture multiple graphs with different shapes, and pad the model input to the nearest graph, which will greatly reduce redundant computing. But when `max_num_batched_tokens` is very large, the number of graphs that need to be captured will also become very large. We know that when the input tensor's shape is large, the computing time will be very long, and graph mode is not necessary in this case. So all of the things we need to do are:
1. Set a threshold;
2. When `num_scheduled_tokens` is bigger than the threshold, use `eager_mode`;
@@ -58,39 +58,39 @@ Obviously, we can solve this problem by capturing the biggest shape and padding
### Piecewise and Full graph
Due to the increasing complexity of the attention layer in current LLM, we can't ensure all types of attention can run in graph. In MLA, prefill_tokens and decode_tokens have different calculation method, so when a batch has both prefills and decodes in MLA, graph mode is difficult to handle this situation.
Due to the increasing complexity of the attention layer in current LLMs, we can't ensure all types of attention can run in graph. In MLA, prefill_tokens and decode_tokens have different calculation methods, so when a batch has both prefills and decodes in MLA, graph mode is difficult to handle this situation.
vLLM solves this problem with piecewise graph mode. We use eager mode to launch attention's ops, and use graph to deal with others. But it also bring some problems: The cost of launching ops has become large again, although much smaller than eager mode, but it will also lead to host bound when cpu is poor or `num_tokens` is small.
vLLM solves this problem with piecewise graph mode. We use eager mode to launch attention's ops, and use graph to deal with others. But this also brings some problems: The cost of launching ops has become large again. Although much smaller than eager mode, it will also lead to host bound when the CPU is poor or `num_tokens` is small.
Altogether, we need to support both piecewise and full graph mode.
1. When attention can run in graph, we tend to choose full graph mode to achieve optimal performance;
2. When full graph is not work, use piecewise graph as a substitute;
3. When piecewise graph's performance is not good and full graph mode is blocked, separate prefills and decodes, and use full graph mode in **decode_only** situation. Because when a batch include prefill req, usually `num_tokens` will be quite big and not cause host bound.
2. When full graph does not work, use piecewise graph as a substitute;
3. When piecewise graph's performance is not good and full graph mode is blocked, separate prefills and decodes, and use full graph mode in **decode_only** situations. Because when a batch includes prefill requests, usually `num_tokens` will be quite big and not cause host bound.
> Currently, due to stream resource constraint, we can only support a few buckets in piecewise graph mode now, which will cause redundant computing and may lead to performance degradation compared with eager mode.
## How it be implemented?
## How is it implemented?
vLLM has already implemented most of the modules in graph mode. You can see more details at: [CUDA Graphs](https://docs.vllm.ai/en/latest/design/cuda_graphs.html)
When in graph mode, vLLM will call `current_platform.get_static_graph_wrapper_cls` to get current device's graph model wrapper, so what we need to do is to implement the graph mode wrapper on Ascend: `ACLGraphWrapper`.
When in graph mode, vLLM will call `current_platform.get_static_graph_wrapper_cls` to get the current device's graph model wrapper, so what we need to do is implement the graph mode wrapper on Ascend: `ACLGraphWrapper`.
vLLM has added `support_torch_compile` decorator to all models, this decorator will replace the `__init__` and `forward` interface of the model class, and when `forward` called, the code inside the `ACLGraphWrapper` will be executed, and it will do capture or replay as mentioned above.
vLLM has added `support_torch_compile` decorator to all models. This decorator will replace the `__init__` and `forward` interface of the model class. When `forward` is called, the code inside the `ACLGraphWrapper` will be executed, and it will do capture or replay as mentioned above.
When use piecewise graph, we just need to follow the above-mentioned process, but when in full graph, due to the complexity of the attention, sometimes we need to update attention op's param before execution. So we implement `update_attn_params` and `update_mla_attn_params` funcs for full graph mode. And when forward, memory will be reused between different ops, so we can't update attention op's param before forward. In ACL Graph, we use `torch.npu.graph_task_update_begin` and `torch.npu.graph_task_update_end` to do it, and use `torch.npu.ExternalEvent` to ensure order between params update and ops execution.
When using piecewise graph, we just need to follow the above-mentioned process. But when in full graph, due to the complexity of the attention, sometimes we need to update attention op's params before execution. So we implement `update_attn_params` and `update_mla_attn_params` functions for full graph mode. During forward, memory will be reused between different ops, so we can't update attention op's params before forward. In ACL Graph, we use `torch.npu.graph_task_update_begin` and `torch.npu.graph_task_update_end` to do it, and use `torch.npu.ExternalEvent` to ensure order between param updates and op executions.
## DFX
### Stream resource constraint
Currently, we can only capture 1800 graphs at most, due to the limitation of ACL graph that a graph requires a separate stream at least. This number is bounded by the number of streams, which is 2048, we save 248 streams as a buffer. Besides, there are many variables that can affect the number of buckets:
Currently, we can only capture 1800 graphs at most, due to the limitation of ACL graph that a graph requires at least a separate stream. This number is bounded by the number of streams, which is 2048; we save 248 streams as a buffer. Besides, there are many variables that can affect the number of buckets:
+ Piecewise graph will divides the model into `num_hidden_layers + 1` sub modules, based on attention layer. Every sub module is a single graph which need to cost stream, so the number of buckets in piecewise graph mode is very tight compared with full graph mode.
+ Piecewise graph divides the model into `num_hidden_layers + 1` sub modules, based on the attention layer. Every sub module is a single graph which needs to cost a stream, so the number of buckets in piecewise graph mode is very tight compared with full graph mode.
+ The number of streams required for a graph is related to the number of comm domains. Each comm domain will increase one stream consumed by a graph.
+ When multi-stream is explicitly called in sub module, it will consumes an additional stream.
+ When multi-stream is explicitly called in a sub module, it will consume an additional stream.
There are some other rules about ACL Graph and stream. Currently, we use func `update_aclgraph_sizes` to calculate the maximum number of buckets and update `graph_batch_sizes` to ensure stream resource is sufficient.

View File

@@ -4,19 +4,19 @@
Prefix caching is an important feature in LLM inference that can reduce prefill computation time drastically.
However, the performance gain from prefix caching is highly dependent on cache hit rate, while cache hit rate can be limited if one only uses HBM for kv cache storage.
However, the performance gain from prefix caching is highly dependent on the cache hit rate, while the cache hit rate can be limited if one only uses HBM for KV cache storage.
Hence, KV Cache Pool is proposed to utilize various types of storages including HBM,DRAM and SSD, making a pool for KV Cache storage, while making the prefix of requests visible across all nodes, increasing the cache hit rate for all requests.
Hence, KV Cache Pool is proposed to utilize various types of storage including HBM, DRAM, and SSD, making a pool for KV Cache storage while making the prefix of requests visible across all nodes, increasing the cache hit rate for all requests.
vLLM Ascend currently supports [MooncakeStore](https://github.com/kvcache-ai/Mooncake): one of the most recognized KV Cache storage engine;
vLLM Ascend currently supports [MooncakeStore](https://github.com/kvcache-ai/Mooncake), one of the most recognized KV Cache storage engines.
While one can utilize mooncake store in vLLM V1 engine by setting it as a remote backend of LMCache with GPU (see [Tutorial](https://github.com/LMCache/LMCache/blob/dev/examples/kv_cache_reuse/remote_backends/mooncakestore/README.md)), we find it would be better to integrate a connector that directly supports mooncake store and can utilize the data transfer strategy to one that is best fit to Huawei NPU hardware.
While one can utilize Mooncake Store in vLLM V1 engine by setting it as a remote backend of LMCache with GPU (see [Tutorial](https://github.com/LMCache/LMCache/blob/dev/examples/kv_cache_reuse/remote_backends/mooncakestore/README.md)), we find it would be better to integrate a connector that directly supports Mooncake Store and can utilize the data transfer strategy that best fits Huawei NPU hardware.
Hence, we propose to integrate Mooncake Store with a brand new **MooncakeStoreConnectorV1**, which is indeed largely inspired by **LMCacheConnectorV1** (see the `How is MooncakeStoreConnectorV1 Implemented?` section).
## Usage
vLLM Ascend Currently supports Mooncake Store for KV Cache Pool. To enable Mooncake Store, one needs to config `kv-transfer-config` and choose `MooncakeStoreConnector` as KV Connector.
vLLM Ascend currently supports Mooncake Store for KV Cache Pool. To enable Mooncake Store, one needs to configure `kv-transfer-config` and choose `MooncakeStoreConnector` as the KV Connector.
For step-by-step deployment and configuration, please refer to the [KV Pool User Guide](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/feature_guide/kv_pool.html).
@@ -33,25 +33,25 @@ When combined with vLLMs Prefix Caching mechanism, the pool enables efficient
Prefix Caching with HBM is already supported by the vLLM V1 Engine.
By introducing KV Connector V1, users can seamlessly combine HBM-based Prefix Caching with Mooncake-backed KV Pool.
The user can enable both features simply by enabling Prefix Caching, which is enabled by default in vLLM V1 unless the --no_enable_prefix_caching flag is set, and setting up the KV Connector for KV Pool(e.g. the MooncakeStoreConnector)
The user can enable both features simply by enabling Prefix Caching, which is enabled by default in vLLM V1 unless the `--no_enable_prefix_caching` flag is set, and setting up the KV Connector for KV Pool (e.g., the MooncakeStoreConnector).
**Workflow**:
1. The engine first checks for prefix hits in the HBM cache.
2. After getting the number of hit tokens on HBM, it queries the KV Pool via the connector, if there is additional hits in KV Pool, we get the **additional blocks only** from KV Pool, and get the rest of the blocks directly from HBM to minimize the data transfer latency.
2. After getting the number of hit tokens on HBM, it queries the KV Pool via the connector. If there are additional hits in the KV Pool, we get the **additional blocks only** from the KV Pool, and get the rest of the blocks directly from HBM to minimize the data transfer latency.
3. After the KV Caches in KV Pool is load into HBM, the remaining process is the same as Prefix Caching in HBM.
3. After the KV Caches in the KV Pool are loaded into HBM, the remaining process is the same as Prefix Caching in HBM.
### 2. Combining KV Cache Pool with Mooncake PD Disaggregation
When used together with Mooncake PD (Prefill-Decode) Disaggregation, the KV Cache Pool can further decouple prefill and decode stages across devices or nodes.
Currently, we only perform put and get operation of KV Pool for **Prefill Nodes**, and Decode Nodes get their KV Cache from Mooncake P2P KV Connector, i.e. MooncakeConnector.
Currently, we only perform put and get operations of KV Pool for **Prefill Nodes**, and Decode Nodes get their KV Cache from Mooncake P2P KV Connector, i.e., MooncakeConnector.
The key benefit of doing this is that we can keep the gain in performance by computing less with Prefix Caching from HBM and KV Pool for Prefill Nodes while not sacrificing the data transfer efficiency between Prefill and Decode nodes with P2P KV Connector that transfer KV Caches between NPU devices directly.
The key benefit of doing this is that we can keep the gain in performance by computing less with Prefix Caching from HBM and KV Pool for Prefill Nodes, while not sacrificing the data transfer efficiency between Prefill and Decode nodes with P2P KV Connector that transfers KV Caches between NPU devices directly.
To Enable this feature, we need to setup both Mooncake Connector and Mooncake Store connector with a Multi Connector, which is a KV Connector class provided by vLLM that can call multiple KV Connectors in specific order;
To enable this feature, we need to set up both Mooncake Connector and Mooncake Store Connector with a Multi Connector, which is a KV Connector class provided by vLLM that can call multiple KV Connectors in a specific order.
For details, please also refer to the Mooncake Connector Store Deployment Guide.
@@ -72,17 +72,17 @@ The KV Connector methods that need to be implemented can be categorized into sch
### Connector Worker-Side Methods
`register_kv_caches`: Register KV cache buffers needed for KV cache transfer.
`start_load_kv`: Perform KV cache load operation that transfers KV cache from storage to device.
`wait_for_layer_load`: Optional; Wait for layer load in layerwise + async KV load scenario.
`save_kv_layer`: Optional Do layerwise KV cache put into KV Pool.
`wait_for_save`: Wait for KV Save to finish if async KV cache save/put.
`get_finished` Get request that finished KV transfer, `done_sending` if `put` finished, `done_receiving` if `get` finished.
`register_kv_caches`: Register KV cache buffers needed for KV cache transfer.
`start_load_kv`: Perform KV cache load operation that transfers KV cache from storage to device.
`wait_for_layer_load`: Optional; Wait for layer load in layerwise + async KV load scenario.
`save_kv_layer`: Optional; Do layerwise KV cache put into KV Pool.
`wait_for_save`: Wait for KV Save to finish if async KV cache save/put.
`get_finished`: Get request that finished KV transfer, `done_sending` if `put` finished, `done_receiving` if `get` finished.
## DFX
1. When looking up a key in KV Pool, if we cannot find the key, there is no Cache Hit for this specific block; we return no hit for this block and do not look up further blocks for current request.
2. Similarly, when we are trying to put a block into KV Pool and failed, we do not put further blocks (subject to change).
1. When looking up a key in KV Pool, if we cannot find the key, there is no Cache Hit for this specific block; we return no hit for this block and do not look up further blocks for the current request.
2. Similarly, when we are trying to put a block into KV Pool and it fails, we do not put further blocks (subject to change).
## Limitations

View File

@@ -33,7 +33,7 @@ The workflow of obtaining inputs:
3. Get `Token IDs`: using token indices to retrieve the Token IDs from **token id table**.
At last, these `Token IDs` are required to be fed into a model, and also, `positions` should be sent into the model to create `Rope` (Rotary positional embedding). Both of them are the inputs of the model.
At last, these `Token IDs` are required to be fed into a model, and `positions` should also be sent into the model to create `Rope` (Rotary positional embedding). Both of them are the inputs of the model.
**Note**: The `Token IDs` are the inputs of a model, so we also call them `Inputs IDs`.
@@ -55,13 +55,13 @@ A model requires these attention metadata during the forward pass:
There are mainly three types of variables.
- token level: represents one attribute corresponding to each scheduled token, so the length of this variable is the number of scheduled tokens
- request level: represents one attribute of each scheduled request, whose length usually is the number of scheduled requests. (`query start location` is a special case, which has one more element)
- token level: represents one attribute corresponding to each scheduled token, so the length of this variable is the number of scheduled tokens.
- request level: represents one attribute of each scheduled request, whose length usually is the number of scheduled requests. (`query start location` is a special case, which has one more element.)
- system level:
1. **Token IDs table**: stores the token IDs (i.e. the inputs of a model) of each request. The shape of this table is `(max num request, max model len)`. Here, `max num request` is the maximum count of concurrent requests allowed in a forward batch and `max model len` is the maximum token count that can be handled at one request sequence in this model.
2. **Block table**: translates the logical address (within its sequence) of each block to its global physical address in the device's memory. The shape of this table is `(max num request, max model len / block size)`
**Note**: Both of these two tables are come from the `_update_states` method before **preparing inputs**. You can take a look if you need more inspiration.
**Note**: Both of these two tables come from the `_update_states` method before **preparing inputs**. You can take a look if you need more inspiration.
### Tips
@@ -89,18 +89,18 @@ Example of `Token ID`:
Assumptions:
- maximum number of tokens can be scheduled at once: 10
- maximum number of tokens that can be scheduled at once: 10
- `block size`: 2
- Totally schedule 3 requests. Their prompt lengths are 3, 2, and 8 respectively.
- `max model length`: 12 (the maximum token count can be handled at one request sequence in a model).
- `max model length`: 12 (the maximum token count that can be handled at one request sequence in a model).
These assumptions are configured in the beginning when starting vLLM. They are not fixed, so you can manually set them.
These assumptions are configured at the beginning when starting vLLM. They are not fixed, so you can manually set them.
### Step 1: All requests in the prefill phase
#### Obtain inputs
As the maximum number of tokens that can be schedules is 10, the scheduled tokens of each request can be represented as `{'0': 3, '1': 2, '2': 5}`. Note that`request_2` uses chunked prefill, leaving 3 prompt tokens unscheduled.
As the maximum number of tokens that can be scheduled is 10, the scheduled tokens of each request can be represented as `{'0': 3, '1': 2, '2': 5}`. Note that `request_2` uses chunked prefill, leaving 3 prompt tokens unscheduled.
##### 1. Get token positions
@@ -108,7 +108,7 @@ First, determine which request each token belongs to: tokens 02 are assigned
For each request, use **the number of computed tokens** + **the relative position of current scheduled tokens** (`request_0: [0 + 0, 0 + 1, 0 + 2]`, `request_1: [0 + 0, 0 + 1]`, `request_2: [0 + 0, 0 + 1,..., 0 + 4]`) and then concatenate them together (`[0, 1, 2, 0, 1, 0, 1, 2, 3, 4]`).
Note: there is more efficient way (using `request indices`) to create positions in actual code.
Note: there is a more efficient way (using `request indices`) to create positions in actual code.
Finally, `token positions` can be obtained as `[0, 1, 2, 0, 1, 0, 1, 2, 3, 4]`. This variable is **token level**.
@@ -116,9 +116,9 @@ Finally, `token positions` can be obtained as `[0, 1, 2, 0, 1, 0, 1, 2, 3, 4]`.
The shape of the current **Token IDs table** is `(max num request, max model len)`.
Why these `T_3_5`, `T_3_6`, `T_3_7` are in this table without being scheduled?
Why are these `T_3_5`, `T_3_6`, `T_3_7` in this table without being scheduled?
- We fill all Token IDs in one request sequence to this table at once, but we only retrieve the tokens we scheduled this time. Then we retrieve the remain Token IDs next time.
- We fill all Token IDs in one request sequence to this table at once, but we only retrieve the tokens we scheduled this time. Then we retrieve the remaining Token IDs next time.
```shell
| T_0_0 | T_0_1 | T_0_2 | ? | ? | ? | ? | ? | ? | ? | ? | ? |
@@ -130,7 +130,7 @@ Why these `T_3_5`, `T_3_6`, `T_3_7` are in this table without being scheduled?
......
```
Note that`T_x_x` is an `int32`.
Note that `T_x_x` is an `int32`.
Let's say `M = max model len`. Then we can use `token positions` together with `request indices` of each token to construct `token indices`.
@@ -190,10 +190,10 @@ The workflow of achieving slot mapping:
Details:
1. (**Token level**) Use a simple formula to calculate `block table indices`: `request indices * K + positions / block size`. So it equal to `[0 * 6 + 0 / 2, 0 * 6 + 1 / 2, 0 * 6 + 2 / 2, 1 * 6 + 0 / 2, 1 * 6 + 1 / 2, 2 * 6 + 0 / 2, 2 * 6 + 1 / 2, 2 * 6 + 2 / 2, 2 * 6 + 3 / 2, 2 * 6 + 4 / 2] = [0, 0, 1, 6, 6, 12, 12, 13, 13, 14]`. This could be used to select `device block number` from `block table`.
2. (**Token level**) Use `block table indices` to select out `device block number` for each scheduled token. The Pseudocode is `block_numbers = block_table[block_table_indices]`. So `device block number=[1, 1, 2, 3, 3, 4, 4, 5, 5, 6]`
1. (**Token level**) Use a simple formula to calculate `block table indices`: `request indices * K + positions / block size`. So it equals `[0 * 6 + 0 / 2, 0 * 6 + 1 / 2, 0 * 6 + 2 / 2, 1 * 6 + 0 / 2, 1 * 6 + 1 / 2, 2 * 6 + 0 / 2, 2 * 6 + 1 / 2, 2 * 6 + 2 / 2, 2 * 6 + 3 / 2, 2 * 6 + 4 / 2] = [0, 0, 1, 6, 6, 12, 12, 13, 13, 14]`. This could be used to select `device block number` from `block table`.
2. (**Token level**) Use `block table indices` to select out `device block number` for each scheduled token. The pseudocode is `block_numbers = block_table[block_table_indices]`. So `device block number=[1, 1, 2, 3, 3, 4, 4, 5, 5, 6]`
3. (**Token level**) `block offsets` could be computed by `block offsets = positions % block size = [0, 1, 0, 0, 1, 0, 1, 0, 1, 0]`.
4. At last, use `block offsets` and `device block number` to create `slot mapping`: `device block number * block size + block_offsets = [2, 3, 4, 6, 7, 8, 9, 10, 11, 12]`
4. Finally, use `block offsets` and `device block number` to create `slot mapping`: `device block number * block size + block_offsets = [2, 3, 4, 6, 7, 8, 9, 10, 11, 12]`
(**Request level**) As we know the scheduled token count is `[3, 2, 5]`:
@@ -261,7 +261,7 @@ KV cache block in the device memory:
3. (**Token level**) `block offsets`: `[1, 0, 1, 0, 1]`
4. (**Token level**) `slot mapping`: `[5, 14, 13, 16, 17]`
Scheduled token count:`[1, 1, 3]`
Scheduled token count: `[1, 1, 3]`
- `query start location`: `[0, 1, 2, 5]`
@@ -281,6 +281,6 @@ Scheduled token count:`[1, 1, 3]`
## At last
If you understand the step_1 and step_2, you will know the all following steps.
If you understand step_1 and step_2, you will know all the following steps.
Hope this document can help you better understand how vLLM prepares inputs for model forwarding. If you have any good idea, welcome to contribute to us.
Hope this document helps you better understand how vLLM prepares inputs for model forwarding. If you have any good ideas, you are welcome to contribute to us.

View File

@@ -79,9 +79,9 @@ After computing the results with the local KV cache, the results are updated via
**Tokens Partition in Head-Tail Style**
PCP requires splitting the input sequence and ensure balanced computational load across devices during the prefill phase.
PCP requires splitting the input sequence and ensuring balanced computational load across devices during the prefill phase.
We employ a head-tail style for splitting and concatenation: specifically, the sequence is first padded to a length of `2*pcp_size`, then divided into `2*pcp_size` equal parts.
The first part is merged with the last part, the second part with the second last part, and so on, thereby assigning computationally balanced chunks to each devices.
The first part is merged with the last part, the second part with the second last part, and so on, thereby assigning computationally balanced chunks to each device.
Additionally, since allgather aggregation of KV or Q results in interleaved chunks from different requests, we compute `pcp_allgather_restore_idx` to quickly restore the original order.
These logics are implemented in the function `_update_tokens_for_pcp`.

View File

@@ -7,8 +7,8 @@ This feature addresses the need to optimize the **Time Per Output Token (TPOT)**
1. **Adjusting Parallel Strategy and Instance Count for P and D Nodes**
Using the disaggregated-prefill strategy, this feature allows the system to flexibly adjust the parallelization strategy (e.g., data parallelism (dp), tensor parallelism (tp), and expert parallelism (ep)) and the instance count for both P (Prefiller) and D (Decoder) nodes. This leads to better system performance tuning, particularly for **TTFT** and **TPOT**.
2. **Optimizing TPOT**
Without disaggregated-prefill strategy, prefill tasks are inserted during decoding, which results in inefficiencies and delays. disaggregated-prefill solves this by allowing for better control over the systems **TPOT**. By managing chunked prefill tasks effectively, the system avoids the challenge of determining the optimal chunk size and provides more reliable control over the time taken for generating output tokens.
2. **Optimizing TPOT**
Without the disaggregated-prefill strategy, prefill tasks are inserted during decoding, which results in inefficiencies and delays. Disaggregated-prefill solves this by allowing for better control over the systems **TPOT**. By managing chunked prefill tasks effectively, the system avoids the challenge of determining the optimal chunk size and provides more reliable control over the time taken for generating output tokens.
---

View File

@@ -4,7 +4,7 @@
When using Expert Parallelism (EP), different experts are assigned to different NPUs. Given that the load of various experts may vary depending on the current workload, it is crucial to maintain balanced loads across different NPUs. We adopt a redundant experts strategy by duplicating heavily-loaded experts. Then, we heuristically pack these duplicated experts onto NPUs to ensure load balancing across them. Moreover, thanks to the group-limited expert routing used in MoE models, we also attempt to place experts of the same group on the same node to reduce inter-node data traffic, whenever possible.
To facilitate reproduction and deployment, Vllm Ascend supported deployed EP load balancing algorithm in `vllm_ascend/eplb/core/policy`. The algorithm computes a balanced expert replication and placement plan based on the estimated expert loads. Note that the exact method for predicting expert loads is outside the scope of this repository. A common method is to use a moving average of historical statistics.
To facilitate reproduction and deployment, vLLM Ascend supports the deployed EP load balancing algorithm in `vllm_ascend/eplb/core/policy`. The algorithm computes a balanced expert replication and placement plan based on the estimated expert loads. Note that the exact method for predicting expert loads is outside the scope of this repository. A common method is to use a moving average of historical statistics.
![eplb](../../assets/eplb.png)

View File

@@ -8,7 +8,7 @@ This section provides an overview of the features implemented in vLLM Ascend. De
patch
ModelRunner_prepare_inputs
disaggregated_prefill
eplb_swift_balancer.md
eplb_swift_balancer
ACL_Graph
KV_Cache_Pool_Guide
add_custom_aclnn_op

View File

@@ -1,8 +1,8 @@
# Npugraph_ex
## How it works?
## How Does It Work?
Optimization based on Fx graphs, can be considered an acceleration solution for the aclgraph mode.
This is an optimization based on Fx graphs, which can be considered an acceleration solution for the aclgraph mode.
You can get its code [here](https://gitcode.com/Ascend/torchair)

View File

@@ -34,7 +34,7 @@ vllm_ascend
## How to write a patch
Before writing a patch, following the principle above, we should patch the least code. If it's necessary, we can patch the code in either **platform** and **worker** folder. Here is an example to patch `distributed` module in vLLM.
Before writing a patch, following the principle above, we should patch the least code. If it's necessary, we can patch the code in either **platform** or **worker** folder. Here is an example to patch `distributed` module in vLLM.
1. Decide which version of vLLM we should patch. For example, after analysis, here we want to patch both `0.10.0` and `main` of vLLM.
2. Decide which process we should patch. For example, here `distributed` belongs to the vLLM main process, so we should patch `platform`.

View File

@@ -1,6 +1,6 @@
# Optimization and Tuning
This guide aims to help users to improve vllm-ascend performance on system level. It includes OS configuration, library optimization, deployment guide and so on. Any feedback is welcome.
This guide aims to help users improve vLLM-Ascend performance at the system level. It includes OS configuration, library optimization, deployment guide, and so on. Any feedback is welcome.
## Preparation