[Doc] fix the nit in docs (#6826)
Refresh the doc, fix the nit in the docs
- vLLM version: v0.15.0
- vLLM main:
83b47f67b1
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
This commit is contained in:
@@ -4,7 +4,7 @@
|
||||
|
||||
| Name | Github ID | Date |
|
||||
|:-----------:|:-----:|:-----:|
|
||||
| Xiyuan Wang| [@wangxiyuan](https://github.com/wangxiyuan) | 2025/01 |
|
||||
| Xiyuan Wang | [@wangxiyuan](https://github.com/wangxiyuan) | 2025/01 |
|
||||
| Yikun Jiang| [@Yikun](https://github.com/Yikun) | 2025/02 |
|
||||
| Yi Gan| [@ganyi1996ppo](https://github.com/ganyi1996ppo) | 2025/02 |
|
||||
| Shoujian Zheng| [@jianzs](https://github.com/jianzs) | 2025/06 |
|
||||
@@ -166,12 +166,12 @@ Updated on 2025-12-30:
|
||||
| 127 | [@wuweiqiang24](https://github.com/wuweiqiang24) | 2025/9/11 | [9615dea](https://github.com/vllm-project/vllm-ascend/commit/9615dea3a71df8ecd2c591f284d9615140dce68a) |
|
||||
| 126 | [@wenba0](https://github.com/wenba0) | 2025/9/11 | [bd3dede](https://github.com/vllm-project/vllm-ascend/commit/bd3dedea6123c9c8c19fe83b6e05716f63b1285d) |
|
||||
| 125 | [@zhaozx-cn](https://github.com/zhaozx-cn) | 2025/9/11 | [b9a0a75](https://github.com/vllm-project/vllm-ascend/commit/b9a0a75c783571caf22129612fb3338272d1782c) |
|
||||
| 124 | [@anon189Ty](https://github.com/anon189Ty) | 2025/9/11| [7b2ecc1](https://github.com/vllm-project/vllm-ascend/commit/7b2ecc1e9a64aeda78e2137aa06abdbf2890c000) |
|
||||
| 124 | [@anon189Ty](https://github.com/anon189Ty) | 2025/9/11 | [7b2ecc1](https://github.com/vllm-project/vllm-ascend/commit/7b2ecc1e9a64aeda78e2137aa06abdbf2890c000) |
|
||||
| 123 | [@FFFrog](https://github.com/FFFrog) | 2025/9/10 | [b7ee3fd](https://github.com/vllm-project/vllm-ascend/commit/b7ee3fdad30d00d9aaa31be04315838c7e2c24ac) |
|
||||
| 122 | [@machenglong2025](https://github.com/machenglong2025) | 2025/9/08 | [1a82b16](https://github.com/vllm-project/vllm-ascend/commit/1a82b16355d2ec0ba01c23935092dc0af323b820) |
|
||||
| 121 | [@realliujiaxu](https://github.com/realliujiaxu) | 2025/9/08 | [d3c3538](https://github.com/vllm-project/vllm-ascend/commit/d3c3538ddc67ba8f4873637e2bc1052f9eb09e93) |
|
||||
| 120 | [@marcobarlo](https://github.com/marcobarlo) | 2025/9/08 | [6666e52](https://github.com/vllm-project/vllm-ascend/commit/6666e5265d40ecafc3cb377233fee840d7fe553b) |
|
||||
| 119 | [@1092626063](https://github.com/1092626063) | 2025/9/05 | [5b3646a](https://github.com/vllm-project/vllm-ascend/commit/5b3646ab2142131579661ce12e4f0e4ba731ad06)|
|
||||
| 119 | [@1092626063](https://github.com/1092626063) | 2025/9/05 | [5b3646a](https://github.com/vllm-project/vllm-ascend/commit/5b3646ab2142131579661ce12e4f0e4ba731ad06) |
|
||||
| 118 | [@WithHades](https://github.com/WithHades) | 2025/9/04 | [0c0789b](https://github.com/vllm-project/vllm-ascend/commit/0c0789be7442122eb1203abbf89a9592648922e0) |
|
||||
| 117 | [@panchao-hub](https://github.com/panchao-hub) | 2025/8/30 | [7215454](https://github.com/vllm-project/vllm-ascend/commit/7215454de6df78f4f9a49a99c5739f8bb360f5bc) |
|
||||
| 116 | [@lidenghui1110](https://github.com/lidenghui1110) | 2025/8/29 | [600b08f](https://github.com/vllm-project/vllm-ascend/commit/600b08f7542be3409c2c70927c91471e8de33d03) |
|
||||
@@ -207,9 +207,9 @@ Updated on 2025-12-30:
|
||||
| 86 | [@pkking](https://github.com/pkking) | 2025/7/18 | [3e39d72](https://github.com/vllm-project/vllm-ascend/commit/3e39d7234c0e5c66b184c136c602e87272b5a36e) |
|
||||
| 85 | [@lianyiibo](https://github.com/lianyiibo) | 2025/7/18 | [53d2ea3](https://github.com/vllm-project/vllm-ascend/commit/53d2ea3789ffce32bf3ceb055d5582d28eadc6c7) |
|
||||
| 84 | [@xudongLi-cmss](https://github.com/xudongLi-cmss) | 2025/7/2 | [7fc1a98](https://github.com/vllm-project/vllm-ascend/commit/7fc1a984890bd930f670deedcb2dda3a46f84576) |
|
||||
| 83 | [@ZhengWG](https://github.com/) | 2025/7/7 | [3a469de](https://github.com/vllm-project/vllm-ascend/commit/9c886d0a1f0fc011692090b0395d734c83a469de) |
|
||||
| 82 | [@wm901115nwpu](https://github.com/) | 2025/7/7 | [a2a47d4](https://github.com/vllm-project/vllm-ascend/commit/f08c4f15a27f0f27132f4ca7a0c226bf0a2a47d4) |
|
||||
| 81 | [@Agonixiaoxiao](https://github.com/) | 2025/7/2 | [6f84576](https://github.com/vllm-project/vllm-ascend/commit/7fc1a984890bd930f670deedcb2dda3a46f84576) |
|
||||
| 83 | [@ZhengWG](https://github.com/ZhengWG) | 2025/7/7 | [3a469de](https://github.com/vllm-project/vllm-ascend/commit/9c886d0a1f0fc011692090b0395d734c83a469de) |
|
||||
| 82 | [@wm901115nwpu](https://github.com/wm901115nwpu) | 2025/7/7 | [a2a47d4](https://github.com/vllm-project/vllm-ascend/commit/f08c4f15a27f0f27132f4ca7a0c226bf0a2a47d4) |
|
||||
| 81 | [@Agonixiaoxiao](https://github.com/Agonixiaoxiao) | 2025/7/2 | [6f84576](https://github.com/vllm-project/vllm-ascend/commit/7fc1a984890bd930f670deedcb2dda3a46f84576) |
|
||||
| 80 | [@zhanghw0354](https://github.com/zhanghw0354) | 2025/7/2 | [d3df9a5](https://github.com/vllm-project/vllm-ascend/commit/9fb3d558e5b57a3c97ee5e11b9f5dba6ad3df9a5) |
|
||||
| 79 | [@GDzhu01](https://github.com/GDzhu01) | 2025/6/28 | [de256ac](https://github.com/vllm-project/vllm-ascend/commit/b308a7a25897b88d4a23a9e3d583f4ec6de256ac) |
|
||||
| 78 | [@leo-pony](https://github.com/leo-pony) | 2025/6/26 | [3f2a5f2](https://github.com/vllm-project/vllm-ascend/commit/10253449120307e3b45f99d82218ba53e3f2a5f2) |
|
||||
@@ -218,10 +218,10 @@ Updated on 2025-12-30:
|
||||
| 75 | [@Pr0Wh1teGivee](https://github.com/Pr0Wh1teGivee) | 2025/6/25 | [c65dd40](https://github.com/vllm-project/vllm-ascend/commit/2fda60464c287fe456b4a2f27e63996edc65dd40) |
|
||||
| 74 | [@xleoken](https://github.com/xleoken) | 2025/6/23 | [c604de0](https://github.com/vllm-project/vllm-ascend/commit/4447e53d7ad5edcda978ca6b0a3a26a73c604de0) |
|
||||
| 73 | [@lyj-jjj](https://github.com/lyj-jjj) | 2025/6/23 | [5cbd74e](https://github.com/vllm-project/vllm-ascend/commit/5177bef87a21331dcca11159d3d1438075cbd74e) |
|
||||
| 72 | [@farawayboat](https://github.com/farawayboat)| 2025/6/21 | [bc7d392](https://github.com/vllm-project/vllm-ascend/commit/097e7149f75c0806774bc68207f0f6270bc7d392)
|
||||
| 71 | [@yuancaoyaoHW](https://github.com/yuancaoyaoHW) | 2025/6/20 | [7aa0b94](https://github.com/vllm-project/vllm-ascend/commit/00ae250f3ced68317bc91c93dc1f1a0977aa0b94)
|
||||
| 70 | [@songshanhu07](https://github.com/songshanhu07) | 2025/6/18 | [5e1de1f](https://github.com/vllm-project/vllm-ascend/commit/2a70dbbdb8f55002de3313e17dfd595e1de1f)
|
||||
| 69 | [@wangyanhui-cmss](https://github.com/wangyanhui-cmss) | 2025/6/12| [40c9e88](https://github.com/vllm-project/vllm-ascend/commit/2a5fb4014b863cee6abc3009f5bc5340c9e88) |
|
||||
| 72 | [@farawayboat](https://github.com/farawayboat) | 2025/6/21 | [bc7d392](https://github.com/vllm-project/vllm-ascend/commit/097e7149f75c0806774bc68207f0f6270bc7d392) |
|
||||
| 71 | [@yuancaoyaoHW](https://github.com/yuancaoyaoHW) | 2025/6/20 | [7aa0b94](https://github.com/vllm-project/vllm-ascend/commit/00ae250f3ced68317bc91c93dc1f1a0977aa0b94) |
|
||||
| 70 | [@songshanhu07](https://github.com/songshanhu07) | 2025/6/18 | [5e1de1f](https://github.com/vllm-project/vllm-ascend/commit/2a70dbbdb8f55002de3313e17dfd595e1de1f) |
|
||||
| 69 | [@wangyanhui-cmss](https://github.com/wangyanhui-cmss) | 2025/6/12 | [40c9e88](https://github.com/vllm-project/vllm-ascend/commit/2a5fb4014b863cee6abc3009f5bc5340c9e88) |
|
||||
| 68 | [@chenwaner](https://github.com/chenwaner) | 2025/6/11 | [c696169](https://github.com/vllm-project/vllm-ascend/commit/e46dc142bf1180453c64226d76854fc1ec696169) |
|
||||
| 67 | [@yzim](https://github.com/yzim) | 2025/6/11 | [aaf701b](https://github.com/vllm-project/vllm-ascend/commit/4153a5091b698c2270d160409e7fee73baaf701b) |
|
||||
| 66 | [@Yuxiao-Xu](https://github.com/Yuxiao-Xu) | 2025/6/9 | [6b853f1](https://github.com/vllm-project/vllm-ascend/commit/6b853f15fe69ba335d2745ebcf14a164d0bcc505) |
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
# Versioning Policy
|
||||
|
||||
Starting with vLLM 0.7.x, the vLLM Ascend Plugin ([vllm-project/vllm-ascend](https://github.com/vllm-project/vllm-ascend)) project follows the [PEP 440](https://peps.python.org/pep-0440/) to publish matching with vLLM ([vllm-project/vllm](https://github.com/vllm-project/vllm)).
|
||||
Starting with vLLM 0.7.x, the vLLM Ascend Plugin ([vllm-project/vllm-ascend](https://github.com/vllm-project/vllm-ascend)) project follows [PEP 440](https://peps.python.org/pep-0440/) to publish versions matching vLLM ([vllm-project/vllm](https://github.com/vllm-project/vllm)).
|
||||
|
||||
## vLLM Ascend Plugin versions
|
||||
|
||||
@@ -28,7 +28,7 @@ The table below is the release compatibility matrix for vLLM Ascend release.
|
||||
| v0.13.0rc2 | v0.13.0 | >= 3.10, < 3.12 | 8.5.0 | 2.8.0 / 2.8.0.post1 | 3.2.0 |
|
||||
| v0.13.0rc1 | v0.13.0 | >= 3.10, < 3.12 | 8.3.RC2 | 2.8.0 / 2.8.0 | |
|
||||
| v0.12.0rc1 | v0.12.0 | >= 3.10, < 3.12 | 8.3.RC2 | 2.8.0 / 2.8.0 | |
|
||||
| v0.11.0 | v0.11.0 | >= 3.9 , < 3.12 | 8.3.RC2 | 2.7.1 / 2.7.1.post1 | |
|
||||
| v0.11.0 | v0.11.0 | >= 3.9, < 3.12 | 8.3.RC2 | 2.7.1 / 2.7.1.post1 | |
|
||||
| v0.11.0rc3 | v0.11.0 | >= 3.9, < 3.12 | 8.3.RC2 | 2.7.1 / 2.7.1.post1 | |
|
||||
| v0.11.0rc2 | v0.11.0 | >= 3.9, < 3.12 | 8.3.RC2 | 2.7.1 / 2.7.1 | |
|
||||
| v0.11.0rc1 | v0.11.0 | >= 3.9, < 3.12 | 8.3.RC1 | 2.7.1 / 2.7.1 | |
|
||||
|
||||
@@ -103,7 +103,7 @@ If the PR spans more than one category, please include all relevant prefixes.
|
||||
## Others
|
||||
|
||||
You may find more information about contributing to vLLM Ascend backend plugin on [<u>docs.vllm.ai</u>](https://docs.vllm.ai/en/latest/contributing/overview.html).
|
||||
If you find any problem when contributing, you can feel free to submit a PR to improve the doc to help other developers.
|
||||
If you encounter any problems while contributing, feel free to submit a PR to improve the documentation to help other developers.
|
||||
|
||||
:::{toctree}
|
||||
:caption: Index
|
||||
|
||||
@@ -2,9 +2,9 @@
|
||||
|
||||
Multi-Node CI is designed to test distributed scenarios of very large models, eg: disaggregated_prefill multi DP across multi nodes and so on.
|
||||
|
||||
## How is works
|
||||
## How it works
|
||||
|
||||
The following picture shows the basic deployment view of the multi-node CI mechanism, It shows how the github action interact with [lws](https://lws.sigs.k8s.io/docs/overview/) (a kind of kubernetes crd resource)
|
||||
The following picture shows the basic deployment view of the multi-node CI mechanism. It shows how the GitHub action interacts with [lws](https://lws.sigs.k8s.io/docs/overview/) (a kind of kubernetes crd resource).
|
||||
|
||||

|
||||
|
||||
@@ -16,7 +16,7 @@ From the workflow perspective, we can see how the final test script is executed,
|
||||
|
||||
1. Upload custom weights
|
||||
|
||||
If you need customized weights, for example, you quantized a w8a8 weight for DeepSeek-V3 and you want your weight to run on CI, Uploading weights to ModelScope's [vllm-ascend](https://www.modelscope.cn/organization/vllm-ascend) organization is welcome, If you do not have permission to upload, please contact @Potabk
|
||||
If you need customized weights, for example, you quantized a w8a8 weight for DeepSeek-V3 and you want your weight to run on CI, uploading weights to ModelScope's [vllm-ascend](https://www.modelscope.cn/organization/vllm-ascend) organization is welcome. If you do not have permission to upload, please contact @Potabk
|
||||
|
||||
2. Add config yaml
|
||||
|
||||
@@ -71,7 +71,8 @@ From the workflow perspective, we can see how the final test script is executed,
|
||||
```
|
||||
|
||||
3. Add the case to nightly workflow
|
||||
currently, the multi-node test workflow defined in the [nightly_test_a3.yaml](https://github.com/vllm-project/vllm-ascend/blob/main/.github/workflows/nightly_test_a3.yaml)
|
||||
|
||||
Currently, the multi-node test workflow is defined in the [nightly_test_a3.yaml](https://github.com/vllm-project/vllm-ascend/blob/main/.github/workflows/nightly_test_a3.yaml)
|
||||
|
||||
```yaml
|
||||
multi-node-tests:
|
||||
@@ -106,7 +107,7 @@ currently, the multi-node test workflow defined in the [nightly_test_a3.yaml](ht
|
||||
KUBECONFIG_B64: ${{ secrets.KUBECONFIG_B64 }}
|
||||
```
|
||||
|
||||
The matrix above defines all the parameters required to add a multi-machine use case, The parameters worth paying attention to (I mean if you are adding a new use case) are size and the path to the yaml configuration file. The former defines the number of nodes required for your use case, and the latter defines the path to the configuration file you have completed in step 2.
|
||||
The matrix above defines all the parameters required to add a multi-machine use case. The parameters worth noting (if you are adding a new use case) are `size` and the path to the yaml configuration file. The former defines the number of nodes required for your use case, and the latter defines the path to the configuration file you have completed in step 2.
|
||||
|
||||
## Run Multi-Node tests locally
|
||||
|
||||
|
||||
@@ -2,9 +2,9 @@
|
||||
|
||||
This document explains how to write E2E tests and unit tests to verify the implementation of your feature.
|
||||
|
||||
## Setup a test environment
|
||||
## Set up a test environment
|
||||
|
||||
The fastest way to setup a test environment is to use the main branch's container image:
|
||||
The fastest way to set up a test environment is to use the main branch's container image:
|
||||
|
||||
:::::{tab-set}
|
||||
:sync-group: e2e
|
||||
@@ -178,7 +178,7 @@ TORCH_DEVICE_BACKEND_AUTOLOAD=0 pytest -sv tests/ut
|
||||
|
||||
```bash
|
||||
cd /vllm-workspace/vllm-ascend/
|
||||
# Run all single card the tests
|
||||
# Run all single-card tests
|
||||
pytest -sv tests/ut
|
||||
|
||||
# Run single test
|
||||
@@ -192,7 +192,7 @@ pytest -sv tests/ut/test_ascend_config.py
|
||||
|
||||
```bash
|
||||
cd /vllm-workspace/vllm-ascend/
|
||||
# Run all single card the tests
|
||||
# Run all multi-card tests
|
||||
pytest -sv tests/ut
|
||||
|
||||
# Run single test
|
||||
@@ -223,7 +223,7 @@ You can't run the E2E test on CPUs.
|
||||
|
||||
```bash
|
||||
cd /vllm-workspace/vllm-ascend/
|
||||
# Run all single card the tests
|
||||
# Run all single-card tests
|
||||
VLLM_USE_MODELSCOPE=true pytest -sv tests/e2e/singlecard/
|
||||
|
||||
# Run a certain test script
|
||||
@@ -240,7 +240,7 @@ VLLM_USE_MODELSCOPE=true pytest -sv tests/e2e/singlecard/test_offline_inference.
|
||||
|
||||
```bash
|
||||
cd /vllm-workspace/vllm-ascend/
|
||||
# Run all the single card tests
|
||||
# Run all multi-card tests
|
||||
VLLM_USE_MODELSCOPE=true pytest -sv tests/e2e/singlecard/
|
||||
|
||||
# Run a certain test script
|
||||
@@ -256,7 +256,7 @@ VLLM_USE_MODELSCOPE=true pytest -sv tests/e2e/singlecard/test_aclgraph_accuracy.
|
||||
|
||||
This will reproduce the E2E test. See [vllm_ascend_test.yaml](https://github.com/vllm-project/vllm-ascend/blob/main/.github/workflows/vllm_ascend_test.yaml).
|
||||
|
||||
Run nightly multi-node test cases locally refer to section of `Running Locally` of [Multi Node Test](./multi_node_test.md).
|
||||
For running nightly multi-node test cases locally, refer to the `Running Locally` section in [Multi Node Test](./multi_node_test.md).
|
||||
|
||||
#### E2E test example
|
||||
|
||||
|
||||
@@ -53,7 +53,7 @@ INFO: Waiting for application startup.
|
||||
INFO: Application startup complete.
|
||||
```
|
||||
|
||||
### 2. Run different dataset using AISBench
|
||||
### 2. Run different datasets using AISBench
|
||||
|
||||
#### Install AISBench
|
||||
|
||||
@@ -81,7 +81,7 @@ You can choose one or multiple datasets to execute accuracy evaluation.
|
||||
|
||||
1. `C-Eval` dataset.
|
||||
|
||||
Take `C-Eval` dataset as an example. And you can refer to [Datasets](https://gitee.com/aisbench/benchmark/tree/master/ais_bench/benchmark/configs/datasets) for more datasets. Every datasets have a `README.md` for detailed download and installation process.
|
||||
Take `C-Eval` dataset as an example. You can refer to [Datasets](https://gitee.com/aisbench/benchmark/tree/master/ais_bench/benchmark/configs/datasets) for more datasets. Each dataset has a `README.md` with detailed download and installation instructions.
|
||||
|
||||
Download dataset and install it to specific path.
|
||||
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
# Using EvalScope
|
||||
|
||||
This document will guide you have model inference stress testing and accuracy testing using [EvalScope](https://github.com/modelscope/evalscope).
|
||||
This document will guide you through model inference stress testing and accuracy testing using [EvalScope](https://github.com/modelscope/evalscope).
|
||||
|
||||
## 1. Online server
|
||||
|
||||
|
||||
@@ -1,8 +1,8 @@
|
||||
# ACL Graph
|
||||
|
||||
## Why we need ACL Graph?
|
||||
## Why do we need ACL Graph?
|
||||
|
||||
When in LLM inference, each token requires nearly thousand operator executions, and when host launching operators are slower than device, it will cause host bound. In severe cases, the device will be idle for more than half of the time. To solve this problem, we use graph in LLM inference.
|
||||
In LLM inference, each token requires nearly a thousand operator executions. When host launching operators are slower than device, it will cause host bound. In severe cases, the device will be idle for more than half of the time. To solve this problem, we use graph in LLM inference.
|
||||
|
||||
```shell
|
||||
eager mode:
|
||||
@@ -29,15 +29,15 @@ ACL Graph is enabled by default in V1 Engine, just need to check that `enforce_e
|
||||
|
||||
## How it works?
|
||||
|
||||
In short, graph mode works in two steps: **capture and replay**. When engine starts, we will capture all of the ops in model forward and save it as a graph, and when req come in, we just replay the graph on devices, and waiting for result.
|
||||
In short, graph mode works in two steps: **capture and replay**. When the engine starts, we capture all of the ops in the model forward and save it as a graph. When a request comes in, we just replay the graph on the device and wait for the result.
|
||||
|
||||
But in reality, graph mode is not that simple.
|
||||
|
||||
### Padding and Bucketing
|
||||
|
||||
Due to graph can only replay the ops captured before, without doing tiling and checking graph input, we need to ensure the consistency of the graph input, but we know that model input's shape depends on the request scheduled by Scheduler, we can't ensure the consistency.
|
||||
Due to the fact that a graph can only replay the ops captured before, without doing tiling and checking graph input, we need to ensure the consistency of the graph input. However, we know that the model input's shape depends on the request scheduled by the Scheduler, so we can't ensure consistency.
|
||||
|
||||
Obviously, we can solve this problem by capturing the biggest shape and padding all of the model input to it. But it will bring a lot of redundant computing and make performance worse. So we can capture multiple graphs with different shape, and pad the model input to the nearest graph, which will greatly reduce redundant computing. But when `max_num_batched_tokens` is very large, the number of graphs that need to be captured will also become very large. But we know that when intensor's shape is large, the computing time will be very long, and graph mode is not necessary in this case. So all of things we need to do is:
|
||||
Obviously, we can solve this problem by capturing the biggest shape and padding all of the model inputs to it. But this will bring a lot of redundant computing and make performance worse. So we can capture multiple graphs with different shapes, and pad the model input to the nearest graph, which will greatly reduce redundant computing. But when `max_num_batched_tokens` is very large, the number of graphs that need to be captured will also become very large. We know that when the input tensor's shape is large, the computing time will be very long, and graph mode is not necessary in this case. So all of the things we need to do are:
|
||||
|
||||
1. Set a threshold;
|
||||
2. When `num_scheduled_tokens` is bigger than the threshold, use `eager_mode`;
|
||||
@@ -58,39 +58,39 @@ Obviously, we can solve this problem by capturing the biggest shape and padding
|
||||
|
||||
### Piecewise and Full graph
|
||||
|
||||
Due to the increasing complexity of the attention layer in current LLM, we can't ensure all types of attention can run in graph. In MLA, prefill_tokens and decode_tokens have different calculation method, so when a batch has both prefills and decodes in MLA, graph mode is difficult to handle this situation.
|
||||
Due to the increasing complexity of the attention layer in current LLMs, we can't ensure all types of attention can run in graph. In MLA, prefill_tokens and decode_tokens have different calculation methods, so when a batch has both prefills and decodes in MLA, graph mode is difficult to handle this situation.
|
||||
|
||||
vLLM solves this problem with piecewise graph mode. We use eager mode to launch attention's ops, and use graph to deal with others. But it also bring some problems: The cost of launching ops has become large again, although much smaller than eager mode, but it will also lead to host bound when cpu is poor or `num_tokens` is small.
|
||||
vLLM solves this problem with piecewise graph mode. We use eager mode to launch attention's ops, and use graph to deal with others. But this also brings some problems: The cost of launching ops has become large again. Although much smaller than eager mode, it will also lead to host bound when the CPU is poor or `num_tokens` is small.
|
||||
|
||||
Altogether, we need to support both piecewise and full graph mode.
|
||||
|
||||
1. When attention can run in graph, we tend to choose full graph mode to achieve optimal performance;
|
||||
2. When full graph is not work, use piecewise graph as a substitute;
|
||||
3. When piecewise graph's performance is not good and full graph mode is blocked, separate prefills and decodes, and use full graph mode in **decode_only** situation. Because when a batch include prefill req, usually `num_tokens` will be quite big and not cause host bound.
|
||||
2. When full graph does not work, use piecewise graph as a substitute;
|
||||
3. When piecewise graph's performance is not good and full graph mode is blocked, separate prefills and decodes, and use full graph mode in **decode_only** situations. Because when a batch includes prefill requests, usually `num_tokens` will be quite big and not cause host bound.
|
||||
|
||||
> Currently, due to stream resource constraint, we can only support a few buckets in piecewise graph mode now, which will cause redundant computing and may lead to performance degradation compared with eager mode.
|
||||
|
||||
## How it be implemented?
|
||||
## How is it implemented?
|
||||
|
||||
vLLM has already implemented most of the modules in graph mode. You can see more details at: [CUDA Graphs](https://docs.vllm.ai/en/latest/design/cuda_graphs.html)
|
||||
|
||||
When in graph mode, vLLM will call `current_platform.get_static_graph_wrapper_cls` to get current device's graph model wrapper, so what we need to do is to implement the graph mode wrapper on Ascend: `ACLGraphWrapper`.
|
||||
When in graph mode, vLLM will call `current_platform.get_static_graph_wrapper_cls` to get the current device's graph model wrapper, so what we need to do is implement the graph mode wrapper on Ascend: `ACLGraphWrapper`.
|
||||
|
||||
vLLM has added `support_torch_compile` decorator to all models, this decorator will replace the `__init__` and `forward` interface of the model class, and when `forward` called, the code inside the `ACLGraphWrapper` will be executed, and it will do capture or replay as mentioned above.
|
||||
vLLM has added `support_torch_compile` decorator to all models. This decorator will replace the `__init__` and `forward` interface of the model class. When `forward` is called, the code inside the `ACLGraphWrapper` will be executed, and it will do capture or replay as mentioned above.
|
||||
|
||||
When use piecewise graph, we just need to follow the above-mentioned process, but when in full graph, due to the complexity of the attention, sometimes we need to update attention op's param before execution. So we implement `update_attn_params` and `update_mla_attn_params` funcs for full graph mode. And when forward, memory will be reused between different ops, so we can't update attention op's param before forward. In ACL Graph, we use `torch.npu.graph_task_update_begin` and `torch.npu.graph_task_update_end` to do it, and use `torch.npu.ExternalEvent` to ensure order between params update and ops execution.
|
||||
When using piecewise graph, we just need to follow the above-mentioned process. But when in full graph, due to the complexity of the attention, sometimes we need to update attention op's params before execution. So we implement `update_attn_params` and `update_mla_attn_params` functions for full graph mode. During forward, memory will be reused between different ops, so we can't update attention op's params before forward. In ACL Graph, we use `torch.npu.graph_task_update_begin` and `torch.npu.graph_task_update_end` to do it, and use `torch.npu.ExternalEvent` to ensure order between param updates and op executions.
|
||||
|
||||
## DFX
|
||||
|
||||
### Stream resource constraint
|
||||
|
||||
Currently, we can only capture 1800 graphs at most, due to the limitation of ACL graph that a graph requires a separate stream at least. This number is bounded by the number of streams, which is 2048, we save 248 streams as a buffer. Besides, there are many variables that can affect the number of buckets:
|
||||
Currently, we can only capture 1800 graphs at most, due to the limitation of ACL graph that a graph requires at least a separate stream. This number is bounded by the number of streams, which is 2048; we save 248 streams as a buffer. Besides, there are many variables that can affect the number of buckets:
|
||||
|
||||
+ Piecewise graph will divides the model into `num_hidden_layers + 1` sub modules, based on attention layer. Every sub module is a single graph which need to cost stream, so the number of buckets in piecewise graph mode is very tight compared with full graph mode.
|
||||
+ Piecewise graph divides the model into `num_hidden_layers + 1` sub modules, based on the attention layer. Every sub module is a single graph which needs to cost a stream, so the number of buckets in piecewise graph mode is very tight compared with full graph mode.
|
||||
|
||||
+ The number of streams required for a graph is related to the number of comm domains. Each comm domain will increase one stream consumed by a graph.
|
||||
|
||||
+ When multi-stream is explicitly called in sub module, it will consumes an additional stream.
|
||||
+ When multi-stream is explicitly called in a sub module, it will consume an additional stream.
|
||||
|
||||
There are some other rules about ACL Graph and stream. Currently, we use func `update_aclgraph_sizes` to calculate the maximum number of buckets and update `graph_batch_sizes` to ensure stream resource is sufficient.
|
||||
|
||||
|
||||
@@ -4,19 +4,19 @@
|
||||
|
||||
Prefix caching is an important feature in LLM inference that can reduce prefill computation time drastically.
|
||||
|
||||
However, the performance gain from prefix caching is highly dependent on cache hit rate, while cache hit rate can be limited if one only uses HBM for kv cache storage.
|
||||
However, the performance gain from prefix caching is highly dependent on the cache hit rate, while the cache hit rate can be limited if one only uses HBM for KV cache storage.
|
||||
|
||||
Hence, KV Cache Pool is proposed to utilize various types of storages including HBM,DRAM and SSD, making a pool for KV Cache storage, while making the prefix of requests visible across all nodes, increasing the cache hit rate for all requests.
|
||||
Hence, KV Cache Pool is proposed to utilize various types of storage including HBM, DRAM, and SSD, making a pool for KV Cache storage while making the prefix of requests visible across all nodes, increasing the cache hit rate for all requests.
|
||||
|
||||
vLLM Ascend currently supports [MooncakeStore](https://github.com/kvcache-ai/Mooncake): one of the most recognized KV Cache storage engine;
|
||||
vLLM Ascend currently supports [MooncakeStore](https://github.com/kvcache-ai/Mooncake), one of the most recognized KV Cache storage engines.
|
||||
|
||||
While one can utilize mooncake store in vLLM V1 engine by setting it as a remote backend of LMCache with GPU (see [Tutorial](https://github.com/LMCache/LMCache/blob/dev/examples/kv_cache_reuse/remote_backends/mooncakestore/README.md)), we find it would be better to integrate a connector that directly supports mooncake store and can utilize the data transfer strategy to one that is best fit to Huawei NPU hardware.
|
||||
While one can utilize Mooncake Store in vLLM V1 engine by setting it as a remote backend of LMCache with GPU (see [Tutorial](https://github.com/LMCache/LMCache/blob/dev/examples/kv_cache_reuse/remote_backends/mooncakestore/README.md)), we find it would be better to integrate a connector that directly supports Mooncake Store and can utilize the data transfer strategy that best fits Huawei NPU hardware.
|
||||
|
||||
Hence, we propose to integrate Mooncake Store with a brand new **MooncakeStoreConnectorV1**, which is indeed largely inspired by **LMCacheConnectorV1** (see the `How is MooncakeStoreConnectorV1 Implemented?` section).
|
||||
|
||||
## Usage
|
||||
|
||||
vLLM Ascend Currently supports Mooncake Store for KV Cache Pool. To enable Mooncake Store, one needs to config `kv-transfer-config` and choose `MooncakeStoreConnector` as KV Connector.
|
||||
vLLM Ascend currently supports Mooncake Store for KV Cache Pool. To enable Mooncake Store, one needs to configure `kv-transfer-config` and choose `MooncakeStoreConnector` as the KV Connector.
|
||||
|
||||
For step-by-step deployment and configuration, please refer to the [KV Pool User Guide](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/feature_guide/kv_pool.html).
|
||||
|
||||
@@ -33,25 +33,25 @@ When combined with vLLM’s Prefix Caching mechanism, the pool enables efficient
|
||||
Prefix Caching with HBM is already supported by the vLLM V1 Engine.
|
||||
By introducing KV Connector V1, users can seamlessly combine HBM-based Prefix Caching with Mooncake-backed KV Pool.
|
||||
|
||||
The user can enable both features simply by enabling Prefix Caching, which is enabled by default in vLLM V1 unless the --no_enable_prefix_caching flag is set, and setting up the KV Connector for KV Pool(e.g. the MooncakeStoreConnector)
|
||||
The user can enable both features simply by enabling Prefix Caching, which is enabled by default in vLLM V1 unless the `--no_enable_prefix_caching` flag is set, and setting up the KV Connector for KV Pool (e.g., the MooncakeStoreConnector).
|
||||
|
||||
**Workflow**:
|
||||
|
||||
1. The engine first checks for prefix hits in the HBM cache.
|
||||
|
||||
2. After getting the number of hit tokens on HBM, it queries the KV Pool via the connector, if there is additional hits in KV Pool, we get the **additional blocks only** from KV Pool, and get the rest of the blocks directly from HBM to minimize the data transfer latency.
|
||||
2. After getting the number of hit tokens on HBM, it queries the KV Pool via the connector. If there are additional hits in the KV Pool, we get the **additional blocks only** from the KV Pool, and get the rest of the blocks directly from HBM to minimize the data transfer latency.
|
||||
|
||||
3. After the KV Caches in KV Pool is load into HBM, the remaining process is the same as Prefix Caching in HBM.
|
||||
3. After the KV Caches in the KV Pool are loaded into HBM, the remaining process is the same as Prefix Caching in HBM.
|
||||
|
||||
### 2. Combining KV Cache Pool with Mooncake PD Disaggregation
|
||||
|
||||
When used together with Mooncake PD (Prefill-Decode) Disaggregation, the KV Cache Pool can further decouple prefill and decode stages across devices or nodes.
|
||||
|
||||
Currently, we only perform put and get operation of KV Pool for **Prefill Nodes**, and Decode Nodes get their KV Cache from Mooncake P2P KV Connector, i.e. MooncakeConnector.
|
||||
Currently, we only perform put and get operations of KV Pool for **Prefill Nodes**, and Decode Nodes get their KV Cache from Mooncake P2P KV Connector, i.e., MooncakeConnector.
|
||||
|
||||
The key benefit of doing this is that we can keep the gain in performance by computing less with Prefix Caching from HBM and KV Pool for Prefill Nodes while not sacrificing the data transfer efficiency between Prefill and Decode nodes with P2P KV Connector that transfer KV Caches between NPU devices directly.
|
||||
The key benefit of doing this is that we can keep the gain in performance by computing less with Prefix Caching from HBM and KV Pool for Prefill Nodes, while not sacrificing the data transfer efficiency between Prefill and Decode nodes with P2P KV Connector that transfers KV Caches between NPU devices directly.
|
||||
|
||||
To Enable this feature, we need to setup both Mooncake Connector and Mooncake Store connector with a Multi Connector, which is a KV Connector class provided by vLLM that can call multiple KV Connectors in specific order;
|
||||
To enable this feature, we need to set up both Mooncake Connector and Mooncake Store Connector with a Multi Connector, which is a KV Connector class provided by vLLM that can call multiple KV Connectors in a specific order.
|
||||
|
||||
For details, please also refer to the Mooncake Connector Store Deployment Guide.
|
||||
|
||||
@@ -72,17 +72,17 @@ The KV Connector methods that need to be implemented can be categorized into sch
|
||||
|
||||
### Connector Worker-Side Methods
|
||||
|
||||
`register_kv_caches`: Register KV cache buffers needed for KV cache transfer.
|
||||
`start_load_kv`: Perform KV cache load operation that transfers KV cache from storage to device.
|
||||
`wait_for_layer_load`: Optional; Wait for layer load in layerwise + async KV load scenario.
|
||||
`save_kv_layer`: Optional Do layerwise KV cache put into KV Pool.
|
||||
`wait_for_save`: Wait for KV Save to finish if async KV cache save/put.
|
||||
`get_finished` Get request that finished KV transfer, `done_sending` if `put` finished, `done_receiving` if `get` finished.
|
||||
`register_kv_caches`: Register KV cache buffers needed for KV cache transfer.
|
||||
`start_load_kv`: Perform KV cache load operation that transfers KV cache from storage to device.
|
||||
`wait_for_layer_load`: Optional; Wait for layer load in layerwise + async KV load scenario.
|
||||
`save_kv_layer`: Optional; Do layerwise KV cache put into KV Pool.
|
||||
`wait_for_save`: Wait for KV Save to finish if async KV cache save/put.
|
||||
`get_finished`: Get request that finished KV transfer, `done_sending` if `put` finished, `done_receiving` if `get` finished.
|
||||
|
||||
## DFX
|
||||
|
||||
1. When looking up a key in KV Pool, if we cannot find the key, there is no Cache Hit for this specific block; we return no hit for this block and do not look up further blocks for current request.
|
||||
2. Similarly, when we are trying to put a block into KV Pool and failed, we do not put further blocks (subject to change).
|
||||
1. When looking up a key in KV Pool, if we cannot find the key, there is no Cache Hit for this specific block; we return no hit for this block and do not look up further blocks for the current request.
|
||||
2. Similarly, when we are trying to put a block into KV Pool and it fails, we do not put further blocks (subject to change).
|
||||
|
||||
## Limitations
|
||||
|
||||
|
||||
@@ -33,7 +33,7 @@ The workflow of obtaining inputs:
|
||||
|
||||
3. Get `Token IDs`: using token indices to retrieve the Token IDs from **token id table**.
|
||||
|
||||
At last, these `Token IDs` are required to be fed into a model, and also, `positions` should be sent into the model to create `Rope` (Rotary positional embedding). Both of them are the inputs of the model.
|
||||
At last, these `Token IDs` are required to be fed into a model, and `positions` should also be sent into the model to create `Rope` (Rotary positional embedding). Both of them are the inputs of the model.
|
||||
|
||||
**Note**: The `Token IDs` are the inputs of a model, so we also call them `Inputs IDs`.
|
||||
|
||||
@@ -55,13 +55,13 @@ A model requires these attention metadata during the forward pass:
|
||||
|
||||
There are mainly three types of variables.
|
||||
|
||||
- token level: represents one attribute corresponding to each scheduled token, so the length of this variable is the number of scheduled tokens
|
||||
- request level: represents one attribute of each scheduled request, whose length usually is the number of scheduled requests. (`query start location` is a special case, which has one more element)
|
||||
- token level: represents one attribute corresponding to each scheduled token, so the length of this variable is the number of scheduled tokens.
|
||||
- request level: represents one attribute of each scheduled request, whose length usually is the number of scheduled requests. (`query start location` is a special case, which has one more element.)
|
||||
- system level:
|
||||
1. **Token IDs table**: stores the token IDs (i.e. the inputs of a model) of each request. The shape of this table is `(max num request, max model len)`. Here, `max num request` is the maximum count of concurrent requests allowed in a forward batch and `max model len` is the maximum token count that can be handled at one request sequence in this model.
|
||||
2. **Block table**: translates the logical address (within its sequence) of each block to its global physical address in the device's memory. The shape of this table is `(max num request, max model len / block size)`
|
||||
|
||||
**Note**: Both of these two tables are come from the `_update_states` method before **preparing inputs**. You can take a look if you need more inspiration.
|
||||
**Note**: Both of these two tables come from the `_update_states` method before **preparing inputs**. You can take a look if you need more inspiration.
|
||||
|
||||
### Tips
|
||||
|
||||
@@ -89,18 +89,18 @@ Example of `Token ID`:
|
||||
|
||||
Assumptions:
|
||||
|
||||
- maximum number of tokens can be scheduled at once: 10
|
||||
- maximum number of tokens that can be scheduled at once: 10
|
||||
- `block size`: 2
|
||||
- Totally schedule 3 requests. Their prompt lengths are 3, 2, and 8 respectively.
|
||||
- `max model length`: 12 (the maximum token count can be handled at one request sequence in a model).
|
||||
- `max model length`: 12 (the maximum token count that can be handled at one request sequence in a model).
|
||||
|
||||
These assumptions are configured in the beginning when starting vLLM. They are not fixed, so you can manually set them.
|
||||
These assumptions are configured at the beginning when starting vLLM. They are not fixed, so you can manually set them.
|
||||
|
||||
### Step 1: All requests in the prefill phase
|
||||
|
||||
#### Obtain inputs
|
||||
|
||||
As the maximum number of tokens that can be schedules is 10, the scheduled tokens of each request can be represented as `{'0': 3, '1': 2, '2': 5}`. Note that`request_2` uses chunked prefill, leaving 3 prompt tokens unscheduled.
|
||||
As the maximum number of tokens that can be scheduled is 10, the scheduled tokens of each request can be represented as `{'0': 3, '1': 2, '2': 5}`. Note that `request_2` uses chunked prefill, leaving 3 prompt tokens unscheduled.
|
||||
|
||||
##### 1. Get token positions
|
||||
|
||||
@@ -108,7 +108,7 @@ First, determine which request each token belongs to: tokens 0–2 are assigned
|
||||
|
||||
For each request, use **the number of computed tokens** + **the relative position of current scheduled tokens** (`request_0: [0 + 0, 0 + 1, 0 + 2]`, `request_1: [0 + 0, 0 + 1]`, `request_2: [0 + 0, 0 + 1,..., 0 + 4]`) and then concatenate them together (`[0, 1, 2, 0, 1, 0, 1, 2, 3, 4]`).
|
||||
|
||||
Note: there is more efficient way (using `request indices`) to create positions in actual code.
|
||||
Note: there is a more efficient way (using `request indices`) to create positions in actual code.
|
||||
|
||||
Finally, `token positions` can be obtained as `[0, 1, 2, 0, 1, 0, 1, 2, 3, 4]`. This variable is **token level**.
|
||||
|
||||
@@ -116,9 +116,9 @@ Finally, `token positions` can be obtained as `[0, 1, 2, 0, 1, 0, 1, 2, 3, 4]`.
|
||||
|
||||
The shape of the current **Token IDs table** is `(max num request, max model len)`.
|
||||
|
||||
Why these `T_3_5`, `T_3_6`, `T_3_7` are in this table without being scheduled?
|
||||
Why are these `T_3_5`, `T_3_6`, `T_3_7` in this table without being scheduled?
|
||||
|
||||
- We fill all Token IDs in one request sequence to this table at once, but we only retrieve the tokens we scheduled this time. Then we retrieve the remain Token IDs next time.
|
||||
- We fill all Token IDs in one request sequence to this table at once, but we only retrieve the tokens we scheduled this time. Then we retrieve the remaining Token IDs next time.
|
||||
|
||||
```shell
|
||||
| T_0_0 | T_0_1 | T_0_2 | ? | ? | ? | ? | ? | ? | ? | ? | ? |
|
||||
@@ -130,7 +130,7 @@ Why these `T_3_5`, `T_3_6`, `T_3_7` are in this table without being scheduled?
|
||||
......
|
||||
```
|
||||
|
||||
Note that`T_x_x` is an `int32`.
|
||||
Note that `T_x_x` is an `int32`.
|
||||
|
||||
Let's say `M = max model len`. Then we can use `token positions` together with `request indices` of each token to construct `token indices`.
|
||||
|
||||
@@ -190,10 +190,10 @@ The workflow of achieving slot mapping:
|
||||
|
||||
Details:
|
||||
|
||||
1. (**Token level**) Use a simple formula to calculate `block table indices`: `request indices * K + positions / block size`. So it equal to `[0 * 6 + 0 / 2, 0 * 6 + 1 / 2, 0 * 6 + 2 / 2, 1 * 6 + 0 / 2, 1 * 6 + 1 / 2, 2 * 6 + 0 / 2, 2 * 6 + 1 / 2, 2 * 6 + 2 / 2, 2 * 6 + 3 / 2, 2 * 6 + 4 / 2] = [0, 0, 1, 6, 6, 12, 12, 13, 13, 14]`. This could be used to select `device block number` from `block table`.
|
||||
2. (**Token level**) Use `block table indices` to select out `device block number` for each scheduled token. The Pseudocode is `block_numbers = block_table[block_table_indices]`. So `device block number=[1, 1, 2, 3, 3, 4, 4, 5, 5, 6]`
|
||||
1. (**Token level**) Use a simple formula to calculate `block table indices`: `request indices * K + positions / block size`. So it equals `[0 * 6 + 0 / 2, 0 * 6 + 1 / 2, 0 * 6 + 2 / 2, 1 * 6 + 0 / 2, 1 * 6 + 1 / 2, 2 * 6 + 0 / 2, 2 * 6 + 1 / 2, 2 * 6 + 2 / 2, 2 * 6 + 3 / 2, 2 * 6 + 4 / 2] = [0, 0, 1, 6, 6, 12, 12, 13, 13, 14]`. This could be used to select `device block number` from `block table`.
|
||||
2. (**Token level**) Use `block table indices` to select out `device block number` for each scheduled token. The pseudocode is `block_numbers = block_table[block_table_indices]`. So `device block number=[1, 1, 2, 3, 3, 4, 4, 5, 5, 6]`
|
||||
3. (**Token level**) `block offsets` could be computed by `block offsets = positions % block size = [0, 1, 0, 0, 1, 0, 1, 0, 1, 0]`.
|
||||
4. At last, use `block offsets` and `device block number` to create `slot mapping`: `device block number * block size + block_offsets = [2, 3, 4, 6, 7, 8, 9, 10, 11, 12]`
|
||||
4. Finally, use `block offsets` and `device block number` to create `slot mapping`: `device block number * block size + block_offsets = [2, 3, 4, 6, 7, 8, 9, 10, 11, 12]`
|
||||
|
||||
(**Request level**) As we know the scheduled token count is `[3, 2, 5]`:
|
||||
|
||||
@@ -261,7 +261,7 @@ KV cache block in the device memory:
|
||||
3. (**Token level**) `block offsets`: `[1, 0, 1, 0, 1]`
|
||||
4. (**Token level**) `slot mapping`: `[5, 14, 13, 16, 17]`
|
||||
|
||||
Scheduled token count:`[1, 1, 3]`
|
||||
Scheduled token count: `[1, 1, 3]`
|
||||
|
||||
- `query start location`: `[0, 1, 2, 5]`
|
||||
|
||||
@@ -281,6 +281,6 @@ Scheduled token count:`[1, 1, 3]`
|
||||
|
||||
## At last
|
||||
|
||||
If you understand the step_1 and step_2, you will know the all following steps.
|
||||
If you understand step_1 and step_2, you will know all the following steps.
|
||||
|
||||
Hope this document can help you better understand how vLLM prepares inputs for model forwarding. If you have any good idea, welcome to contribute to us.
|
||||
Hope this document helps you better understand how vLLM prepares inputs for model forwarding. If you have any good ideas, you are welcome to contribute to us.
|
||||
|
||||
@@ -79,9 +79,9 @@ After computing the results with the local KV cache, the results are updated via
|
||||
|
||||
**Tokens Partition in Head-Tail Style**
|
||||
|
||||
PCP requires splitting the input sequence and ensure balanced computational load across devices during the prefill phase.
|
||||
PCP requires splitting the input sequence and ensuring balanced computational load across devices during the prefill phase.
|
||||
We employ a head-tail style for splitting and concatenation: specifically, the sequence is first padded to a length of `2*pcp_size`, then divided into `2*pcp_size` equal parts.
|
||||
The first part is merged with the last part, the second part with the second last part, and so on, thereby assigning computationally balanced chunks to each devices.
|
||||
The first part is merged with the last part, the second part with the second last part, and so on, thereby assigning computationally balanced chunks to each device.
|
||||
Additionally, since allgather aggregation of KV or Q results in interleaved chunks from different requests, we compute `pcp_allgather_restore_idx` to quickly restore the original order.
|
||||
|
||||
These logics are implemented in the function `_update_tokens_for_pcp`.
|
||||
|
||||
@@ -7,8 +7,8 @@ This feature addresses the need to optimize the **Time Per Output Token (TPOT)**
|
||||
1. **Adjusting Parallel Strategy and Instance Count for P and D Nodes**
|
||||
Using the disaggregated-prefill strategy, this feature allows the system to flexibly adjust the parallelization strategy (e.g., data parallelism (dp), tensor parallelism (tp), and expert parallelism (ep)) and the instance count for both P (Prefiller) and D (Decoder) nodes. This leads to better system performance tuning, particularly for **TTFT** and **TPOT**.
|
||||
|
||||
2. **Optimizing TPOT**
|
||||
Without disaggregated-prefill strategy, prefill tasks are inserted during decoding, which results in inefficiencies and delays. disaggregated-prefill solves this by allowing for better control over the system’s **TPOT**. By managing chunked prefill tasks effectively, the system avoids the challenge of determining the optimal chunk size and provides more reliable control over the time taken for generating output tokens.
|
||||
2. **Optimizing TPOT**
|
||||
Without the disaggregated-prefill strategy, prefill tasks are inserted during decoding, which results in inefficiencies and delays. Disaggregated-prefill solves this by allowing for better control over the system’s **TPOT**. By managing chunked prefill tasks effectively, the system avoids the challenge of determining the optimal chunk size and provides more reliable control over the time taken for generating output tokens.
|
||||
|
||||
---
|
||||
|
||||
|
||||
@@ -4,7 +4,7 @@
|
||||
|
||||
When using Expert Parallelism (EP), different experts are assigned to different NPUs. Given that the load of various experts may vary depending on the current workload, it is crucial to maintain balanced loads across different NPUs. We adopt a redundant experts strategy by duplicating heavily-loaded experts. Then, we heuristically pack these duplicated experts onto NPUs to ensure load balancing across them. Moreover, thanks to the group-limited expert routing used in MoE models, we also attempt to place experts of the same group on the same node to reduce inter-node data traffic, whenever possible.
|
||||
|
||||
To facilitate reproduction and deployment, Vllm Ascend supported deployed EP load balancing algorithm in `vllm_ascend/eplb/core/policy`. The algorithm computes a balanced expert replication and placement plan based on the estimated expert loads. Note that the exact method for predicting expert loads is outside the scope of this repository. A common method is to use a moving average of historical statistics.
|
||||
To facilitate reproduction and deployment, vLLM Ascend supports the deployed EP load balancing algorithm in `vllm_ascend/eplb/core/policy`. The algorithm computes a balanced expert replication and placement plan based on the estimated expert loads. Note that the exact method for predicting expert loads is outside the scope of this repository. A common method is to use a moving average of historical statistics.
|
||||
|
||||

|
||||
|
||||
|
||||
@@ -8,7 +8,7 @@ This section provides an overview of the features implemented in vLLM Ascend. De
|
||||
patch
|
||||
ModelRunner_prepare_inputs
|
||||
disaggregated_prefill
|
||||
eplb_swift_balancer.md
|
||||
eplb_swift_balancer
|
||||
ACL_Graph
|
||||
KV_Cache_Pool_Guide
|
||||
add_custom_aclnn_op
|
||||
|
||||
@@ -1,8 +1,8 @@
|
||||
# Npugraph_ex
|
||||
|
||||
## How it works?
|
||||
## How Does It Work?
|
||||
|
||||
Optimization based on Fx graphs, can be considered an acceleration solution for the aclgraph mode.
|
||||
This is an optimization based on Fx graphs, which can be considered an acceleration solution for the aclgraph mode.
|
||||
|
||||
You can get its code [here](https://gitcode.com/Ascend/torchair)
|
||||
|
||||
|
||||
@@ -34,7 +34,7 @@ vllm_ascend
|
||||
|
||||
## How to write a patch
|
||||
|
||||
Before writing a patch, following the principle above, we should patch the least code. If it's necessary, we can patch the code in either **platform** and **worker** folder. Here is an example to patch `distributed` module in vLLM.
|
||||
Before writing a patch, following the principle above, we should patch the least code. If it's necessary, we can patch the code in either **platform** or **worker** folder. Here is an example to patch `distributed` module in vLLM.
|
||||
|
||||
1. Decide which version of vLLM we should patch. For example, after analysis, here we want to patch both `0.10.0` and `main` of vLLM.
|
||||
2. Decide which process we should patch. For example, here `distributed` belongs to the vLLM main process, so we should patch `platform`.
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
# Optimization and Tuning
|
||||
|
||||
This guide aims to help users to improve vllm-ascend performance on system level. It includes OS configuration, library optimization, deployment guide and so on. Any feedback is welcome.
|
||||
This guide aims to help users improve vLLM-Ascend performance at the system level. It includes OS configuration, library optimization, deployment guide, and so on. Any feedback is welcome.
|
||||
|
||||
## Preparation
|
||||
|
||||
|
||||
@@ -9,7 +9,7 @@
|
||||
|
||||
### 1. What devices are currently supported?
|
||||
|
||||
Currently, **ONLY** Atlas A2 series (Ascend-cann-kernels-910b),Atlas A3 series (Atlas-A3-cann-kernels) and Atlas 300I (Ascend-cann-kernels-310p) series are supported:
|
||||
Currently, **ONLY** Atlas A2 series (Ascend-cann-kernels-910b), Atlas A3 series (Atlas-A3-cann-kernels) and Atlas 300I (Ascend-cann-kernels-310p) series are supported:
|
||||
|
||||
- Atlas A2 Training series (Atlas 800T A2, Atlas 900 A2 PoD, Atlas 200T A2 Box16, Atlas 300T A2)
|
||||
- Atlas 800I A2 Inference series (Atlas 800I A2)
|
||||
@@ -23,7 +23,7 @@ Below series are NOT supported yet:
|
||||
- Atlas 200I A2 (Ascend-cann-kernels-310b) unplanned yet
|
||||
- Ascend 910, Ascend 910 Pro B (Ascend-cann-kernels-910) unplanned yet
|
||||
|
||||
From a technical view, vllm-ascend support would be possible if torch-npu is supported. Otherwise, we have to implement it by using custom ops. We also welcome you to join us to improve together.
|
||||
From a technical view, vllm-ascend supports devices if torch-npu is supported. Otherwise, we have to implement it by using custom ops. We also welcome you to join us to improve together.
|
||||
|
||||
### 2. How to get our docker containers?
|
||||
|
||||
@@ -108,7 +108,7 @@ If all above steps are not working, feel free to submit a GitHub issue.
|
||||
|
||||
### 8. Does vllm-ascend support Prefill Disaggregation feature?
|
||||
|
||||
Yes, vllm-ascend supports Prefill Disaggregation feature with Mooncake backend. Take [official tutorial](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/pd_disaggregation_mooncake_multi_node.html) for example.
|
||||
Yes, vllm-ascend supports Prefill Disaggregation feature with Mooncake backend. See the [official tutorial](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/pd_disaggregation_mooncake_multi_node.html) for example.
|
||||
|
||||
### 9. Does vllm-ascend support quantization method?
|
||||
|
||||
|
||||
@@ -164,7 +164,7 @@ VLLM_TARGET_DEVICE=empty pip install -v -e .
|
||||
cd ..
|
||||
|
||||
# Install vLLM Ascend.
|
||||
git clone --depth 1 --branch |vllm_ascend_version| https://github.com/vllm-project/vllm-ascend.git
|
||||
git clone --depth 1 --branch |vllm_ascend_version| https://github.com/vllm-project/vllm-ascend.git
|
||||
cd vllm-ascend
|
||||
git submodule update --init --recursive
|
||||
pip install -v -e .
|
||||
|
||||
@@ -82,7 +82,7 @@ yum update -y && yum install -y curl
|
||||
::::
|
||||
:::::
|
||||
|
||||
The default workdir is `/workspace`, vLLM and vLLM Ascend code are placed in `/vllm-workspace` and installed in [development mode](https://setuptools.pypa.io/en/latest/userguide/development_mode.html) (`pip install -e`) to help developers immediately make changes effective without requiring a new installation.
|
||||
The default workdir is `/workspace`, vLLM and vLLM Ascend code are placed in `/vllm-workspace` and installed in [development mode](https://setuptools.pypa.io/en/latest/userguide/development_mode.html) (`pip install -e`) to help developers make changes effective immediately without requiring a new installation.
|
||||
|
||||
## Usage
|
||||
|
||||
|
||||
@@ -2,15 +2,15 @@
|
||||
|
||||
## Introduction
|
||||
|
||||
[GLM-5](https://huggingface.co/zai-org/GLM-5)use a Mixture-of-Experts (MoE) architecture and targeting at complex systems engineering and long-horizon agentic tasks.
|
||||
[GLM-5](https://huggingface.co/zai-org/GLM-5) use a Mixture-of-Experts (MoE) architecture and targeting at complex systems engineering and long-horizon agentic tasks.
|
||||
|
||||
This document will show the main verification steps of the model, including supported features, feature configuration, environment preparation, single-node and multi-node deployment, accuracy and performance evaluation.
|
||||
|
||||
## Supported Features
|
||||
|
||||
Refer to [supported features](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/support_matrix/supported_models.html)to get the model's supported feature matrix.
|
||||
Refer to [supported features](../../user_guide/support_matrix/supported_models.md) to get the model's supported feature matrix.
|
||||
|
||||
Refer to [feature guide](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/support_matrix/supported_features.html) to get the feature's configuration.
|
||||
Refer to [feature guide](../../user_guide/feature_guide/index.md) to get the feature's configuration.
|
||||
|
||||
## Environment Preparation
|
||||
|
||||
@@ -241,7 +241,7 @@ The parameters are explained as follows:
|
||||
|
||||
### Multi-node Deployment
|
||||
|
||||
If you want to deploy multi-node environment, you need to verify multi-node communication according to [verify multi-node communication environment](https://docs.vllm.ai/projects/ascend/en/latest/installation.html#verify-multi-node-communication).
|
||||
If you want to deploy multi-node environment, you need to verify multi-node communication according to [verify multi-node communication environment](../../installation.md#verify-multi-node-communication).
|
||||
|
||||
:::::{tab-set}
|
||||
:sync-group: install
|
||||
@@ -450,7 +450,7 @@ vllm serve /root/.cache/modelscope/hub/models/vllm-ascend/GLM-5-w4a8 \
|
||||
::::
|
||||
:::::
|
||||
|
||||
- For bf16 weight, use this script on each node to enable [Multi Token Prediction (MTP)](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/feature_guide/Multi_Token_Prediction.html).
|
||||
- For bf16 weight, use this script on each node to enable [Multi Token Prediction (MTP)](../../user_guide/feature_guide/Multi_Token_Prediction.md).
|
||||
|
||||
```shell
|
||||
python adjust_weight.py "path_of_bf16_weight"
|
||||
@@ -518,7 +518,7 @@ Here are two accuracy evaluation methods.
|
||||
|
||||
### Using AISBench
|
||||
|
||||
1. Refer to [Using AISBench](https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/evaluation/using_ais_bench.html) for details.
|
||||
1. Refer to [Using AISBench](../../developer_guide/evaluation/using_ais_bench.md) for details.
|
||||
|
||||
2. After execution, you can get the result.
|
||||
|
||||
@@ -530,7 +530,7 @@ Not test yet.
|
||||
|
||||
### Using AISBench
|
||||
|
||||
Refer to [Using AISBench for performance evaluation](https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/evaluation/using_ais_bench.html#execute-performance-evaluation) for details.
|
||||
Refer to [Using AISBench for performance evaluation](../../developer_guide/evaluation/using_ais_bench.md#execute-performance-evaluation) for details.
|
||||
|
||||
### Using vLLM Benchmark
|
||||
|
||||
|
||||
@@ -1,5 +1,31 @@
|
||||
# Kimi-K2-Thinking
|
||||
|
||||
## Introduction
|
||||
|
||||
Kimi-K2-Thinking is a large-scale Mixture-of-Experts (MoE) model developed by Moonshot AI. It features a hybrid thinking architecture that excels in complex reasoning and problem-solving tasks.
|
||||
|
||||
This document will show the main verification steps of the model, including supported features, environment preparation, single-node deployment, and functional verification.
|
||||
|
||||
## Supported Features
|
||||
|
||||
Refer to [supported features](../../user_guide/support_matrix/supported_models.md) to get the model's supported feature matrix.
|
||||
|
||||
Refer to [feature guide](../../user_guide/feature_guide/index.md) to get the feature's configuration.
|
||||
|
||||
## Environment Preparation
|
||||
|
||||
### Model Weight
|
||||
|
||||
- `Kimi-K2-Thinking`(bfloat16): require 1 Atlas 800 A3 (64G × 16) node. [Download model weight](https://huggingface.co/moonshotai/Kimi-K2-Thinking).
|
||||
|
||||
It is recommended to download the model weight to the shared directory, such as `/mnt/sfs_turbo/.cache/`.
|
||||
|
||||
### Installation
|
||||
|
||||
You can use our official docker image to run `Kimi-K2-Thinking` directly.
|
||||
|
||||
Select an image based on your machine type and start the docker image on your node, refer to [using docker](../../installation.md#set-up-using-docker).
|
||||
|
||||
## Run with Docker
|
||||
|
||||
```{code-block} bash
|
||||
@@ -90,7 +116,7 @@ For an Atlas 800 A3 (64G*16) node, tensor-parallel-size should be at least 16.
|
||||
vllm serve Kimi-K2-Thinking \
|
||||
--served-model-name kimi-k2-thinking \
|
||||
--tensor-parallel-size 16 \
|
||||
--enable_expert_parallel \
|
||||
--enable-expert-parallel \
|
||||
--trust-remote-code \
|
||||
--no-enable-prefix-caching
|
||||
```
|
||||
|
||||
@@ -8,9 +8,9 @@ This document provides a detailed workflow for the complete deployment and verif
|
||||
|
||||
## Supported Features
|
||||
|
||||
Refer to [supported features](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/support_matrix/supported_models.html) to get the model's supported feature matrix.
|
||||
Refer to [supported features](../../user_guide/support_matrix/supported_models.md) to get the model's supported feature matrix.
|
||||
|
||||
Refer to [feature guide](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/feature_guide/index.html) to get the feature's configuration.
|
||||
Refer to [feature guide](../../user_guide/feature_guide/index.md) to get the feature's configuration.
|
||||
|
||||
## Environment Preparation
|
||||
|
||||
|
||||
@@ -18,7 +18,7 @@ Refer to [feature guide](../../user_guide/feature_guide/index.md) to get the fea
|
||||
|
||||
### Model Weight
|
||||
|
||||
- `Qwen2.5-7B-Instruct`(BF16 version): require 1 910B4 cards(32G × 1) [Qwen2.5-7B-Instruct](https://modelscope.cn/models/Qwen/Qwen2.5-7B-Instruct)
|
||||
- `Qwen2.5-7B-Instruct`(BF16 version): require 1 Atlas 910B4 (32G × 1) card. [Download model weight](https://modelscope.cn/models/Qwen/Qwen2.5-7B-Instruct)
|
||||
|
||||
It is recommended to download the model weights to a local directory (e.g., `./Qwen2.5-7B-Instruct/`) for quick access during deployment.
|
||||
|
||||
|
||||
@@ -48,7 +48,7 @@ Run the following script to start the vLLM server on Multi-NPU:
|
||||
For an Atlas A2 with 64 GB of NPU card memory, tensor-parallel-size should be at least 2, and for 32 GB of memory, tensor-parallel-size should be at least 4.
|
||||
|
||||
```bash
|
||||
vllm serve Qwen/Qwen3-30B-A3B --tensor-parallel-size 4 --enable_expert_parallel
|
||||
vllm serve Qwen/Qwen3-30B-A3B --tensor-parallel-size 4 --enable-expert-parallel
|
||||
```
|
||||
|
||||
Once your server is started, you can query the model with input prompts.
|
||||
|
||||
@@ -43,7 +43,7 @@ vllm serve Qwen/Qwen3-235B-A22 \
|
||||
|
||||
#### Initial Setup (Record Expert Map)
|
||||
|
||||
We need to add environment variable `export EXPERT_MAP_RECORD="true"` to record expert map.Generate the initial expert distribution map using expert_map_record_path. This creates a baseline configuration for future deployments.
|
||||
We need to add environment variable `export EXPERT_MAP_RECORD="true"` to record expert map. Generate the initial expert distribution map using expert_map_record_path. This creates a baseline configuration for future deployments.
|
||||
|
||||
```shell
|
||||
vllm serve Qwen/Qwen3-235B-A22 \
|
||||
|
||||
@@ -22,7 +22,7 @@ This tutorial will introduce the usage of them.
|
||||
pip install fastapi httpx uvicorn
|
||||
```
|
||||
|
||||
## Starting Exeternal DP Servers
|
||||
## Starting External DP Servers
|
||||
|
||||
First, you need to have at least two vLLM servers running in data parallel. These can be mock servers or actual vLLM servers. Note that this proxy also works with only one vLLM server running, but will fall back to direct request forwarding which is meaningless.
|
||||
|
||||
|
||||
@@ -267,10 +267,10 @@ Currently, the key-value pool in PD Disaggregate only stores the kv cache genera
|
||||
"kv_connector": "AscendStoreConnector",
|
||||
"kv_role": "kv_consumer",
|
||||
"kv_connector_extra_config": {
|
||||
"lookup_rpc_port":"0",
|
||||
"backend": "mooncake"
|
||||
"lookup_rpc_port": "0",
|
||||
"backend": "mooncake",
|
||||
"consumer_is_to_put": true,
|
||||
"prefill_pp_size": 2
|
||||
"prefill_pp_size": 2,
|
||||
"prefill_pp_layer_partition": "30,31"
|
||||
}
|
||||
}
|
||||
|
||||
@@ -164,7 +164,7 @@ vllm serve vllm-ascend/DeepSeek-R1-W8A8 \
|
||||
"kv_parallel_size": "1",
|
||||
"kv_port": "20001",
|
||||
"engine_id": "0"
|
||||
}'
|
||||
}' \
|
||||
--additional-config '{"enable_weight_nz_layout":true,"enable_prefill_optimizations":true}'
|
||||
```
|
||||
|
||||
|
||||
@@ -8,10 +8,10 @@ You can refer to [Supported Models](https://docs.vllm.ai/en/latest/models/suppor
|
||||
|
||||
You can run LoRA with ACLGraph mode now. Please refer to [Graph Mode Guide](./graph_mode.md) for a better LoRA performance.
|
||||
|
||||
Address for downloading models:\
|
||||
base model: <https://www.modelscope.cn/models/vllm-ascend/Llama-2-7b-hf/files> \
|
||||
lora model:
|
||||
<https://www.modelscope.cn/models/vllm-ascend/llama-2-7b-sql-lora-test/files>
|
||||
Address for downloading models:
|
||||
|
||||
- base model: <https://www.modelscope.cn/models/vllm-ascend/Llama-2-7b-hf/files>
|
||||
- lora model: <https://www.modelscope.cn/models/vllm-ascend/llama-2-7b-sql-lora-test/files>
|
||||
|
||||
## Example
|
||||
|
||||
|
||||
Reference in New Issue
Block a user