[v0.11.0][Doc] Update doc (#3852)

### What this PR does / why we need it?
Update doc


Signed-off-by: hfadzxy <starmoon_zhang@163.com>
This commit is contained in:
zhangxinyuehfad
2025-10-29 11:32:12 +08:00
committed by GitHub
parent 6188450269
commit 75de3fa172
49 changed files with 724 additions and 701 deletions

View File

@@ -1,16 +1,16 @@
# Contributing
## Building and testing
It's recommended to set up a local development environment to build and test
## Building and Testing
It's recommended to set up a local development environment to build vllm-ascend and run tests
before you submit a PR.
### Setup development environment
### Set up a development environment
Theoretically, the vllm-ascend build is only supported on Linux because
`vllm-ascend` dependency `torch_npu` only supports Linux.
But you can still set up dev env on Linux/Windows/macOS for linting and basic
test as following commands:
But you can still set up a development environment on Linux/Windows/macOS for linting and running basic
tests.
#### Run lint locally
@@ -27,13 +27,13 @@ cd vllm-ascend
# Install lint requirement and enable pre-commit hook
pip install -r requirements-lint.txt
# Run lint (You need install pre-commits deps via proxy network at first time)
# Run lint (You need to install pre-commits deps via proxy network at first time)
bash format.sh
```
#### Run CI locally
After complete "Run lint" setup, you can run CI locally:
After completing "Run lint" setup, you can run CI locally:
```{code-block} bash
:substitutions:
@@ -68,9 +68,9 @@ git commit -sm "your commit info"
🎉 Congratulations! You have completed the development environment setup.
### Test locally
### Testing locally
You can refer to [Testing](./testing.md) doc to help you setup testing environment and running tests locally.
You can refer to [Testing](./testing.md) to set up a testing environment and running tests locally.
## DCO and Signed-off-by
@@ -88,7 +88,7 @@ Only specific types of PRs will be reviewed. The PR title is prefixed appropriat
- `[Platform]` for new features or optimization in platform.
- `[Worker]` for new features or optimization in worker.
- `[Core]` for new features or optimization in the core vllm-ascend logic (such as platform, attention, communicators, model runner)
- `[Kernel]` changes affecting compute kernels and ops.
- `[Kernel]` for changes affecting compute kernels and ops.
- `[Bugfix]` for bug fixes.
- `[Doc]` for documentation fixes and improvements.
- `[Test]` for tests (such as unit tests).

View File

@@ -1,10 +1,10 @@
# Testing
This secition explains how to write e2e tests and unit tests to verify the implementation of your feature.
This document explains how to write E2E tests and unit tests to verify the implementation of your feature.
## Setup test environment
## Setup a test environment
The fastest way to setup test environment is to use the main branch container image:
The fastest way to setup a test environment is to use the main branch's container image:
:::::{tab-set}
:sync-group: e2e
@@ -13,7 +13,7 @@ The fastest way to setup test environment is to use the main branch container im
:selected:
:sync: cpu
You can run the unit tests on CPU with the following steps:
You can run the unit tests on CPUs with the following steps:
```{code-block} bash
:substitutions:
@@ -22,7 +22,7 @@ cd ~/vllm-project/
# ls
# vllm vllm-ascend
# Use mirror to speedup download
# Use mirror to speed up download
# docker pull quay.nju.edu.cn/ascend/cann:|cann_image_tag|
export IMAGE=quay.io/ascend/cann:|cann_image_tag|
docker run --rm --name vllm-ascend-ut \
@@ -30,7 +30,7 @@ docker run --rm --name vllm-ascend-ut \
-v ~/.cache:/root/.cache \
-ti $IMAGE bash
# (Optional) Configure mirror to speedup download
# (Optional) Configure mirror to speed up download
sed -i 's|ports.ubuntu.com|mirrors.huaweicloud.com|g' /etc/apt/sources.list
pip config set global.index-url https://mirrors.huaweicloud.com/repository/pypi/simple/
@@ -136,13 +136,13 @@ pip install -r requirements-dev.txt
## Running tests
### Unit test
### Unit tests
There are several principles to follow when writing unit tests:
- The test file path should be consistent with source file and start with `test_` prefix, such as: `vllm_ascend/worker/worker_v1.py` --> `tests/ut/worker/test_worker_v1.py`
- The vLLM Ascend test are using unittest framework, see [here](https://docs.python.org/3/library/unittest.html#module-unittest) to understand how to write unit tests.
- All unit tests can be run on CPU, so you must mock the device-related function to host.
- The test file path should be consistent with the source file and start with the `test_` prefix, such as: `vllm_ascend/worker/worker_v1.py` --> `tests/ut/worker/test_worker_v1.py`
- The vLLM Ascend test uses unittest framework. See [here](https://docs.python.org/3/library/unittest.html#module-unittest) to understand how to write unit tests.
- All unit tests can be run on CPUs, so you must mock the device-related function to host.
- Example: [tests/ut/test_ascend_config.py](https://github.com/vllm-project/vllm-ascend/blob/main/tests/ut/test_ascend_config.py).
- You can run the unit tests using `pytest`:
@@ -161,7 +161,7 @@ TORCH_DEVICE_BACKEND_AUTOLOAD=0 pytest -sv tests/ut
::::
::::{tab-item} Single card
::::{tab-item} Single-card
:sync: single
```bash
@@ -175,7 +175,7 @@ pytest -sv tests/ut/test_ascend_config.py
::::
::::{tab-item} Multi cards test
::::{tab-item} Multi-card
:sync: multi
```bash
@@ -193,7 +193,7 @@ pytest -sv tests/ut/test_ascend_config.py
### E2E test
Although vllm-ascend CI provide [e2e test](https://github.com/vllm-project/vllm-ascend/blob/main/.github/workflows/vllm_ascend_test.yaml) on Ascend CI, you can run it
Although vllm-ascend CI provides the [E2E test](https://github.com/vllm-project/vllm-ascend/blob/main/.github/workflows/vllm_ascend_test.yaml) on Ascend CI, you can run it
locally.
:::::{tab-set}
@@ -202,10 +202,10 @@ locally.
::::{tab-item} Local (CPU)
:sync: cpu
You can't run e2e test on CPU.
You can't run the E2E test on CPUs.
::::
::::{tab-item} Single card
::::{tab-item} Single-card
:selected:
:sync: single
@@ -223,12 +223,12 @@ VLLM_USE_MODELSCOPE=true pytest -sv tests/e2e/singlecard/test_offline_inference.
::::
::::{tab-item} Multi cards test
::::{tab-item} Multi-card
:sync: multi
```bash
cd /vllm-workspace/vllm-ascend/
# Run all single card the tests
# Run all the single card tests
VLLM_USE_MODELSCOPE=true pytest -sv tests/e2e/multicard/
# Run a certain test script
@@ -242,7 +242,7 @@ VLLM_USE_MODELSCOPE=true pytest -sv tests/e2e/multicard/test_offline_inference.p
:::::
This will reproduce e2e test: [vllm_ascend_test.yaml](https://github.com/vllm-project/vllm-ascend/blob/main/.github/workflows/vllm_ascend_test.yaml).
This will reproduce the E2E test. See [vllm_ascend_test.yaml](https://github.com/vllm-project/vllm-ascend/blob/main/.github/workflows/vllm_ascend_test.yaml).
#### E2E test example:
@@ -251,8 +251,8 @@ This will reproduce e2e test: [vllm_ascend_test.yaml](https://github.com/vllm-pr
- Correctness test example: [`tests/e2e/singlecard/test_aclgraph.py`](https://github.com/vllm-project/vllm-ascend/blob/main/tests/e2e/singlecard/test_aclgraph.py)
- Reduced Layer model test example: [test_torchair_graph_mode.py - DeepSeek-V3-Pruning](https://github.com/vllm-project/vllm-ascend/blob/20767a043cccb3764214930d4695e53941de87ec/tests/e2e/multicard/test_torchair_graph_mode.py#L48)
The CI resource is limited, you might need to reduce layer number of the model, below is an example of how to generate a reduced layer model:
1. Fork the original model repo in modelscope, we need all the files in the repo except for weights.
The CI resource is limited, and you might need to reduce the number of layers of a model. Below is an example of how to generate a reduced layer model:
1. Fork the original model repo in modelscope. All the files in the repo except for weights are required.
2. Set `num_hidden_layers` to the expected number of layers, e.g., `{"num_hidden_layers": 2,}`
3. Copy the following python script as `generate_random_weight.py`. Set the relevant parameters `MODEL_LOCAL_PATH`, `DIST_DTYPE` and `DIST_MODEL_PATH` as needed:
@@ -275,11 +275,11 @@ This will reproduce e2e test: [vllm_ascend_test.yaml](https://github.com/vllm-pr
### Run doctest
vllm-ascend provides a `vllm-ascend/tests/e2e/run_doctests.sh` command to run all doctests in the doc files.
The doctest is a good way to make sure the docs are up to date and the examples are executable, you can run it locally as follows:
The doctest is a good way to make sure docs stay current and examples remain executable, which can be run locally as follows:
```bash
# Run doctest
/vllm-workspace/vllm-ascend/tests/e2e/run_doctests.sh
```
This will reproduce the same environment as the CI: [vllm_ascend_doctest.yaml](https://github.com/vllm-project/vllm-ascend/blob/main/.github/workflows/vllm_ascend_doctest.yaml).
This will reproduce the same environment as the CI. See [vllm_ascend_doctest.yaml](https://github.com/vllm-project/vllm-ascend/blob/main/.github/workflows/vllm_ascend_doctest.yaml).

View File

@@ -2,7 +2,7 @@
This document will guide you have model inference stress testing and accuracy testing using [EvalScope](https://github.com/modelscope/evalscope).
## 1. Online serving
## 1. Online server
You can run docker container to start the vLLM server on a single NPU:
@@ -31,7 +31,7 @@ docker run --rm \
vllm serve Qwen/Qwen2.5-7B-Instruct --max_model_len 26240
```
If your service start successfully, you can see the info shown below:
If the vLLM server is started successfully, you can see information shown below:
```
INFO: Started server process [6873]
@@ -39,7 +39,7 @@ INFO: Waiting for application startup.
INFO: Application startup complete.
```
Once your server is started, you can query the model with input prompts in new terminal:
Once your server is started, you can query the model with input prompts in a new terminal:
```
curl http://localhost:8000/v1/completions \
@@ -54,7 +54,7 @@ curl http://localhost:8000/v1/completions \
## 2. Install EvalScope using pip
You can install EvalScope by using:
You can install EvalScope as follows:
```bash
python3 -m venv .venv-evalscope
@@ -62,9 +62,9 @@ source .venv-evalscope/bin/activate
pip install gradio plotly evalscope
```
## 3. Run gsm8k accuracy test using EvalScope
## 3. Run GSM8K using EvalScope for accuracy testing
You can `evalscope eval` run gsm8k accuracy test:
You can use `evalscope eval` to run GSM8K for accuracy testing:
```
evalscope eval \
@@ -76,7 +76,7 @@ evalscope eval \
--limit 10
```
After 1-2 mins, the output is as shown below:
After 1 to 2 minutes, the output is shown below:
```shell
+---------------------+-----------+-----------------+----------+-------+---------+---------+
@@ -86,7 +86,7 @@ After 1-2 mins, the output is as shown below:
+---------------------+-----------+-----------------+----------+-------+---------+---------+
```
See more detail in: [EvalScope doc - Model API Service Evaluation](https://evalscope.readthedocs.io/en/latest/get_started/basic_usage.html#model-api-service-evaluation).
See more detail in [EvalScope doc - Model API Service Evaluation](https://evalscope.readthedocs.io/en/latest/get_started/basic_usage.html#model-api-service-evaluation).
## 4. Run model inference stress testing using EvalScope
@@ -98,7 +98,7 @@ pip install evalscope[perf] -U
### Basic usage
You can use `evalscope perf` run perf test:
You can use `evalscope perf` to run perf testing:
```
evalscope perf \
@@ -113,7 +113,7 @@ evalscope perf \
### Output results
After 1-2 mins, the output is as shown below:
After 1 to 2 minutes, the output is shown below:
```shell
Benchmarking summary:
@@ -172,4 +172,4 @@ Percentile results:
+------------+----------+---------+-------------+--------------+---------------+----------------------+
```
See more detail in: [EvalScope doc - Model Inference Stress Testing](https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/quick_start.html#basic-usage).
See more detail in [EvalScope doc - Model Inference Stress Testing](https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/quick_start.html#basic-usage).

View File

@@ -1,8 +1,8 @@
# Using lm-eval
This document will guide you have a accuracy testing using [lm-eval][1].
This document guides you to conduct accuracy testing using [lm-eval][1].
## Online Server
### 1. start the vLLM server
### 1. Start the vLLM server
You can run docker container to start the vLLM server on a single NPU:
```{code-block} bash
@@ -31,7 +31,7 @@ docker run --rm \
vllm serve Qwen/Qwen2.5-0.5B-Instruct --max_model_len 4096 &
```
Started the vLLM server successfully,if you see log as below:
The vLLM server is started successfully, if you see logs as below:
```
INFO: Started server process [9446]
@@ -39,9 +39,9 @@ INFO: Waiting for application startup.
INFO: Application startup complete.
```
### 2. Run gsm8k accuracy test using lm-eval
### 2. Run GSM8K using lm-eval for accuracy testing
You can query result with input prompts:
You can query the result with input prompts:
```
curl http://localhost:8000/v1/completions \
@@ -98,7 +98,7 @@ The output format matches the following:
}
```
Install lm-eval in the container.
Install lm-eval in the container:
```bash
export HF_ENDPOINT="https://hf-mirror.com"
@@ -116,7 +116,7 @@ lm_eval \
--output_path ./
```
After 30 mins, the output is as shown below:
After 30 minutes, the output is as shown below:
```
The markdown format results is as below:
@@ -158,8 +158,8 @@ docker run --rm \
/bin/bash
```
### 2. Run gsm8k accuracy test using lm-eval
Install lm-eval in the container.
### 2. Run GSM8K using lm-eval for accuracy testing
Install lm-eval in the container:
```bash
export HF_ENDPOINT="https://hf-mirror.com"
@@ -177,7 +177,7 @@ lm_eval \
--batch_size auto
```
After 1-2 mins, the output is as shown below:
After 1 to 2 minutes, the output is shown below:
```
The markdown format results is as below:
@@ -189,9 +189,9 @@ Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
```
## Use offline Datasets
## Use Offline Datasets
Take gsm8k(single dataset) and mmlu(multi-subject dataset) as examples, and you can see more from [here][2].
Take GSM8K (single dataset) and MMLU (multi-subject dataset) as examples, and you can see more from [here][2].
```bash
# set HF_DATASETS_OFFLINE when using offline datasets
@@ -205,7 +205,7 @@ cd lm_eval/tasks/gsm8k
cd lm_eval/tasks/mmlu/default
```
set [gsm8k.yaml][3] as follows:
Set [gsm8k.yaml][3] as follows:
```yaml
tag:
@@ -230,7 +230,7 @@ training_split: train
fewshot_split: train
test_split: test
doc_to_text: 'Q: {{question}}
A(Please follow the summarize the result at the end with the format of "The answer is xxx", where xx is the result.):'
A(Please follow the summarized result at the end with the format of "The answer is xxx", where xx is the result.):'
doc_to_target: "{{answer}}" #" {{answer.split('### ')[-1].rstrip()}}"
metric_list:
- metric: exact_match
@@ -268,7 +268,7 @@ metadata:
version: 3.0
```
set [_default_template_yaml][4] as follows:
Set [_default_template_yaml][4] as follows:
```yaml
# set dataset_path according to the downloaded dataset

View File

@@ -1,7 +1,7 @@
# Using OpenCompass
This document will guide you have a accuracy testing using [OpenCompass](https://github.com/open-compass/opencompass).
This document guides you to conduct accuracy testing using [OpenCompass](https://github.com/open-compass/opencompass).
## 1. Online Serving
## 1. Online Server
You can run docker container to start the vLLM server on a single NPU:
@@ -30,7 +30,7 @@ docker run --rm \
vllm serve Qwen/Qwen2.5-7B-Instruct --max_model_len 26240
```
If your service start successfully, you can see the info shown below:
The vLLM server is started successfully, if you see information as below:
```
INFO: Started server process [6873]
@@ -38,7 +38,7 @@ INFO: Waiting for application startup.
INFO: Application startup complete.
```
Once your server is started, you can query the model with input prompts in new terminal:
Once your server is started, you can query the model with input prompts in a new terminal.
```
curl http://localhost:8000/v1/completions \
@@ -51,8 +51,8 @@ curl http://localhost:8000/v1/completions \
}'
```
## 2. Run ceval accuracy test using OpenCompass
Install OpenCompass and configure the environment variables in the container.
## 2. Run C-Eval using OpenCompass for accuracy testing
Install OpenCompass and configure the environment variables in the container:
```bash
# Pin Python 3.10 due to:
@@ -64,7 +64,7 @@ export DATASET_SOURCE=ModelScope
git clone https://github.com/open-compass/opencompass.git
```
Add `opencompass/configs/eval_vllm_ascend_demo.py` with the following content:
Add the following content to `opencompass/configs/eval_vllm_ascend_demo.py`:
```python
from mmengine.config import read_base
@@ -110,7 +110,7 @@ Run the following command:
python3 run.py opencompass/configs/eval_vllm_ascend_demo.py --debug
```
After 1-2 mins, the output is as shown below:
After 1 to 2 minutes, the output is shown below:
```
The markdown format results is as below:

View File

@@ -1,11 +1,11 @@
# Prepare inputs for model forwarding
## Purpose
What information should we have in order to perform model forward pass?
Information required to perform model forward pass:
- the inputs
- the corresponding attention metadata of the inputs
The following diagram shows what we should prepare for the model inference.
The following diagram shows what we should prepare for model inference.
```
+---------------+
@@ -17,47 +17,47 @@ attn_meta --> | |
Therefore, as long as we have these two pieces of information mentioned above, we can perform the model's forward propagation.
This article will explain **how we obtain the inputs and their corresponding attention metadata** which are on the left part of above diagram.
This document will explain **how we obtain the inputs and their corresponding attention metadata**.
## Overview
### 1. Obtain inputs
The workflow of obtain inputs:
1. Get `token positions`: The relative position of each token within its request sequence.
The workflow of obtaining inputs:
1. Get `token positions`: relative position of each token within its request sequence.
2. Get `token indices`: the index of each scheduled token in the token table.
2. Get `token indices`: index of each scheduled token in the token table.
3. Get `Token IDs`: Using token indices to retrieve the Token IDs from **token id table**.
3. Get `Token IDs`: using token indices to retrieve the Token IDs from **token id table**.
At last, these `Token IDs` required to feed into the model, and also, `positions` should be send into model to create `Rope` (Rotary positional embedding). Both of them are the inputs of a model.
At last, these `Token IDs` are required to be fed into a model, and also, `positions` should be sent into the model to create `Rope` (Rotary positional embedding). Both of them are the inputs of the model.
**Note**: The `Token IDs` are the inputs of a model, so we also call them `Inputs IDs`.
**Note**: because the `Token IDs` is the inputs of the model, so we will call it `Inputs IDs`
### 2. Build inputs attention metadata
The model requires these attention metadata during the forward pass:
- `query start location`: represents the start and end location of each request corresponding to the scheduled tokens.
- `sequence length`: the length of each request including both computed tokens and newly scheduled tokens.
- `number of computed tokens`: the number of computed tokens for each request.
- `number of requests`: the number of requests in this batch.
- `number of tokens`: Total number of scheduled tokens in this batch.
A model requires these attention metadata during the forward pass:
- `query start location`: start and end location of each request corresponding to the scheduled tokens.
- `sequence length`: length of each request including both computed tokens and newly scheduled tokens.
- `number of computed tokens`: number of computed tokens for each request.
- `number of requests`: number of requests in this batch.
- `number of tokens`: total number of scheduled tokens in this batch.
- **`block table`**: translates the logical address (within its sequence) of each block to its global physical address in the device's memory.
- `max query len`: the longest scheduled tokens length in this requests batch.
- `slot mapping`: the indices of each token that input token will be stored into.
- `attention mask`: The mask matrix applied to attention scores before softmax to control which tokens can attend to each other. (usually a causal attention)
- `max query len`: the longest scheduled tokens length in this request batch.
- `slot mapping`: indices of each token that input token will be stored into.
- `attention mask`: mask matrix applied to attention scores before softmax to control which tokens can attend to each other (usually a causal attention).
## Before start
There are mainly three types of variables.
- token level: represents one attribute corresponding to each scheduled token, so the length of this variable is the number of scheduled tokens
- request level: represents one attribute of each scheduled request, which length usually is the number of scheduled requests. (`query start location` is a special case, which has one more element)
- request level: represents one attribute of each scheduled request, whose length usually is the number of scheduled requests. (`query start location` is a special case, which has one more element)
- system level:
1. **Token IDs table**: store the token ids (i.e. the inputs of the model) of each request. The shape of this table is `(max num request, max model len)`. Here, `max num request` is maximum count of concurrent requests allowed in a forward batch and `max model len` is the max token count can be handled at one request sequence in this model.
1. **Token IDs table**: stores the token IDs (i.e. the inputs of a model) of each request. The shape of this table is `(max num request, max model len)`. Here, `max num request` is the maximum count of concurrent requests allowed in a forward batch and `max model len` is the maximum token count that can be handled at one request sequence in this model.
2. **Block table**: translates the logical address (within its sequence) of each block to its global physical address in the device's memory. The shape of this table is `(max num request, max model len / block size)`
**Note**: How were these two tables formed?
- Both of them are come from the `_update_states` method before **prepare inputs**. You can take a look if you need more inspiration.
**Note**: Both of these two tables are come from the `_update_states` method before **preparing inputs**. You can take a look if you need more inspiration.
### Tips
What is `Token ID`?
For simple, a `token ID` is an **integer** (usually `int32`), which represents a token.
example of `Token ID`:
Simply put, a `token ID` is an **integer** (usually `int32`), which represents a token.
Example of `Token ID`:
```
| Token ID | Token |
@@ -77,30 +77,32 @@ example of `Token ID`:
```
## Go through details
Make a simple example, assumption:
- max tokens can be scheduled at once: 10.
Assumptions:
- maximum number of tokens can be scheduled at once: 10
- `block size`: 2
- Totally schedule 3 requests. Their prompt lengths are 3, 2, and 8 respectively.
- `max model length`: 12 (the max token count can be handled at one request sequence in this model).
- `max model length`: 12 (the maximum token count can be handled at one request sequence in a model).
These assumption are configured in the beginning when starting the vllm. They are not fixed, so you can manually set them.
These assumptions are configured in the beginning when starting vLLM. They are not fixed, so you can manually set them.
### Step 1: All requests in the prefill phase
#### Obtain inputs
Due to the max schedule token count limitation is 10, The scheduled token of each request: `{'0': 3, '1': 2, '2': 5}`. Note that the `request_2` is in chunked prefill, still has 3 prompt tokens not be scheduled.
As the maximum number of tokens that can be schedules is 10, the scheduled tokens of each request can be represented as `{'0': 3, '1': 2, '2': 5}`. Note that`request_2` uses chunked prefill, leaving 3 prompt tokens unscheduled.
##### 1. Get token positions:
First, find out each token belong to which request: the 0~2 tokens belong to request_0, 3~4 tokens belong to request_1 and 5~9 tokens belong to request_2. So, we can use `request indices` to point out each token belongs to which request. `request indices`: `[0, 0, 0, 1, 1, 2, 2, 2, 2, 2]`
First, determine which request each token belongs to: tokens 02 are assigned to **request_0**, tokens 34 to **request_1**, and tokens 59 to **request_2**. To represent this mapping, we use `request indices`, for example, `request indices`: `[0, 0, 0, 1, 1, 2, 2, 2, 2, 2]`.
For each request, use **the number of tokens already computed** + **the relative position in current scheduled tokens**: `request_0: [0 + 0, 0 + 1, 0 + 2]`, `request_1: [0 + 0, 0 + 1]`, `request_2: [0 + 0, 0 + 1,..., 0 + 4]` and then concat them together: `[0, 1, 2, 0, 1, 0, 1, 2, 3, 4]`. Note: there is more efficient way (using `request indices`) to create positions in actual code.
For each request, use **the number of computed tokens** + **the relative position of current scheduled tokens** (`request_0: [0 + 0, 0 + 1, 0 + 2]`, `request_1: [0 + 0, 0 + 1]`, `request_2: [0 + 0, 0 + 1,..., 0 + 4]`) and then concatenate them together (`[0, 1, 2, 0, 1, 0, 1, 2, 3, 4]`).
Finally, `token opsitions` is `[0, 1, 2, 0, 1, 0, 1, 2, 3, 4]`. This variable is **token level**
Note: there is more efficient way (using `request indices`) to create positions in actual code.
Finally, `token positions` can be obtained as `[0, 1, 2, 0, 1, 0, 1, 2, 3, 4]`. This variable is **token level**.
##### 2. Get token indices:
Current **Token IDs table**, which shape is `(max num request, max model len)`.
The shape of the current **Token IDs table** is `(max num request, max model len)`.
Why these `T_3_5`, `T_3_6`, `T_3_7` are in this table even them are not scheduled this time?
- We will fill all Token IDs in one request sequence to this table at once, but we only retrieve the tokens we scheduled this time. Then we will retrieve the remain Token IDs next time.
Why these `T_3_5`, `T_3_6`, `T_3_7` are in this table without being scheduled?
- We fill all Token IDs in one request sequence to this table at once, but we only retrieve the tokens we scheduled this time. Then we retrieve the remain Token IDs next time.
```
| T_0_0 | T_0_1 | T_0_2 | ? | ? | ? | ? | ? | ? | ? | ? | ? |
@@ -112,26 +114,24 @@ Why these `T_3_5`, `T_3_6`, `T_3_7` are in this table even them are not schedule
......
```
Note that the `T_x_x` is an `int32`
Note that`T_x_x` is an `int32`.
Let's say `M = max model len`, Then we can use `token positions` together with the `request indices` of each token to construct `token indices`.
Let's say `M = max model len`. Then we can use `token positions` together with `request indices` of each token to construct `token indices`.
So `token indices` = `[0 + 0 * M, 1 + 0 * M, 2 + 0 * M, 0 + 1 * M, 1 + 1 * M, 0 + 2 * M, 1 + 2 * M, 2 + 2 * M, 3 + 2 * M, 4 + 2 * M]` = `[0, 1, 2, 12, 13, 24, 25, 26, 27, 28]`
##### 3. Retrieve the Token IDs
As mentioned before, we will refer to these `Token IDs` as `Input IDs`.
We use the `token indices` to select out the corresponding `Input IDs` from the token table, The Pseudocode like:
We use `token indices` to select out the corresponding `Input IDs` from the token table. The pseudocode is as follows:
```
input_ids = token_table[token_indices]
```
As mentioned before, we will refer these Token IDs as Inputs IDs:
As mentioned before, we refer to these `Token IDs` as `Input IDs`.
- `Input IDs` = `[T_0_0, T_0_1, T_0_2, T_1_0, T_1_1, T_2_0, T_2_1, T_3_2, T_3_3, T_3_4]`
#### Build inputs attention metadata
Current **Block Table**, we use the first block (i.e. block_0) to mark the unused block. The shape of the block is `(max num request, max model len / block size)`, the `max model len / block size = 12 / 2 = 6`
In the current **Block Table**, we use the first block (i.e. block_0) to mark the unused block. The shape of the block is `(max num request, max model len / block size)`, where `max model len / block size = 12 / 2 = 6`.
```
| 1 | 2 | 0 | 0 | 0 | 0 |
@@ -143,42 +143,53 @@ Current **Block Table**, we use the first block (i.e. block_0) to mark the unuse
......
```
The kv cache block in the device memory is like:
The KV cache block in the device memory is like:
```
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ......
```
Let's say `K = max model len / block size = 6`, we can get token `device block number` from
Let's say `K = max model len / block size = 6`, and we can get token `device block number`.
The workflow of achieving slot mapping:
1. get `block table indices` using `K`, `positions` and `request indices`. Purpose: For each token, it could be used to select the `device block number` from `block table`.
2. get `device block number` using `block table indices`. Purpose: `device block number` indicates each token belong to which device block.
3. get `block offsets` using `positions` and `block size`. Purpose: `block offsets` indicates the offsets of each token within a block.
4. construct `slot mapping` using `device block number` and `block offsets`. Purpose: we can use `slot mapping` to store the Token IDs into token slots.
1. Get `block table indices` using `K`, `positions` and `request indices`.
Purpose: For each token, it could be used to select `device block number` from `block table`.
2. Get `device block number` using `block table indices`.
Purpose: `device block number` indicates which device block each token belongs to.
3. Get `block offsets` using `positions` and `block size`.
Purpose: `block offsets` indicates the offsets of each token within a block.
4. construct `slot mapping` using `device block number` and `block offsets`.
Purpose: we can use `slot mapping` to store Token IDs into token slots.
Details:
1. Using a simple formula to calculate the `block table indices`: `request indices * K + positions / block size`. So it equal to `[0 * 6 + 0 / 2, 0 * 6 + 1 / 2, 0 * 6 + 2 / 2, 1 * 6 + 0 / 2, 1 * 6 + 1 / 2, 2 * 6 + 0 / 2, 2 * 6 + 1 / 2, 2 * 6 + 2 / 2, 2 * 6 + 3 / 2, 2 * 6 + 4 / 2] = [0, 0, 1, 6, 6, 12, 12, 13, 13, 14]`. This could be used to select the `device block number` from `block table`. **token level**
2. Using the `block table indices` to select out the `device block number` for each scheduled token. The Pseudocode like: `block_numbers = block_table[block_table_indices]`. So `device block number = [1, 1, 2, 3, 3, 4, 4, 5, 5, 6]`**token level**
3. `block offsets` could be computed by `block offsets = positions % block size = [0, 1, 0, 0, 1, 0, 1, 0, 1, 0]`. **token level**
1. (**Token level**) Use a simple formula to calculate `block table indices`: `request indices * K + positions / block size`. So it equal to `[0 * 6 + 0 / 2, 0 * 6 + 1 / 2, 0 * 6 + 2 / 2, 1 * 6 + 0 / 2, 1 * 6 + 1 / 2, 2 * 6 + 0 / 2, 2 * 6 + 1 / 2, 2 * 6 + 2 / 2, 2 * 6 + 3 / 2, 2 * 6 + 4 / 2] = [0, 0, 1, 6, 6, 12, 12, 13, 13, 14]`. This could be used to select `device block number` from `block table`.
2. (**Token level**) Use `block table indices` to select out `device block number` for each scheduled token. The Pseudocode is `block_numbers = block_table[block_table_indices]`. So `device block number=[1, 1, 2, 3, 3, 4, 4, 5, 5, 6]`
3. (**Token level**) `block offsets` could be computed by `block offsets = positions % block size = [0, 1, 0, 0, 1, 0, 1, 0, 1, 0]`.
4. At last, use `block offsets` and `device block number` to create `slot mapping`: `device block number * block size + block_offsets = [2, 3, 4, 6, 7, 8, 9, 10, 11, 12]`
First, we know the scheduled token count is `[3, 2, 5]` **request level**
(**Request level**) As we know the scheduled token count is `[3, 2, 5]`:
- So, we can use prefix sum to calculate the `query start location`: `[0, 3, 5, 10]`. **request level**
- Because in step_1 all the tokens in prefill, computed tokens count is 0, then `sequence length` = `[3, 2, 5]`. **request level**
- As mentioned above, `number of computed tokens` are all 0: `[0, 0, 0]`. **request level**
- `number of requests`: `3`.
- `number of tokens`: `[3, 2, 5]`. **request level**
- `max query len`: `5`.
- `slot mapping`: `[2, 3, 4, 6, 7, 8, 9, 10, 11, 12]`. **token level**
- `attention mask`: For all request do prefill, we simply create only one mask matrix for reuse across different requests. The shape of this mask matrix is `5 * 5`:
- (**Request level**) Use prefix sum to calculate `query start location`: `[0, 3, 5, 10]`.
- (**Request level**) All tokens in step 1 are in the prefill stage, and the computed tokens count is 0; then `sequence length` = `[3, 2, 5]`.
- (**Request level**) As mentioned above, `number of computed tokens` are all 0s: `[0, 0, 0]`.
- `number of requests`: `3`
- (**Request level**) `number of tokens`: `[3, 2, 5]`
- `max query len`: `5`
- (**Token level**) `slot mapping`: `[2, 3, 4, 6, 7, 8, 9, 10, 11, 12]`
- `attention mask`: For all requests that initiate a prefill process, we simply create only one mask matrix for reuse across different requests. The shape of this mask matrix is `5 * 5`:
### Step 2: Chunked prefill
In Step 2, we will no longer provide explanations or perform calculations; instead, we will directly present the final result.
In Step 2, we no longer provide explanations or perform calculations; instead, we directly present the final result.
#### Obtain inputs
The scheduled token of each request: `{'0': 1, '1': 1, '2': 3}`.
Scheduled token of each request: `{'0': 1, '1': 1, '2': 3}`
1. `request indices`: `[0, 1, 2, 2, 2]`
2. `token positions`: `[3, 2, 5, 6, 7]`
@@ -195,13 +206,15 @@ Current **Token IDs table**:
......
```
**Note**: The **T_0_3**, **T_1_2** are new Token IDs of request_0, request_1 respectively. them are sampled from the output of the model.
**Note**: **T_0_3**, **T_1_2** are new Token IDs of **request_0** and **request_1** respectively. They are sampled from the output of the model.
3. `token indices`: `[3, 14, 29, 30, 31]`
4. `Input IDs`: `[T_0_3, T_1_2, T_3_5, T_3_6, T_3_7]`
#### Build inputs attention metadata
Current **Block Table**. **Note**: We allocate the `7` and `8` block to `request_1` and `request_2` respectively. Because they need more space in device to store kv cache after generate new tokens or chunked prefill new tokens.
We allocate the blocks `7` and `8` to `request_1` and `request_2` respectively, as they need more space in device to store KV cache following token generation or chunked prefill.
Current **Block Table**:
```
| 1 | 2 | 0 | 0 | 0 | 0 |
@@ -213,27 +226,35 @@ Current **Block Table**. **Note**: We allocate the `7` and `8` block to `request
......
```
The kv cache block in the device memory is still like:
KV cache block in the device memory:
```
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ......
```
1. `block table indices`: `[1, 7, 14, 15, 15]`. **token level**
2. `device block number`: `[2, 7, 6, 8, 8]`. **token level**
3. `block offsets`: `[1, 0, 1, 0, 1]` **token level**
4. `slot mapping`: `[5, 14, 13, 16, 17]` **token level**
1. (**Token level**) `block table indices`: `[1, 7, 14, 15, 15]`
2. (**Token level**) `device block number`: `[2, 7, 6, 8, 8]`
3. (**Token level**) `block offsets`: `[1, 0, 1, 0, 1]`
4. (**Token level**) `slot mapping`: `[5, 14, 13, 16, 17]`
scheduled token count is `[1, 1, 3]`
Scheduled token count:`[1, 1, 3]`
- `query start location`: `[0, 1, 2, 5]`
- `sequence length`: `[4, 3, 8]`
- `number of computed tokens`: `[3, 2, 5]`
- `number of requests`: `3`
- `max query len`: `3`
- `slot mapping`: `[5, 14, 13, 16, 17]`
- `attention mask`: `5 * 8` Each token will have a `1 * 8` vector, and there are 5 scheduled tokens.
- `attention mask`: `5 * 8`
Each token has a `1 * 8` vector, and there are 5 scheduled tokens.
## At last
If you under stand the step_1 and step_2, you will know the all following steps.
If you understand the step_1 and step_2, you will know the all following steps.
Hope this article can help you get better understand to how vllm prepare inputs for model forwarding. If you have any good idea, welcome to contribute to us.
Hope this document can help you better understand how vLLM prepares inputs for model forwarding. If you have any good idea, welcome to contribute to us.

View File

@@ -1,16 +1,16 @@
# Patch in vLLM Ascend
vLLM Ascend is a platform plugin for vLLM. Due to the release cycle of vLLM and vLLM Ascend is different, and the hardware limitation in some case, we need to patch some code in vLLM to make it compatible with vLLM Ascend.
vLLM Ascend is a platform plugin for vLLM. Due to the different release cycle of vLLM and vLLM Ascend and their hardware limitations, we need to patch some code in vLLM to make it compatible with vLLM Ascend.
In vLLM Ascend code, we provide a patch module `vllm_ascend/patch` to address the change for vLLM.
In vLLM Ascend code, we provide a patch module `vllm_ascend/patch` to adapt to changes in vLLM.
## Principle
We should keep in mind that Patch is not the best way to make vLLM Ascend compatible. It's just a temporary solution. The best way is to contribute the change to vLLM to make it compatible with vLLM Ascend originally. In vLLM Ascend, we have the basic principle for Patch strategy:
We should keep in mind that Patch is not the best way to make vLLM Ascend compatible. It's just a temporary solution. The best way is to contribute the change to vLLM to make it compatible with vLLM Ascend initially. In vLLM Ascend, we have the basic principle for Patch strategy:
1. Less is more. Please do not patch unless it's the only way currently.
2. Once a patch is added, it's required to describe the future plan for removing the patch.
3. Anytime, clean the patch code is welcome.
3. Anytime, cleaning the patch code is welcome.
## How it works
@@ -27,18 +27,18 @@ vllm_ascend
```
- **platform**: The patch code in this directory is for patching the code in vLLM main process. It's called by `vllm_ascend/platform::NPUPlatform::pre_register_and_update` very early when vLLM is initialized.
- For online mode, vLLM process calls the platform patch here `vllm/vllm/engine/arg_utils.py::AsyncEngineArgs.add_cli_args` when parsing the cli args.
- For offline mode, vLLM process calls the platform patch here `vllm/vllm/engine/arg_utils.py::EngineArgs.create_engine_config` when parsing the input parameters.
- For online mode, vLLM process calls the platform patch in `vllm/vllm/engine/arg_utils.py::AsyncEngineArgs.add_cli_args` when parsing the cli args.
- For offline mode, vLLM process calls the platform patch in `vllm/vllm/engine/arg_utils.py::EngineArgs.create_engine_config` when parsing the input parameters.
- **worker**: The patch code in this directory is for patching the code in vLLM worker process. It's called by `vllm_ascend/worker/worker_v1::NPUWorker::__init__` when the vLLM worker process is initialized.
- For both online and offline mode, vLLM engine core process calls the worker patch here `vllm/vllm/worker/worker_base.py::WorkerWrapperBase.init_worker` when initializing the worker process.
- For both online and offline mode, vLLM engine core process calls the worker patch in `vllm/vllm/worker/worker_base.py::WorkerWrapperBase.init_worker` when initializing the worker process.
## How to write a patch
Before writing a patch, following the principle above, we should patch the least code. If it's necessary, we can patch the code in either **platform** and **worker** folder. Here is an example to patch `distributed` module in vLLM.
1. Decide which version of vLLM we should patch. For example, after analysis, here we want to patch both 0.10.0 and main of vLLM.
1. Decide which version of vLLM we should patch. For example, after analysis, here we want to patch both `0.10.0` and `main` of vLLM.
2. Decide which process we should patch. For example, here `distributed` belongs to the vLLM main process, so we should patch `platform`.
3. Create the patch file in the right folder. The file should be named as `patch_{module_name}.py`. The example here is `vllm_ascend/patch/platform/patch_distributed.py`.
3. Create the patch file in the right folder. The file should be named as `patch_{module_name}.py`. The example here is `vllm_ascend/patch/platform/patch_common/patch_distributed.py`.
4. Write your patch code in the new file. Here is an example:
```python
@@ -51,7 +51,7 @@ Before writing a patch, following the principle above, we should patch the least
vllm.distributed.parallel_state.destroy_model_parallel = patch_destroy_model_parallel
```
5. Import the patch file in `__init__.py`. In this example, add `import vllm_ascend.patch.platform.patch_distributed` into `vllm_ascend/patch/platform/__init__.py`.
5. Import the patch file in `__init__.py`. In this example, add `import vllm_ascend.patch.platform.patch_common.patch_distributed` into `vllm_ascend/patch/platform/patch_common/__init__.py`.
6. Add the description of the patch in `vllm_ascend/patch/__init__.py`. The description format is as follows:
```
@@ -71,5 +71,5 @@ Before writing a patch, following the principle above, we should patch the least
7. Add the Unit Test and E2E Test. Any newly added code in vLLM Ascend should contain the Unit Test and E2E Test as well. You can find more details in [test guide](../contribution/testing.md)
## Limitation
1. In V1 Engine, vLLM starts three kinds of process: Main process, EngineCore process and Worker process. Now vLLM Ascend only support patch the code in Main process and Worker process by default. If you want to patch the code runs in EngineCore process, you should patch EngineCore process entirely during setup, the entry code is here `vllm.v1.engine.core`. Please override `EngineCoreProc` and `DPEngineCoreProc` entirely.
2. If you are running an edited vLLM code, the version of the vLLM may be changed automatically. For example, if you runs an edited vLLM based on v0.9.n, the version of vLLM may be change to v0.9.nxxx, in this case, the patch for v0.9.n in vLLM Ascend would not work as expect, because that vLLM Ascend can't distinguish the version of vLLM you're using. In this case, you can set the environment variable `VLLM_VERSION` to specify the version of vLLM you're using, then the patch for v0.10.0 should work.
1. In V1 Engine, vLLM starts three kinds of process: Main process, EngineCore process and Worker process. Now vLLM Ascend only can patch the code in Main process and Worker process by default. If you want to patch the code running in EngineCore process, you should patch EngineCore process entirely during setup. Find the entire code in `vllm.v1.engine.core`. Please override `EngineCoreProc` and `DPEngineCoreProc` entirely.
2. If you are running edited vLLM code, the version of vLLM may be changed automatically. For example, if you run the edited vLLM based on v0.9.n, the version of vLLM may be changed to v0.9.nxxx. In this case, the patch for v0.9.n in vLLM Ascend would not work as expected, because vLLM Ascend can't distinguish the version of the vLLM you're using. In this case, you can set the environment variable `VLLM_VERSION` to specify the version of the vLLM you're using, and then the patch for v0.10.0 should work.

View File

@@ -5,22 +5,22 @@ This guide demonstrates how to integrate a novel or customized model into vllm-a
## Step 1: Implementing Models with `torch` and `torch_npu`
This section provides instructions for implementing new models compatible with vllm and vllm-ascend.
This section provides instructions for implementing new models compatible with vLLM and vllm-ascend.
**Before starting:**
- Verify whether your model already exists in vllm's [models](https://github.com/vllm-project/vllm/tree/main/vllm/model_executor/models) directory.
- Verify whether your model already exists in vLLM's [models](https://github.com/vllm-project/vllm/tree/main/vllm/model_executor/models) directory.
- Use existing models' implementation as templates to accelerate your development.
### Method 1: Implementing New Models from Scratch
Follow vllm's [OPT model adaptation](https://docs.vllm.ai/en/stable/contributing/model/basic.html) example for guidance.
Follow vLLM's [OPT model adaptation](https://docs.vllm.ai/en/stable/contributing/model/basic.html) example for guidance.
**Key implementation requirements:**
1. Place model files in `vllm_ascend/models/` directory.
2. Standard module structure for decoder-only LLMs (please checkout vllm's implementations for other kinds of model):
2. Standard module structure for decoder-only LLMs (please checkout vLLM's implementations for other kinds of models):
- `*ModelForCausalLM` (top-level wrapper)
- `*Model` (main architecture)
@@ -31,7 +31,7 @@ Follow vllm's [OPT model adaptation](https://docs.vllm.ai/en/stable/contributing
`*` denotes your model's unique identifier.
:::
3. Critical Implementation Details:
3. Critical implementation details:
All modules must include a `prefix` argument in `__init__()`.
@@ -42,13 +42,13 @@ All modules must include a `prefix` argument in `__init__()`.
| `*ModelForCausalLM` | `get_input_embeddings`, `compute_logits`, `load_weights` |
| `*Model` | `get_input_embeddings`, `load_weights` |
4. Attention Backend Integration:
4. Attention backend integration:
Importing attention via `from vllm.attention import Attention` can automatically leverage the attention backend routing of vllm-ascend (see: `get_attn_backend_cls()` in `vllm_ascend/platform.py`).
5. Tensor Parallelism:
5. Tensor parallelism:
Use vllm's parallel layers (`ColumnParallelLinear`, `VocabParallelEmbedding`, etc.) to implement models supporting tensor parallelism. Note that Ascend-specific customizations are implemented in `vllm_ascend/ops/` directory (RMSNorm, VocabParallelEmbedding, etc.).
Use vLLM's parallel layers (`ColumnParallelLinear`, `VocabParallelEmbedding`, etc.) to implement models supporting tensor parallelism. Note that Ascend-specific customizations are implemented in `vllm_ascend/ops/` directory (RMSNorm, VocabParallelEmbedding, etc.).
**Reference Implementation Template** (assumed path: `vllm_ascend/models/custom_model.py`):
@@ -133,7 +133,7 @@ class CustomModelForCausalLM(nn.Module):
### Method 2: Customizing Existing vLLM Models
For most use cases, extending existing implementations is preferable. We demonstrate an example to inherit from base classes and implement a custom deepseek model below (assumed path: `vllm_ascend/models/deepseek_v2.py`).
For most use cases, extending existing implementations is preferable. We demonstrate an example to inherit from base classes and implement a custom DeepSeek model below (assumed path: `vllm_ascend/models/deepseek_v2.py`).
```python
from typing import List, Optional
@@ -171,12 +171,12 @@ class CustomDeepseekV2ForCausalLM(DeepseekV2ForCausalLM):
```
:::{note}
For a complete implementation reference, see: `vllm_ascend/models/deepseek_v2.py`.
For a complete implementation reference, see `vllm_ascend/models/deepseek_v2.py`.
:::
## Step 2: Registering Custom Models using ModelRegistry Plugins in vLLM
vllm provides a plugin mechanism for registering externally implemented models without modifying its codebase.
vLLM provides a plugin mechanism for registering externally implemented models without modifying the codebase.
To integrate your implemented model from `vllm_ascend/models/` directory:
@@ -220,33 +220,33 @@ The first argument of `vllm.ModelRegistry.register_model()` indicates the unique
## Step 3: Verification
### Case 1: Overriding Existing vLLM Model Architecture
### Case 1: Overriding Existing vLLM Model Architectures
If you're registering a customized model architecture based on vllm's existing implementation (overriding vllm's original class), when executing vllm offline/online inference (using any model), you'll observe warning logs similar to the following output from `vllm/models_executor/models/registry.py`.
If you're registering a customized model architecture based on vLLM's existing implementation (overriding vLLM's original class), when executing vLLM offline/online inference (using any model), you'll observe warning logs similar to the following output from `vllm/models_executor/models/registry.py`.
```bash
Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend/models/deepseek_v2:CustomDeepseekV2ForCausalLM.
```
### Case 2: Registering New Model Architecture
### Case 2: Registering New Model Architectures
If you're registering a novel model architecture not present in vllm (creating a completely new class), current logs won't provide explicit confirmation by default. It's recommended to add the following logging statement at the end of the `register_model` method in `vllm/models_executor/models/registry.py`.
If you're registering a novel model architecture not present in vLLM (creating a completely new class), current logs won't provide explicit confirmation by default. It's recommended to add the following logging statement at the end of the `register_model` method in `vllm/models_executor/models/registry.py`.
```python
logger.info(f"model_arch: {model_arch} has been registered here!")
```
After adding this line, you will see confirmation logs shown below when running vllm offline/online inference (using any model).
After adding this line, you will see confirmation logs shown below when running vLLM offline/online inference (using any model).
```bash
model_arch: CustomModelForCausalLM has been registered here!
```
This log output confirms your novel model architecture has been successfully registered in vllm.
This log output confirms your novel model architecture has been successfully registered in vLLM.
## Step 4: Testing
After adding a new model, we should do basic functional test (offline/online inference), accuracy test and performance benchmark for the model.
After adding a new model, we should do basic functional test (offline/online inference), accuracy test, and performance benchmark for the model.
Find more details at:

View File

@@ -1,3 +1,3 @@
# Adding a New Multi-Modal Model
# Adding a New Multimodal Model
**_Comming soon ..._**
**_Coming soon ..._**

View File

@@ -1,6 +1,6 @@
# Optimization and Tuning
This guide aims to help users to improve vllm-ascend performance on system level. It includes OS configuration, library optimization, deploy guide and so on. Any feedback is welcome.
This guide aims to help users to improve vllm-ascend performance on system level. It includes OS configuration, library optimization, deployment guide and so on. Any feedback is welcome.
## Preparation
@@ -57,10 +57,10 @@ pip install modelscope pandas datasets gevent sacrebleu rouge_score pybind11 pyt
VLLM_USE_MODELSCOPE=true
```
Please follow the [Installation Guide](https://vllm-ascend.readthedocs.io/en/latest/installation.html) to make sure vllm, vllm-ascend and mindie-turbo is installed correctly.
Please follow the [Installation Guide](https://vllm-ascend.readthedocs.io/en/latest/installation.html) to make sure vLLM, vllm-ascend, and MindIE Turbo are installed correctly.
:::{note}
Make sure your vllm and vllm-ascend are installed after your python configuration completed, because these packages will build binary files using the python in current environment. If you install vllm, vllm-ascend and mindie-turbo before chapter 1.1, the binary files will not use the optimized python.
Make sure your vLLM and vllm-ascend are installed after your python configuration is completed, because these packages will build binary files using python in current environment. If you install vLLM, vllm-ascend, and MindIE Turbo before completing section 1.1, the binary files will not use the optimized python.
:::
## Optimizations
@@ -69,7 +69,7 @@ Make sure your vllm and vllm-ascend are installed after your python configuratio
#### 1.1. Install optimized `python`
Python supports **LTO** and **PGO** optimization starting from version `3.6` and above, which can be enabled at compile time. And we have offered compilation optimized `python` packages directly to users for the sake of convenience. You can also reproduce the `python` build follow this [tutorial](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0063.html) according to your specific scenarios.
Python supports **LTO** and **PGO** optimization starting from version `3.6` and above, which can be enabled at compile time. And we have offered optimized `python` packages directly to users for the sake of convenience. You can also reproduce the `python` built following this [tutorial](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0063.html) according to your specific scenarios.
```{code-block} bash
:substitutions:
@@ -101,7 +101,7 @@ export PATH=/usr/bin:/usr/local/python/bin:$PATH
#### 2.1. jemalloc
**jemalloc** is a memory allocator that improves performance for multi-threads scenario and can reduce memory fragment. jemalloc use thread local memory manager to allocate variables, which can avoid lock competition between multi-threads and can hugely optimize performance.
**jemalloc** is a memory allocator that improves performance for multi-thread scenarios and can reduce memory fragmentation. jemalloc uses local thread memory manager to allocate variables, which can avoid lock competition between threads and can hugely optimize performance.
```{code-block} bash
:substitutions:
@@ -144,18 +144,18 @@ Memory optimization:
```{code-block} bash
:substitutions:
# Upper limit of memory block splitting allowed (MB), Setting this parameter can prevent large memory blocks from being split.
# Upper limit of memory block splitting allowed (MB): Setting this parameter can prevent large memory blocks from being split.
export PYTORCH_NPU_ALLOC_CONF="max_split_size_mb:250"
# When operators on the communication stream have dependencies, they all need to be ended before being released for reuse. The logic of multi-stream reuse is to release the memory on the communication stream in advance so that the computing stream can be reused.
export PYTORCH_NPU_ALLOC_CONF="expandable_segments:True"
```
Schedule optimization:
Scheduling optimization:
```{code-block} bash
:substitutions:
# Optimize operator delivery queue, this will affect the memory peak value, and may degrade if the memory is tight.
# Optimize operator delivery queue. This will affect the memory peak value, and may degrade if the memory is tight.
export TASK_QUEUE_ENABLE=2
# This will greatly improve the CPU bottleneck model and ensure the same performance for the NPU bottleneck model.
@@ -168,7 +168,7 @@ export CPU_AFFINITY_CONF=1
There are some performance tuning features in HCCL, which are controlled by environment variables.
You can configure HCCL to use "AIV" mode to optimize performance by setting the environment variable shown below. In "AIV" mode, the communication is scheduled by AI vector core directly with ROCE, instead of being scheduled by AI cpu.
You can configure HCCL to use "AIV" mode to optimize performance by setting the environment variable shown below. In "AIV" mode, the communication is scheduled by AI vector core directly with RoCE, instead of being scheduled by AI CPU.
```{code-block} bash
:substitutions:
@@ -177,7 +177,7 @@ export HCCL_OP_EXPANSION_MODE="AIV"
Plus, there are more features for performance optimization in specific scenarios, which are shown below.
- `HCCL_INTRA_ROCE_ENABLE`: Use RDMA link instead of SDMA link between two 8Ps as the mesh interconnect link, find more details [here](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0044.html).
- `HCCL_RDMA_TC`: Use this var to configure traffic class of RDMA network card, find more details [here](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0045.html).
- `HCCL_RDMA_SL`: Use this var to configure service level of RDMA network card, find more details [here](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0046.html).
- `HCCL_BUFFSIZE`: Use this var to control the cache size for sharing data between two NPUs, find more details [here](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0047.html).
- `HCCL_INTRA_ROCE_ENABLE`: Use RDMA link instead of SDMA link between two 8Ps as the mesh interconnect link. Find more details [here](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0044.html).
- `HCCL_RDMA_TC`: Use this var to configure traffic class of RDMA NIC. Find more details [here](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0045.html).
- `HCCL_RDMA_SL`: Use this var to configure service level of RDMA NIC. Find more details [here](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0046.html).
- `HCCL_BUFFSIZE`: Use this var to control the cache size for sharing data between two NPUs. Find more details [here](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0047.html).

View File

@@ -1,7 +1,7 @@
# Performance Benchmark
This document details the benchmark methodology for vllm-ascend, aimed at evaluating the performance under a variety of workloads. To maintain alignment with vLLM, we use the [benchmark](https://github.com/vllm-project/vllm/tree/main/benchmarks) script provided by the vllm project.
**Benchmark Coverage**: We measure offline e2e latency and throughput, and fixed-QPS online serving benchmarks, for more details see [vllm-ascend benchmark scripts](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks).
**Benchmark Coverage**: We measure offline E2E latency and throughput, and fixed-QPS online serving benchmarks. For more details, see [vllm-ascend benchmark scripts](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks).
## 1. Run docker container
@@ -37,7 +37,7 @@ pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/si
pip install -r benchmarks/requirements-bench.txt
```
## 3. (Optional)Prepare model weights
## 3. (Optional) Prepare model weights
For faster running speed, we recommend downloading the model in advance
```bash
@@ -68,7 +68,7 @@ Run benchmark script:
bash benchmarks/scripts/run-performance-benchmarks.sh
```
After about 10 mins, the output is as shown below:
After about 10 mins, the output is shown below:
```bash
online serving:
@@ -180,7 +180,7 @@ Total num prompt tokens: 42659
Total num output tokens: 43545
```
The result json files are generated into the path `benchmark/results`
The result json files are generated into the path `benchmark/results`.
These files contain detailed benchmarking results for further analysis.
```bash

View File

@@ -9,7 +9,7 @@ The execution duration of each stage (including pre/post-processing, model forwa
* Use the non-blocking API `ProfileExecuteDuration().capture_async` to set observation points asynchronously when you need to observe the execution duration.
* Use the blocking API `ProfileExecuteDuration().pop_captured_sync` at an appropriate time to get and print the execution durations of all observed stages.
**We have instrumented the key inference stages (including pre-processing, model forward pass, etc.) for execute duration profiling. Execute the script as follows:**
**We have instrumented the key inference stages (including pre-processing, model forward pass, etc.) for execution duration profiling. Execute the script as follows:**
```
VLLM_ASCEND_MODEL_EXECUTE_TIME_OBSERVE=1 python3 vllm-ascend/examples/offline_inference_npu.py