v0.10.1rc1

2025-09-09 09:40:35 +08:00
parent d6f6ef41fe
commit 9149384e03
432 changed files with 84698 additions and 1 deletions
--- a/docs/source/developer_guide/contribution/index.md
+++ b/docs/source/developer_guide/contribution/index.md
@@ -0,0 +1,111 @@
+# Contributing
+
+## Building and testing
+It's recommended to set up a local development environment to build and test
+before you submit a PR.
+
+### Setup development environment
+
+Theoretically, the vllm-ascend build is only supported on Linux because
+`vllm-ascend` dependency `torch_npu` only supports Linux.
+
+But you can still set up dev env on Linux/Windows/macOS for linting and basic
+test as following commands:
+
+#### Run lint locally
+
+```bash
+# Choose a base dir (~/vllm-project/) and set up venv
+cd ~/vllm-project/
+python3 -m venv .venv
+source ./.venv/bin/activate
+
+# Clone vllm-ascend and install
+git clone https://github.com/vllm-project/vllm-ascend.git
+cd vllm-ascend
+
+# Install lint requirement and enable pre-commit hook
+pip install -r requirements-lint.txt
+
+# Run lint (You need install pre-commits deps via proxy network at first time)
+bash format.sh
+```
+
+#### Run CI locally
+
+After complete "Run lint" setup, you can run CI locally:
+
+```{code-block} bash
+   :substitutions:
+
+cd ~/vllm-project/
+
+# Run CI need vLLM installed
+git clone --branch |vllm_version| https://github.com/vllm-project/vllm.git
+cd vllm
+pip install -r requirements/build.txt
+VLLM_TARGET_DEVICE="empty" pip install .
+cd ..
+
+# Install requirements
+cd vllm-ascend
+# For Linux:
+pip install -r requirements-dev.txt
+# For non Linux:
+cat requirements-dev.txt | grep -Ev '^#|^--|^$|^-r' | while read PACKAGE; do pip install "$PACKAGE"; done
+cat requirements.txt | grep -Ev '^#|^--|^$|^-r' | while read PACKAGE; do pip install "$PACKAGE"; done
+
+# Run ci:
+bash format.sh ci
+```
+
+#### Submit the commit
+
+```bash
+# Commit changed files using `-s`
+git commit -sm "your commit info"
+```
+
+🎉 Congratulations! You have completed the development environment setup.
+
+### Test locally
+
+You can refer to [Testing](./testing.md) doc to help you setup testing environment and running tests locally.
+
+## DCO and Signed-off-by
+
+When contributing changes to this project, you must agree to the DCO. Commits must include a `Signed-off-by:` header which certifies agreement with the terms of the DCO.
+
+Using `-s` with `git commit` will automatically add this header.
+
+## PR Title and Classification
+
+Only specific types of PRs will be reviewed. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:
+
+- `[Attention]` for new features or optimization in attention.
+- `[Communicator]` for new features or optimization in communicators.
+- `[ModelRunner]` for new features or optimization in model runner.
+- `[Platform]` for new features or optimization in platform.
+- `[Worker]` for new features or optimization in worker.
+- `[Core]` for new features or optimization  in the core vllm-ascend logic (such as platform, attention, communicators, model runner)
+- `[Kernel]` changes affecting compute kernels and ops.
+- `[Bugfix]` for bug fixes.
+- `[Doc]` for documentation fixes and improvements.
+- `[Test]` for tests (such as unit tests).
+- `[CI]` for build or continuous integration improvements.
+- `[Misc]` for PRs that do not fit the above categories. Please use this sparingly.
+
+:::{note}
+If the PR spans more than one category, please include all relevant prefixes.
+:::
+
+## Others
+
+You may find more information about contributing to vLLM Ascend backend plugin on [<u>docs.vllm.ai</u>](https://docs.vllm.ai/en/latest/contributing/overview.html).
+If you find any problem when contributing, you can feel free to submit a PR to improve the doc to help other developers.
+
+:::{toctree}
+:caption: Index
+:maxdepth: 1
+testing
+:::
--- a/docs/source/developer_guide/contribution/testing.md
+++ b/docs/source/developer_guide/contribution/testing.md
@@ -0,0 +1,285 @@
+# Testing
+
+This secition explains how to write e2e tests and unit tests to verify the implementation of your feature.
+
+## Setup test environment
+
+The fastest way to setup test environment is to use the main branch container image:
+
+:::::{tab-set}
+:sync-group: e2e
+
+::::{tab-item} Local (CPU)
+:selected:
+:sync: cpu
+
+You can run the unit tests on CPU with the following steps:
+
+```{code-block} bash
+   :substitutions:
+
+cd ~/vllm-project/
+# ls
+# vllm  vllm-ascend
+
+# Use mirror to speedup download
+# docker pull quay.nju.edu.cn/ascend/cann:|cann_image_tag|
+export IMAGE=quay.io/ascend/cann:|cann_image_tag|
+docker run --rm --name vllm-ascend-ut \
+    -v $(pwd):/vllm-project \
+    -v ~/.cache:/root/.cache \
+    -ti $IMAGE bash
+
+# (Optional) Configure mirror to speedup download
+sed -i 's|ports.ubuntu.com|mirrors.huaweicloud.com|g' /etc/apt/sources.list
+pip config set global.index-url https://mirrors.huaweicloud.com/repository/pypi/simple/
+
+# For torch-npu dev version or x86 machine
+export PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu/ https://mirrors.huaweicloud.com/ascend/repos/pypi"
+
+apt-get update -y
+apt-get install -y python3-pip git vim wget net-tools gcc g++ cmake libnuma-dev curl gnupg2
+
+# Install vllm
+cd /vllm-project/vllm
+VLLM_TARGET_DEVICE=empty python3 -m pip -v install .
+
+# Install vllm-ascend
+cd /vllm-project/vllm-ascend
+# [IMPORTANT] Import LD_LIBRARY_PATH to enumerate the CANN environment under CPU
+export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/Ascend/ascend-toolkit/latest/$(uname -m)-linux/devlib
+python3 -m pip install -r requirements-dev.txt
+python3 -m pip install -v .
+```
+
+::::
+
+::::{tab-item} Single card
+:sync: single
+
+```{code-block} bash
+   :substitutions:
+
+# Update DEVICE according to your device (/dev/davinci[0-7])
+export DEVICE=/dev/davinci0
+# Update the vllm-ascend image
+export IMAGE=quay.io/ascend/vllm-ascend:main
+docker run --rm \
+    --name vllm-ascend \
+    --device $DEVICE \
+    --device /dev/davinci_manager \
+    --device /dev/devmm_svm \
+    --device /dev/hisi_hdc \
+    -v /usr/local/dcmi:/usr/local/dcmi \
+    -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
+    -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
+    -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
+    -v /etc/ascend_install.info:/etc/ascend_install.info \
+    -v /root/.cache:/root/.cache \
+    -p 8000:8000 \
+    -it $IMAGE bash
+```
+
+After starting the container, you should install the required packages:
+
+```bash
+# Prepare
+pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
+
+# Install required packages
+pip install -r requirements-dev.txt
+```
+
+::::
+
+::::{tab-item} Multi cards
+:sync: multi
+
+```{code-block} bash
+   :substitutions:
+# Update the vllm-ascend image
+export IMAGE=quay.io/ascend/vllm-ascend:main
+docker run --rm \
+    --name vllm-ascend \
+    --device /dev/davinci0 \
+    --device /dev/davinci1 \
+    --device /dev/davinci2 \
+    --device /dev/davinci3 \
+    --device /dev/davinci_manager \
+    --device /dev/devmm_svm \
+    --device /dev/hisi_hdc \
+    -v /usr/local/dcmi:/usr/local/dcmi \
+    -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
+    -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
+    -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
+    -v /etc/ascend_install.info:/etc/ascend_install.info \
+    -v /root/.cache:/root/.cache \
+    -p 8000:8000 \
+    -it $IMAGE bash
+```
+
+After starting the container, you should install the required packages:
+
+```bash
+cd /vllm-workspace/vllm-ascend/
+
+# Prepare
+pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
+
+# Install required packages
+pip install -r requirements-dev.txt
+```
+
+::::
+
+:::::
+
+## Running tests
+
+### Unit test
+
+There are several principles to follow when writing unit tests:
+
+- The test file path should be consistent with source file and start with `test_` prefix, such as: `vllm_ascend/worker/worker_v1.py` --> `tests/ut/worker/test_worker_v1.py`
+- The vLLM Ascend test are using unittest framework, see [here](https://docs.python.org/3/library/unittest.html#module-unittest) to understand how to write unit tests.
+- All unit tests can be run on CPU, so you must mock the device-related function to host.
+- Example: [tests/ut/test_ascend_config.py](https://github.com/vllm-project/vllm-ascend/blob/main/tests/ut/test_ascend_config.py).
+- You can run the unit tests using `pytest`:
+
+:::::{tab-set}
+:sync-group: e2e
+
+::::{tab-item} Local (CPU)
+:selected:
+:sync: cpu
+
+```bash
+# Run unit tests
+export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/Ascend/ascend-toolkit/latest/$(uname -m)-linux/devlib
+TORCH_DEVICE_BACKEND_AUTOLOAD=0 pytest -sv tests/ut
+```
+
+::::
+
+::::{tab-item} Single card
+:sync: single
+
+```bash
+cd /vllm-workspace/vllm-ascend/
+# Run all single card the tests
+pytest -sv tests/ut
+
+# Run single test
+pytest -sv tests/ut/test_ascend_config.py
+```
+
+::::
+
+::::{tab-item} Multi cards test
+:sync: multi
+
+```bash
+cd /vllm-workspace/vllm-ascend/
+# Run all single card the tests
+pytest -sv tests/ut
+
+# Run single test
+pytest -sv tests/ut/test_ascend_config.py
+```
+
+::::
+
+:::::
+
+### E2E test
+
+Although vllm-ascend CI provide [e2e test](https://github.com/vllm-project/vllm-ascend/blob/main/.github/workflows/vllm_ascend_test.yaml) on Ascend CI, you can run it
+locally.
+
+:::::{tab-set}
+:sync-group: e2e
+
+::::{tab-item} Local (CPU)
+:sync: cpu
+
+You can't run e2e test on CPU.
+::::
+
+::::{tab-item} Single card
+:selected:
+:sync: single
+
+```bash
+cd /vllm-workspace/vllm-ascend/
+# Run all single card the tests
+VLLM_USE_MODELSCOPE=true pytest -sv tests/e2e/singlecard/
+
+# Run a certain test script
+VLLM_USE_MODELSCOPE=true pytest -sv tests/e2e/singlecard/test_offline_inference.py
+
+# Run a certain case in test script
+VLLM_USE_MODELSCOPE=true pytest -sv tests/e2e/singlecard/test_offline_inference.py::test_models
+```
+
+::::
+
+::::{tab-item} Multi cards test
+:sync: multi
+
+```bash
+cd /vllm-workspace/vllm-ascend/
+# Run all single card the tests
+VLLM_USE_MODELSCOPE=true pytest -sv tests/e2e/multicard/
+
+# Run a certain test script
+VLLM_USE_MODELSCOPE=true pytest -sv tests/e2e/multicard/test_dynamic_npugraph_batchsize.py
+
+# Run a certain case in test script
+VLLM_USE_MODELSCOPE=true pytest -sv tests/e2e/multicard/test_offline_inference.py::test_models
+```
+
+::::
+
+:::::
+
+This will reproduce e2e test: [vllm_ascend_test.yaml](https://github.com/vllm-project/vllm-ascend/blob/main/.github/workflows/vllm_ascend_test.yaml).
+
+#### E2E test example:
+
+- Offline test example: [`tests/e2e/singlecard/test_offline_inference.py`](https://github.com/vllm-project/vllm-ascend/blob/main/tests/e2e/singlecard/test_offline_inference.py)
+- Online test examples: [`tests/e2e/singlecard/test_prompt_embedding.py`](https://github.com/vllm-project/vllm-ascend/blob/main/tests/e2e/singlecard/test_prompt_embedding.py)
+- Correctness test example: [`tests/e2e/singlecard/test_aclgraph.py`](https://github.com/vllm-project/vllm-ascend/blob/main/tests/e2e/singlecard/test_aclgraph.py)
+- Reduced Layer model test example: [test_torchair_graph_mode.py - DeepSeek-V3-Pruning](https://github.com/vllm-project/vllm-ascend/blob/20767a043cccb3764214930d4695e53941de87ec/tests/e2e/multicard/test_torchair_graph_mode.py#L48)
+
+    The CI resource is limited, you might need to reduce layer number of the model, below is an example of how to generate a reduced layer model:
+    1. Fork the original model repo in modelscope, we need all the files in the repo except for weights.
+    2. Set `num_hidden_layers` to the expected number of layers, e.g., `{"num_hidden_layers": 2,}`
+    3. Copy the following python script as `generate_random_weight.py`. Set the relevant parameters `MODEL_LOCAL_PATH`, `DIST_DTYPE` and `DIST_MODEL_PATH` as needed:
+
+        ```python
+        import torch
+        from transformers import AutoTokenizer, AutoConfig
+        from modeling_deepseek import DeepseekV3ForCausalLM
+        from modelscope import snapshot_download
+
+        MODEL_LOCAL_PATH = "~/.cache/modelscope/models/vllm-ascend/DeepSeek-V3-Pruning"
+        DIST_DTYPE = torch.bfloat16
+        DIST_MODEL_PATH = "./random_deepseek_v3_with_2_hidden_layer"
+
+        config = AutoConfig.from_pretrained(MODEL_LOCAL_PATH, trust_remote_code=True)
+        model = DeepseekV3ForCausalLM(config)
+        model = model.to(DIST_DTYPE)
+        model.save_pretrained(DIST_MODEL_PATH)
+        ```
+
+### Run doctest
+
+vllm-ascend provides a `vllm-ascend/tests/e2e/run_doctests.sh` command to run all doctests in the doc files.
+The doctest is a good way to make sure the docs are up to date and the examples are executable, you can run it locally as follows:
+
+```bash
+# Run doctest
+/vllm-workspace/vllm-ascend/tests/e2e/run_doctests.sh
+```
+
+This will reproduce the same environment as the CI: [vllm_ascend_doctest.yaml](https://github.com/vllm-project/vllm-ascend/blob/main/.github/workflows/vllm_ascend_doctest.yaml).
--- a/docs/source/developer_guide/evaluation/accuracy_report/index.md
+++ b/docs/source/developer_guide/evaluation/accuracy_report/index.md
@@ -0,0 +1,6 @@
+# Accuracy Report
+
+:::{toctree}
+:caption: Accuracy Report
+:maxdepth: 1
+:::
--- a/docs/source/developer_guide/evaluation/index.md
+++ b/docs/source/developer_guide/evaluation/index.md
@@ -0,0 +1,10 @@
+# Accuracy
+
+:::{toctree}
+:caption: Accuracy
+:maxdepth: 1
+using_evalscope
+using_lm_eval
+using_opencompass
+accuracy_report/index
+:::
--- a/docs/source/developer_guide/evaluation/using_evalscope.md
+++ b/docs/source/developer_guide/evaluation/using_evalscope.md
@@ -0,0 +1,175 @@
+# Using EvalScope
+
+This document will guide you have model inference stress testing and accuracy testing using [EvalScope](https://github.com/modelscope/evalscope).
+
+## 1. Online serving
+
+You can run docker container to start the vLLM server on a single NPU:
+
+```{code-block} bash
+   :substitutions:
+# Update DEVICE according to your device (/dev/davinci[0-7])
+export DEVICE=/dev/davinci7
+# Update the vllm-ascend image
+export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
+docker run --rm \
+--name vllm-ascend \
+--device $DEVICE \
+--device /dev/davinci_manager \
+--device /dev/devmm_svm \
+--device /dev/hisi_hdc \
+-v /usr/local/dcmi:/usr/local/dcmi \
+-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
+-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
+-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
+-v /etc/ascend_install.info:/etc/ascend_install.info \
+-v /root/.cache:/root/.cache \
+-p 8000:8000 \
+-e VLLM_USE_MODELSCOPE=True \
+-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
+-it $IMAGE \
+vllm serve Qwen/Qwen2.5-7B-Instruct --max_model_len 26240
+```
+
+If your service start successfully, you can see the info shown below:
+
+```
+INFO:     Started server process [6873]
+INFO:     Waiting for application startup.
+INFO:     Application startup complete.
+```
+
+Once your server is started, you can query the model with input prompts in new terminal:
+
+```
+curl http://localhost:8000/v1/completions \
+    -H "Content-Type: application/json" \
+    -d '{
+        "model": "Qwen/Qwen2.5-7B-Instruct",
+        "prompt": "The future of AI is",
+        "max_tokens": 7,
+        "temperature": 0
+    }'
+```
+
+## 2. Install EvalScope using pip
+
+You can install EvalScope by using:
+
+```bash
+python3 -m venv .venv-evalscope
+source .venv-evalscope/bin/activate
+pip install gradio plotly evalscope
+```
+
+## 3. Run gsm8k accuracy test using EvalScope
+
+You can `evalscope eval` run gsm8k accuracy test:
+
+```
+evalscope eval \
+ --model Qwen/Qwen2.5-7B-Instruct \
+ --api-url http://localhost:8000/v1 \
+ --api-key EMPTY \
+ --eval-type service \
+ --datasets gsm8k \
+ --limit 10
+```
+
+After 1-2 mins, the output is as shown below:
+
+```shell
+---------------------+-----------+-----------------+----------+-------+---------+---------+
+| Model               | Dataset   | Metric          | Subset   |   Num |   Score | Cat.0   |
+=====================+===========+=================+==========+=======+=========+=========+
+| Qwen2.5-7B-Instruct | gsm8k     | AverageAccuracy | main     |    10 |     0.8 | default |
+---------------------+-----------+-----------------+----------+-------+---------+---------+
+```
+
+See more detail in: [EvalScope doc - Model API Service Evaluation](https://evalscope.readthedocs.io/en/latest/get_started/basic_usage.html#model-api-service-evaluation).
+
+## 4. Run model inference stress testing using EvalScope
+
+### Install EvalScope[perf] using pip
+
+```shell
+pip install evalscope[perf] -U
+```
+
+### Basic usage
+
+You can use `evalscope perf` run perf test:
+
+```
+evalscope perf \
+    --url "http://localhost:8000/v1/chat/completions" \
+    --parallel 5 \
+    --model Qwen/Qwen2.5-7B-Instruct \
+    --number 20 \
+    --api openai \
+    --dataset openqa \
+    --stream
+```
+
+### Output results
+
+After 1-2 mins, the output is as shown below:
+
+```shell
+Benchmarking summary:
+-----------------------------------+---------------------------------------------------------------+
+| Key                               | Value                                                         |
+===================================+===============================================================+
+| Time taken for tests (s)          | 38.3744                                                       |
+-----------------------------------+---------------------------------------------------------------+
+| Number of concurrency             | 5                                                             |
+-----------------------------------+---------------------------------------------------------------+
+| Total requests                    | 20                                                            |
+-----------------------------------+---------------------------------------------------------------+
+| Succeed requests                  | 20                                                            |
+-----------------------------------+---------------------------------------------------------------+
+| Failed requests                   | 0                                                             |
+-----------------------------------+---------------------------------------------------------------+
+| Output token throughput (tok/s)   | 132.6926                                                      |
+-----------------------------------+---------------------------------------------------------------+
+| Total token throughput (tok/s)    | 158.8819                                                      |
+-----------------------------------+---------------------------------------------------------------+
+| Request throughput (req/s)        | 0.5212                                                        |
+-----------------------------------+---------------------------------------------------------------+
+| Average latency (s)               | 8.3612                                                        |
+-----------------------------------+---------------------------------------------------------------+
+| Average time to first token (s)   | 0.1035                                                        |
+-----------------------------------+---------------------------------------------------------------+
+| Average time per output token (s) | 0.0329                                                        |
+-----------------------------------+---------------------------------------------------------------+
+| Average input tokens per request  | 50.25                                                         |
+-----------------------------------+---------------------------------------------------------------+
+| Average output tokens per request | 254.6                                                         |
+-----------------------------------+---------------------------------------------------------------+
+| Average package latency (s)       | 0.0324                                                        |
+-----------------------------------+---------------------------------------------------------------+
+| Average package per request       | 254.6                                                         |
+-----------------------------------+---------------------------------------------------------------+
+| Expected number of requests       | 20                                                            |
+-----------------------------------+---------------------------------------------------------------+
+| Result DB path                    | outputs/20250423_002442/Qwen2.5-7B-Instruct/benchmark_data.db |
+-----------------------------------+---------------------------------------------------------------+
+
+Percentile results:
+------------+----------+---------+-------------+--------------+---------------+----------------------+
+| Percentile | TTFT (s) | ITL (s) | Latency (s) | Input tokens | Output tokens | Throughput(tokens/s) |
+------------+----------+---------+-------------+--------------+---------------+----------------------+
+|    10%     |  0.0962  |  0.031  |   4.4571    |      42      |      135      |       29.9767        |
+|    25%     |  0.0971  | 0.0318  |   6.3509    |      47      |      193      |       30.2157        |
+|    50%     |  0.0987  | 0.0321  |   9.3387    |      49      |      285      |       30.3969        |
+|    66%     |  0.1017  | 0.0324  |   9.8519    |      52      |      302      |       30.5182        |
+|    75%     |  0.107   | 0.0328  |   10.2391   |      55      |      313      |       30.6124        |
+|    80%     |  0.1221  | 0.0329  |   10.8257   |      58      |      330      |       30.6759        |
+|    90%     |  0.1245  | 0.0333  |   13.0472   |      62      |      404      |       30.9644        |
+|    95%     |  0.1247  | 0.0336  |   14.2936   |      66      |      432      |       31.6691        |
+|    98%     |  0.1247  | 0.0353  |   14.2936   |      66      |      432      |       31.6691        |
+|    99%     |  0.1247  | 0.0627  |   14.2936   |      66      |      432      |       31.6691        |
+------------+----------+---------+-------------+--------------+---------------+----------------------+
+```
+
+See more detail in: [EvalScope doc - Model Inference Stress Testing](https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/quick_start.html#basic-usage).
--- a/docs/source/developer_guide/evaluation/using_lm_eval.md
+++ b/docs/source/developer_guide/evaluation/using_lm_eval.md
@@ -0,0 +1,300 @@
+# Using lm-eval
+This document will guide you have a accuracy testing using [lm-eval][1].
+
+## Online Server
+### 1. start the vLLM server
+You can run docker container to start the vLLM server on a single NPU:
+
+```{code-block} bash
+   :substitutions:
+# Update DEVICE according to your device (/dev/davinci[0-7])
+export DEVICE=/dev/davinci7
+# Update the vllm-ascend image
+export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
+docker run --rm \
+--name vllm-ascend \
+--device $DEVICE \
+--device /dev/davinci_manager \
+--device /dev/devmm_svm \
+--device /dev/hisi_hdc \
+-v /usr/local/dcmi:/usr/local/dcmi \
+-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
+-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
+-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
+-v /etc/ascend_install.info:/etc/ascend_install.info \
+-v /root/.cache:/root/.cache \
+-p 8000:8000 \
+-e VLLM_USE_MODELSCOPE=True \
+-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
+-it $IMAGE \
+/bin/bash
+vllm serve Qwen/Qwen2.5-0.5B-Instruct --max_model_len 4096 &
+```
+
+Started the vLLM server successfully,if you see log as below:
+
+```
+INFO:     Started server process [9446]
+INFO:     Waiting for application startup.
+INFO:     Application startup complete.
+```
+
+### 2. Run gsm8k accuracy test using lm-eval
+
+You can query result with input prompts:
+
+```
+curl http://localhost:8000/v1/completions \
+    -H "Content-Type: application/json" \
+    -d '{
+        "model": "Qwen/Qwen2.5-0.5B-Instruct",
+        "prompt": "'"<|im_start|>system\nYou are a professional accountant. Answer questions using accounting knowledge, output only the option letter (A/B/C/D).<|im_end|>\n"\
+"<|im_start|>user\nQuestion: A company's balance sheet as of December 31, 2023 shows:\n"\
+"  Current assets: Cash and equivalents 5 million yuan, Accounts receivable 8 million yuan, Inventory 6 million yuan\n"\
+"  Non-current assets: Net fixed assets 12 million yuan\n"\
+"  Current liabilities: Short-term loans 4 million yuan, Accounts payable 3 million yuan\n"\
+"  Non-current liabilities: Long-term loans 9 million yuan\n"\
+"  Owner's equity: Paid-in capital 10 million yuan, Retained earnings ?\n"\
+"Requirement: Calculate the company's Asset-Liability Ratio and Current Ratio (round to two decimal places).\n"\
+"Options:\n"\
+"A. Asset-Liability Ratio=58.33%, Current Ratio=1.90\n"\
+"B. Asset-Liability Ratio=62.50%, Current Ratio=2.17\n"\
+"C. Asset-Liability Ratio=65.22%, Current Ratio=1.75\n"\
+"D. Asset-Liability Ratio=68.00%, Current Ratio=2.50<|im_end|>\n"\
+"<|im_start|>assistant\n"'",
+        "max_tokens": 1,
+        "temperature": 0,
+        "stop": ["<|im_end|>"]
+    }' | python3 -m json.tool
+```
+
+The output format matches the following:
+
+```
+{
+    "id": "cmpl-2f678e8bdf5a4b209a3f2c1fa5832e25",
+    "object": "text_completion",
+    "created": 1754475138,
+    "model": "Qwen/Qwen2.5-0.5B-Instruct",
+    "choices": [
+        {
+            "index": 0,
+            "text": "A",
+            "logprobs": null,
+            "finish_reason": "length",
+            "stop_reason": null,
+            "prompt_logprobs": null
+        }
+    ],
+    "service_tier": null,
+    "system_fingerprint": null,
+    "usage": {
+        "prompt_tokens": 252,
+        "total_tokens": 253,
+        "completion_tokens": 1,
+        "prompt_tokens_details": null
+    },
+    "kv_transfer_params": null
+}
+```
+
+Install lm-eval in the container.
+
+```bash
+export HF_ENDPOINT="https://hf-mirror.com"
+pip install lm-eval[api]
+```
+
+Run the following command:
+
+```
+# Only test gsm8k dataset in this demo
+lm_eval \
+  --model local-completions \
+  --model_args model=Qwen/Qwen2.5-0.5B-Instruct,base_url=http://127.0.0.1:8000/v1/completions,tokenized_requests=False,trust_remote_code=True \
+  --tasks gsm8k \
+  --output_path ./
+```
+
+After 30 mins, the output is as shown below:
+
+```
+The markdown format results is as below:
+
+Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
+|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
+|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.3215|±  |0.0129|
+|     |       |strict-match    |     5|exact_match|↑  |0.2077|±  |0.0112|
+
+```
+
+## Offline Server
+### 1. Run docker container
+
+You can run docker container on a single NPU:
+
+```{code-block} bash
+   :substitutions:
+# Update DEVICE according to your device (/dev/davinci[0-7])
+export DEVICE=/dev/davinci7
+# Update the vllm-ascend image
+export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
+docker run --rm \
+--name vllm-ascend \
+--device $DEVICE \
+--device /dev/davinci_manager \
+--device /dev/devmm_svm \
+--device /dev/hisi_hdc \
+-v /usr/local/dcmi:/usr/local/dcmi \
+-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
+-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
+-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
+-v /etc/ascend_install.info:/etc/ascend_install.info \
+-v /root/.cache:/root/.cache \
+-p 8000:8000 \
+-e VLLM_USE_MODELSCOPE=True \
+-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
+-it $IMAGE \
+/bin/bash
+```
+
+### 2. Run gsm8k accuracy test using lm-eval
+Install lm-eval in the container.
+
+```bash
+export HF_ENDPOINT="https://hf-mirror.com"
+pip install lm-eval
+```
+
+Run the following command:
+
+```
+# Only test gsm8k dataset in this demo
+lm_eval \
+  --model vllm \
+  --model_args pretrained=Qwen/Qwen2.5-0.5B-Instruct,max_model_len=4096 \
+  --tasks gsm8k \
+  --batch_size auto
+```
+
+After 1-2 mins, the output is as shown below:
+
+```
+The markdown format results is as below:
+
+Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
+|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
+|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.3412|±  |0.0131|
+|     |       |strict-match    |     5|exact_match|↑  |0.3139|±  |0.0128|
+
+```
+
+## Use offline Datasets
+
+Take gsm8k(single dataset) and mmlu(multi-subject dataset) as examples, and you can see more from [here][2].
+
+```bash
+# set HF_DATASETS_OFFLINE when using offline datasets
+export HF_DATASETS_OFFLINE=1
+git clone https://github.com/EleutherAI/lm-evaluation-harness.git
+cd lm-evaluation-harness
+pip install -e .
+# gsm8k yaml path
+cd lm_eval/tasks/gsm8k
+# mmlu yaml path
+cd lm_eval/tasks/mmlu/default
+```
+
+set [gsm8k.yaml][3] as follows:
+
+```yaml
+tag:
+  - math_word_problems
+task: gsm8k
+
+# set dataset_path arrow or json or parquet according to the downloaded dataset
+dataset_path: arrow
+
+# set dataset_name to null
+dataset_name: null
+output_type: generate_until
+
+# add dataset_kwargs 
+dataset_kwargs:
+  data_files:
+    # train and test data download path
+    train: /root/.cache/gsm8k/gsm8k-train.arrow
+    test: /root/.cache/gsm8k/gsm8k-test.arrow
+
+training_split: train
+fewshot_split: train
+test_split: test
+doc_to_text: 'Q: {{question}}
+  A(Please follow the summarize the result at the end with the format of "The answer is xxx", where xx is the result.):'
+doc_to_target: "{{answer}}" #" {{answer.split('### ')[-1].rstrip()}}"
+metric_list:
+  - metric: exact_match
+    aggregation: mean
+    higher_is_better: true
+    ignore_case: true
+    ignore_punctuation: false
+    regexes_to_ignore:
+      - ","
+      - "\\$"
+      - "(?s).*#### "
+      - "\\.$"
+generation_kwargs:
+  until:
+    - "Question:"
+    - "</s>"
+    - "<|im_end|>"
+  do_sample: false
+  temperature: 0.0
+repeats: 1
+num_fewshot: 5
+filter_list:
+  - name: "strict-match"
+    filter:
+      - function: "regex"
+        regex_pattern: "#### (\\-?[0-9\\.\\,]+)"
+      - function: "take_first"
+  - name: "flexible-extract"
+    filter:
+      - function: "regex"
+        group_select: -1
+        regex_pattern: "(-?[$0-9.,]{2,})|(-?[0-9]+)"
+      - function: "take_first"
+metadata:
+  version: 3.0
+```
+
+set [_default_template_yaml][4] as follows:
+
+```yaml
+# set dataset_path according to the downloaded dataset
+dataset_path: /root/.cache/mmlu
+test_split: test
+fewshot_split: dev
+fewshot_config:
+  sampler: first_n
+output_type: multiple_choice
+doc_to_text: "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:"
+doc_to_choice: ["A", "B", "C", "D"]
+doc_to_target: answer
+metric_list:
+  - metric: acc
+    aggregation: mean
+    higher_is_better: true
+metadata:
+  version: 1.0
+dataset_kwargs:
+  trust_remote_code: true
+```
+
+You can see more usage on [Lm-eval Docs][5].
+
+[1]: https://github.com/EleutherAI/lm-evaluation-harness
+[2]: https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#using-local-datasets
+[3]: https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/gsm8k/gsm8k.yaml
+[4]: https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/mmlu/default/_default_template_yaml
+[5]: https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/README.md
--- a/docs/source/developer_guide/evaluation/using_opencompass.md
+++ b/docs/source/developer_guide/evaluation/using_opencompass.md
@@ -0,0 +1,123 @@
+# Using OpenCompass
+This document will guide you have a accuracy testing using [OpenCompass](https://github.com/open-compass/opencompass).
+
+## 1. Online Serving
+
+You can run docker container to start the vLLM server on a single NPU:
+
+```{code-block} bash
+   :substitutions:
+# Update DEVICE according to your device (/dev/davinci[0-7])
+export DEVICE=/dev/davinci7
+# Update the vllm-ascend image
+export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
+docker run --rm \
+--name vllm-ascend \
+--device $DEVICE \
+--device /dev/davinci_manager \
+--device /dev/devmm_svm \
+--device /dev/hisi_hdc \
+-v /usr/local/dcmi:/usr/local/dcmi \
+-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
+-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
+-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
+-v /etc/ascend_install.info:/etc/ascend_install.info \
+-v /root/.cache:/root/.cache \
+-p 8000:8000 \
+-e VLLM_USE_MODELSCOPE=True \
+-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
+-it $IMAGE \
+vllm serve Qwen/Qwen2.5-7B-Instruct --max_model_len 26240
+```
+
+If your service start successfully, you can see the info shown below:
+
+```
+INFO:     Started server process [6873]
+INFO:     Waiting for application startup.
+INFO:     Application startup complete.
+```
+
+Once your server is started, you can query the model with input prompts in new terminal:
+
+```
+curl http://localhost:8000/v1/completions \
+    -H "Content-Type: application/json" \
+    -d '{
+        "model": "Qwen/Qwen2.5-7B-Instruct",
+        "prompt": "The future of AI is",
+        "max_tokens": 7,
+        "temperature": 0
+    }'
+```
+
+## 2. Run ceval accuracy test using OpenCompass
+Install OpenCompass and configure the environment variables in the container.
+
+```bash
+# Pin Python 3.10 due to:
+# https://github.com/open-compass/opencompass/issues/1976
+conda create -n opencompass python=3.10
+conda activate opencompass
+pip install opencompass modelscope[framework]
+export DATASET_SOURCE=ModelScope
+git clone https://github.com/open-compass/opencompass.git
+```
+
+Add `opencompass/configs/eval_vllm_ascend_demo.py` with the following content:
+
+```python
+from mmengine.config import read_base
+from opencompass.models import OpenAISDK
+
+with read_base():
+    from opencompass.configs.datasets.ceval.ceval_gen import ceval_datasets
+
+# Only test ceval-computer_network dataset in this demo
+datasets = ceval_datasets[:1]
+
+api_meta_template = dict(
+    round=[
+        dict(role='HUMAN', api_role='HUMAN'),
+        dict(role='BOT', api_role='BOT', generate=True),
+    ],
+    reserved_roles=[dict(role='SYSTEM', api_role='SYSTEM')],
+)
+
+models = [
+    dict(
+        abbr='Qwen2.5-7B-Instruct-vLLM-API',
+        type=OpenAISDK,
+        key='EMPTY', # API key
+        openai_api_base='http://127.0.0.1:8000/v1', 
+        path='Qwen/Qwen2.5-7B-Instruct', 
+        tokenizer_path='Qwen/Qwen2.5-7B-Instruct', 
+        rpm_verbose=True, 
+        meta_template=api_meta_template,
+        query_per_second=1, 
+        max_out_len=1024, 
+        max_seq_len=4096, 
+        temperature=0.01, 
+        batch_size=8,
+        retry=3,
+    )
+]
+```
+
+Run the following command:
+
+```
+python3 run.py opencompass/configs/eval_vllm_ascend_demo.py --debug
+```
+
+After 1-2 mins, the output is as shown below:
+
+```
+The markdown format results is as below:
+
+| dataset | version | metric | mode | Qwen2.5-7B-Instruct-vLLM-API |
+|----- | ----- | ----- | ----- | -----|
+| ceval-computer_network | db9ce2 | accuracy | gen | 68.42 |
+```
+
+You can see more usage on [OpenCompass Docs](https://opencompass.readthedocs.io/en/latest/index.html).
--- a/docs/source/developer_guide/feature_guide/ModelRunner_prepare_inputs.md
+++ b/docs/source/developer_guide/feature_guide/ModelRunner_prepare_inputs.md
@@ -0,0 +1,237 @@
+# Purpose
+What information should we have in order to perform model forward pass?
+ - the inputs
+ - the corresponding attention metadata of the inputs
+
+The following diagram shows what we should prepare for the model inference.
+
+```
+              +---------------+
+  inputs  --> |               |
+              |     model     |  --> output
+attn_meta --> |               |
+              +---------------+  
+```
+
+Therefore, as long as we have these two pieces of information mentioned above, we can perform the model's forward propagation.
+
+This article will explain **how we obtain the inputs and their corresponding attention metadata** which are on the left part of above diagram.
+
+# Overview
+## 1. Obtain inputs
+The workflow of obtain inputs:
+1. Get `token positions`: The relative position of each token within its request sequence.
+
+2. Get `token indices`: the index of each scheduled token in the token table.
+
+3. Get `Token IDs`: Using token indices to retrieve the Token IDs from **token id table**.
+
+At last, these `Token IDs` required to feed into the model, and also, `positions` should be send into model to create `Rope` (Rotary positional embedding). Both of them are the inputs of a model.
+
+**Note**: because the `Token IDs` is the inputs of the model, so we will call it `Inputs IDs`
+## 2. Build inputs attention metadata
+The model requires these attention metadata during the forward pass:
+- `query start location`: represents the start and end location of each request corresponding to the scheduled tokens.
+- `sequence length`: the length of each request including both computed tokens and newly scheduled tokens.
+- `number of computed tokens`: the number of computed tokens for each request.
+- `number of requests`: the number of requests in this batch.
+- `number of tokens`: Total number of scheduled tokens in this batch.
+- **`block table`**: translates the logical address (within its sequence) of each block to its global physical address in the device's memory.
+- `max query len`: the longest scheduled tokens length in this requests batch.
+- `slot mapping`: the indices of each token that input token will be stored into.
+- `attention mask`: The mask matrix applied to attention scores before softmax to control which tokens can attend to each other. (usually a causal attention)
+
+# Before start
+There are mainly three types of variables.
+- token level: represents one attribute corresponding to each scheduled token, so the length of this variable is the number of scheduled tokens
+- request level: represents one attribute of each scheduled request, which length usually is the number of scheduled requests. (`query start location` is a special case, which has one more element)
+- system level:
+  1. **Token IDs table**: store the token ids (i.e. the inputs of the model) of each request. The shape of this table is `(max num request, max model len)`. Here, `max num request` is maximum count of concurrent requests allowed in a forward batch and `max model len` is the max token count can be handled at one request sequence in this model.
+  2. **Block table**: translates the logical address (within its sequence) of each block to its global physical address in the device's memory. The shape of this table is `(max num request, max model len / block size)`
+
+**Note**: How were these two tables formed?
+- Both of them are come from the `_update_states` method before **prepare inputs**. You can take a look if you need more inspiration.
+
+## Tips
+What is `Token ID`?
+For simple, a `token ID` is an **integer** (usually `int32`), which represents a token.
+example of `Token ID`:
+
+```
+| Token ID     | Token         | 
+|--------------|---------------|
+| 0            | [PAD]         |
+| 1            | <|endoftext|> |
+| 2            | <|start|>     |
+| 3            | [SEP]         |
+| 4            | I             |
+| 5            | the           |
+| 6            | be            |
+| 7            | of            |
+| 8            | and           |     
+| ...          | ...           |     
+| ...          | ...           |
+| vocab_size-1 | <|im_end|>    |
+```
+
+# Go through details
+Make a simple example, assumption:
+- max tokens can be scheduled at once: 10.
+- `block size`: 2
+- Totally schedule 3 requests. Their prompt lengths are 3, 2, and 8 respectively.
+- `max model length`: 12 (the max token count can be handled at one request sequence in this model).
+
+These assumption are configured in the beginning when starting the vllm. They are not fixed, so you can manually set them.
+## Step 1: All requests in the prefill phase
+
+### Obtain inputs
+Due to the max schedule token count limitation is 10, The scheduled token of each request: `{'0': 3, '1': 2, '2': 5}`. Note that the `request_2` is in chunked prefill, still has 3 prompt tokens not be scheduled.
+
+#### 1. Get token positions:
+First, find out each token belong to which request: the 0~2 tokens belong to request_0, 3~4 tokens belong to request_1 and 5~9 tokens belong to request_2. So, we can use `request indices` to point out each token belongs to which request. `request indices`: `[0, 0, 0, 1, 1, 2, 2, 2, 2, 2]`
+
+For each request, use **the number of tokens already computed** + **the relative position in current scheduled tokens**: `request_0: [0 + 0, 0 + 1, 0 + 2]`, `request_1: [0 + 0, 0 + 1]`, `request_2: [0 + 0, 0 + 1,..., 0 + 4]` and then concat them together: `[0, 1, 2, 0, 1, 0, 1, 2, 3, 4]`. Note: there is more efficient way (using `request indices`) to create positions in actual code.
+
+Finally, `token opsitions` is `[0, 1, 2, 0, 1, 0, 1, 2, 3, 4]`. This variable is **token level**
+
+#### 2. Get token indices:
+Current **Token IDs table**, which shape is `(max num request, max model len)`.
+
+Why these `T_3_5`, `T_3_6`, `T_3_7` are in this table even them are not scheduled this time?
+- We will fill all Token IDs in one request sequence to this table at once, but we only retrieve the tokens we scheduled this time. Then we will retrieve the remain Token IDs next time.
+
+```
+| T_0_0 | T_0_1 | T_0_2 |   ?   |   ?   |   ?   |   ?   |   ?   |   ?   |   ?   |   ?   |   ?   |
+| T_1_0 | T_1_1 |   ?   |   ?   |   ?   |   ?   |   ?   |   ?   |   ?   |   ?   |   ?   |   ?   |
+| T_2_0 | T_2_1 | T_3_2 | T_3_3 | T_3_4 | T_3_5 | T_3_6 | T_3_7 |   ?   |   ?   |   ?   |   ?   |
+|   ?   |   ?   |   ?   |   ?   |   ?   |   ?   |   ?   |   ?   |   ?   |   ?   |   ?   |   ?   |
+......
+......
+......
+```
+
+Note that the `T_x_x` is an `int32`
+
+Let's say `M = max model len`, Then we can use `token positions` together with the `request indices` of each token to construct `token indices`.
+
+So `token indices` = `[0 + 0 * M, 1 + 0 * M, 2 + 0 * M, 0 + 1 * M, 1 + 1 * M, 0 + 2 * M, 1 + 2 * M, 2 + 2 * M, 3 + 2 * M, 4 + 2 * M]` = `[0, 1, 2, 12, 13, 24, 25, 26, 27, 28]`
+
+#### 3. Retrieve the Token IDs
+As mentioned before, we will refer to these `Token IDs` as `Input IDs`.
+
+We use the `token indices` to select out the corresponding `Input IDs` from the token table, The Pseudocode like:
+
+```
+input_ids = token_table[token_indices]
+```
+
+As mentioned before, we will refer these Token IDs as Inputs IDs:
+- `Input IDs` = `[T_0_0, T_0_1, T_0_2, T_1_0, T_1_1, T_2_0, T_2_1, T_3_2, T_3_3, T_3_4]`
+
+### Build inputs attention metadata
+Current **Block Table**, we use the first block (i.e. block_0) to mark the unused block. The shape of the block is `(max num request, max model len / block size)`, the `max model len / block size = 12 / 2 = 6`
+
+```
+| 1  | 2  | 0  | 0  | 0  | 0  |
+| 3  | 0  | 0  | 0  | 0  | 0  |
+| 4  | 5  | 6  | 0  | 0  | 0  |
+| 0  | 0  | 0  | 0  | 0  | 0  |
+......
+......
+......
+```
+
+The kv cache block in the device memory is like:
+
+```
+| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ...... 
+```
+
+Let's say `K = max model len / block size = 6`, we can get token `device block number` from
+
+The workflow of achieving slot mapping:
+1. get `block table indices` using `K`, `positions` and `request indices`. Purpose: For each token, it could be used to select the `device block number` from `block table`.
+2. get `device block number` using `block table indices`. Purpose: `device block number` indicates each token belong to which device block.
+3. get `block offsets` using `positions` and `block size`. Purpose: `block offsets` indicates the offsets of each token within a block.
+4. construct `slot mapping` using `device block number` and `block offsets`. Purpose: we can use `slot mapping` to store the Token IDs into token slots.
+
+Details:
+1. Using a simple formula to calculate the `block table indices`: `request indices * K + positions / block size`. So it equal to `[0 * 6 + 0 / 2, 0 * 6 + 1 / 2, 0 * 6 + 2 / 2, 1 * 6 + 0 / 2, 1 * 6 + 1 / 2, 2 * 6 + 0 / 2, 2 * 6 + 1 / 2, 2 * 6 + 2 / 2, 2 * 6 + 3 / 2, 2 * 6 + 4 / 2] = [0, 0, 1, 6, 6, 12, 12, 13, 13, 14]`. This could be used to select the `device block number` from `block table`. **token level**
+2. Using the `block table indices` to select out the `device block number` for each scheduled token. The Pseudocode like: `block_numbers = block_table[block_table_indices]`. So `device block number =  [1, 1, 2, 3, 3, 4, 4, 5, 5, 6]`**token level**
+3. `block offsets` could be computed by `block offsets = positions % block size = [0, 1, 0, 0, 1, 0, 1, 0, 1, 0]`. **token level**
+4. At last, use `block offsets` and `device block number` to create `slot mapping`: `device block number * block size + block_offsets = [2, 3, 4, 6, 7, 8, 9, 10, 11, 12]`
+
+First, we know the scheduled token count is `[3, 2, 5]` **request level**
+
+- So, we can use prefix sum to calculate the `query start location`: `[0, 3, 5, 10]`. **request level**
+- Because in step_1 all the tokens in prefill, computed tokens count is 0, then `sequence length` = `[3, 2, 5]`. **request level**
+- As mentioned above, `number of computed tokens` are all 0: `[0, 0, 0]`. **request level**
+- `number of requests`: `3`.
+- `number of tokens`: `[3, 2, 5]`. **request level**
+- `max query len`: `5`.
+- `slot mapping`: `[2, 3, 4, 6, 7, 8, 9, 10, 11, 12]`. **token level**
+- `attention mask`: For all request do prefill, we simply create only one mask matrix for reuse across different requests. The shape of this mask matrix is `5 * 5`:
+
+## Step 2: Chunked prefill
+In Step 2, we will no longer provide explanations or perform calculations; instead, we will directly present the final result.
+
+### Obtain inputs
+The scheduled token of each request: `{'0': 1, '1': 1, '2': 3}`.
+
+1. `request indices`: `[0, 1, 2, 2, 2]`
+2. `token positions`: `[3, 2, 5, 6, 7]`
+
+Current **Token IDs table**:
+
+```
+| T_0_0 | T_0_1 | T_0_2 | T_0_3 |   ?   |   ?   |   ?   |   ?   |   ?   |   ?   |   ?   |   ?   |
+| T_1_0 | T_1_1 | T_1_2 |   ?   |   ?   |   ?   |   ?   |   ?   |   ?   |   ?   |   ?   |   ?   |
+| T_2_0 | T_2_1 | T_3_2 | T_3_3 | T_3_4 | T_3_5 | T_3_6 | T_3_7 |   ?   |   ?   |   ?   |   ?   |
+|   ?   |   ?   |   ?   |   ?   |   ?   |   ?   |   ?   |   ?   |   ?   |   ?   |   ?   |   ?   |
+......
+......
+......
+```
+
+**Note**: The **T_0_3**, **T_1_2** are new Token IDs of request_0, request_1 respectively. them are sampled from the output of the model.
+
+3. `token indices`: `[3, 14, 29, 30, 31]`
+4. `Input IDs`: `[T_0_3, T_1_2, T_3_5, T_3_6, T_3_7]`
+
+### Build inputs attention metadata
+Current **Block Table**. **Note**: We allocate the `7` and `8` block to `request_1` and `request_2` respectively. Because they need more space in device to store kv cache after generate new tokens or chunked prefill new tokens.
+
+```
+| 1  | 2  | 0  | 0  | 0  | 0  |
+| 3  | 7  | 0  | 0  | 0  | 0  |
+| 4  | 5  | 6  | 8  | 0  | 0  |
+| 0  | 0  | 0  | 0  | 0  | 0  |
+......
+......
+......
+```
+
+The kv cache block in the device memory is still like:
+
+```
+| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ...... 
+```
+
+1. `block table indices`: `[1, 7, 14, 15, 15]`. **token level**
+2. `device block number`: `[2, 7, 6, 8, 8]`. **token level**
+3. `block offsets`: `[1, 0, 1, 0, 1]` **token level**
+4. `slot mapping`: `[5, 14, 13, 16, 17]` **token level**
+
+scheduled token count is `[1, 1, 3]`
+- `query start location`: `[0, 1, 2, 5]`
+- `sequence length`: `[4, 3, 8]`
+- `number of computed tokens`: `[3, 2, 5]`
+- `number of requests`: `3`
+- `max query len`: `3`
+- `slot mapping`: `[5, 14, 13, 16, 17]`
+- `attention mask`: `5 * 8` Each token will have a `1 * 8` vector, and there are 5 scheduled tokens.
+
+# At last
+If you under stand the step_1 and step_2, you will know the all following steps.
+
+Hope this article can help you get better understand to how vllm prepare inputs for model forwarding. If you have any good idea, welcome to contribute to us.
--- a/docs/source/developer_guide/feature_guide/index.md
+++ b/docs/source/developer_guide/feature_guide/index.md
@@ -0,0 +1,10 @@
+# Feature Guide
+
+This section provides an overview of the features implemented in vLLM Ascend. Developers can refer to this guide to understand how vLLM Ascend works.
+
+:::{toctree}
+:caption: Feature Guide
+:maxdepth: 1
+patch
+ModelRunner_prepare_inputs
+:::
--- a/docs/source/developer_guide/feature_guide/patch.md
+++ b/docs/source/developer_guide/feature_guide/patch.md
@@ -0,0 +1,85 @@
+# Patch in vLLM Ascend
+
+vLLM Ascend is a platform plugin for vLLM. Due to the release cycle of vLLM and vLLM Ascend is different, and the hardware limitation in some case, we need to patch some code in vLLM to make it compatible with vLLM Ascend.
+
+In vLLM Ascend code, we provide a patch module `vllm_ascend/patch` to address the change for vLLM.
+
+## Principle
+
+We should keep in mind that Patch is not the best way to make vLLM Ascend compatible. It's just a temporary solution. The best way is to contribute the change to vLLM to make it compatible with vLLM Ascend originally. In vLLM Ascend, we have the basic principle for Patch strategy:
+
+1. Less is more. Please do not patch unless it's the only way currently.
+2. Once a patch is added, it's required to describe the future plan for removing the patch.
+3. Anytime, clean the patch code is welcome.
+
+## How it works
+
+In `vllm_ascend/patch`, you can see the code structure as follows:
+
+```
+vllm_ascend
+├── patch
+│   ├── platform
+│   │   ├── patch_0_9_2
+│   │   ├── patch_common
+│   │   ├── patch_main
+│   ├── worker
+│   │   ├── patch_0_9_2
+│   │   ├── patch_common
+│   │   ├── patch_main
+└───────────
+```
+
+- **platform**: The patch code in this directory is for patching the code in vLLM main process. It's called by `vllm_ascend/platform::NPUPlatform::pre_register_and_update` very early when vLLM is initialized.
+  - For online mode, vLLM process calls the platform patch here `vllm/vllm/engine/arg_utils.py::AsyncEngineArgs.add_cli_args` when parsing the cli args.
+  - For offline mode, vLLM process calls the platform patch here `vllm/vllm/engine/arg_utils.py::EngineArgs.create_engine_config` when parsing the input parameters.
+- **worker**: The patch code in this directory is for patching the code in vLLM worker process. It's called by `vllm_ascend/worker/worker_v1::NPUWorker::__init__` when the vLLM worker process is initialized.
+  - For both online and offline mode, vLLM engine core process calls the worker patch here `vllm/vllm/worker/worker_base.py::WorkerWrapperBase.init_worker` when initializing the worker process.
+
+In both **platform** and **worker** folder, there are several patch modules. They are used for patching different version of vLLM.
+
+- `patch_0_10_0`: This module is used for patching vLLM 0.10.0. The version is always the nearest version of vLLM. Once vLLM is released, we will drop this patch module and bump to a new version. For example, `patch_0_10_0` is used for patching vLLM 0.10.0.
+- `patch_main`: This module is used for patching the code in vLLM main branch.
+- `patch_common`: This module is used for patching both vLLM 0.10.0 and vLLM main branch.
+
+## How to write a patch
+
+Before writing a patch, following the principle above, we should patch the least code. If it's necessary, we can patch the code in either **platform** and **worker** folder. Here is an example to patch `distributed` module in vLLM.
+
+1. Decide which version of vLLM we should patch. For example, after analysis, here we want to patch both 0.10.0 and main of vLLM.
+2. Decide which process we should patch. For example, here `distributed` belongs to the vLLM main process, so we should patch `platform`.
+3. Create the patch file in the right folder. The file should be named as `patch_{module_name}.py`. The example here is `vllm_ascend/patch/platform/patch_common/patch_distributed.py`.
+4. Write your patch code in the new file. Here is an example:
+
+    ```python
+    import vllm
+
+    def patch_destroy_model_parallel():
+        # your patch code
+        ...
+
+    vllm.distributed.parallel_state.destroy_model_parallel = patch_destroy_model_parallel
+    ```
+
+5. Import the patch file in `__init__.py`. In this example, add `import vllm_ascend.patch.platform.patch_common.patch_distributed` into `vllm_ascend/patch/platform/patch_common/__init__.py`.
+6. Add the description of the patch in `vllm_ascend/patch/__init__.py`. The description format is as follows:
+
+    ```
+    # ** File: <The patch file name> **
+    # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+    #   1. `<The target patch module in vLLM>`
+    #    Why:
+    #       <Describe the reason why we need to patch>
+    #    How：
+    #       <Describe the way to patch>
+    #    Related PR (if no, explain why):
+    #       <Add a link to the related PR in vLLM. If there is no related PR, explain why>
+    #    Future Plan:
+    #       <Describe the future plan to remove the patch>
+    ```
+
+7. Add the Unit Test and E2E Test. Any newly added code in vLLM Ascend should contain the Unit Test and E2E Test as well. You can find more details in [test guide](../contribution/testing.md)
+
+## Limitation
+1. In V1 Engine, vLLM starts three kinds of process: Main process, EngineCore process and Worker process. Now vLLM Ascend only support patch the code in Main process and Worker process by default. If you want to patch the code runs in EngineCore process, you should patch EngineCore process entirely during setup, the entry code is here `vllm.v1.engine.core`. Please override `EngineCoreProc` and `DPEngineCoreProc` entirely.
+2. If you are running an edited vLLM code, the version of the vLLM may be changed automatically. For example, if you runs an edited vLLM based on v0.9.n, the version of vLLM may be change to v0.9.nxxx, in this case, the patch for v0.9.n in vLLM Ascend would not work as expect, because that vLLM Ascend can't distinguish the version of vLLM you're using. In this case, you can set the environment variable `VLLM_VERSION` to specify the version of vLLM you're using, then the patch for v0.10.0 should work.
--- a/docs/source/developer_guide/modeling/adding_a_new_model.md
+++ b/docs/source/developer_guide/modeling/adding_a_new_model.md
@@ -0,0 +1,259 @@
+# Adding a New Model
+
+This guide demonstrates how to integrate a novel or customized model into vllm-ascend. For foundational concepts, it is highly recommended to refer to
+[vllm official doc: Adding a New Model](https://docs.vllm.ai/en/stable/contributing/model/) first.
+
+## Step 1: Implementing Models with `torch` and `torch_npu`
+
+This section provides instructions for implementing new models compatible with vllm and vllm-ascend.
+
+**Before starting:**
+
+- Verify whether your model already exists in vllm's [models](https://github.com/vllm-project/vllm/tree/main/vllm/model_executor/models) directory.
+- Use existing models' implementation as templates to accelerate your development.
+
+### Method 1: Implementing New Models from Scratch
+
+Follow vllm's [OPT model adaptation](https://docs.vllm.ai/en/stable/contributing/model/basic.html) example for guidance.
+
+**Key implementation requirements:**
+
+1. Place model files in `vllm_ascend/models/` directory.
+
+2. Standard module structure for decoder-only LLMs (please checkout vllm's implementations for other kinds of model):
+
+- `*ModelForCausalLM` (top-level wrapper)
+- `*Model` (main architecture)
+- `*DecoderLayer` (transformer block)
+- `*Attention` and `*MLP` (specific computation unit)
+
+:::{note}
+`*` denotes your model's unique identifier.
+:::
+
+3. Critical Implementation Details:
+
+All modules must include a `prefix` argument in `__init__()`.
+
+**Required interfaces:**
+
+| Module Type          | Required Methods                          |
+| :------------------- | :---------------------------------------- |
+| `*ModelForCausalLM`  | `get_input_embeddings`, `compute_logits`, `load_weights` |
+| `*Model`             | `get_input_embeddings`, `load_weights`    |
+
+4. Attention Backend Integration:
+
+Importing attention via `from vllm.attention import Attention` can automatically leverage the attention backend routing of vllm-ascend (see: `get_attn_backend_cls()` in `vllm_ascend/platform.py`).
+
+5. Tensor Parallelism:
+
+Use vllm's parallel layers (`ColumnParallelLinear`, `VocabParallelEmbedding`, etc.) to implement models supporting tensor parallelism. Note that Ascend-specific customizations are implemented in `vllm_ascend/ops/` directory (RMSNorm, VocabParallelEmbedding, etc.).
+
+**Reference Implementation Template** (assumed path: `vllm_ascend/models/custom_model.py`):
+
+```python
+from collections.abc import Iterable
+from typing import Optional, Union
+
+import torch
+from torch import nn
+from vllm.attention import Attention
+from vllm.config import VllmConfig
+from vllm.sequence import IntermediateTensors
+from vllm.model_executor.sampling_metadata import SamplingMetadata
+
+class CustomAttention(nn.Module):
+    def __init__(self, vllm_config: VllmConfig, prefix: str):
+        super().__init__()
+        self.attn = Attention(prefix=f"{prefix}.attn")
+
+    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
+        # Implement attention logic
+        ...
+
+class CustomDecoderLayer(nn.Module):
+    def __init__(self, vllm_config: VllmConfig, prefix: str):
+        super().__init__()
+        self.self_attn = CustomAttention(vllm_config, prefix=f"{prefix}.self_attn")
+
+    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
+        # Implement decoder layer
+        ...
+
+class CustomModel(nn.Module):
+    def __init__(self, vllm_config: VllmConfig, prefix: str):
+        super().__init__()
+        self.layers = nn.ModuleList([
+            CustomDecoderLayer(vllm_config, prefix=f"{prefix}.layers.{i}") 
+            for i in range(vllm_config.model_config.hf_config.num_hidden_layers)
+        ])
+
+    def get_input_embeddings(self, input_ids: torch.Tensor) -> torch.Tensor:
+        ...
+
+    def forward(
+        self,
+        input_ids: torch.Tensor,
+        positions: torch.Tensor,
+        intermediate_tensors: Optional[IntermediateTensors] = None,
+        inputs_embeds: Optional[torch.Tensor] = None,
+    ) -> Union[torch.Tensor, IntermediateTensors]:
+        ...
+
+    def load_weights(self, 
+                    weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]:
+        ...
+
+class CustomModelForCausalLM(nn.Module):
+    def __init__(self, vllm_config: VllmConfig, prefix: str = ""):
+        super().__init__()
+        self.model = CustomModel(vllm_config, prefix=f"{prefix}.model")
+
+    def get_input_embeddings(self, input_ids: torch.Tensor) -> torch.Tensor:
+        ...
+
+    def forward(
+        self,
+        input_ids: torch.Tensor,
+        positions: torch.Tensor,
+        intermediate_tensors: Optional[IntermediateTensors] = None,
+        inputs_embeds: Optional[torch.Tensor] = None,
+    ) -> Union[torch.Tensor, IntermediateTensors]:
+        ...
+
+    def compute_logits(self,
+                      hidden_states: torch.Tensor,
+                      sampling_metadata: SamplingMetadata) -> torch.Tensor:
+        ...
+
+    def load_weights(self, 
+                    weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]:
+        ...
+```
+
+### Method 2: Customizing Existing vLLM Models
+
+For most use cases, extending existing implementations is preferable. We demonstrate an example to inherit from base classes and implement a custom deepseek model below (assumed path: `vllm_ascend/models/deepseek_v2.py`).
+
+```python
+from typing import List, Optional
+import torch
+from vllm.attention import AttentionMetadata
+from vllm.model_executor.models.deepseek_v2 import DeepseekV2ForCausalLM
+from vllm.sequence import IntermediateTensors
+
+class CustomDeepseekV2ForCausalLM(DeepseekV2ForCausalLM):
+    # Define merged weights for quantization/efficiency
+    packed_modules_mapping = {
+        "gate_up_proj": ["gate_proj", "up_proj"],
+        "experts": ["experts.0.gate_proj", "experts.0.up_proj", "experts.0.down_proj"]
+    }
+
+    def forward(
+        self,
+        input_ids: torch.Tensor,
+        positions: torch.Tensor,
+        kv_caches: Optional[List[torch.Tensor]] = None,
+        attn_metadata: Optional[AttentionMetadata] = None,
+        intermediate_tensors: Optional[IntermediateTensors] = None,
+        inputs_embeds: Optional[torch.Tensor] = None,
+    ) -> Union[torch.Tensor, IntermediateTensors]:
+        # Custom forward logic
+        hidden_states = self.model(
+            input_ids, 
+            positions, 
+            kv_caches,
+            attn_metadata, 
+            intermediate_tensors,
+            inputs_embeds
+        )
+        return hidden_states
+```
+
+:::{note}
+For a complete implementation reference, see: `vllm_ascend/models/deepseek_v2.py`.
+:::
+
+## Step 2: Registering Custom Models using ModelRegistry Plugins in vLLM
+
+vllm provides a plugin mechanism for registering externally implemented models without modifying its codebase.
+
+To integrate your implemented model from `vllm_ascend/models/` directory:
+
+1. Import your model implementation in `vllm_ascend/models/__init__.py` using relative imports.
+2. Register the model wrapper class via `vllm.ModelRegistry.register_model()` function.
+
+**Reference Registration Template** (an example of registering new models in `vllm_ascend/models/__init__.py`):
+
+```python
+from vllm import ModelRegistry
+
+def register_model():
+    from .custom_model import CustomModelForCausalLM        # New custom model
+    from .deepseek_v2 import ModifiedDeepseekV2ForCausalLM  # Customized Deepseek
+
+    # For NEW architectures: Register with unique name
+    ModelRegistry.register_model(
+        "CustomModelForCausalLM",  # Must match config.json's 'architectures'
+        "vllm_ascend.models.custom_model:CustomModelForCausalLM"
+    )
+
+    # For MODIFIED architectures: Use original name
+    ModelRegistry.register_model(
+        "DeepseekV2ForCausalLM",   # Original architecture identifier in vLLM
+        "vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM  "
+    )
+```
+
+:::{note}
+The first argument of `vllm.ModelRegistry.register_model()` indicates the unique architecture identifier which must match `architectures` in `config.json` of the model.
+
+```json
+{
+  "architectures": [
+    "CustomModelForCausalLM"
+  ],
+}
+```
+
+:::
+
+## Step 3: Verification
+
+### Case 1: Overriding Existing vLLM Model Architecture
+
+If you're registering a customized model architecture based on vllm's existing implementation (overriding vllm's original class), when executing vllm offline/online inference (using any model), you'll observe warning logs similar to the following output from `vllm/models_executor/models/registry.py`.
+
+```bash
+Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend/models/deepseek_v2:CustomDeepseekV2ForCausalLM.
+```
+
+### Case 2: Registering New Model Architecture
+
+If you're registering a novel model architecture not present in vllm (creating a completely new class), current logs won't provide explicit confirmation by default. It's recommended to add the following logging statement at the end of the `register_model` method in `vllm/models_executor/models/registry.py`.
+
+```python
+logger.info(f"model_arch: {model_arch} has been registered here!")
+```
+
+After adding this line, you will see confirmation logs shown below when running vllm offline/online inference (using any model).
+
+```bash
+model_arch: CustomModelForCausalLM has been registered here!
+```
+
+This log output confirms your novel model architecture has been successfully registered in vllm.
+
+## Step 4: Testing
+
+After adding a new model, we should do basic functional test (offline/online inference), accuracy test and performance benchmark for the model.
+
+Find more details at:
+
+- [Accuracy test guide](https://vllm-ascend.readthedocs.io/en/latest/developer_guide/evaluation/index.html)
+- [Performance benchmark guide](https://vllm-ascend.readthedocs.io/en/latest/developer_guide/performance/performance_benchmark.html)
+
+## Step 5: Updating Supported Models Doc
+
+At last, if all the steps above are completed, you should add the new model into our [Supported Models](https://vllm-ascend.readthedocs.io/en/latest/user_guide/supported_models.html) doc.
--- a/docs/source/developer_guide/modeling/adding_a_new_multimodal_model.md
+++ b/docs/source/developer_guide/modeling/adding_a_new_multimodal_model.md
@@ -0,0 +1,3 @@
+# Adding a New Multi-Modal Model
+
+**_Comming soon ..._**
--- a/docs/source/developer_guide/modeling/index.md
+++ b/docs/source/developer_guide/modeling/index.md
@@ -0,0 +1,10 @@
+# Modeling
+
+This section provides tutorials of how to implement and register a new model into vllm-ascend.
+
+:::{toctree}
+:caption: Modeling
+:maxdepth: 1
+adding_a_new_model
+adding_a_new_multimodal_model
+:::
--- a/docs/source/developer_guide/performance/index.md
+++ b/docs/source/developer_guide/performance/index.md
@@ -0,0 +1,9 @@
+# Performance
+
+:::{toctree}
+:caption: Performance
+:maxdepth: 1
+performance_benchmark
+profile_execute_duration
+optimization_and_tuning
+:::
--- a/docs/source/developer_guide/performance/optimization_and_tuning.md
+++ b/docs/source/developer_guide/performance/optimization_and_tuning.md
@@ -0,0 +1,183 @@
+# Optimization and Tuning
+
+This guide aims to help users to improve vllm-ascend performance on system level. It includes OS configuration, library optimization, deploy guide and so on. Any feedback is welcome.
+
+## Preparation
+
+Run the container:
+
+```{code-block} bash
+   :substitutions:
+# Update DEVICE according to your device (/dev/davinci[0-7])
+export DEVICE=/dev/davinci0
+# Update the cann base image
+export IMAGE=m.daocloud.io/quay.io/ascend/cann:|cann_image_tag|
+docker run --rm \
+--name performance-test \
+--device $DEVICE \
+--device /dev/davinci_manager \
+--device /dev/devmm_svm \
+--device /dev/hisi_hdc \
+-v /usr/local/dcmi:/usr/local/dcmi \
+-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
+-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
+-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
+-v /etc/ascend_install.info:/etc/ascend_install.info \
+-v /root/.cache:/root/.cache \
+-it $IMAGE bash
+```
+
+Configure your environment:
+
+```{code-block} bash
+   :substitutions:
+# Configure the mirror
+echo "deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy main restricted universe multiverse" > /etc/apt/sources.list && \
+echo "deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy main restricted universe multiverse" >> /etc/apt/sources.list && \
+echo "deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy-updates main restricted universe multiverse" >> /etc/apt/sources.list && \
+echo "deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy-updates main restricted universe multiverse" >> /etc/apt/sources.list && \
+echo "deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy-backports main restricted universe multiverse" >> /etc/apt/sources.list && \
+echo "deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy-backports main restricted universe multiverse" >> /etc/apt/sources.list && \
+echo "deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy-security main restricted universe multiverse" >> /etc/apt/sources.list && \
+echo "deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy-security main restricted universe multiverse" >> /etc/apt/sources.list
+
+# Install os packages
+apt update && apt install wget gcc g++ libnuma-dev git vim -y
+```
+
+Install vllm and vllm-ascend:
+
+```{code-block} bash
+   :substitutions:
+# Install necessary dependencies
+pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
+pip install modelscope pandas datasets gevent sacrebleu rouge_score pybind11 pytest
+
+# Configure this var to speed up model download
+VLLM_USE_MODELSCOPE=true
+```
+
+Please follow the [Installation Guide](https://vllm-ascend.readthedocs.io/en/latest/installation.html) to make sure vllm, vllm-ascend and mindie-turbo is installed correctly.
+
+:::{note}
+Make sure your vllm and vllm-ascend are installed after your python configuration completed, because these packages will build binary files using the python in current environment. If you install vllm, vllm-ascend and mindie-turbo before chapter 1.1, the binary files will not use the optimized python.
+:::
+
+## Optimizations
+
+### 1. Compilation Optimization
+
+#### 1.1. Install optimized `python`
+
+Python supports **LTO** and **PGO** optimization starting from version `3.6` and above, which can be enabled at compile time. And we have offered compilation optimized `python` packages directly to users for the sake of convenience. You can also reproduce the `python` build follow this [tutorial](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0063.html) according to your specific scenarios.
+
+```{code-block} bash
+   :substitutions:
+mkdir -p /workspace/tmp
+cd /workspace/tmp
+
+# Download prebuilt lib and packages
+wget https://repo.oepkgs.net/ascend/pytorch/vllm/lib/libcrypto.so.1.1
+wget https://repo.oepkgs.net/ascend/pytorch/vllm/lib/libomp.so
+wget https://repo.oepkgs.net/ascend/pytorch/vllm/lib/libssl.so.1.1
+wget https://repo.oepkgs.net/ascend/pytorch/vllm/python/py311_bisheng.tar.gz
+
+# Configure python and pip
+cp ./*.so* /usr/local/lib
+tar -zxvf ./py311_bisheng.*  -C /usr/local/
+mv  /usr/local/py311_bisheng/  /usr/local/python
+sed -i "1c#\!/usr/local/python/bin/python3.11" /usr/local/python/bin/pip3
+sed -i "1c#\!/usr/local/python/bin/python3.11" /usr/local/python/bin/pip3.11
+ln -sf  /usr/local/python/bin/python3  /usr/bin/python
+ln -sf  /usr/local/python/bin/python3  /usr/bin/python3
+ln -sf  /usr/local/python/bin/python3.11  /usr/bin/python3.11
+ln -sf  /usr/local/python/bin/pip3  /usr/bin/pip3
+ln -sf  /usr/local/python/bin/pip3  /usr/bin/pip
+
+export PATH=/usr/bin:/usr/local/python/bin:$PATH
+```
+
+### 2. OS Optimization
+
+#### 2.1. jemalloc
+
+**jemalloc** is a memory allocator that improves performance for multi-threads scenario and can reduce memory fragment. jemalloc use thread local memory manager to allocate variables, which can avoid lock competition between multi-threads and can hugely optimize performance.
+
+```{code-block} bash
+   :substitutions:
+# Install jemalloc
+sudo apt update
+sudo apt install libjemalloc2
+
+# Configure jemalloc
+export LD_PRELOAD=/usr/lib/"$(uname -i)"-linux-gnu/libjemalloc.so.2 $LD_PRELOAD
+```
+
+#### 2.2. Tcmalloc
+
+**Tcmalloc (Thread Counting Malloc)** is a universal memory allocator that improves overall performance while ensuring low latency by introducing a multi-level cache structure, reducing mutex competition and optimizing large object processing flow. Find more details [here](https://www.hiascend.com/document/detail/zh/Pytorch/700/ptmoddevg/trainingmigrguide/performance_tuning_0068.html).
+
+```{code-block} bash
+   :substitutions:
+# Install tcmalloc
+sudo apt update
+sudo apt install libgoogle-perftools4 libgoogle-perftools-dev
+
+# Get the location of libtcmalloc.so*
+find /usr -name libtcmalloc.so*
+
+# Make the priority of tcmalloc higher
+# The <path> is the location of libtcmalloc.so we get from the upper command
+# Example: "$LD_PRELOAD:/usr/lib/aarch64-linux-gnu/libtcmalloc.so"
+export LD_PRELOAD="$LD_PRELOAD:<path>"
+
+# Verify your configuration
+# The path of libtcmalloc.so will be contained in the result if your configuration is valid
+ldd `which python`
+```
+
+### 3. `torch_npu` Optimization
+
+Some performance tuning features in `torch_npu` are controlled by environment variables. Some features and their related environment variables are shown below.
+
+Memory optimization:
+
+```{code-block} bash
+   :substitutions:
+# Upper limit of memory block splitting allowed (MB), Setting this parameter can prevent large memory blocks from being split.
+export PYTORCH_NPU_ALLOC_CONF="max_split_size_mb:250"
+
+# When operators on the communication stream have dependencies, they all need to be ended before being released for reuse. The logic of multi-stream reuse is to release the memory on the communication stream in advance so that the computing stream can be reused.
+export PYTORCH_NPU_ALLOC_CONF="expandable_segments:True"
+```
+
+Schedule optimization:
+
+```{code-block} bash
+   :substitutions:
+# Optimize operator delivery queue, this will affect the memory peak value, and may degrade if the memory is tight.
+export TASK_QUEUE_ENABLE=2
+
+# This will greatly improve the CPU bottleneck model and ensure the same performance for the NPU bottleneck model.
+export CPU_AFFINITY_CONF=1
+```
+
+### 4. CANN Optimization
+
+#### 4.1. HCCL Optimization
+
+There are some performance tuning features in HCCL, which are controlled by environment variables.
+
+You can configure HCCL to use "AIV" mode to optimize performance by setting the environment variable shown below. In "AIV" mode, the communication is scheduled by AI vector core directly with ROCE, instead of being scheduled by AI cpu.
+
+```{code-block} bash
+   :substitutions:
+export HCCL_OP_EXPANSION_MODE="AIV"
+```
+
+Plus, there are more features for performance optimization in specific scenarios, which are shown below.
+
+- `HCCL_INTRA_ROCE_ENABLE`: Use RDMA link instead of SDMA link between two 8Ps as the mesh interconnect link, find more details [here](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0044.html).
+- `HCCL_RDMA_TC`: Use this var to configure traffic class of RDMA network card, find more details [here](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0045.html).
+- `HCCL_RDMA_SL`: Use this var to configure service level of RDMA network card, find more details [here](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0046.html).
+- `HCCL_BUFFSIZE`: Use this var to control the cache size for sharing data between two NPUs, find more details [here](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0047.html).
--- a/docs/source/developer_guide/performance/performance_benchmark.md
+++ b/docs/source/developer_guide/performance/performance_benchmark.md
@@ -0,0 +1,194 @@
+# Performance Benchmark
+This document details the benchmark methodology for vllm-ascend, aimed at evaluating the performance under a variety of workloads. To maintain alignment with vLLM, we use the [benchmark](https://github.com/vllm-project/vllm/tree/main/benchmarks) script provided by the vllm project.
+
+**Benchmark Coverage**: We measure offline e2e latency and throughput, and fixed-QPS online serving benchmarks, for more details see [vllm-ascend benchmark scripts](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks).
+
+## 1. Run docker container
+
+```{code-block} bash
+   :substitutions:
+# Update DEVICE according to your device (/dev/davinci[0-7])
+export DEVICE=/dev/davinci7
+export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version|
+docker run --rm \
+--name vllm-ascend \
+--device $DEVICE \
+--device /dev/davinci_manager \
+--device /dev/devmm_svm \
+--device /dev/hisi_hdc \
+-v /usr/local/dcmi:/usr/local/dcmi \
+-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
+-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
+-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
+-v /etc/ascend_install.info:/etc/ascend_install.info \
+-v /root/.cache:/root/.cache \
+-p 8000:8000 \
+-e VLLM_USE_MODELSCOPE=True \
+-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
+-it $IMAGE \
+/bin/bash
+```
+
+## 2. Install dependencies
+
+```bash
+cd /workspace/vllm-ascend
+pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
+pip install -r benchmarks/requirements-bench.txt
+```
+
+## 3. (Optional)Prepare model weights
+For faster running speed, we recommend downloading the model in advance：
+
+```bash
+modelscope download --model LLM-Research/Meta-Llama-3.1-8B-Instruct
+```
+
+You can also replace all model paths in the [json](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks/tests) files with your local paths:
+
+```bash
+[
+  {
+    "test_name": "latency_llama8B_tp1",
+    "parameters": {
+      "model": "your local model path",
+      "tensor_parallel_size": 1,
+      "load_format": "dummy",
+      "num_iters_warmup": 5,
+      "num_iters": 15
+    }
+  }
+]
+```
+
+## 4. Run benchmark script
+Run benchmark script:
+
+```bash
+bash benchmarks/scripts/run-performance-benchmarks.sh
+```
+
+After about 10 mins, the output is as shown below:
+
+```bash
+online serving:
+qps 1:
+============ Serving Benchmark Result ============
+Successful requests:                     200       
+Benchmark duration (s):                  212.77    
+Total input tokens:                      42659     
+Total generated tokens:                  43545     
+Request throughput (req/s):              0.94      
+Output token throughput (tok/s):         204.66    
+Total Token throughput (tok/s):          405.16    
+---------------Time to First Token----------------
+Mean TTFT (ms):                          104.14    
+Median TTFT (ms):                        102.22    
+P99 TTFT (ms):                           153.82    
+-----Time per Output Token (excl. 1st token)------
+Mean TPOT (ms):                          38.78     
+Median TPOT (ms):                        38.70     
+P99 TPOT (ms):                           48.03     
+---------------Inter-token Latency----------------
+Mean ITL (ms):                           38.46     
+Median ITL (ms):                         36.96     
+P99 ITL (ms):                            75.03     
+==================================================
+
+qps 4:
+============ Serving Benchmark Result ============
+Successful requests:                     200       
+Benchmark duration (s):                  72.55     
+Total input tokens:                      42659     
+Total generated tokens:                  43545     
+Request throughput (req/s):              2.76      
+Output token throughput (tok/s):         600.24    
+Total Token throughput (tok/s):          1188.27   
+---------------Time to First Token----------------
+Mean TTFT (ms):                          115.62    
+Median TTFT (ms):                        109.39    
+P99 TTFT (ms):                           169.03    
+-----Time per Output Token (excl. 1st token)------
+Mean TPOT (ms):                          51.48     
+Median TPOT (ms):                        52.40     
+P99 TPOT (ms):                           69.41     
+---------------Inter-token Latency----------------
+Mean ITL (ms):                           50.47     
+Median ITL (ms):                         43.95     
+P99 ITL (ms):                            130.29    
+==================================================
+
+qps 16:
+============ Serving Benchmark Result ============
+Successful requests:                     200       
+Benchmark duration (s):                  47.82     
+Total input tokens:                      42659     
+Total generated tokens:                  43545     
+Request throughput (req/s):              4.18      
+Output token throughput (tok/s):         910.62    
+Total Token throughput (tok/s):          1802.70   
+---------------Time to First Token----------------
+Mean TTFT (ms):                          128.50    
+Median TTFT (ms):                        128.36    
+P99 TTFT (ms):                           187.87    
+-----Time per Output Token (excl. 1st token)------
+Mean TPOT (ms):                          83.60     
+Median TPOT (ms):                        77.85     
+P99 TPOT (ms):                           165.90    
+---------------Inter-token Latency----------------
+Mean ITL (ms):                           65.72     
+Median ITL (ms):                         54.84     
+P99 ITL (ms):                            289.63    
+==================================================
+
+qps inf:
+============ Serving Benchmark Result ============
+Successful requests:                     200       
+Benchmark duration (s):                  41.26     
+Total input tokens:                      42659     
+Total generated tokens:                  43545     
+Request throughput (req/s):              4.85      
+Output token throughput (tok/s):         1055.44   
+Total Token throughput (tok/s):          2089.40   
+---------------Time to First Token----------------
+Mean TTFT (ms):                          3394.37   
+Median TTFT (ms):                        3359.93   
+P99 TTFT (ms):                           3540.93   
+-----Time per Output Token (excl. 1st token)------
+Mean TPOT (ms):                          66.28     
+Median TPOT (ms):                        64.19     
+P99 TPOT (ms):                           97.66     
+---------------Inter-token Latency----------------
+Mean ITL (ms):                           56.62     
+Median ITL (ms):                         55.69     
+P99 ITL (ms):                            82.90     
+==================================================
+
+offline:
+latency:
+Avg latency: 4.944929537673791 seconds
+10% percentile latency: 4.894104263186454 seconds
+25% percentile latency: 4.909652255475521 seconds
+50% percentile latency: 4.932477846741676 seconds
+75% percentile latency: 4.9608619548380375 seconds
+90% percentile latency: 5.035418218374252 seconds
+99% percentile latency: 5.052476694583893 seconds
+
+throughput:
+Throughput: 4.64 requests/s, 2000.51 total tokens/s, 1010.54 output tokens/s
+Total num prompt tokens:  42659
+Total num output tokens:  43545
+```
+
+The result json files are generated into the path `benchmark/results`
+These files contain detailed benchmarking results for further analysis.
+
+```bash
+.
+|-- latency_llama8B_tp1.json
+|-- serving_llama8B_tp1_qps_1.json
+|-- serving_llama8B_tp1_qps_16.json
+|-- serving_llama8B_tp1_qps_4.json
+|-- serving_llama8B_tp1_qps_inf.json
+`-- throughput_llama8B_tp1.json
+```
--- a/docs/source/developer_guide/performance/profile_execute_duration.md
+++ b/docs/source/developer_guide/performance/profile_execute_duration.md
@@ -0,0 +1,40 @@
+# Profile Execute Duration
+
+The execution duration of each stage (including pre/post-processing, model forward, etc.) usually needs to be captured during a complete inference process. Typically, this is done by using `torch.npu.synchronize()` and obtaining CPU timestamps, which increases the performance overhead of host/device synchronization.
+
+**To reduce the performance overhead, we add this feature, using the NPU event timestamp mechanism to observe the device execution time asynchronously.**
+
+## Usage
+* Use the environment variable `VLLM_ASCEND_MODEL_EXECUTE_TIME_OBSERVE` to enable this feature.
+* Use the non-blocking API `ProfileExecuteDuration().capture_async` to set observation points asynchronously when you need to observe the execution duration.
+* Use the blocking API `ProfileExecuteDuration().pop_captured_sync` at an appropriate time to get and print the execution durations of all observed stages.
+
+**We have instrumented the key inference stages (including pre-processing, model forward pass, etc.) for execute duration profiling. Execute the script as follows:**
+
+```
+VLLM_ASCEND_MODEL_EXECUTE_TIME_OBSERVE=1 python3 vllm-ascend/examples/offline_inference_npu.py
+```
+
+## Example Output
+
+```
+5691:(IntegratedWorker pid=1502285) Profile execute duration [Decode]: [post process]:14.17ms [prepare input and forward]:9.57ms [forward]:4.14ms
+5695:(IntegratedWorker pid=1502285) Profile execute duration [Decode]: [post process]:14.29ms [prepare input and forward]:10.19ms [forward]:4.14ms
+5697:(IntegratedWorker pid=1502343) Profile execute duration [Decode]: [post process]:14.81ms [prepare input and forward]:10.29ms [forward]:3.99ms
+5701:(IntegratedWorker pid=1502343) Profile execute duration [Decode]: [post process]:14.10ms [prepare input and forward]:10.62ms [forward]:4.33ms
+5705:(IntegratedWorker pid=1502343) Profile execute duration [Decode]: [post process]:14.65ms [prepare input and forward]:9.58ms [forward]:4.20ms
+5709:(IntegratedWorker pid=1502343) Profile execute duration [Decode]: [post process]:14.43ms [prepare input and forward]:9.88ms [forward]:4.20ms
+5711:(IntegratedWorker pid=1502401) Profile execute duration [Decode]: [post process]:14.89ms [prepare input and forward]:10.49ms [forward]:4.19ms
+5715:(IntegratedWorker pid=1502401) Profile execute duration [Decode]: [post process]:14.14ms [prepare input and forward]:11.21ms [forward]:4.18ms
+5719:(IntegratedWorker pid=1502401) Profile execute duration [Decode]: [post process]:14.71ms [prepare input and forward]:10.15ms [forward]:4.42ms
+5723:(IntegratedWorker pid=1502401) Profile execute duration [Decode]: [post process]:14.62ms [prepare input and forward]:10.31ms [forward]:4.25ms
+5725:(IntegratedWorker pid=1502462) Profile execute duration [Decode]: [post process]:14.12ms [prepare input and forward]:10.33ms [forward]:4.24ms
+5729:(IntegratedWorker pid=1502462) Profile execute duration [Decode]: [post process]:14.58ms [prepare input and forward]:10.85ms [forward]:4.32ms
+5733:(IntegratedWorker pid=1502462) Profile execute duration [Decode]: [post process]:14.32ms [prepare input and forward]:9.79ms [forward]:4.28ms
+5737:(IntegratedWorker pid=1502462) Profile execute duration [Decode]: [post process]:15.06ms [prepare input and forward]:9.89ms [forward]:4.32ms
+5739:(IntegratedWorker pid=1502524) Profile execute duration [Decode]: [post process]:14.62ms [prepare input and forward]:10.48ms [forward]:4.27ms
+5743:(IntegratedWorker pid=1502524) Profile execute duration [Decode]: [post process]:14.60ms [prepare input and forward]:10.71ms [forward]:4.61ms
+5747:(IntegratedWorker pid=1502524) Profile execute duration [Decode]: [post process]:14.21ms [prepare input and forward]:10.10ms [forward]:4.52ms
+5751:(IntegratedWorker pid=1502524) Profile execute duration [Decode]: [post process]:15.03ms [prepare input and forward]:10.00ms [forward]:4.42ms
+
+```