[Doc][Misc] Correcting the document and uploading the model deployment template (#8287)

<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
Correcting the document and uploading the model deployment template

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?

---------

Signed-off-by: herizhen <1270637059@qq.com>
Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com>
This commit is contained in:
herizhen
2026-04-15 16:03:11 +08:00
committed by GitHub
parent 147b589f62
commit 95726d20eb
31 changed files with 536 additions and 308 deletions

View File

@@ -0,0 +1,233 @@
# Deployment Tutorial Template Based on the XXX Model
This template is based on deployment tutorials for models such as DeepSeek-V3.2 and Qwen-VL-Dense, and is intended to serve as a reference for technical documentation writing. Users can systematically construct relevant technical documentation by following the guidelines provided in this template.
## 1 Introduction
**Content Writing Requirements:**
- Provide a one-sentence description of the model's basic architecture, core features, and primary application scenarios.
- Provide a one-sentence description of the document's purpose and the objectives to be achieved.
- Specify the version of vLLM-Ascend used in the document and the version support status of the model.
**Example 1: Model Introduction**
DeepSeek-V3.2 is a sparse attention model. Its core architecture is similar to that of DeepSeek-V3.1, but it employs a sparse attention mechanism, aiming to explore and validate optimization solutions for training and inference efficiency in long-context scenarios.
**Example 2: Document Purpose**
This document will demonstrate the primary validation steps for the model, including supported features, feature configuration, environment preparation, single-node and multi-node deployment, as well as accuracy and performance evaluation.
**Example 3: Version Information**
This document is validated and written based on **vLLM-Ascend v0.13.0**. The current model (XXX) is fully supported in this version, and all **v0.13.0 and later versions** can run stably. To use the latest features (e.g., PD separation, MTP), it is recommended to use v0.13.0 or a later version.
## 2 Feature Matrix
This section introduces the features supported by the model, including supported hardware, quantization methods, data parallelism, long-sequence features, etc.
**Content Writing Requirements:**
- Present the support status of models and features in a table format.
- Alternatively, provide references with hyperlinks.
**Example 1: Feature Support List**
| Model Name | Support Status | Remarks | BF16 | Supported Hardware | W8A8 | Chunked Prefill | Automatic Prefix Caching | LoRA | Speculative Decoding | Asynchronous Scheduling | Tensor Parallelism | Pipeline Parallelism | Expert Parallelism | Data Parallelism | Prefill-Decode Separation | Segmented ACL Graph Execution | Full ACL Graph Execution | Max Model Length | MLP Weight Prefetch | Documentation |
| ------ | ---------- | ------ | ------ | ---------- | ------ | ------------ | -------------- | ------ | ---------- | ---------- | ---------- | ------------ | ---------- | ---------- | ------------------- | ----------- | ----------- | ------------- | ------------- | ---------- |
| DeepSeek V3/3.1 | ✅ | | ✅ | Atlas 800I A2:<br>Minimum card requirement: xx | ✅ | ✅ | ✅ | | ✅ | | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | 240k | | [DeepSeek-V3.1](../../tutorials/models/DeepSeek-V3.1.md) |
| DeepSeek V3.2 | ✅ | | ✅ | Atlas 800I A2:<br>Minimum card requirement: xx | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | 160k | ✅ | [DeepSeek-V3.2](../../tutorials/models/DeepSeek-V3.2.md) |
| DeepSeek R1 | ✅ | | ✅ | Atlas 800I A2:<br>Minimum card requirement: xx | ✅ | ✅ | ✅ | | ✅ | | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | 128k | | [DeepSeek R1](../../tutorials/models/DeepSeek-R1.md) |
| Qwen3 | ✅ | | ✅ | Atlas 800I A2:<br>Minimum card requirement: xx | ✅ | ✅ | ✅ | | | ✅ | ✅ | | | ✅ | | ✅ | ✅ | 128k | ✅ | [Qwen3](../../tutorials/models/Qwen3-Dense.md) |
**Note**: This is a simplified example. Please refer to the complete feature matrix for the full table.
**Example 2: Reference Citation**
Please refer to the [Supported Features List](../user_guide/support_matrix/supported_models.md) for the model support matrix.
Please refer to the [Feature Guide](../user_guide/feature_guide/index.md) for feature configuration information.
## 3 Environment Preparation
### 3.1 Model Weight
**Content Writing Requirements:** Describe the hardware resources, software environment, and model files required for deployment.
**Example:**
| Model Version | Hardware Requirements | Download Link |
| ---------- | ---------- | ---------- |
| DeepSeek-V3.2-Exp (BF16) | 2×Atlas 800 A3 (64G×16)<br>4×Atlas 800 A2 (64G×8) | [Model Weight](https://modelers.cn/models/Modelers_Park/DeepSeek-V3.2-Exp-BF16) |
| DeepSeek-V3.2-Exp-w8a8 (Quantized) | 1×Atlas 800 A3 (64G×16)<br>2×Atlas 800 A2 (64G×8) | [Model Weight](https://modelers.cn/models/Modelers_Park/DeepSeek-V3.2-Exp-w8a8) |
| DeepSeek-V3.2-w8a8 (Quantized) | 1×Atlas 800 A3 (64G×16)<br>2×Atlas 800 A2 (64G×8) | [Model Weight](https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V3.2-W8A8/) |
### 3.2 Verify Multi-node Communication (Optional)
**Example:**
If multi-node deployment is required, please follow the [Verify Multi-node Communication Environment](../installation.md#verify-multi-node-communication) guide for communication verification.
## 4 Installation
**Content Writing Requirements:**
- Provide specific steps and startup commands, covering both single-node and multi-node configurations.
- Provide explanations for parameters, including meaning, value range, and units.
- Specify the basic environment variables and communication environment variables that need to be enabled, with explanations including meaning, value range, and units.
### 4.1 Docker Image Installation
**Example:** Omitted
### 4.2 Source Code Installation
**Example:** Omitted
## 5 Online Service Deployment
### 5.1 Single-Node Online Deployment
**Content Writing Requirements:**
- Describe the architectural characteristics and applicable scenarios of single-node deployment.
- Provide startup command templates and key parameter descriptions.
- Provide service verification methods.
**Example:**
Single-node deployment completes both Prefill and Decode within the same node, suitable for XXX scenarios.
Startup Command:
```bash
# Omitted
```
Service Verification:
```bash
# Omitted
```
### 5.2 Multi-Node PD Separation Deployment
**Content Writing Requirements:**
- Describe the principles of PD separation architecture and applicable scenarios.
- List prerequisites (network, storage, permissions).
- Provide script frameworks and key configuration item descriptions.
- Specify node role division and startup procedures.
- Indicate performance metrics.
**Example:** Omitted
### 5.3 Special Deployment Modes (Optional)
**Content Writing Requirements:**
- If the model features nonstandard deployment modes (e.g., offline batch processing for embedding models, lowlatency online serving for reranker models), the corresponding deployment solutions must be explicitly documented.
- Section 5 "Online Service Deployment" provides examples for singlenode online service deployment and multinode PDseparated deployment, which can be referenced and extended.
## 6 Functional Verification
**Content Writing Requirements:** Guide users on how to test the basic functionality of the model through simple interface calls after the service is started.
**Example:**
After the service is started, the model can be invoked by sending a prompt:
```shell
curl http://<node0_ip>:<port>/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek_v3.2",
"prompt": "The future of AI is",
"max_tokens": 50,
"temperature": 0
}'
```
## 7 Accuracy Evaluation
**Content Writing Requirements:** Introduce standardized methods and tools for evaluating model output quality (accuracy). Two accuracy evaluation methods are provided below as examples; alternatively, provide direct links to existing documentation.
### Using AISBench
For details, please refer to [Using AISBench](../developer_guide/evaluation/using_ais_bench.md).
### Using Language Model Evaluation Harness
Using the `gsm8k` dataset as an example test dataset, run the accuracy evaluation for `DeepSeek-V3.2-W8A8` in online mode.
1. For `lm_eval` installation, please refer to [Using lm_eval](../developer_guide/evaluation/using_lm_eval.md).
2. Run `lm_eval` to execute the accuracy evaluation.
```shell
lm_eval \
--model local-completions \
--model_args model=/root/.cache/Eco-Tech/DeepSeek-V3.2-w8a8-mtp-QuaRot,base_url=http://127.0.0.1:8000/v1/completions,tokenized_requests=False,trust_remote_code=True \
--tasks gsm8k \
--output_path ./
```
## 8 Performance
Omitted. Requirements are the same as for Accuracy Evaluation.
## 9 Best Practices
**Content Writing Requirements:**
Provide recommended configurations for three scenarios (long sequence, low latency, high throughput) for each model that can achieve optimal performance, but do not provide specific performance data.
## 10 Performance Tuning (Optional)
**Content Writing Requirements:**
- Summarize key optimization techniques and parameter tuning experiences for the model to help users achieve optimal performance in specific scenarios. Include optimization technique descriptions, enablement methods, parameter tuning recommendations, and typical configuration examples.
- Hyperlinks to the features guide may be used to allow users to view detailed descriptions of specific features.
### 10.1 Key Optimization Points
In this section, we will introduce the key optimization points that can significantly improve the performance of the XX model. These techniques aim to improve throughput and efficiency in various scenarios.
#### 10.1.1 Basic Optimizations
**Example:**
The following optimizations are enabled by default and require no additional configuration:
| Optimization Technique | Technical Principle | Performance Benefit |
| --------- | --------- | --------- |
| Rope Optimization | The cos_sin_cache and indexing operations of positional encoding are executed only in the first layer, and subsequent layers reuse them directly | Reduces redundant computation during the decoding phase, accelerating inference |
| AddRMSNormQuant Fusion | Merges address-wise multi-scale normalization and quantization operations into a single operator | Optimizes memory access patterns, improving computational efficiency |
| Zero-like Elimination | Removes unnecessary zero-tensor operations in Attention forward pass | Reduces memory footprint, improves matrix operation efficiency |
| FullGraph Optimization | Captures and replays the entire decoding graph at once using `compilation_config={"cudagraph_mode":"FULL_DECODE_ONLY"}` | Significantly reduces scheduling latency, stabilizes multi-device performance |
#### 10.1.2 Advanced Optimizations (Require Explicit Enablement)
**Example:**
| Optimization Technique | Technical Principle | Enablement Method | Applicable Scenarios | Precautions |
| --------- | --------- | --------- | --------- | --------- |
| FlashComm_v1 | Decomposes traditional Allreduce into Reduce-Scatter and All-Gather, reducing RMSNorm computation dimensions | `export VLLM_ASCEND_ENABLE_FLASHCOMM1=1` | High-concurrency, Tensor Parallelism (TP) scenarios | Threshold protection: Only takes effect when the actual number of tokens exceeds the threshold to avoid performance degradation in low-concurrency scenarios |
| Matmul-ReduceScatter Fusion | Fuses matrix multiplication and Reduce-Scatter operations to achieve pipelined parallel processing | Automatically enabled after enabling FlashComm_v1 | Large-scale distributed environments | Same as FlashComm_v1, has threshold protection |
| Weight Prefetch | Utilizes vector computation time to prefetch MLP weights into L2 cache in advance | `export VLLM_ASCEND_ENABLE_PREFETCH_MLP=1` | MLP-intensive scenarios (Dense models) | Requires coordination with prefetch buffer size adjustment |
| Asynchronous Scheduling | Non-blocking task scheduling to improve concurrent processing capability | `--async-scheduling` | Large-scale models, high-concurrency scenarios | Should be used in coordination with FullGraph optimization |
### 10.2 Optimization Highlights
**Content Writing Requirements:**
Summarize the most noteworthy optimization points during the actual tuning process, distill core experiences, and provide readers with tuning ideas for getting started quickly.
**Example:**
During the actual tuning process, the following points are most critical for performance improvement: The prefetch buffer size needs to be determined through empirical measurement to find the optimal overlap between computation and prefetching; the setting of `max-num-batched-tokens` needs to balance throughput and video memory to avoid excessive chunking or OOM risk; `cudagraph_capture_sizes` must be manually specified and cover the target concurrency; when FlashComm_v1 is enabled, it is also necessary to ensure that the values are multiples of TP; `pa_shape_list` is a temporary tuning parameter that only takes effect for specific batch sizes, requiring attention to version evolution for timely adjustments. The coordinated configuration of the above parameters and environment variables is key to achieving extreme performance.
## 11 FAQ
**Content Writing Requirements:**
Provide solutions to common problems, including but not limited to problem phenomenon description, cause analysis, and solution measures.

View File

@@ -2,7 +2,7 @@
## Committers ## Committers
| Name | Github ID | Date | | Name | GitHub ID | Date |
|:-----------:|:-----:|:-----:| |:-----------:|:-----:|:-----:|
| Xiyuan Wang | [@wangxiyuan](https://github.com/wangxiyuan) | 2025/01 | | Xiyuan Wang | [@wangxiyuan](https://github.com/wangxiyuan) | 2025/01 |
| Yikun Jiang| [@Yikun](https://github.com/Yikun) | 2025/02 | | Yikun Jiang| [@Yikun](https://github.com/Yikun) | 2025/02 |

View File

@@ -171,7 +171,7 @@ Notes:
## Software dependency management ## Software dependency management
- `torch-npu`: Ascend Extension for PyTorch (torch-npu) releases a stable version to [PyPi](https://pypi.org/project/torch-npu) - `torch-npu`: Ascend Extension for PyTorch (torch-npu) releases a stable version to [PyPI](https://pypi.org/project/torch-npu)
every 3 months, a development version (aka the POC version) every month, and a nightly version every day. every 3 months, a development version (aka the POC version) every month, and a nightly version every day.
The PyPi stable version **CAN** be used in vLLM Ascend final version, the monthly dev version **ONLY CAN** be used in The PyPI stable version **CAN** be used in vLLM Ascend final version, the monthly dev version **ONLY CAN** be used in
vLLM Ascend RC version for rapid iteration, and the nightly version **CANNOT** be used in vLLM Ascend any version or branch. vLLM Ascend RC version for rapid iteration, and the nightly version **CANNOT** be used in vLLM Ascend any version or branch.

View File

@@ -35,7 +35,7 @@ The workflow of obtaining inputs:
At last, these `Token IDs` are required to be fed into a model, and `positions` should also be sent into the model to create `Rope` (Rotary positional embedding). Both of them are the inputs of the model. At last, these `Token IDs` are required to be fed into a model, and `positions` should also be sent into the model to create `Rope` (Rotary positional embedding). Both of them are the inputs of the model.
**Note**: The `Token IDs` are the inputs of a model, so we also call them `Inputs IDs`. **Note**: The `Token IDs` are the inputs of a model, so we also call them `Input IDs`.
### 2. Build inputs attention metadata ### 2. Build inputs attention metadata

View File

@@ -60,7 +60,7 @@ Before writing a patch, following the principle above, we should patch the least
# 1. `<The target patch module in vLLM>` # 1. `<The target patch module in vLLM>`
# Why: # Why:
# <Describe the reason why we need to patch> # <Describe the reason why we need to patch>
# How # How:
# <Describe the way to patch> # <Describe the way to patch>
# Related PR (if no, explain why): # Related PR (if no, explain why):
# <Add a link to the related PR in vLLM. If there is no related PR, explain why> # <Add a link to the related PR in vLLM. If there is no related PR, explain why>

View File

@@ -54,7 +54,7 @@ Based on the above content, we present a brief description of the adaptation pro
- **Step 2: Registration**. Use the `@register_scheme` decorator in `vllm_ascend/quantization/methods/registry.py` to register your quantization scheme class. - **Step 2: Registration**. Use the `@register_scheme` decorator in `vllm_ascend/quantization/methods/registry.py` to register your quantization scheme class.
```python ```python
from vllm_ascend.quantization.methods import register_scheme, AscendLinearScheme from vllm_ascend.quantization.methods import register_scheme, AscendLinearScheme, AscendMoEScheme
@register_scheme("W4A8_DYNAMIC", "linear") @register_scheme("W4A8_DYNAMIC", "linear")
class AscendW4A8DynamicLinearMethod(AscendLinearScheme): class AscendW4A8DynamicLinearMethod(AscendLinearScheme):

View File

@@ -332,7 +332,6 @@ An L0 `dump.json` contains forward I/O for modules together with parameters. Usi
"data_name": "Module.conv2.Conv2d.forward.0.parameters.bias.pt" "data_name": "Module.conv2.Conv2d.forward.0.parameters.bias.pt"
} }
} }
},
} }
} }
} }
@@ -389,7 +388,6 @@ An L1 `dump.json` records forward I/O for APIs. Using PyTorch's `relu` function
"data_name": "Functional.relu.0.forward.output.0.pt" "data_name": "Functional.relu.0.forward.output.0.pt"
} }
] ]
},
} }
} }
} }

View File

@@ -111,7 +111,7 @@ sudo apt update
sudo apt install libjemalloc2 sudo apt install libjemalloc2
# Configure jemalloc # Configure jemalloc
export LD_PRELOAD=/usr/lib/"$(uname -i)"-linux-gnu/libjemalloc.so.2 $LD_PRELOAD export LD_PRELOAD=/usr/lib/"$(uname -i)"-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
``` ```
#### 2.2. Tcmalloc #### 2.2. Tcmalloc

View File

@@ -97,7 +97,8 @@ For local `dataset-path`, please set `hf-name` to its Hugging Face ID like
First start serving your model: First start serving your model:
```bash ```bash
VLLM_USE_MODELSCOPE=True vllm serve Qwen/Qwen3-8B export VLLM_USE_MODELSCOPE=True
vllm serve Qwen/Qwen3-8B
``` ```
Then run the benchmarking script: Then run the benchmarking script:
@@ -158,7 +159,7 @@ vllm bench throughput \
If successful, you will see the following output If successful, you will see the following output
```shell ```shell
Processed prompts: 100%|█| 10/10 [00:03<00:00, 2.74it/s, est. speed input: 351.02 toks/s, output: 351.02 t Processed prompts: 100%|█| 10/10 [00:03<00:00, 2.74it/s, est. speed input: 351.02 toks/s, output: 351.02 toks/s
Throughput: 2.73 requests/s, 699.93 total tokens/s, 349.97 output tokens/s Throughput: 2.73 requests/s, 699.93 total tokens/s, 349.97 output tokens/s
Total num prompt tokens: 1280 Total num prompt tokens: 1280
Total num output tokens: 1280 Total num output tokens: 1280

View File

@@ -259,7 +259,7 @@ The performance of `torch_npu.npu_fused_infer_attention_score` in small batch sc
```bash ```bash
bash tools/install_flash_infer_attention_score_ops_a2.sh bash tools/install_flash_infer_attention_score_ops_a2.sh
## change to run the following instruction if you're using A3 machine # change to run the following instruction if you're using A3 machine
# bash tools/install_flash_infer_attention_score_ops_a3.sh # bash tools/install_flash_infer_attention_score_ops_a3.sh
``` ```

View File

@@ -128,7 +128,7 @@ sed -i 's|ports.ubuntu.com|mirrors.tuna.tsinghua.edu.cn|g' /etc/apt/sources.list
apt-get update -y && apt-get install -y gcc g++ cmake libnuma-dev wget git curl jq apt-get update -y && apt-get install -y gcc g++ cmake libnuma-dev wget git curl jq
# Or using yum # Or using yum
# yum update -y && yum install -y gcc g++ cmake numactl-devel wget git curl jq # yum update -y && yum install -y gcc g++ cmake numactl-devel wget git curl jq
# Config pip mirror # Config pip mirror,only versions 0.11.0 and earlier are supported, if using a version later than 0.11.0, do not execute this command
pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
``` ```

View File

@@ -327,8 +327,6 @@ The parameters are explained as follows:
## Accuracy Evaluation ## Accuracy Evaluation
Here are two accuracy evaluation methods.
### Using AISBench ### Using AISBench
1. Refer to [Using AISBench](../../developer_guide/evaluation/using_ais_bench.md) for details. 1. Refer to [Using AISBench](../../developer_guide/evaluation/using_ais_bench.md) for details.

View File

@@ -135,8 +135,6 @@ The parameters are explained as follows:
## Accuracy Evaluation ## Accuracy Evaluation
Here are two accuracy evaluation methods.
### Using AISBench ### Using AISBench
1. Refer to [Using AISBench](../../developer_guide/evaluation/using_ais_bench.md) for details. 1. Refer to [Using AISBench](../../developer_guide/evaluation/using_ais_bench.md) for details.

View File

@@ -240,12 +240,12 @@ If you occasionally see `zmq.error.ZMQError: Address already in use` during star
### launch_online_dp.py ### launch_online_dp.py
Use `launch_online_dp.py` to launch external dp vllm servers. Use `launch_online_dp.py` to launch external dp vllm servers.
[launch\_online\_dp.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/external_online_dp/launch_online_dp.py) [launch_online_dp.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/external_online_dp/launch_online_dp.py)
### run_dp_template.sh ### run_dp_template.sh
Modify `run_dp_template.sh` on each node. Modify `run_dp_template.sh` on each node.
[run\_dp\_template.sh](https://github.com/vllm-project/vllm-ascend/blob/main/examples/external_online_dp/run_dp_template.sh) [run_dp_template.sh](https://github.com/vllm-project/vllm-ascend/blob/main/examples/external_online_dp/run_dp_template.sh)
#### Layerwise #### Layerwise

View File

@@ -1,10 +1,10 @@
# Prefill-Decode Disaggregation (Qwen2.5-VL) # Prefill-Decode Disaggregation (Qwen2.5-VL)
## Getting Start ## Getting Started
vLLM-Ascend now supports prefill-decode (PD) disaggregation. This guide takes one-by-one steps to verify these features with constrained resources. vLLM-Ascend now supports prefill-decode (PD) disaggregation. This guide takes one-by-one steps to verify these features with constrained resources.
Using the Qwen2.5-VL-7B-Instruct model as an example, use vllm-ascend v0.11.0rc1 (with vLLM v0.11.0) on 1 Atlas 800T A2 server to deploy the "1P1D" architecture. Assume the IP address is 192.0.0.1. Using the Qwen2.5-VL-7B-Instruct model as an example, use vLLM-Ascend v0.11.0rc1 (with vLLM v0.11.0) on 1 Atlas 800T A2 server to deploy the "1P1D" architecture. Assume the IP address is 192.0.0.1.
## Verify Communication Environment ## Verify Communication Environment

View File

@@ -133,7 +133,7 @@ models = [
```bash ```bash
# Example command to test gsm8k dataset performance using the first 100 prompts. Commands for other datasets are similar. # Example command to test gsm8k dataset performance using the first 100 prompts. Commands for other datasets are similar.
ais_bench --models vllm_api_stream_chat \ ais_bench --models vllm-api-stream-chat \
--datasets gsm8k_gen_0_shot_cot_str_perf \ --datasets gsm8k_gen_0_shot_cot_str_perf \
--debug --summarizer default_perf --mode perf --num-prompts 100 --debug --summarizer default_perf --mode perf --num-prompts 100
``` ```

View File

@@ -809,7 +809,7 @@ In this chapter, we recommend best practices for three scenarios:
- **Q: Startup fails with HCCL port conflicts (address already bound). What should I do?** - **Q: Startup fails with HCCL port conflicts (address already bound). What should I do?**
A: Clean up old processes and restart: `pkill -f VLLM*`. A: Clean up old processes and restart: `pkill -f vLLM*`.
- **Q: How to handle OOM or unstable startup?** - **Q: How to handle OOM or unstable startup?**

View File

@@ -69,7 +69,7 @@ docker run --rm \
#### Single NPU (Qwen2.5-Omni-7B) #### Single NPU (Qwen2.5-Omni-7B)
:::{note} :::{note}
The **environment variable** `LOCAL_MEDIA_PATH` which **allows** API requests to read local images or videos from directories specified by the server file system. Please note this is a security risk. Should only be enabled in trusted environments. The environment variable `LOCAL_MEDIA_PATH` which allows API requests to read local images or videos from directories specified by the server file system. Please note this is a security risk. Should only be enabled in trusted environments.
::: :::
@@ -128,7 +128,7 @@ Not supported yet.
## Functional Verification ## Functional Verification
If your service **starts** successfully, you can see the info shown below: If your service starts successfully, you can see the info shown below:
```bash ```bash
INFO: Started server process [2736] INFO: Started server process [2736]

View File

@@ -283,7 +283,7 @@ There are three `vllm bench` subcommands:
Take the `serve` as an example. Run the code as follows. Take the `serve` as an example. Run the code as follows.
```bash ```bash
VLLM_USE_MODELSCOPE=True export VLLM_USE_MODELSCOPE=True
export MODEL=Qwen/Qwen3-Omni-30B-A3B-Thinking export MODEL=Qwen/Qwen3-Omni-30B-A3B-Thinking
python3 -m vllm.entrypoints.openai.api_server --model $MODEL --tensor-parallel-size 2 --swap-space 16 --disable-log-stats --disable-log-request --load-format dummy python3 -m vllm.entrypoints.openai.api_server --model $MODEL --tensor-parallel-size 2 --swap-space 16 --disable-log-stats --disable-log-request --load-format dummy

View File

@@ -87,7 +87,6 @@ If you want to deploy multi-node environment, you need to set up environment on
### Single-node Deployment ### Single-node Deployment
`Qwen3.5-397B-A17B` can be deployed on 2 Atlas 800 A3(64G*16) or 4 Atlas 800 A2(64G*8).
`Qwen3.5-397B-A17B-w8a8` can be deployed on 1 Atlas 800 A3(64G*16) or 2 Atlas 800 A2(64G*8), need to start with parameter `--quantization ascend`. `Qwen3.5-397B-A17B-w8a8` can be deployed on 1 Atlas 800 A3(64G*16) or 2 Atlas 800 A2(64G*8), need to start with parameter `--quantization ascend`.
Run the following script to execute online 128k inference On 1 Atlas 800 A3(64G*16). Run the following script to execute online 128k inference On 1 Atlas 800 A3(64G*16).
@@ -152,7 +151,7 @@ The parameters are explained as follows:
### Multi-node Deployment with MP (Recommended) ### Multi-node Deployment with MP (Recommended)
Assume you have 2 Atlas 800 A2 nodes, and want to deploy the `Qwen3.5-397B-A17B` model across multiple nodes. Assume you have 2 Atlas 800 A2 nodes, and want to deploy the `Qwen3.5-397B-A17B-w8a8-mtp` model across multiple nodes.
Node 0 Node 0
@@ -350,7 +349,7 @@ vllm serve Eco-Tech/Qwen3.5-397B-A17B-w8a8-mtp \
}' }'
``` ```
3. Decode Node 0 `run_d0.sh` script 2. Decode Node 0 `run_d0.sh` script
```shell ```shell
unset ftp_proxy unset ftp_proxy
@@ -430,7 +429,7 @@ vllm serve Eco-Tech/Qwen3.5-397B-A17B-w8a8-mtp \
}' }'
``` ```
5. Decode Node 1 `run_d1.sh` script 3. Decode Node 1 `run_d1.sh` script
```shell ```shell
unset ftp_proxy unset ftp_proxy
@@ -517,7 +516,7 @@ The parameters are explained as follows:
- `recompute_scheduler_enable: true`: enables the recomputation scheduler. When the Key-Value Cache (KV Cache) of the decode node is insufficient, requests will be sent to the prefill node to recompute the KV Cache. In the PD separation scenario, it is recommended to enable this configuration on both prefill and decode nodes simultaneously. - `recompute_scheduler_enable: true`: enables the recomputation scheduler. When the Key-Value Cache (KV Cache) of the decode node is insufficient, requests will be sent to the prefill node to recompute the KV Cache. In the PD separation scenario, it is recommended to enable this configuration on both prefill and decode nodes simultaneously.
- `no-enable-prefix-caching`: The prefix-cache feature is enabled by default. You can use the `--no-enable-prefix-caching` parameter to disable this feature. Notice: for Prefill-Decode disaggregation feature, known issue on D node: [#7944](https://github.com/vllm-project/vllm-ascend/issues/7944) - `no-enable-prefix-caching`: The prefix-cache feature is enabled by default. You can use the `--no-enable-prefix-caching` parameter to disable this feature. Notice: for Prefill-Decode disaggregation feature, known issue on D node: [#7944](https://github.com/vllm-project/vllm-ascend/issues/7944)
7. Run the `proxy.sh` script on the prefill master node 4. Run the `proxy.sh` script on the prefill master node
Run a proxy server on the same node with the prefiller service instance. You can get the proxy program in the repository's examples: [load\_balance\_proxy\_server\_example.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py) Run a proxy server on the same node with the prefiller service instance. You can get the proxy program in the repository's examples: [load\_balance\_proxy\_server\_example.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)

View File

@@ -100,7 +100,7 @@ model_name = "Qwen/Qwen3-Reranker-8B"
# It needs to computing 151669 tokens logits, making this method extremely # It needs to computing 151669 tokens logits, making this method extremely
# inefficient, not to mention incompatible with the vllm score API. # inefficient, not to mention incompatible with the vllm score API.
# A method for converting the original model into a sequence classification # A method for converting the original model into a sequence classification
# model was proposed. Seehttps://huggingface.co/Qwen/Qwen3-Reranker-0.6B/discussions/3 # model was proposed. See: https://huggingface.co/Qwen/Qwen3-Reranker-0.6B/discussions/3
# Models converted offline using this method can not only be more efficient # Models converted offline using this method can not only be more efficient
# and support the vllm score API, but also make the init parameters more # and support the vllm score API, but also make the init parameters more
# concise, for example. # concise, for example.

View File

@@ -1,4 +1,4 @@
# Fine-Grained Tensor Parallelism (Finegrained TP) # Fine-Grained Tensor Parallelism (Fine-grained TP)
## Overview ## Overview
@@ -8,7 +8,7 @@ This capability supports heterogeneous parallelism strategies within a single mo
--- ---
## Benefits of Finegrained TP ## Benefits of Fine-grained TP
Fine-Grained Tensor Parallelism delivers two primary performance advantages through targeted weight sharding: Fine-Grained Tensor Parallelism delivers two primary performance advantages through targeted weight sharding:
@@ -53,7 +53,7 @@ The Fine-Grained TP size for any component must:
--- ---
## How to Use Finegrained TP ## How to Use Fine-grained TP
### Configuration Format ### Configuration Format

View File

@@ -72,6 +72,7 @@ For best results, if you run inside a docker container, which `systemctl` is lik
- **Stop `irqbalance` service**: - **Stop `irqbalance` service**:
For example, on Ubuntu system, you can run the following command to stop irqbalance: For example, on Ubuntu system, you can run the following command to stop irqbalance:
```bash ```bash
sudo systemctl stop irqbalance sudo systemctl stop irqbalance
``` ```

View File

@@ -56,12 +56,12 @@ All related code is under `vllm/distributed/ec_transfer`.
* *Scheduler role* checks cache existence and schedules loads. * *Scheduler role* checks cache existence and schedules loads.
* *Worker role* loads the embeddings into memory. * *Worker role* loads the embeddings into memory.
* **EPD Load Balance Proxy** - * **EPD Load Balancing Proxy** -
* *Multi-Path Scheduling Strategy* - dynamically diverts the multimodal request or text requests to the corresponding inference path * *Multi-Path Scheduling Strategy* - dynamically diverts the multimodal request or text requests to the corresponding inference path
* *Instance-Level Dynamic Load Balancing* - dispatches multimodal requests based on a least-loaded strategy, using a priority queue to balance the active token workload across instances. * *Instance-Level Dynamic Load Balancing* - dispatches multimodal requests based on a least-loaded strategy, using a priority queue to balance the active token workload across instances.
We create the example setup with the **MooncakeLayerwiseConnector** from `vllm_ascend/distributed/kv_transfer/kv_p2p/mooncake_layerwise_connector.py` and refer to the `examples/disaggregated_prefill_v1/load_balance_proxy_layerwise_server_example.py` to facilitate the kv transfer between P and D. For step-by-step deployment and configuration of Mooncake, refer to the following guide: We create the example setup with the **MooncakeLayerwiseConnector** from `vllm_ascend/distributed/kv_transfer/kv_p2p/mooncake_layerwise_connector.py` and refer to the `examples/disaggregated_prefill_v1/load_balance_proxy_layerwise_server_example.py` to facilitate the kv transfer between P and D. For step-by-step deployment and configuration of Mooncake, refer to the following guide:
[https://docs.vllm.ai/projects/ascend/en/latest/tutorials/pd_disaggregation_mooncake_multi_node.html](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/features/pd_disaggregation_mooncake_multi_node.html) [https://docs.vllm.ai/projects/ascend/en/latest/tutorials/features/pd_disaggregation_mooncake_multi_node.html](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/features/pd_disaggregation_mooncake_multi_node.html)
For the PD disaggregation part, when using MooncakeLayerwiseConnector: The request first enters the Decoder instance,the Decoder triggers a remote prefill task in reverse via the Metaserver. The Prefill node then executes inference and pushes KV Cache layer-wise to the Decoder, overlapping computation with transmission. Once the transfer is complete, the Decoder seamlessly continues with the subsequent token generation. For the PD disaggregation part, when using MooncakeLayerwiseConnector: The request first enters the Decoder instance,the Decoder triggers a remote prefill task in reverse via the Metaserver. The Prefill node then executes inference and pushes KV Cache layer-wise to the Decoder, overlapping computation with transmission. Once the transfer is complete, the Decoder seamlessly continues with the subsequent token generation.
`docs/source/developer_guide/Design_Documents/disaggregated_prefill.md` shows the brief idea about the disaggregated prefill. `docs/source/developer_guide/Design_Documents/disaggregated_prefill.md` shows the brief idea about the disaggregated prefill.

View File

@@ -4,7 +4,7 @@ For larger-scale deployments especially, it can make sense to handle the orchest
In this case, it's more convenient to treat each DP rank like a separate vLLM deployment, with its own endpoint, and have an external router balance HTTP requests between them, making use of appropriate real-time telemetry from each server for routing decisions. In this case, it's more convenient to treat each DP rank like a separate vLLM deployment, with its own endpoint, and have an external router balance HTTP requests between them, making use of appropriate real-time telemetry from each server for routing decisions.
## Getting Start ## Getting Started
The functionality of [external DP](https://docs.vllm.ai/en/latest/serving/data_parallel_deployment/?h=external#external-load-balancing) is already natively supported by vLLM. In vllm-ascend we provide two enhanced functionalities: The functionality of [external DP](https://docs.vllm.ai/en/latest/serving/data_parallel_deployment/?h=external#external-load-balancing) is already natively supported by vLLM. In vllm-ascend we provide two enhanced functionalities:

View File

@@ -163,7 +163,7 @@ export ASCEND_ENABLE_USE_FABRIC_MEM=1
#A2 #A2
#export HCCL_INTRA_ROCE_ENABLE=1 #export HCCL_INTRA_ROCE_ENABLE=1
#Minimum retransmission timeout of the RDMAequals 4.096 μs * 2 ^ timeout. #Minimum retransmission timeout of the RDMA, equals 4.096 μs * 2 ^ timeout.
#Needs to satisfy the equation: ASCEND_TRANSFER_TIMEOUT > RDMA_TIMEOUT * 7, where 7 is the default number of retry for RDMA transfer. #Needs to satisfy the equation: ASCEND_TRANSFER_TIMEOUT > RDMA_TIMEOUT * 7, where 7 is the default number of retry for RDMA transfer.
#HCCL_RDMA_TIMEOUT also affects collective communication behavior and should be configured carefully. #HCCL_RDMA_TIMEOUT also affects collective communication behavior and should be configured carefully.
export HCCL_RDMA_TIMEOUT=17 export HCCL_RDMA_TIMEOUT=17

View File

@@ -1,6 +1,6 @@
# Distributed DP Server With Large-Scale Expert Parallelism # Distributed DP Server With Large-Scale Expert Parallelism
## Getting Start ## Getting Started
vLLM-Ascend now supports prefill-decode (PD) disaggregation in the large-scale **Expert Parallelism (EP)** scenario. To achieve better performance, the distributed DP server is applied in vLLM-Ascend. In the PD separation scenario, different optimization strategies can be implemented based on the distinct characteristics of PD nodes, thereby enabling more flexible model deployment. \ vLLM-Ascend now supports prefill-decode (PD) disaggregation in the large-scale **Expert Parallelism (EP)** scenario. To achieve better performance, the distributed DP server is applied in vLLM-Ascend. In the PD separation scenario, different optimization strategies can be implemented based on the distinct characteristics of PD nodes, thereby enabling more flexible model deployment. \
Taking the DeepSeek model as an example, using 8 Atlas 800T A3 servers to deploy the model. Assume the IP of the servers starts from 192.0.0.1 and ends by 192.0.0.8. Use the first 4 servers as prefiller nodes and the last 4 servers as decoder nodes. And the prefiller nodes are deployed as master nodes independently, while the decoder nodes use the 192.0.0.5 node as the master node. Taking the DeepSeek model as an example, using 8 Atlas 800T A3 servers to deploy the model. Assume the IP of the servers starts from 192.0.0.1 and ends by 192.0.0.8. Use the first 4 servers as prefiller nodes and the last 4 servers as decoder nodes. And the prefiller nodes are deployed as master nodes independently, while the decoder nodes use the 192.0.0.5 node as the master node.

View File

@@ -68,7 +68,7 @@ This is the first release candidate of v0.16.0 for vLLM Ascend. Please follow th
- [Experimental] Support FabricMem Mode for ADXL/HIXL interconnect. [#6806](https://github.com/vllm-project/vllm-ascend/pull/6806) - [Experimental] Support FabricMem Mode for ADXL/HIXL interconnect. [#6806](https://github.com/vllm-project/vllm-ascend/pull/6806)
- Qwen3-Next now supports FlashComm1. [#6830](https://github.com/vllm-project/vllm-ascend/pull/6830) - Qwen3-Next now supports FlashComm1. [#6830](https://github.com/vllm-project/vllm-ascend/pull/6830)
- NPUWorker Profiler now supports profile_prefix for better profiling experience. [#6968](https://github.com/vllm-project/vllm-ascend/pull/6968) - NPUWorker Profiler now supports profile_prefix for better profiling experience. [#6968](https://github.com/vllm-project/vllm-ascend/pull/6968)
- EPLB profiling now displays expert hotness comparison and time required for eplb adjustment. [#6877](https://github.com/vllm-project/vllm-ascend/pull/6877) [#7001](https://github.com/vllm-project/vllm-ascend/pull/7001)] - EPLB profiling now displays expert hotness comparison and time required for eplb adjustment. [#6877](https://github.com/vllm-project/vllm-ascend/pull/6877) [#7001](https://github.com/vllm-project/vllm-ascend/pull/7001)
- Xlite Qwen3 MoE now supports Data Parallel. [#6715](https://github.com/vllm-project/vllm-ascend/pull/6715) - Xlite Qwen3 MoE now supports Data Parallel. [#6715](https://github.com/vllm-project/vllm-ascend/pull/6715)
- Mooncake Layerwise Connector now supports kv_pool. [#7032](https://github.com/vllm-project/vllm-ascend/pull/7032) - Mooncake Layerwise Connector now supports kv_pool. [#7032](https://github.com/vllm-project/vllm-ascend/pull/7032)
- Eagle3 now supports QuaRot quantization without embedding. [#7038](https://github.com/vllm-project/vllm-ascend/pull/7038) - Eagle3 now supports QuaRot quantization without embedding. [#7038](https://github.com/vllm-project/vllm-ascend/pull/7038)
@@ -143,7 +143,7 @@ This is the first release candidate of v0.16.0 for vLLM Ascend. Please follow th
- Currently, for DeepSeek v3.2, PCP & DCP do not yet work with FlashComm1 feature, which may cause serve errors or other unknown errors. - Currently, for DeepSeek v3.2, PCP & DCP do not yet work with FlashComm1 feature, which may cause serve errors or other unknown errors.
- In 4-node A3 PD disaggregation deployment with DeepSeek V3.2, the P-Node may hang when benchmarking in high concurrency scenario, e.g., 2K/2K tokens with 512 concurrent requests. - In 4-node A3 PD disaggregation deployment with DeepSeek V3.2, the P-Node may hang when benchmarking in high concurrency scenario, e.g., 2K/2K tokens with 512 concurrent requests.
- MTP with large EP configurations may cause graph capture buffer overflow. This is a bug need to fix in vLLM, now there is a workaround to avoid it: explicitly set `--compilation-config '{"max_cudagraph_capture_size": N}'` where `N = max_concurrency × (1 + num_speculative_tokens)`. - MTP with large EP configurations may cause graph capture buffer overflow. This is a bug need to fix in vLLM, now there is a workaround to avoid it: explicitly set `--compilation-config '{"max_cudagraph_capture_size": N}'` where `N = max_concurrency * (1 + num_speculative_tokens)`.
## v0.15.0rc1 - 2026.02.27 ## v0.15.0rc1 - 2026.02.27
@@ -575,7 +575,7 @@ This is the first release candidate of v0.12.0 for vLLM Ascend. We landed lots o
- [Experimental] [KV cache pool](https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/Design_Documents/KV_Cache_Pool_Guide.html) feature is added - [Experimental] [KV cache pool](https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/Design_Documents/KV_Cache_Pool_Guide.html) feature is added
- [Experimental] A new graph mode `xlite` is introduced. It performs good with some models. Following the [official tutorial](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/feature_guide/graph_mode.html#using-xlitegraph) to start using it. - [Experimental] A new graph mode `xlite` is introduced. It performs good with some models. Following the [official tutorial](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/feature_guide/graph_mode.html#using-xlitegraph) to start using it.
- LLMdatadist kv connector is removed. Please use mooncake connector instead. - LLMdatadist kv connector is removed. Please use mooncake connector instead.
- Ascend scheduler is removed. `--additional-config {"ascend_scheduler": {"enabled": true}` doesn't work anymore. - Ascend scheduler is removed. `--additional-config {"ascend_scheduler": {"enabled": true}}` doesn't work anymore.
- Torchair graph mode is removed. `--additional-config {"torchair_graph_config": {"enabled": true}}` doesn't work anymore. Please use aclgraph instead. - Torchair graph mode is removed. `--additional-config {"torchair_graph_config": {"enabled": true}}` doesn't work anymore. Please use aclgraph instead.
- `VLLM_ASCEND_ENABLE_TOPK_TOPP_OPTIMIZATION` env is removed. This feature is stable enough. We enable it by default now. - `VLLM_ASCEND_ENABLE_TOPK_TOPP_OPTIMIZATION` env is removed. This feature is stable enough. We enable it by default now.
- speculative decode method `Ngram` is back now. - speculative decode method `Ngram` is back now.
@@ -585,7 +585,7 @@ This is the first release candidate of v0.12.0 for vLLM Ascend. We landed lots o
### Upgrade Note ### Upgrade Note
- vLLM Ascend self maintained modeling file has been removed. The related python entrypoint is removed as well. So please uninstall the old version of vLLM Ascend in your env before upgrade. - vLLM Ascend self maintained modeling file has been removed. The related python entrypoint is removed as well. So please uninstall the old version of vLLM Ascend in your env before upgrade.
- CANN is upgraded to 8.3.RC2, Pytorch and torch-npu are upgraded to 2.8.0. Don't forget to install them. - CANN is upgraded to 8.3.RC2, PyTorch and torch-npu are upgraded to 2.8.0. Don't forget to install them.
- Python 3.9 support is dropped to keep the same with vLLM v0.12.0 - Python 3.9 support is dropped to keep the same with vLLM v0.12.0
### Known Issues ### Known Issues
@@ -683,7 +683,7 @@ v0.11.0 will be the next official release version of vLLM Ascend. We'll release
- For long sequence input case, there is no response sometimes and the kv cache usage is become higher. This is a bug for scheduler. We are working on it. - For long sequence input case, there is no response sometimes and the kv cache usage is become higher. This is a bug for scheduler. We are working on it.
- Qwen2-audio doesn't work by default, we're fixing it. Temporary solution is to set `--gpu-memory-utilization` to a suitable value, such as 0.8. - Qwen2-audio doesn't work by default, we're fixing it. Temporary solution is to set `--gpu-memory-utilization` to a suitable value, such as 0.8.
- When running Qwen3-Next with expert parallel enabled, please set `HCCL_BUFFSIZE` environment variable to a suitable value, such as 1024. - When running Qwen3-Next with expert parallel enabled, please set `HCCL_BUFFSIZE` environment variable to a suitable value, such as 1024.
- The accuracy of DeepSeek3.2 with aclgraph is not correct. Temporary solution is to set `cudagraph_capture_sizes` to a suitable value depending on the batch size for the input. - The accuracy of DeepSeek3.2 with aclgraph is not correct. Temporary solution is to set `agraph_capture_sizes` to a suitable value depending on the batch size for the input.
## v0.11.0rc0 - 2025.09.30 ## v0.11.0rc0 - 2025.09.30
@@ -699,7 +699,7 @@ This is the special release candidate of v0.11.0 for vLLM Ascend. Please follow
- DeepSeek works with aclgraph now. [#2707](https://github.com/vllm-project/vllm-ascend/pull/2707) - DeepSeek works with aclgraph now. [#2707](https://github.com/vllm-project/vllm-ascend/pull/2707)
- MTP works with aclgraph now. [#2932](https://github.com/vllm-project/vllm-ascend/pull/2932) - MTP works with aclgraph now. [#2932](https://github.com/vllm-project/vllm-ascend/pull/2932)
- EPLB is supported now. [#2956](https://github.com/vllm-project/vllm-ascend/pull/2956) - EPLB is supported now. [#2956](https://github.com/vllm-project/vllm-ascend/pull/2956)
- Mooncacke store kvcache connector is supported now. [#2913](https://github.com/vllm-project/vllm-ascend/pull/2913) - Mooncake store kvcache connector is supported now. [#2913](https://github.com/vllm-project/vllm-ascend/pull/2913)
- CPU offload connector is supported now. [#1659](https://github.com/vllm-project/vllm-ascend/pull/1659) - CPU offload connector is supported now. [#1659](https://github.com/vllm-project/vllm-ascend/pull/1659)
### Others ### Others
@@ -828,7 +828,7 @@ Please note that this release note will list all the important changes from last
The following notes are especially for reference when upgrading from last final release (v0.7.3): The following notes are especially for reference when upgrading from last final release (v0.7.3):
- V0 Engine is not supported from this release. Please always set `VLLM_USE_V1=1` to use V1 engine with vLLM Ascend. - V0 Engine is not supported from this release. Please always set `VLLM_USE_V1=1` to use V1 engine with vLLM Ascend.
- Mindie Turbo is not needed with this release. And the old version of Mindie Turbo is not compatible. Please do not install it. Currently all the function and enhancement is included in vLLM Ascend already. We'll consider to add it back in the future in needed. - Mindie Turbo is not needed with this release. And the old version of Mindie Turbo is not compatible. Please do not install it. Currently all the function and enhancement is included in vLLM Ascend already. We'll consider to add it back in the future if needed.
- Torch-npu is upgraded to 2.5.1.post1. CANN is upgraded to 8.2.RC1. Don't forget to upgrade them. - Torch-npu is upgraded to 2.5.1.post1. CANN is upgraded to 8.2.RC1. Don't forget to upgrade them.
### Core ### Core
@@ -893,7 +893,7 @@ This is the 1st release candidate of v0.10.0 for vLLM Ascend. Please follow the
### Core ### Core
- Ascend PyTorch adapter (torch_npu) has been upgraded to `2.7.1.dev20250724`. [#1562](https://github.com/vllm-project/vllm-ascend/pull/1562) And CANN hase been upgraded to `8.2.RC1`. [#1653](https://github.com/vllm-project/vllm-ascend/pull/1653) Dont forget to update them in your environment or using the latest images. - Ascend PyTorch adapter (torch_npu) has been upgraded to `2.7.1.dev20250724`. [#1562](https://github.com/vllm-project/vllm-ascend/pull/1562) And CANN has been upgraded to `8.2.RC1`. [#1653](https://github.com/vllm-project/vllm-ascend/pull/1653) Dont forget to update them in your environment or using the latest images.
- vLLM Ascend works on Atlas 800I A3 now, and the image on A3 will be released from this version on. [#1582](https://github.com/vllm-project/vllm-ascend/pull/1582) - vLLM Ascend works on Atlas 800I A3 now, and the image on A3 will be released from this version on. [#1582](https://github.com/vllm-project/vllm-ascend/pull/1582)
- Kimi-K2 with w8a8 quantization, Qwen3-Coder and GLM-4.5 is supported in vLLM Ascend, please following this [tutorial](https://github.com/vllm-project/vllm-ascend/blob/v0.10.0rc1/docs/source/tutorials/multi_node_kimi.md) to have a try. [#2162](https://github.com/vllm-project/vllm-ascend/pull/2162) - Kimi-K2 with w8a8 quantization, Qwen3-Coder and GLM-4.5 is supported in vLLM Ascend, please following this [tutorial](https://github.com/vllm-project/vllm-ascend/blob/v0.10.0rc1/docs/source/tutorials/multi_node_kimi.md) to have a try. [#2162](https://github.com/vllm-project/vllm-ascend/pull/2162)
- Pipeline Parallelism is supported in V1 now. [#1800](https://github.com/vllm-project/vllm-ascend/pull/1800) - Pipeline Parallelism is supported in V1 now. [#1800](https://github.com/vllm-project/vllm-ascend/pull/1800)
@@ -1055,7 +1055,7 @@ This is the 2nd release candidate of v0.9.1 for vLLM Ascend. Please follow the [
## v0.9.2rc1 - 2025.07.11 ## v0.9.2rc1 - 2025.07.11
This is the 1st release candidate of v0.9.2 for vLLM Ascend. Please follow the [official doc](https://github.com/vllm-project/vllm-ascend/tree/v0.9.2rc1) to get started. From this release, V1 engine will be enabled by default, there is no need to set `VLLM_USE_V1=1` any more. And this release is the last version to support V0 engine, V0 code will be clean up in the future. This is the 1st release candidate of v0.9.2 for vLLM Ascend. Please follow the [official doc](https://github.com/vllm-project/vllm-ascend/tree/v0.9.2rc1) to get started. From this release, V1 engine will be enabled by default, there is no need to set `VLLM_USE_V1=1` any more. And this release is the last version to support V0 engine, V0 code will be cleaned up in the future.
### Highlights ### Highlights
@@ -1074,7 +1074,7 @@ This is the 1st release candidate of v0.9.2 for vLLM Ascend. Please follow the [
### Others ### Others
- Official doc has been updated for better read experience. For example, more deployment tutorials are added, user/developer docs are updated. More guide will coming soon. - Official doc has been updated for better read experience. For example, more deployment tutorials are added, user/developer docs are updated. More guide will coming soon.
- Fix accuracy problem for deepseek V3/R1 models with torchair graph in long sequence predictions. [#1331](https://github.com/vllm-project/vllm-ascend/pull/1331) - Fix accuracy problem for Deepseek V3/R1 models with torchair graph in long sequence predictions. [#1331](https://github.com/vllm-project/vllm-ascend/pull/1331)
- A new env variable `VLLM_ENABLE_FUSED_EXPERTS_ALLGATHER_EP` has been added. It enables the fused allgather-experts kernel for Deepseek V3/R1 models. The default value is `0`. [#1335](https://github.com/vllm-project/vllm-ascend/pull/1335) - A new env variable `VLLM_ENABLE_FUSED_EXPERTS_ALLGATHER_EP` has been added. It enables the fused allgather-experts kernel for Deepseek V3/R1 models. The default value is `0`. [#1335](https://github.com/vllm-project/vllm-ascend/pull/1335)
- A new env variable `VLLM_ASCEND_ENABLE_TOPK_TOPP_OPTIMIZATION` has been added to improve the performance of topk-topp sampling. The default value is 0, we'll consider to enable it by default in the future[#1732](https://github.com/vllm-project/vllm-ascend/pull/1732) - A new env variable `VLLM_ASCEND_ENABLE_TOPK_TOPP_OPTIMIZATION` has been added to improve the performance of topk-topp sampling. The default value is 0, we'll consider to enable it by default in the future[#1732](https://github.com/vllm-project/vllm-ascend/pull/1732)
- A batch of bugs have been fixed for Data Parallelism case [#1273](https://github.com/vllm-project/vllm-ascend/pull/1273) [#1322](https://github.com/vllm-project/vllm-ascend/pull/1322) [#1275](https://github.com/vllm-project/vllm-ascend/pull/1275) [#1478](https://github.com/vllm-project/vllm-ascend/pull/1478) - A batch of bugs have been fixed for Data Parallelism case [#1273](https://github.com/vllm-project/vllm-ascend/pull/1273) [#1322](https://github.com/vllm-project/vllm-ascend/pull/1322) [#1275](https://github.com/vllm-project/vllm-ascend/pull/1275) [#1478](https://github.com/vllm-project/vllm-ascend/pull/1478)
@@ -1233,7 +1233,7 @@ We are excited to announce the release of 0.7.3 for vllm-ascend. This is the fir
### Highlights ### Highlights
- This release includes all features landed in the previous release candidates ([v0.7.1rc1](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.7.1rc1), [v0.7.3rc1](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.7.3rc1), [v0.7.3rc2](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.7.3rc2)). And all the features are fully tested and verified. Visit the official doc the get the detail [feature](https://docs.vllm.ai/projects/ascend/en/v0.7.3/user_guide/suppoted_features.html) and [model](https://docs.vllm.ai/projects/ascend/en/v0.7.3/user_guide/supported_models.html) support matrix. - This release includes all features landed in the previous release candidates ([v0.7.1rc1](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.7.1rc1), [v0.7.3rc1](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.7.3rc1), [v0.7.3rc2](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.7.3rc2)). And all the features are fully tested and verified. Visit the official doc to get the detail [feature](https://docs.vllm.ai/projects/ascend/en/v0.7.3/user_guide/suppoted_features.html) and [model](https://docs.vllm.ai/projects/ascend/en/v0.7.3/user_guide/supported_models.html) support matrix.
- Upgrade CANN to 8.1.RC1 to enable chunked prefill and automatic prefix caching features. You can now enable them now. - Upgrade CANN to 8.1.RC1 to enable chunked prefill and automatic prefix caching features. You can now enable them now.
- Upgrade PyTorch to 2.5.1. vLLM Ascend no longer relies on the dev version of torch-npu now. Now users don't need to install the torch-npu by hand. The 2.5.1 version of torch-npu will be installed automatically. [#662](https://github.com/vllm-project/vllm-ascend/pull/662) - Upgrade PyTorch to 2.5.1. vLLM Ascend no longer relies on the dev version of torch-npu now. Now users don't need to install the torch-npu by hand. The 2.5.1 version of torch-npu will be installed automatically. [#662](https://github.com/vllm-project/vllm-ascend/pull/662)
- Integrate MindIE Turbo into vLLM Ascend to improve DeepSeek V3/R1, Qwen 2 series performance. [#708](https://github.com/vllm-project/vllm-ascend/pull/708) - Integrate MindIE Turbo into vLLM Ascend to improve DeepSeek V3/R1, Qwen 2 series performance. [#708](https://github.com/vllm-project/vllm-ascend/pull/708)
@@ -1256,7 +1256,7 @@ We are excited to announce the release of 0.7.3 for vllm-ascend. This is the fir
## v0.8.5rc1 - 2025.05.06 ## v0.8.5rc1 - 2025.05.06
This is the 1st release candidate of v0.8.5 for vllm-ascend. Please follow the [official doc](https://github.com/vllm-project/vllm-ascend/tree/v0.8.5rc1) to start the journey. Now you can enable V1 egnine by setting the environment variable `VLLM_USE_V1=1`, see the feature support status of vLLM Ascend in [supported_features](https://github.com/vllm-project/vllm-ascend/blob/v0.8.5rc1/docs/source/user_guide/suppoted_features.md). This is the 1st release candidate of v0.8.5 for vllm-ascend. Please follow the [official doc](https://github.com/vllm-project/vllm-ascend/tree/v0.8.5rc1) to start the journey. Now you can enable V1 engine by setting the environment variable `VLLM_USE_V1=1`, see the feature support status of vLLM Ascend in [supported_features](https://github.com/vllm-project/vllm-ascend/blob/v0.8.5rc1/docs/source/user_guide/suppoted_features.md).
### Highlights ### Highlights