[Doc][Misc] Comprehensive documentation cleanup and grammatical fixes (#8073)

What this PR does / why we need it?
This pull request performs a comprehensive cleanup of the vLLM Ascend
documentation. It fixes numerous typos, grammatical errors, and phrasing
issues across community guidelines, developer documents, hardware
tutorials, and feature guides. Key improvements include correcting
hardware names (e.g., Atlas 300I), fixing broken links, cleaning up code
examples (removing duplicate flags and trailing commas), and improving
the clarity of technical explanations. These changes are necessary to
ensure the documentation is professional, accurate, and easy for users
to follow.

Does this PR introduce any user-facing change?
No, this PR contains documentation-only updates.

How was this patch tested?
The changes were manually reviewed for accuracy and grammatical
correctness. No functional code changes were introduced.

---------

Signed-off-by: herizhen <1270637059@qq.com>
Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com>
This commit is contained in:
herizhen
2026-04-09 15:37:57 +08:00
committed by GitHub
parent c40a387f63
commit 0d1424d81a
71 changed files with 1295 additions and 1296 deletions

View File

@@ -13,7 +13,7 @@ This capability supports heterogeneous parallelism strategies within a single mo
Fine-Grained Tensor Parallelism delivers two primary performance advantages through targeted weight sharding:
- **Reduced Per-Device Memory Footprint**:
Fine-grained TP shards large weight matrices( LM Head, o_proj)across devices, lowering peak memory usage and enabling larger batches or deployment on memory-limited hardware—without quantization.
Fine-grained TP shards large weight matrices(e.g., LM Head, o_proj)across devices, lowering peak memory usage and enabling larger batches or deployment on memory-limited hardware—without quantization.
- **Faster Memory Access in GEMMs**:
In decode-heavy workloads, GEMM performance is often memory-bound. Weight sharding reduces per-device weight fetch volume, cutting DRAM traffic and improving bandwidth efficiency—especially for latency-sensitive layers like LM Head and o_proj.

View File

@@ -18,9 +18,8 @@ Batch invariance is crucial for several use cases:
## Hardware Requirements
Batch invariance currently requires Ascend NPUs for 910B,
because only 910B supports batch invariance with HCCL communication for now,
we will support other NPUs in the future.
Batch invariance currently requires Ascend 910B NPUs, because only the 910B supports batch invariance with HCCL communication for now.
We will support other NPUs in the future.
## Software Requirements

View File

@@ -11,7 +11,7 @@ Context parallel mainly solves the problem of serving long context requests. As
- For long context prefill, we can use context parallel to reduce TTFT (time to first token) by amortizing the computation time of the prefill across query tokens.
- For long context decode, we can use context parallel to reduce KV cache duplication and offer more space for KV cache to increase the batch size (and hence the throughput).
To learn more about the theory and implementation details of context parallel, please refer to the [context parallel developer guide](../../developer_guide/feature_guide/context_parallel.md).
To learn more about the theory and implementation details of context parallel, please refer to the [context parallel developer guide](../../developer_guide/Design_Documents/context_parallel.md).
## Supported Scenarios

View File

@@ -11,15 +11,16 @@ We are working on further improvements and this feature will support more XPUs i
### Prerequisites
1. Dynamic batch now depends on an offline cost model saved in a lookup table to refine the token budget. The lookup table is saved in a '.csv' file, which should be first downloaded from [here](https://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-ascend/dynamic_batch_scheduler/A2-B3-BLK128.csv), renamed, and saved to the path `vllm_ascend/core/profile_table.csv`
1. Dynamic batch now depends on an offline cost model saved in a lookup table to refine the token budget. The lookup table is saved in a '.csv' file, which should be first downloaded from [A2-B3-BLK128.csv](https://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-ascend/dynamic_batch_scheduler/A2-B3-BLK128.csv), renamed, and saved to the path `vllm_ascend/core/profile_table.csv`
2. `Pandas` is needed to load the lookup table, in case pandas is not installed.
```bash
```bash
pip install pandas
```
### Tuning Parameters
`--SLO_limits_for_dynamic_batch` is the tuning parameter (integer type) for the dynamic batch feature, larger values impose more constraints on the latency limitation, leading to higher effective throughput. The parameter can be selected according to the specific models or service requirements.
```python

View File

@@ -6,30 +6,28 @@ A **disaggregated encoder** runs the vision-encoder stage of a multimodal LLM in
1. **Independent, fine-grained scaling**
* Vision encoders are lightweight, while language models are orders of magnitude larger.
* The language model can be parallelised without affecting the encoder fleet.
* Encoder nodes can be added or removed independently.
* Vision encoders are lightweight, while language models are orders of magnitude larger.
* The language model can be parallelised without affecting the encoder fleet.
* Encoder nodes can be added or removed independently.
2. **Lower time-to-first-token (TTFT)**
* Language-only requests bypass the vision encoder entirely.
* Encoder output is injected only at required attention layers, shortening the pre-fill critical path.
* Language-only requests bypass the vision encoder entirely.
* Encoder output is injected only at required attention layers, shortening the pre-fill critical path.
3. **Cross-process reuse and caching of encoder outputs**
* In-process encoders confine reuse to a single worker.
* A remote, shared cache lets any worker retrieve existing embeddings, eliminating redundant computation.
* In-process encoders confine reuse to a single worker.
* A remote, shared cache lets any worker retrieve existing embeddings, eliminating redundant computation.
Design doc: <
<https://docs.google.com/document/d/1aed8KtC6XkXtdoV87pWT0a8OJlZ-CpnuLLzmR8l9BAE>
>
Design doc: <https://docs.google.com/document/d/1aed8KtC6XkXtdoV87pWT0a8OJlZ-CpnuLLzmR8l9BAE>
---
## Usage
The current reference pathway is **ExampleConnector**.
Below ready-to-run scripts shows the workflow:
The ready-to-run scripts below show the workflow:
1 Encoder instance + 1 PD instance:
`examples/online_serving/disaggregated_encoder/disagg_1e1pd/`
@@ -45,7 +43,7 @@ Below ready-to-run scripts shows the workflow:
Disaggregated encoding is implemented by running two parts:
* **Encoder instance** a vLLM instance to performs vision encoding.
* **Encoder instance** a vLLM instance to perform vision encoding.
* **Prefill/Decode (PD) instance(s)** runs language pre-fill and decode.
* PD can be in either a single normal instance with (E + PD) or in disaggregated instances with (E + P + D)
@@ -62,11 +60,11 @@ All related code is under `vllm/distributed/ec_transfer`.
* *Multi-Path Scheduling Strategy* - dynamically diverts the multimodal request or text requests to the corresponding inference path
* *Instance-Level Dynamic Load Balancing* - dispatches multimodal requests based on a least-loaded strategy, using a priority queue to balance the active token workload across instances.
We create the example setup with the **MooncakeLayerwiseConnector** from `vllm_ascend/distributed/kv_transfer/kv_p2p/mooncake_layerwise_connector.py` and referred to the `examples/disaggregated_prefill_v1/load_balance_proxy_layerwise_server_example.py` to facilitate the kv transfer between P and D. For step-by-step deployment and configuration of Mooncake, refer to the following guide:
[https://docs.vllm.ai/projects/ascend/en/latest/tutorials/pd_disaggregation_mooncake_multi_node.html](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/pd_disaggregation_mooncake_multi_node.html)
We create the example setup with the **MooncakeLayerwiseConnector** from `vllm_ascend/distributed/kv_transfer/kv_p2p/mooncake_layerwise_connector.py` and refer to the `examples/disaggregated_prefill_v1/load_balance_proxy_layerwise_server_example.py` to facilitate the kv transfer between P and D. For step-by-step deployment and configuration of Mooncake, refer to the following guide:
[https://docs.vllm.ai/projects/ascend/en/latest/tutorials/pd_disaggregation_mooncake_multi_node.html](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/features/pd_disaggregation_mooncake_multi_node.html)
For the PD disaggregation part, when using MooncakeLayerwiseConnector: The request first enters the Decoder instance,the Decoder triggers a remote prefill task in reverse via the Metaserver. The Prefill node then executes inference and pushes KV Cache layer-wise to the Decoder, overlapping computation with transmission. Once the transfer is complete, the Decoder seamlessly continues with the subsequent token generation.
`docs/source/developer_guide/feature_guide/disaggregated_prefill.md` shows the brief idea about the disaggregated prefill.
`docs/source/developer_guide/Design_Documents/disaggregated_prefill.md` shows the brief idea about the disaggregated prefill.
## Limitations

View File

@@ -83,7 +83,7 @@ vllm serve Qwen/Qwen3-235B-A22 \
3. Model Compatibility:
- Only MoE models with explicit expert parallelism support (e.g., Qwen3 MoE models) are compatible.
- Verify model architecture supports dynamic expert routing through --enable-expert-parallel.
- Verify model architecture supports dynamic expert routing through `--enable-expert-parallel`.
4. Monitoring & Validation:
- Track metrics: expert_load_balance_ratio, ttft_p99, tpot_avg, and gpu_utilization.

View File

@@ -65,7 +65,7 @@ Online example:
vllm serve path/to/Qwen3-32B --tensor-parallel-size 8 --additional-config='{"xlite_graph_config": {"enabled": true, "full_mode": true}}'
```
You can find more details about Xlite [here](https://atomgit.com/openeuler/GVirt/blob/master/xlite/README.md)
You can find more details about [Xlite](https://atomgit.com/openeuler/GVirt/blob/master/xlite/README.md)
## Fallback to the Eager Mode

View File

@@ -19,41 +19,41 @@ Taking the DeepSeek model as an example, using 8 Atlas 800T A3 servers to deploy
1. Single Node Verification:
Execute the following commands on each node in sequence. The results must all be `success` and the status must be `UP`:
Execute the following commands on each node in sequence. The results must all be `success` and the status must be `UP`:
```bash
# Check the remote switch ports
for i in {0..15}; do hccn_tool -i $i -lldp -g | grep Ifname; done
# Get the link status of the Ethernet ports (UP or DOWN)
for i in {0..15}; do hccn_tool -i $i -link -g ; done
# Check the network health status
for i in {0..15}; do hccn_tool -i $i -net_health -g ; done
# View the network detected IP configuration
for i in {0..15}; do hccn_tool -i $i -netdetect -g ; done
# View gateway configuration
for i in {0..15}; do hccn_tool -i $i -gateway -g ; done
# View NPU network configuration
cat /etc/hccn.conf
```
```bash
# Check the remote switch ports
for i in {0..15}; do hccn_tool -i $i -lldp -g | grep Ifname; done
# Get the link status of the Ethernet ports (UP or DOWN)
for i in {0..15}; do hccn_tool -i $i -link -g ; done
# Check the network health status
for i in {0..15}; do hccn_tool -i $i -net_health -g ; done
# View the network detected IP configuration
for i in {0..15}; do hccn_tool -i $i -netdetect -g ; done
# View gateway configuration
for i in {0..15}; do hccn_tool -i $i -gateway -g ; done
# View NPU network configuration
cat /etc/hccn.conf
```
2. Get NPU IP Addresses
```bash
for i in {0..15}; do hccn_tool -i $i -vnic -g;done
```
```bash
for i in {0..15}; do hccn_tool -i $i -vnic -g;done
```
3. Get superpodid and SDID
```bash
for i in {0..15}; do npu-smi info -t spod-info -i $i -c 0;npu-smi info -t spod-info -i $i -c 1;done
```
```bash
for i in {0..15}; do npu-smi info -t spod-info -i $i -c 0;npu-smi info -t spod-info -i $i -c 1;done
```
4. Cross-Node PING Test
```bash
# Execute on the target node (replace 'x.x.x.x' with actual NPU IP address)
for i in {0..15}; do hccn_tool -i $i -hccs_ping -g address x.x.x.x;done
```
```bash
# Execute on the target node (replace 'x.x.x.x' with actual NPU IP address)
for i in {0..15}; do hccn_tool -i $i -hccs_ping -g address x.x.x.x;done
```
::::
@@ -61,35 +61,35 @@ for i in {0..15}; do hccn_tool -i $i -hccs_ping -g address x.x.x.x;done
1. Single Node Verification:
Execute the following commands on each node in sequence. The results must all be `success` and the status must be `UP`:
Execute the following commands on each node in sequence. The results must all be `success` and the status must be `UP`:
```bash
# Check the remote switch ports
for i in {0..7}; do hccn_tool -i $i -lldp -g | grep Ifname; done
# Get the link status of the Ethernet ports (UP or DOWN)
for i in {0..7}; do hccn_tool -i $i -link -g ; done
# Check the network health status
for i in {0..7}; do hccn_tool -i $i -net_health -g ; done
# View the network detected IP configuration
for i in {0..7}; do hccn_tool -i $i -netdetect -g ; done
# View gateway configuration
for i in {0..7}; do hccn_tool -i $i -gateway -g ; done
# View NPU network configuration
cat /etc/hccn.conf
```
```bash
# Check the remote switch ports
for i in {0..7}; do hccn_tool -i $i -lldp -g | grep Ifname; done
# Get the link status of the Ethernet ports (UP or DOWN)
for i in {0..7}; do hccn_tool -i $i -link -g ; done
# Check the network health status
for i in {0..7}; do hccn_tool -i $i -net_health -g ; done
# View the network detected IP configuration
for i in {0..7}; do hccn_tool -i $i -netdetect -g ; done
# View gateway configuration
for i in {0..7}; do hccn_tool -i $i -gateway -g ; done
# View NPU network configuration
cat /etc/hccn.conf
```
2. Get NPU IP Addresses
```bash
for i in {0..7}; do hccn_tool -i $i -ip -g;done
```
```bash
for i in {0..7}; do hccn_tool -i $i -ip -g;done
```
3. Cross-Node PING Test
```bash
# Execute on the target node (replace 'x.x.x.x' with actual NPU IP address)
for i in {0..7}; do hccn_tool -i $i -ping -g address x.x.x.x;done
```
```bash
# Execute on the target node (replace 'x.x.x.x' with actual NPU IP address)
for i in {0..7}; do hccn_tool -i $i -ping -g address x.x.x.x;done
```
::::
@@ -426,59 +426,59 @@ In the PD separation scenario, we provide an optimized configuration.
2. add '--enforce-eager' command to 'vllm serve'
3. Take '--kv-transfer-config' as follows:
```shell
--kv-transfer-config \
'{"kv_connector": "MooncakeConnectorV1",
"kv_buffer_device": "npu",
"kv_role": "kv_producer",
"kv_parallel_size": "1",
"kv_port": "20001",
"engine_id": "0"
}'
```
```shell
--kv-transfer-config \
'{"kv_connector": "MooncakeConnectorV1",
"kv_buffer_device": "npu",
"kv_role": "kv_producer",
"kv_parallel_size": "1",
"kv_port": "20001",
"engine_id": "0"
}'
```
4. Take '--additional-config' as follows:
```shell
--additional-config '{"enable_weight_nz_layout":true,"enable_prefill_optimizations":true}'
```
```shell
--additional-config '{"enable_weight_nz_layout":true,"enable_prefill_optimizations":true}'
```
- **decoder node**
1. set HCCL_BUFFSIZE=1024
2. Take '--kv-transfer-config' as follows:
```shell
--kv-transfer-config
'{"kv_connector": "MooncakeConnectorV1",
"kv_buffer_device": "npu",
"kv_role": "kv_consumer",
"kv_parallel_size": "1",
"kv_port": "20001",
"engine_id": "0"
}'
```
```shell
--kv-transfer-config
'{"kv_connector": "MooncakeConnectorV1",
"kv_buffer_device": "npu",
"kv_role": "kv_consumer",
"kv_parallel_size": "1",
"kv_port": "20001",
"engine_id": "0"
}'
```
3. Take '--additional-config' as follows:
```shell
--additional-config '{"enable_weight_nz_layout":true}'
```
```shell
--additional-config '{"enable_weight_nz_layout":true}'
```
### Parameters Description
1. '--additional-config' Parameter Introduction:
- **"enable_weight_nz_layout"**: Whether to convert quantized weights to NZ format to accelerate matrix multiplication.
- **"enable_prefill_optimizations"**: Whether to enable DeepSeek models' prefill optimizations.
<br>
- **"enable_weight_nz_layout"**: Whether to convert quantized weights to NZ format to accelerate matrix multiplication.
- **"enable_prefill_optimizations"**: Whether to enable DeepSeek models' prefill optimizations.
<br>
2. Enable MTP
Add the following command to your configurations.
```shell
--speculative-config '{"num_speculative_tokens": 1, "method":"deepseek_mtp"}'
```
```shell
--speculative-config '{"num_speculative_tokens": 1, "method":"deepseek_mtp"}'
```
### Recommended Configuration Example

View File

@@ -56,16 +56,16 @@ Assuming your working directory is ```/workspace``` and vllm/vllm-ascend have al
1. Install LMCache Repo
```bash
NO_CUDA_EXT=1 pip install lmcache==0.3.12
```
```bash
NO_CUDA_EXT=1 pip install lmcache==0.3.12
```
2. Install LMCache-Ascend Repo
```bash
cd /workspace/LMCache-Ascend
python3 -m pip install -v --no-build-isolation -e .
```
```bash
cd /workspace/LMCache-Ascend
python3 -m pip install -v --no-build-isolation -e .
```
### Usage

View File

@@ -32,4 +32,4 @@ vllm serve Qwen/Qwen2-7B-Instruct
--additional-config '{"ascend_compilation_config":{"enable_npugraph_ex":true, "enable_static_kernel":false}}'
```
You can find more details about npugraph_ex [here](https://www.hiascend.com/document/detail/zh/Pytorch/730/modthirdparty/torchairuseguide/torchair_00021.html)
You can find more details about [npugraph_ex](https://www.hiascend.com/document/detail/zh/Pytorch/730/modthirdparty/torchairuseguide/torchair_00021.html)

View File

@@ -14,11 +14,11 @@ Currently, vllm-ascend has implemented Sequence Parallelism for VL-class models
```bash
vllm serve Qwen/Qwen3-VL-2B-Instruct \
--tensor-parallel-size 2 \
--compilation-config '{"pass_config": {"enable_sp": true, , "sp_min_token_num": 1000}}'
--compilation-config '{"pass_config": {"enable_sp": true , "sp_min_token_num": 1000}}'
```
- `"enable_sp"`: This is the switch for SP. Since SP relies on graph mode, it is not supported in eager mode.
- `sp_min_token_num` (from upstream vllm's `pass_config`): Based on our experiments, when the number of tokens is small (empirical value is less than 1000), SP can actually bring negative benefits. This is because when the communication volume is small, the fixed overhead of the communication operator becomes the dominant factor. SP will only take effect when `num_tokens >= sp_min_token_num`. **The default value is 1000 on Ascend, which generally does not need to be modified.** To customize, use `--compilation-config '{"pass_config": {"enable_sp": true, "sp_min_token_num": 512}}'`. The value will be appended into `compile_ranges_split_points`, which splits the graph compilation range and checks whether the pass is applicable per range.
- `sp_min_token_num` (from upstream vllm's `pass_config`): Based on our experiments, when the number of tokens is small (empirical value is less than 1000), SP can actually bring negative impact. This is because when the communication volume is small, the fixed overhead of the communication operator becomes the dominant factor. SP will only take effect when `num_tokens >= sp_min_token_num`. **The default value is 1000 on Ascend, which generally does not need to be modified.** To customize, use `--compilation-config '{"pass_config": {"enable_sp": true, "sp_min_token_num": 512}}'`. The value will be appended into `compile_ranges_split_points`, which splits the graph compilation range and checks whether the pass is applicable per range.
Without modifying `sp_min_token_num`, the simplest way and recommended way to enable SP is:

View File

@@ -91,7 +91,7 @@ A few important things to consider when using the EAGLE based draft models:
## Speculating using MTP speculators
The following code configures vLLM Ascend to use speculative decoding where proposals are generated by MTP (Multi Token Prediction), boosting inference performance by parallelizing the prediction of multiple tokens. For more information about MTP see [Multi_Token_Prediction](https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/feature_guide/Multi_Token_Prediction.html)
The following code configures vLLM Ascend to use speculative decoding where proposals are generated by MTP (Multi Token Prediction), boosting inference performance by parallelizing the prediction of multiple tokens. For more information about MTP see [Multi_Token_Prediction](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/feature_guide/Multi_Token_Prediction.html)
- Online inference

View File

@@ -2,7 +2,7 @@
Weight prefetching optimizes memory usage by preloading weights into the cache before they are needed, minimizing delays caused by memory access during model execution. Linear layers sometimes exhibit relatively high MTE utilization. To address this, we create a separate pipeline specifically for weight prefetching, which runs in parallel with the original vector computation pipeline, such as quantize, MoE gating top_k, RMSNorm and SwiGlu. This approach allows the weights to be preloaded to L2 cache ahead of time, reducing MTE utilization during the linear layer computations and indirectly improving Cube computation efficiency by minimizing resource contention and optimizing data flow.
Since we use vector computations to hide the weight prefetching pipeline, it has effect on computation, if you prioritize low latency over high throughput, then it is best not to enable prefetching.
Since we use vector computations to hide the weight prefetching pipeline, this has an effect on computation. If you prioritize low latency over high throughput, it is best not to enable prefetching.
## Quick Start
@@ -35,39 +35,39 @@ Notes:
1) For MoE model:
```shell
--additional-config \
'{
"weight_prefetch_config": {
"enabled": true,
"prefetch_ratio": {
"attn": {
"qkv": 1.0,
"o": 1.0
},
"moe": {
"gate_up": 0.8
```shell
--additional-config \
'{
"weight_prefetch_config": {
"enabled": true,
"prefetch_ratio": {
"attn": {
"qkv": 1.0,
"o": 1.0
},
"moe": {
"gate_up": 0.8
}
}
}
}
}'
```
}'
```
2) For dense model:
Following is the default configuration that can get a good performance for `--max-num-seqs` is 72 for Qwen3-32B-W8A8
Following is the default configuration that can get a good performance for `--max-num-seqs` is 72 for Qwen3-32B-W8A8
```shell
--additional-config \
'{
"weight_prefetch_config": {
"enabled": true,
"prefetch_ratio": {
"mlp": {
"gate_up": 1.0,
"down": 1.0
```shell
--additional-config \
'{
"weight_prefetch_config": {
"enabled": true,
"prefetch_ratio": {
"mlp": {
"gate_up": 1.0,
"down": 1.0
}
}
}
}
}'
```
}'
```

View File

@@ -49,9 +49,9 @@ This is the first release candidate of v0.17.0 for vLLM Ascend. Please follow th
### Known Issue
- GLM5 requires transformers==5.2.0, and this will resolved by [vllm-project/vllm#30566](https://github.com/vllm-project/vllm/pull/30566), will not included in v0.17.0.
- GLM5 requires transformers==5.2.0, and this will be resolved by [vllm-project/vllm#30566](https://github.com/vllm-project/vllm/pull/30566), will not be included in v0.17.0.
- There is a precision issue with Qwen3-Next due to the changed tp weight split method. Will fix it in next release.
- The minimum number of tokens of prefix cache hit in hybrid model is large now. The exact number is related to tp size, e.g., with tp 2, the block_size is adjusted to 2048, which means that any prefix shorter than 2048 will never be cached.
- In hybrid models, the minimum token count required for a prefix cache hit is currently large. The exact number is related to tp size, e.g., with tp 2, the block_size is adjusted to 2048, which means that any prefix shorter than 2048 will never be cached.
- GLM5 has an issue in the 2-node PD mixed deployment scenario where inference may hang when concurrency exceeds 8 (fixed in PR [#7235](https://github.com/vllm-project/vllm-ascend/pull/7235) [#7290](https://github.com/vllm-project/vllm-ascend/pull/7290)).
## v0.16.0rc1 - 2026.03.09
@@ -96,7 +96,7 @@ This is the first release candidate of v0.16.0 for vLLM Ascend. Please follow th
### Deprecation & Breaking Changes
- `enable_flash_comm_v1` config option has been renamed back to `enable_sp`. [#6883](https://github.com/vllm-project/vllm-ascend/pull/6883)
- The auto-detect quantization format from model files is reverted, in v0.16.0rc1, we still need to add `--quantization ascend` to serve a model quantinized by modelslim. It will be added back in the next version after the bug with the remote model id is fixed. [#6873](https://github.com/vllm-project/vllm-ascend/pull/6873)
- The auto-detect quantization format from model files is reverted, in v0.16.0rc1, we still need to add `--quantization ascend` to serve a model quantized by modelslim. It will be added back in the next version after the bug with the remote model id is fixed. [#6873](https://github.com/vllm-project/vllm-ascend/pull/6873)
### Documentation
@@ -448,7 +448,7 @@ We mainly focus on quality and performance improvement in this release. The spec
- `CPUOffloadingConnector` is deprecated. We'll remove it in the next release. It'll be replaced by CPUOffload feature from vLLM in the future.
- eplb config options is moved to `eplb_config` in [additional config](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/configuration/additional_config.html). The old ones will be removed in the next release.
- `ProfileExecuteDuration` [feature](https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/performance_and_debug/profile_execute_duration.html) is deprecated. It's replaced by `ObservabilityConfig` from vLLM.
- `ProfileExecuteDuration` [feature](https://github.com/vllm-project/vllm-ascend/blob/v0.13.0rc2/docs/source/developer_guide/performance_and_debug/profile_execute_duration.md) is deprecated. It's replaced by `ObservabilityConfig` from vLLM.
- The value of `VLLM_ASCEND_ENABLE_MLAPO` env will be set to True by default in the next release. It'll be enabled in decode node by default. Please note that this feature will cost more memory. If you are memory sensitive, please set it to False.
## v0.13.0rc1 - 2025.12.27
@@ -457,8 +457,8 @@ This is the first release candidate of v0.13.0 for vLLM Ascend. We landed lots o
### Highlights
- Improved the performance of DeepSeek V3.2, please refer to [tutorials](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/DeepSeek-V3.2.html)
- Qwen3-Next MTP with chunked prefill is supported now [#4770](https://github.com/vllm-project/vllm-ascend/pull/4770), please refer to [tutorials](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/Qwen3-Next.html)
- Improved the performance of DeepSeek V3.2, please refer to [tutorials](https://github.com/vllm-project/vllm-ascend/blob/v0.13.0rc1/docs/source/tutorials/DeepSeek-V3.2.md)
- Qwen3-Next MTP with chunked prefill is supported now [#4770](https://github.com/vllm-project/vllm-ascend/pull/4770), please refer to [tutorials](https://github.com/vllm-project/vllm-ascend/blob/v0.13.0rc1/docs/source/tutorials/Qwen3-Next.md)
- [Experimental] Prefill Context Parallel and Decode Context Parallel are supported, but notice that it is an experimental feature now, welcome any feedback. please refer to [context parallel feature guide](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/feature_guide/context_parallel.html)
### Features
@@ -552,7 +552,7 @@ This is the first release candidate of v0.12.0 for vLLM Ascend. We landed lots o
### Highlights
- DeepSeek 3.2 is stable and performance is improved. In this release, you don't need to install any other packages now. Following the [official tutorial](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/DeepSeek-V3.2.html) to start using it.
- DeepSeek 3.2 is stable and performance is improved. In this release, you don't need to install any other packages now. Following the [official tutorial](https://github.com/vllm-project/vllm-ascend/blob/v0.12.0rc1/docs/source/tutorials/DeepSeek-V3.2.md) to start using it.
- Async scheduler is more stable and ready to enable now. Please set `--async-scheduling` to enable it.
- More new models, such as Qwen3-omni, DeepSeek OCR, PaddleOCR, OpenCUA are supported now.
@@ -572,7 +572,7 @@ This is the first release candidate of v0.12.0 for vLLM Ascend. We landed lots o
### Other
- [Experimental] Mooncake layerwise connector is supported now.
- [Experimental] [KV cache pool](https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/feature_guide/KV_Cache_Pool_Guide.html) feature is added
- [Experimental] [KV cache pool](https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/Design_Documents/KV_Cache_Pool_Guide.html) feature is added
- [Experimental] A new graph mode `xlite` is introduced. It performs good with some models. Following the [official tutorial](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/feature_guide/graph_mode.html#using-xlitegraph) to start using it.
- LLMdatadist kv connector is removed. Please use mooncake connector instead.
- Ascend scheduler is removed. `--additional-config {"ascend_scheduler": {"enabled": true}` doesn't work anymore.
@@ -580,7 +580,7 @@ This is the first release candidate of v0.12.0 for vLLM Ascend. We landed lots o
- `VLLM_ASCEND_ENABLE_TOPK_TOPP_OPTIMIZATION` env is removed. This feature is stable enough. We enable it by default now.
- speculative decode method `Ngram` is back now.
- msprobe tool is added to help user to check the model accuracy. Please follow the [official doc](https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/performance_and_debug/msprobe_guide.html) to get started.
- msserviceprofiler tool is added to help user to profile the model performance. Please follow the [official doc](https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/performance_and_debug/service_profiling_guide.html) to get started.
- msserviceprofiler tool is added to help user to profile the model performance. Please follow the [official doc](https://github.com/vllm-project/vllm-ascend/blob/v0.12.0rc1/docs/source/developer_guide/performance_and_debug/service_profiling_guide.md) to get started.
### Upgrade Note
@@ -658,12 +658,12 @@ v0.11.0 will be the next official release version of vLLM Ascend. We'll release
- CANN is upgrade to 8.3.RC1. Torch-npu is upgrade to 2.7.1. [#3945](https://github.com/vllm-project/vllm-ascend/pull/3945) [#3896](https://github.com/vllm-project/vllm-ascend/pull/3896)
- PrefixCache and Chunked Prefill are enabled by default. [#3967](https://github.com/vllm-project/vllm-ascend/pull/3967)
- W4A4 quantization is supported now. [#3427](https://github.com/vllm-project/vllm-ascend/pull/3427) Official tutorial is available at [here](<https://docs.vllm.ai/projects/ascend/en/latest/tutorials/single_npu_qwen3_w4a4.html>.
- W4A4 quantization is supported now. [#3427](https://github.com/vllm-project/vllm-ascend/pull/3427) Official tutorial is available at [single_npu_qwen3_w4a4]<https://github.com/vllm-project/vllm-ascend/pull/4076/changes>.
### Core
- Performance of Qwen3 and Deepseek V3 series models are improved.
- Mooncake layerwise connector is supported now [#2602](https://github.com/vllm-project/vllm-ascend/pull/2602). Find tutorial [here](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/pd_disaggregation_mooncake_multi_node.html).
- Mooncake layerwise connector is supported now [#2602](https://github.com/vllm-project/vllm-ascend/pull/2602). Find tutorial [pd_disaggregation_mooncake_multi_node](https://github.com/vllm-project/vllm-ascend/blob/v0.11.0rc1/docs/source/tutorials/multi_node_pd_disaggregation_mooncake.md).
- MTP > 1 is supported now. [#2708](https://github.com/vllm-project/vllm-ascend/pull/2708)
- [Experimental] Graph mode `FULL_DECODE_ONLY` is supported now! And `FULL` will be landing in the next few weeks. [#2128](https://github.com/vllm-project/vllm-ascend/pull/2128)
- Pooling models, such as bge-m3, are supported now. [#3171](https://github.com/vllm-project/vllm-ascend/pull/3171)
@@ -687,7 +687,7 @@ v0.11.0 will be the next official release version of vLLM Ascend. We'll release
## v0.11.0rc0 - 2025.09.30
This is the special release candidate of v0.11.0 for vLLM Ascend. Please follow the [official doc](https://docs.vllm.ai/projects/ascend/en/) to get started.
This is the special release candidate of v0.11.0 for vLLM Ascend. Please follow the [official doc](https://github.com/vllm-project/vllm-ascend/tree/v0.11.0rc0) to get started.
### Highlights
@@ -711,11 +711,11 @@ This is the special release candidate of v0.11.0 for vLLM Ascend. Please follow
## v0.10.2rc1 - 2025.09.16
This is the 1st release candidate of v0.10.2 for vLLM Ascend. Please follow the [official doc](https://docs.vllm.ai/projects/ascend/en/) to get started.
This is the 1st release candidate of v0.10.2 for vLLM Ascend. Please follow the [official doc](https://github.com/vllm-project/vllm-ascend/tree/v0.10.2rc1) to get started.
### Highlights
- Added support for Qwen3-Next. Please note that the expert parallel and MTP features do not work with this release. We will be adding support for them soon. Follow the [official guide](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/multi_npu_qwen3_next.html) to get started. [#2917](https://github.com/vllm-project/vllm-ascend/pull/2917)
- Added support for Qwen3-Next. Please note that the expert parallel and MTP features do not work with this release. We will be adding support for them soon. Follow the [official guide](https://github.com/vllm-project/vllm-ascend/blob/v0.10.2rc1/docs/source/tutorials/multi_npu_qwen3_next.md) to get started. [#2917](https://github.com/vllm-project/vllm-ascend/pull/2917)
- Added quantization support for aclgraph [#2841](https://github.com/vllm-project/vllm-ascend/pull/2841)
### Core
@@ -754,7 +754,7 @@ This is the 1st release candidate of v0.10.2 for vLLM Ascend. Please follow the
## v0.10.1rc1 - 2025.09.04
This is the 1st release candidate of v0.10.1 for vLLM Ascend. Please follow the [official doc](https://docs.vllm.ai/projects/ascend/en/) to get started.
This is the 1st release candidate of v0.10.1 for vLLM Ascend. Please follow the [official doc](https://github.com/vllm-project/vllm-ascend/tree/v0.10.1rc1) to get started.
### Highlights
@@ -808,15 +808,15 @@ This is the 1st release candidate of v0.10.1 for vLLM Ascend. Please follow the
We are excited to announce the newest official release of vLLM Ascend. This release includes many feature supports, performance improvements and bug fixes. We recommend users to upgrade from 0.7.3 to this version. Please always set `VLLM_USE_V1=1` to use V1 engine.
In this release, we added many enhancements for large scale expert parallel case. It's recommended to follow the [official guide](https://docs.vllm.ai/projects/ascend/en/v0.9.1/tutorials/large_scale_ep.html).
In this release, we added many enhancements for large scale expert parallel case. It's recommended to follow the [official guide](https://github.com/vllm-project/vllm-ascend/blob/v0.9.1/docs/source/tutorials/large_scale_ep.md).
Please note that this release note will list all the important changes from last official release(v0.7.3)
### Highlights
- DeepSeek V3/R1 is supported with high quality and performance. MTP can work with DeepSeek as well. Please refer to [muliti node tutorials](https://docs.vllm.ai/projects/ascend/en/v0.9.1/tutorials/multi_node.html) and [Large Scale Expert Parallelism](https://docs.vllm.ai/projects/ascend/en/v0.9.1/tutorials/large_scale_ep.html).
- DeepSeek V3/R1 is supported with high quality and performance. MTP can work with DeepSeek as well. Please refer to [multi node tutorials](https://docs.vllm.ai/projects/ascend/en/v0.9.1/tutorials/multi_node.html) and [Large Scale Expert Parallelism](https://github.com/vllm-project/vllm-ascend/blob/v0.9.1/docs/source/tutorials/large_scale_ep.md).
- Qwen series models work with graph mode now. It works by default with V1 Engine. Please refer to [Qwen tutorials](https://docs.vllm.ai/projects/ascend/en/v0.9.1/tutorials/index.html).
- Disaggregated Prefilling support for V1 Engine. Please refer to [Large Scale Expert Parallelism](https://docs.vllm.ai/projects/ascend/en/v0.9.1/tutorials/large_scale_ep.html) tutorials.
- Disaggregated Prefilling support for V1 Engine. Please refer to [Large Scale Expert Parallelism](https://github.com/vllm-project/vllm-ascend/blob/v0.9.1/docs/source/tutorials/large_scale_ep.md) tutorials.
- Automatic prefix caching and chunked prefill feature is supported.
- Speculative decoding feature works with Ngram and MTP method.
- MOE and dense w4a8 quantization support now. Please refer to [quantization guide](https://docs.vllm.ai/projects/ascend/en/v0.9.1/user_guide/feature_guide/quantization.html).
@@ -884,18 +884,18 @@ This is the 3rd release candidate of v0.9.1 for vLLM Ascend. Please follow the [
## v0.10.0rc1 - 2025.08.07
This is the 1st release candidate of v0.10.0 for vLLM Ascend. Please follow the [official doc](https://docs.vllm.ai/projects/ascend/en/) to get started. V0 is completely removed from this version.
This is the 1st release candidate of v0.10.0 for vLLM Ascend. Please follow the [official doc](https://github.com/vllm-project/vllm-ascend/tree/v0.10.0rc1) to get started. V0 is completely removed from this version.
### Highlights
- Disaggregate prefill works with V1 engine now. You can take a try with DeepSeek model [#950](https://github.com/vllm-project/vllm-ascend/pull/950), following this [tutorial](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/README.md).
- Disaggregate prefill works with V1 engine now. You can take a try with DeepSeek model [#950](https://github.com/vllm-project/vllm-ascend/pull/950), following this [tutorial](https://github.com/vllm-project/vllm-ascend/blob/v0.10.0rc1/examples/disaggregated_prefill_v1/README.md).
- W4A8 quantization method is supported for dense and MoE model now. [#2060](https://github.com/vllm-project/vllm-ascend/pull/2060) [#2172](https://github.com/vllm-project/vllm-ascend/pull/2172)
### Core
- Ascend PyTorch adapter (torch_npu) has been upgraded to `2.7.1.dev20250724`. [#1562](https://github.com/vllm-project/vllm-ascend/pull/1562) And CANN hase been upgraded to `8.2.RC1`. [#1653](https://github.com/vllm-project/vllm-ascend/pull/1653) Dont forget to update them in your environment or using the latest images.
- vLLM Ascend works on Atlas 800I A3 now, and the image on A3 will be released from this version on. [#1582](https://github.com/vllm-project/vllm-ascend/pull/1582)
- Kimi-K2 with w8a8 quantization, Qwen3-Coder and GLM-4.5 is supported in vLLM Ascend, please following this [tutorial](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/multi_node_kimi.md.html) to have a try. [#2162](https://github.com/vllm-project/vllm-ascend/pull/2162)
- Kimi-K2 with w8a8 quantization, Qwen3-Coder and GLM-4.5 is supported in vLLM Ascend, please following this [tutorial](https://github.com/vllm-project/vllm-ascend/blob/v0.10.0rc1/docs/source/tutorials/multi_node_kimi.md) to have a try. [#2162](https://github.com/vllm-project/vllm-ascend/pull/2162)
- Pipeline Parallelism is supported in V1 now. [#1800](https://github.com/vllm-project/vllm-ascend/pull/1800)
- Prefix cache feature now work with the Ascend Scheduler. [#1446](https://github.com/vllm-project/vllm-ascend/pull/1446)
- Torchair graph mode works with tp > 4 now. [#1508](https://github.com/vllm-project/vllm-ascend/issues/1508)
@@ -1055,7 +1055,7 @@ This is the 2nd release candidate of v0.9.1 for vLLM Ascend. Please follow the [
## v0.9.2rc1 - 2025.07.11
This is the 1st release candidate of v0.9.2 for vLLM Ascend. Please follow the [official doc](https://docs.vllm.ai/projects/ascend/en/) to get started. From this release, V1 engine will be enabled by default, there is no need to set `VLLM_USE_V1=1` any more. And this release is the last version to support V0 engine, V0 code will be clean up in the future.
This is the 1st release candidate of v0.9.2 for vLLM Ascend. Please follow the [official doc](https://github.com/vllm-project/vllm-ascend/tree/v0.9.2rc1) to get started. From this release, V1 engine will be enabled by default, there is no need to set `VLLM_USE_V1=1` any more. And this release is the last version to support V0 engine, V0 code will be clean up in the future.
### Highlights
@@ -1106,7 +1106,7 @@ This is the 1st release candidate of v0.9.2 for vLLM Ascend. Please follow the [
## v0.9.1rc1 - 2025.06.22
This is the 1st release candidate of v0.9.1 for vLLM Ascend. Please follow the [official doc](https://docs.vllm.ai/projects/ascend/en/) to get started.
This is the 1st release candidate of v0.9.1 for vLLM Ascend. Please follow the [official doc](https://github.com/vllm-project/vllm-ascend/tree/v0.9.1rc1) to get started.
### Experimental
@@ -1167,7 +1167,7 @@ This release contains some quick fixes for v0.9.0rc1. Please use this release in
## v0.9.0rc1 - 2025.06.09
This is the 1st release candidate of v0.9.0 for vllm-ascend. Please follow the [official doc](https://docs.vllm.ai/projects/ascend/en/) to start the journey. From this release, V1 Engine is recommended to use. The code of V0 Engine is frozen and will not be maintained any more. Please set environment `VLLM_USE_V1=1` to enable V1 Engine.
This is the 1st release candidate of v0.9.0 for vllm-ascend. Please follow the [official doc](https://github.com/vllm-project/vllm-ascend/tree/v0.9.0rc1) to start the journey. From this release, V1 Engine is recommended to use. The code of V0 Engine is frozen and will not be maintained any more. Please set environment `VLLM_USE_V1=1` to enable V1 Engine.
### Highlights
@@ -1256,7 +1256,7 @@ We are excited to announce the release of 0.7.3 for vllm-ascend. This is the fir
## v0.8.5rc1 - 2025.05.06
This is the 1st release candidate of v0.8.5 for vllm-ascend. Please follow the [official doc](https://docs.vllm.ai/projects/ascend/en/) to start the journey. Now you can enable V1 egnine by setting the environment variable `VLLM_USE_V1=1`, see the feature support status of vLLM Ascend in [here](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/support_matrix/supported_features.html).
This is the 1st release candidate of v0.8.5 for vllm-ascend. Please follow the [official doc](https://github.com/vllm-project/vllm-ascend/tree/v0.8.5rc1) to start the journey. Now you can enable V1 egnine by setting the environment variable `VLLM_USE_V1=1`, see the feature support status of vLLM Ascend in [supported_features](https://github.com/vllm-project/vllm-ascend/blob/v0.8.5rc1/docs/source/user_guide/suppoted_features.md).
### Highlights
@@ -1282,12 +1282,12 @@ This is the 1st release candidate of v0.8.5 for vllm-ascend. Please follow the [
## v0.8.4rc2 - 2025.04.29
This is the second release candidate of v0.8.4 for vllm-ascend. Please follow the [official doc](https://docs.vllm.ai/projects/ascend/en/) to start the journey. Some experimental features are included in this version, such as W8A8 quantization and EP/DP support. We'll make them stable enough in the next release.
This is the second release candidate of v0.8.4 for vllm-ascend. Please follow the [official doc](https://github.com/vllm-project/vllm-ascend/tree/v0.8.4rc2) to start the journey. Some experimental features are included in this version, such as W8A8 quantization and EP/DP support. We'll make them stable enough in the next release.
### Highlights
- Qwen3 and Qwen3MOE is supported now. Please follow the [official doc](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/single_npu.html) to run the quick demo. [#709](https://github.com/vllm-project/vllm-ascend/pull/709)
- Ascend W8A8 quantization method is supported now. Please take the [official doc](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/multi_npu_quantization.html) for example. Any [feedback](https://github.com/vllm-project/vllm-ascend/issues/619) is welcome. [#580](https://github.com/vllm-project/vllm-ascend/pull/580)
- Qwen3 and Qwen3MOE is supported now. Please follow the [official doc](https://github.com/vllm-project/vllm-ascend/blob/v0.8.4rc2/docs/source/tutorials/single_npu.md) to run the quick demo. [#709](https://github.com/vllm-project/vllm-ascend/pull/709)
- Ascend W8A8 quantization method is supported now. Please take the [official doc](https://github.com/vllm-project/vllm-ascend/blob/v0.8.4rc2/docs/source/tutorials/multi_npu_quantization.md) for example. Any [feedback](https://github.com/vllm-project/vllm-ascend/issues/619) is welcome. [#580](https://github.com/vllm-project/vllm-ascend/pull/580)
- DeepSeek V3/R1 works with DP, TP and MTP now. Please note that it's still in experimental status. Let us know if you hit any problem. [#429](https://github.com/vllm-project/vllm-ascend/pull/429) [#585](https://github.com/vllm-project/vllm-ascend/pull/585) [#626](https://github.com/vllm-project/vllm-ascend/pull/626) [#636](https://github.com/vllm-project/vllm-ascend/pull/636) [#671](https://github.com/vllm-project/vllm-ascend/pull/671)
### Core
@@ -1305,18 +1305,18 @@ This is the second release candidate of v0.8.4 for vllm-ascend. Please follow th
## v0.8.4rc1 - 2025.04.18
This is the first release candidate of v0.8.4 for vllm-ascend. Please follow the [official doc](https://docs.vllm.ai/projects/ascend/en/) to start the journey. From this version, vllm-ascend will follow the newest version of vllm and release every two weeks. For example, if vllm releases v0.8.5 in the next two weeks, vllm-ascend will release v0.8.5rc1 instead of v0.8.4rc2. Please find the detail from the [official documentation](https://docs.vllm.ai/projects/ascend/en/latest/community/versioning_policy.html#release-window).
This is the first release candidate of v0.8.4 for vllm-ascend. Please follow the [official doc](https://github.com/vllm-project/vllm-ascend/tree/v0.8.4rc1) to start the journey. From this version, vllm-ascend will follow the newest version of vllm and release every two weeks. For example, if vllm releases v0.8.5 in the next two weeks, vllm-ascend will release v0.8.5rc1 instead of v0.8.4rc2. Please find the detail from the [official documentation](https://docs.vllm.ai/projects/ascend/en/latest/community/versioning_policy.html#release-window).
### Highlights
- vLLM V1 engine experimental support is included in this version. You can visit [official guide](https://docs.vllm.ai/en/latest/getting_started/v1_user_guide.html) to get more detail. By default, vLLM will fallback to V0 if V1 doesn't work, please set `VLLM_USE_V1=1` environment if you want to use V1 forcibly.
- LoRAMulti-LoRA And Dynamic Serving is supported now. The performance will be improved in the next release. Please follow the [official doc](https://docs.vllm.ai/en/latest/features/lora.html) for more usage information. Thanks for the contribution from China Merchants Bank. [#521](https://github.com/vllm-project/vllm-ascend/pull/521).
- vLLM V1 engine experimental support is included in this version. You can visit [official guide](https://docs.vllm.ai/en/v0.8.4/getting_started/v1_user_guide.html) to get more detail. By default, vLLM will fallback to V0 if V1 doesn't work, please set `VLLM_USE_V1=1` environment if you want to use V1 forcibly.
- LoRAMulti-LoRA And Dynamic Serving is supported now. The performance will be improved in the next release. Please follow the [official doc](https://docs.vllm.ai/en/v0.8.4/features/lora.html) for more usage information. Thanks for the contribution from China Merchants Bank. [#521](https://github.com/vllm-project/vllm-ascend/pull/521).
- Sleep Mode feature is supported. Currently it only works on V0 engine. V1 engine support will come soon. [#513](https://github.com/vllm-project/vllm-ascend/pull/513)
### Core
- The Ascend scheduler is added for V1 engine. This scheduler is more affinity with Ascend hardware. More scheduler policy will be added in the future. [#543](https://github.com/vllm-project/vllm-ascend/pull/543)
- Disaggregated Prefill feature is supported. Currently only 1P1D works. NPND is under design by vllm team. vllm-ascend will support it once it's ready from vLLM. Follow the [official guide](https://docs.vllm.ai/en/latest/features/disagg_prefill.html) to use. [#432](https://github.com/vllm-project/vllm-ascend/pull/432)
- Disaggregated Prefill feature is supported. Currently only 1P1D works. NPND is under design by vllm team. vllm-ascend will support it once it's ready from vLLM. Follow the [official guide](https://docs.vllm.ai/en/v0.8.4/features/disagg_prefill.html) to use. [#432](https://github.com/vllm-project/vllm-ascend/pull/432)
- Spec decode feature works now. Currently it only works on V0 engine. V1 engine support will come soon. [#500](https://github.com/vllm-project/vllm-ascend/pull/500)
- Structured output feature works now on V1 Engine. Currently it only supports xgrammar backend while using guidance backend may get some errors. [#555](https://github.com/vllm-project/vllm-ascend/pull/555)
@@ -1336,7 +1336,7 @@ This is 2nd release candidate of v0.7.3 for vllm-ascend. Please follow the [offi
### Highlights
- Add Ascend Custom Ops framework. Developers now can write customs ops using AscendC. An example ops `rotary_embedding` is added. More tutorials will come soon. The Custom Ops compilation is disabled by default when installing vllm-ascend. Set `COMPILE_CUSTOM_KERNELS=1` to enable it. [#371](https://github.com/vllm-project/vllm-ascend/pull/371)
- V1 engine is basic supported in this release. The full support will be done in 0.8.X release. If you hit any issue or have any requirement of V1 engine. Please tell us [here](https://github.com/vllm-project/vllm-ascend/issues/414). [#376](https://github.com/vllm-project/vllm-ascend/pull/376)
- V1 engine is basic supported in this release. The full support will be done in 0.8.X release. If you hit any issue or have any requirement of V1 engine. Please tell us [this issue](https://github.com/vllm-project/vllm-ascend/issues/414). [#376](https://github.com/vllm-project/vllm-ascend/pull/376)
- Prefix cache feature works now. You can set `enable_prefix_caching=True` to enable it. [#282](https://github.com/vllm-project/vllm-ascend/pull/282)
### Core
@@ -1394,13 +1394,13 @@ We are excited to announce the first release candidate of v0.7.1 for vllm-ascend
vLLM Ascend Plugin (vllm-ascend) is a community maintained hardware plugin for running vLLM on the Ascend NPU. With this release, users can now enjoy the latest features and improvements of vLLM on the Ascend NPU.
Please follow the [official doc](https://docs.vllm.ai/projects/ascend/en/v0.7.1) to start the journey. Note that this is a release candidate, and there may be some bugs or issues. We appreciate your feedback and suggestions [here](https://github.com/vllm-project/vllm-ascend/issues/19)
Please follow the [official doc](https://docs.vllm.ai/projects/ascend/en/v0.7.1) to start the journey. Note that this is a release candidate, and there may be some bugs or issues. We appreciate your feedback and suggestions [this issue](https://github.com/vllm-project/vllm-ascend/issues/19)
### Highlights
- Initial supports for Ascend NPU on vLLM. [#3](https://github.com/vllm-project/vllm-ascend/pull/3)
- DeepSeek is now supported. [#88](https://github.com/vllm-project/vllm-ascend/pull/88) [#68](https://github.com/vllm-project/vllm-ascend/pull/68)
- Qwen, Llama series and other popular models are also supported, you can see more details in [here](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/supported_models.html).
- Qwen, Llama series and other popular models are also supported, you can see more details in [supported_models](https://github.com/vllm-project/vllm-ascend/blob/v0.7.1rc1/docs/source/user_guide/supported_models.md).
### Core
@@ -1415,6 +1415,6 @@ Please follow the [official doc](https://docs.vllm.ai/projects/ascend/en/v0.7.1)
### Known Issues
- This release relies on an unreleased torch_npu version. It has been installed within official container image already. Please [install](https://docs.vllm.ai/projects/ascend/en/v0.7.1rc1/installation.html) it manually if you are using non-container environment.
- This release relies on an unreleased torch_npu version. It has been installed within official container image already. Please [install](https://github.com/vllm-project/vllm-ascend/blob/v0.7.1rc1/docs/source/installation.md) it manually if you are using non-container environment.
- There are logs like `No platform detected, vLLM is running on UnspecifiedPlatform` or `Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")` shown when running vllm-ascend. It actually doesn't affect any functionality and performance. You can just ignore it. And it has been fixed in this [PR](https://github.com/vllm-project/vllm/pull/12432) which will be included in v0.7.3 soon.
- There are logs like `# CPU blocks: 35064, # CPU blocks: 2730` shown when running vllm-ascend which should be `# NPU blocks:` . It actually doesn't affect any functionality and performance. You can just ignore it. And it has been fixed in this [PR](https://github.com/vllm-project/vllm/pull/13378) which will be included in v0.7.3 soon.

View File

@@ -1,6 +1,6 @@
# Supported Features
The feature support principle of vLLM Ascend is: **aligned with the vLLM**. We are also actively collaborating with the community to accelerate support.
The feature support principle of vLLM Ascend is: **aligned with vLLM**. We are also actively collaborating with the community to accelerate support.
Functional call: <https://docs.vllm.ai/en/latest/features/tool_calling/>
@@ -37,7 +37,7 @@ You can check the [support status of vLLM V1 Engine][v1_user_guide]. Below is th
- 🔴 NO plan/Deprecated: No plan or deprecated by vLLM.
[v1_user_guide]: https://docs.vllm.ai/en/latest/getting_started/v1_user_guide.html
[multimodal]: https://docs.vllm.ai/projects/ascend/en/latest/tutorials/Qwen-VL-Dense.html
[multimodal]: https://docs.vllm.ai/projects/ascend/en/latest/tutorials/models/Qwen-VL-Dense.html
[guided_decoding]: https://github.com/vllm-project/vllm-ascend/issues/177
[LoRA]: https://docs.vllm.ai/projects/ascend/en/latest/user_guide/feature_guide/lora.html
[graph_mode]: https://docs.vllm.ai/projects/ascend/en/latest/user_guide/feature_guide/graph_mode.html