@@ -8,12 +8,12 @@ This capability supports heterogeneous parallelism strategies within a single mo
---
## Benefits of Finegrained TP
## Benefits of Fine-grained TP
Fine-Grained Tensor Parallelism delivers two primary performance advantages through targeted weight sharding:
- **Reduced Per-Device Memory Footprint**:
Fine-grained TP shards large weight matrices(e.g., LM Head, o_proj)across devices, lowering peak memory usage and enabling larger batches or deployment on memory-limited hardware—without quantization.
Fine-grained TP shards large weight matrices(e.g., LM Head, o_proj)across devices, lowering peak memory usage and enabling larger batches or deployment on memory-limited hardware—without quantization.
- **Faster Memory Access in GEMMs**:
In decode-heavy workloads, GEMM performance is often memory-bound. Weight sharding reduces per-device weight fetch volume, cutting DRAM traffic and improving bandwidth efficiency—especially for latency-sensitive layers like LM Head and o_proj.
@@ -53,7 +53,7 @@ The Fine-Grained TP size for any component must:
@@ -56,12 +56,12 @@ All related code is under `vllm/distributed/ec_transfer`.
* *Scheduler role* – checks cache existence and schedules loads.
* *Worker role* – loads the embeddings into memory.
* **EPD Load Balance Proxy** -
* **EPD Load Balancing Proxy** -
* *Multi-Path Scheduling Strategy* - dynamically diverts the multimodal request or text requests to the corresponding inference path
* *Instance-Level Dynamic Load Balancing* - dispatches multimodal requests based on a least-loaded strategy, using a priority queue to balance the active token workload across instances.
We create the example setup with the **MooncakeLayerwiseConnector** from `vllm_ascend/distributed/kv_transfer/kv_p2p/mooncake_layerwise_connector.py` and refer to the `examples/disaggregated_prefill_v1/load_balance_proxy_layerwise_server_example.py` to facilitate the kv transfer between P and D. For step-by-step deployment and configuration of Mooncake, refer to the following guide:
For the PD disaggregation part, when using MooncakeLayerwiseConnector: The request first enters the Decoder instance,the Decoder triggers a remote prefill task in reverse via the Metaserver. The Prefill node then executes inference and pushes KV Cache layer-wise to the Decoder, overlapping computation with transmission. Once the transfer is complete, the Decoder seamlessly continues with the subsequent token generation.
`docs/source/developer_guide/Design_Documents/disaggregated_prefill.md` shows the brief idea about the disaggregated prefill.
@@ -4,7 +4,7 @@ For larger-scale deployments especially, it can make sense to handle the orchest
In this case, it's more convenient to treat each DP rank like a separate vLLM deployment, with its own endpoint, and have an external router balance HTTP requests between them, making use of appropriate real-time telemetry from each server for routing decisions.
## Getting Start
## Getting Started
The functionality of [external DP](https://docs.vllm.ai/en/latest/serving/data_parallel_deployment/?h=external#external-load-balancing) is already natively supported by vLLM. In vllm-ascend we provide two enhanced functionalities:
# Distributed DP Server With Large-Scale Expert Parallelism
## Getting Start
## Getting Started
vLLM-Ascend now supports prefill-decode (PD) disaggregation in the large-scale **Expert Parallelism (EP)** scenario. To achieve better performance, the distributed DP server is applied in vLLM-Ascend. In the PD separation scenario, different optimization strategies can be implemented based on the distinct characteristics of PD nodes, thereby enabling more flexible model deployment. \
Taking the DeepSeek model as an example, using 8 Atlas 800T A3 servers to deploy the model. Assume the IP of the servers starts from 192.0.0.1 and ends by 192.0.0.8. Use the first 4 servers as prefiller nodes and the last 4 servers as decoder nodes. And the prefiller nodes are deployed as master nodes independently, while the decoder nodes use the 192.0.0.5 node as the master node.
@@ -432,7 +432,7 @@ We mainly focus on quality and performance improvement in this release. The spec
- Fix weight transpose in RL scenarios [#5567](https://github.com/vllm-project/vllm-ascend/pull/5567)
- Adapted SP to eagle3 [#5562](https://github.com/vllm-project/vllm-ascend/pull/5562)
- Context Parallel(PCP&DCP) support mlapo [#5672](https://github.com/vllm-project/vllm-ascend/pull/5672)
- GLM4.6 support mtp with fullgraph [#5460](https://github.com/vllm-project/vllm-ascend/pull/5460)
- GLM4.6 support mtp with fullgraph [#5460](https://github.com/vllm-project/vllm-ascend/pull/5460)
- Flashcomm2 now works with oshard generalized feature [#4723](https://github.com/vllm-project/vllm-ascend/pull/4723)
- Support setting tp=1 for the Eagle draft model [#5804](https://github.com/vllm-project/vllm-ascend/pull/5804)
- Flashcomm1 feature now works with qwen3-vl [#5848](https://github.com/vllm-project/vllm-ascend/pull/5848)
@@ -575,7 +575,7 @@ This is the first release candidate of v0.12.0 for vLLM Ascend. We landed lots o
- [Experimental] [KV cache pool](https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/Design_Documents/KV_Cache_Pool_Guide.html) feature is added
- [Experimental] A new graph mode `xlite` is introduced. It performs good with some models. Following the [official tutorial](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/feature_guide/graph_mode.html#using-xlitegraph) to start using it.
- LLMdatadist kv connector is removed. Please use mooncake connector instead.
- Ascend scheduler is removed. `--additional-config {"ascend_scheduler": {"enabled": true}` doesn't work anymore.
- Ascend scheduler is removed. `--additional-config {"ascend_scheduler": {"enabled": true}}` doesn't work anymore.
- Torchair graph mode is removed. `--additional-config {"torchair_graph_config": {"enabled": true}}` doesn't work anymore. Please use aclgraph instead.
-`VLLM_ASCEND_ENABLE_TOPK_TOPP_OPTIMIZATION` env is removed. This feature is stable enough. We enable it by default now.
- speculative decode method `Ngram` is back now.
@@ -585,7 +585,7 @@ This is the first release candidate of v0.12.0 for vLLM Ascend. We landed lots o
### Upgrade Note
- vLLM Ascend self maintained modeling file has been removed. The related python entrypoint is removed as well. So please uninstall the old version of vLLM Ascend in your env before upgrade.
- CANN is upgraded to 8.3.RC2, Pytorch and torch-npu are upgraded to 2.8.0. Don't forget to install them.
- CANN is upgraded to 8.3.RC2, PyTorch and torch-npu are upgraded to 2.8.0. Don't forget to install them.
- Python 3.9 support is dropped to keep the same with vLLM v0.12.0
### Known Issues
@@ -683,7 +683,7 @@ v0.11.0 will be the next official release version of vLLM Ascend. We'll release
- For long sequence input case, there is no response sometimes and the kv cache usage is become higher. This is a bug for scheduler. We are working on it.
- Qwen2-audio doesn't work by default, we're fixing it. Temporary solution is to set `--gpu-memory-utilization` to a suitable value, such as 0.8.
- When running Qwen3-Next with expert parallel enabled, please set `HCCL_BUFFSIZE` environment variable to a suitable value, such as 1024.
- The accuracy of DeepSeek3.2 with aclgraph is not correct. Temporary solution is to set `cudagraph_capture_sizes` to a suitable value depending on the batch size for the input.
- The accuracy of DeepSeek3.2 with aclgraph is not correct. Temporary solution is to set `agraph_capture_sizes` to a suitable value depending on the batch size for the input.
## v0.11.0rc0 - 2025.09.30
@@ -699,7 +699,7 @@ This is the special release candidate of v0.11.0 for vLLM Ascend. Please follow
- DeepSeek works with aclgraph now. [#2707](https://github.com/vllm-project/vllm-ascend/pull/2707)
- MTP works with aclgraph now. [#2932](https://github.com/vllm-project/vllm-ascend/pull/2932)
- EPLB is supported now. [#2956](https://github.com/vllm-project/vllm-ascend/pull/2956)
- Mooncacke store kvcache connector is supported now. [#2913](https://github.com/vllm-project/vllm-ascend/pull/2913)
- Mooncake store kvcache connector is supported now. [#2913](https://github.com/vllm-project/vllm-ascend/pull/2913)
- CPU offload connector is supported now. [#1659](https://github.com/vllm-project/vllm-ascend/pull/1659)
### Others
@@ -828,7 +828,7 @@ Please note that this release note will list all the important changes from last
The following notes are especially for reference when upgrading from last final release (v0.7.3):
- V0 Engine is not supported from this release. Please always set `VLLM_USE_V1=1` to use V1 engine with vLLM Ascend.
- Mindie Turbo is not needed with this release. And the old version of Mindie Turbo is not compatible. Please do not install it. Currently all the function and enhancement is included in vLLM Ascend already. We'll consider to add it back in the future in needed.
- Mindie Turbo is not needed with this release. And the old version of Mindie Turbo is not compatible. Please do not install it. Currently all the function and enhancement is included in vLLM Ascend already. We'll consider to add it back in the future if needed.
- Torch-npu is upgraded to 2.5.1.post1. CANN is upgraded to 8.2.RC1. Don't forget to upgrade them.
### Core
@@ -893,7 +893,7 @@ This is the 1st release candidate of v0.10.0 for vLLM Ascend. Please follow the
### Core
- Ascend PyTorch adapter (torch_npu) has been upgraded to `2.7.1.dev20250724`. [#1562](https://github.com/vllm-project/vllm-ascend/pull/1562) And CANN hase been upgraded to `8.2.RC1`. [#1653](https://github.com/vllm-project/vllm-ascend/pull/1653) Don’t forget to update them in your environment or using the latest images.
- Ascend PyTorch adapter (torch_npu) has been upgraded to `2.7.1.dev20250724`. [#1562](https://github.com/vllm-project/vllm-ascend/pull/1562) And CANN has been upgraded to `8.2.RC1`. [#1653](https://github.com/vllm-project/vllm-ascend/pull/1653) Don’t forget to update them in your environment or using the latest images.
- vLLM Ascend works on Atlas 800I A3 now, and the image on A3 will be released from this version on. [#1582](https://github.com/vllm-project/vllm-ascend/pull/1582)
- Kimi-K2 with w8a8 quantization, Qwen3-Coder and GLM-4.5 is supported in vLLM Ascend, please following this [tutorial](https://github.com/vllm-project/vllm-ascend/blob/v0.10.0rc1/docs/source/tutorials/multi_node_kimi.md) to have a try. [#2162](https://github.com/vllm-project/vllm-ascend/pull/2162)
- Pipeline Parallelism is supported in V1 now. [#1800](https://github.com/vllm-project/vllm-ascend/pull/1800)
@@ -1055,7 +1055,7 @@ This is the 2nd release candidate of v0.9.1 for vLLM Ascend. Please follow the [
## v0.9.2rc1 - 2025.07.11
This is the 1st release candidate of v0.9.2 for vLLM Ascend. Please follow the [official doc](https://github.com/vllm-project/vllm-ascend/tree/v0.9.2rc1) to get started. From this release, V1 engine will be enabled by default, there is no need to set `VLLM_USE_V1=1` any more. And this release is the last version to support V0 engine, V0 code will be clean up in the future.
This is the 1st release candidate of v0.9.2 for vLLM Ascend. Please follow the [official doc](https://github.com/vllm-project/vllm-ascend/tree/v0.9.2rc1) to get started. From this release, V1 engine will be enabled by default, there is no need to set `VLLM_USE_V1=1` any more. And this release is the last version to support V0 engine, V0 code will be cleaned up in the future.
### Highlights
@@ -1074,7 +1074,7 @@ This is the 1st release candidate of v0.9.2 for vLLM Ascend. Please follow the [
### Others
- Official doc has been updated for better read experience. For example, more deployment tutorials are added, user/developer docs are updated. More guide will coming soon.
- Fix accuracy problem for deepseek V3/R1 models with torchair graph in long sequence predictions. [#1331](https://github.com/vllm-project/vllm-ascend/pull/1331)
- Fix accuracy problem for Deepseek V3/R1 models with torchair graph in long sequence predictions. [#1331](https://github.com/vllm-project/vllm-ascend/pull/1331)
- A new env variable `VLLM_ENABLE_FUSED_EXPERTS_ALLGATHER_EP` has been added. It enables the fused allgather-experts kernel for Deepseek V3/R1 models. The default value is `0`. [#1335](https://github.com/vllm-project/vllm-ascend/pull/1335)
- A new env variable `VLLM_ASCEND_ENABLE_TOPK_TOPP_OPTIMIZATION` has been added to improve the performance of topk-topp sampling. The default value is 0, we'll consider to enable it by default in the future[#1732](https://github.com/vllm-project/vllm-ascend/pull/1732)
- A batch of bugs have been fixed for Data Parallelism case [#1273](https://github.com/vllm-project/vllm-ascend/pull/1273) [#1322](https://github.com/vllm-project/vllm-ascend/pull/1322) [#1275](https://github.com/vllm-project/vllm-ascend/pull/1275) [#1478](https://github.com/vllm-project/vllm-ascend/pull/1478)
@@ -1233,7 +1233,7 @@ We are excited to announce the release of 0.7.3 for vllm-ascend. This is the fir
### Highlights
-Thisreleaseincludesallfeatureslandedinthepreviousreleasecandidates([v0.7.1rc1](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.7.1rc1), [v0.7.3rc1](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.7.3rc1), [v0.7.3rc2](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.7.3rc2)).Andallthefeaturesarefullytestedandverified.Visittheofficialdocthegetthedetail [feature](https://docs.vllm.ai/projects/ascend/en/v0.7.3/user_guide/suppoted_features.html) and [model](https://docs.vllm.ai/projects/ascend/en/v0.7.3/user_guide/supported_models.html) supportmatrix.
-Thisreleaseincludesallfeatureslandedinthepreviousreleasecandidates([v0.7.1rc1](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.7.1rc1), [v0.7.3rc1](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.7.3rc1), [v0.7.3rc2](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.7.3rc2)).Andallthefeaturesarefullytestedandverified.Visittheofficialdoctogetthedetail [feature](https://docs.vllm.ai/projects/ascend/en/v0.7.3/user_guide/suppoted_features.html) and [model](https://docs.vllm.ai/projects/ascend/en/v0.7.3/user_guide/supported_models.html) supportmatrix.
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.