[ReleaseNote] Release note of v0.10.0rc1 (#2225)
### What this PR does / why we need it?
Release note of v0.10.0rc1
- vLLM version: v0.10.0
- vLLM main:
8e8e0b6af1
---------
Signed-off-by: MengqingCao <cmq0113@163.com>
This commit is contained in:
@@ -1,5 +1,61 @@
|
||||
# Release note
|
||||
|
||||
## v0.10.0rc1 - 2025.08.07
|
||||
|
||||
This is the 1st release candidate of v0.10.0 for vLLM Ascend. Please follow the [official doc](https://vllm-ascend.readthedocs.io/en/) to get started. V0 is completely removed from this version.
|
||||
|
||||
### Highlights
|
||||
* Disaggregate prefill works with V1 engine now. You can take a try with DeepSeek model [#950](https://github.com/vllm-project/vllm-ascend/pull/950), following this [tutorial](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/README.md).
|
||||
* W4A8 quantization method is supported for dense and MoE model now. [#2060](https://github.com/vllm-project/vllm-ascend/pull/2060) [#2172](https://github.com/vllm-project/vllm-ascend/pull/2172)
|
||||
|
||||
### Core
|
||||
* Ascend PyTorch adapter (torch_npu) has been upgraded to `2.7.1.dev20250724`. [#1562](https://github.com/vllm-project/vllm-ascend/pull/1562) And CANN hase been upgraded to `8.2.RC1`. [#1653](https://github.com/vllm-project/vllm-ascend/pull/1653) Don’t forget to update them in your environment or using the latest images.
|
||||
* vLLM Ascend works on Atlas 800I A3 now, and the image on A3 will be released from this version on. [#1582](https://github.com/vllm-project/vllm-ascend/pull/1582)
|
||||
* Kimi-K2 with w8a8 quantization, Qwen3-Coder and GLM-4.5 is supported in vLLM Ascend, please following this [tutorial](https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_node_kimi.md.html) to have a try. [#2162](https://github.com/vllm-project/vllm-ascend/pull/2162)
|
||||
* Pipeline Parallelism is supported in V1 now. [#1800](https://github.com/vllm-project/vllm-ascend/pull/1800)
|
||||
* Prefix cache feature now work with the Ascend Scheduler. [#1446](https://github.com/vllm-project/vllm-ascend/pull/1446)
|
||||
* Torchair graph mode works with tp > 4 now. [#1508](https://github.com/vllm-project/vllm-ascend/issues/1508)
|
||||
* MTP support torchair graph mode now [#2145](https://github.com/vllm-project/vllm-ascend/pull/2145)
|
||||
|
||||
## Other
|
||||
|
||||
* Bug fixes:
|
||||
* Fix functional problem of multi-modality models like Qwen2-audio with Aclgraph. [#1803](https://github.com/vllm-project/vllm-ascend/pull/1803)
|
||||
* Fix the process group creating error with external launch scenario. [#1681](https://github.com/vllm-project/vllm-ascend/pull/1681)
|
||||
* Fix the functional problem with guided decoding. [#2022](https://github.com/vllm-project/vllm-ascend/pull/2022)
|
||||
* Fix the accuracy issue with common MoE models in DP scenario. [#1856](https://github.com/vllm-project/vllm-ascend/pull/1856)
|
||||
* Performance improved through a lot of prs:
|
||||
* Caching sin/cos instead of calculate it every layer. [#1890](https://github.com/vllm-project/vllm-ascend/pull/1890)
|
||||
* Improve shared expert multi-stream parallelism [#1891](https://github.com/vllm-project/vllm-ascend/pull/1891)
|
||||
* Implement the fusion of allreduce and matmul in prefill phase when tp is enabled. Enable this feature by setting `VLLM_ASCEND_ENABLE_MATMUL_ALLREDUCE` to `1`. [#1926](https://github.com/vllm-project/vllm-ascend/pull/1926)
|
||||
* Optimize Quantized MoE Performance by Reducing All2All Communication. [#2195](https://github.com/vllm-project/vllm-ascend/pull/2195)
|
||||
* Use AddRmsNormQuant ops in the custom model to optimize Qwen3's performance [#1806](https://github.com/vllm-project/vllm-ascend/pull/1806)
|
||||
* Use multicast to avoid padding decode request to prefill size [#1555](https://github.com/vllm-project/vllm-ascend/pull/1555)
|
||||
* The performance of LoRA has been improved. [#1884](https://github.com/vllm-project/vllm-ascend/pull/1884)
|
||||
* A batch of refactoring prs to enhance the code architecture:
|
||||
* Torchair model runner refactor [#2205](https://github.com/vllm-project/vllm-ascend/pull/2205)
|
||||
* Refactoring forward_context and model_runner_v1. [#1979](https://github.com/vllm-project/vllm-ascend/pull/1979)
|
||||
* Refactor AscendMetaData Comments. [#1967](https://github.com/vllm-project/vllm-ascend/pull/1967)
|
||||
* Refactor torchair utils. [#1892](https://github.com/vllm-project/vllm-ascend/pull/1892)
|
||||
* Refactor torchair worker. [#1885](https://github.com/vllm-project/vllm-ascend/pull/1885)
|
||||
* Register activation customop instead of overwrite forward_oot. [#1841](https://github.com/vllm-project/vllm-ascend/pull/1841)
|
||||
* Parameters changes:
|
||||
* `expert_tensor_parallel_size` in `additional_config` is removed now, and the EP and TP is aligned with vLLM now. [#1681](https://github.com/vllm-project/vllm-ascend/pull/1681)
|
||||
* Add `VLLM_ASCEND_MLA_PA` in environ variables, use this to enable mla paged attention operator for deepseek mla decode.
|
||||
* Add `VLLM_ASCEND_ENABLE_MATMUL_ALLREDUCE` in environ variables, enable `MatmulAllReduce` fusion kernel when tensor parallel is enabled. This feature is supported in A2, and eager mode will get better performance.
|
||||
* Add `VLLM_ASCEND_ENABLE_MOE_ALL2ALL_SEQ` in environ variables, Whether to enable moe all2all seq, this provides a basic framework on the basis of alltoall for easy expansion.
|
||||
|
||||
* UT coverage reached 76.34% after a batch of prs followed by this rfc: [#1298](https://github.com/vllm-project/vllm-ascend/issues/1298)
|
||||
* Sequence Parallelism works for Qwen3 MoE. [#2209](https://github.com/vllm-project/vllm-ascend/issues/2209)
|
||||
* Chinese online document is added now. [#1870](https://github.com/vllm-project/vllm-ascend/issues/1870)
|
||||
|
||||
### Known Issues
|
||||
* Aclgraph could not work with DP + EP currently, the mainly gap is the number of npu stream that Aclgraph needed to capture graph is not enough. [#2229](https://github.com/vllm-project/vllm-ascend/issues/2229)
|
||||
* There is an accuracy issue on W8A8 dynamic quantized DeepSeek with multistream enabled. This will be fixed in the next release. [#2232](https://github.com/vllm-project/vllm-ascend/issues/2232)
|
||||
* In Qwen3 MoE, SP cannot be incorporated into the Aclgraph. [#2246](https://github.com/vllm-project/vllm-ascend/issues/2246)
|
||||
* MTP not support V1 scheduler currently, will fix it in Q3. [#2254](https://github.com/vllm-project/vllm-ascend/issues/2254)
|
||||
* When running MTP with DP > 1, we need to disable metrics logger due to some issue on vLLM. [#2254](https://github.com/vllm-project/vllm-ascend/issues/2254)
|
||||
|
||||
## v0.9.1rc2 - 2025.08.04
|
||||
This is the 2nd release candidate of v0.9.1 for vLLM Ascend. Please follow the [official doc](https://vllm-ascend.readthedocs.io/en/v0.9.1-dev/) to get started.
|
||||
|
||||
|
||||
@@ -9,7 +9,6 @@ You can check the [support status of vLLM V1 Engine][v1_user_guide]. Below is th
|
||||
| Chunked Prefill | 🟢 Functional | Functional, see detail note: [Chunked Prefill][cp] |
|
||||
| Automatic Prefix Caching | 🟢 Functional | Functional, see detail note: [vllm-ascend#732][apc] |
|
||||
| LoRA | 🟢 Functional | [vllm-ascend#396][multilora], [vllm-ascend#893][v1 multilora] |
|
||||
| Prompt adapter | 🔴 No plan | This feature has been deprecated by vLLM. |
|
||||
| Speculative decoding | 🟢 Functional | Basic support |
|
||||
| Pooling | 🟢 Functional | CI needed and adapting more models; V1 support rely on vLLM support. |
|
||||
| Enc-dec | 🟡 Planned | vLLM should support this feature first. |
|
||||
@@ -17,15 +16,13 @@ You can check the [support status of vLLM V1 Engine][v1_user_guide]. Below is th
|
||||
| LogProbs | 🟢 Functional | CI needed |
|
||||
| Prompt logProbs | 🟢 Functional | CI needed |
|
||||
| Async output | 🟢 Functional | CI needed |
|
||||
| Multi step scheduler | 🔴 Deprecated | [vllm#8779][v1_rfc], replaced by [vLLM V1 Scheduler][v1_scheduler] |
|
||||
| Best of | 🔴 Deprecated | [vllm#13361][best_of] |
|
||||
| Beam search | 🟢 Functional | CI needed |
|
||||
| Guided Decoding | 🟢 Functional | [vllm-ascend#177][guided_decoding] |
|
||||
| Tensor Parallel | 🟢 Functional | Make TP >4 work with graph mode |
|
||||
| Pipeline Parallel | 🟢 Functional | Write official guide and tutorial. |
|
||||
| Expert Parallel | 🟢 Functional | Dynamic EPLB support. |
|
||||
| Data Parallel | 🟢 Functional | Data Parallel support for Qwen3 MoE. |
|
||||
| Prefill Decode Disaggregation | 🚧 WIP | working on [1P1D] and xPyD. |
|
||||
| Prefill Decode Disaggregation | 🟢 Functional | Functional, xPyD is supported. |
|
||||
| Quantization | 🟢 Functional | W8A8 available; working on more quantization method support(W4A8, etc) |
|
||||
| Graph Mode | 🔵 Experimental| Experimental, see detail note: [vllm-ascend#767][graph_mode] |
|
||||
| Sleep Mode | 🟢 Functional | |
|
||||
@@ -38,10 +35,7 @@ You can check the [support status of vLLM V1 Engine][v1_user_guide]. Below is th
|
||||
|
||||
[v1_user_guide]: https://docs.vllm.ai/en/latest/getting_started/v1_user_guide.html
|
||||
[multimodal]: https://vllm-ascend.readthedocs.io/en/latest/tutorials/single_npu_multimodal.html
|
||||
[best_of]: https://github.com/vllm-project/vllm/issues/13361
|
||||
[guided_decoding]: https://github.com/vllm-project/vllm-ascend/issues/177
|
||||
[v1_scheduler]: https://github.com/vllm-project/vllm/blob/main/vllm/v1/core/sched/scheduler.py
|
||||
[v1_rfc]: https://github.com/vllm-project/vllm/issues/8779
|
||||
[multilora]: https://github.com/vllm-project/vllm-ascend/issues/396
|
||||
[v1 multilora]: https://github.com/vllm-project/vllm-ascend/pull/893
|
||||
[graph_mode]: https://github.com/vllm-project/vllm-ascend/issues/767
|
||||
|
||||
Reference in New Issue
Block a user