Files
xc-llm-ascend/docs/source/user_guide/release_notes.md
Yikun Jiang ec27af346a [Doc] Add 0.8.5rc1 release note (#756)
### What this PR does / why we need it?
Add 0.8.5rc1 release note and bump vllm version to v0.8.5.post1

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

CI passed

---------

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-05-06 23:46:35 +08:00

14 KiB

Release note

v0.8.5rc1

This is the 1st release candidate of v0.8.5 for vllm-ascend. Please follow the official doc to start the journey. Now you can enable V1 egnine by setting the environment variable VLLM_USE_V1=1, see the feature support status of vLLM Ascend in here.

Highlights

  • Upgrade CANN version to 8.1.RC1 to support chunked prefill and automatic prefix caching (--enable_prefix_caching) when V1 is enabled #747
  • Optimize Qwen2 VL and Qwen 2.5 VL #701
  • Improve Deepseek V3 eager mode and graph mode performance, now you can use --additional_config={'enable_graph_mode': True} to enable graph mode. #598 #719

Core

  • Upgrade vLLM to 0.8.5.post1 #715
  • Fix early return in CustomDeepseekV2MoE.forward during profile_run #682
  • Adapts for new quant model generated by modelslim #719
  • Initial support on P2P Disaggregated Prefill based on llm_datadist #694
  • Use /vllm-workspace as code path and include .git in container image to fix issue when start vllm under /workspace #726
  • Optimize NPU memory usage to make DeepSeek R1 W8A8 32K model len work. #728
  • Fix PYTHON_INCLUDE_PATH typo in setup.py #762

Other

  • Add Qwen3-0.6B test #717
  • Add nightly CI #668
  • Add accuracy test report #542

v0.8.4rc2

This is the second release candidate of v0.8.4 for vllm-ascend. Please follow the official doc to start the journey. Some experimental features are included in this version, such as W8A8 quantization and EP/DP support. We'll make them stable enough in the next release.

Highlights

  • Qwen3 and Qwen3MOE is supported now. Please follow the official doc to run the quick demo. #709
  • Ascend W8A8 quantization method is supported now. Please take the official doc for example. Any feedback is welcome. #580
  • DeepSeek V3/R1 works with DP, TP and MTP now. Please note that it's still in experimental status. Let us know if you hit any problem. #429 #585 #626 #636 #671

Core

  • ACLGraph feature is supported with V1 engine now. It's disabled by default because this feature rely on CANN 8.1 release. We'll make it avaiable by default in the next release #426
  • Upgrade PyTorch to 2.5.1. vLLM Ascend no longer relies on the dev version of torch-npu now. Now users don't need to install the torch-npu by hand. The 2.5.1 version of torch-npu will be installed automaticlly. #661

Other

  • MiniCPM model works now. #645
  • openEuler container image supported with v0.8.4-openeuler tag and customs Ops build is enabled by default for openEuler OS. #689
  • Fix ModuleNotFoundError bug to make Lora work #600
  • Add "Using EvalScope evaluation" doc #611
  • Add a VLLM_VERSION environment to make vLLM version configurable to help developer set correct vLLM version if the code of vLLM is changed by hand locally. #651

v0.8.4rc1

This is the first release candidate of v0.8.4 for vllm-ascend. Please follow the official doc to start the journey. From this version, vllm-ascend will follow the newest version of vllm and release every two weeks. For example, if vllm releases v0.8.5 in the next two weeks, vllm-ascend will release v0.8.5rc1 instead of v0.8.4rc2. Please find the detail from the official documentation.

Highlights

  • vLLM V1 engine experimental support is included in this version. You can visit official guide to get more detail. By default, vLLM will fallback to V0 if V1 doesn't work, please set VLLM_USE_V1=1 environment if you want to use V1 forcely.
  • LoRA、Multi-LoRA And Dynamic Serving is supported now. The performance will be improved in the next release. Please follow the official doc for more usage information. Thanks for the contribution from China Merchants Bank. #521.
  • Sleep Mode feature is supported. Currently it's only work on V0 engine. V1 engine support will come soon. #513

Core

  • The Ascend scheduler is added for V1 engine. This scheduler is more affinity with Ascend hardware. More scheduler policy will be added in the future. #543
  • Disaggregated Prefill feature is supported. Currently only 1P1D works. NPND is under design by vllm team. vllm-ascend will support it once it's ready from vLLM. Follow the official guide to use. #432
  • Spec decode feature works now. Currently it's only work on V0 engine. V1 engine support will come soon. #500
  • Structured output feature works now on V1 Engine. Currently it only supports xgrammar backend while using guidance backend may get some errors. #555

Other

  • A new communicator pyhccl is added. It's used for call CANN HCCL library directly instead of using torch.distribute. More usage of it will be added in the next release #503
  • The custom ops build is enabled by default. You should install the packages like gcc, cmake first to build vllm-ascend from source. Set COMPILE_CUSTOM_KERNELS=0 environment to disable the compilation if you don't need it. #466
  • The custom op rotay embedding is enabled by default now to improve the performance. #555

v0.7.3rc2

This is 2nd release candidate of v0.7.3 for vllm-ascend. Please follow the official doc to start the journey.

Highlights

  • Add Ascend Custom Ops framewrok. Developers now can write customs ops using AscendC. An example ops rotary_embedding is added. More tutorials will come soon. The Custom Ops compilation is disabled by default when installing vllm-ascend. Set COMPILE_CUSTOM_KERNELS=1 to enable it. #371
  • V1 engine is basic supported in this release. The full support will be done in 0.8.X release. If you hit any issue or have any requirement of V1 engine. Please tell us here. #376
  • Prefix cache feature works now. You can set enable_prefix_caching=True to enable it. #282

Core

  • Bump torch_npu version to dev20250320.3 to improve accuracy to fix !!! output problem. #406

Model

  • The performance of Qwen2-vl is improved by optimizing patch embedding (Conv3D). #398

Other

  • Fixed a bug to make sure multi step scheduler feature work. #349
  • Fixed a bug to make prefix cache feature works with correct accuracy. #424

v0.7.3rc1

🎉 Hello, World! This is the first release candidate of v0.7.3 for vllm-ascend. Please follow the official doc to start the journey.

Highlights

  • DeepSeek V3/R1 works well now. Read the official guide to start! #242
  • Speculative decoding feature is supported. #252
  • Multi step scheduler feature is supported. #300

Core

  • Bump torch_npu version to dev20250308.3 to improve _exponential accuracy
  • Added initial support for pooling models. Bert based model, such as BAAI/bge-base-en-v1.5 and BAAI/bge-reranker-v2-m3 works now. #229

Model

  • The performance of Qwen2-VL is improved. #241
  • MiniCPM is now supported #164

Other

Known issues

  • In some cases, especially when the input/output is very long, the accuracy of output may be incorrect. We are working on it. It'll be fixed in the next release.
  • Improved and reduced the garbled code in model output. But if you still hit the issue, try to change the generation config value, such as temperature, and try again. There is also a knonwn issue shown below. Any feedback is welcome. #277

v0.7.1rc1

🎉 Hello, World!

We are excited to announce the first release candidate of v0.7.1 for vllm-ascend.

vLLM Ascend Plugin (vllm-ascend) is a community maintained hardware plugin for running vLLM on the Ascend NPU. With this release, users can now enjoy the latest features and improvements of vLLM on the Ascend NPU.

Please follow the official doc to start the journey. Note that this is a release candidate, and there may be some bugs or issues. We appreciate your feedback and suggestions here

Highlights

  • Initial supports for Ascend NPU on vLLM. #3
  • DeepSeek is now supported. #88 #68
  • Qwen, Llama series and other popular models are also supported, you can see more details in here.

Core

  • Added the Ascend quantization config option, the implementation will coming soon. #7 #73
  • Add silu_and_mul and rope ops and add mix ops into attention layer. #18

Other

  • [CI] Enable Ascend CI to actively monitor and improve quality for vLLM on Ascend. #3
  • [Docker] Add vllm-ascend container image #64
  • [Docs] Add a live doc #55

Known issues

  • This release relies on an unreleased torch_npu version. It has been installed within official container image already. Please install it manually if you are using non-container environment.
  • There are logs like No platform detected, vLLM is running on UnspecifiedPlatform or Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'") shown when running vllm-ascend. It actually doesn't affect any functionality and performance. You can just ignore it. And it has been fixed in this PR which will be included in v0.7.3 soon.
  • There are logs like # CPU blocks: 35064, # CPU blocks: 2730 shown when running vllm-ascend which should be # NPU blocks: . It actually doesn't affect any functionality and performance. You can just ignore it. And it has been fixed in this PR which will be included in v0.7.3 soon.