diff --git a/README.md b/README.md index bf02466a..7c81773d 100644 --- a/README.md +++ b/README.md @@ -63,7 +63,7 @@ Please use the following recommended versions to get started quickly: | Version | Release type | Doc | |------------|--------------|--------------------------------------| -| v0.14.0rc1 | Latest release candidate | See [QuickStart](https://docs.vllm.ai/projects/ascend/en/latest/quick_start.html) and [Installation](https://docs.vllm.ai/projects/ascend/en/latest/installation.html) for more details | +| v0.16.0rc1 | Latest release candidate | See [QuickStart](https://docs.vllm.ai/projects/ascend/en/latest/quick_start.html) and [Installation](https://docs.vllm.ai/projects/ascend/en/latest/installation.html) for more details | | v0.13.0 | Latest stable version | See [QuickStart](https://docs.vllm.ai/projects/ascend/en/v0.13.0/quick_start.html) and [Installation](https://docs.vllm.ai/projects/ascend/en/v0.13.0/installation.html) for more details | ## Contributing @@ -86,7 +86,7 @@ Below are the maintained branches: | Branch | Status | Note | |------------|--------------|--------------------------------------| -| main | Maintained | CI commitment for vLLM main branch and vLLM v0.13.0 tag | +| main | Maintained | CI commitment for vLLM main branch and vLLM v0.16.0 tag | | v0.7.1-dev | Unmaintained | Only doc fixes are allowed | | v0.7.3-dev | Maintained | CI commitment for vLLM 0.7.3 version, only bug fixes are allowed, and no new release tags anymore. | | v0.9.1-dev | Maintained | CI commitment for vLLM 0.9.1 version | diff --git a/README.zh.md b/README.zh.md index 930410da..99e31342 100644 --- a/README.zh.md +++ b/README.zh.md @@ -57,7 +57,7 @@ vLLM 昇腾插件 (`vllm-ascend`) 是一个由社区维护的让vLLM在Ascend NP | Version | Release type | Doc | |------------|--------------|--------------------------------------| -|v0.14.0rc1| 最新RC版本 |请查看[快速开始](https://docs.vllm.ai/projects/ascend/en/latest/quick_start.html)和[安装指南](https://docs.vllm.ai/projects/ascend/en/latest/installation.html)了解更多| +|v0.16.0rc1| 最新RC版本 |请查看[快速开始](https://docs.vllm.ai/projects/ascend/en/latest/quick_start.html)和[安装指南](https://docs.vllm.ai/projects/ascend/en/latest/installation.html)了解更多| |v0.13.0| 最新正式/稳定版本 |[快速开始](https://docs.vllm.ai/projects/ascend/en/v0.13.0/quick_start.html) and [安装指南](https://docs.vllm.ai/projects/ascend/en/v0.13.0/installation.html)了解更多| ## 贡献 @@ -80,7 +80,7 @@ vllm-ascend有主干分支和开发分支。 | 分支 | 状态 | 备注 | |------------|------------|---------------------| -| main | Maintained | 基于vLLM main分支和vLLM最新版本(v0.13.0)CI看护 | +| main | Maintained | 基于vLLM main分支和vLLM最新版本(v0.16.0)CI看护 | | v0.7.1-dev | Unmaintained | 只允许文档修复 | | v0.7.3-dev | Maintained | 基于vLLM v0.7.3版本CI看护, 只允许Bug修复,不会再发布新版本 | | v0.9.1-dev | Maintained | 基于vLLM v0.9.1版本CI看护 | diff --git a/docs/source/community/versioning_policy.md b/docs/source/community/versioning_policy.md index 7c0f5f30..ea52fe98 100644 --- a/docs/source/community/versioning_policy.md +++ b/docs/source/community/versioning_policy.md @@ -23,6 +23,7 @@ The table below is the release compatibility matrix for vLLM Ascend release. | vLLM Ascend | vLLM | Python | Stable CANN | PyTorch/torch_npu | Triton Ascend | |-------------|-------------------|-----------------|-------------|---------------------------------|---------------| +| v0.16.0rc1 | v0.16.0 | >= 3.10, < 3.12 | 8.5.0 | 2.9.0 / 2.9.0 | 3.2.0 | | v0.15.0rc1 | v0.15.0 | >= 3.10, < 3.12 | 8.5.0 | 2.9.0 / 2.9.0 | 3.2.0 | | v0.14.0rc1 | v0.14.1 | >= 3.10, < 3.12 | 8.5.0 | 2.9.0 / 2.9.0 | 3.2.0 | | v0.13.0 | v0.13.0 | >= 3.10, < 3.12 | 8.5.0 | 2.9.0 / 2.8.0.post2 | 3.2.0 | @@ -65,6 +66,7 @@ For main branch of vLLM Ascend, we usually make it compatible with the latest vL | Date | Event | |------------|-------------------------------------------| +| 2026.03.10 | Release candidates, v0.16.0rc1 | | 2026.02.27 | Release candidates, v0.15.0rc1 | | 2026.02.06 | v0.13.0 Final release, v0.13.0 | | 2026.01.26 | Release candidates, v0.14.0rc1 | @@ -122,7 +124,7 @@ Usually, each minor version of vLLM (such as 0.7) corresponds to a vLLM Ascend v | Branch | State | Note | | ---------- | ------------ | -------------------------------------------------------- | -| main | Maintained | CI commitment for vLLM main branch and vLLM 0.13.0 tag | +| main | Maintained | CI commitment for vLLM main branch and vLLM 0.16.0 tag | | releases/v0.13.0 | Maintained | CI commitment for vLLM 0.13.0 version | | v0.11.0-dev| Maintained | CI commitment for vLLM 0.11.0 version | | v0.9.1-dev | Maintained | CI commitment for vLLM 0.9.1 version | diff --git a/docs/source/conf.py b/docs/source/conf.py index 39f883a1..b6b823ee 100644 --- a/docs/source/conf.py +++ b/docs/source/conf.py @@ -65,15 +65,15 @@ myst_substitutions = { # the branch of vllm, used in vllm clone # - main branch: 'main' # - vX.Y.Z branch: 'vX.Y.Z' - "vllm_version": "v0.15.0", + "vllm_version": "v0.16.0", # the branch of vllm-ascend, used in vllm-ascend clone and image tag # - main branch: 'main' # - vX.Y.Z branch: latest vllm-ascend release tag - "vllm_ascend_version": "v0.15.0rc1", + "vllm_ascend_version": "v0.16.0rc1", # the newest release version of vllm-ascend and matched vLLM, used in pip install. # This value should be updated when cut down release. - "pip_vllm_ascend_version": "0.15.0rc1", - "pip_vllm_version": "0.15.0", + "pip_vllm_ascend_version": "0.16.0rc1", + "pip_vllm_version": "0.16.0", # CANN image tag "cann_image_tag": "8.5.0-910b-ubuntu22.04-py3.11", # vllm version in ci diff --git a/docs/source/faqs.md b/docs/source/faqs.md index 9ae99afb..cf4b77f5 100644 --- a/docs/source/faqs.md +++ b/docs/source/faqs.md @@ -2,6 +2,7 @@ ## Version Specific FAQs +- [[v0.16.0rc1] FAQ & Feedback](https://github.com/vllm-project/vllm-ascend/issues/6969) - [[v0.15.0rc1] FAQ & Feedback](https://github.com/vllm-project/vllm-ascend/issues/6838) - [[v0.13.0] FAQ & Feedback](https://github.com/vllm-project/vllm-ascend/issues/6583) diff --git a/docs/source/user_guide/release_notes.md b/docs/source/user_guide/release_notes.md index 6b401cc0..8d6ed8e5 100644 --- a/docs/source/user_guide/release_notes.md +++ b/docs/source/user_guide/release_notes.md @@ -1,5 +1,96 @@ # Release Notes +## v0.16.0rc1 - 2026.03.09 + +This is the first release candidate of v0.16.0 for vLLM Ascend. Please follow the [official doc](https://docs.vllm.ai/projects/ascend/en/latest) to get started. + +### Highlights + +- Qwen3-Omni quantization adaptation and optimization is now available. [#6828](https://github.com/vllm-project/vllm-ascend/pull/6828) +- GLM5-W8A8 quantization is now supported by parameterizing hardcoded MLA dimensions. [#6902](https://github.com/vllm-project/vllm-ascend/pull/6902) + +### Features + +- [Experimental] Support FabricMem Mode for ADXL/HIXL interconnect. [#6806](https://github.com/vllm-project/vllm-ascend/pull/6806) +- Qwen3-Next now supports FlashComm1. [#6830](https://github.com/vllm-project/vllm-ascend/pull/6830) +- NPUWorker Profiler now supports profile_prefix for better profiling experience. [#6968](https://github.com/vllm-project/vllm-ascend/pull/6968) +- EPLB profiling now displays expert hotness comparison and time required for eplb adjustment. [#6877](https://github.com/vllm-project/vllm-ascend/pull/6877) [#7001](https://github.com/vllm-project/vllm-ascend/pull/7001)] +- Xlite Qwen3 MoE now supports Data Parallel. [#6715](https://github.com/vllm-project/vllm-ascend/pull/6715) +- Mooncake Layerwise Connector now supports kv_pool. [#7032](https://github.com/vllm-project/vllm-ascend/pull/7032) +- Eagle3 now supports QuaRot quantization without embedding. [#7038](https://github.com/vllm-project/vllm-ascend/pull/7038) + +### Hardware and Operator Support + +- 310P now supports w8a8sc quantization method. [#7075](https://github.com/vllm-project/vllm-ascend/pull/7075) +- Added AscendC casual_conv1d_fn operator for Qwen3-Next. [#6661](https://github.com/vllm-project/vllm-ascend/pull/6661) +- Added Ascend Ops recurrent_gated_delta_rule operator. [#6725](https://github.com/vllm-project/vllm-ascend/pull/6725) +- Added GMM custom operator for MoE models. [#7010](https://github.com/vllm-project/vllm-ascend/pull/7010) + +### Performance + +- Faster convolution computation improves TTFT by 0.95% and throughput by 0.59% for Qwen3-VL models. [#7017](https://github.com/vllm-project/vllm-ascend/pull/7017) +- Optimize split_qkv_rmsnorm_rope operator. [#6827](https://github.com/vllm-project/vllm-ascend/pull/6827) +- Implement global CPU slicing and improve IRQ binding for Ascend NPUs, ensuring non-overlapping CPU partitions and better resource management. [#6945](https://github.com/vllm-project/vllm-ascend/pull/6945) +- Optimize MTP execution by reordering state update operation. [#6844](https://github.com/vllm-project/vllm-ascend/pull/6844) +- Avoid CPU sync in mrope_positions copy by using full tensor copy. [#7014](https://github.com/vllm-project/vllm-ascend/pull/7014) +- Remove H2D synchronization for expert_map in MoE models. [#7000](https://github.com/vllm-project/vllm-ascend/pull/7000) + +### Dependencies + +- CANN is upgraded to 8.5.1, please remember to upgrade by hand if you're not using the official image. [#6897](https://github.com/vllm-project/vllm-ascend/pull/6897) + +### Deprecation & Breaking Changes + +- `enable_flash_comm_v1` config option has been renamed back to `enable_sp`. [#6883](https://github.com/vllm-project/vllm-ascend/pull/6883) +- The auto-detect quantization format from model files is reverted, in v0.16.0rc1, we still need to add `---quantization ascend` to serve a model quantinized by modelslim. It will be added back in the next version after the bug with the remote model id is fixed. [#6873](https://github.com/vllm-project/vllm-ascend/pull/6873) + +### Documentation + +- Added user/developer guide for CPU binding. [#7045](https://github.com/vllm-project/vllm-ascend/pull/7045) +- Added metrics usage documentation and example. [#6962](https://github.com/vllm-project/vllm-ascend/pull/6962) +- Added llms.txt for LLM discovery. [#6886](https://github.com/vllm-project/vllm-ascend/pull/6886) +- Added GLM4.x multi-node deploy tutorial. [#6872](https://github.com/vllm-project/vllm-ascend/pull/6872) +- Added explanation of 310p special param: max-model-len. [#7065](https://github.com/vllm-project/vllm-ascend/pull/7065) + +### Others + +- Fix openEuler Dockerfile error. [#6871](https://github.com/vllm-project/vllm-ascend/pull/6871) +- Many bug fixes including: + - Fix Eagle speculative decoding with Context Parallel enabled. [#6981](https://github.com/vllm-project/vllm-ascend/pull/6981) [#7079](https://github.com/vllm-project/vllm-ascend/pull/7079) + - Fix LoRA accuracy issue introduced by upstream vLLM changes. [#6958](https://github.com/vllm-project/vllm-ascend/pull/6958) + - Fix streaming content-type in load balance proxy server. [#6985](https://github.com/vllm-project/vllm-ascend/pull/6985) + - Fix metadata execute error: integer modulo by zero. [#6521](https://github.com/vllm-project/vllm-ascend/pull/6521) + - Fix triton rope_siso implementation bug. [#7082](https://github.com/vllm-project/vllm-ascend/pull/7082) + - Fix incorrect layer count for MTP models in update_aclgraph_sizes. [#7064](https://github.com/vllm-project/vllm-ascend/pull/7064) + - Fix compilation errors for CANN versions subsequent to b020. [#7059](https://github.com/vllm-project/vllm-ascend/pull/7059) + - Fix quant config support in GLM4.6V. [#7062](https://github.com/vllm-project/vllm-ascend/pull/7062) + - Fix parameter ordering bug in _merge_multimodal_embeddings. [#7068](https://github.com/vllm-project/vllm-ascend/pull/7068) + - Fix fused mc2 bug in EPLB. [#6794](https://github.com/vllm-project/vllm-ascend/pull/6794) + - Fix kernel block size for computing slot mapping. [#7019](https://github.com/vllm-project/vllm-ascend/pull/7019) + - Fix layerwise stacking MTP error in P/D disaggregation. [#7036](https://github.com/vllm-project/vllm-ascend/pull/7036) + - Fix RoPE dimension for npu_rotary_embedding. [#6880](https://github.com/vllm-project/vllm-ascend/pull/6880) + - Fix Qwen-Omni quantization bugs. [#7042](https://github.com/vllm-project/vllm-ascend/pull/7042) [#7007](https://github.com/vllm-project/vllm-ascend/pull/7007) + - Fix GDN layer accuracy in graph mode. [#6822](https://github.com/vllm-project/vllm-ascend/pull/6822) + - Fix precision bugs for PCP/DCP in PD disaggregate. [#6876](https://github.com/vllm-project/vllm-ascend/pull/6876) + - Fix MTP in PD disaggregation with fullgraph support for all D-Nodes. [#6948](https://github.com/vllm-project/vllm-ascend/pull/6948) + - Fix GQA model error when enabling both DP and DCP. [#7012](https://github.com/vllm-project/vllm-ascend/pull/7012) + - Fix MTP prefill misclassified as decode edge case. [#6835](https://github.com/vllm-project/vllm-ascend/pull/6835) + - Fix Eagle3 acceptance rate for QuaRot quantized models. [#6914](https://github.com/vllm-project/vllm-ascend/pull/6914) + - Fix RoPE shape mismatch for MTP models with FlashComm V1 enabled. [#6939](https://github.com/vllm-project/vllm-ascend/pull/6939) + - Fix Qwen2.5VL accuracy issue. [#6975](https://github.com/vllm-project/vllm-ascend/pull/6975) + - Fix MoE forward error with static kernel enabled. [#6964](https://github.com/vllm-project/vllm-ascend/pull/6964) + - Fix muls_add fusion for GLM5 models. [#6928](https://github.com/vllm-project/vllm-ascend/pull/6928) + - Fix GDN layer detection for multimodal models. [#6941](https://github.com/vllm-project/vllm-ascend/pull/6941) + - Fix 300I unquant model weight nd2nz error. [#6851](https://github.com/vllm-project/vllm-ascend/pull/6851) + - Fix CPU binding logic. [#6889](https://github.com/vllm-project/vllm-ascend/pull/6889) + - Fix Eagle fullgraph shape capture. [#6846](https://github.com/vllm-project/vllm-ascend/pull/6846) + +### Known Issue + +- Currently, for DeepSeek v3.2, PCP & DCP do not yet work with FlashComm1 feature, which may cause serve errors or other unknown errors. +- In 4-node A3 PD disaggregation deployment with DeepSeek V3.2, the P-Node may hang when benchmarking in high concurrency scenario, e.g., 2K/2K tokens with 512 concurrent requests. +- MTP with large EP configurations may cause graph capture buffer overflow. This is a bug need to fix in vLLM, now there is a workaround to avoid it: explicitly set `--compilation-config '{"max_cudagraph_capture_size": N}'` where `N = max_concurrency × (1 + num_speculative_tokens)`. + ## v0.15.0rc1 - 2026.02.27 This is the first release candidate of v0.15.0 for vLLM Ascend. Please follow the [official doc](https://docs.vllm.ai/projects/ascend/en/latest) to get started.