### What this PR does / why we need it?
Upgrade vllm commit to 0106
- vLLM version: v0.13.0
- vLLM main:
8be6432bda
Signed-off-by: wjunLu <wjunlu217@gmail.com>
### What this PR does / why we need it?
update bisheng version in 20260105
- vLLM version: v0.13.0
- vLLM main:
8be6432bda
Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
### What this PR does / why we need it?
Remove the incorrectly depicted DCP all_gather operation in the prefill
stage PCP for GQA diagram.
Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>
Fixes#3386
- Update Qwen3-30B-A3B.md to use A3-specific image tag
- Update Qwen3-Dense.md to provide both A2 and A3 image options
- Update Qwen3-Next.md to use A3-specific image for Atlas A3
environments
Previously, documentation only mentioned A2 images (vllm-ascend:version)
but Atlas A3 machines require A3-specific images
(vllm-ascend:version-a3). This change ensures users select the correct
image for their hardware.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c
Signed-off-by: hu-qi <huqi1024@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>
Fixes#2727
- Add NNAL to the software requirements table with version information
- Add note explaining that prebuilt Docker images include NNAL
- Add warning message for manual installation when encountering
libatb.so errors
- Improve visibility of NNAL installation instructions to prevent
runtime errors
This addresses the issue where users encounter 'libatb.so not found'
errors due to missing NNAL installation in their environment.
### What this PR does / why we need it?
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: menogrey <1299267905@qq.com>
Signed-off-by: hu-qi <huqi1024@gmail.com>
Co-authored-by: zhangyiming <34808445+menogrey@users.noreply.github.com>
### What this PR does / why we need it?
This documentation provides a comprehensive technical guide for
deploying **vLLM-Ascend** using a **Prefill-Decode (PD) colocated
architecture** integrated with **Mooncake**, a high-performance
distributed KV Cache transfer engine. As Large Language Model (LLM)
serving scales, managing KV Cache efficiently across distributed nodes
is essential for reducing latency and optimizing hardware utilization.
The tutorial focuses on a multi-instance setup using Huawei **Atlas 800T
A2** nodes. By leveraging Mooncake’s distributed memory pooling, vLLM
instances can achieve seamless **cross-node KV Cache reuse**. This
capability allows an instance to retrieve precomputed cache from a
remote node's DRAM via high-speed **RoCE** networks, effectively
bypassing redundant prefill computations.
### Does this PR introduce _any_ user-facing change?
No
- vLLM version: release/v0.13.0
- vLLM main:
0bfd7484fd
---------
Signed-off-by: zhangmuzhibangde <1037640609@qq.com>
Signed-off-by: zhangmuzhi_yuwan <1037640609@qq.com>
Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
### What this PR does / why we need it?
1. Refactor eagle and mtp function: load_model and generate_token_ids
2. Remove redundant code in mtp and eagle file
3. Refactor the UT of file
2/N of Refactor and merge mtp and eagle
Relational RFC: https://github.com/vllm-project/vllm-ascend/issues/5467
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
ut and tests
- vLLM version: release/v0.13.0
- vLLM main:
81786c8774
---------
Signed-off-by: lilinsiman <lilinsiman@gmail.com>
### What this PR does / why we need it?
Due to the update of the Bisheng version's installation path, the
corresponding source path in the environment variables needs to be
updated.
- vLLM version: v0.13.0
- vLLM main:
7157596103
---------
Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
### What this PR does / why we need it?
This PR makes the following modifications:
1.delete the `user_guide/feature_guide/quantization-llm-compressor.md`
and merge it into `user_guide/feature_guide/quantization.md`.
2.update the content of `user_guide/feature_guide/quantization.md`.
3.add guidance `developer_guide/feature_guide/quantization.md' on the
adaptation of quantization algorithms and quantized models.
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
7157596103
---------
Signed-off-by: IncSec <1790766300@qq.com>
Signed-off-by: InSec <1790766300@qq.com>
### What this PR does / why we need it?
kvpool decode save kvcache
now only support mla
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: baxingpiaochong <771405853@qq.com>
Co-authored-by: Chao Lei <leichao139636@163.com>
Fixed a typo in the environment variable name.
`ASCEBD_RT_VISIBLE_DEVICES` -> `ASCEND_RT_VISIBLE_DEVICES`
Fixes#5580
Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
By converting the KV cache from ND to NZ format when the decode node
receives it, this PR ensures that the KV NZ feature works correctly
during the decoding phase in disagg-prefill scenario.
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
---------
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Co-authored-by: ghphotoframe <854746559@qq.com>
Co-authored-by: alex101-ops <alex1015718386@gmail.com>
### What this PR does / why we need it?
Upgrade vllm commit to 1230
Affected by https://github.com/vllm-project/vllm/pull/27614 (and the
core PR https://github.com/vllm-project/vllm/pull/26866), we have to
make the following changes:
1. Modify `tests/e2e/multicard/test_aclgraph_capture_replay.py` to keep
compatible with both vllm version of `v0.13.0` and latest main commitID,
while vllm enables async scheduling by default
2. Skip `test_guided_decoding.py` due to xgrammar errors
(https://github.com/vllm-project/vllm-ascend/issues/5524)
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
45c1ca1ca1
---------
Signed-off-by: wjunLu <wjunlu217@gmail.com>
### What this PR does / why we need it?
Update new contributors.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
45c1ca1ca1
Signed-off-by: menogrey <1299267905@qq.com>
Fixes#3376
- Remove --task embed from vllm serve command in Qwen3_embedding.md
- Remove task='embed' parameter from LLM constructor in Python example
The --task parameter has been deprecated in recent vLLM versions
in favor of automatic model type detection.
- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: hu-qi <huqi1024@gmail.com>
### What this PR does / why we need it?
update triton-ascend version to 1229 and bisheng version in 1225;
- vLLM version: release/v0.13.0
- vLLM main:
254f6b9867
---------
Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
### What this PR does / why we need it?
- Fixes vllm break:
1. [[BugFix] register quant scale tensors as buffer #31395]
(https://github.com/vllm-project/vllm/pull/31395)
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
5326c89803
---------
Signed-off-by: leo-pony <nengjunma@outlook.com>
### What this PR does / why we need it?
Add release note for v0.13.0rc1
- vLLM version: release/v0.13.0
- vLLM main:
bc0a5a0c08
---------
Signed-off-by: MengqingCao <cmq0113@163.com>
### What this PR does / why we need it?
modify pcp tutorial doc
Because some optimization points have been submitted as PRs and haven't
been merged yet, I'll update the performance data now and refresh it
again after the PRs are merged.
- vLLM version: release/v0.13.0
- vLLM main:
81786c8774
Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
### What this PR does / why we need it?
The PR updates the documentation for DeepSeek-V3.1 and DeepSeek-R1 in
the scenario of prefill-decode disaggregation.
Updated some PD separation-related setting parameters and optimal
configurations. This script has been verified.
- vLLM version: release/v0.13.0
- vLLM main:
bc0a5a0c08
Signed-off-by: chenmenglong <chenmenglong1@huawei.com>
### What this PR does / why we need it?
Modify DeepSeek-R1/V3.1 documentation. Mainly update the mtp size and some other configs.
Signed-off-by: GDzhu01 <809721801@qq.com>
### What this PR does / why we need it?
This PR updates Qwen3-235B doc to give a simple recipe for repreducing
our latest perfomance on Atlas A3 servers.
- vLLM version: release/v0.13.0
- vLLM main:
5fbfa8d9ef
---------
Signed-off-by: Angazenn <supperccell@163.com>
### What this PR does / why we need it?
Currently, HCCL_OP_EXPANSION_MODE="AIV" is causing some freezing issues
on A2.so we have temporarily removed it from the documentation.
Signed-off-by: GDzhu01 <809721801@qq.com>
### What this PR does / why we need it?
add long_sequence feature user guide
- vLLM version: release/v0.13.0
- vLLM main:
bc0a5a0c08
---------
Signed-off-by: LookAround <lixushi@huawei.com>
### What this PR does / why we need it?
modify pcp tutorials
modify pcp perf statistics and add note: Context parallel feature
currently is only supported on Atlas A3 device, and will be supported on
Atlas A2 in the future.
- vLLM version: release/v0.13.0
- vLLM main:
81786c8774
---------
Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
### What this PR does / why we need it?
1.KV Pool for KV Transfer in PD Disaggregation Scenarios Error
Resolution
2.Update KV Pool Documentation
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: release/v0.13.0
- vLLM main:
254f6b9867
---------
Signed-off-by: fems14 <1804143737@qq.com>
### What this PR does / why we need it?
Provide sample guidance for running long-sequence DeepSeek across
multiple nodes
To guide users on using the context parallel feature, a practical
example is provided.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: release/v0.13.0
- vLLM main:
bc0a5a0c08
Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
### What this PR does / why we need it?
Update vllm pin to 12.26
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: release/v0.13.0
- vLLM main:
81786c8774
---------
Signed-off-by: ZT-AIA <1028681969@qq.com>
Signed-off-by: Mengqing Cao <cmq0113@163.com>
Signed-off-by: ZT-AIA <63220130+ZT-AIA@users.noreply.github.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
### What this PR does / why we need it?
Rollback causal_conv1d_fn ops from triton to torch version to fix
hanging issues,meanwhile update Qwen3Next doc
- vLLM version: release/v0.13.0
- vLLM main:
254f6b9867
---------
Signed-off-by: SunnyLee219 <3294305115@qq.com>
### What this PR does / why we need it?
This PR updates DeepSeek-R1/V3.1 doc to give a simple recipe for
repreducing our latest perfomance on Atlas A3/A2 servers.
### Does this PR introduce any user-facing change?
No.
Signed-off-by: GDzhu01 <809721801@qq.com>
### What this PR does / why we need it?
add developer guide for PCP&DCP
- vLLM version: release/v0.13.0
- vLLM main:
bc0a5a0c08
Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>
1. refresh additional config doc
2. move kv config logic to platform.
3. improve `dump_config` init logic and rename it to `dump_config_path`
this change is user impacted. dump_config is changed from dict to
string.
4. correct `enable_async_exponential` type
5. remove useless `chunked_prefill_for_mla`
- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
- Fix vllm break in the pr:
1.[Drop v0.14 deprecations
]https://github.com/vllm-project/vllm/pull/31285
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- vLLM version: release/v0.13.0
- vLLM main:
bc0a5a0c08
---------
Signed-off-by: ZT-AIA <1028681969@qq.com>
### What this PR does / why we need it?
Update the configuration for optimal performance of deepseek v3.2 in the usage tutorial.
- vLLM version: release/v0.13.0
- vLLM main:
bc0a5a0c08
---------
Signed-off-by: cookieyyds <126683903+cookieyyds@users.noreply.github.com>
Signed-off-by: Mengqing Cao <cmq0113@163.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
`VLLM_ASCEND_ENABLE_DENSE_OPTIMIZE` is only used together with
`VLLM_ASCEND_ENABLE_PREFETCH_MLP` which is useless totally. This PR
remove it.
- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
Add pa_shape_list description to qwen dense tutorial.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c
Signed-off-by: ZYang6263 <zy626375@gmail.com>
Co-authored-by: zzzzwwjj <34335947+zzzzwwjj@users.noreply.github.com>
### What this PR does / why we need it?
### Does this PR introduce _any_ user-facing change?
Fix vllm break:
1. [Enable cuda graph for deepepHT, 5.3% throughput improvement, 4.4%
TTFT improvement] (https://github.com/vllm-project/vllm/pull/29558)
Fix Solution: Add the now-necessary `all2all_backend` parameter. The
impact of this parameter on the original `set_splitting_ops_for_v1`
implementation is only that graph mode is disabled in `vllm` if
`deepep_high_throughput` is enabled; it has no effect on the
`vllm-ascend` logic.
2.[Migrate legacy ViT MultiHeadAttention to new MMEncoderAttention
interface ] (https://github.com/vllm-project/vllm/pull/30684)
Fix Solution: The reason why the GPU does not need to convert qkv to 3D
is that the GPU's flash_attention operator is compatible with 3D and 4D
(b s h d and s b ( h d)), but the NPU's flash_attention_unpad operator
only supports 3D (s b ( h d)). Therefore, we need to introduce the
reshape_qkv_to_3d operation.
4.Skip Tencent-Hunyuan/HunyuanOCR test case, as it has following issue
in upgrade vllm code:
https://github.com/vllm-project/vllm-ascend/issues/5297
### How was this patch tested?
Co-authored-by: zxwang <1476209578@qq.com>
- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: leo-pony <nengjunma@outlook.com>
Signed-off-by: zxwang <1476209578@qq.com>
Co-authored-by: zxwang <1476209578@qq.com>
### What this PR does / why we need it?
[Kthena](https://github.com/volcano-sh/kthena) is a Kubernetes-native
LLM inference platform that transforms how organizations deploy and
manage Large Language Models in production. Built with declarative model
lifecycle management and intelligent request routing, it provides high
performance and enterprise-grade scalability for LLM inference
workloads.
The platform extends Kubernetes with purpose-built Custom Resource
Definitions (CRDs) for managing LLM workloads, supporting multiple
inference engines (vLLM, SGLang, Triton) and advanced serving patterns
like prefill-decode disaggregation.
This pr added a example on deloying llm on Ascend Kubernetes clusters.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
Signed-off-by: Zhonghu Xu <xuzhonghu@huawei.com>
### What this PR does / why we need it?
[Doc] Add new contributors and relative scripts.
Usage of scripts:
- `export GITHUB_TOKEN=<your github token>`
- `bash tools/collect_user_first_contribution.sh
vllm-project/vllm-ascend <base_sha> <head_sha>` and save the result to
one temporary file such as `contributors.txt`
- `python tools/format_contributors.py contributors.txt --start <start
index now>`
- Use the output to update the `contributors.md`
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: menogrey <1299267905@qq.com>