Commit Graph

621 Commits

Author SHA1 Message Date
Mengqing Cao
1b5d5abf86 [ReleaseNote] Add release note for v0.13.0rc1 (#5334)
### What this PR does / why we need it?
Add release note for v0.13.0rc1

- vLLM version: release/v0.13.0
- vLLM main:
bc0a5a0c08
---------
Signed-off-by: MengqingCao <cmq0113@163.com>
2025-12-27 18:46:57 +08:00
weiguihua2
c30c3dc831 [Doc]modify pcp tutorial doc (#5440)
### What this PR does / why we need it?
modify pcp tutorial doc

Because some optimization points have been submitted as PRs and haven't
been merged yet, I'll update the performance data now and refresh it
again after the PRs are merged.

- vLLM version: release/v0.13.0
- vLLM main:
81786c8774

Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
2025-12-27 17:47:09 +08:00
MengLong Chen
b8b5521f5b [Doc] Update DeepSeek V3.1/R1 2P1D doc (#5387)
### What this PR does / why we need it?
The PR updates the documentation for DeepSeek-V3.1 and DeepSeek-R1 in
the scenario of prefill-decode disaggregation.

Updated some PD separation-related setting parameters and optimal
configurations. This script has been verified.

- vLLM version: release/v0.13.0
- vLLM main:
bc0a5a0c08

Signed-off-by: chenmenglong <chenmenglong1@huawei.com>
2025-12-27 17:28:43 +08:00
cookieyyds
843751768e [DOC]Fix model weight download links (#5436)
Updated download links for DeepSeek-V3.2 model weights.

- vLLM version: release/v0.13.0
- vLLM main:
81786c8774

Signed-off-by: cookieyyds <126683903+cookieyyds@users.noreply.github.com>
2025-12-27 17:14:31 +08:00
Zhu Yi Lin
04104031d0 [Doc] Modify DeepSeek-R1/V3.1 documentation (#5426)
### What this PR does / why we need it?
Modify DeepSeek-R1/V3.1 documentation. Mainly update the mtp size and some other configs.

Signed-off-by: GDzhu01 <809721801@qq.com>
2025-12-27 17:13:58 +08:00
Angazenn
eab306b09c [doc] Update Qwen3-235B doc for reproducing latest performance (#5323)
### What this PR does / why we need it?
This PR updates Qwen3-235B doc to give a simple recipe for repreducing
our latest perfomance on Atlas A3 servers.

- vLLM version: release/v0.13.0
- vLLM main:
5fbfa8d9ef
---------
Signed-off-by: Angazenn <supperccell@163.com>
2025-12-27 15:55:58 +08:00
Zhu Yi Lin
be2a947521 [Doc] delete environment variable HCCL_OP_EXPANSION_MODE in DeepSeekV3.1/R1 (#5419)
### What this PR does / why we need it?
Currently, HCCL_OP_EXPANSION_MODE="AIV" is causing some freezing issues
on A2.so we have temporarily removed it from the documentation.

Signed-off-by: GDzhu01 <809721801@qq.com>
2025-12-27 12:44:50 +08:00
LookAround0301
ca31d6823e [Doc] add long_sequence feature user guide (#5343)
### What this PR does / why we need it?
add long_sequence feature user guide

- vLLM version: release/v0.13.0
- vLLM main:
bc0a5a0c08

---------

Signed-off-by: LookAround <lixushi@huawei.com>
2025-12-27 10:44:43 +08:00
weiguihua2
69f96950e1 [Doc] modify pcp tutorials (#5411)
### What this PR does / why we need it?
modify pcp tutorials

modify pcp perf statistics and add note: Context parallel feature
currently is only supported on Atlas A3 device, and will be supported on
Atlas A2 in the future.

- vLLM version: release/v0.13.0
- vLLM main:
81786c8774
---------
Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
2025-12-27 10:36:10 +08:00
fems14
2ef4d1979e [bugfix][main]KV Pool for KV Transfer in PD Disaggregation Scenarios (#5398)
### What this PR does / why we need it?
1.KV Pool for KV Transfer in PD Disaggregation Scenarios Error
Resolution
2.Update KV Pool Documentation

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: release/v0.13.0
- vLLM main:
254f6b9867

---------

Signed-off-by: fems14 <1804143737@qq.com>
2025-12-27 09:53:57 +08:00
weiguihua2
ce52e17bf3 [Doc]add long sequence tutorials (#5364)
### What this PR does / why we need it?
Provide sample guidance for running long-sequence DeepSeek across
multiple nodes

To guide users on using the context parallel feature, a practical
example is provided.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: release/v0.13.0
- vLLM main:
bc0a5a0c08

Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
2025-12-27 09:52:11 +08:00
ZT-AIA
1d8aa892bf Update vllm pin to 12.26 (#5378)
### What this PR does / why we need it?
Update vllm pin to 12.26
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: release/v0.13.0
- vLLM main:
81786c8774

---------

Signed-off-by: ZT-AIA <1028681969@qq.com>
Signed-off-by: Mengqing Cao <cmq0113@163.com>
Signed-off-by: ZT-AIA <63220130+ZT-AIA@users.noreply.github.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
2025-12-26 23:44:48 +08:00
LeeWenquan
7685d0c239 rollback causal_conv1d_fn to torch ops & update qwen3Next doc (#5391)
### What this PR does / why we need it?
Rollback causal_conv1d_fn ops from triton to torch version to fix
hanging issues,meanwhile update Qwen3Next doc

- vLLM version: release/v0.13.0
- vLLM main:
254f6b9867
---------
Signed-off-by: SunnyLee219 <3294305115@qq.com>
2025-12-26 19:57:38 +08:00
Zhu Yi Lin
06732dbf5b [Doc] update R1/V3.1 doc (#5383)
### What this PR does / why we need it?
This PR updates DeepSeek-R1/V3.1 doc to give a simple recipe for
repreducing our latest perfomance on Atlas A3/A2 servers.
### Does this PR introduce any user-facing change?
No.

Signed-off-by: GDzhu01 <809721801@qq.com>
2025-12-26 17:09:22 +08:00
zhangsicheng5
8ed87dfa84 [doc] Add context parallel user guide (#5358)
1. Add context parallel user guide
2. Add context parallel related message in supported features/models
- vLLM version: release/v0.13.0
- vLLM main:
bc0a5a0c08

Signed-off-by: zhangsicheng5 <zhangsicheng5@huawei.com>
2025-12-26 17:03:47 +08:00
Qiu
da0b113cf5 [doc]<PCP&DCP> add developer guide for PCP&DCP (#5372)
### What this PR does / why we need it?
add developer guide for PCP&DCP
- vLLM version: release/v0.13.0
- vLLM main:
bc0a5a0c08

Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>
2025-12-26 16:17:38 +08:00
wangxiyuan
29d2fe653d cleanup ascend config (#5296)
1. refresh additional config doc
2. move kv config logic to platform.
3. improve `dump_config` init logic and rename it to `dump_config_path`
this change is user impacted. dump_config is changed from dict to
string.
4. correct `enable_async_exponential` type
5. remove useless `chunked_prefill_for_mla`

- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-12-26 14:07:37 +08:00
ZT-AIA
adaa89a7a5 Update vllm pin to 12.25 (#5342)
### What this PR does / why we need it?
- Fix vllm break in the pr:
1.[Drop v0.14 deprecations
]https://github.com/vllm-project/vllm/pull/31285
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- vLLM version: release/v0.13.0
- vLLM main:
bc0a5a0c08

---------

Signed-off-by: ZT-AIA <1028681969@qq.com>
2025-12-26 14:05:40 +08:00
cookieyyds
2da8038dd2 [doc] update using command (#5373)
### What this PR does / why we need it?
Update the configuration for optimal performance of deepseek v3.2 in the usage tutorial.

- vLLM version: release/v0.13.0
- vLLM main:
bc0a5a0c08
---------
Signed-off-by: cookieyyds <126683903+cookieyyds@users.noreply.github.com>
Signed-off-by: Mengqing Cao <cmq0113@163.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
2025-12-25 22:28:35 +08:00
wangxiyuan
2ae0bad96d Remove VLLM_ASCEND_ENABLE_DENSE_OPTIMIZE (#5272)
`VLLM_ASCEND_ENABLE_DENSE_OPTIMIZE` is only used together with
`VLLM_ASCEND_ENABLE_PREFETCH_MLP` which is useless totally. This PR
remove it.
- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-12-25 11:09:56 +08:00
Nengjun Ma
42c989a437 Update vllm pin to 12.24 (#5307)
### What this PR does / why we need it?
Fix vllm break in the pr:
1. [Add MiMo-V2-Flash support]
(https://github.com/vllm-project/vllm/pull/30836)

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Co-authored-by: zxwang [1476209578@qq.com](mailto:1476209578@qq.com)

- vLLM version: release/v0.13.0
- vLLM main:
5fbfa8d9ef

---------

Signed-off-by: leo-pony <nengjunma@outlook.com>
Signed-off-by: zxwang <1476209578@qq.com>
Co-authored-by: zxwang <1476209578@qq.com>
2025-12-24 17:24:31 +08:00
ZYang6263
a3f65b938f [Doc] Add pa_shape_list description to qwen dense tutorial (#5225)
### What this PR does / why we need it?
Add pa_shape_list description to qwen dense tutorial.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c

Signed-off-by: ZYang6263 <zy626375@gmail.com>
Co-authored-by: zzzzwwjj <34335947+zzzzwwjj@users.noreply.github.com>
2025-12-24 14:40:20 +08:00
Nengjun Ma
3b59f20a28 update to vllm 12-19 (#5223)
### What this PR does / why we need it?

### Does this PR introduce _any_ user-facing change?
Fix vllm break:
1. [Enable cuda graph for deepepHT, 5.3% throughput improvement, 4.4%
TTFT improvement] (https://github.com/vllm-project/vllm/pull/29558)
Fix Solution: Add the now-necessary `all2all_backend` parameter. The
impact of this parameter on the original `set_splitting_ops_for_v1`
implementation is only that graph mode is disabled in `vllm` if
`deepep_high_throughput` is enabled; it has no effect on the
`vllm-ascend` logic.

2.[Migrate legacy ViT MultiHeadAttention to new MMEncoderAttention
interface ] (https://github.com/vllm-project/vllm/pull/30684)
Fix Solution: The reason why the GPU does not need to convert qkv to 3D
is that the GPU's flash_attention operator is compatible with 3D and 4D
(b s h d and s b ( h d)), but the NPU's flash_attention_unpad operator
only supports 3D (s b ( h d)). Therefore, we need to introduce the
reshape_qkv_to_3d operation.

4.Skip Tencent-Hunyuan/HunyuanOCR test case, as it has following issue
in upgrade vllm code:
https://github.com/vllm-project/vllm-ascend/issues/5297

### How was this patch tested?


Co-authored-by: zxwang <1476209578@qq.com>

- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: leo-pony <nengjunma@outlook.com>
Signed-off-by: zxwang <1476209578@qq.com>
Co-authored-by: zxwang <1476209578@qq.com>
2025-12-23 23:52:11 +08:00
Tiger Xu / Zhonghu Xu
cb963c53a5 [Doc] Added deploying on k8s with kthena (#4674)
### What this PR does / why we need it?
[Kthena](https://github.com/volcano-sh/kthena) is a Kubernetes-native
LLM inference platform that transforms how organizations deploy and
manage Large Language Models in production. Built with declarative model
lifecycle management and intelligent request routing, it provides high
performance and enterprise-grade scalability for LLM inference
workloads.

The platform extends Kubernetes with purpose-built Custom Resource
Definitions (CRDs) for managing LLM workloads, supporting multiple
inference engines (vLLM, SGLang, Triton) and advanced serving patterns
like prefill-decode disaggregation.

This pr added a example on deloying llm on Ascend Kubernetes clusters.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: Zhonghu Xu <xuzhonghu@huawei.com>
2025-12-23 17:46:04 +08:00
rongfu.leng
c9b5881bcd [Doc] fix docs set rope_theta value is 10e6 in qwen3-235b model (#5258)
### What this PR does / why we need it?

Fixes https://github.com/vllm-project/vllm-ascend/issues/5201

### Does this PR introduce _any_ user-facing change?
No, doc only

### How was this patch tested?

- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c

Signed-off-by: rongfu.leng <lenronfu@gmail.com>
2025-12-23 10:21:46 +08:00
zhangyiming
35dbdbb398 [Doc] Add new contributors and relative scripts. (#5070)
### What this PR does / why we need it?
[Doc] Add new contributors and relative scripts.
Usage of scripts:
- `export GITHUB_TOKEN=<your github token>`
- `bash tools/collect_user_first_contribution.sh
vllm-project/vllm-ascend <base_sha> <head_sha>` and save the result to
one temporary file such as `contributors.txt`
- `python tools/format_contributors.py contributors.txt --start <start
index now>`
- Use the output to update the `contributors.md`


- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: menogrey <1299267905@qq.com>
2025-12-23 10:01:45 +08:00
zhangyiming
f883a2edb9 [Doc] Update the weight download URL. (#5238)
### What this PR does / why we need it?
Update the weight download URL. Because the model was renamed.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: menogrey <1299267905@qq.com>
2025-12-23 08:53:30 +08:00
lvjunqi
55beac9c91 [Feat]Xlite Qwen3-vl Support (#5228)
### What this PR does / why we need it?
This patch adds support for the Qwen3-VL model in Xlite. For more
details about Xlite, please refer to the following
link:https://atomgit.com/openeuler/GVirt/blob/master/xlite/README.md.
The latest performance comparison data between xlite and the default
aclgraph mode is as follows:

### Does this PR introduce _any_ user-facing change?
XLite graph mode supports the Qwen3-VL model.

### How was this patch tested?
vLLM version: v0.12.0 

- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c

Signed-off-by: lvjunqi <lvjunqi1@huawei.com>
Co-authored-by: lvjunqi <lvjunqi1@huawei.com>
2025-12-22 16:30:52 +08:00
zhangyiming
dc047489c7 [Doc] Fix DeepSeek-V3.2 tutorial. (#5190)
### What this PR does / why we need it?
Fix DeepSeek-V3.2 tutorial.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: menogrey <1299267905@qq.com>
2025-12-22 11:30:17 +08:00
YuhanBai
5d02eed16f [Performance] Add async exponential while model executing (#4501)
### What this PR does / why we need it?
Add a control to enable the exponential distribution operator
overlapping with model executing (default is OFF due to this feature
might not perform well on MOE models, i.e. For Qwen3-30B).
Enable async exponential overlapping will provides performance
improvement.
Also, overlapping the exponential operator with module execution can
cover the performance drop introduced by AICPU-version's exponential
operator.

**UPDATE**: (12/12)
Now our overlap will use the same stream that introduced in this pr:
#4908 .
We move the `do_async_exponential` from `model_runner_v1.py` to
`sampler.py`.
Now we are using `additional_config` to enable async exponential:
Add `"enable_async_exponential": 1` in `addition_config`.
Now we **ONLY** support default exponential/AI-CPU exponential, the old
`"enable_async_exponential": 2` option has been aborted to keep
consistency.

### Does this PR introduce _any_ user-facing change?
**YES**, added a new `additional_config` : `"enable_async_exponential":
1`.
When `enable_async_exponential` is set to 1, we enable the async
exponential and overlap with model runner.
When `enable_async_exponential` is set to 0 (default is 0), we disable
the async exponential, but exponential will still running on a different
stream using stream introduced in #4908.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: YuhanBai <yuhan.bai0830@gmail.com>
Signed-off-by: YuhanBai yuhan.bai0830@gmail.com
2025-12-20 21:23:21 +08:00
wangxiyuan
758d81dcb1 Drop 0.12.0 support (#5146)
We decided to release v0.13.0 soon. So no need to support 0.12.0 now.
Let's drop it.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-12-20 09:38:53 +08:00
zzhxxx
17f2eead99 [Doc]Add the user_guide doc file regarding fine-grained TP. (#5084)
### What this PR does / why we need it?
Add user guide for **Fine-Grained Tensor Parallelism** feature.  
Documents usage, supported components (`embedding`, `lm_head`, `o_proj`,
`mlp`/`dense_ffn`), model compatibility, and deployment guidelines.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: zzhx1 <zzh_201018@outlook.com>
Signed-off-by: chenxiao <Jaychou1620@Gmail.com>
Signed-off-by: 秋刀鱼 <jaychou1620@Gmail.com>
Co-authored-by: chenxiao <Jaychou1620@Gmail.com>
Co-authored-by: Jade Zheng <zheng.shoujian@outlook.com>
2025-12-19 16:37:25 +08:00
luluxiu520
bc05a81bf2 Add Qwen3-VL-235B-A22B-Instruct tutorials (#5167)
### What this PR does / why we need it?

This PR provides an introduction to the Qwen3-VL-235B-A22B-Instruct
model, details on the features supported by the model in the current
version, the model deployment process, as well as methods for
performance testing and accuracy testing.

With this document, the deployment and testing of the
Qwen3-VL-235B-A22B-Instruct model can be implemented more easily.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: luluxiu520 <l2625793@outlook.com>
2025-12-19 14:56:17 +08:00
Li Wang
5ab6d124e5 [Doc] Add a perf tune section (#5127)
### What this PR does / why we need it?
This patch purpose to 
1. add a  section on os point of perf tune doc
2. Set some default env in the image for performance

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-12-19 14:52:52 +08:00
1092626063
f952de93df 【Doc】Deepseekv3.1/R1 doc enhancement (#4827)
### What this PR does / why we need it?

Deepseekv3.1、DeepSeekR1 doc enhancement

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: 1092626063 <1092626063@qq.com>
2025-12-19 10:52:33 +08:00
weichen
ca6f631cba [2/N][Pangu][MoE] Remove Pangu Related Code (#5130)
### What this PR does / why we need it?
Remove Pangu Related Code

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
e2e & ut

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: weichen <calvin_zhu0210@outlook.com>
2025-12-19 09:00:07 +08:00
zxr2333
073a3a6e6c [Doc][P/D] Fix MooncakeConnector's name (#5172)
### What this PR does / why we need it?
vLLM community has integrated their MooncakeConnector. The original
scripts will now find this MooncakeConnector instead of the one from
vLLM-Ascend. All scripts that involve using the MooncakeConnector need
to be modified to another name.

### Does this PR introduce _any_ user-facing change?
Yes, users need to use a new name to load vLLM-Ascend MooncakeConnector.

### How was this patch tested?
By CI.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
2025-12-18 22:29:19 +08:00
Li Wang
7d32371b7e [Doc] Refact benchmark doc (#5173)
### What this PR does / why we need it?
Refactor some outdated doc

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-12-18 22:26:13 +08:00
wangxiyuan
0f571c347b Nominate new maintainers @zzzzwwjj @realliujiaxu @LCAIZJ (#5152)
I'd like to nominate @zzzzwwjj @realliujiaxu @LCAIZJ to join vLLM Ascend
committer team.

@zzzzwwjj
---
- Review Quality‌:
He has completed 80+reviews since April. 2025, include
https://github.com/vllm-project/vllm-ascend/pull/3232#issuecomment-3506110786,
https://github.com/vllm-project/vllm-ascend/pull/4822#discussion_r2601661204,
https://github.com/vllm-project/vllm-ascend/pull/4768#issuecomment-3644795995
high quality review.

- Sustained Contributions
15+ Valuable bug fix and refactor is very good.

https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3Azzzzwwjj+is%3Aclosed+review%3Aapproved
Continuous optimization of code architecture

https://github.com/vllm-project/vllm-ascend/pulls?q=author%3Azzzzwwjj+is%3Amerged

- Quality Contribution‌:
https://github.com/vllm-project/vllm-ascend/pull/1229
https://github.com/vllm-project/vllm-ascend/pull/1979
https://github.com/vllm-project/vllm-ascend/pull/4359
https://github.com/vllm-project/vllm-ascend/pull/4878

- Community Involvement‌: 
He lead the https://github.com/vllm-project/vllm-ascend/issues/1147, to
refactor AscendFusedMoE at the first time.
He shared topics about large-scale distributed inference and
reinforcement learning on vLLM-Ascend meetup on August 2nd.

@realliujiaxu
---
- Review Quality‌:
He has completed about [40+
reviews](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+commenter%3Arealliujiaxu+-author%3Arealliujiaxu+)
since September, include
https://github.com/vllm-project/vllm-ascend/pull/4868#discussion_r2605549015,
https://github.com/vllm-project/vllm-ascend/pull/2275#discussion_r2268455665.

- Sustained Contributions
He has completed (17
commits)[https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3Arealliujiaxu+is%3Amerged],
continuously optimizing the performance of the MoE model.

- Quality Contribution‌:

Contributed the Flash Comm1 feature to the community, supporting both
eager and aclgraph execution modes, while compatible with multiple MoE
models including DeepSeek and GLM4.5.
  - https://github.com/vllm-project/vllm-ascend/pull/3334
  - https://github.com/vllm-project/vllm-ascend/pull/3420
  - https://github.com/vllm-project/vllm-ascend/pull/3015
  
  co-author:
  - https://github.com/vllm-project/vllm-ascend/pull/3495
  - https://github.com/vllm-project/vllm-ascend/pull/4868

- Community Involvement‌: 
1. Completed two major refactors, enabling vllm-ascend to evolve more
rapidly and robustly: [Linear
module](https://github.com/vllm-project/vllm-ascend/pull/2867) and
[rejection
sampler](https://github.com/vllm-project/vllm-ascend/pull/4975)
2. [fixed 8
bugs](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3Arealliujiaxu+is%3Amerged+bugfix+)
in graph mode, spec decoding and async scheduling.

@LCAIZJ
---
- Review Quality‌: He's been the go-to reviewer for virtually all PD
disaggregation and KV Pool related PRs, having completed [30+
reviews](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+commenter%3ALCAIZJ+is%3Aopen+-author%3ALCAIZJ+)
since May 2025. Notable examples include
[discussion_r2553887360](https://github.com/vllm-project/vllm-ascend/pull/4345#discussion_r2553887360),
[issuecomment-3540994801](https://github.com/vllm-project/vllm-ascend/pull/4161#issuecomment-3540994801),
and
[discussion_r2492593988](https://github.com/vllm-project/vllm-ascend/pull/3981#discussion_r2492593988),
all demonstrating thorough and insightful feedback.
- Sustained and Quality Contributions: His contributions reflect a
strong grasp of both ‌vLLM‌ and ‌vLLM Ascend‌ codebases, particularly in
prefill-decode disaggregation and KV pool areas ([7 PRs
merged](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3ALCAIZJ+is%3Amerged+)).
Prefill-Decode Disaggregation: Delivered KV transfer functionality using
Mooncake TransferEngine and enabled layerwise KV transfer
https://github.com/vllm-project/vllm-ascend/pull/1568
https://github.com/vllm-project/vllm-ascend/pull/2602
KV Pool: Developed the foundational KV Pool infrastructure and migrated
it to the latest ADXL stack
https://github.com/vllm-project/vllm-ascend/pull/2913
https://github.com/vllm-project/vllm-ascend/pull/3350
- Quality Contribution‌:
https://github.com/vllm-project/vllm-ascend/pull/1568
https://github.com/vllm-project/vllm-ascend/pull/2602
https://github.com/vllm-project/vllm-ascend/pull/2913
https://github.com/vllm-project/vllm-ascend/pull/3350
- Community Involvement‌: 
He actively responds to [community
issues](https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue%20commenter%3ALCAIZJ%20is%3Aopen%20-author%3ALCAIZJ),
continuously monitors functionality and accuracy issues related to PD
disaggregation and KV Pool, and proactively delivers [bug
fixes](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3ALCAIZJ+is%3Amerged+bugfix).
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-12-18 18:49:07 +08:00
Ronald
b69b04d3a9 implement model runner v2 basic framework (#5051)
### What this PR does / why we need it?
This PR aim to implement model runner v2 basic framework in vllm-ascend,
the e2e function is not guaranteed by this pr.
 
### Does this PR introduce _any_ user-facing change?
use envs.VLLM_USE_V2_MODEL_RUNNER to decide if choose model_runenr_v2.

### How was this patch tested?

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
2025-12-18 15:51:54 +08:00
ming1212
9268ad11e3 Qwen3-Next:Update the gpu-memory-utilization parameter to 0.7 (#5129)
### What this PR does / why we need it?
Update the gpu-memory-utilization parameter to 0.7

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: ming1212 <2717180080@qq.com>
Signed-off-by: ming1212 <104972349+ming1212@users.noreply.github.com>
Signed-off-by: Mengqing Cao <cmq0113@163.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
2025-12-18 15:16:33 +08:00
TingW09
879ec2d1c4 [Doc] add qwen3 reranker (#5086)
### What this PR does / why we need it?
add qwen3 reranker tutorials
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.12.0

---------

Signed-off-by: TingW09 <944713709@qq.com>
2025-12-18 10:54:07 +08:00
lilinsiman
3f7a2fba70 [main][doc] Instructions for using permissions added to docker (#5092)
### What this PR does / why we need it?
Instructions for using permissions added to docker

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
ut

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: lilinsiman <lilinsiman@gmail.com>
2025-12-17 15:26:09 +08:00
ZixuanWang
b1a853b0f6 Upgrade vllm commit hash to 1216 (#5053)
### What this PR does / why we need it?
Upstream vLLM PR #30212 https://github.com/vllm-project/vllm/pull/30212
refactored the attention backend selection interface, This PR adapts
vllm-ascend's get_attn_backend_cls to align with the new upstream
standard, ensuring compatibility and reducing maintenance overhead.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

co-author:[leo-pony][nengjunma@outlook.com](mailto:nengjunma@outlook.com)
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: zxwang <1476209578@qq.com>
Signed-off-by: leo-pony <nengjunma@outlook.com>
Co-authored-by: leo-pony <nengjunma@outlook.com>
2025-12-17 08:48:36 +08:00
liziyu
190ae55e9f Add a Mooncake installation tutorial for kv pool and update Mooncake installation tutorial (#5069)
### What this PR does / why we need it?
Add a Mooncake installation tutorial for kv pool and update Mooncake
installation tutorial

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: liziyu <liziyu16@huawei.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
2025-12-16 19:53:23 +08:00
wangxiyuan
d11b74a571 Add release note for v0.11.0 (#4918)
Add release note for v0.11.0. We'll release soon.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-12-16 17:31:45 +08:00
zhaomingyu13
039cc65e58 [Doc] Add user guide of speculative decoding (#5074)
### What this PR does / why we need it?
Add user guide of speculative decoding that includes n-grams, EAGLE,
MTP, and suffix.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com>
2025-12-16 17:01:44 +08:00
Li Wang
a63ef031af [Doc] Upgrade some outdated doc (#5062)
### What this PR does / why we need it?
Upgrade some outdated doc to make run happily

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-12-16 11:48:19 +08:00
UnifiedCacheManager
195eac665b [Core][Worker] Add UCMConnector for KV Cache Offloading (#4411)
### What this PR does / why we need it?

This PR introduces the initial integration of **UCM (Unified Cache
Management)** into the vllm-ascend distributed KV-cache system.

Specifically, it adds:
- A new `UCMConnector` implementation under the distributed KV-transfer
framework.
- Support for offloading KV-cache blocks to external UCM backends (DRAM
/ NFS / Localdisk), depending on UCM configuration).
- Integration with vLLM V1 KV connector interface, including metadata
handling and role registration.

**Why it is needed:**
- UCM provides a unified, high-performance storage layer for KV-cache
externalization.
- This enables vllm-ascend to support out-of-core KV-cache workloads,
improve memory efficiency, and leverage hardware-accelerated storage
paths (RDMA / NFS / hybrid modes).
- This connector is a required component to allow future work on
multi-node inference + UCM-based scaling.

---

### Does this PR introduce _any_ user-facing change?

Yes, but limited:

- A new `kv_connector=UCMConnector` option becomes available through the
configuration interface.
- When selected, vllm-ascend workers may initialize UCM and offload
KV-cache blocks externally.
- No default behaviors are changed. Users must explicitly enable this
connector.

This PR does **not** modify:
- existing APIs,
- default execution paths,
- model runner behavior,
- user workflow unless `UCMConnector` is configured.

---

### How was this patch tested?

---

### Prefix Caching Benchmark

We provide preliminary measurements for TTFT (ms) under VLLM benchmark.
Tests run on 2 * Ascend 910B3, vllm-ascend 0.11.0, Tensor Parallel size
2, with UCM (Localdisk) enabled.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: UnifiedCacheManager <unifiedcachem@163.com>
2025-12-16 10:53:30 +08:00
Li Wang
6063853ead [Misc] Upgrade vllm commit hash to 1215 (#5029)
### What this PR does / why we need it?
Upgrade vllm commit hash to `4429d934de3c5cc327b0d7aec8e473aeba38db90`

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-12-16 09:23:02 +08:00