587 Commits

Author SHA1 Message Date
lvjunqi
55beac9c91 [Feat]Xlite Qwen3-vl Support (#5228)
### What this PR does / why we need it?
This patch adds support for the Qwen3-VL model in Xlite. For more
details about Xlite, please refer to the following
link:https://atomgit.com/openeuler/GVirt/blob/master/xlite/README.md.
The latest performance comparison data between xlite and the default
aclgraph mode is as follows:

### Does this PR introduce _any_ user-facing change?
XLite graph mode supports the Qwen3-VL model.

### How was this patch tested?
vLLM version: v0.12.0 

- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c

Signed-off-by: lvjunqi <lvjunqi1@huawei.com>
Co-authored-by: lvjunqi <lvjunqi1@huawei.com>
2025-12-22 16:30:52 +08:00
zhangyiming
dc047489c7 [Doc] Fix DeepSeek-V3.2 tutorial. (#5190)
### What this PR does / why we need it?
Fix DeepSeek-V3.2 tutorial.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: menogrey <1299267905@qq.com>
2025-12-22 11:30:17 +08:00
YuhanBai
5d02eed16f [Performance] Add async exponential while model executing (#4501)
### What this PR does / why we need it?
Add a control to enable the exponential distribution operator
overlapping with model executing (default is OFF due to this feature
might not perform well on MOE models, i.e. For Qwen3-30B).
Enable async exponential overlapping will provides performance
improvement.
Also, overlapping the exponential operator with module execution can
cover the performance drop introduced by AICPU-version's exponential
operator.

**UPDATE**: (12/12)
Now our overlap will use the same stream that introduced in this pr:
#4908 .
We move the `do_async_exponential` from `model_runner_v1.py` to
`sampler.py`.
Now we are using `additional_config` to enable async exponential:
Add `"enable_async_exponential": 1` in `addition_config`.
Now we **ONLY** support default exponential/AI-CPU exponential, the old
`"enable_async_exponential": 2` option has been aborted to keep
consistency.

### Does this PR introduce _any_ user-facing change?
**YES**, added a new `additional_config` : `"enable_async_exponential":
1`.
When `enable_async_exponential` is set to 1, we enable the async
exponential and overlap with model runner.
When `enable_async_exponential` is set to 0 (default is 0), we disable
the async exponential, but exponential will still running on a different
stream using stream introduced in #4908.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: YuhanBai <yuhan.bai0830@gmail.com>
Signed-off-by: YuhanBai yuhan.bai0830@gmail.com
2025-12-20 21:23:21 +08:00
wangxiyuan
758d81dcb1 Drop 0.12.0 support (#5146)
We decided to release v0.13.0 soon. So no need to support 0.12.0 now.
Let's drop it.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-12-20 09:38:53 +08:00
zzhxxx
17f2eead99 [Doc]Add the user_guide doc file regarding fine-grained TP. (#5084)
### What this PR does / why we need it?
Add user guide for **Fine-Grained Tensor Parallelism** feature.  
Documents usage, supported components (`embedding`, `lm_head`, `o_proj`,
`mlp`/`dense_ffn`), model compatibility, and deployment guidelines.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: zzhx1 <zzh_201018@outlook.com>
Signed-off-by: chenxiao <Jaychou1620@Gmail.com>
Signed-off-by: 秋刀鱼 <jaychou1620@Gmail.com>
Co-authored-by: chenxiao <Jaychou1620@Gmail.com>
Co-authored-by: Jade Zheng <zheng.shoujian@outlook.com>
2025-12-19 16:37:25 +08:00
luluxiu520
bc05a81bf2 Add Qwen3-VL-235B-A22B-Instruct tutorials (#5167)
### What this PR does / why we need it?

This PR provides an introduction to the Qwen3-VL-235B-A22B-Instruct
model, details on the features supported by the model in the current
version, the model deployment process, as well as methods for
performance testing and accuracy testing.

With this document, the deployment and testing of the
Qwen3-VL-235B-A22B-Instruct model can be implemented more easily.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: luluxiu520 <l2625793@outlook.com>
2025-12-19 14:56:17 +08:00
Li Wang
5ab6d124e5 [Doc] Add a perf tune section (#5127)
### What this PR does / why we need it?
This patch purpose to 
1. add a  section on os point of perf tune doc
2. Set some default env in the image for performance

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-12-19 14:52:52 +08:00
1092626063
f952de93df 【Doc】Deepseekv3.1/R1 doc enhancement (#4827)
### What this PR does / why we need it?

Deepseekv3.1、DeepSeekR1 doc enhancement

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: 1092626063 <1092626063@qq.com>
2025-12-19 10:52:33 +08:00
weichen
ca6f631cba [2/N][Pangu][MoE] Remove Pangu Related Code (#5130)
### What this PR does / why we need it?
Remove Pangu Related Code

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
e2e & ut

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: weichen <calvin_zhu0210@outlook.com>
2025-12-19 09:00:07 +08:00
zxr2333
073a3a6e6c [Doc][P/D] Fix MooncakeConnector's name (#5172)
### What this PR does / why we need it?
vLLM community has integrated their MooncakeConnector. The original
scripts will now find this MooncakeConnector instead of the one from
vLLM-Ascend. All scripts that involve using the MooncakeConnector need
to be modified to another name.

### Does this PR introduce _any_ user-facing change?
Yes, users need to use a new name to load vLLM-Ascend MooncakeConnector.

### How was this patch tested?
By CI.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
2025-12-18 22:29:19 +08:00
Li Wang
7d32371b7e [Doc] Refact benchmark doc (#5173)
### What this PR does / why we need it?
Refactor some outdated doc

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-12-18 22:26:13 +08:00
wangxiyuan
0f571c347b Nominate new maintainers @zzzzwwjj @realliujiaxu @LCAIZJ (#5152)
I'd like to nominate @zzzzwwjj @realliujiaxu @LCAIZJ to join vLLM Ascend
committer team.

@zzzzwwjj
---
- Review Quality‌:
He has completed 80+reviews since April. 2025, include
https://github.com/vllm-project/vllm-ascend/pull/3232#issuecomment-3506110786,
https://github.com/vllm-project/vllm-ascend/pull/4822#discussion_r2601661204,
https://github.com/vllm-project/vllm-ascend/pull/4768#issuecomment-3644795995
high quality review.

- Sustained Contributions
15+ Valuable bug fix and refactor is very good.

https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3Azzzzwwjj+is%3Aclosed+review%3Aapproved
Continuous optimization of code architecture

https://github.com/vllm-project/vllm-ascend/pulls?q=author%3Azzzzwwjj+is%3Amerged

- Quality Contribution‌:
https://github.com/vllm-project/vllm-ascend/pull/1229
https://github.com/vllm-project/vllm-ascend/pull/1979
https://github.com/vllm-project/vllm-ascend/pull/4359
https://github.com/vllm-project/vllm-ascend/pull/4878

- Community Involvement‌: 
He lead the https://github.com/vllm-project/vllm-ascend/issues/1147, to
refactor AscendFusedMoE at the first time.
He shared topics about large-scale distributed inference and
reinforcement learning on vLLM-Ascend meetup on August 2nd.

@realliujiaxu
---
- Review Quality‌:
He has completed about [40+
reviews](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+commenter%3Arealliujiaxu+-author%3Arealliujiaxu+)
since September, include
https://github.com/vllm-project/vllm-ascend/pull/4868#discussion_r2605549015,
https://github.com/vllm-project/vllm-ascend/pull/2275#discussion_r2268455665.

- Sustained Contributions
He has completed (17
commits)[https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3Arealliujiaxu+is%3Amerged],
continuously optimizing the performance of the MoE model.

- Quality Contribution‌:

Contributed the Flash Comm1 feature to the community, supporting both
eager and aclgraph execution modes, while compatible with multiple MoE
models including DeepSeek and GLM4.5.
  - https://github.com/vllm-project/vllm-ascend/pull/3334
  - https://github.com/vllm-project/vllm-ascend/pull/3420
  - https://github.com/vllm-project/vllm-ascend/pull/3015
  
  co-author:
  - https://github.com/vllm-project/vllm-ascend/pull/3495
  - https://github.com/vllm-project/vllm-ascend/pull/4868

- Community Involvement‌: 
1. Completed two major refactors, enabling vllm-ascend to evolve more
rapidly and robustly: [Linear
module](https://github.com/vllm-project/vllm-ascend/pull/2867) and
[rejection
sampler](https://github.com/vllm-project/vllm-ascend/pull/4975)
2. [fixed 8
bugs](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3Arealliujiaxu+is%3Amerged+bugfix+)
in graph mode, spec decoding and async scheduling.

@LCAIZJ
---
- Review Quality‌: He's been the go-to reviewer for virtually all PD
disaggregation and KV Pool related PRs, having completed [30+
reviews](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+commenter%3ALCAIZJ+is%3Aopen+-author%3ALCAIZJ+)
since May 2025. Notable examples include
[discussion_r2553887360](https://github.com/vllm-project/vllm-ascend/pull/4345#discussion_r2553887360),
[issuecomment-3540994801](https://github.com/vllm-project/vllm-ascend/pull/4161#issuecomment-3540994801),
and
[discussion_r2492593988](https://github.com/vllm-project/vllm-ascend/pull/3981#discussion_r2492593988),
all demonstrating thorough and insightful feedback.
- Sustained and Quality Contributions: His contributions reflect a
strong grasp of both ‌vLLM‌ and ‌vLLM Ascend‌ codebases, particularly in
prefill-decode disaggregation and KV pool areas ([7 PRs
merged](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3ALCAIZJ+is%3Amerged+)).
Prefill-Decode Disaggregation: Delivered KV transfer functionality using
Mooncake TransferEngine and enabled layerwise KV transfer
https://github.com/vllm-project/vllm-ascend/pull/1568
https://github.com/vllm-project/vllm-ascend/pull/2602
KV Pool: Developed the foundational KV Pool infrastructure and migrated
it to the latest ADXL stack
https://github.com/vllm-project/vllm-ascend/pull/2913
https://github.com/vllm-project/vllm-ascend/pull/3350
- Quality Contribution‌:
https://github.com/vllm-project/vllm-ascend/pull/1568
https://github.com/vllm-project/vllm-ascend/pull/2602
https://github.com/vllm-project/vllm-ascend/pull/2913
https://github.com/vllm-project/vllm-ascend/pull/3350
- Community Involvement‌: 
He actively responds to [community
issues](https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue%20commenter%3ALCAIZJ%20is%3Aopen%20-author%3ALCAIZJ),
continuously monitors functionality and accuracy issues related to PD
disaggregation and KV Pool, and proactively delivers [bug
fixes](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3ALCAIZJ+is%3Amerged+bugfix).
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-12-18 18:49:07 +08:00
Ronald
b69b04d3a9 implement model runner v2 basic framework (#5051)
### What this PR does / why we need it?
This PR aim to implement model runner v2 basic framework in vllm-ascend,
the e2e function is not guaranteed by this pr.
 
### Does this PR introduce _any_ user-facing change?
use envs.VLLM_USE_V2_MODEL_RUNNER to decide if choose model_runenr_v2.

### How was this patch tested?

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
2025-12-18 15:51:54 +08:00
ming1212
9268ad11e3 Qwen3-Next:Update the gpu-memory-utilization parameter to 0.7 (#5129)
### What this PR does / why we need it?
Update the gpu-memory-utilization parameter to 0.7

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: ming1212 <2717180080@qq.com>
Signed-off-by: ming1212 <104972349+ming1212@users.noreply.github.com>
Signed-off-by: Mengqing Cao <cmq0113@163.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
2025-12-18 15:16:33 +08:00
TingW09
879ec2d1c4 [Doc] add qwen3 reranker (#5086)
### What this PR does / why we need it?
add qwen3 reranker tutorials
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.12.0

---------

Signed-off-by: TingW09 <944713709@qq.com>
2025-12-18 10:54:07 +08:00
lilinsiman
3f7a2fba70 [main][doc] Instructions for using permissions added to docker (#5092)
### What this PR does / why we need it?
Instructions for using permissions added to docker

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
ut

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: lilinsiman <lilinsiman@gmail.com>
2025-12-17 15:26:09 +08:00
ZixuanWang
b1a853b0f6 Upgrade vllm commit hash to 1216 (#5053)
### What this PR does / why we need it?
Upstream vLLM PR #30212 https://github.com/vllm-project/vllm/pull/30212
refactored the attention backend selection interface, This PR adapts
vllm-ascend's get_attn_backend_cls to align with the new upstream
standard, ensuring compatibility and reducing maintenance overhead.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

co-author:[leo-pony][nengjunma@outlook.com](mailto:nengjunma@outlook.com)
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: zxwang <1476209578@qq.com>
Signed-off-by: leo-pony <nengjunma@outlook.com>
Co-authored-by: leo-pony <nengjunma@outlook.com>
2025-12-17 08:48:36 +08:00
liziyu
190ae55e9f Add a Mooncake installation tutorial for kv pool and update Mooncake installation tutorial (#5069)
### What this PR does / why we need it?
Add a Mooncake installation tutorial for kv pool and update Mooncake
installation tutorial

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: liziyu <liziyu16@huawei.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
2025-12-16 19:53:23 +08:00
wangxiyuan
d11b74a571 Add release note for v0.11.0 (#4918)
Add release note for v0.11.0. We'll release soon.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-12-16 17:31:45 +08:00
zhaomingyu13
039cc65e58 [Doc] Add user guide of speculative decoding (#5074)
### What this PR does / why we need it?
Add user guide of speculative decoding that includes n-grams, EAGLE,
MTP, and suffix.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com>
2025-12-16 17:01:44 +08:00
Li Wang
a63ef031af [Doc] Upgrade some outdated doc (#5062)
### What this PR does / why we need it?
Upgrade some outdated doc to make run happily

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-12-16 11:48:19 +08:00
UnifiedCacheManager
195eac665b [Core][Worker] Add UCMConnector for KV Cache Offloading (#4411)
### What this PR does / why we need it?

This PR introduces the initial integration of **UCM (Unified Cache
Management)** into the vllm-ascend distributed KV-cache system.

Specifically, it adds:
- A new `UCMConnector` implementation under the distributed KV-transfer
framework.
- Support for offloading KV-cache blocks to external UCM backends (DRAM
/ NFS / Localdisk), depending on UCM configuration).
- Integration with vLLM V1 KV connector interface, including metadata
handling and role registration.

**Why it is needed:**
- UCM provides a unified, high-performance storage layer for KV-cache
externalization.
- This enables vllm-ascend to support out-of-core KV-cache workloads,
improve memory efficiency, and leverage hardware-accelerated storage
paths (RDMA / NFS / hybrid modes).
- This connector is a required component to allow future work on
multi-node inference + UCM-based scaling.

---

### Does this PR introduce _any_ user-facing change?

Yes, but limited:

- A new `kv_connector=UCMConnector` option becomes available through the
configuration interface.
- When selected, vllm-ascend workers may initialize UCM and offload
KV-cache blocks externally.
- No default behaviors are changed. Users must explicitly enable this
connector.

This PR does **not** modify:
- existing APIs,
- default execution paths,
- model runner behavior,
- user workflow unless `UCMConnector` is configured.

---

### How was this patch tested?

---

### Prefix Caching Benchmark

We provide preliminary measurements for TTFT (ms) under VLLM benchmark.
Tests run on 2 * Ascend 910B3, vllm-ascend 0.11.0, Tensor Parallel size
2, with UCM (Localdisk) enabled.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: UnifiedCacheManager <unifiedcachem@163.com>
2025-12-16 10:53:30 +08:00
Li Wang
6063853ead [Misc] Upgrade vllm commit hash to 1215 (#5029)
### What this PR does / why we need it?
Upgrade vllm commit hash to `4429d934de3c5cc327b0d7aec8e473aeba38db90`

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-12-16 09:23:02 +08:00
InSec
a5cb8e40f5 [doc]Modify quantization tutorials (#5026)
### What this PR does / why we need it?
Modify quantization tutorials to correct a few mistakes:
Qwen3-32B-W4A4.md and Qwen3-8B-W4A8.md
Qwen3-8B-W4A8: need to set one idle npu card.
Qwen3-32B-W4A4: need to set two idle npu cards for the flatquant
training and modify the calib_file path which does not match the
ModeSlim version.
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: IncSec <1790766300@qq.com>
2025-12-15 20:12:06 +08:00
Li Wang
8d2998d0e4 [Misc] Upgrade vllm hash to 12_14 (#5000)
### What this PR does / why we need it?

### Does this PR introduce _any_ user-facing change?
1. fix https://github.com/vllm-project/vllm/pull/27938
2. fix https://github.com/vllm-project/vllm/pull/27145
pooling models now supports chunked prefill and prefix caching,
3. fix https://github.com/vllm-project/vllm/pull/30181
define the CPU fields in the field config where they really belong.
4. fix https://github.com/vllm-project/vllm/pull/28168
define the CPU fields in the field config where they really belong.
5. fix https://github.com/vllm-project/vllm/pull/30201
some moudle rename
6. fix https://github.com/vllm-project/vllm/pull/29067
fusedmoe moudle refactor
7. fix https://github.com/vllm-project/vllm/pull/29066
fusedmoe moudle refactor
8. fix https://github.com/vllm-project/vllm/pull/29624
### How was this patch tested?

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-12-15 19:54:23 +08:00
fluctlux
6de4bedd04 update release note for suffix decoding (#5009)
update release note for suffix decoding

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: fluctlux <38945811+fluctlux@users.noreply.github.com>
2025-12-15 17:22:19 +08:00
Chao Lei
b75bfc58f6 [Doc ] Supplement kvpool user guide (#5013)
### What this PR does / why we need it?
Supplement detailed descriptions for `ASCEND_CONNECT_TIMEOUT` and
`ASCEND_TRANSFER_TIMEOUT` in kvpool.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: LCAIZJ <leichao139636@163.com>
2025-12-15 14:24:39 +08:00
ming1212
98b9e2e18e Add Qwen3-Next tutorials (#4607)
### What this PR does / why we need it?

This PR provides an introduction to the Qwen3-Next model, details on the
features supported by the model in the current version, the model
deployment process, as well as methods for performance testing and
accuracy testing.

With this document, the deployment and testing of the Qwen3-Next model
can be implemented more easily.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: ming1212 <2717180080@qq.com>
Signed-off-by: ming1212 <104972349+ming1212@users.noreply.github.com>
Signed-off-by: Mengqing Cao <cmq0113@163.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
2025-12-15 11:48:22 +08:00
Li Wang
2497bbbaf6 [Misc] Update pooling example (#5002)
### What this PR does / why we need it?
Since the param `task` has been depprecated, we should use the latest
unified standard parameters for pooling models, this should be more
clear


- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-12-15 08:36:19 +08:00
wangxiyuan
8090914d69 [CI] CI refactor (#4928)
1. rename workflow to better name
2. fix lint error
3. remove accuracy report doc and test

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-12-14 11:09:56 +08:00
wangxiyuan
42ceaf08a1 add release note for 0.12.0 (#4995)
Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-12-13 22:09:59 +08:00
lilinsiman
31c94b7e7b [doc][main] Correct more doc mistakes (#4958)
### What this PR does / why we need it?
Correct more doc mistakes

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: lilinsiman <lilinsiman@gmail.com>
2025-12-13 18:36:58 +08:00
lilinsiman
fc818f1509 [doc][main] Correct mistakes in doc (#4945)
### What this PR does / why we need it?
Correct mistakes in doc

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: lilinsiman <lilinsiman@gmail.com>
2025-12-12 19:17:10 +08:00
liziyu
716c4dacfe update qwen2.5vl readme (#4938)
### What this PR does / why we need it?
fix qwen2.5vl readme, del gen ranktable and add install mooncake


- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: liziyu <liziyu16@huawei.com>
2025-12-12 15:40:07 +08:00
Li Wang
4ae7588c52 [Doc] Upgrade outdated doc (#4957)
### What this PR does / why we need it?
Updated some issues that caused sleep mode document content to be
unavailable due to changes/outdated environment variables.

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-12-12 15:38:29 +08:00
1092626063
62a9fea7af 【doc】Add model feature matrix (#4950)
### What this PR does / why we need it?

doc tutorials add  model feature matrix:
DeepSeekR1
DeepSeekV3.1
Qwen3-Dense
Qwen3-Moe
Qwen3-Next
Qwen2.5
Qwen2.5-VL
Qwen3-VL

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: 1092626063 <1092626063@qq.com>
2025-12-12 15:37:39 +08:00
lidenghui1110
d65fb194d9 [Feat] Add custom Embedding tensor model parallel (#2616)
Similar to #2309 , this PR introduces Embedding tensor model parallel to
achieve decreasing of memory consumption. It support both eager mode and
graph mode.

And this PR refactor module tensor parallel configurations supported in
#2309, #2167, #2120, merge all config into `finegrained_tp_config` in
`additional_config`, including:
`lmhead_tensor_parallel_size`
`oproj_tensor_parallel_size`
`embedding_tensor_parallel_size`
`mlp_tensor_parallel_size`

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: zzhx1 <zzh_201018@outlook.com>
Signed-off-by: zzhxx <zhangzihang23@mails.ucas.ac.cn>
Co-authored-by: zzhx1 <zzh_201018@outlook.com>
Co-authored-by: chenxiao <Jaychou1620@Gmail.com>
Co-authored-by: zzhxx <zhangzihang23@mails.ucas.ac.cn>
Co-authored-by: Jade Zheng <zheng.shoujian@outlook.com>
2025-12-12 14:41:20 +08:00
wangxiyuan
e538fa6f9c [Doc] Update tutorial index (#4920)
Update tutorial index and remove useless doc

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-12-11 20:53:13 +08:00
Shanshan Shen
551069e53a [Doc] Update structured output doc with upstream link (#4015)
### What this PR does / why we need it?
Currently, the usage of structured output feature in vllm-ascend is
totally the same as that in vllm.

Thus, IMO, it's better to remove this doc directly to avoid some case
that there are some changes in the upstream doc and we don't update our
doc in time, which can be misleading to users.


- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-12-11 19:14:29 +08:00
yangxiaoman8
e1bb6f47ec [doc] Add Qwen2.5 tutorials (#4636)
### What this PR does / why we need it?
Add qwen2.5 turorial

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: yangshihao6 <yangshihao6@huawei.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-12-11 17:30:05 +08:00
wangxiyuan
bb76f7962c cleanup useless torchair logic (#4856)
This PR clean up useless torchair logic in model runner. The moge doc is
only for torchair, it can be removed as well.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
2025-12-11 11:21:13 +08:00
zhangyiming
c95c271538 [E2E] Optimize nightly testcase. (#4886)
### What this PR does / why we need it?
Optimize nightly testcase.
Changes:
- tests/e2e/nightly/multi_node/config/models/Qwen3-235B-A3B.yaml: Add
accuracy and performance benchmark
- tests/e2e/models/configs/Qwen3-8B-Base.yaml: Delete
- tests/e2e/models/configs/internlm-7b.yaml: Change to
internlm3-8b-instruct
- tests/e2e/nightly/models/test_deepseek_r1_w8a8_eplb.py: Change to
DeepSeek-R1-0528-W8A8 model

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: menogrey <1299267905@qq.com>
2025-12-11 10:15:39 +08:00
zhangyiming
66b0781840 [E2E] Refactor the e2e testcases. (#4789)
### What this PR does / why we need it?
Refactor the e2e testcases.
- tests/e2e/multicard/test_weight_loader.py: Remove the unused code.
- tests/e2e/singlecard/multi-modal/test_internvl.py: Move to accuracy
test.
- tests/e2e/singlecard/test_aclgraph.py: Rename the file.
- tests/e2e/singlecard/test_embedding_aclgraph.py : Combine with
tests/e2e/singlecard/test_bge_model.py
- tests/e2e/singlecard/test_completion_with_prompt_embeds.py: Delete
eager mode and modify model to Qwen3-0.6B
- tests/e2e/singlecard/test_quantization.py: Modify model to
Qwen3-0.6B-W8A8
- tests/e2e/singlecard/test_vlm.py: Modify model to Qwen3-VL-8B

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: menogrey <1299267905@qq.com>
2025-12-11 10:15:00 +08:00
Nengjun Ma
0eefbe75b6 [Doc] Add local running multi-node nightly test case guide (#4884)
### What this PR does / why we need it?
Add local running multi-node nightly test case guide, help running
locally at developer env.
### Does this PR introduce _any_ user-facing change?
NA
### How was this patch tested?
Test with local running multi-node test.
Using this document can successfully start multi-node night e2e in
locall

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: leo-pony <nengjunma@outlook.com>
2025-12-11 08:56:27 +08:00
SILONG ZENG
ff7d703192 [Doc]Add tutorial document for qwen-VL-Dense (#3516)
### What this PR does / why we need it?
This document employs the qwen3-vl-8b model and qwen2.5-vl-32b to
demonstrate the primary verification steps for the Qwen-VL series dense
models, including supported features, feature configuration, environment
preparation, NPU deployment, and accuracy and performance evaluation.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: MrZ20 <2609716663@qq.com>
2025-12-11 08:55:23 +08:00
Leaf
89a8607b30 add DeepSeek-R1 tutorial. (#4666)
### What this PR does / why we need it?

This PR adds tutorials for the DeepSeeK-R1 series models, including the
A2 and A3 series, and provides accuracy validation results.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: Gongdayao <gongdayao@foxmail.com>
2025-12-11 08:52:27 +08:00
wangxiyuan
37db0844f5 Remove COMPILE_CUSTOM_KERNELS env (#4864)
With more and more custom ops merged, disable `COMPILE_CUSTOM_KERNELS `
for vllm ascend seems useless now. Let's enable csrc compile by default.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-12-10 23:48:03 +08:00
wangxiyuan
c77dca54b2 [CI] fix lint (#4888)
Fix lint CI error

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-12-10 16:57:24 +08:00
wind-all
1a443f2772 add multi_npu_qwen3_dense tutorials (#4543)
### What this PR does / why we need it?

This PR adds tutorials for the Qwen3-Dense series models, including the
A2 and A3 series, and provides accuracy validation results.



- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: wind-all <anyuting@h-partners.com>
2025-12-10 16:09:56 +08:00
Ruri
ce5872705e [Feat] Support native Kimi-K2-Thinking native W4A16 quantized experts weights (#4516)
### What this PR does / why we need it?

Adds W4A16 quantization method for the Kimi-K2-Thinking model and
updates relevant modules to support the new quantization method.

- Implements complete W4A16 quantization method including weight
packing/unpacking, per-group quantization parameter generation,
post-processing logic and MoE method application.
- Adds parameters `use_int4_w4a16`, `w1_offset` and `w2_offset`, adjusts
`with_quant` conditional logic to support W4A16 matrix multiplication.
- Adds `packed_modules_model_mapping` for Kimi-K2-Thinking model and
processing logic for `weight_packed` field.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>
Signed-off-by: Ruri <33858552+zhoux77899@users.noreply.github.com>
Signed-off-by: Ruri <zhouxiang100@huawei.com>
2025-12-10 15:58:52 +08:00