Commit Graph

1753 Commits

Author SHA1 Message Date
Levi
df7e0fe916 [Bugfix] qwen3-vl-235b-w8a8 load weight ERROR when start service (#4292)
### What this PR does / why we need it?
fix qwen3-vl-w8a8 load weight ERROR when start service
0.12.0rc1 can start qwen3-vl-235b-w8a8 by adding this PR

- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

---------

Signed-off-by: Levi-JQ <yujinqi2@huawei.com>
Co-authored-by: Levi-JQ <yujinqi2@huawei.com>
2025-12-15 16:39:58 +08:00
knight0528
e25c57b346 [Bugfix] Add support for PP intermediate value types in graph mode (#4902)
This PR adds support for handling intermediate value types in pipeline
parallelism when running in graph mode.


- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: zhangshushun <3265779424@qq.com>
Co-authored-by: Jade Zheng <zheng.shoujian@outlook.com>
2025-12-15 16:27:17 +08:00
zzhxxx
e16444f21f [Bugfix] Fix the bug in initializing the shared_weight communication domain in sfa-cp, and fix the mtp weight load in pp>1 situation (#4913)
### What this PR does / why we need it?
In PR #4188, a small bug was introduced that caused sfa-cp to be unable
to find the global_pp_size parameter during initialization, and this PR
fixed the issue.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: zzhx1 <zzh_201018@outlook.com>
Co-authored-by: Jade Zheng <zheng.shoujian@outlook.com>
2025-12-15 16:21:49 +08:00
SILONG ZENG
70606e0bb9 [Test]update accuracy test of models (#4911)
### What this PR does / why we need it?
Delete accuracy tests for models that are no longer retained:
- Meta-Llama-3.1-8B-Instruct
- llava-1.5-7b-hf
- InternVL2-8B.yaml
- InternVL2_5-8B.yaml
- InternVL3-8B.yaml

Add accuracy tests for the new models:
- Llama-3.2-3B-Instruct
- llava-onevision-qwen2-0.5b-ov-hf
- Qwen3-VL-30B-A3B-Instruct

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: MrZ20 <2609716663@qq.com>
2025-12-15 15:04:20 +08:00
Chao Lei
b75bfc58f6 [Doc ] Supplement kvpool user guide (#5013)
### What this PR does / why we need it?
Supplement detailed descriptions for `ASCEND_CONNECT_TIMEOUT` and
`ASCEND_TRANSFER_TIMEOUT` in kvpool.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: LCAIZJ <leichao139636@163.com>
2025-12-15 14:24:39 +08:00
Chen Chen
aa02a85e4d [bugfix] Fix dummy-run and multi-node issues in MoE routing and MTP (#4947)
### What this PR does / why we need it?

- Fix a premature `return` in `moe_init_routing_quant_v2.cpp` so the
routing kernel completes correctly instead of exiting early in certain
paths.
- Switch `FusedAlltoAllCommImpl` to use the MC2-based token dispatcher
and prepare/finalize routines, aligning MoE communication with the MC2
algorithm optimized for Ascend devices.
- Add a temporary override in `MtpProposer` to map `FUSED_ALLTOALL` back
to `ALLTOALL` until the MoE communication type selection logic is fully
finalized, avoiding incorrect behavior in dummy-run flows.
- Simplify the MoE communication selection for Ascend 910-93 in
`NPUModelRunner` by removing the EP-size guard on `FUSED_ALLTOALL`,
which fixes failures in multi-node / larger-EP configurations while
keeping MC2 routing under the configured token capacity.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: mojave2 <chenchen145@huawei.com>
2025-12-15 14:18:23 +08:00
dependabot[bot]
cc7b302020 Bump actions/upload-artifact from 5 to 6 (#5014)
Bumps
[actions/upload-artifact](https://github.com/actions/upload-artifact) from 5 to 6.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-12-15 14:13:06 +08:00
drslark
8fb0ef5ffa [main][BugFix] Fixed an accuracy bug of Qwen3-next-MTP when batched inferring (#4932)
### What this PR does / why we need it?
Fixes an accuracy bug of Qwen3-next-MTP when batched inferring.
It is descibed in
https://github.com/vllm-project/vllm-ascend/issues/4930.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: drslark <slarksblood@qq.com>
2025-12-15 13:22:30 +08:00
wujinyuan1
545e856971 [Refactor]3/N Refactor mla_v1.py & extract mla_cp (#4933)
RFC: https://github.com/vllm-project/vllm-ascend/issues/4629
Reason:
The functions related to Cp differ significantly from those of normal
MLA-Attention, but the coupling is quite severe.

Steps:
Isolate PCP and DCP
(1) create a new python file: mla_cp.py
(2) add classes AscendMlaCPImpl and
AscendMlaCPMetadataBuilder,Inheritance AscendMLAImpl and
AscendMLAMetadataBuilder
(3) Remove PCP and DCP-related methods from mla_v1.py to mla_cp.py

vLLM version: v0.12.0

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: wujinyuan1 <wjy9595@qq.com>
Co-authored-by: wujinyuan1 <wjy9595@qq.com>
Co-authored-by: weijinqian0 <1184188277@qq.com>
2025-12-15 12:59:18 +08:00
ming1212
98b9e2e18e Add Qwen3-Next tutorials (#4607)
### What this PR does / why we need it?

This PR provides an introduction to the Qwen3-Next model, details on the
features supported by the model in the current version, the model
deployment process, as well as methods for performance testing and
accuracy testing.

With this document, the deployment and testing of the Qwen3-Next model
can be implemented more easily.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: ming1212 <2717180080@qq.com>
Signed-off-by: ming1212 <104972349+ming1212@users.noreply.github.com>
Signed-off-by: Mengqing Cao <cmq0113@163.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
2025-12-15 11:48:22 +08:00
Mengqing Cao
6beb4434e1 [CI][Bugfix] Fix scheduleroutput has no attr get error in prompt logprobs (#4998)
### What this PR does / why we need it?
Fix scheduleroutput has no attr get error in prompt logprobs

Fix
https://github.com/vllm-project/vllm-ascend/actions/runs/20194753373/job/57977131870

### How was this patch tested?
CI passed with existing test.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-12-15 11:10:39 +08:00
Li Wang
2497bbbaf6 [Misc] Update pooling example (#5002)
### What this PR does / why we need it?
Since the param `task` has been depprecated, we should use the latest
unified standard parameters for pooling models, this should be more
clear


- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-12-15 08:36:19 +08:00
LookAround0301
bb7b74c14f add ut for model runner (#4991)
### What this PR does / why we need it?
add ut for model runner

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: LookAround <lixushi@huawei.com>
2025-12-14 11:16:20 +08:00
wangxiyuan
8090914d69 [CI] CI refactor (#4928)
1. rename workflow to better name
2. fix lint error
3. remove accuracy report doc and test

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-12-14 11:09:56 +08:00
AlvisGong
ba28d54f35 [Perf]enable prefill flashcommon3 (#4065)
### What this PR does / why we need it?
moe multistream overlap to improve the performance.

### How was this patch tested?
--additional-config '{"multistream_overlap_gate": true}'

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: AlvisGong <gwly0401@163.com>
Signed-off-by: chenxiao <Jaychou1620@Gmail.com>
Co-authored-by: clrs97 <524936896@qq.com>
Co-authored-by: zzhx1 <zzh_201018@outlook.com>
Co-authored-by: chenxiao <Jaychou1620@Gmail.com>
2025-12-14 09:34:13 +08:00
Yizhou
0686b32d82 [Fix] Fixes issues in MTP with async scheduling and ACL graph (#4963)
### What this PR does / why we need it?
Corrects attention metadata size for MTP when both asynchronous
scheduling and full ACL graph mode are enabled. This prevents potential
size mismatches during execution.

Additionally, improves the robustness of calculating token sample
indices by explicitly aligning tensor shapes.

Finally, prevents padding when the number of input tokens exceeds the
maximum ACL graph batch size to avoid out-of-bounds errors.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
Need to add corresponding test case ASAP.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
Signed-off-by: Yizhou <136800916+yiz-liu@users.noreply.github.com>
Co-authored-by: Jade Zheng <zheng.shoujian@outlook.com>
2025-12-14 00:10:11 +08:00
wangxiyuan
42ceaf08a1 add release note for 0.12.0 (#4995)
Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-12-13 22:09:59 +08:00
Li Wang
0f92d34a70 [CI] Pull latest vllm-ascend src before tests (#4988)
### What this PR does / why we need it?
Currently, our image build suffers from errors during cross-compilation,
which causing the image to fail to build sometimes(see
https://github.com/vllm-project/vllm-ascend/actions/runs/20152861650/job/57849208186).
This results in the nightly test code not being the latest version.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-12-13 19:04:14 +08:00
wangxiyuan
fd7c929145 [perf] replace all_reduce for kv_consumer and support different num_tokens among all ranks (#4983)
pick from https://github.com/vllm-project/vllm-ascend/pull/4736 to fix
the merge conflict

### What this PR does / why we need it?
Currently, the all_reduce operation in _sync_metadata_across_dp is
performed with gloo backend which is extremely time-consuming when
DPEngineCores are in different nodes. This operation cannot be ignored
by async scheduling in multi-node-scenarios with speculative decoding
(e.g., EAGLE, mtp).

This pr eliminates the all_reduce operation for D Nodes and change the
input parameter of MoEDispatch & MoeCombine operators to make MC2EP
support different num_tokens across all ranks.

### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Tested with PD disaggregation (2P: DP2TP8EP16 1D: DP8TP4EP32) scenarios
while enabling async scheduling. This pr can remove cross-node
all_reduce with gloo backend and further reduce latency with correct
accuracy.

---------

Signed-off-by: linfeng-yuan <1102311262@qq.com>
Co-authored-by: linfeng-yuan <1102311262@qq.com>
2025-12-13 18:59:54 +08:00
wangxiyuan
5211e991ad Revert "[Bugfix] support mtp kv transfer and pp partition by hand in kv transfer (#4892)" (#4981)
This reverts commit 332b547728.

This break deepseek3.2 in PD case.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
2025-12-13 18:58:55 +08:00
lilinsiman
31c94b7e7b [doc][main] Correct more doc mistakes (#4958)
### What this PR does / why we need it?
Correct more doc mistakes

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: lilinsiman <lilinsiman@gmail.com>
2025-12-13 18:36:58 +08:00
zhenwenqi2024
4721e4f53f [bugfix] asyncscheduler bug fix (#4968)
### What this PR does / why we need it?
now vllm-ascend uses AsyncGPUModelRunnerOutput
,AsyncNPUModelRunnerOutput before is outdated, so we should fix it

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>
2025-12-13 17:04:54 +08:00
realliujiaxu
3581946256 [Bugfix] fix eagle proposer (#4971)
### What this PR does / why we need it?
After https://github.com/vllm-project/vllm-ascend/pull/4764, a lot of
tensor created by `make_buffer` should be renamed, like `input_ids` ->
`input_ids.gpu`.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: realliujiaxu <realliujiaxu@163.com>
2025-12-12 22:39:49 +08:00
Jade Zheng
45889a6185 [Bugfix] Pass vllm_config to kv_connector_no_forward in NPUModelRunner (#4970)
### What this PR does / why we need it?

The newest version crashes in PD separation scenarios because the
function is missing the `vllm_config` parameter.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
2025-12-12 22:36:23 +08:00
MengLong Chen
fa367e3b1a [CI] Add mtp_proposer ut (#4397)
### What this PR does / why we need it?
Add mtp_proposer ut

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: chenmenglong <chenmenglong1@huawei.com>
2025-12-12 20:41:31 +08:00
lilinsiman
fc818f1509 [doc][main] Correct mistakes in doc (#4945)
### What this PR does / why we need it?
Correct mistakes in doc

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: lilinsiman <lilinsiman@gmail.com>
2025-12-12 19:17:10 +08:00
zhenwenqi2024
f708d919f8 [Feature] model_runner refactor (#4764)
### What this PR does / why we need it?
refactor npu_modelrunner, we should be close to gpu_modelrunner 

### Does this PR introduce _any_ user-facing change?
NO

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>
Signed-off-by: zhenwenqi2024 <155598497+zhenwenqi2024@users.noreply.github.com>
2025-12-12 17:27:09 +08:00
Li Wang
5b12c068f9 [Nightly] Remove gen_ranktable logic (#4941)
### What this PR does / why we need it?
Since the `llmdatadist` has sunset, the logic gen_ranktable should also
be removed

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-12-12 17:20:18 +08:00
lty
0cdf98ac48 [usability]Modify the default value of the protocol to ascend (#4959)
### What this PR does / why we need it?
The recommended configuration in the document kv_pool.md is ascend.
Modify the default value of the protocol to ascend,Improve usability

#### 1.Configure mooncake.json

The environment variable **MOONCAKE_CONFIG_PATH** is configured to the
full path where mooncake.json is located.

```
{
    "local_hostname": "xx.xx.xx.xx",
    "metadata_server": "P2PHANDSHAKE",
    "protocol": "ascend",
    "device_name": "",
    "alloc_in_same_node": true,
    "master_server_address": "xx.xx.xx.xx:50088",
    "global_segment_size": "1GB" (1024MB/1048576KB/1073741824B/1073741824)
}
```

**local_hostname**: Configured as the IP address of the current master
node.
**metadata_server**: Configured as **P2PHANDSHAKE**.  
**protocol:** Configured for Ascend to use Mooncake's HCCL
communication.
**device_name**: ""  
**alloc_in_same_node**: Indicator for preferring local buffer allocation
strategy.
**master_server_address**: Configured with the IP and port of the master
service.
**global_segment_size**: Expands the kvcache size registered by the PD
node to the master.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Mooncake does not set up a protocol to launch the pooled VLLM service;
test whether the pooling function is working.

Signed-off-by: lty <linhebiwen@gmail.com>
2025-12-12 16:56:18 +08:00
wangyao-i
0983c5510a vllm-ascend support Ascend950 with Qwen dense model. (#4228)
### What this PR does / why we need it?
vllm-ascend support Ascend950 with Qwen dense model
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: wangyao <iwangyao@outlook.com>
Co-authored-by: weijinqian0 <1184188277@qq.com>
2025-12-12 15:50:57 +08:00
liziyu
716c4dacfe update qwen2.5vl readme (#4938)
### What this PR does / why we need it?
fix qwen2.5vl readme, del gen ranktable and add install mooncake


- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: liziyu <liziyu16@huawei.com>
2025-12-12 15:40:07 +08:00
Li Wang
4ae7588c52 [Doc] Upgrade outdated doc (#4957)
### What this PR does / why we need it?
Updated some issues that caused sleep mode document content to be
unavailable due to changes/outdated environment variables.

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-12-12 15:38:29 +08:00
1092626063
62a9fea7af 【doc】Add model feature matrix (#4950)
### What this PR does / why we need it?

doc tutorials add  model feature matrix:
DeepSeekR1
DeepSeekV3.1
Qwen3-Dense
Qwen3-Moe
Qwen3-Next
Qwen2.5
Qwen2.5-VL
Qwen3-VL

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: 1092626063 <1092626063@qq.com>
2025-12-12 15:37:39 +08:00
zhangxinyuehfad
cf801fdbbb [CI] fix light test (#4954)
### What this PR does / why we need it?
fix light test

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-12-12 15:24:04 +08:00
Mercykid-bash
84b9d38e28 BugFix: Resolve PolicyFlashlb warm up function attribute error (#4741)
## Description
Fix the AttributeError caused by incorrect invocation of the warm-up
function in the FlashLB algorithm:
1. **Root Cause**: The warm-up function for FlashLB is defined outside
the `PolicyFlashlb` class (not a class method), but the code incorrectly
attempted to call it via the `PolicyFlashlb` class instance.
2. **Key Fix**: Clarify the invocation rule for FlashLB: when selecting
the FlashLB algorithm, the warm-up function must be called in advance to
precompile and warm up the algorithm (invoked as a standalone function),
instead of calling it through the `PolicyFlashlb` class.
3. **Impact**: Resolve the runtime error when using FlashLB, ensure the
algorithm pre-compilation/warm-up process works as expected, and avoid
attribute missing exceptions.

Signed-off-by: Mercykid-bash <ruanche0218@gmail.com>
2025-12-12 14:55:26 +08:00
Clorist33
4984e8a284 [Bugfix] bugfix for moe_mlp (#4822)
### What this PR does / why we need it?
This PR fixes a bug in the moe_mlp module by correcting the arguments
passed to the torch_npu.npu_dequant_swiglu_quant function.It properly
converts group_list from a cumulative sum to counts for the group_index
parameter.

### Does this PR introduce _any_ user-facing change?
No


- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: tanqingshan (A)  <50050625@china.huawei.com>
Co-authored-by: tanqingshan (A) <50050625@china.huawei.com>
2025-12-12 14:51:20 +08:00
lidenghui1110
d65fb194d9 [Feat] Add custom Embedding tensor model parallel (#2616)
Similar to #2309 , this PR introduces Embedding tensor model parallel to
achieve decreasing of memory consumption. It support both eager mode and
graph mode.

And this PR refactor module tensor parallel configurations supported in
#2309, #2167, #2120, merge all config into `finegrained_tp_config` in
`additional_config`, including:
`lmhead_tensor_parallel_size`
`oproj_tensor_parallel_size`
`embedding_tensor_parallel_size`
`mlp_tensor_parallel_size`

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: zzhx1 <zzh_201018@outlook.com>
Signed-off-by: zzhxx <zhangzihang23@mails.ucas.ac.cn>
Co-authored-by: zzhx1 <zzh_201018@outlook.com>
Co-authored-by: chenxiao <Jaychou1620@Gmail.com>
Co-authored-by: zzhxx <zhangzihang23@mails.ucas.ac.cn>
Co-authored-by: Jade Zheng <zheng.shoujian@outlook.com>
2025-12-12 14:41:20 +08:00
Slightwind
b8a317caac [main][Bugfix] Remove the ZMQ communication setup on the D node (#4926)
In the PD separation scenario, the D node does not need to perform get
operations, and therefore does not need to create ZeroMQ (ZMQ)
communication.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
2025-12-12 14:37:26 +08:00
weichen
d54db76dd2 [MoE][TorchAir] Remove FusedMoEState (#4927)
### What this PR does / why we need it?
Remove FusedMoEState which is used by torchair.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
e2e & ut

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: weichen <calvin_zhu0210@outlook.com>
2025-12-12 09:12:24 +08:00
zhangxinyuehfad
bfafe30953 [CI] refect e2e test (#4799)
### What this PR does / why we need it?
This PR updates the CI configuration and adjusts a set of end-to-end
(e2e) tests under tests/e2e/multicard, in order to refactor the test
suite and ensure compatibility with current codebase and CI workflows.

1. tests/e2e/multicard/test_prefix_caching.py: change model to Qwen3-8B
and rename the test case
2. tests/e2e/multicard/test_quantization.py: rename the test case
3. tests/e2e/multicard/test_qwen3_moe.py: remove duplicate test and
rename test cases
4. tests/e2e/multicard/test_qwen3_next.py: rename test cases and change
the W8A8 pruning model to the W8A8 model and remove the eager parameter
5. tests/e2e/multicard/test_shared_expert_dp.py: rename test case and
remove the eager parameter
6. tests/e2e/multicard/test_single_request_aclgraph.py: rename test case
and change Qwen3-30B to Qwen3-0.6B
7. tests/e2e/multicard/test_torchair_graph_mode.py: delete test cases
about torchair

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-12-12 08:42:08 +08:00
weijinqian0
a6ef3ac4e4 [Performance] Pre-issued exponential distribution operator. (#4908)
Pre-issued exponential distribution operator.

Result:
Single inference saves 200-300 microseconds.
before:

<img width="2257" height="1058" alt="2"
src="https://github.com/user-attachments/assets/c1da19e2-a439-42cb-9d7c-c0218e61fd4c"
/>

After:

<img width="2211" height="342" alt="image"
src="https://github.com/user-attachments/assets/03c84292-c802-4755-949c-4266a9a72fc0"
/>


- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>
2025-12-11 23:02:51 +08:00
linfeng-yuan
0fbe0831ec [bugfix][refactor] fix recompute_scheduler break with vllm 0.12.0 & support async scheduling & refactor recompute_scheduler.py (#4895)
### What this PR does / why we need it?
Currently, the initialization and fundamental functions of
RecomputeScheduler are broken with `vLLM v0.12.0`. This PR fixes the
conflicts of `RecomputeScheduler` and refactor its implementations by
inheriting original `Scheduler` of vLLM. Meanwhile, this PR also
supports async cheduling with recompute scheduler by implementing
`AsyncRecomputeScheduler` which is simply inherited `AsncyScheduler` of
vLLM and `RecomputeScheduler` of vLLM-Ascend with python MRO.
### Does this PR introduce _any_ user-facing change?
No. The switch naming is the same as v0.11.0 :
`recompute_scheduler_enable`
### How was this patch tested?
E2E serving with 2P1D dsv3.1 passed. The performance was the same as
original vllm scheduler with `async_scheduling` and preempted requests
in D Nodes are successfully transfered to Proxy and further to P Node.
This significantly improves the performance and robustness of PD
disaggregation deployments.


- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: linfeng-yuan <1102311262@qq.com>
2025-12-11 22:24:49 +08:00
wangxiyuan
e538fa6f9c [Doc] Update tutorial index (#4920)
Update tutorial index and remove useless doc

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-12-11 20:53:13 +08:00
SILONG ZENG
e56dba9b0d [CI]cleanup e2e test (#4800)
### What this PR does / why we need it?
This PR refactors the E2E multicard test suite to improve test case
identification and maintainability. Specifically, it renames various
test functions to be more descriptive (explicitly indicating model
families like Qwen/DeepSeek and parallelism strategies like DP/TP/PP/EP)
and cleans up outdated or redundant test configurations in the offline
distributed inference tests.

**Key Changes:**
1. Test Function Renaming (Standardization): Renamed multiple test
functions across **`tests/e2e/multicard/`** to include clear
suffixes/prefixes regarding the model and parallel strategy. This helps
differentiate test cases in CI logs and prevents naming collisions.

**`test_aclgraph_capture_replay.py`:** 
- `test_aclgraph_capture_replay_dp2` ->
`test_aclgraph_capture_replay_metrics_dp2`

**`test_data_parallel.py`:**
- `test_data_parallel_inference` -> `test_qwen_inference_dp2`

**`test_data_parallel_tp2.py`:**
- `test_data_parallel_inference` -> `test_qwen_inference_dp2_tp2`

**`test_expert_parallel.py`:**
- `test_e2e_ep_correctness` -> `test_deepseek_correctness_ep`

**`test_external_launcher.py`:**
- `test_external_launcher` -> `test_qwen_external_launcher`
- `test_moe_external_launcher` -> `test_qwen_moe_external_launcher_ep`
- `test_external_launcher_and_sleepmode` ->
`test_qwen_external_launcher_with_sleepmode`
- `test_external_launcher_and_sleepmode_level2` ->
`test_qwen_external_launcher_with_sleepmode_level2`
- `test_mm_allreduce` ->
`test_qwen_external_launcher_with_matmul_allreduce`

**`test_full_graph_mode.py`:** 
- `test_models_distributed_Qwen3_MOE_TP2_WITH_FULL_DECODE_ONLY` ->
`test_qwen_moe_with_full_decode_only`
- `test_models_distributed_Qwen3_MOE_TP2_WITH_FULL` ->
`test_qwen_moe_with_full`

**`test_fused_moe_allgather_ep.py`:** 
- `test_generate_with_allgather `->
`test_deepseek_moe_fused_allgather_ep`
- `test_generate_with_alltoall` -> `test_deepseek_moe_fused_alltoall_ep`

**`test_offline_weight_load.py`:**
- `test_offline_weight_load_and_sleepmode` ->
`test_qwen_offline_weight_load_and_sleepmode`

**`test_pipeline_parallel.py`:**
- `test_models` -> `test_models_pp2`

2. Distributed Inference Cleanup
(**`test_offline_inference_distributed.py`**):

**model list changes:**
```
QWEN_DENSE_MODELS = [
-     "vllm-ascend/Qwen3-8B-W8A8", "vllm-ascend/Qwen2.5-0.5B-Instruct-W8A8"
+     "vllm-ascend/Qwen3-8B-W8A8",
]
```

```
- QWEN_W4A8_OLD_VERSION_MODELS = [
-    "vllm-ascend/Qwen3-8B-W4A8",
- ]

- QWEN_W4A8_NEW_VERSION_MODELS = [
-     "vllm-ascend/DeepSeek-V3-W4A8-Pruing",
-     "vllm-ascend/DeepSeek-V3.1-W4A8-puring",
- ]

+ DEEPSEEK_W4A8_MODELS = [
+      "vllm-ascend/DeepSeek-V3.1-W4A8-puring",
+ ]
```

**Test Function Changes:**
- removed `test_models_distributed_QwQ`
- removed `test_models_distributed_Qwen3_W8A8`
- removed `test_models_distributed_Qwen3_W4A8DYNAMIC_old_version`
- `test_models_distributed_Qwen3_W4A8DYNAMIC_new_version` ->
`test_models_distributed_Qwen3_W4A8DYNAMIC`

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: MrZ20 <2609716663@qq.com>
2025-12-11 20:35:32 +08:00
Li Wang
3349f61769 [CI] Cancel whl build when submitting a new commit (#4925)
### What this PR does / why we need it?
From a resource-saving perspective, canceling old jobs when submitting
new commits can reduce github_hosted in queue

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-12-11 19:54:52 +08:00
wangxiyuan
c30b51e764 Refactor CI workflow (#4912)
- merge image build workflow into one
- merge package build workflow into one
- merge community related workflow into one

This change makes the workflow more clear

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-12-11 19:34:43 +08:00
Shanshan Shen
551069e53a [Doc] Update structured output doc with upstream link (#4015)
### What this PR does / why we need it?
Currently, the usage of structured output feature in vllm-ascend is
totally the same as that in vllm.

Thus, IMO, it's better to remove this doc directly to avoid some case
that there are some changes in the upstream doc and we don't update our
doc in time, which can be misleading to users.


- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-12-11 19:14:29 +08:00
wangxiyuan
06a66939cd Remove mindie_turbo (#4896)
mindie_turbo is out of data for long time. This PR remove the related register method.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-12-11 18:46:12 +08:00
wangxiyuan
b89763f1ed [CI] speed up ut (#4901)
avoid model download to speed up ut test. 

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-12-11 18:45:43 +08:00
Jade Zheng
3fade30275 [Bugfix] Prevent engine hang during KVCacheSendingThread startup (#4754)
Previously, if the KVCacheSendingThread couldn't create a socket because
of port conflicts or other problems, the main thread would wait
endlessly for the ready_event signal, causing the entire engine
initialization to freeze. This update fixes the issue by adding timeouts
for thread startup and handling unexpected thread exits, so the
initialization process no longer gets stuck indefinitely.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
2025-12-11 18:39:25 +08:00