Commit Graph

1697 Commits

Author SHA1 Message Date
lilinsiman
3f7a2fba70 [main][doc] Instructions for using permissions added to docker (#5092)
### What this PR does / why we need it?
Instructions for using permissions added to docker

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
ut

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: lilinsiman <lilinsiman@gmail.com>
2025-12-17 15:26:09 +08:00
zzzzwwjj
06b82e7503 [main] rename device type (#5099)
### What this PR does / why we need it?
Rename `_910B` to `A2`;
Rename `_910_93` to `A3`;
Rename `_910_95` to `A5`;

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: zzzzwwjj <1183291235@qq.com>
2025-12-17 14:08:19 +08:00
wangxiyuan
4144376e88 [CI] Fix UT (#5106)
Fix broken ut introduced by #5053 

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-12-17 09:52:20 +08:00
weiguihua2
bf97048bce [feat]pd disaggregated support cross-machine (#5008)
### What this PR does / why we need it?
pd disaggregated support cross-machine.
We send the primary and secondary node information of node p to node d.
When node d pulls the KV data, it retrieves the corresponding primary or
secondary node information from the mapping.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
2025-12-17 09:28:03 +08:00
Wang Yixuan
153eeaa621 [Bugfix] Fix DeepSeek FIA error in async_scheduling with mtp (#5046)
### What this PR does / why we need it?
When enable the async_scheduling, in large scale EP scene, mtp module
goes to eagler mode, which results in the mismatch of
seq_lens_list、block_table. So adapt the judgement before the draft model
forward.

fix #4986 

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: hust17yixuan <303660421@qq.com>
2025-12-17 09:20:44 +08:00
pichangping
06f33540c4 [UT]add the UT of pcp and dcp in the attention_cp file (#5054)
### What this PR does / why we need it?
add the UT of pcp and dcp in the attention_cp file
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: pichangping <1337510399@qq.com>
2025-12-17 09:11:33 +08:00
Icey
cadfa5ddc1 [Fusion] [Graph] Add qknorm rope fusion operator (#4711)
### What this PR does / why we need it?
This PR add `qkv_rmsnorm_rope` operator and introduces a graph fusion
pass for `qknorm_rope` operations. The implementation includes a new
configuration flag, a pattern matching pass using
`torch._inductor.pattern_matcher`, and a custom Triton kernel for the
fused operation.

Co-authored-by: Angazenn
[supperccell@163.com](mailto:supperccell@163.com)

### Does this PR introduce _any_ user-facing change?
Yes, add new additional_config

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: wxsIcey <1790571317@qq.com>
2025-12-17 08:53:44 +08:00
ZixuanWang
b1a853b0f6 Upgrade vllm commit hash to 1216 (#5053)
### What this PR does / why we need it?
Upstream vLLM PR #30212 https://github.com/vllm-project/vllm/pull/30212
refactored the attention backend selection interface, This PR adapts
vllm-ascend's get_attn_backend_cls to align with the new upstream
standard, ensuring compatibility and reducing maintenance overhead.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

co-author:[leo-pony][nengjunma@outlook.com](mailto:nengjunma@outlook.com)
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: zxwang <1476209578@qq.com>
Signed-off-by: leo-pony <nengjunma@outlook.com>
Co-authored-by: leo-pony <nengjunma@outlook.com>
2025-12-17 08:48:36 +08:00
zhenwenqi2024
eb4c08f05d [bugfix] fix mtp accept rate (#5093)
### What this PR does / why we need it?
1. now, npu_model_runner reuses gpu_model_runner, this pr deletes some
attrs already defined in gpu_model_runner
2. fix mtp accept rate by disabling in_profile_run
3. remove redundant moe method selection logic
4. Reverts vllm-project/vllm-ascend#5082, which broke CI in
https://github.com/vllm-project/vllm-ascend/actions/runs/20266314048/job/58190426832?pr=5088

### Does this PR introduce _any_ user-facing change?
NO

### How was this patch tested?
vLLM version: v0.12.0
vLLM main:
ad32e3e19c

vLLM version: v0.12.0
vLLM main:
ad32e3e19c

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>
Signed-off-by: Mengqing Cao <cmq0113@163.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
2025-12-17 01:35:26 +08:00
anon189Ty
5b1da4e914 [Feat] Support async_scheduler and disable_padded_drafter_batch in eagle (#4893)
### What this PR does / why we need it?
We refactored the eagle_proposer.py to adapt the framework of eagle.py
in vllm-v0.12.0, to support the logit of padded drafter batch and
async-scheduler.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>
Co-authored-by: drslark <slarksblood@qq.com>
2025-12-16 22:06:40 +08:00
whx
cee521bad5 [Nightly][BugFix] Install triton for nightly e2e op test. (#5096)
### What this PR does / why we need it?
This PR adds triton-ascend installation to nightly e2e single card
environment.

Signed-off-by: whx-sjtu <2952154980@qq.com>
2025-12-16 21:31:53 +08:00
Li Wang
c6f60e8dd8 [Nightly] Upgrade single node test to latest main (#5101)
### What this PR does / why we need it?
Sync source code from vllm-ascend on nightly tests

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-12-16 21:28:45 +08:00
LI SHENGYONG
8d099a5cd7 [Bugfix] EPLB nightly deepseek (#5095)
### What this PR does / why we need it?
The name of the smoke test file for DeepSeek EPLB has been changed, but
the name in the script hasn't been updated. Fix this bug.

Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
2025-12-16 20:02:54 +08:00
liziyu
190ae55e9f Add a Mooncake installation tutorial for kv pool and update Mooncake installation tutorial (#5069)
### What this PR does / why we need it?
Add a Mooncake installation tutorial for kv pool and update Mooncake
installation tutorial

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: liziyu <liziyu16@huawei.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
2025-12-16 19:53:23 +08:00
zhenwenqi2024
4ed2951400 【Feature】refactor npu_modelrunner for profile_run (#4993)
### What this PR does / why we need it?
(1)refactor npu_model_runner for profile_run
(2) move _select_moe_comm_method to ascend_forward_context
(3) delete _init_model_kwargs in npu_model_runner

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Na
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>
Signed-off-by: zhenwenqi2024 <155598497+zhenwenqi2024@users.noreply.github.com>
2025-12-16 17:44:04 +08:00
Trunrain
af64087732 [bugfix] matmul_allreduce_add_rmsnorm aclnn interface (#5082)
What this PR does / why we need it?
a2 kernel aclnn interface extern "C" fix

Does this PR introduce any user-facing change?
No

How was this patch tested?
vLLM version: v0.12.0

Signed-off-by: tongrunze <t00574058@china.huawei.com>
Co-authored-by: tongrunze <t00574058@china.huawei.com>
2025-12-16 17:36:40 +08:00
wangxiyuan
d11b74a571 Add release note for v0.11.0 (#4918)
Add release note for v0.11.0. We'll release soon.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-12-16 17:31:45 +08:00
zhaomingyu13
039cc65e58 [Doc] Add user guide of speculative decoding (#5074)
### What this PR does / why we need it?
Add user guide of speculative decoding that includes n-grams, EAGLE,
MTP, and suffix.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com>
2025-12-16 17:01:44 +08:00
Wang Yixuan
ff0a1e012a [BugFix]Fix FIA input err in DSv3.1 (#5059)
### What this PR does / why we need it?
When use mtp, full decdoe only and async_scheduling together, finding a
input err for FIA ops due to the non-increasing input
of the 'actual_seq_lengths'. This bug is caused by the filling the
variable ‘query_start_loc’. We need to fill the query_start_loc' s end
by the 'cu_num_tokens' instead of '-1'

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: hust17yixuan <303660421@qq.com>
2025-12-16 16:40:35 +08:00
zhangxinyuehfad
18d2395f5e [Bugfix] fix fastapi version (#5047)
### What this PR does / why we need it?

fix fastapi version == 0.123.10(<0.124.0)

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-12-16 15:58:27 +08:00
zhenwenqi2024
ddd475d5be [ModelRunner] apply_grammer uses vllm function (#4974)
### What this PR does / why we need it?
this pr removes apply_gramme in npu_model_runner. we change logits to
cpu, and do the same thing with gpu_model_runner.
it may change the performance, we will change it after torch.compile is
supported with npu inductor

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>
2025-12-16 15:26:01 +08:00
Li Wang
a63ef031af [Doc] Upgrade some outdated doc (#5062)
### What this PR does / why we need it?
Upgrade some outdated doc to make run happily

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-12-16 11:48:19 +08:00
Canlin Guo
bb3a826e08 [Refactor] Remove the process patches of Qwen2.5-VL and Qwen2.5-Omni (#5035)
### What this PR does / why we need it?

Related to #4084. Before we add the patches temporarily for making
`set_forward_context` patched by `set_ascend_forward_context` in the
function `_process_image_input` and `_process_video_input` of
`Qwen2.5-VL` and `Qwen2.5-Omni` models. After removing these patches, I
met the `AttributeError` for `ForwardContext` missing
`prefetch_mlp_enabled`. So we need to add the defensive check for
`prefetch_mlp_enabled`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

```
vllm serve Qwen/Qwen2.5-VL-7B-Instruct \
    --max-model-len 30000 \
    --max-num-batched-tokens 50000 \
    --max-num-seqs 30 \
    --no-enable-prefix-caching \
    --trust-remote-code \
    --dtype bfloat16
```

```
{"id":"chatcmpl-b66d8acb76905c49","object":"chat.completion","created":1765796863,"model":"/root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the illustration reads \"TONGYI Qwen.\"","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":73,"total_tokens":88,"completion_tokens":15,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}
```


- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-12-16 11:43:52 +08:00
Chao Lei
9c02fa9867 [bugfix] Fix mooncake kvpool accuracy issue (#4976)
### What this PR does / why we need it?

The current KVPool has a accuracy issue
https://github.com/vllm-project/vllm-ascend/issues/4412. This PR aims to
fix the precision problem without impacting prefill performance.

Note:Due to a bug in ADXL, calling `current_event.synchronize()` may
occasionally hang. This issue will be fixed in Cann version 8.5.rc1. You
can manually build the master branch of the project at
https://gitcode.com/cann/hixl to resolve this issue before the 8.5.RC1
release.


- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: LCAIZJ <leichao139636@163.com>
2025-12-16 11:33:16 +08:00
realliujiaxu
9e24bdd44c [Feat] Refactor rejection sampler (#4975)
### What this PR does / why we need it?

Currently, we are using `AscendRejctionSampler` that extends from
`RejctionSampler` in spec decoding. `AscendRejctionSampler` override
`forward` of `RejctionSampler`, only aming to replace `rejection_sample`
func. This
causes a lot of code of `RejctionSampler` cannot be reused, for example:
- https://github.com/vllm-project/vllm/pull/19482
- https://github.com/vllm-project/vllm/pull/26060
- https://github.com/vllm-project/vllm/pull/29223

#### Proposed Change:
- Delete `AscendRejctionSampler` and use `RejctionSampler` directly in
model runner.
- Patch `RejctionSampler.expand_batch_to_tokens` and
`RejctionSampler.rejection_sample`, maybe a better way is to make them
as custom ops.
- Modify `NPUModelRunner` following
https://github.com/vllm-project/vllm/pull/26060

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- [x] test logits processor for spec decoding
- [x] test logprobs for spec decoding
- [x] test logprobs for spec decoding + async shcheduling (test with
https://github.com/vllm-project/vllm-ascend/pull/4893/)


- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: realliujiaxu <realliujiaxu@163.com>
2025-12-16 11:32:26 +08:00
dependabot[bot]
5f840696c1 Bump actions/checkout from 4 to 6 (#5015)
Bumps [actions/checkout](https://github.com/actions/checkout) from 4 to 6.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-12-16 11:30:41 +08:00
LI SHENGYONG
0918de58d5 [Bugfix] dynamic eplb does't use fused_alltoall (#4919)
### What this PR does / why we need it?
The fused alltoall operator itself was not designed or implemented to
handle the scenario where tensors are lists, but the weights for dynamic
load balancing are in list form.
Therefore, we have disabled this operator when using dynamic load
balancing.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
2025-12-16 10:59:30 +08:00
UnifiedCacheManager
195eac665b [Core][Worker] Add UCMConnector for KV Cache Offloading (#4411)
### What this PR does / why we need it?

This PR introduces the initial integration of **UCM (Unified Cache
Management)** into the vllm-ascend distributed KV-cache system.

Specifically, it adds:
- A new `UCMConnector` implementation under the distributed KV-transfer
framework.
- Support for offloading KV-cache blocks to external UCM backends (DRAM
/ NFS / Localdisk), depending on UCM configuration).
- Integration with vLLM V1 KV connector interface, including metadata
handling and role registration.

**Why it is needed:**
- UCM provides a unified, high-performance storage layer for KV-cache
externalization.
- This enables vllm-ascend to support out-of-core KV-cache workloads,
improve memory efficiency, and leverage hardware-accelerated storage
paths (RDMA / NFS / hybrid modes).
- This connector is a required component to allow future work on
multi-node inference + UCM-based scaling.

---

### Does this PR introduce _any_ user-facing change?

Yes, but limited:

- A new `kv_connector=UCMConnector` option becomes available through the
configuration interface.
- When selected, vllm-ascend workers may initialize UCM and offload
KV-cache blocks externally.
- No default behaviors are changed. Users must explicitly enable this
connector.

This PR does **not** modify:
- existing APIs,
- default execution paths,
- model runner behavior,
- user workflow unless `UCMConnector` is configured.

---

### How was this patch tested?

---

### Prefix Caching Benchmark

We provide preliminary measurements for TTFT (ms) under VLLM benchmark.
Tests run on 2 * Ascend 910B3, vllm-ascend 0.11.0, Tensor Parallel size
2, with UCM (Localdisk) enabled.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: UnifiedCacheManager <unifiedcachem@163.com>
2025-12-16 10:53:30 +08:00
SILONG ZENG
237fad635c [Fix]Revert temporary skip on mtp1/mtp2 correctness tests (aclgraph fix) (#5039)
### What this PR does / why we need it?
This Pull Request removes the @pytest.mark.skip decorators from
test_mtp1_correctness_piecewise_graph and
test_mtp2_correctness_piecewise_graph.

These tests were temporarily skipped because of an issue with the MTP
ACL Graph (as per the original TODO comment). Since the relevant
bug/issue has been resolved, these tests are now re-enabled to ensure
full correctness coverage for MTP functionality.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: MrZ20 <2609716663@qq.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
2025-12-16 10:40:00 +08:00
Li Wang
6063853ead [Misc] Upgrade vllm commit hash to 1215 (#5029)
### What this PR does / why we need it?
Upgrade vllm commit hash to `4429d934de3c5cc327b0d7aec8e473aeba38db90`

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-12-16 09:23:02 +08:00
MengLong Chen
5e0ada5395 [Bugfix] Fix the attn_metadata is None (#5038)
### What this PR does / why we need it?
Fix the bug " TypeError: 'NoneType' object is not iterable' " in
vllm_ascend/compilation/acl_graph.py
The reason of that is the attn_metadata is none in the dummy_run of MTP.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: chenmenglong <chenmenglong1@huawei.com>
2025-12-16 09:14:05 +08:00
Clorist33
d43cabc2b1 [Bugfix] Fix precision issues in moe_mlp (vllm-ascend main) (#5025)
### What this PR does / why we need it?
Use group_list[0] to replace group_diff[0] in function
"cumsum_group_list" (moe_mlp.py).
The purpose is to modify it to the correct logic of converting cumsum to
count.

### Does this PR introduce _any_ user-facing change?
No
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: tanqingshan (A)  <50050625@china.huawei.com>
Co-authored-by: tanqingshan (A) <50050625@china.huawei.com>
2025-12-16 08:39:54 +08:00
fems14
b662d914a4 [bugfix] [main] Fix KV cache query inconsistency across different TP ranks in the KV Pool (#5030)
### What this PR does / why we need it?
In the current KV Pool scenario for models like MLA and GQA, where
different TP ranks generate identical KV caches, the system is designed
to store only a single copy. The previous approach allowed each card to
query storage requirements dynamically, but inconsistent query results
across cards led to incorrect storage. To fix this, the new solution
pre-allocates storage responsibilities; each card now simply stores its
pre-assigned blocks, bypassing the inconsistent query step and ensuring
data correctness.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: fems14 <1804143737@qq.com>
2025-12-15 21:56:05 +08:00
Jade Zheng
c064d11fd7 [Cleanup] Remove unused attn_metadata parameter from Proposer classes (#4862)
The `attn_metadata` is not used by any draft proposer, so we can remove
it.


- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
2025-12-15 21:21:38 +08:00
whx
a9625851ef [Attention] Temporarily add back pa for small batch sizes. (#4765)
### What this PR does / why we need it?
This PR adds back pa in scenarios of small batch sizes due to
performance consideration. Will remove pa once fia performs better than
pa in all scenarios.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed with existing test.


- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: whx-sjtu <2952154980@qq.com>
Co-authored-by: weijinqian0 <1184188277@qq.com>
2025-12-15 20:35:50 +08:00
baxingpiaochong
95e6400128 [KVPool]Fix PP get bug (#5007)
### What this PR does / why we need it?

When kv caches are evicted from the key-value pool, it's possible that
the kv cache for pp0 is still active, but the kv cache for pp1 has
already been evicted. Therefore, a unified check is needed during the
get operation.


- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: baxingpiaochong <771405853@qq.com>
Co-authored-by: Jade Zheng <zheng.shoujian@outlook.com>
2025-12-15 20:27:57 +08:00
InSec
a5cb8e40f5 [doc]Modify quantization tutorials (#5026)
### What this PR does / why we need it?
Modify quantization tutorials to correct a few mistakes:
Qwen3-32B-W4A4.md and Qwen3-8B-W4A8.md
Qwen3-8B-W4A8: need to set one idle npu card.
Qwen3-32B-W4A4: need to set two idle npu cards for the flatquant
training and modify the calib_file path which does not match the
ModeSlim version.
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: IncSec <1790766300@qq.com>
2025-12-15 20:12:06 +08:00
zhangyiming
e90e8afc94 [E2E] Collect test run time. (#5018)
### What this PR does / why we need it?
[E2E] Collect test run time.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: menogrey <1299267905@qq.com>
2025-12-15 20:06:48 +08:00
zhangxinyuehfad
019c8e03c2 [CI] Delete deepseek3.2-exp nightly test (#5028)
### What this PR does / why we need it?

Delete deepseek3.2-exp nightly test firstly for replacing
deepseek3.2-exp with deepseek3.2 after nightly tests pass.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-12-15 20:01:53 +08:00
Li Wang
8d2998d0e4 [Misc] Upgrade vllm hash to 12_14 (#5000)
### What this PR does / why we need it?

### Does this PR introduce _any_ user-facing change?
1. fix https://github.com/vllm-project/vllm/pull/27938
2. fix https://github.com/vllm-project/vllm/pull/27145
pooling models now supports chunked prefill and prefix caching,
3. fix https://github.com/vllm-project/vllm/pull/30181
define the CPU fields in the field config where they really belong.
4. fix https://github.com/vllm-project/vllm/pull/28168
define the CPU fields in the field config where they really belong.
5. fix https://github.com/vllm-project/vllm/pull/30201
some moudle rename
6. fix https://github.com/vllm-project/vllm/pull/29067
fusedmoe moudle refactor
7. fix https://github.com/vllm-project/vllm/pull/29066
fusedmoe moudle refactor
8. fix https://github.com/vllm-project/vllm/pull/29624
### How was this patch tested?

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-12-15 19:54:23 +08:00
wangx700
3b7eb5179f [Bugfix] fix the incorrect use of python's sum on tensors. (#4655)
### What this PR does / why we need it?
Fix the incorrect use of python's sum function on PyTorch tensors.
1. Using Python's sum() function on a tensor self.num_pcp_pads resulted
in 6ms execution time
Optimization: replacing with PyTorch's torch.sum() reduced execution
time to 474µs
2. scheduler_output.scheduled_spec_decode_tokens undergoes repeated loop
processing even when speculative decoding is not used

Optimization: added conditional logic to skip processing loops when
speculative decoding is disabled, eliminating unnecessary computational
overhead.


- vLLM version: 86e178f7c4d8c3b0eaf3c8e3f810a83f63b90e24
- vLLM main:
86e178f7c4

Signed-off-by: wangx700 <wangxin700@huawei.com>
Co-authored-by: weijinqian0 <1184188277@qq.com>
2025-12-15 19:22:40 +08:00
zengzengran
6029bea480 [UT]add pcp dcp ut (#4949)
### What this PR does / why we need it?
Adding UT for DCP/PCP

-vLLM version: v0.12.0
-vLLM main:
ad32e3e19c

Signed-off-by: zengran <zengran2@huawei.com>
2025-12-15 18:41:38 +08:00
Icey
5fae65f3a8 [Graph][Fusion] Add AddRMSNorm(with bias) and Quant Fusion Pattern (#5011)
### What this PR does / why we need it?
AddRMSNorm(with bias) and Quant Fusion Pattern

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
CI passed with new added/existing test.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: wxsIcey <1790571317@qq.com>
2025-12-15 18:37:56 +08:00
fluctlux
6de4bedd04 update release note for suffix decoding (#5009)
update release note for suffix decoding

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: fluctlux <38945811+fluctlux@users.noreply.github.com>
2025-12-15 17:22:19 +08:00
Levi
df7e0fe916 [Bugfix] qwen3-vl-235b-w8a8 load weight ERROR when start service (#4292)
### What this PR does / why we need it?
fix qwen3-vl-w8a8 load weight ERROR when start service
0.12.0rc1 can start qwen3-vl-235b-w8a8 by adding this PR

- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

---------

Signed-off-by: Levi-JQ <yujinqi2@huawei.com>
Co-authored-by: Levi-JQ <yujinqi2@huawei.com>
2025-12-15 16:39:58 +08:00
knight0528
e25c57b346 [Bugfix] Add support for PP intermediate value types in graph mode (#4902)
This PR adds support for handling intermediate value types in pipeline
parallelism when running in graph mode.


- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: zhangshushun <3265779424@qq.com>
Co-authored-by: Jade Zheng <zheng.shoujian@outlook.com>
2025-12-15 16:27:17 +08:00
zzhxxx
e16444f21f [Bugfix] Fix the bug in initializing the shared_weight communication domain in sfa-cp, and fix the mtp weight load in pp>1 situation (#4913)
### What this PR does / why we need it?
In PR #4188, a small bug was introduced that caused sfa-cp to be unable
to find the global_pp_size parameter during initialization, and this PR
fixed the issue.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: zzhx1 <zzh_201018@outlook.com>
Co-authored-by: Jade Zheng <zheng.shoujian@outlook.com>
2025-12-15 16:21:49 +08:00
SILONG ZENG
70606e0bb9 [Test]update accuracy test of models (#4911)
### What this PR does / why we need it?
Delete accuracy tests for models that are no longer retained:
- Meta-Llama-3.1-8B-Instruct
- llava-1.5-7b-hf
- InternVL2-8B.yaml
- InternVL2_5-8B.yaml
- InternVL3-8B.yaml

Add accuracy tests for the new models:
- Llama-3.2-3B-Instruct
- llava-onevision-qwen2-0.5b-ov-hf
- Qwen3-VL-30B-A3B-Instruct

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: MrZ20 <2609716663@qq.com>
2025-12-15 15:04:20 +08:00
Chao Lei
b75bfc58f6 [Doc ] Supplement kvpool user guide (#5013)
### What this PR does / why we need it?
Supplement detailed descriptions for `ASCEND_CONNECT_TIMEOUT` and
`ASCEND_TRANSFER_TIMEOUT` in kvpool.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: LCAIZJ <leichao139636@163.com>
2025-12-15 14:24:39 +08:00
Chen Chen
aa02a85e4d [bugfix] Fix dummy-run and multi-node issues in MoE routing and MTP (#4947)
### What this PR does / why we need it?

- Fix a premature `return` in `moe_init_routing_quant_v2.cpp` so the
routing kernel completes correctly instead of exiting early in certain
paths.
- Switch `FusedAlltoAllCommImpl` to use the MC2-based token dispatcher
and prepare/finalize routines, aligning MoE communication with the MC2
algorithm optimized for Ascend devices.
- Add a temporary override in `MtpProposer` to map `FUSED_ALLTOALL` back
to `ALLTOALL` until the MoE communication type selection logic is fully
finalized, avoiding incorrect behavior in dummy-run flows.
- Simplify the MoE communication selection for Ascend 910-93 in
`NPUModelRunner` by removing the EP-size guard on `FUSED_ALLTOALL`,
which fixes failures in multi-node / larger-EP configurations while
keeping MC2 routing under the configured token capacity.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: mojave2 <chenchen145@huawei.com>
2025-12-15 14:18:23 +08:00