Commit Graph

312 Commits

Author SHA1 Message Date
Mengqing Cao
e54630e01c Revert [KV-Sharing] Support KV-Sharing feature in CLA models (#4138) (#5317)
### What this PR does / why we need it?
Revert [KV-Sharing] Support KV-Sharing feature in CLA models (#4138) as
it causes deepseek v3.2 hang error


- vLLM version: release/v0.13.0
- vLLM main:
5fbfa8d9ef

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-12-24 22:24:17 +08:00
Slightwind
22138e2727 [main][Refactor] Remove with_prefill parameter from set_ascend_forward_context (#5094)
Removes the redundant `with_prefill` parameter from
`set_ascend_forward_context` to align the interface with vLLM's
`set_forward_context` for future refactoring.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
Signed-off-by: Slightwind <slightwindsec@gmail.com>
Co-authored-by: zzzzwwjj <34335947+zzzzwwjj@users.noreply.github.com>
2025-12-23 14:30:50 +08:00
Mengqing Cao
449f8f65a7 [KV-Sharing] Support KV-Sharing feature in CLA models (#4138)
### What this PR does / why we need it?
Support KV-Sharing feature in CLA (cross layer attention) models, which
sharing kv cache in some layers.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: MengqingCao <cmq0113@163.com>
2025-12-23 10:48:31 +08:00
Li Wang
9a79cbaecb [ModelRunner] Add hunyuan-vl basic support (#5151)
### What this PR does / why we need it?
This patch add handling of `XDRotaryEmbedding` in modelrunner to support
for `hunyuan-vl`
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
CI passed with added/exist tests

Closes: https://github.com/vllm-project/vllm-ascend/issues/4992

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-12-23 10:46:54 +08:00
Wang Kunpeng
c3a8d13ca7 [refactor] Remove unnecessary attributes from set_ascend_forward_context (#5204)
### What this PR does / why we need it?
Remove unnecessary attributes from set_ascend_forward_context
1.prefetch_stream
2.weight_prefetch_method
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: Wang Kunpeng <1289706727@qq.com>
2025-12-23 08:49:52 +08:00
weijinqian0
95e8a52156 [Refactor] move the metadata from attention_v1 to util(ready for extract common_cp) & realize Ascendmetadata inherit from the parent class. (#5203)
RFC: https://github.com/vllm-project/vllm-ascend/issues/4629

1. Remove the pcp-related code from attention_v1.
2. Establish the inheritance relationship of CommonAttentionMetadata.

TODO
1. extract common_cp
2. move cp metadata to common_cp.
3. remove commonAttentionMetadata for aclgraph.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>
2025-12-23 00:10:52 +08:00
zhangxinyuehfad
61efaffcaf [Bugfix] Implement multimodal_cpu_fields in model runner (#5196)
### What this PR does / why we need it?
Related to https://github.com/vllm-project/vllm-ascend/issues/4084
Implement multimodal_cpu_fields in model runner

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-12-22 18:39:45 +08:00
zhangsicheng5
78aa7f2693 [feature] support pcp + mtp in full graph (#4572)
1. support pcp + mtp in full graph
2. pcp/dcp related mtp bugfix
3. support pcp + mtpx

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: zhangsicheng5 <zhangsicheng5@huawei.com>
2025-12-22 16:13:39 +08:00
Yizhou
60d9398f6d [1/N][Eagle3] Aligns auxiliary hidden state usage for eagle3 models (#5162)
### What this PR does / why we need it?
This is to prepare for the migration to vLLM's `EagleProposer`, it does
not have `name` attribution. Also it's a breakdown of #5100 .

Introduces logic to determine whether eagle3 heads require auxiliary
hidden states based on configuration, ensuring consistent handling
across related components. Prevents incorrect assumptions for eagle3
variants that do not use auxiliary outputs, improving compatibility and
correctness.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
None.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-12-22 15:24:54 +08:00
YuhanBai
5d02eed16f [Performance] Add async exponential while model executing (#4501)
### What this PR does / why we need it?
Add a control to enable the exponential distribution operator
overlapping with model executing (default is OFF due to this feature
might not perform well on MOE models, i.e. For Qwen3-30B).
Enable async exponential overlapping will provides performance
improvement.
Also, overlapping the exponential operator with module execution can
cover the performance drop introduced by AICPU-version's exponential
operator.

**UPDATE**: (12/12)
Now our overlap will use the same stream that introduced in this pr:
#4908 .
We move the `do_async_exponential` from `model_runner_v1.py` to
`sampler.py`.
Now we are using `additional_config` to enable async exponential:
Add `"enable_async_exponential": 1` in `addition_config`.
Now we **ONLY** support default exponential/AI-CPU exponential, the old
`"enable_async_exponential": 2` option has been aborted to keep
consistency.

### Does this PR introduce _any_ user-facing change?
**YES**, added a new `additional_config` : `"enable_async_exponential":
1`.
When `enable_async_exponential` is set to 1, we enable the async
exponential and overlap with model runner.
When `enable_async_exponential` is set to 0 (default is 0), we disable
the async exponential, but exponential will still running on a different
stream using stream introduced in #4908.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: YuhanBai <yuhan.bai0830@gmail.com>
Signed-off-by: YuhanBai yuhan.bai0830@gmail.com
2025-12-20 21:23:21 +08:00
lianyibo
58773af708 [Fix] Delete pooling redundant code (#4940)
### What this PR does / why we need it?
Remove redundant code in #3122.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: lianyibo <lianyibo1@kunlunit.com>
2025-12-20 20:47:30 +08:00
wangxiyuan
758d81dcb1 Drop 0.12.0 support (#5146)
We decided to release v0.13.0 soon. So no need to support 0.12.0 now.
Let's drop it.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-12-20 09:38:53 +08:00
weijinqian0
35ad11b637 [Refactor] remove some metadata variables in attention_v1. (#5160)
RFC: https://github.com/vllm-project/vllm-ascend/issues/4629

Reason:

The metadata data class contains an excessive number of variables. We
will inherit the metadata of the community and simultaneously remove
some variables that are no longer needed at present.

Todo:
1. remove attn_state partly.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>
2025-12-19 14:57:09 +08:00
zzzzwwjj
cc23067f1e [refactor] refactor weight trans nz and transpose (#4878)
### What this PR does / why we need it?

Now `VLLM_ASCEND_ENABLE_NZ` will have three options:
0: disable nz;
1: only quant case enable nz;
2: enable nz as long as possible;

And `VLLM_ASCEND_ENABLE_NZ`=1 by default.

All cases are shown in the table below:

|  | W4A4 | W4A8 | W8A8 | fp16/bf16 | fp32 |
|---|---|---|---|---|---|
| trans nz | can't support nz | trans nz by default | trans nz by
default | trans nz when VLLM_ASCEND_ENABLE_NZ is 2 | can't support nz |
| transpose | only support not transpose case | only support transpose
case | only support transpose case | linear: only support not transpose
case<br>gmm: only support transpose case | same to fp16/bf16 |

Some exceptional cases:
1. MLAPO op need to do some additional processing on the weights,
including trans nz. If use MLAPO op, some weight will be transformed to
nz forcely;
2. MLA/SFA's weight `W_UV` will be used by op
`torch.ops._C_ascend.batch_matmul_transpose`, and this op can't support
nz currently;

### Does this PR introduce _any_ user-facing change?
Now fp16/bf16 weight will not trans nz by default.

### How was this patch tested?

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: zzzzwwjj <1183291235@qq.com>
2025-12-19 14:27:24 +08:00
weichen
ca6f631cba [2/N][Pangu][MoE] Remove Pangu Related Code (#5130)
### What this PR does / why we need it?
Remove Pangu Related Code

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
e2e & ut

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: weichen <calvin_zhu0210@outlook.com>
2025-12-19 09:00:07 +08:00
Chen Chen
1b47fca0e8 [bugfix] Use FUSED_MC2 MoE comm path for the op dispatch_ffn_combine (#5156)
### What this PR does / why we need it?

- Renames the MoE comm enum value `MoECommType.FUSED_ALLTOALL` to
`MoECommType.FUSED_MC2` and updates all call sites.
- Updates `select_moe_comm_method` to optionally select `FUSED_MC2` on
Ascend A3 when:
  - `enable_expert_parallel=True`
  - quantization is `w8a8_dynamic`
  - `EP <= 16`
  - `dynamic_eplb` is disabled
  - `is_mtp_model = False`
- Replaces the old “fused all-to-all” comm implementation with
`FusedMC2CommImpl`, using `TokenDispatcherWithMC2` /
`PrepareAndFinalizeWithMC2` and `dispatch_ffn_combine`.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: Chen Chen <0109chenchen@gmail.com>
2025-12-18 23:34:31 +08:00
Zetong Li
2304218f90 [Bugfix] Fix in_profile_run in mtp_proposer dummy_run (#5165)
### What this PR does / why we need it?
This PR aims to fix failure of `enable_force_load_balance` caused by
missing `in_profile_run` in `dummy_run` of mtp_proposer.

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
by ci

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: Zetong Li <slippersss@126.com>
2025-12-18 22:27:47 +08:00
Angazenn
632eab28b7 [BugFix]Fix incorrect get_current_vllm_config (#5121)
### What this PR does / why we need it?
This PR fixes some incorrect `get_current_vllm_config` calling, which
creates empty vllm_config instead.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: Angazenn <supperccell@163.com>
2025-12-18 22:21:36 +08:00
Yizhou
ff3914e31a [Fix] Refines decode mode padding condition for uniform queries (#5164)
### What this PR does / why we need it?
The reason why we cannot use `self.cudagraph_batch_sizes[-1]` is that
it's actually not the max number of tokens to be padded in
`FULL_DECODE_ONLY` mode, much larger instead. And it's trimmed only
before capturing to `compilation_cases`, this really caused us lots of
trouble.

Updates the logic to ensure padding occurs only when the number of input
tokens falls within a valid uniform decode query range, improving
consistency and avoiding unnecessary padding in specific decode modes.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
None.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-12-18 21:09:23 +08:00
lidenghui1110
1c8c23de58 [Bugfix] fix pipeline parallelism bug introduced by async-scheduling refactor work (#4973)
### What this PR does / why we need it?
Currently, when using pipeline parallel and pd disaggregate,
model_runner will return None on non-last-pp-rank stages in
`sample_tokens`, which will cause assert error in vllm
KVOutputAggregator on [this
line](https://github.com/vllm-project/vllm/blob/main/vllm/distributed/kv_transfer/kv_connector/utils.py#L84).

In fact, all pp workers should return a model_runner_output which
contains kv_connector_output to do aggregate in Enginecore scheduler
process to ensure all kv transfer is finished for kv cache releasing
later.

To fix this issue, this PR follows gpu_model_runner in vllm, passing
kv_connector_output in `sample_tokens` to make sure all ranks will
return a ModelRunnerOutput, in non-last-pp-rank workers, it will return
EMPTY_MODEL_RUNNER_OUTPUT with kv_connector_output.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: lidenghui <lidenghui1110@gmail.com>
2025-12-18 15:27:55 +08:00
Yizhou
43d974c6f7 [Fix] Synchronize the host query_start_loc with device values to prevent shape mismatches (#5134)
### What this PR does / why we need it?
Synchronize the host query_start_loc with device values to prevent shape
mismatches when not enable async scheduling.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
None.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-12-17 23:50:12 +08:00
zhenwenqi2024
950570f8d1 [Bugfix]delele profile_run in model_runner (#5122)
### What this PR does / why we need it?
delete sekf.in_profile_run in model_runner to make EPLB works as expect
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>
Signed-off-by: Mengqing Cao <cmq0113@163.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
2025-12-17 23:48:34 +08:00
Yuzhou Tong
7671ce1bf1 Fix a data conversion bug introduced by commit 3b7eb51 in main#4655 (#5115)
### What this PR does / why we need it?

[Fix a data conversion bug introduced by
[main#4655](3b7eb5179f)
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: tongyuzhou <tongyuzhou1@huawei.com>
Co-authored-by: tongyuzhou <tongyuzhou1@huawei.com>
Co-authored-by: weijinqian0 <1184188277@qq.com>
2025-12-17 20:19:02 +08:00
Icey
cadfa5ddc1 [Fusion] [Graph] Add qknorm rope fusion operator (#4711)
### What this PR does / why we need it?
This PR add `qkv_rmsnorm_rope` operator and introduces a graph fusion
pass for `qknorm_rope` operations. The implementation includes a new
configuration flag, a pattern matching pass using
`torch._inductor.pattern_matcher`, and a custom Triton kernel for the
fused operation.

Co-authored-by: Angazenn
[supperccell@163.com](mailto:supperccell@163.com)

### Does this PR introduce _any_ user-facing change?
Yes, add new additional_config

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: wxsIcey <1790571317@qq.com>
2025-12-17 08:53:44 +08:00
zhenwenqi2024
eb4c08f05d [bugfix] fix mtp accept rate (#5093)
### What this PR does / why we need it?
1. now, npu_model_runner reuses gpu_model_runner, this pr deletes some
attrs already defined in gpu_model_runner
2. fix mtp accept rate by disabling in_profile_run
3. remove redundant moe method selection logic
4. Reverts vllm-project/vllm-ascend#5082, which broke CI in
https://github.com/vllm-project/vllm-ascend/actions/runs/20266314048/job/58190426832?pr=5088

### Does this PR introduce _any_ user-facing change?
NO

### How was this patch tested?
vLLM version: v0.12.0
vLLM main:
ad32e3e19c

vLLM version: v0.12.0
vLLM main:
ad32e3e19c

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>
Signed-off-by: Mengqing Cao <cmq0113@163.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
2025-12-17 01:35:26 +08:00
anon189Ty
5b1da4e914 [Feat] Support async_scheduler and disable_padded_drafter_batch in eagle (#4893)
### What this PR does / why we need it?
We refactored the eagle_proposer.py to adapt the framework of eagle.py
in vllm-v0.12.0, to support the logit of padded drafter batch and
async-scheduler.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>
Co-authored-by: drslark <slarksblood@qq.com>
2025-12-16 22:06:40 +08:00
zhenwenqi2024
4ed2951400 【Feature】refactor npu_modelrunner for profile_run (#4993)
### What this PR does / why we need it?
(1)refactor npu_model_runner for profile_run
(2) move _select_moe_comm_method to ascend_forward_context
(3) delete _init_model_kwargs in npu_model_runner

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Na
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>
Signed-off-by: zhenwenqi2024 <155598497+zhenwenqi2024@users.noreply.github.com>
2025-12-16 17:44:04 +08:00
Wang Yixuan
ff0a1e012a [BugFix]Fix FIA input err in DSv3.1 (#5059)
### What this PR does / why we need it?
When use mtp, full decdoe only and async_scheduling together, finding a
input err for FIA ops due to the non-increasing input
of the 'actual_seq_lengths'. This bug is caused by the filling the
variable ‘query_start_loc’. We need to fill the query_start_loc' s end
by the 'cu_num_tokens' instead of '-1'

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: hust17yixuan <303660421@qq.com>
2025-12-16 16:40:35 +08:00
zhenwenqi2024
ddd475d5be [ModelRunner] apply_grammer uses vllm function (#4974)
### What this PR does / why we need it?
this pr removes apply_gramme in npu_model_runner. we change logits to
cpu, and do the same thing with gpu_model_runner.
it may change the performance, we will change it after torch.compile is
supported with npu inductor

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>
2025-12-16 15:26:01 +08:00
realliujiaxu
9e24bdd44c [Feat] Refactor rejection sampler (#4975)
### What this PR does / why we need it?

Currently, we are using `AscendRejctionSampler` that extends from
`RejctionSampler` in spec decoding. `AscendRejctionSampler` override
`forward` of `RejctionSampler`, only aming to replace `rejection_sample`
func. This
causes a lot of code of `RejctionSampler` cannot be reused, for example:
- https://github.com/vllm-project/vllm/pull/19482
- https://github.com/vllm-project/vllm/pull/26060
- https://github.com/vllm-project/vllm/pull/29223

#### Proposed Change:
- Delete `AscendRejctionSampler` and use `RejctionSampler` directly in
model runner.
- Patch `RejctionSampler.expand_batch_to_tokens` and
`RejctionSampler.rejection_sample`, maybe a better way is to make them
as custom ops.
- Modify `NPUModelRunner` following
https://github.com/vllm-project/vllm/pull/26060

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- [x] test logits processor for spec decoding
- [x] test logprobs for spec decoding
- [x] test logprobs for spec decoding + async shcheduling (test with
https://github.com/vllm-project/vllm-ascend/pull/4893/)


- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: realliujiaxu <realliujiaxu@163.com>
2025-12-16 11:32:26 +08:00
LI SHENGYONG
0918de58d5 [Bugfix] dynamic eplb does't use fused_alltoall (#4919)
### What this PR does / why we need it?
The fused alltoall operator itself was not designed or implemented to
handle the scenario where tensors are lists, but the weights for dynamic
load balancing are in list form.
Therefore, we have disabled this operator when using dynamic load
balancing.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
2025-12-16 10:59:30 +08:00
MengLong Chen
5e0ada5395 [Bugfix] Fix the attn_metadata is None (#5038)
### What this PR does / why we need it?
Fix the bug " TypeError: 'NoneType' object is not iterable' " in
vllm_ascend/compilation/acl_graph.py
The reason of that is the attn_metadata is none in the dummy_run of MTP.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: chenmenglong <chenmenglong1@huawei.com>
2025-12-16 09:14:05 +08:00
Jade Zheng
c064d11fd7 [Cleanup] Remove unused attn_metadata parameter from Proposer classes (#4862)
The `attn_metadata` is not used by any draft proposer, so we can remove
it.


- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
2025-12-15 21:21:38 +08:00
Li Wang
8d2998d0e4 [Misc] Upgrade vllm hash to 12_14 (#5000)
### What this PR does / why we need it?

### Does this PR introduce _any_ user-facing change?
1. fix https://github.com/vllm-project/vllm/pull/27938
2. fix https://github.com/vllm-project/vllm/pull/27145
pooling models now supports chunked prefill and prefix caching,
3. fix https://github.com/vllm-project/vllm/pull/30181
define the CPU fields in the field config where they really belong.
4. fix https://github.com/vllm-project/vllm/pull/28168
define the CPU fields in the field config where they really belong.
5. fix https://github.com/vllm-project/vllm/pull/30201
some moudle rename
6. fix https://github.com/vllm-project/vllm/pull/29067
fusedmoe moudle refactor
7. fix https://github.com/vllm-project/vllm/pull/29066
fusedmoe moudle refactor
8. fix https://github.com/vllm-project/vllm/pull/29624
### How was this patch tested?

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-12-15 19:54:23 +08:00
wangx700
3b7eb5179f [Bugfix] fix the incorrect use of python's sum on tensors. (#4655)
### What this PR does / why we need it?
Fix the incorrect use of python's sum function on PyTorch tensors.
1. Using Python's sum() function on a tensor self.num_pcp_pads resulted
in 6ms execution time
Optimization: replacing with PyTorch's torch.sum() reduced execution
time to 474µs
2. scheduler_output.scheduled_spec_decode_tokens undergoes repeated loop
processing even when speculative decoding is not used

Optimization: added conditional logic to skip processing loops when
speculative decoding is disabled, eliminating unnecessary computational
overhead.


- vLLM version: 86e178f7c4d8c3b0eaf3c8e3f810a83f63b90e24
- vLLM main:
86e178f7c4

Signed-off-by: wangx700 <wangxin700@huawei.com>
Co-authored-by: weijinqian0 <1184188277@qq.com>
2025-12-15 19:22:40 +08:00
Mengqing Cao
6beb4434e1 [CI][Bugfix] Fix scheduleroutput has no attr get error in prompt logprobs (#4998)
### What this PR does / why we need it?
Fix scheduleroutput has no attr get error in prompt logprobs

Fix
https://github.com/vllm-project/vllm-ascend/actions/runs/20194753373/job/57977131870

### How was this patch tested?
CI passed with existing test.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-12-15 11:10:39 +08:00
Yizhou
0686b32d82 [Fix] Fixes issues in MTP with async scheduling and ACL graph (#4963)
### What this PR does / why we need it?
Corrects attention metadata size for MTP when both asynchronous
scheduling and full ACL graph mode are enabled. This prevents potential
size mismatches during execution.

Additionally, improves the robustness of calculating token sample
indices by explicitly aligning tensor shapes.

Finally, prevents padding when the number of input tokens exceeds the
maximum ACL graph batch size to avoid out-of-bounds errors.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
Need to add corresponding test case ASAP.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
Signed-off-by: Yizhou <136800916+yiz-liu@users.noreply.github.com>
Co-authored-by: Jade Zheng <zheng.shoujian@outlook.com>
2025-12-14 00:10:11 +08:00
wangxiyuan
fd7c929145 [perf] replace all_reduce for kv_consumer and support different num_tokens among all ranks (#4983)
pick from https://github.com/vllm-project/vllm-ascend/pull/4736 to fix
the merge conflict

### What this PR does / why we need it?
Currently, the all_reduce operation in _sync_metadata_across_dp is
performed with gloo backend which is extremely time-consuming when
DPEngineCores are in different nodes. This operation cannot be ignored
by async scheduling in multi-node-scenarios with speculative decoding
(e.g., EAGLE, mtp).

This pr eliminates the all_reduce operation for D Nodes and change the
input parameter of MoEDispatch & MoeCombine operators to make MC2EP
support different num_tokens across all ranks.

### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Tested with PD disaggregation (2P: DP2TP8EP16 1D: DP8TP4EP32) scenarios
while enabling async scheduling. This pr can remove cross-node
all_reduce with gloo backend and further reduce latency with correct
accuracy.

---------

Signed-off-by: linfeng-yuan <1102311262@qq.com>
Co-authored-by: linfeng-yuan <1102311262@qq.com>
2025-12-13 18:59:54 +08:00
zhenwenqi2024
4721e4f53f [bugfix] asyncscheduler bug fix (#4968)
### What this PR does / why we need it?
now vllm-ascend uses AsyncGPUModelRunnerOutput
,AsyncNPUModelRunnerOutput before is outdated, so we should fix it

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>
2025-12-13 17:04:54 +08:00
Jade Zheng
45889a6185 [Bugfix] Pass vllm_config to kv_connector_no_forward in NPUModelRunner (#4970)
### What this PR does / why we need it?

The newest version crashes in PD separation scenarios because the
function is missing the `vllm_config` parameter.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
2025-12-12 22:36:23 +08:00
zhenwenqi2024
f708d919f8 [Feature] model_runner refactor (#4764)
### What this PR does / why we need it?
refactor npu_modelrunner, we should be close to gpu_modelrunner 

### Does this PR introduce _any_ user-facing change?
NO

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>
Signed-off-by: zhenwenqi2024 <155598497+zhenwenqi2024@users.noreply.github.com>
2025-12-12 17:27:09 +08:00
wangxiyuan
bb76f7962c cleanup useless torchair logic (#4856)
This PR clean up useless torchair logic in model runner. The moge doc is
only for torchair, it can be removed as well.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
2025-12-11 11:21:13 +08:00
wangxiyuan
08441baedd Remove VLLM_ASCEND_ENABLE_TOPK_TOPP_OPTIMIZATION (#4860)
VLLM_ASCEND_ENABLE_TOPK_TOPP_OPTIMIZATION is enabled by default for long
time. Let's remove it now.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-12-10 23:50:18 +08:00
drslark
0fb1dc43a1 [BugFix][main] Adapted Qwen3-Next-MTP to chunked prefill (#4770)
### What this PR does / why we need it?
The pad `-1` modification is from
https://github.com/vllm-project/vllm/pull/25743.

It still has bugs for batched chunked prefill.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: drslark <slarksblood@qq.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
2025-12-10 22:54:24 +08:00
linfeng-yuan
490ddf536f [perf][dsv3.2][async_scheduling] improve dsv3.2 performance by eliminating HD synchronization (#4805)
### What this PR does / why we need it?
This PR eliminates the simplicit HD synchronization in sfa backend, and
_build_dummy_attn_metadata and dummy_run in mtp_proposer, significantly
improving dsv3.2 performance in low-latency scenarios.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Performance improvements are observed with E2E performance serving (P:
DP4TP8EP32 D: DP8TP4EP32) with `num_speculative_tokens=3`.

DSV3.2-W8A8-EXP:
TPOT: 41.67ms -> 23.36ms
ITL: 85.93ms -> 55.96ms

DSV3.2-W8A8 (relaesed in December):
TPOT: 18.11ms
ITL: 56.13ms
 

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: linfeng-yuan <1102311262@qq.com>
2025-12-10 22:31:47 +08:00
ChenCangtao
dd622aa6a6 [Feature] Support npuhraph_ex backend (#4700)
### What this PR does / why we need it?
We introduced the npugraph_ex backend through the vllm's adaptor
dispatch mechanism to accelerate aclgraph. This solution is based on
torch.compile and uses torchair to optimize the fx.graph. The
performance gains are mainly obtained from the static kernel. We
conducted tests on Qwen3-30B and achieved over 5% performance
optimization.

### Does this PR introduce _any_ user-facing change?
Yes, we add a new switch named"enable_npugraph_ex" in additional_config,
default is False.
We also add an example to show how to register custom replacement pass

### More information about this PR
This feature depends on the release of CANN and torch_npu in Q4. 
We tested it on a package that has not been publicly released yet and
verified that the functionality works.
This feature is still experimental at the moment; setting the config
true will directly raise error.
Merging into the main branch initially involves some preliminary commits
to facilitate subsequent development and testing of the feature, as well
as to avoid submitting an excessively large PR at once.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: chencangtao <chencangtao@huawei.com>
Signed-off-by: ChenCangtao <50493711+ChenCangtao@users.noreply.github.com>
Co-authored-by: chencangtao <chencangtao@huawei.com>
Co-authored-by: panchao-hub <315134829@qq.com>
Co-authored-by: wbigat <wbigat@163.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
2025-12-10 20:48:05 +08:00
Yizhou
5b179c53f1 [FEAT] Support DeepSeek-V3.2 with FULL_DECODE_ONLY mode (#4706)
### What this PR does / why we need it?
The first commit support `FULL_DECODE_ONLY`:
- Update `AscendSFAMetadataBuilder` to use `num_input_tokens` for
slicing slots and positions, ensuring fixed tensor shapes.
- Implement padding logic for `query_start_loc` in `NPUModelRunner` to
support uniform decode in full graph mode, aligning with GPU runner
behavior.
- Adjust MLA cosine cache allocation to occur independently of graph
mode and switch to using device-resident sequence lengths for attention
metadata.
- Remove redundant slicing of hidden states and outputs in
`AscendSFAImpl` and optimize `sin`/`cos` cache updates.

The second commit take MTP into account:
- Update `AscendSFAMetadataBuilder` to use `num_input_tokens` for
slicing slots and positions, ensuring fixed tensor shapes.
- Implement padding logic for `query_start_loc` in `NPUModelRunner` to
support uniform decode in full graph mode, aligning with GPU runner
behavior.
- Adjust MLA cosine cache allocation to occur independently of graph
mode and switch to using device-resident sequence lengths for attention
metadata.
- Remove redundant slicing of hidden states and outputs in
`AscendSFAImpl` and optimize `sin`/`cos` cache updates.

And the rest of them are just bugfix.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
Test cases needed.


- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-12-10 20:11:09 +08:00
JiangWeixiang
0d8c0f1a24 [Bugfix] Fix out-of-bounds access to token_id due to uninitialized logprobs (#4248)
### What this PR does / why we need it?
The logprobs_tensor was not initialized before accessing its token_id
member, leading to a crash when tokenizer.decode() is called by passing
a negative token_id

### How was this patch tested?
Constructed an inference request with two prompts and set
SamplingParams(prompt_logprobs=<non-None value>) (e.g.,
prompt_logprobs=1).
After applying the fix (proper initialization of logprobs_tensor), the
same request completed successfully without errors, and the returned
logprobs matched expected values.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: jiangweixiang <jwx02384838@antgroup.com>
Co-authored-by: jiangweixiang <jwx02384838@antgroup.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
2025-12-10 17:45:58 +08:00
lidenghui1110
a82b0fa70e mooncake connector support pipeline parallel & fix pp with flashcomm1 (#4054)
### What this PR does / why we need it?
To support pipeline parallel with PD disaggregation, this PR support PP
in mooncake connector and fix other bugs when enable pp with other
optimization params, including following changes:
- mooncake connector support pp in prefill, we do not support decode pp
currently
- fix bugs when enable both pp and flashcomm1
- optimize ascend-scheduler to support full batch in multiple pipeline
stages, original implementation would cause all pipeline stages
batch_size total summed to max_num_seq, which makes pipeline is not
full, this optimization can make all stages running with full batch_size
= max_num_seq, the same changes will contribute to vllm scheduler too.

### Does this PR introduce _any_ user-facing change?
add `pp_size` in mooncake connector kv_connector_extra_config
```
"kv_connector_extra_config": {
            "use_ascend_direct": true,
            "prefill": {
                    "dp_size": 1,
                    "tp_size": 4,
                    "pp_size": 4
             },
             "decode": {
                    "dp_size": 16,
                    "tp_size": 1
             }
        }
```

### How was this patch tested?

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: chenxiao <Jaychou1620@Gmail.com>
Signed-off-by: Kurumi5210 <Jaychou1620@Gmail.com>
Signed-off-by: Kurumi5210 <jaychou1620@gmail.com>
Signed-off-by: 秋刀鱼 <jaychou1620@Gmail.com>
Co-authored-by: chenxiao <Jaychou1620@Gmail.com>
Co-authored-by: zss <zss@qq.com>
Co-authored-by: zss <3265779424@qq.com>
2025-12-10 16:01:43 +08:00
Ruri
ce5872705e [Feat] Support native Kimi-K2-Thinking native W4A16 quantized experts weights (#4516)
### What this PR does / why we need it?

Adds W4A16 quantization method for the Kimi-K2-Thinking model and
updates relevant modules to support the new quantization method.

- Implements complete W4A16 quantization method including weight
packing/unpacking, per-group quantization parameter generation,
post-processing logic and MoE method application.
- Adds parameters `use_int4_w4a16`, `w1_offset` and `w2_offset`, adjusts
`with_quant` conditional logic to support W4A16 matrix multiplication.
- Adds `packed_modules_model_mapping` for Kimi-K2-Thinking model and
processing logic for `weight_packed` field.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>
Signed-off-by: Ruri <33858552+zhoux77899@users.noreply.github.com>
Signed-off-by: Ruri <zhouxiang100@huawei.com>
2025-12-10 15:58:52 +08:00