Commit Graph

1003 Commits

Author SHA1 Message Date
DreamerLeader
4dbe4fd123 [feature]Pooling Features and PCP Adaptation (#4143)
This PR let pooling kv connector support pcp feature

- vLLM version: v0.11.2

---------

Signed-off-by: fjw <2270923832@qq.com>
Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
Co-authored-by: SlightwindSec <slightwindsec@gmail.com>
2025-11-29 22:07:45 +08:00
wangxiyuan
1eb5295a1b remove qwen3-next model file (#4573)
Let's remove qwen3-next model filecurrently. We'll support it later by
using vLLM origin model file

- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-11-29 18:37:26 +08:00
Nengjun Ma
a3041cd78c [Bugfix] fix dp parallel + tp > 1 offline inference port conflict (#4539)
### What this PR does / why we need it?
fix dp parallel + tp > 1 offline inference port conflict

issue import PR:https://github.com/vllm-project/vllm-ascend/pull/429


- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

---------

Signed-off-by: leo-pony <nengjunma@outlook.com>
2025-11-29 18:37:11 +08:00
wangxiyuan
1874265074 Move mla to ops module (#4575)
Move mla custom op to correct module
- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-11-29 18:36:55 +08:00
Shanshan Shen
2a19215e5f [MM][Model] Remove Qwen2-VL modeling files (#4534)
### What this PR does / why we need it?

Following https://github.com/vllm-project/vllm-ascend/pull/4349, remove
Qwen2-VL modeling files.


- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-11-29 18:07:01 +08:00
wangxiyuan
f10acddb78 drop ascend scheduler (#4498)
Ascend scheduler was added for non chunk prefill case before, since that
the npu ops didn't work well with chunked prefill.

Now the ops with chunked prefill work better, it's time to remove the
ascend scheduler to use vLLM default scheduler.

- vLLM version: v0.11.2

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-11-29 16:18:34 +08:00
liziyu
53a52d6614 [P/D] [bugfix] add get_kv_connector_handshake_metadata func for 0.11.2 (#4567)
### What this PR does / why we need it?
add get_kv_connector_handshake_metadata func for 0.11.2


Signed-off-by: liziyu <liziyu16@huawei.com>
2025-11-29 16:09:45 +08:00
LI SHENGYONG
0151022ab8 [bugfix] dep ineffective (#4417)
### What this PR does / why we need it?
The expert mapping table and weights of the dynamic EPLB were not
updated, causing the accuracy to be correct but not effective. This bug
has now been fixed.

- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

---------

Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
2025-11-29 15:18:29 +08:00
Ting FU
9af34755ff [Bugfix] Fix model run _npu_flash_attention hang issue (#4410)
Fix model run _npu_flash_attention in _forward_prefill_no_cache hang
issue, it was caused by wrong attention mask dtype.
### How was this patch tested?
Yes, tesed on Qwen2.5-VL and Qwen2.5-Omni

- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

Signed-off-by: Ting FU <futing10@huawei.com>
2025-11-29 09:20:22 +08:00
shiyuan680
1c4a0468ee 【OPS】qwen3-next support triton chunk_gated_delta_rule ops (#4070)
### What this PR does / why we need it?
qwen3-next suppot  triton chunk_gated_delta_rule ops

### co-owners
@OsirisDuan

- vLLM version: v0.11.2

Signed-off-by: shiyuan680 <917935075@qq.com>
2025-11-28 20:55:43 +08:00
fems14
5447a039b9 [Feature][main]reconstruction kvpool connector to ascend connector (#4438)
### What this PR does / why we need it?
1.In short, we renamed the existing MooncakeStoreConnector to
AscendStoreConnector and extracted the storage engine interaction logic
into a new Backend class.
Associated RFC:https://github.com/vllm-project/vllm-ascend/issues/4329
2.Fixed the issue where the number of input parameters for the connector
was incorrect, introduced in vllm 0.11.2
### Does this PR introduce _any_ user-facing change?
change MooncakeStoreConnector to AscendStoreConnector
### How was this patch tested?

- vLLM version: v0.11.2

---------

Signed-off-by: fems14 <1804143737@qq.com>
2025-11-28 18:08:37 +08:00
Chenxi Qian
554f16ae1f [Kernel] add custom op GmmSwigluQuantWeightNzTensorList (#3804)
### What this PR does / why we need it?

This PR introduces support for adding custom CANN `aclnn` ops to
`vllm-ascend`, allowing users to define and use their own custom
operators.

Key changes include:
- Building and installing custom ops into the `vllm-ascend`-specified
directory
- Binding the `aclnn` op interface to the `torch.ops._C_ascend` module
- Enabling invocation of these ops within `vllm-ascend`

This PR includes a sample custom op:
`aclnnGroupedMatmulSwigluQuantWeightNzTensorList`, which is adapted from
the CANN operator
[`aclnnGroupedMatmulSwigluQuantWeightNZ`](https://www.hiascend.com/document/detail/zh/canncommercial/83RC1/API/aolapi/context/aclnnGroupedMatmulSwigluQuantWeightNZ.md).
Its input parameters `weight` and `weight_scale` now accept
`list[torch.Tensor]` (i.e., `at::TensorList`).

### Does this PR introduce _any_ user-facing change?

No.


- vLLM version: v0.11.2

---------

Signed-off-by: QianChenxi <chenxi.qian.cq@outlook.com>
2025-11-28 18:06:39 +08:00
Shanshan Shen
e52ebf8674 [MM][Model][Perf] Remove Qwen2.5-VL modeling files and add patch for VisionAttention (#4349)
### What this PR does / why we need it?

- [x] Patch `Qwen2_5_VisionAttention` with
`AscendQwen2_5_VisionAttention`.
- [x] Replace `AscendQwen2_5_VisionTransformer` with
`Qwen2_5_VisionTransformer` in vllm.
- [x] Move padding logic (q/k/v and cos/sin) before FA to `forward()` of
`Qwen2_5_VisionAttention`.
- [x] Covert `cu_seqlens` in `Qwen2_5_VisionAttention` from cumulative
form to intervals and move it to cpu (compatible for npu FA).
- [x] Remove Qwen2.5-VL modeling files.
- [x] Remove Qwen2.5-VL (without padding) modeling files.
- [x] Remove related UT.
- [x] Make `set_forward_context` pluggable when getting MM embedding.
Find more details at https://github.com/vllm-project/vllm/pull/29388.
- [x] Simplify padding logic for FA.
- [x] Add patch for https://github.com/vllm-project/vllm/pull/28798.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

- [x] Functional test (eager mode)
- [x] Functional test (graph mode)
- [x] Benchmark


- vLLM version: v0.11.2

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-11-28 14:23:00 +08:00
LHXuuu
bdc66972db [Quantization] Support compressed tensors w8a8 static and w8a8 dynamic weight (#4036)
### What this PR does / why we need it?

While using the LLM Compressor quantization tool from the VLLM community
to generate quantized weights, the VLLM Ascend engine needs to be
adapted to support the compressed tensors quantization format.

1. Add AscendCompressedTensorsConfig to replace CompressedTensorsConfig
in vllm.
2. Support CompressedTensorsW8A8 static weight.
- weight: per-channel, int8, symmetric; activation: per-tensor, int8,
symmetric.
4. Support CompressedTensorsW8A8Dynamic weight.
- weight: per-channel, int8, symmetric; activation: per-token, int8,
symmetric, dynamic.
5. Modify the override_quantization_method in AscendQuantConfig.

Co-authored-by: taoqun110 taoqun@huawei.com
Co-authored-by: chenxi-hh chen464822955@163.com

- vLLM version: v0.11.2

---------

Signed-off-by: LHXuuu <scut_xlh@163.com>
Signed-off-by: chenxi-hh <chen464822955@163.com>
Signed-off-by: chenxi-hh <32731611+chenxi-hh@users.noreply.github.com>
Co-authored-by: chenxi-hh <chen464822955@163.com>
Co-authored-by: chenxi-hh <32731611+chenxi-hh@users.noreply.github.com>
2025-11-28 14:09:39 +08:00
Nengjun Ma
89a1a65300 [bugfix] fix ray start failed: local_world_size cannot little than visible device count error (#4457)
### What this PR does / why we need it?
Fix the ray start failed bug: local_world_size cannot little than
visible device count error
detail see issue #4456.

This fix code is copied from vllm fixing modify, PR:
[#28873](https://github.com/vllm-project/vllm/pull/28873)


- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

---------

Signed-off-by: leo-pony <nengjunma@outlook.com>
2025-11-27 21:18:32 +08:00
drslark
1cae3e4a49 [BugFix] Adapted Qwen3-Next eager mode to v0.11.2 (#4477)
### What this PR does / why we need it?

Adapted Qwen3-Next eager mode to `v0.11.2`.


- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

Signed-off-by: drslark <slarksblood@qq.com>
2025-11-27 17:44:59 +08:00
zzzzwwjj
136ea9ff56 [refact] unified soc_version code (#4359)
### What this PR does / why we need it?

Currently, there are two paths to judge the chip type in code,
`get_ascend_soc_version` use `get_soc_version` api in torch_npu, and
`is_310p` `use _build_info.__soc_version__`, which generate when
install. We need to unify the two paths.

We need to unify these codes based on the following points:

1. We need to ensure consistency in chip type judgment between compiling
and running states;
2. In compiling state, we need chip type to complete op's compilation,
but in running state, we only need device
type(910B/910_93/310P/910_95/etc) to make code branch judgement;
3. In compiling state, torch_npu may not have been installed yet, so we
can't use torch_npu's api.

Based on the above points, we have made the following changes:

1. When user set env `SOC_VERSION`, use it; when not set, query
soc_version by `npu-smi`;
2. generate device_type based on soc_version when compiling, and write
`__device_type__` instead of `__soc_version__` in `_build_info.py`;
3. In running state, use `__device_type__` to judge code branch.

### Does this PR introduce _any_ user-facing change?

When not set env `SOC_VERSION`, it will not be `ASCEND910B1` by default,
we will query soc_version by `npu-smi`. And env `SOC_VERSION` must be in
the list `soc_to_device` in `setup.py`.

- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

Signed-off-by: zzzzwwjj <1183291235@qq.com>
2025-11-26 14:28:55 +08:00
wangxiyuan
bc69d7cfe1 upgrade to vllm 0.11.2 (#4400)
Bump vLLM version to v0.11.2

What's broken and changed by vLLM:
1. structured_output is broken by
https://github.com/vllm-project/vllm/pull/26866
2. get_mrope_input_positions is broken by
https://github.com/vllm-project/vllm/pull/28399
3. graph mode is broken by
https://github.com/vllm-project/vllm/pull/25110 we'll upgrade torch to
2.8 to fix the problem later
4. embedding is broken by
https://github.com/vllm-project/vllm/pull/27583
5. `get_attn_backend_cls` and attention backend is broken are broken by
https://github.com/vllm-project/vllm/pull/28534
6. spec decode is broken by
https://github.com/vllm-project/vllm/pull/28771
7. sp feature is broken by
https://github.com/vllm-project/vllm/pull/27126
8. mtp is broken by https://github.com/vllm-project/vllm/pull/27922
9. lora is broken by https://github.com/vllm-project/vllm/pull/21068
10. execute_model is broken by
https://github.com/vllm-project/vllm/pull/26866
11. `VLLM_DISABLE_SHARED_EXPERTS_STREAM` env is broken by
https://github.com/vllm-project/vllm/pull/28159
12. kv cahe is broken by https://github.com/vllm-project/vllm/pull/27753
13. dp is broken by https://github.com/vllm-project/vllm/pull/25110

 
What's broken and changed by ourself:
1. qwen vl is broken by https://github.com/vllm-project/vllm/pull/28455
We'll remove model files in the future to avoid this kind of error
2. Engine core is broken by
https://github.com/vllm-project/vllm/pull/23691 We'll remove the patch
file in the future.
3. Ascend scheduler is broken by
https://github.com/vllm-project/vllm/pull/28733 We'll remove ascend
scheudler later.
4. qwen3-next is broken by
https://github.com/vllm-project/vllm/pull/28083 We'll remove model files
in the future to avoid this kind of error
5. qwen vl is broken by https://github.com/vllm-project/vllm/pull/27764.
We'll remove model files in the future

Known issue:
1. ray doesn't work 
2. the accuracy of qwen3-next is not correct
3. qwen3-vl is broken
4. prefix cache+ ascend scheduler + deepseek v2 lite is broken.

Co-authored-by: MengqingCao <cmq0113@163.com>
Co-authored-by: hfadzxy <starmoon_zhang@163.com>
Co-authored-by: leo-pony <nengjunma@outlook.com>
Co-authored-by: 22dimensions <waitingwind@foxmail.com>
Co-authored-by: shen-shanshan <467638484@qq.com>


- vLLM version: v0.11.2

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
Signed-off-by: leo-pony <nengjunma@outlook.com>
Co-authored-by: MengqingCao <cmq0113@163.com>
Co-authored-by: hfadzxy <starmoon_zhang@163.com>
Co-authored-by: leo-pony <nengjunma@outlook.com>
2025-11-26 11:48:58 +08:00
shiyuan680
d5f77f14d0 mkdir triton package and move triton files (#4420)
### What this PR does / why we need it?
mkdir triton package and move triton files

- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

Signed-off-by: shiyuan680 <917935075@qq.com>
2025-11-26 11:06:12 +08:00
wangxiyuan
98031653df [misc] Remove useless patch_logits (#4252)
Torch-npu 2.7.1 has fixed the device check bug. This patch can be
removed now.

- vLLM main:
2918c1b49c

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-11-25 21:25:54 +08:00
Shanshan Shen
4864909648 [MM][Bugfix] Minor fix for VL model verification (#4384)
### What this PR does / why we need it?

To fix ops test, where `model_config` has been set to `None` and doesn't
has `hf_config` attribute, we have added a check for `model_config` to
guarantee it is not `None_Type`.

- vLLM main:
2918c1b49c

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-11-25 20:36:16 +08:00
Zhijun Chen
463910e686 [Bugfix] use module-level import for patched function in Qwen3Next (#4354)
### What this PR does / why we need it?

**Problem**: The Qwen3Next model implementation currently imports
chunk_gated_delta_rule directly using `from ... import ...`

In frameworks like `verl`, the model file is often imported before
`vllm-ascend` initializes and applies its patches. This causes the model
to permanently hold a reference to the original (unpatched) vLLM kernel,
resulting in execution errors on Ascend devices even if the patch runs
later.

**Solution**: Changed the import style to `from vllm...ops import chunk`
and call `chunk.chunk_gated_delta_rule().`

This ensures that the function lookup happens at runtime (dynamic
dispatch), allowing the model to correctly pick up the patched function
regardless of import order.

- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

Signed-off-by: zjchenn <zjchenn@gmail.com>
2025-11-25 20:15:43 +08:00
欧派果奶我还要
31a2c09e79 [Bugfix] fix patch typo (#4351)
### What this PR does / why we need it?
Fix a bug caused by this pr:
https://github.com/vllm-project/vllm-ascend/pull/4223
The bug makes
vllm-ascend/vllm_ascend/patch/platform/patch_multiproc_executor.py patch
in a wrong way

### How was this patch tested?
Tested in a single node. When the environment DYNAMIC_EPLB is set to
true, the patch works correctly. When it's set to false, the patch do
not patch
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
Co-authored-by: 白永斌 <baiyongbin3@h-partners.com>
2025-11-25 15:13:06 +08:00
wujinyuan1
06f6cc1c81 [Bugfix]Fix the hang issue of multimodal model when running with DP>1 (#4392)
### What this PR does / why we need it?
When cudagraph_mode is set to FULL_DECODE_ONLY, if dp > 1, the dummy-run
process will be triggered. When calling the update_attn_params function,
the num_tokens parameter needs to be passed, and this value is obtained
through positions.shape[0]. However, the multimodal model uses mRope
(multi-dimensional rotary positional embeddings), which causes the shape
of positions to be 2. As a result, the value obtained from
positions.shape[0] is incorrect. We solve this problem by replacing
positions.shape[0] with num_tokens.

### Does this PR introduce _any_ user-facing change?
NO

### How was this patch tested?
vLLM version: v0.11.0rc3
vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

---------

Signed-off-by: wujinyuan1 <wjy9595@qq.com>
Co-authored-by: wujinyuan1 <wjy9595@qq.com>
2025-11-25 09:33:49 +08:00
Tjh-UKN
00ea61ec88 [feature] vllm-ascend support msprobe (eager mode dump) (#4241)
### What this PR does / why we need it?
vllm-ascend need to dump data during model execution to debug some
precision problems, here msprobe provide the corresponding abilities, so
msprobe will join vllm-ascend to make debug easier

### Does this PR introduce _any_ user-facing change?
```
'dump_config': '/path/to/config.json'
```



- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

---------

Signed-off-by: Tjh-UKN <2559659915@qq.com>
2025-11-24 21:58:31 +08:00
weichen
5b1a7514eb [Bugfix][MoE] enable force_load_balance in aclgraph (#4366)
### What this PR does / why we need it?
Temporarily fix the oom issue, will update to vllm's plan later.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
e2e&ut

- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>
2025-11-24 20:33:56 +08:00
weijinqian0
ae068a3342 [Refactor] remove moe type of multicast. (#4224)
The main purposes of this PR are as follows: 
1. Remove the multicast-related code; 

Reason:
1. In the scenario like a2 Dual-System Back-to-Back Networking,the
performance is worse than all_gather. Before the modification, in e2e
test, it was 3 tps; after the modification, it is 10 tps.
2. At the same time, we usually enable the SP feature,it is consistent
with the current logic.
3. The advantage of broadcast communication lies in the fact that it
does not suffer from uneven DP load and does not require the prefill ACL
graph to be enabled. But we support prefill Acl graph recently.

So we think there is no need to maintain the multicast as one choice in
moe communication.

Performance benefits are as follows:
When not enable_flashcomm1, TTFT remains relatively stable at around
43000ms, which is approximately 15000ms faster than before the
modification.

When enable_flashcomm1, there is no diffenence, TTFT remains relatively
stable at around 29000ms.


- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

---------

Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Signed-off-by: weijinqian0 <1184188277@qq.com>
Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>
2025-11-24 17:32:37 +08:00
wangxiyuan
a1f142b7ad Drop 0.11.0 support (#4377)
There is a lot hack code for v0.11.0, which makes the code hard to
upgrade to newer vLLM version. Since v0.11.0 will release soon. Let's
drop v0.11.0 support first. Then we'll upgrade to v0.11.2 soon.


- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-11-24 17:08:20 +08:00
Yizhou
97999347c8 [Fix] Remove unnecessary NPU synchronization in MTP proposer (#4325)
### What this PR does / why we need it?
Remove unnecessary NPU synchronization in MTP proposer to improve
performances.

Removing this synchronization point improves pipeline efficiency by
allowing for better overlap between CPU and NPU operations. A more
proper one is already implemented in #4233

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
None.


- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-11-24 14:07:10 +08:00
pz1116
ea3372fb0c [Bugfix][KV Pool]fix get_ip import in mooncake_store (#4355)
### What this PR does / why we need it?
fix import error for get_ip() in vllm main branch

### Does this PR introduce _any_ user-facing change?
N

### How was this patch tested?

- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

---------

Signed-off-by: pz1116 <zpbzpb123123@gmail.com>
2025-11-22 18:52:48 +08:00
Angazenn
9b3a484b46 [BugFix] Fix some issues caused by the ascending order of cudagraph_capture_sizes (#4338)
### What this PR does / why we need it?
In [#26016](https://github.com/vllm-project/vllm/pull/26016), vllm
change the `cudagraph_capture_sizes` to be in ascending order. This PR
fixes related issues caused by this.
### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?


- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

---------

Signed-off-by: Angazenn <supperccell@163.com>
2025-11-22 17:33:12 +08:00
LI SHENGYONG
3955bf2908 [EPLB] Eplb Verify Fix (#4333)
### What this PR does / why we need it?
Eplb Verify Fix

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

---------

Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
Signed-off-by: LI SHENGYONG <49200266+shenchuxiaofugui@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-11-21 18:17:46 +08:00
wangxiaochao
3deeea14a0 [bugfix] bugfix for PD disaggregate (#4319)
This PR is used to fix mooncake_connector in pcp/dcp case. When
executing function update_done_task_count, it is necessary to ensure
that both pcp/dcp and TP ranks have finished transferring KV cache.

- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

---------

Signed-off-by: wangxiaochao <w00642655@china.huawei.com>
Co-authored-by: wangxiaochao <w00642655@china.huawei.com>
2025-11-21 18:08:56 +08:00
Shanshan Shen
8e3b834bf7 [MM][Bugfix] Add error log for VL models when enabling FLASHCOMM (#4272)
### What this PR does / why we need it?
Add error log for VL models when enabling
`VLLM_ASCEND_ENABLE_FLASHCOMM1=1` or `VLLM_ASCEND_ENABLE_FLASHCOMM=1`
(for backward compatibility).

This is a temporary fix for
https://github.com/vllm-project/vllm-ascend/issues/4132.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-11-21 15:04:18 +08:00
LI SHENGYONG
4573c855b7 [Readme] EPLB Support Scenarios (#4314)
### What this PR does / why we need it?
Add information on the scope of EPLB support.
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?


- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

---------

Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
2025-11-21 14:24:54 +08:00
LI SHENGYONG
019c7ded91 eplb redundant expert bugfix (#4291)
### What this PR does / why we need it?
Redundant experts bugfix
### Does this PR introduce _any_ user-facing change?
After configuring the path for experts_map, users do not need to
configure iinit_redundancy_expert.
### How was this patch tested?
The accuracy of EPLB was tested with and without the use of redundant
experts.


- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

---------

Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
2025-11-21 14:24:35 +08:00
InSec
5a4e8cdeba [Feat][BugFix]Support the Qwen3-Next-80B-A3B-Instruct quantization model&Fix the NZ issue (#4245)
### What this PR does / why we need it?
Support the Qwen3-Next-80B-A3B-Instruct quantization model and Fix the
NZ issue. Triton kernel doesn't support data format nz, thus we skip
converting weight to nz on layer `conv1d`

- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

---------

Signed-off-by: IncSec <1790766300@qq.com>
2025-11-21 10:42:56 +08:00
anon189Ty
5c9f4a40c6 [Feat] Support MTP to running in full graph mode (#3892)
### What this PR does / why we need it?
Currently, the MTP model still runs in eager in full graph mode. This PR
adapts the MTP with the full graph capture and execution. When the graph
mode is set to "FULL_DECODE_ONLY", the MTP will run in full-graph to
improve the performance.

The change in both disable_padded_drafter_batch is True and False case
include:

1. Add _mtp_graph_params in acl_graph.py to isolate the data of main
model and the data of MTP.
2. Padding some metadata in mla_v1.py when in fullgraph mode.
3. Fixed the essential data address that will be used in model.forward.
4. Adapted according to the aclgraph capture framwork:
    1). Rebuild MTP model with ACLGraphWrapper.
    2). Add common attn metadata when start capture in MTP dummy_run.
    3). Add common attn metadata update in MTP.
    4). Addapted data update when num_speculative_tokens > 1.
5. Add a patch of MTP to adapt vllm v0.11.0.

Existing Issues:
1. When disable_padded_drafter_batch=True and running in FullGraph mode,
the data of the first-round requests in MTP is abnormal. We need to
identify the cause subsequently.
2. When disable_padded_drafter_batch=False and running in FullGraph
mode, the acceptance rate of the second and third tokens will decrease
(For example, if we set the num_speculative_tokens=3, the acceptance
rate of first token is 90%, the second is only 50% lower than 60%, the
third is only 20% lower than 30%). The reason is that the data processed
after the model runs does not match. This is a problem from another PR.
It works fine in eager and PIECEWISE mode, but has problem in FullGraph
mode. Once we have a solution, we will submit a bugfix.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

---------

Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>
2025-11-20 20:34:54 +08:00
shaopeng-666
3653f33878 avoid mrope fusion op when running qwen2.5-vl on a+x machine (#4270)
### What this PR does / why we need it?
avoid mrope fusion op when running qwen2.5-vl on a+x machine
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
Test text VQA accuracy on G8600 with aisbench

- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

---------

Signed-off-by: 李少鹏 <lishaopeng21@huawei.com>
2025-11-19 22:31:14 +08:00
欧派果奶我还要
c848da0687 [Bugfix] fix nightly multi-node EPLB tests' "DYNAMIC_EPLB=true" environment not working (#4223)
### What this PR does / why we need it?
fix nightly multi-node EPLB tests by adjusting
vllm_ascend\eplb\core\eplb_utils.py dynamic_eplb gate checking
### Does this PR introduce _any_ user-facing change?
no

- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

---------

Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
Signed-off-by: 欧派果奶我还要 <47294568+845473182@users.noreply.github.com>
Co-authored-by: 白永斌 <baiyongbin3@h-partners.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-11-19 21:31:58 +08:00
Delphine-Nic
a3e9673137 [long seq feat]GQA support long-prefill-token-threshold and fixbug (#4209)
### What this PR does / why we need it?
GQA chunk prefill with pcp and dcp support long-prefill-token-threshold

The markdown format results is as below:
| dataset | version | metric | mode | vllm-api-general-chat |
|----- | ----- | ----- | ----- | -----|
| gsm8kdataset | - | accuracy | gen | 96.13 |

- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

---------

Signed-off-by: Delphine-Nic <tanwenqin@huawei.com>
Signed-off-by: Delphine-Nic <t00608739@china.huawei.com>
Co-authored-by: Delphine-Nic <tanwenqin@huawei.com>
Co-authored-by: Delphine-Nic <t00608739@china.huawei.com>
2025-11-19 18:10:27 +08:00
wangxiyuan
2938bd5ad2 remove get_metadata_cls (#4087)
remove get_metadata_cls. It's only used for V0 engine and has been removed from vLLM already.

- vLLM version: v0.11.0
- vLLM main:
83f478bb19

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-11-19 14:58:17 +08:00
realliujiaxu
1cdf9ffa73 [Bugfix] fix hang in async scheduling (#4233)
### What this PR does / why we need it?

After https://github.com/vllm-project/vllm-ascend/pull/4113, there is no
synchronization between steps. However, in async scheduling with
aclgraph, it is possible that the CPU's record event for the current
iteration completes before the previous iteration's graph execution has
finished.

If cpu is fast enough, device will hang on event_wait in interation i+1
(assume that event_record is executed immediately on update stream of
device):
<img width="1812" height="489" alt="image"
src="https://github.com/user-attachments/assets/373fe655-afe5-4d7d-807e-b0aacf24a543"
/>

after add synchonization, record is launched after graph replay:
<img width="1803" height="466" alt="image"
src="https://github.com/user-attachments/assets/a8a68053-bd7d-49f5-a79c-9a26ef1285cc"
/>

bubble time caused by synchronization is about 85 us on G8600:
<img width="1491" height="804" alt="image"
src="https://github.com/user-attachments/assets/968611ee-f39a-4329-8150-1c4adba25dd1"
/>

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

---------

Signed-off-by: realliujiaxu <realliujiaxu@163.com>
Co-authored-by: hwhaokun <haokun0405@163.com>
2025-11-19 14:47:19 +08:00
zhangsicheng5
df777e9faa [bugfix] pcp + mtp acl graph bugfix (#4221)
Fix pcp + mtp bug while using acl graph.
While using pcp + mtp, we need to flatten block_table to avoid irregular
attn mask shape, this was done in mla attn_metadata builder, but we
found out that this influences block_table address and leads to
incorrect results while enable acl graph.
To fix this, we enlarge block_table buffer size and flatten block_table
in model_runner prepare_inputs, so this will not influence block_table
address.

- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

Signed-off-by: zhangsicheng5 <zhangsicheng5@huawei.com>
2025-11-19 11:21:46 +08:00
1092626063
9328f377b4 [refactor]support gatingtopk operator generalization (#2958)
### What this PR does / why we need it?

Past:
npu_moe_gating_top_k can only support 'group_count=256' pattern

Now:
1、npu_moe_gating_top_k support all size of group_count
2、the functionality of `torch_npu.npu_moe_gating_top_k_softmax` are
included in `torch_npu.npu_moe_gating_top_k`

CANN: depends on 8.3.RC1

Performance:
1. GLM4.5-w8a8, TPS improve 6%
2. Qwen3, the same as before

- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

Signed-off-by: 1092626063 <1092626063@qq.com>
2025-11-19 10:38:56 +08:00
Yizhou
63561d6763 [Fix] Sorts aclgraph batch sizes in ascending order (#4230)
### What this PR does / why we need it?
Sorts aclgraph batch sizes in ascending order, corresponding to vLLM
[#26016](https://github.com/vllm-project/vllm/pull/26016)

Ensures batch sizes for aclgraph are sorted ascending when aclgraph mode
is enabled, improving consistency and compatibility with later logic
that may depend on order.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
Waiting for #3886 

- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-11-19 09:36:37 +08:00
liziyu
e98543267a [bugfix] fix proxy hen host ip using domain name (#4243)
### What this PR does / why we need it?
fix proxy when host ip using domain name

- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

---------

Signed-off-by: liziyu <liziyu16@huawei.com>
2025-11-18 16:30:51 +08:00
liziyu
a30261f779 [P/D] pd proxy support ipv6 (#4161)
### What this PR does / why we need it?
pd proxy support ipv6, mooncake connector check whether the IPv6 address
is used and notify the user.


- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

---------

Signed-off-by: liziyu <liziyu16@huawei.com>
2025-11-18 11:01:13 +08:00
wangxiaochao
0d04ad8c8f [feature] Mooncake_connector support pcp/dcp (#4183)
add feature for Mooncake_connector supporting pcp/dcp

- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

---------

Signed-off-by: wangxiaochao <w00642655@china.huawei.com>
Co-authored-by: wangxiaochao <w00642655@china.huawei.com>
2025-11-18 10:17:48 +08:00
Angazenn
10a046ddce [main][misc]change default capture size for Qwen3-MoE when using full dp (#4199)
### What this PR does / why we need it?
Currently, the default `cudagraph_capture_size` in vLLM is `[1, 2, 4 ,8
,16 ,24 ,... , max_capture_size]`. However, this is not always the best
choice on different situations. This PR aims to change the default
setting when running Qwen3-MoE on full dp (`dp_size > 1` && `tp_size ==
1`) setting, which is usually applied in Large-Scale EP.
old :
`[1, 2, 4 ,8 ,16 ,24 ,... , max_capture_size]`
new:
`[1, 2, 5 ,10 ,15, 16 ,24 ,... , max_capture_size]`
This is mainly because the performance of `_npu_paged_attention` op
degrades dramatically on old settings. We hope to provide better
performance if users do not set specific `cudagraph_capture_size`.
### Does this PR introduce _any_ user-facing change?
The default `cudagraph_capture_size` is modified in above cases.
However, if `cudagraph_capture_size` has already set by users, this PR
won't have any influence on this.

### How was this patch tested?

- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

---------

Signed-off-by: Angazenn <supperccell@163.com>
2025-11-18 08:41:45 +08:00