1665 Commits

Author SHA1 Message Date
shaopeng-666
3653f33878 avoid mrope fusion op when running qwen2.5-vl on a+x machine (#4270)
### What this PR does / why we need it?
avoid mrope fusion op when running qwen2.5-vl on a+x machine
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
Test text VQA accuracy on G8600 with aisbench

- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

---------

Signed-off-by: 李少鹏 <lishaopeng21@huawei.com>
2025-11-19 22:31:14 +08:00
欧派果奶我还要
c848da0687 [Bugfix] fix nightly multi-node EPLB tests' "DYNAMIC_EPLB=true" environment not working (#4223)
### What this PR does / why we need it?
fix nightly multi-node EPLB tests by adjusting
vllm_ascend\eplb\core\eplb_utils.py dynamic_eplb gate checking
### Does this PR introduce _any_ user-facing change?
no

- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

---------

Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
Signed-off-by: 欧派果奶我还要 <47294568+845473182@users.noreply.github.com>
Co-authored-by: 白永斌 <baiyongbin3@h-partners.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-11-19 21:31:58 +08:00
Delphine-Nic
a3e9673137 [long seq feat]GQA support long-prefill-token-threshold and fixbug (#4209)
### What this PR does / why we need it?
GQA chunk prefill with pcp and dcp support long-prefill-token-threshold

The markdown format results is as below:
| dataset | version | metric | mode | vllm-api-general-chat |
|----- | ----- | ----- | ----- | -----|
| gsm8kdataset | - | accuracy | gen | 96.13 |

- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

---------

Signed-off-by: Delphine-Nic <tanwenqin@huawei.com>
Signed-off-by: Delphine-Nic <t00608739@china.huawei.com>
Co-authored-by: Delphine-Nic <tanwenqin@huawei.com>
Co-authored-by: Delphine-Nic <t00608739@china.huawei.com>
2025-11-19 18:10:27 +08:00
wangxiyuan
2938bd5ad2 remove get_metadata_cls (#4087)
remove get_metadata_cls. It's only used for V0 engine and has been removed from vLLM already.

- vLLM version: v0.11.0
- vLLM main:
83f478bb19

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-11-19 14:58:17 +08:00
realliujiaxu
1cdf9ffa73 [Bugfix] fix hang in async scheduling (#4233)
### What this PR does / why we need it?

After https://github.com/vllm-project/vllm-ascend/pull/4113, there is no
synchronization between steps. However, in async scheduling with
aclgraph, it is possible that the CPU's record event for the current
iteration completes before the previous iteration's graph execution has
finished.

If cpu is fast enough, device will hang on event_wait in interation i+1
(assume that event_record is executed immediately on update stream of
device):
<img width="1812" height="489" alt="image"
src="https://github.com/user-attachments/assets/373fe655-afe5-4d7d-807e-b0aacf24a543"
/>

after add synchonization, record is launched after graph replay:
<img width="1803" height="466" alt="image"
src="https://github.com/user-attachments/assets/a8a68053-bd7d-49f5-a79c-9a26ef1285cc"
/>

bubble time caused by synchronization is about 85 us on G8600:
<img width="1491" height="804" alt="image"
src="https://github.com/user-attachments/assets/968611ee-f39a-4329-8150-1c4adba25dd1"
/>

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

---------

Signed-off-by: realliujiaxu <realliujiaxu@163.com>
Co-authored-by: hwhaokun <haokun0405@163.com>
2025-11-19 14:47:19 +08:00
zhangsicheng5
df777e9faa [bugfix] pcp + mtp acl graph bugfix (#4221)
Fix pcp + mtp bug while using acl graph.
While using pcp + mtp, we need to flatten block_table to avoid irregular
attn mask shape, this was done in mla attn_metadata builder, but we
found out that this influences block_table address and leads to
incorrect results while enable acl graph.
To fix this, we enlarge block_table buffer size and flatten block_table
in model_runner prepare_inputs, so this will not influence block_table
address.

- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

Signed-off-by: zhangsicheng5 <zhangsicheng5@huawei.com>
2025-11-19 11:21:46 +08:00
1092626063
9328f377b4 [refactor]support gatingtopk operator generalization (#2958)
### What this PR does / why we need it?

Past:
npu_moe_gating_top_k can only support 'group_count=256' pattern

Now:
1、npu_moe_gating_top_k support all size of group_count
2、the functionality of `torch_npu.npu_moe_gating_top_k_softmax` are
included in `torch_npu.npu_moe_gating_top_k`

CANN: depends on 8.3.RC1

Performance:
1. GLM4.5-w8a8, TPS improve 6%
2. Qwen3, the same as before

- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

Signed-off-by: 1092626063 <1092626063@qq.com>
2025-11-19 10:38:56 +08:00
Yizhou
63561d6763 [Fix] Sorts aclgraph batch sizes in ascending order (#4230)
### What this PR does / why we need it?
Sorts aclgraph batch sizes in ascending order, corresponding to vLLM
[#26016](https://github.com/vllm-project/vllm/pull/26016)

Ensures batch sizes for aclgraph are sorted ascending when aclgraph mode
is enabled, improving consistency and compatibility with later logic
that may depend on order.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
Waiting for #3886 

- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-11-19 09:36:37 +08:00
liziyu
e98543267a [bugfix] fix proxy hen host ip using domain name (#4243)
### What this PR does / why we need it?
fix proxy when host ip using domain name

- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

---------

Signed-off-by: liziyu <liziyu16@huawei.com>
2025-11-18 16:30:51 +08:00
liziyu
a30261f779 [P/D] pd proxy support ipv6 (#4161)
### What this PR does / why we need it?
pd proxy support ipv6, mooncake connector check whether the IPv6 address
is used and notify the user.


- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

---------

Signed-off-by: liziyu <liziyu16@huawei.com>
2025-11-18 11:01:13 +08:00
wangxiaochao
0d04ad8c8f [feature] Mooncake_connector support pcp/dcp (#4183)
add feature for Mooncake_connector supporting pcp/dcp

- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

---------

Signed-off-by: wangxiaochao <w00642655@china.huawei.com>
Co-authored-by: wangxiaochao <w00642655@china.huawei.com>
2025-11-18 10:17:48 +08:00
Angazenn
10a046ddce [main][misc]change default capture size for Qwen3-MoE when using full dp (#4199)
### What this PR does / why we need it?
Currently, the default `cudagraph_capture_size` in vLLM is `[1, 2, 4 ,8
,16 ,24 ,... , max_capture_size]`. However, this is not always the best
choice on different situations. This PR aims to change the default
setting when running Qwen3-MoE on full dp (`dp_size > 1` && `tp_size ==
1`) setting, which is usually applied in Large-Scale EP.
old :
`[1, 2, 4 ,8 ,16 ,24 ,... , max_capture_size]`
new:
`[1, 2, 5 ,10 ,15, 16 ,24 ,... , max_capture_size]`
This is mainly because the performance of `_npu_paged_attention` op
degrades dramatically on old settings. We hope to provide better
performance if users do not set specific `cudagraph_capture_size`.
### Does this PR introduce _any_ user-facing change?
The default `cudagraph_capture_size` is modified in above cases.
However, if `cudagraph_capture_size` has already set by users, this PR
won't have any influence on this.

### How was this patch tested?

- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

---------

Signed-off-by: Angazenn <supperccell@163.com>
2025-11-18 08:41:45 +08:00
weiguihua2
da1cd9c7ca [Bugfix]Fix moe error when sp chunked the hidden_states (#4212)
### What this PR does / why we need it?
Fix moe error when sp chunked the hidden_states by disabling sp by a hacky way

- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

---------

Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
2025-11-17 22:55:17 +08:00
XiaoxinWang
e38ef2c434 support FULL graph mode for GQA (#3970)
### What this PR does / why we need it?
The current library only supports the FullDecodeOnly graph mode, which
enables full graph execution during the decode. This PR extends support
to allow full graph execution in both the prefill and decode, referred
to as FULL graph mode.

- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
2025-11-17 10:50:35 +08:00
Canlin Guo
f10251ede0 [Platform] Add import_kernels interface (#3694)
### What this PR does / why we need it?
Add import_kernels interface to avoid import useless vLLM C library

Closes #3488. Reopen #3498 for CI.

### How was this patch tested?

CI tested.

- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

---------

Signed-off-by: gcanlin <canlinguosdu@gmail.com>
2025-11-14 11:32:51 +08:00
Yizhou
094f32c8c9 [Feat] Adds a utility for printing from within ACL graphs (#4162)
### What this PR does / why we need it?
Introduces the `acl_graph_print` function to enable printing debug
information from code running inside an ACL graph, such as custom
operators.

This works by launching a host function on a dedicated stream, bypassing
the limitations of standard `print` within compiled graph execution. The
implementation handles the necessary stream subscriptions and ensures
they are properly unregistered upon exit.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
None.

- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-11-14 09:41:14 +08:00
weiguihua2
01195e860c [Bugfix] fix cannot import name get_mp_context (#4174)
### What this PR does / why we need it?
fix bug: cannot import vllm package

- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
2025-11-14 09:09:14 +08:00
LookAround0301
5ec96fd46c [long_seq_Feat] support chunk prefill (#4158)
### What this PR does / why we need it?
1、qwen GQA attention_v1 optim
2、DeepSeek MLA refactor, all gather q -> all gather kv 
3、modelrunner refactor for chunk prefill, we remove some code not use

- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

---------

Signed-off-by: LookAround <lixushi@huawei.com>
Signed-off-by: Delphine-Nic <tanwenqin@huawei.com>
Co-authored-by: Delphine-Nic <tanwenqin@huawei.com>
2025-11-14 08:43:37 +08:00
zhaozx-cn
fdd2db097a [BugFix] Fix kv_no_split not contiguous (#3594)
allgather need contiguous data, split operation return uncontiguous
data.

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: zhaozx-cn <zhaozx2116@163.com>
2025-11-13 11:28:09 +08:00
realliujiaxu
5093192769 [Bugfix] fix mtp profile run error where main model and mtp model use different quantization (#4102)
### What this PR does / why we need it?
In PR https://github.com/vllm-project/vllm-ascend/pull/3420, we
initially placed the quantization type (quant_type) in the MoECommMethod
class. However, since MoECommMethod follows a singleton pattern, it
couldn't accommodate scenarios where different layers in the model might
use different quantization approaches (e.g., MTP modules using
floating-point computation while the main model employs quantized
computation).
In this PR, we've moved the quantization type to the AscendFusedMoe
class and pass it as a parameter to MoECommMethod.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
```bash
export HCCL_BUFFSIZE=1024
export VLLM_VERSION=0.11.0

vllm serve /home/data/DeepSeek-R1_w8a8/ \
 --data-parallel-size 2 \
 --tensor-parallel-size 8 \
 --enable-expert-parallel \
 --served-model-name dsv3 \
 --max-model-len 32768 \
 --max-num-batched-tokens 4096 \
 --max-num-seqs 16 \
 --quantization ascend \
 --trust-remote-code \
 --gpu-memory-utilization 0.9 \
 --speculative-config '{"num_speculative_tokens": 2, "method":"deepseek_mtp"}'
```


- vLLM version: v0.11.0
- vLLM main:
83f478bb19

---------

Signed-off-by: realliujiaxu <realliujiaxu@163.com>
2025-11-13 11:02:31 +08:00
weichen
17259cb265 [Perf] [MoE] optimize all2allv (#3738)
### What this PR does / why we need it?
1. Replace init_routing_v2 with token_permute to optimize performance.

Note: This pr will be merged after switching ci to CANN 8.3
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
vllm bench serve bs = 48 / rr = 10000 / 2k input -> 20k output:
before:
<img width="489" height="488" alt="image"
src="https://github.com/user-attachments/assets/268a19e6-9ab2-47f0-84a1-4f6d3bc342e2"
/>
 after:
<img width="480" height="500" alt="image"
src="https://github.com/user-attachments/assets/d9b1e628-0520-42d5-8a21-b42f7cd7abc7"
/>
- vLLM version: v0.11.0
- vLLM main:
83f478bb19

Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>
2025-11-13 09:38:11 +08:00
realliujiaxu
6bc770cd78 [Perf] fix async copy for async scheduling (#4113)
### What this PR does / why we need it?
Only CPU tensors with `pin_memory=True` can be asynchronously copied to
the device. Currently, there are two instances where non-pinned CPU
tensors are being copied to the device, which will trigger synchronous
operations, reducing the expected benefits of asynchronous scheduling.

- vLLM version: v0.11.0
- vLLM main:
83f478bb19

Signed-off-by: realliujiaxu <realliujiaxu@163.com>
2025-11-13 09:11:26 +08:00
22dimensions
c272747d13 Upgrade to 0.11.1 newest vllm commit (#3982)
### What this PR does / why we need it?
adapt vllm-ascend main branch with vllm releases/v0.11.1

fix `forward context not set` in test_vlm.py caused by:
https://github.com/vllm-project/vllm/pull/23207

fix import `cdiv round` failed caused by:
https://github.com/vllm-project/vllm/pull/27188

fix import `init_cached_hf_modules` failed caused by:
https://github.com/vllm-project/vllm/pull/27567

adapt triton kernel `fused_recurrent_gated_delta_rule_fwd_kernel` caused
by: https://github.com/vllm-project/vllm/pull/27654
- remove unused code in sigmoid_gating.py
- `class FusedRecurrentFunction` , `fused_recurrent_gated_delta_rule`,
`fused_recurrent_gated_delta_rule_fwd`

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI 


- vLLM version: v0.11.0
- vLLM main:
83f478bb19

Signed-off-by: 22dimensions <waitingwind@foxmail.com>
2025-11-12 23:01:19 +08:00
Angazenn
fc7e5cd9dc [main][bugfix] Change seq_lens in dummy attn_metadata to max_query_len (#4097)
### What this PR does / why we need it?
Currently, we set `seq_lens` in dummy attn_metadata to be
`max_model_len` to get max workspace for attention during capturing.
However, setting it consistently to be `max_model_len` causing dummy_run
to execute a long attention when running actual inference. For example,
if there is a single req with `seqs_lens` as [8] but `max_model_len` is
131072, the whole process will be slow down by dummy_run as it execute a
fake long-seq attention. Therefore, we instead set it to max_query_len,
which is also consistent with vLLM gpu implementation.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?

- vLLM version: v0.11.0
- vLLM main:
83f478bb19

---------

Signed-off-by: Angazenn <supperccell@163.com>
2025-11-12 17:31:39 +08:00
zhangsicheng5
a123f355e9 [feature] support pcp + mtp (in pd co-locate scenario) (#4098)
1. support pcp + mtp in pd co-locate scenario
2. llmdatadist connector pcp related bugfix and cleancode

- vLLM version: v0.11.0
- vLLM main:
83f478bb19

Signed-off-by: zhangsicheng5 <zhangsicheng5@huawei.com>
2025-11-12 17:22:21 +08:00
XiaoxinWang
1b4ce63ec9 fix fullgraph in ds. (#4016)
### What this PR does / why we need it?
DS don't have 'AscendAttentionMetadataBuilder' class so will fail in
fullgraph.
We resolved the issue by modifying the code to only check for
'GDNAttentionMetadataBuilder ', while all other attention cases follow
the default branch.

- vLLM version: v0.11.0
- vLLM main:
83f478bb19

Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
2025-11-12 10:11:43 +08:00
Yizhou
638dbcdb32 [Perf] Remove D2H operations to imporve performance (#4063)
### What this PR does / why we need it?
Replace masked in-place assignment with a device-side torch.where so
selection stays on-device, allowing subsequent device ops to be enqueued
earlier and removing an implicit D2H sync, reducing latency by several
hundreds μs on Ascend.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
None.
- vLLM version: v0.11.0
- vLLM main:
83f478bb19

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-11-12 09:08:55 +08:00
thonean
e38fe92f40 [Misc][Doc] Add service profiling feature with user guide (#3756)
### What this PR does / why we need it?
To support the data collection capabilities of the msServiceProfiler on
vLLM-ascned framework and enable customization of data collection points
via configuration file, a default profiling configuration has been added
to vllm-ascend, facilitating debugging and optimization for developers
and users.

### Does this PR introduce _any_ user-facing change?
None

### How was this patch tested?

- vLLM version: v0.11.0
- vLLM main:
83f478bb19

Signed-off-by: minghangc <29514143@qq.com>
2025-11-12 09:07:14 +08:00
zzhxxx
46a41b26d3 oproj TP support acl graph (#4073)
### What this PR does / why we need it?
Reference #2167 and orpoj TP supports ACL graph.

- vLLM version: v0.11.0
- vLLM main:
83f478bb19

---------

Signed-off-by: zzhx1 <zzh_201018@outlook.com>
2025-11-11 19:39:06 +08:00
wangxiyuan
f811a24bf0 Remove VLLM_USE_V1 (#4086)
Drop VLLM_USE_V1 usage.  This env has been removed from vLLM already.

- vLLM version: v0.11.0
- vLLM main:
83f478bb19

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-11-11 15:43:39 +08:00
Apocalypse
71866d5311 [feature] chunkprefill support pcp&dcp (#3801)
### What this PR does / why we need it?
ChunkPrefill now can support Long Sequence Feature Pcp&Dcp

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI tests passed with self-test


- vLLM version: v0.11.0
- vLLM main:
83f478bb19

---------

Signed-off-by: Apocalypse990923-qshi <qiushixu@usc.edu>
Signed-off-by: Delphine-Nic <tanwenqin@huawei.com>
Co-authored-by: Delphine-Nic <tanwenqin@huawei.com>
Co-authored-by: Delphine-Nic <3834144971@qq.com>
2025-11-11 09:18:02 +08:00
zhaomingyu13
7ffbe73d54 [main][Bugfix] Fix ngram precision issue and open e2e ngram test (#4090)
### What this PR does / why we need it?
Fix ngram precision issue and open e2e ngram test

- vLLM version: v0.11.0
- vLLM main:
83f478bb19

---------

Signed-off-by: Icey <1790571317@qq.com>
Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com>
Co-authored-by: Icey <1790571317@qq.com>
2025-11-11 09:06:24 +08:00
Icey
e04a87f4be [BugFix] Fixes Qwen3-Next enable nz accuracy problem (#4058)
### What this PR does / why we need it?
- Fixes Qwen3-Next enable nz accuracy problem

### Does this PR introduce _any_ user-facing change?
N/A


- vLLM version: v0.11.0
- vLLM main:
83f478bb19

---------

Signed-off-by: Icey <1790571317@qq.com>
Signed-off-by: wxsIcey <1790571317@qq.com>
2025-11-10 20:54:57 +08:00
rjg-lyh
a1558b99c2 [Core] Restore scheduling logic under default configuration (#3967)
### What this PR does / why we need it?
This PR reverts the changes introduced in PR #2894 Initially, due to
performance issues with the older version of the chunked prefill ops,
the default behavior was to use the Ascend scheduler to disable the
chunked prefill feature. However, with the improvements in the
performance of the new chunked prefill ops, this interception strategy
has been removed. This change also aligns with the community's default
configuration behavior.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
CI passed with new added/existing test.

- vLLM version: v0.11.0
- vLLM main:
83f478bb19

Signed-off-by: rjg-lyh <1318825571@qq.com>
2025-11-10 17:48:56 +08:00
Levi
0a62e671fb [Feat] flashcomm_v2 optim solution (#3232)
### What this PR does / why we need it?
Supports generalized FlashComm2 optimization, which reduces
communication overhead, decreases RmsNorm computation, and saves one
AllGather step by replacing Allreduce operations in the Attention module
with pre-AlltoAll and post-AllGather operations (used in combination
with FlashComm1). This feature is enabled during the Prefill phase and
is recommended to be used together with FlashComm1, delivering broad
performance improvements, especially in long sequence scenarios with
large tensor parallelism (TP) configurations. Benchmark tests show that
under TP16DP1 configuration, it can improve the prefill performance of
the DeepSeek model by 8% on top of FlashComm1.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.11.0
- vLLM main:
83f478bb19

---------

Signed-off-by: zzhxx <2783294813@qq.com>
Signed-off-by: Levi-JQ <yujinqi2@huawei.com>
Co-authored-by: Levi-JQ <yujinqi2@huawei.com>
Co-authored-by: zzhxx <2783294813@qq.com>
2025-11-10 11:01:45 +08:00
lilinsiman
a3ff765c65 [Info][main] Corrected the errors in the information (#4055)
### What this PR does / why we need it?
Corrected the errors in the information

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
ut

- vLLM version: v0.11.0
- vLLM main:
83f478bb19

Signed-off-by: lilinsiman <lilinsiman@gmail.com>
2025-11-08 18:48:59 +08:00
weiguihua2
1d7cb5880a [Bugfix]fix pcp dcp attn aclgraph (#4066)
### What this PR does / why we need it?
In the DCP-PCP graph mode scenario, there is a shape issue with multiple
batches. This PR fixes this problem.

- vLLM version: v0.11.0
- vLLM main:
83f478bb19

---------

Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
2025-11-08 18:47:12 +08:00
hucong
48094148f8 [BugFix] Improve the performance of prefixcache features (#4022)
### What this PR does / why we need it?
The code bug caused an empty bubble. When the npu_paged_cache_load
operator was called, it forcibly transferred seq_len2 to the device,
which triggered synchronization and interrupted the CPU operator's
launch stream.

- vLLM version: v0.11.0
- vLLM main:
83f478bb19

---------

Signed-off-by: underfituu <hzhucong@163.com>
2025-11-08 18:45:31 +08:00
zxr2333
1d81a289d0 [P/D][BugFix]Fix proxy format processing errors & Layerwise connector performance optimization (#4043)
### What this PR does / why we need it?
1. Fix proxy format processing errors.
2. Layer-wise connector performance optimization.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
By CI.

- vLLM version: v0.11.0
- vLLM main:
83f478bb19

---------

Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com>
2025-11-08 18:44:06 +08:00
wangx700
24d6314718 [Bugfix] fix sleepmode level2 e2e test (#4019)
### What this PR does / why we need it?

enable sleepmode level2 e2e test and add the check logic to ensure the
nz is not enabled.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

use e2e tests


- vLLM version: v0.11.0
- vLLM main:
83f478bb19

Signed-off-by: wangx700 <wangxin700@huawei.com>
2025-11-08 14:11:55 +08:00
offline893
e687d6af85 [BugFix]Fix group list type of mc2. (#4047)
### What this PR does / why we need it?
Fix accrucy problem of eplb because of PTA upgrade.

### How was this patch tested?
Main:
    baseline:
| dataset | version | metric | mode | vllm-api-general-chat |
|----- | ----- | ----- | ----- | -----|
| aime2024 | 604a78 | accuracy | gen | 87.50 |

   EPLB:

| dataset | version | metric | mode | vllm-api-general-chat |
|----- | ----- | ----- | ----- | -----|
| aime2024 | 604a78 | accuracy | gen | 87.50 |
- vLLM version: v0.11.0
- vLLM main:
83f478bb19

Signed-off-by: offline0806 <3337230449@qq.com>
Co-authored-by: offline0806 <3337230449@qq.com>
2025-11-07 17:41:56 +08:00
drslark
23b785fdfb [Feat] Adapted mtp function to Qwen3-next (#3918)
### What this PR does / why we need it?

Adapts mtp function to Qwen3-next.

- vLLM version: v0.11.0
- vLLM main:
83f478bb19

Signed-off-by: drslark <slarksblood@qq.com>
2025-11-07 16:39:03 +08:00
LookAround0301
79e536d939 [Feat] update op for mla (#4000)
### What this PR does / why we need it?
1、in mla_v1 module, add torch_npu.npu_attention_update op when pcp and dcp

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.11.0
- vLLM main:
83f478bb19

---------

Signed-off-by: LookAround <lixushi@huawei.com>
2025-11-07 09:48:39 +08:00
LookAround0301
f8610b7d67 [long_seq] fix A2 accuracy problem (#4030)
### What this PR does / why we need it?
1、update prepare_finalize.py:fix A2 accuracy problem when pcp and dcp

- vLLM version: v0.11.0
- vLLM main:
83f478bb19

Signed-off-by: LookAround <lixushi@huawei.com>
2025-11-07 09:29:33 +08:00
Angazenn
e0d58d543b [main][bugfix] Fix a rare bug triggered by _npu_paged_attention in FULL_DECODE_ONLY mode (#3986)
### What this PR does / why we need it?
This PR fixes a bug where the workspace of `_npu_paged_attention` in
setup is smaller than execution. For current implementation of
FULL_DECODE_ONLY with `_npu_paged_attention`, we use
`_npu_paged_attention_get_workspace` when capturing with `max_model_len`
as `seq_lens`. This assumes that PA with larger `seq_lens` inputs should
have larger workspace than smaller `seq_lens`. However, there are rare
cases where PA with smaller `seq_lens` incurs larger space. So I add
`get_workspace` directly into `update_attn_params`.
This change might introduce small(≈1%) performance degradation for low
num_tokens(such as 1) in decode phase, and there is no other known
memory issues. So I think this change is acceptable. We can remove this
if new attention op (such as `npu_fused_infer_attention_score`) does not
have such problems.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?


- vLLM version: v0.11.0
- vLLM main:
83f478bb19

Signed-off-by: Angazenn <supperccell@163.com>
2025-11-06 23:08:07 +08:00
drslark
1804b60ec8 [BugFix][main] Adapted to torch_npu.npu_fused_infer_attention_score (#4025)
### What this PR does / why we need it?

Fixes a compatible bug with `torch_npu.npu_fused_infer_attention_score`
which is discribed in
https://github.com/vllm-project/vllm-ascend/issues/4020.
@momo609 tells us this solution.

### Does this PR introduce _any_ user-facing change?

N/A

### How was this patch tested?

The environment is same with this issue,
https://github.com/vllm-project/vllm-ascend/issues/4020.

We modify the code according to
https://github.com/vllm-project/vllm-ascend/pull/3918.

And run below codes:

```python
# run with Qwen3-next-mtp

prompts = [
    "Who are you?",
]

sampling_params = SamplingParams(temperature=0.0, top_p=0.95, top_k=40, max_tokens=128)
llm = LLM(model="/home/model/Qwen3-Next-80B-A3B-Instruct",
          tensor_parallel_size=4,
          enforce_eager=True,
          distributed_executor_backend="mp",
          gpu_memory_utilization=0.7,
          speculative_config={
              "method": "qwen3_next_mtp",
              "num_speculative_tokens": 1,
          },
          max_model_len=4096)

outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```

Outputs:

```text
Prompt: 'Who are you?', Generated text: ' I am Qwen, a large-scale language model independently developed by the Tongyi Lab under Alibaba Group. I am designed to answer questions, create text such as stories, official documents, emails, scripts, and more, as well as perform logical reasoning, programming, and other tasks. If you have any questions or need assistance, feel free to let me know anytime!'
```

Now, `torch_npu.npu_fused_infer_attention_score` is compatible with
Qwen3-Next.
- vLLM version: v0.11.0
- vLLM main:
83f478bb19

Signed-off-by: drslark <slarksblood@qq.com>
2025-11-06 22:00:24 +08:00
realliujiaxu
22005c64c1 [Bugfix] Add constraints for sequence parallelism (#4014)
### What this PR does / why we need it?
Add Add constraints for sequence parallelism for unsupported scenarios:
1. tp_size > 1
2. enable_expert_parallel must be True for MoE model

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.11.0
- vLLM main:
83f478bb19

---------

Signed-off-by: realliujiaxu <realliujiaxu@163.com>
2025-11-06 20:02:03 +08:00
weiguihua2
2eebe1dc0a [feat]decode convert bsnd to tnd and fix bug when pcp and dcp (#3980)
### What this PR does / why we need it?
1、in attention_v1 module, convert bsnd t0 tnd when pcp and dcp
2、fix tochair bug: service startup problem

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.11.0
- vLLM main:
83f478bb19

Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
2025-11-06 14:58:24 +08:00
Liziqi-77
25b24c02ea [Feat](Mooncake) Supports multiple input suffixes for global_segment_size (#3690)
### What this PR does / why we need it?
- global_segment_size and local_buffer_size use constants for unified
management.
- Newly added support for input formats ending with GB, MB, KB, and B,
while being compatible with existing input methods.

### Does this PR introduce _any_ user-facing change?
- Users can use new input methods
- The documentation has also been modified

### How was this patch tested?


- vLLM version: v0.11.0
- vLLM main:
83f478bb19

---------

Signed-off-by: 李子琦 <liziqi_ing@163.com>
2025-11-06 14:48:15 +08:00
zxr2333
b206e831e9 [P/D]Make kv-transfer env variable take effect & Fix load-balance proxy (#3981)
### What this PR does / why we need it?
Make kv-transfer env variable take effect and Fix load-balance proxy.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
By CI.


- vLLM version: v0.11.0
- vLLM main:
83f478bb19

---------

Signed-off-by: liziyu <liziyu16@huawei.com>
Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
Co-authored-by: liziyu <liziyu16@huawei.com>
2025-11-06 12:02:47 +08:00