Commit Graph

1191 Commits

Author SHA1 Message Date
Zetong Li
66b67f9cf2 [Bugfix][SHM] Fix weak memory ordering problem in share memory (#3988)
### What this PR does / why we need it?
This PR aims to fix weak memory ordering problem in share memory by
patching message queue with an additional lock. The detailed issue can
be found here https://github.com/vllm-project/vllm/issues/27858. The key
point is to use the writer lock to enforce memory fence before the ready
flag `metadata_buffer[0] = 1` is set.

This is a temporary solution, and you can use it by setting env
`SHM_BARRIER=true`. By default, we disable this modification.

### Does this PR introduce _any_ user-facing change?
`SHM_BARRIER=true` enables this change while `SHM_BARRIER=false`
disables this change. The latter is the default choice.

### How was this patch tested?
by ci

---------

Signed-off-by: Zetong Li <slippersss@126.com>
2025-11-04 23:07:23 +08:00
zxr2333
954dab64fb [v0.11.0][P/D]Set adxl as default backend and update readme (#3771)
### What this PR does / why we need it?
Set adxl engine as the default Mooncake backend, because Ascend
Transport is no longer maintained.
Update README to include instructions for installing the adxl backend
Mooncake.

### Does this PR introduce _any_ user-facing change?
Users need to compile and install the mooncake backend for adxl
according to the revised README instructions.

### How was this patch tested?
By CI.

---------

Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
2025-11-04 16:06:58 +08:00
leo-pony
0cead5c1ee Quality enhancement: Immediately interrupt execution when allocate NPU memory OOM (#3944)
### What this PR does / why we need it?
Protect the scene where the first problem occurs. The execution should
be interrupted when the video memory application fails, rather than
waiting until an illegal address is accessed.


### Does this PR introduce _any_ user-facing change?
NA

### How was this patch tested?
NA
- vLLM version: v0.11.0
- vLLM main:
83f478bb19

Signed-off-by: leo-pony <nengjunma@outlook.com>
2025-11-04 08:55:22 +08:00
Mengqing Cao
7cc6208029 [0.11.0][MTP][Aclgraph] Fix the support aclgraph with MTP (#3912)
### What this PR does / why we need it?
Fix 2 breaks of aclgraph with MTP:
1. deepseekmtp in vllm 0.11.0 does not support aclgraph and lack the
`support_torch_compile` decorator
2. There is a d2h synchornization in the original forward of mtp
predictor. The fix pr in vllm
https://github.com/vllm-project/vllm/pull/27643

As we'll fix it in vllm main, this fix pr is only needed in branch
v0.11.0-dev

The profling shows that MTP replays in aclgraph now:
<img width="1612" height="1866" alt="a7d7f04155df4ed454b7eb20a92b2e2a"
src="https://github.com/user-attachments/assets/eaa4b9ff-aeb0-416d-964f-5a06e497f155"
/>

### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-11-03 14:25:37 +08:00
wangxiyuan
8a7154001e [0.11.0]Chery pick pta upgrade change (#3940)
This PR cherry-pick two commit from main to upgrade torch-npu to 2.7.1
official release

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-10-31 22:14:26 +08:00
rjg-lyh
3d81ea03ed [v0.11.0-dev][bugfix] fix valueError in static_forward_context when prefix is empty (#3929)
### What this PR does / why we need it?
This PR temporarily bypasses the scenario where some models in vLLM
trigger a `ValueError` during the process of storing values in
`static_forward_context` when no `prefix` is specified for the linear
layers, which is a bug in some models in vLLM. The official fix will be
addressed by submitting a PR to the vLLM community that specifies a
prefix for the linear layers in each model.

- vLLM version: v0.11.0
- vLLM main:
83f478bb19

### How was this patch tested?
CI passed with new added/existing test.

Signed-off-by: rjg-lyh <1318825571@qq.com>
2025-10-31 15:45:06 +08:00
Nagisa125
9f7de45b75 [Bugfix] fix MTP support for lmhead_tensor_parallel_size (#3921)
### What this PR does / why we need it?
Fix the issue of MTP being enabled and setting
Imhead_tensor_parallel_size=16 causing the inference to hang.


Signed-off-by: wyh145 <1987244901@qq.com>
2025-10-31 14:34:28 +08:00
lilinsiman
ee2e55e602 [v0.11.0][Test] Add new test model for aclgraph single_request v0.11.0 (#3889)
### What this PR does / why we need it?
add new test model for aclgraph single_request v0.11.0

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
ut

Signed-off-by: lilinsiman <lilinsiman@gmail.com>
2025-10-31 11:23:55 +08:00
zouyida2052
90aca84e60 fix bug when max_seqs=14 in mtp=2 scenario and raise error when cudagraph_capture_sizes can't be an integer multiple of uniform_decode_query_len (#3909)
### What this PR does / why we need it?
1. Revert [bugfix for mtp in
fullgraph](0948483642)
and support it when vllm supports
2. raise error when cudagraph_capture_sizes can't be an integer multiple
of uniform_decode_query_len
3. bugfix when max_num_seqs=14 in mtp=2 scenario

---------

Signed-off-by: zouyida2052 <zouyida2002@gmail.com>
2025-10-31 09:25:06 +08:00
lilinsiman
387ce1cc5b add new e2e tests case for aclgraph memory to v0.11.0 (#3880)
<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
add new e2e tests case for aclgraph memory to v0.11.0

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
ut

Signed-off-by: lilinsiman <lilinsiman@gmail.com>
2025-10-31 09:17:09 +08:00
wangxiaoteng888
38afd2c9cb [bugfix_v0.11.0]cancel tokenize for layerwise_proxy (#3913)
### What this PR does / why we need it?
cancel tokenize for layerwise_proxy
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
by ci

Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
2025-10-30 23:55:04 +08:00
wangxiaoteng888
af7a56550b [bugfix_v0.11.0-dev] layerwise D first plan (#3907)
### What this PR does / why we need it?
Refactored the layerwise code to send to the D node first, preventing
P-node hangs due to communication timeouts when DP > 1.
---------

Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
Signed-off-by: liziyu <liziyu16@huawei.com>
Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
Co-authored-by: nwpu-zxr <zhouxuerong2@huawei.com>
Co-authored-by: liziyu <liziyu16@huawei.com>
2025-10-30 22:21:11 +08:00
offline893
d5a9aba03f [BugFix]Fix group list type of mc2. (#3890)
### What this PR does / why we need it?
Fix the precision issue caused by the inconsistency between the group
list type used by mc2 and that of eplb.

---------

Signed-off-by: offline0806 <3337230449@qq.com>
2025-10-30 21:44:14 +08:00
weichen
c506ba60fb [v0.11.0] [Bugfix] [MoE]fix error in deepseek when using allgather (#3827)
### What this PR does / why we need it?
After refactoring vllm_ascend/models and FusedMoE, we are unable to pass
`gate` from deepseekv2.py to `AscendFusedMoE.forward`, which will result
in error when running deepseek v3/r1 with allgather.
Hence, this pr removes `gate` related computations from FusedMoE module
in eager/aclgraph mode.
### Does this PR introduce _any_ user-facing change?
`rm_router_logits` is deprecated in eager/aclgraph.
### How was this patch tested?
e2e & ut

Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>
2025-10-30 14:59:46 +08:00
whx
211d4b9da4 [BugFix] Fix mlapo accuracy problem related with weight processing. (#3857)
This PR fixes a mlapo accuracy problem related with weight processing.
Furthermore, modify mlapo related e2e test with quantized deepseek model
to make it effective.

Signed-off-by: whx-sjtu <2952154980@qq.com>
2025-10-30 00:35:50 +08:00
zouyida2052
d9249c968e bugfix for mtp in fullgraph (#3878)
### What this PR does / why we need it?
bugfix for mtp in fullgraph

### Does this PR introduce _any_ user-facing change?
no

---------

Signed-off-by: zouyida2052 <zouyida2002@gmail.com>
2025-10-29 23:52:20 +08:00
fems14
19f49ecb5f [0.11.0][Bugfix]fix_mulit_connector_bug (#3332) (#3882)
### What this PR does / why we need it?
When using multi connector, the multi connector does not define
get_finished_count, which will cause the kv cache to be released ###
Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.11.0rc3
- vLLM main:
83f478bb19


Signed-off-by: baxingpiaochong <771405853@qq.com>
Co-authored-by: baxingpiaochong <771405853@qq.com>
2025-10-29 23:44:52 +08:00
liziyu
e5b938c5fe [v0.11.0] [P/D] force with_prefill true after allreduce in kv producer (#3835)
### What this PR does / why we need it?
force with_prefill true after allreduce in kv producer. This is a backport of #3768 and #3849

---------

Signed-off-by: liziyu <liziyu16@huawei.com>
2025-10-29 23:14:00 +08:00
Wang Yixuan
b323be9fe4 deepseek torchair adapt for torch_npu version (#3876)
### What this PR does / why we need it?
To adapt the torch_npu version to avoid the precision problem of
torchair deepseek. The torch_npu version may result in the different
branches in the ops register, the rms_norm ops has two branches
according to the verson_check, this pr unify the rms_norm in torchair by
patch method. #3862

Signed-off-by: hust17yixuan <303660421@qq.com>
2025-10-29 22:44:44 +08:00
realliujiaxu
29bd9235ed [v0.11.0][Perf] Delete redundant operations in model_runner and forward_context (#3775)
<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->

cherry pick https://github.com/vllm-project/vllm-ascend/pull/3677

Remove redundant operations from `model_runner` and `forward_context`.
This optimization can significantly reduce the idle time (bubble) before
decoding when running models with small parameter counts (e.g.,
Qwen/Qwen2.5-0.5B).

Testing on 800I A2, bubble is reduced from 3.8ms to 2.8ms :
Before
<img width="1655" height="696" alt="image"
src="https://github.com/user-attachments/assets/d7608e52-2438-46dd-8fc9-391fd6274495"
/>

After
<img width="1607" height="774" alt="image"
src="https://github.com/user-attachments/assets/56daf081-2dba-4d2e-99d4-e055187d9806"
/>
### What this PR does / why we need it?
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.

- Please clarify why the changes are needed. For instance, the use case
and bug description.

- Fixes #
-->

### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->
No
### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->

---------

Signed-off-by: realliujiaxu <realliujiaxu@163.com>
2025-10-29 15:58:53 +08:00
zhangxinyuehfad
75de3fa172 [v0.11.0][Doc] Update doc (#3852)
### What this PR does / why we need it?
Update doc


Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-10-29 11:32:12 +08:00
ZYang6263
6188450269 [v0.11.0][Bugfix]Avoid using the fusion operator in the MOE model (#3837)
### What this PR does / why we need it?
The current MatmulReduceScatter operator experiences performance
degradation in small-shape scenarios, so it determines whether to use
this operator by judging the size of the shape.


---------

Signed-off-by: ZYang6263 <zy626375@gmail.com>
2025-10-28 23:31:19 +08:00
Shirley125
e48ca0b6ec [bugfix][0.11]fix proxy decode bug (#3751)
### What this PR does / why we need it?
fix proxy decode bug while parsing non-UTF-8 characters.

---------

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>
2025-10-27 16:56:50 +08:00
Yizhou
43276fd822 [v0.11.0][Fix] Prevent memory leak in MLA decode graph (#3743) (#3774)
### What this PR does / why we need it?
The cache for MLA decode graph parameters was holding strong references
to tensors, preventing them from being garbage collected and leading to
increased memory usage.

This change wraps the cached tensors in weak references, allowing them
to be deallocated when no longer in use and reducing overall memory
pressure.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
None.

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-10-27 16:00:20 +08:00
Ruri
825fdfb197 [v0.11.0][Feat] Prefetching Attention QKV Linear Weight With AddRmsNormQuant Custom Op (#3649)
### What this PR does / why we need it?

- `qkv_proj.weight` prefetching has been implemented with `Quant` op,
when `AddRmsNormQuant` is enabled (#3465) `qkv_proj.weight` prefetching
won't work
- Implement `qkv_proj.weight` prefetching with `AddRmsNormQuant`, which
has been merged on `main` branch (#3517)

### Does this PR introduce _any_ user-facing change?

None.

### How was this patch tested?

Tested on `Qwen3-235B-A22B-W8A8`
<img width="1868" height="109" alt="image"

src="https://github.com/user-attachments/assets/0bc28082-0287-4d5c-b8f6-f907c3134d36"
/>


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.

- Please clarify why the changes are needed. For instance, the use case
and bug description.

- Fixes #
-->

### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->

### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->

Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>
2025-10-27 09:42:09 +08:00
Mengqing Cao
1b16c01afd [v0.11.0-dev][Installation] limit opencv-python-headless version to resolve numpy version conflict (#3767)
### What this PR does / why we need it?
vllm requires opencv-python-headless >= 4.11.0 which requires
(numpy<2.3.0,>=2), but vllm-ascend numpy version must be less than
2.0.0, so limit opencv-python-headless less than 4.11.0.86 will fix this
conflict.

backport of
afc58184ec

Signed-off-by: 22dimensions <waitingwind@foxmail.com>
Co-authored-by: 22dimensions <waitingwind@foxmail.com>
2025-10-25 18:18:28 +08:00
whx
a58ff9e92f [Cherry-pick] Port MoE multi-stream fix to v0.11.0-dev (#3753)
This PR moves the communication operation of shared experts out of extra
stream because I found that this might cause rtMemcpy related errors
when running shared experts multistream with aclgraph.

Furthermore, I utilize a global variable as extra stream object to avoid
allocating streams for each layer in full-graph mode.

Signed-off-by: whx-sjtu <2952154980@qq.com>
2025-10-25 15:51:43 +08:00
Yizhou
1bc61031e5 [v0.11.0][Fix] Cap max tokens to prevent potential OOM (#3720) (#3744)
### What this PR does / why we need it?
Caps the calculated maximum number of tokens at 512.

This prevents allocating an excessively large buffer when a cudagraph
capture size is not specified, mitigating the risk of out-of-memory
errors.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
None.

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-10-25 15:46:56 +08:00
fems14
99e154dc84 [0.11.0] cherry-pick from #3747 (#3746)
cherry-pick from #3747

correct _register function place for mooncacke

Signed-off-by: fems14 <1804143737@qq.com>
2025-10-25 14:21:30 +08:00
shaopeng-666
fed8145aea [cherry-pick][Feat] Add mrope fusion op#3708 (#3735)
### What this PR does / why we need it?
Add mrope fusion op for qwen2.5-vl. This mrope operator dosen't
support Qwen3-VL currently. Thus could only take affect in qwen2.5-vl
cherry pick from 39b994a987888f7ba78df28b1ccb41a5e8d6eaf5

CI passed with existing test

Signed-off-by: shaopeng666 <shaopeng666@noreply.gitcode.com>
Co-authored-by: shaopeng666 <shaopeng666@noreply.gitcode.com>
2025-10-25 11:41:23 +08:00
whx
0644113c35 [BugFix] cherry-pick PR 3736 to v0.11.0-dev (#3737)
This PR comments out newly added vlm e2e test of ascend scheduler
scenario because I found that when running in multi-batch this will
stuck. Need to add this back after dealing with this issue.

Signed-off-by: whx-sjtu <2952154980@qq.com>
2025-10-25 10:35:14 +08:00
whx
5a2c5be229 [BugFix][Cherry-pick] Cherry-pick PR 3675 to v0.11.0-dev (#3732)
This PR cherry-picks the bugfix related with running multi-modal models
with AscendScheduler to v0.11.0-dev

Signed-off-by: hw_whx <wanghexiang7@huawei.com>
Co-authored-by: hw_whx <wanghexiang7@huawei.com>
2025-10-25 09:41:51 +08:00
hucong
12bc78d252 [v0.11.0][BugFix][P/D] Modify the recalculation logic to prevent waiting requests from filling up the D node KVCache (#3686)
### What this PR does / why we need it?
Modify the recalculation logic to prevent waiting requests from filling
up the D node KVCache

Signed-off-by: underfituu <hzhucong@163.com>
2025-10-25 09:15:42 +08:00
ZYang6263
5c0a23f98b [0.11.0][Perf] Add fused matmul/reduce-scatter kernel for performance optimization. (#3725)
### What this PR does / why we need it?
This PR boosts performance by introducing a fused kernel for the matrix
matmul and reduce scatter operations. It supports both unquantized
(e.g., BFloat16) and W8A8 quantized models.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.

- Please clarify why the changes are needed. For instance, the use case
and bug description.

- Fixes #
-->

### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->

### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->

Signed-off-by: ZYang6263 <zy626375@gmail.com>
2025-10-25 08:20:43 +08:00
fems14
17dd9ae42c [0.11.0][bugfix]look up multi_tp key (#3699) (#3723)
### What this PR does / why we need it?
In multi-Tensor Parallel (TP) scenarios, the KV pool only queries the
first GPU card. When keys on other cards are released, the query result
still returns as successful, introducing accuracy issues. This PR
modifies the KV pool's query logic to check all cards, resolving this
problem.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: fems14 <1804143737@qq.com>
2025-10-24 18:22:45 +08:00
fems14
f0eb3e1d97 [v0.11.0][bugfix]kvpool sync load (#3698) (#3722)
### What this PR does / why we need it?
In certain scenarios, the performance of synchronously loading data from
the pool is better than that of asynchronously loading data. Therefore,
a control logic (or switch) for asynchronous loading from the pool has
been added.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

Signed-off-by: fems14 <1804143737@qq.com>
2025-10-24 18:21:46 +08:00
何必问
33514a4cc2 [Bugfix] The server fails to locate the request, leading to the server hanging. (#3721)
### What this PR does / why we need it?
fix bug: In the mooncake pooling scenario, when the client closes the
request, the server fails to locate the request, leading to the server
hanging.oling scenario, when the client closes the request, the server
fails to locate the request, leading to the server hanging.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
Pull up the PD separated pooling service, send requests using aisbench,
press CTRL+C twice, and check if the vllm_ascend service exit.

---------

Signed-off-by: linhebiwen <linhebiwen@gmail.com>
2025-10-24 17:41:29 +08:00
offline893
4e21b1537e [BugFix] Check all expert maps when using muilty instance. (#3662)
### What this PR does / why we need it?
Check all expert maps when using muilty instance.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
Qwen 235B in double A3.
case1:master has expert map, slave has not expert map.
case2:   master has expert map, slave has error expert map.
case3:   master has expert map,slave has correct expert map.
- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: offline0806 <3337230449@qq.com>
Co-authored-by: offline0806 <3337230449@qq.com>
2025-10-24 17:10:31 +08:00
wangxiyuan
b321e3846a [cherry-pick]【main】patch sched_yield (#3648) (#3687)
### What this PR does / why we need it?
On Arm systems, os.sched_yield() does not take effect, causing the GIL
(Global Interpreter Lock) to remain unrelinquished and resulting in CPU
bound issues. This PR applies a patch to sched_yield in vLLM, making the
process execute time.sleep(0) instead to release the GIL. ### Does this
PR introduce _any_ user-facing change?

Signed-off-by: fems14 <1804143737@qq.com>
Co-authored-by: fems14 <74094523+fems14@users.noreply.github.com>
2025-10-24 00:24:58 +08:00
Wang Yixuan
d0086d432a fix deepseek torchair recompile (#3679)
### What this PR does / why we need it?
The #3624 PR fix the precision of deepseek torchair, but don't consider
the limitation of torch compile which results in the recompile, This PR
fixs this problem. PR to main #3678


### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->

Signed-off-by: hust17yixuan <303660421@qq.com>
2025-10-23 22:53:13 +08:00
Slightwind
d2d19a4c3c [v0.11.0][bugfix] Add 'layer_type' param to get_pergroup_param() for compatibility (#3684)
Resolves a `TypeError: got an unexpected keyword argument 'layer_type'`.

A recent change (PR #3311) started passing the `layer_type` argument
when calling `get_pergroup_param()`. This specific implementation does
not use this parameter, causing the error.

This patch adds `layer_type=None` to the method signature to maintain
API compatibility and ignore the unused argument.

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
2025-10-23 21:26:50 +08:00
liziyu
f3ea657e93 [0.11.0][Bugfix] fix delay free prefill req & D node support prefix cache (#3609)
### What this PR does / why we need it?
Fix mooncake connector. In scenarios where TP is not equal, when the
prefill TP size is less than the number of key-value heads,
_get_remote_tp_ranks_for_req will return a list of np.arrays. Performing
an operation like int in list of np.arrays will cause an error.
Converting the list of np.arrays into a single np.array resolves this
issue.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
qwen235B
P tp16, D tp1
P tp8, D tp1
P tp4, D tp1
P tp8, D tp2


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: liziyu <liziyu16@huawei.com>
2025-10-23 20:39:35 +08:00
ZYang6263
6975d46627 [v0.11.0][Perf] Eliminating the zerolike operator through patch (#3632)
### What this PR does / why we need it?
There is a zero-like operator before the attention operation in each
decoding stage. After analysis, this operator can be eliminated. The
purpose of this PR is to remove this operator and improve performance.

---------

Signed-off-by: ZYang6263 <zy626375@gmail.com>
2025-10-23 14:49:28 +08:00
rjg-lyh
74903af460 [v0.11.0][refactor] refactor SequenceRowParallelOp forward (#3654)
### What this PR does / why we need it?
This PR refactors SequenceRowParallelOp forward. In order to further
expand the operator inclusion scope in dynamic judgment scenarios, this
PR customizes the entire matmul computation and communication as a
custom operator masking. With this refactor, it will support directly
writing code such as common operation fusion into the
SequenceRowParallelOp class's member function matmul_and_reduce, without
the need to register more redundant custom masking operators.

### How was this patch tested?
CI passed with new added/existing test.

Signed-off-by: rjg-lyh <1318825571@qq.com>
2025-10-23 14:45:49 +08:00
Yizhou
54bd531db8 [v0.11.0][Fix] Fix attention metadata handling for profiling and MLA (#3636) (#3643)
### What this PR does / why we need it?
This is a port PR of #3636 .

Move the creation of dummy attention metadata to occur after the ACL
graph runtime mode is determined. This ensures the metadata is
initialized with the correct configuration during a profile run.

Additionally, remove the `attn_metadata` existence check before updating
MLA attention parameters. This change prevents the update from being
skipped when metadata is not yet available, ensuring parameters are set
correctly.

### Does this PR introduce _any_ user-facing change? None.

### How was this patch tested?
None.

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-10-23 10:29:30 +08:00
whx
6464c97ff9 [BugFix][v0.11.0] Fix quantization related mtp bug with patch (#3619)
vLLM 0.11.0 didn't bring PR
(https://github.com/vllm-project/vllm/pull/25805) thus missing the
prefix of mtp's SharedHead. This PR fixes this bug with a patch to
vllm's deepseek_mtp.

---------

Signed-off-by: whx-sjtu <2952154980@qq.com>
2025-10-22 23:06:09 +08:00
Zetong Li
6e72bfdc50 [v0.11.0] cherry-pick Fix performance degradation when mtp>1 (#3597) (#3630)
### What this PR does / why we need it?
cherry-pick Fix performance degradation when mtp>1 (#3597)

This PR aims to fix performance degradation when mtp>1. Since mtp>1 may
result in more tokens (i.e. larger batch size) than acl graph maximum
batch size, this will cause draft model to run in eager mode.

### How was this patch tested?
by ci

---------

Signed-off-by: Zetong Li <slippersss@126.com>
2025-10-22 22:07:39 +08:00
zouyida2052
a989fef5de unify logic between aclgraph and torchair (#3602)
### What this PR does / why we need it?
unify logic between aclgraph and torchair. This is a cherry-pick of https://github.com/vllm-project/vllm-ascend/pull/3560

Signed-off-by: zouyida2052 <zouyida2002@gmail.com>
2025-10-22 21:55:06 +08:00
Wang Yixuan
edccd46d74 fix deepseek torchair precision (#3635)
### What this PR does / why we need it?
The precision of deepseek torchair is broken by #3465 , which due to the origin patch or rmsnorm in torchair. This PR fixes the precision of deepseek torchair.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->

Signed-off-by: hust17yixuan <303660421@qq.com>
2025-10-22 20:20:32 +08:00
Yizhou
984efdc0d0 [v0.11.0][Fix] Fixes attribute error in MLA implementation (#3617)
### What this PR does / why we need it?
Corrects the attribute access for retrieving the device from `q_a_proj`
to `q_proj`. This prevents an `AttributeError` as `q_a_proj` does not
exist on the class instance.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
Need MLAPO tests.

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-10-22 15:49:18 +08:00