### What this PR does / why we need it?
The first letter of the English title should be capitalized
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
ut
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c
Signed-off-by: herizhen <you@example.com>
Co-authored-by: herizhen <you@example.com>
### What this PR does / why we need it?
When we are using `Ascend scheduler`, the param `max_num_batched_tokens`
should be larger than `max_model_len`, otherwise, will encountered the
follow error:
```shell
Value error, Ascend scheduler is enabled without chunked prefill feature. Argument max_num_batched_tokens (4096) is smaller than max_model_len (32768). This effectively limits the maximum sequence length to max_num_batched_tokens and makes vLLM reject longer sequences. Please increase max_num_batched_tokens or decrease max_model_len. [type=value_error, input_value=ArgsKwargs((), {'model_co...g': {'enabled': True}}}), input_type=ArgsKwargs]
```
### Does this PR introduce _any_ user-facing change?
Users/Developers who running the model according to the
[tutorial](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/multi_node.html),
the parameters can be specified correctly.
### How was this patch tested?
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
For multi-node CI system, we need to ensure that cluster resources meet
the expected specifications before conducting multi-node
interoperability tests. Otherwise, unexpected errors may occur (for
example, we might mistakenly assume all nodes are ready and perform a
global cluster IP acquisition, which would cause an exception to be
thrown in Python because some nodes might not actually be ready at that
point). Therefore, we need to wait at the workflow level until all
resources meet the expected specifications.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
add single node PD disaggregation instructions for Qwen 2.5VL model.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c
---------
Signed-off-by: mazhixin <mazhixin7@huawei.com>
Signed-off-by: mazhixin000 <mazhixinkorea@163.com>
Co-authored-by: mazhixin <mazhixin7@huawei.com>
### What this PR does / why we need it?
fix import error for get_ip() in vllm main branch
### Does this PR introduce _any_ user-facing change?
N
### How was this patch tested?
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c
---------
Signed-off-by: pz1116 <zpbzpb123123@gmail.com>
### What this PR does / why we need it?
In [#26016](https://github.com/vllm-project/vllm/pull/26016), vllm
change the `cudagraph_capture_sizes` to be in ascending order. This PR
fixes related issues caused by this.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c
---------
Signed-off-by: Angazenn <supperccell@163.com>
### What this PR does / why we need it?
Eplb Verify Fix
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c
---------
Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
Signed-off-by: LI SHENGYONG <49200266+shenchuxiaofugui@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
This PR is used to fix mooncake_connector in pcp/dcp case. When
executing function update_done_task_count, it is necessary to ensure
that both pcp/dcp and TP ranks have finished transferring KV cache.
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c
---------
Signed-off-by: wangxiaochao <w00642655@china.huawei.com>
Co-authored-by: wangxiaochao <w00642655@china.huawei.com>
### What this PR does / why we need it?
The current community lacks unit tests (UT) for files such as
torchair_worker, mtp_proposer, and model_runner. Therefore, UT coverage
for these files needs to be added.
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c
---------
Signed-off-by: CodeNine-CJ <chenjian343@huawei.com>
### What this PR does / why we need it?
This PR adds a load-balance dp proxy server which can be used in
external DP scenario without Disaggregated-Prefill enabled. What's more,
add a doc of external dp and load-balance dp proxy server.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
See the new doc.
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c
---------
Signed-off-by: whx-sjtu <2952154980@qq.com>
### What this PR does / why we need it?
For better usability, add multimodal audio to vllm compiling in
dockerfile defaultly.
Image size will increase only 2.xM.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
vLLM version: v0.11.0
vLLM main:
2918c1b49c
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c
Signed-off-by: Ting FU <futing10@huawei.com>
### What this PR does / why we need it?
Add error log for VL models when enabling
`VLLM_ASCEND_ENABLE_FLASHCOMM1=1` or `VLLM_ASCEND_ENABLE_FLASHCOMM=1`
(for backward compatibility).
This is a temporary fix for
https://github.com/vllm-project/vllm-ascend/issues/4132.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c
Signed-off-by: shen-shanshan <467638484@qq.com>
### What this PR does / why we need it?
Add information on the scope of EPLB support.
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c
---------
Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
### What this PR does / why we need it?
Redundant experts bugfix
### Does this PR introduce _any_ user-facing change?
After configuring the path for experts_map, users do not need to
configure iinit_redundancy_expert.
### How was this patch tested?
The accuracy of EPLB was tested with and without the use of redundant
experts.
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c
---------
Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
### What this PR does / why we need it?
Support the Qwen3-Next-80B-A3B-Instruct quantization model and Fix the
NZ issue. Triton kernel doesn't support data format nz, thus we skip
converting weight to nz on layer `conv1d`
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c
---------
Signed-off-by: IncSec <1790766300@qq.com>
### What this PR does / why we need it?
Add ACL graph capture/replay DP test, this is a imprved version of #3886
Restructures the multi-card ACL graph test for improved clarity,
robustness, and accuracy.
Key improvements include:
- Replaces fragile `sys.settrace` and manual patching with a clean,
reusable spy installer using `unittest.mock.patch`.
- Introduces more precise metrics by tracking
`NPUModelRunner.execute_model` and `_dummy_run` calls directly.
- Rewrites assertions to be more accurate and provides clear
explanations for the expected counts of graph captures, replays, model
executions, and dummy runs.
- Simplifies the overall test structure by separating the worker logic
into a dedicated function.
- Removes a long, unnecessary sleep at the end of the test.
- Expands test coverage by adding a larger `max_tokens` parameter.
### Does this PR introduce _any_ user-facing change?
None.
### How was this patch tested?
None.
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c
---------
Signed-off-by: lilinsiman <lilinsiman@gmail.com>
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
Co-authored-by: lilinsiman <lilinsiman@gmail.com>
### What this PR does / why we need it?
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c
Signed-off-by: GDzhu01 <809721801@qq.com>
### What this PR does / why we need it?
Currently, the MTP model still runs in eager in full graph mode. This PR
adapts the MTP with the full graph capture and execution. When the graph
mode is set to "FULL_DECODE_ONLY", the MTP will run in full-graph to
improve the performance.
The change in both disable_padded_drafter_batch is True and False case
include:
1. Add _mtp_graph_params in acl_graph.py to isolate the data of main
model and the data of MTP.
2. Padding some metadata in mla_v1.py when in fullgraph mode.
3. Fixed the essential data address that will be used in model.forward.
4. Adapted according to the aclgraph capture framwork:
1). Rebuild MTP model with ACLGraphWrapper.
2). Add common attn metadata when start capture in MTP dummy_run.
3). Add common attn metadata update in MTP.
4). Addapted data update when num_speculative_tokens > 1.
5. Add a patch of MTP to adapt vllm v0.11.0.
Existing Issues:
1. When disable_padded_drafter_batch=True and running in FullGraph mode,
the data of the first-round requests in MTP is abnormal. We need to
identify the cause subsequently.
2. When disable_padded_drafter_batch=False and running in FullGraph
mode, the acceptance rate of the second and third tokens will decrease
(For example, if we set the num_speculative_tokens=3, the acceptance
rate of first token is 90%, the second is only 50% lower than 60%, the
third is only 20% lower than 30%). The reason is that the data processed
after the model runs does not match. This is a problem from another PR.
It works fine in eager and PIECEWISE mode, but has problem in FullGraph
mode. Once we have a solution, we will submit a bugfix.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c
---------
Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>
### What this PR does / why we need it?
add mla_v1.py and mla.py ut
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
`pytest tests/ut/attention/test_mla_v1.py`
`pytest tests/ut/models/test_mla.py`
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c
Signed-off-by: GDzhu01 <809721801@qq.com>
### What this PR does / why we need it?
Add tests for the multi-node DeepSeek-V2-Lite network in GE Graph mode,
and supplement the end-to-end (e2e) tests for the MLA and NZ features of
this network.
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c
---------
Signed-off-by: CodeNine-CJ <chenjian343@huawei.com>
### What this PR does / why we need it?
avoid mrope fusion op when running qwen2.5-vl on a+x machine
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
Test text VQA accuracy on G8600 with aisbench
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c
---------
Signed-off-by: 李少鹏 <lishaopeng21@huawei.com>
### What this PR does / why we need it?
Add some of the pitfalls I ran into when using AISBench to test
multi-modal models.
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c
---------
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
### What this PR does / why we need it?
Add the parameter "register_buffer" for PD Aggregated Scenario in the
given example.
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c
Signed-off-by: Pz1116 <zpbzpb123123@gmail.com>
remove get_metadata_cls. It's only used for V0 engine and has been removed from vLLM already.
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Fix pcp + mtp bug while using acl graph.
While using pcp + mtp, we need to flatten block_table to avoid irregular
attn mask shape, this was done in mla attn_metadata builder, but we
found out that this influences block_table address and leads to
incorrect results while enable acl graph.
To fix this, we enlarge block_table buffer size and flatten block_table
in model_runner prepare_inputs, so this will not influence block_table
address.
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c
Signed-off-by: zhangsicheng5 <zhangsicheng5@huawei.com>
### What this PR does / why we need it?
Past:
npu_moe_gating_top_k can only support 'group_count=256' pattern
Now:
1、npu_moe_gating_top_k support all size of group_count
2、the functionality of `torch_npu.npu_moe_gating_top_k_softmax` are
included in `torch_npu.npu_moe_gating_top_k`
CANN: depends on 8.3.RC1
Performance:
1. GLM4.5-w8a8, TPS improve 6%
2. Qwen3, the same as before
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c
Signed-off-by: 1092626063 <1092626063@qq.com>
### What this PR does / why we need it?
Sorts aclgraph batch sizes in ascending order, corresponding to vLLM
[#26016](https://github.com/vllm-project/vllm/pull/26016)
Ensures batch sizes for aclgraph are sorted ascending when aclgraph mode
is enabled, improving consistency and compatibility with later logic
that may depend on order.
### Does this PR introduce _any_ user-facing change?
None.
### How was this patch tested?
Waiting for #3886
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
### What this PR does / why we need it?
fix proxy when host ip using domain name
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c
---------
Signed-off-by: liziyu <liziyu16@huawei.com>
### What this PR does / why we need it?
pd proxy support ipv6, mooncake connector check whether the IPv6 address
is used and notify the user.
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c
---------
Signed-off-by: liziyu <liziyu16@huawei.com>
### What this PR does / why we need it?
Currently, the default `cudagraph_capture_size` in vLLM is `[1, 2, 4 ,8
,16 ,24 ,... , max_capture_size]`. However, this is not always the best
choice on different situations. This PR aims to change the default
setting when running Qwen3-MoE on full dp (`dp_size > 1` && `tp_size ==
1`) setting, which is usually applied in Large-Scale EP.
old :
`[1, 2, 4 ,8 ,16 ,24 ,... , max_capture_size]`
new:
`[1, 2, 5 ,10 ,15, 16 ,24 ,... , max_capture_size]`
This is mainly because the performance of `_npu_paged_attention` op
degrades dramatically on old settings. We hope to provide better
performance if users do not set specific `cudagraph_capture_size`.
### Does this PR introduce _any_ user-facing change?
The default `cudagraph_capture_size` is modified in above cases.
However, if `cudagraph_capture_size` has already set by users, this PR
won't have any influence on this.
### How was this patch tested?
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c
---------
Signed-off-by: Angazenn <supperccell@163.com>
### What this PR does / why we need it?
Fix moe error when sp chunked the hidden_states by disabling sp by a hacky way
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c
---------
Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
### What this PR does / why we need it?
we often install vllm-ascend in developer mode, which has no _build_info
module. it will raise error in `utils.is_310p` and
`utils.sleep_model_enabled`, then we need to modify these two function.
### Does this PR introduce _any_ user-facing change?
not involved
### How was this patch tested?
not involved
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c
Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
### What this PR does / why we need it?
This PR update the prefixcache threshold for qwen3-32b-int from 0.4 to
0.8, as the baseline has been improved.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By running the test
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c
Signed-off-by: jiangyunfan1 <jiangyunfan1@h-partners.com>
### What this PR does / why we need it?
The current library only supports the FullDecodeOnly graph mode, which
enables full graph execution during the decode. This PR extends support
to allow full graph execution in both the prefill and decode, referred
to as FULL graph mode.
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c
Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
### What this PR does / why we need it?
[CI] Fix no space left in build wheel CI.
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c
Signed-off-by: menogrey <1299267905@qq.com>
### What this PR does / why we need it?
- skip the nightly image build when the github event is pull_request
- set imagepullpolicy as alway for multi_node test
- move multi_node tests ahead to have some resource clean first
- do not relevant nightly image build with nightly tests for tolerance
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c
---------
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
Signed-off-by: wangli <wangli858794774@gmail.com>
Co-authored-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
Add benchmark results into `.gitignore`
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c
Signed-off-by: shen-shanshan <467638484@qq.com>
### What this PR does / why we need it?
Add import_kernels interface to avoid import useless vLLM C library
Closes#3488. Reopen#3498 for CI.
### How was this patch tested?
CI tested.
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c
---------
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
### What this PR does / why we need it?
Introduces the `acl_graph_print` function to enable printing debug
information from code running inside an ACL graph, such as custom
operators.
This works by launching a host function on a dedicated stream, bypassing
the limitations of standard `print` within compiled graph execution. The
implementation handles the necessary stream subscriptions and ensures
they are properly unregistered upon exit.
### Does this PR introduce _any_ user-facing change?
None.
### How was this patch tested?
None.
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
### What this PR does / why we need it?
add DeepSeek-R1-W8A8 and Qwen3-235B-W8A8 configs in multi-nodes and EPLB
scenario
### Does this PR introduce _any_ user-facing change?
no
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
---------
Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
Co-authored-by: 白永斌 <baiyongbin3@h-partners.com>
### What this PR does / why we need it?
1、qwen GQA attention_v1 optim
2、DeepSeek MLA refactor, all gather q -> all gather kv
3、modelrunner refactor for chunk prefill, we remove some code not use
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c
---------
Signed-off-by: LookAround <lixushi@huawei.com>
Signed-off-by: Delphine-Nic <tanwenqin@huawei.com>
Co-authored-by: Delphine-Nic <tanwenqin@huawei.com>
### What this PR does / why we need it?
Given the current excessively long build time of our nightly-ci, I
recommend installing necessary, confirmed versions of packages in the
Docker image to reduce the time required for integration testing.
Including Mooncake vllm with fixed tags, This is expected to reduce
nightly-ci duration by 2 hours.
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c
---------
Signed-off-by: wangli <wangli858794774@gmail.com>