Commit Graph

1442 Commits

Author SHA1 Message Date
wangxiyuan
1eb5295a1b remove qwen3-next model file (#4573)
Let's remove qwen3-next model filecurrently. We'll support it later by
using vLLM origin model file

- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-11-29 18:37:26 +08:00
Nengjun Ma
a3041cd78c [Bugfix] fix dp parallel + tp > 1 offline inference port conflict (#4539)
### What this PR does / why we need it?
fix dp parallel + tp > 1 offline inference port conflict

issue import PR:https://github.com/vllm-project/vllm-ascend/pull/429


- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

---------

Signed-off-by: leo-pony <nengjunma@outlook.com>
2025-11-29 18:37:11 +08:00
wangxiyuan
1874265074 Move mla to ops module (#4575)
Move mla custom op to correct module
- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-11-29 18:36:55 +08:00
Shanshan Shen
2a19215e5f [MM][Model] Remove Qwen2-VL modeling files (#4534)
### What this PR does / why we need it?

Following https://github.com/vllm-project/vllm-ascend/pull/4349, remove
Qwen2-VL modeling files.


- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-11-29 18:07:01 +08:00
wangxiyuan
6664a4e5ce improve soc version (#4522)
Make SOC_VERSION be readable for users. Now users can set simply
"910b"、“910c”、“310p”


- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-11-29 17:42:16 +08:00
wangxiyuan
f10acddb78 drop ascend scheduler (#4498)
Ascend scheduler was added for non chunk prefill case before, since that
the npu ops didn't work well with chunked prefill.

Now the ops with chunked prefill work better, it's time to remove the
ascend scheduler to use vLLM default scheduler.

- vLLM version: v0.11.2

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-11-29 16:18:34 +08:00
liziyu
53a52d6614 [P/D] [bugfix] add get_kv_connector_handshake_metadata func for 0.11.2 (#4567)
### What this PR does / why we need it?
add get_kv_connector_handshake_metadata func for 0.11.2


Signed-off-by: liziyu <liziyu16@huawei.com>
2025-11-29 16:09:45 +08:00
LI SHENGYONG
0151022ab8 [bugfix] dep ineffective (#4417)
### What this PR does / why we need it?
The expert mapping table and weights of the dynamic EPLB were not
updated, causing the accuracy to be correct but not effective. This bug
has now been fixed.

- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

---------

Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
2025-11-29 15:18:29 +08:00
wangxiyuan
8ebbf13c1a Update triton package name (#4563)
Add `aarch64` suffix to make sure the package name is OK


- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-11-29 15:00:40 +08:00
Ting FU
b747c95cfa [Doc] Add single NPU tutorial for Qwen2.5-Omni-7B (#4446)
### What this PR does / why we need it?
Add single NPU tutorial for Qwen2.5-Omni-7B

- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

Signed-off-by: Ting FU <futing10@huawei.com>
2025-11-29 11:57:29 +08:00
Ting FU
9af34755ff [Bugfix] Fix model run _npu_flash_attention hang issue (#4410)
Fix model run _npu_flash_attention in _forward_prefill_no_cache hang
issue, it was caused by wrong attention mask dtype.
### How was this patch tested?
Yes, tesed on Qwen2.5-VL and Qwen2.5-Omni

- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

Signed-off-by: Ting FU <futing10@huawei.com>
2025-11-29 09:20:22 +08:00
wangxiyuan
048d350f9e update triton package url (#4552)
Triton package url is not correct. This PR fix it

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-11-28 21:00:49 +08:00
shiyuan680
1c4a0468ee 【OPS】qwen3-next support triton chunk_gated_delta_rule ops (#4070)
### What this PR does / why we need it?
qwen3-next suppot  triton chunk_gated_delta_rule ops

### co-owners
@OsirisDuan

- vLLM version: v0.11.2

Signed-off-by: shiyuan680 <917935075@qq.com>
2025-11-28 20:55:43 +08:00
fems14
5447a039b9 [Feature][main]reconstruction kvpool connector to ascend connector (#4438)
### What this PR does / why we need it?
1.In short, we renamed the existing MooncakeStoreConnector to
AscendStoreConnector and extracted the storage engine interaction logic
into a new Backend class.
Associated RFC:https://github.com/vllm-project/vllm-ascend/issues/4329
2.Fixed the issue where the number of input parameters for the connector
was incorrect, introduced in vllm 0.11.2
### Does this PR introduce _any_ user-facing change?
change MooncakeStoreConnector to AscendStoreConnector
### How was this patch tested?

- vLLM version: v0.11.2

---------

Signed-off-by: fems14 <1804143737@qq.com>
2025-11-28 18:08:37 +08:00
Chenxi Qian
554f16ae1f [Kernel] add custom op GmmSwigluQuantWeightNzTensorList (#3804)
### What this PR does / why we need it?

This PR introduces support for adding custom CANN `aclnn` ops to
`vllm-ascend`, allowing users to define and use their own custom
operators.

Key changes include:
- Building and installing custom ops into the `vllm-ascend`-specified
directory
- Binding the `aclnn` op interface to the `torch.ops._C_ascend` module
- Enabling invocation of these ops within `vllm-ascend`

This PR includes a sample custom op:
`aclnnGroupedMatmulSwigluQuantWeightNzTensorList`, which is adapted from
the CANN operator
[`aclnnGroupedMatmulSwigluQuantWeightNZ`](https://www.hiascend.com/document/detail/zh/canncommercial/83RC1/API/aolapi/context/aclnnGroupedMatmulSwigluQuantWeightNZ.md).
Its input parameters `weight` and `weight_scale` now accept
`list[torch.Tensor]` (i.e., `at::TensorList`).

### Does this PR introduce _any_ user-facing change?

No.


- vLLM version: v0.11.2

---------

Signed-off-by: QianChenxi <chenxi.qian.cq@outlook.com>
2025-11-28 18:06:39 +08:00
herizhen
3199fe8350 [Doc]Delete equals sign (#4537)
### What this PR does / why we need it?
Delete equals sign in doc
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
ut


- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

---------

Signed-off-by: herizhen <you@example.com>
Co-authored-by: herizhen <you@example.com>
2025-11-28 17:09:26 +08:00
wangxiaoteng888
366d2d95e8 [P/D] Add readme for PD separation (#4182)
### What this PR does / why we need it?
Add readme for PD separation

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By ci

- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

---------

Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
Signed-off-by: liziyu <liziyu16@huawei.com>
Co-authored-by: liziyu <liziyu16@huawei.com>
2025-11-28 15:17:59 +08:00
Shanshan Shen
e52ebf8674 [MM][Model][Perf] Remove Qwen2.5-VL modeling files and add patch for VisionAttention (#4349)
### What this PR does / why we need it?

- [x] Patch `Qwen2_5_VisionAttention` with
`AscendQwen2_5_VisionAttention`.
- [x] Replace `AscendQwen2_5_VisionTransformer` with
`Qwen2_5_VisionTransformer` in vllm.
- [x] Move padding logic (q/k/v and cos/sin) before FA to `forward()` of
`Qwen2_5_VisionAttention`.
- [x] Covert `cu_seqlens` in `Qwen2_5_VisionAttention` from cumulative
form to intervals and move it to cpu (compatible for npu FA).
- [x] Remove Qwen2.5-VL modeling files.
- [x] Remove Qwen2.5-VL (without padding) modeling files.
- [x] Remove related UT.
- [x] Make `set_forward_context` pluggable when getting MM embedding.
Find more details at https://github.com/vllm-project/vllm/pull/29388.
- [x] Simplify padding logic for FA.
- [x] Add patch for https://github.com/vllm-project/vllm/pull/28798.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

- [x] Functional test (eager mode)
- [x] Functional test (graph mode)
- [x] Benchmark


- vLLM version: v0.11.2

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-11-28 14:23:00 +08:00
LHXuuu
bdc66972db [Quantization] Support compressed tensors w8a8 static and w8a8 dynamic weight (#4036)
### What this PR does / why we need it?

While using the LLM Compressor quantization tool from the VLLM community
to generate quantized weights, the VLLM Ascend engine needs to be
adapted to support the compressed tensors quantization format.

1. Add AscendCompressedTensorsConfig to replace CompressedTensorsConfig
in vllm.
2. Support CompressedTensorsW8A8 static weight.
- weight: per-channel, int8, symmetric; activation: per-tensor, int8,
symmetric.
4. Support CompressedTensorsW8A8Dynamic weight.
- weight: per-channel, int8, symmetric; activation: per-token, int8,
symmetric, dynamic.
5. Modify the override_quantization_method in AscendQuantConfig.

Co-authored-by: taoqun110 taoqun@huawei.com
Co-authored-by: chenxi-hh chen464822955@163.com

- vLLM version: v0.11.2

---------

Signed-off-by: LHXuuu <scut_xlh@163.com>
Signed-off-by: chenxi-hh <chen464822955@163.com>
Signed-off-by: chenxi-hh <32731611+chenxi-hh@users.noreply.github.com>
Co-authored-by: chenxi-hh <chen464822955@163.com>
Co-authored-by: chenxi-hh <32731611+chenxi-hh@users.noreply.github.com>
2025-11-28 14:09:39 +08:00
SILONG ZENG
ab37a7d5ae [main]Upgrade cann to 8.3rc2 (#4350)
### What this PR does / why we need it?
Upgrade cann to 8.3rc2

### Does this PR introduce _any_ user-facing change?
Yes, docker image will use 8.3.RC2


- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

---------

Signed-off-by: MrZ20 <2609716663@qq.com>
2025-11-28 14:06:01 +08:00
Zhu Yi Lin
755b635844 [TEST] Add eagle proposer ut (#4447)
### What this PR does / why we need it?
Add eagle proposer ut

- vLLM version: v0.11.2

Signed-off-by: GDzhu01 <809721801@qq.com>
2025-11-27 21:59:31 +08:00
Slightwind
9fdabb7b60 [feature] Add Custom Op grouped_matmul_swiglu_quant (#4431)
This PR introduces the `EXEC_NPU_CMD` macro, serving as an adapter layer
to simplify the invocation of `aclnn` operators on Ascend NPUs.

**Key Changes:**
* **Adapter Layer:** Added `EXEC_NPU_CMD` macro and related dependencies
to standardize `aclnn` calls.
* **Operator Support:** Integrated `grouped_matmul_swiglu_quant` as a
reference implementation to demonstrate the usage of the new macro.

---


- vLLM version: v0.11.2

---------

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
2025-11-27 21:56:18 +08:00
Nengjun Ma
89a1a65300 [bugfix] fix ray start failed: local_world_size cannot little than visible device count error (#4457)
### What this PR does / why we need it?
Fix the ray start failed bug: local_world_size cannot little than
visible device count error
detail see issue #4456.

This fix code is copied from vllm fixing modify, PR:
[#28873](https://github.com/vllm-project/vllm/pull/28873)


- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

---------

Signed-off-by: leo-pony <nengjunma@outlook.com>
2025-11-27 21:18:32 +08:00
drslark
1cae3e4a49 [BugFix] Adapted Qwen3-Next eager mode to v0.11.2 (#4477)
### What this PR does / why we need it?

Adapted Qwen3-Next eager mode to `v0.11.2`.


- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

Signed-off-by: drslark <slarksblood@qq.com>
2025-11-27 17:44:59 +08:00
Li Wang
b220de33e8 [CI][Nightly] Support local debugging for multi-node CI test cases (#4489)
### What this PR does / why we need it?
 This patch mainly doing the following things:
1. Make k8s/lws optional for multi-node testing, allowing developers to
run multi-node tests locally by actively passing in the IP addresses of
all nodes.
2. Allows passing a custom proxy script path in the config file to load
the proxy.

- vLLM version: v0.11.2

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-11-27 17:20:29 +08:00
zzzzwwjj
1fd56b1106 chip type judgement code optimization (#4485)
### What this PR does / why we need it?
| | cpu envir | npu envir |
|---|---|---|
| set `SOC_VERSION` | check if `SOC_VERSION` is in dict `soc_to_device`,
if not, raise an error that can not support current chip type. | print a
warning log when `SOC_VERSION` is not equal to chip type from `npu-smi`,
same as left for others. |
| not set `SOC_VERSION` | raise an error that `SOC_VERSION` is necessary
when compiling in a cpu envir. | use chip type from `npu-smi` to compile
vllm-ascend. |

### Does this PR introduce _any_ user-facing change?

Now we must set env `SOC_VERSION` when compiling in cpu envir. 

### How was this patch tested?


- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

Signed-off-by: zzzzwwjj <1183291235@qq.com>
2025-11-27 17:18:49 +08:00
zhangxinyuehfad
84d7f5a10d [UT] Fix ut test (#4472)
### What this PR does / why we need it?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-11-26 21:37:47 +08:00
herizhen
d252e36ae8 Change comment location (#4432)
### What this PR does / why we need it?
When running 'python example.py',connection issues often occur.The
solution is to comment out the first line the code.
Complete the specific names of machines A2 and A3.
Standardize document format,a space should be added after the colon.
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
ut

- vLLM version: v0.11.2

---------

Signed-off-by: herizhen <you@example.com>
Co-authored-by: herizhen <you@example.com>
2025-11-26 16:13:31 +08:00
zzzzwwjj
136ea9ff56 [refact] unified soc_version code (#4359)
### What this PR does / why we need it?

Currently, there are two paths to judge the chip type in code,
`get_ascend_soc_version` use `get_soc_version` api in torch_npu, and
`is_310p` `use _build_info.__soc_version__`, which generate when
install. We need to unify the two paths.

We need to unify these codes based on the following points:

1. We need to ensure consistency in chip type judgment between compiling
and running states;
2. In compiling state, we need chip type to complete op's compilation,
but in running state, we only need device
type(910B/910_93/310P/910_95/etc) to make code branch judgement;
3. In compiling state, torch_npu may not have been installed yet, so we
can't use torch_npu's api.

Based on the above points, we have made the following changes:

1. When user set env `SOC_VERSION`, use it; when not set, query
soc_version by `npu-smi`;
2. generate device_type based on soc_version when compiling, and write
`__device_type__` instead of `__soc_version__` in `_build_info.py`;
3. In running state, use `__device_type__` to judge code branch.

### Does this PR introduce _any_ user-facing change?

When not set env `SOC_VERSION`, it will not be `ASCEND910B1` by default,
we will query soc_version by `npu-smi`. And env `SOC_VERSION` must be in
the list `soc_to_device` in `setup.py`.

- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

Signed-off-by: zzzzwwjj <1183291235@qq.com>
2025-11-26 14:28:55 +08:00
wangxiyuan
a91e76cd84 [CI] clean up ci (#4452)
1. Run 4-card test only when single and 2-card test passed
2. rename file to make it more clear
3. remove useless pd workflow, it has been managed by nightly test
already.

- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-11-26 14:07:56 +08:00
wangxiyuan
bc69d7cfe1 upgrade to vllm 0.11.2 (#4400)
Bump vLLM version to v0.11.2

What's broken and changed by vLLM:
1. structured_output is broken by
https://github.com/vllm-project/vllm/pull/26866
2. get_mrope_input_positions is broken by
https://github.com/vllm-project/vllm/pull/28399
3. graph mode is broken by
https://github.com/vllm-project/vllm/pull/25110 we'll upgrade torch to
2.8 to fix the problem later
4. embedding is broken by
https://github.com/vllm-project/vllm/pull/27583
5. `get_attn_backend_cls` and attention backend is broken are broken by
https://github.com/vllm-project/vllm/pull/28534
6. spec decode is broken by
https://github.com/vllm-project/vllm/pull/28771
7. sp feature is broken by
https://github.com/vllm-project/vllm/pull/27126
8. mtp is broken by https://github.com/vllm-project/vllm/pull/27922
9. lora is broken by https://github.com/vllm-project/vllm/pull/21068
10. execute_model is broken by
https://github.com/vllm-project/vllm/pull/26866
11. `VLLM_DISABLE_SHARED_EXPERTS_STREAM` env is broken by
https://github.com/vllm-project/vllm/pull/28159
12. kv cahe is broken by https://github.com/vllm-project/vllm/pull/27753
13. dp is broken by https://github.com/vllm-project/vllm/pull/25110

 
What's broken and changed by ourself:
1. qwen vl is broken by https://github.com/vllm-project/vllm/pull/28455
We'll remove model files in the future to avoid this kind of error
2. Engine core is broken by
https://github.com/vllm-project/vllm/pull/23691 We'll remove the patch
file in the future.
3. Ascend scheduler is broken by
https://github.com/vllm-project/vllm/pull/28733 We'll remove ascend
scheudler later.
4. qwen3-next is broken by
https://github.com/vllm-project/vllm/pull/28083 We'll remove model files
in the future to avoid this kind of error
5. qwen vl is broken by https://github.com/vllm-project/vllm/pull/27764.
We'll remove model files in the future

Known issue:
1. ray doesn't work 
2. the accuracy of qwen3-next is not correct
3. qwen3-vl is broken
4. prefix cache+ ascend scheduler + deepseek v2 lite is broken.

Co-authored-by: MengqingCao <cmq0113@163.com>
Co-authored-by: hfadzxy <starmoon_zhang@163.com>
Co-authored-by: leo-pony <nengjunma@outlook.com>
Co-authored-by: 22dimensions <waitingwind@foxmail.com>
Co-authored-by: shen-shanshan <467638484@qq.com>


- vLLM version: v0.11.2

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
Signed-off-by: leo-pony <nengjunma@outlook.com>
Co-authored-by: MengqingCao <cmq0113@163.com>
Co-authored-by: hfadzxy <starmoon_zhang@163.com>
Co-authored-by: leo-pony <nengjunma@outlook.com>
2025-11-26 11:48:58 +08:00
shiyuan680
d5f77f14d0 mkdir triton package and move triton files (#4420)
### What this PR does / why we need it?
mkdir triton package and move triton files

- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

Signed-off-by: shiyuan680 <917935075@qq.com>
2025-11-26 11:06:12 +08:00
Zhu Yi Lin
1b137d6b1b [TEST] Delete Comment (#4427)
### What this PR does / why we need it?
Delete useless comments.
### Does this PR introduce _any_ user-facing change?
No

- vLLM main:
2918c1b49c

Signed-off-by: GDzhu01 <809721801@qq.com>
2025-11-25 21:39:04 +08:00
wangxiyuan
98031653df [misc] Remove useless patch_logits (#4252)
Torch-npu 2.7.1 has fixed the device check bug. This patch can be
removed now.

- vLLM main:
2918c1b49c

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-11-25 21:25:54 +08:00
Shanshan Shen
4864909648 [MM][Bugfix] Minor fix for VL model verification (#4384)
### What this PR does / why we need it?

To fix ops test, where `model_config` has been set to `None` and doesn't
has `hf_config` attribute, we have added a check for `model_config` to
guarantee it is not `None_Type`.

- vLLM main:
2918c1b49c

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-11-25 20:36:16 +08:00
Zhijun Chen
463910e686 [Bugfix] use module-level import for patched function in Qwen3Next (#4354)
### What this PR does / why we need it?

**Problem**: The Qwen3Next model implementation currently imports
chunk_gated_delta_rule directly using `from ... import ...`

In frameworks like `verl`, the model file is often imported before
`vllm-ascend` initializes and applies its patches. This causes the model
to permanently hold a reference to the original (unpatched) vLLM kernel,
resulting in execution errors on Ascend devices even if the patch runs
later.

**Solution**: Changed the import style to `from vllm...ops import chunk`
and call `chunk.chunk_gated_delta_rule().`

This ensures that the function lookup happens at runtime (dynamic
dispatch), allowing the model to correctly pick up the patched function
regardless of import order.

- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

Signed-off-by: zjchenn <zjchenn@gmail.com>
2025-11-25 20:15:43 +08:00
SILONG ZENG
941d54a2ce [bugfix]Return the Transformer version from 4.57.2 to 4.57.1 (#4423)
### What this PR does / why we need it?
This PR pins the transformers dependency to 4.57.1.

Reason: CI tests (specifically test_completion_with_prompt_embeds.py)
are failing with an AttributeError: 'dict' object has no attribute
'model_type' when using newer versions of transformers.

The issue stems from a bug in tokenization_utils_base.py where the code
attempts to access the model_type field of a configuration dictionary
(_config) using dot notation (_config.model_type) instead of dictionary
key lookup (_config["model_type"] or _config.get("model_type")). This
occurs in the logic block checking for transformers_version <= 4.57.2.

Pinning the version to 4.57.1 bypasses this buggy code path and restores
CI stability.

Error Traceback:
``` shell
/usr/local/python3.11.13/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:2419: 
if _is_local and _config.model_type not in [
E   AttributeError: 'dict' object has no attribute 'model_type'
```

- vLLM main:
2918c1b49c

Signed-off-by: MrZ20 <2609716663@qq.com>
2025-11-25 15:32:24 +08:00
欧派果奶我还要
31a2c09e79 [Bugfix] fix patch typo (#4351)
### What this PR does / why we need it?
Fix a bug caused by this pr:
https://github.com/vllm-project/vllm-ascend/pull/4223
The bug makes
vllm-ascend/vllm_ascend/patch/platform/patch_multiproc_executor.py patch
in a wrong way

### How was this patch tested?
Tested in a single node. When the environment DYNAMIC_EPLB is set to
true, the patch works correctly. When it's set to false, the patch do
not patch
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
Co-authored-by: 白永斌 <baiyongbin3@h-partners.com>
2025-11-25 15:13:06 +08:00
herizhen
e945e91933 Document error correction (#4422)
### What this PR does / why we need it?
The "g" at the beginning of the current sentence is redundant and needs
to be deleted
"MindIE Turbo" is no longer required to be displayed.

### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
ut

- vLLM main:
2918c1b49c

---------

Signed-off-by: herizhen <you@example.com>
Co-authored-by: herizhen <you@example.com>
2025-11-25 14:21:13 +08:00
wujinyuan1
06f6cc1c81 [Bugfix]Fix the hang issue of multimodal model when running with DP>1 (#4392)
### What this PR does / why we need it?
When cudagraph_mode is set to FULL_DECODE_ONLY, if dp > 1, the dummy-run
process will be triggered. When calling the update_attn_params function,
the num_tokens parameter needs to be passed, and this value is obtained
through positions.shape[0]. However, the multimodal model uses mRope
(multi-dimensional rotary positional embeddings), which causes the shape
of positions to be 2. As a result, the value obtained from
positions.shape[0] is incorrect. We solve this problem by replacing
positions.shape[0] with num_tokens.

### Does this PR introduce _any_ user-facing change?
NO

### How was this patch tested?
vLLM version: v0.11.0rc3
vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

---------

Signed-off-by: wujinyuan1 <wjy9595@qq.com>
Co-authored-by: wujinyuan1 <wjy9595@qq.com>
2025-11-25 09:33:49 +08:00
dependabot[bot]
84eae97f27 Bump actions/checkout from 4 to 6 (#4380)
Bumps [actions/checkout](https://github.com/actions/checkout) from 4 to 6.

- vLLM main:
2918c1b49c

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-11-25 09:05:11 +08:00
Tjh-UKN
00ea61ec88 [feature] vllm-ascend support msprobe (eager mode dump) (#4241)
### What this PR does / why we need it?
vllm-ascend need to dump data during model execution to debug some
precision problems, here msprobe provide the corresponding abilities, so
msprobe will join vllm-ascend to make debug easier

### Does this PR introduce _any_ user-facing change?
```
'dump_config': '/path/to/config.json'
```



- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

---------

Signed-off-by: Tjh-UKN <2559659915@qq.com>
2025-11-24 21:58:31 +08:00
weichen
5b1a7514eb [Bugfix][MoE] enable force_load_balance in aclgraph (#4366)
### What this PR does / why we need it?
Temporarily fix the oom issue, will update to vllm's plan later.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
e2e&ut

- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>
2025-11-24 20:33:56 +08:00
weijinqian0
ae068a3342 [Refactor] remove moe type of multicast. (#4224)
The main purposes of this PR are as follows: 
1. Remove the multicast-related code; 

Reason:
1. In the scenario like a2 Dual-System Back-to-Back Networking,the
performance is worse than all_gather. Before the modification, in e2e
test, it was 3 tps; after the modification, it is 10 tps.
2. At the same time, we usually enable the SP feature,it is consistent
with the current logic.
3. The advantage of broadcast communication lies in the fact that it
does not suffer from uneven DP load and does not require the prefill ACL
graph to be enabled. But we support prefill Acl graph recently.

So we think there is no need to maintain the multicast as one choice in
moe communication.

Performance benefits are as follows:
When not enable_flashcomm1, TTFT remains relatively stable at around
43000ms, which is approximately 15000ms faster than before the
modification.

When enable_flashcomm1, there is no diffenence, TTFT remains relatively
stable at around 29000ms.


- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

---------

Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Signed-off-by: weijinqian0 <1184188277@qq.com>
Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>
2025-11-24 17:32:37 +08:00
XiaoxinWang
5508a602ed [Fix] fix aclgraph e2e test. (#4131)
### What this PR does / why we need it?
Due to the inconsistency between the attention operators used in eager
mode and graph mode, the accumulation order of the operator cannot be
guaranteed to be deterministic. Therefore, we modify the test to compare
with given outputs.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
2025-11-24 17:22:03 +08:00
wangxiyuan
a1f142b7ad Drop 0.11.0 support (#4377)
There is a lot hack code for v0.11.0, which makes the code hard to
upgrade to newer vLLM version. Since v0.11.0 will release soon. Let's
drop v0.11.0 support first. Then we'll upgrade to v0.11.2 soon.


- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-11-24 17:08:20 +08:00
jiangyunfan1
41ddb06554 [TEST]Update deepseek mtpx acc cases standard (#4321)
### What this PR does / why we need it?
This PR updates the acc standard for deepseek mtpx cases, according to
inner standard
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By running the test

- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

Signed-off-by: jiangyunfan1 <jiangyunfan1@h-partners.com>
2025-11-24 16:43:29 +08:00
Yizhou
97999347c8 [Fix] Remove unnecessary NPU synchronization in MTP proposer (#4325)
### What this PR does / why we need it?
Remove unnecessary NPU synchronization in MTP proposer to improve
performances.

Removing this synchronization point improves pipeline efficiency by
allowing for better overlap between CPU and NPU operations. A more
proper one is already implemented in #4233

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
None.


- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-11-24 14:07:10 +08:00
herizhen
8c87a3b053 Change the first letter to uppercase (#4375)
### What this PR does / why we need it?
 The first letter  of the English title should be capitalized
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
ut

- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

Signed-off-by: herizhen <you@example.com>
Co-authored-by: herizhen <you@example.com>
2025-11-24 12:18:24 +08:00
Li Wang
b5f7a83927 [Doc] Upgrade multi-node doc (#4365)
### What this PR does / why we need it?
When we are using `Ascend scheduler`, the param `max_num_batched_tokens`
should be larger than `max_model_len`, otherwise, will encountered the
follow error:
```shell
Value error, Ascend scheduler is enabled without chunked prefill feature. Argument max_num_batched_tokens (4096) is smaller than max_model_len (32768). This effectively limits the maximum sequence length to max_num_batched_tokens and makes vLLM reject longer sequences. Please increase max_num_batched_tokens or decrease max_model_len. [type=value_error, input_value=ArgsKwargs((), {'model_co...g': {'enabled': True}}}), input_type=ArgsKwargs]
```

### Does this PR introduce _any_ user-facing change?
Users/Developers who running the model according to the
[tutorial](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/multi_node.html),
the parameters can be specified correctly.

### How was this patch tested?

- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-11-24 10:57:50 +08:00