xc-llm-ascend

Author	SHA1	Message	Date
Ting FU	e8e20c0bbf	[BugFix] Fix Qwen2.5_Omni vision customized op attr err (#4568 ) Qwen2.5_Omni vision tower use AscendRMSNorm, which conatins a property function. It would be override by set_forward_context(), patch Qwen2_5OmniThinkerForConditionalGeneration func with customized _process_image_input() and _process_video_input() to fix it. ### What this PR does / why we need it? Fix Qwen2.5_Omni model infer image/video issue ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 Signed-off-by: Ting FU <futing10@huawei.com>	2025-12-01 09:18:55 +08:00
Wang Yixuan	c68ddc11ce	[OPS] add bmm_transpose ops (#3990 ) ### What this PR does / why we need it? Add a new fusion ops to custom_op, which can cobime the torch.bmm() and transpsose to achieve better peformance. This ops is used in mla_v1 to replace the bmm and transpose ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.11.2 --------- Signed-off-by: hust17yixuan <303660421@qq.com>	2025-12-01 09:09:51 +08:00
欧派果奶我还要	bc67696a02	[EPLB][Ops] Integerate grouped_matmul_swiglu_quant_weight_nz_tensor_list operator into dynamic EPLB (#4216 ) ### What this PR does / why we need it? Integerate grouped_matmul_swiglu_quant_weight_nz_tensor_list into dynamic EPLB to support list-type parameters This PR also modify the logic of loading model in dynamic-eplb scenario. The operator is based on this pr: https://github.com/vllm-project/vllm-ascend/pull/3804 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ``` vllm serve /home/weight/DeepSeek-V3.1_w8a8mix_mtp \ --max_num_seqs 8 \ --max-model-len 8192 \ --max-num-batched-tokens 16384 \ --tensor-parallel-size 8 \ --data-parallel-size 2 \ --enable-expert-parallel \ --served-model-name ds_r1 \ --enable-auto-tool-choice \ --tool-call-parser hermes \ --no-enable-prefix-caching \ --port 8999 \ --quantization "ascend" \ --gpu-memory-utilization 0.85 \ --trust-remote-code \ --compilation_config '{"cudagraph_capture_sizes":[1,2,4,8,16,32]}' \ --additional-config='{"dynamic_eplb":true, "num_iterations_eplb_update":100, "num_wait_worker_iterations":100}' ``` input&output: 2k 2k This PR: <img width="1318" height="695" alt="fusion" src="https://github.com/user-attachments/assets/f8657813-0c02-42f4-8396-d99e730f48cd" /> Baseline: <img width="1323" height="690" alt="baseline" src="https://github.com/user-attachments/assets/e1323a78-af26-4523-820c-e20e5642a38e" /> - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: 白永斌 <baiyongbin3@h-partners.com> Signed-off-by: 欧派果奶我还要 <845473182@qq.com> Co-authored-by: 白永斌 <baiyongbin3@h-partners.com>	2025-11-30 22:52:05 +08:00
Slightwind	18eefc23c3	[feature] Support W8A8 PD-Mix Quantization (#4235 ) In PD-separated deployment scenarios: * MoE layers use dynamic quantization exclusively. * For the Attention module, Prefill (P) nodes use dynamic quantization, while Decode (D) nodes use static quantization. In PD-mixed deployment scenarios: * All components fall back to dynamic quantization, as it is difficult to distinguish between Prefill and Decode tokens. ___ - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: SlightwindSec <slightwindsec@gmail.com> Signed-off-by: Slightwind <slightwindsec@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2025-11-30 11:57:26 +08:00
Chao Lei	ff7061317f	[Bugfix] Fix kvpool precision synchronization (#4574 ) ### What this PR does / why we need it? Fix kvpool precision synchronization Issue https://github.com/vllm-project/vllm-ascend/issues/4412 - vLLM version: v0.11.2 --------- Signed-off-by: LCAIZJ <leichao139636@163.com>	2025-11-30 09:39:07 +08:00
weijinqian0	2b3bfe432e	[bugfix] Repair the problem of moe model accuracy caused by version upgrade. (#4562 ) Repair the problem of moe model accuracy caused by version upgrade. Reason: The new version adds the "reduce_output" operation after "forward_impl". Then we have fully taken over the implementation of the FusedMoe module. - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: weijinqian_v1 <weijinqian@huawei.com> Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>	2025-11-30 06:12:39 +08:00
Mengqing Cao	c84efeae25	[CI] Skip test_ngram_correctness as the oom issue block CI (#4578 ) ### What this PR does / why we need it? Skip test_ngram_correctness as the oom issue block CI related CI failure: https://github.com/vllm-project/vllm-ascend/actions/runs/19780591780/job/56680823606 ### Does this PR introduce _any_ user-facing change? N/A - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 Signed-off-by: MengqingCao <cmq0113@163.com>	2025-11-30 01:34:50 +08:00
Mengqing Cao	517fd9272d	Revert "drop ascend scheduler" (#4580 ) Reverts vllm-project/vllm-ascend#4498 - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2	2025-11-29 22:20:48 +08:00
DreamerLeader	4dbe4fd123	[feature]Pooling Features and PCP Adaptation (#4143 ) This PR let pooling kv connector support pcp feature - vLLM version: v0.11.2 --------- Signed-off-by: fjw <2270923832@qq.com> Signed-off-by: SlightwindSec <slightwindsec@gmail.com> Co-authored-by: SlightwindSec <slightwindsec@gmail.com>	2025-11-29 22:07:45 +08:00
wangxiyuan	1eb5295a1b	remove qwen3-next model file (#4573 ) Let's remove qwen3-next model filecurrently. We'll support it later by using vLLM origin model file - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-11-29 18:37:26 +08:00
Nengjun Ma	a3041cd78c	[Bugfix] fix dp parallel + tp > 1 offline inference port conflict (#4539 ) ### What this PR does / why we need it? fix dp parallel + tp > 1 offline inference port conflict issue import PR:https://github.com/vllm-project/vllm-ascend/pull/429 - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: leo-pony <nengjunma@outlook.com>	2025-11-29 18:37:11 +08:00
wangxiyuan	1874265074	Move mla to ops module (#4575 ) Move mla custom op to correct module - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-11-29 18:36:55 +08:00
Shanshan Shen	2a19215e5f	[MM][Model] Remove Qwen2-VL modeling files (#4534 ) ### What this PR does / why we need it? Following https://github.com/vllm-project/vllm-ascend/pull/4349, remove Qwen2-VL modeling files. - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: shen-shanshan <467638484@qq.com>	2025-11-29 18:07:01 +08:00
wangxiyuan	6664a4e5ce	improve soc version (#4522 ) Make SOC_VERSION be readable for users. Now users can set simply "910b"、“910c”、“310p” - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-11-29 17:42:16 +08:00
wangxiyuan	f10acddb78	drop ascend scheduler (#4498 ) Ascend scheduler was added for non chunk prefill case before, since that the npu ops didn't work well with chunked prefill. Now the ops with chunked prefill work better, it's time to remove the ascend scheduler to use vLLM default scheduler. - vLLM version: v0.11.2 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-11-29 16:18:34 +08:00
liziyu	53a52d6614	[P/D] [bugfix] add get_kv_connector_handshake_metadata func for 0.11.2 (#4567 ) ### What this PR does / why we need it? add get_kv_connector_handshake_metadata func for 0.11.2 Signed-off-by: liziyu <liziyu16@huawei.com>	2025-11-29 16:09:45 +08:00
LI SHENGYONG	0151022ab8	[bugfix] dep ineffective (#4417 ) ### What this PR does / why we need it? The expert mapping table and weights of the dynamic EPLB were not updated, causing the accuracy to be correct but not effective. This bug has now been fixed. - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2025-11-29 15:18:29 +08:00
wangxiyuan	8ebbf13c1a	Update triton package name (#4563 ) Add `aarch64` suffix to make sure the package name is OK - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-11-29 15:00:40 +08:00
Ting FU	b747c95cfa	[Doc] Add single NPU tutorial for Qwen2.5-Omni-7B (#4446 ) ### What this PR does / why we need it? Add single NPU tutorial for Qwen2.5-Omni-7B - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 Signed-off-by: Ting FU <futing10@huawei.com>	2025-11-29 11:57:29 +08:00
Ting FU	9af34755ff	[Bugfix] Fix model run _npu_flash_attention hang issue (#4410 ) Fix model run _npu_flash_attention in _forward_prefill_no_cache hang issue, it was caused by wrong attention mask dtype. ### How was this patch tested? Yes, tesed on Qwen2.5-VL and Qwen2.5-Omni - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` Signed-off-by: Ting FU <futing10@huawei.com>	2025-11-29 09:20:22 +08:00
wangxiyuan	048d350f9e	update triton package url (#4552 ) Triton package url is not correct. This PR fix it Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-11-28 21:00:49 +08:00
shiyuan680	1c4a0468ee	【OPS】qwen3-next support triton chunk_gated_delta_rule ops (#4070 ) ### What this PR does / why we need it? qwen3-next suppot triton chunk_gated_delta_rule ops ### co-owners @OsirisDuan - vLLM version: v0.11.2 Signed-off-by: shiyuan680 <917935075@qq.com>	2025-11-28 20:55:43 +08:00
fems14	5447a039b9	[Feature][main]reconstruction kvpool connector to ascend connector (#4438 ) ### What this PR does / why we need it? 1.In short, we renamed the existing MooncakeStoreConnector to AscendStoreConnector and extracted the storage engine interaction logic into a new Backend class. Associated RFC：https://github.com/vllm-project/vllm-ascend/issues/4329 2.Fixed the issue where the number of input parameters for the connector was incorrect, introduced in vllm 0.11.2 ### Does this PR introduce _any_ user-facing change? change MooncakeStoreConnector to AscendStoreConnector ### How was this patch tested? - vLLM version: v0.11.2 --------- Signed-off-by: fems14 <1804143737@qq.com>	2025-11-28 18:08:37 +08:00
Chenxi Qian	554f16ae1f	[Kernel] add custom op GmmSwigluQuantWeightNzTensorList (#3804 ) ### What this PR does / why we need it? This PR introduces support for adding custom CANN `aclnn` ops to `vllm-ascend`, allowing users to define and use their own custom operators. Key changes include: - Building and installing custom ops into the `vllm-ascend`-specified directory - Binding the `aclnn` op interface to the `torch.ops._C_ascend` module - Enabling invocation of these ops within `vllm-ascend` This PR includes a sample custom op: `aclnnGroupedMatmulSwigluQuantWeightNzTensorList`, which is adapted from the CANN operator [`aclnnGroupedMatmulSwigluQuantWeightNZ`](https://www.hiascend.com/document/detail/zh/canncommercial/83RC1/API/aolapi/context/aclnnGroupedMatmulSwigluQuantWeightNZ.md). Its input parameters `weight` and `weight_scale` now accept `list[torch.Tensor]` (i.e., `at::TensorList`). ### Does this PR introduce _any_ user-facing change? No. - vLLM version: v0.11.2 --------- Signed-off-by: QianChenxi <chenxi.qian.cq@outlook.com>	2025-11-28 18:06:39 +08:00
herizhen	3199fe8350	[Doc]Delete equals sign (#4537 ) ### What this PR does / why we need it? Delete equals sign in doc ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ut - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: herizhen <you@example.com> Co-authored-by: herizhen <you@example.com>	2025-11-28 17:09:26 +08:00
wangxiaoteng888	366d2d95e8	[P/D] Add readme for PD separation (#4182 ) ### What this PR does / why we need it? Add readme for PD separation ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By ci - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` --------- Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Signed-off-by: liziyu <liziyu16@huawei.com> Co-authored-by: liziyu <liziyu16@huawei.com>	2025-11-28 15:17:59 +08:00
Shanshan Shen	e52ebf8674	[MM][Model][Perf] Remove Qwen2.5-VL modeling files and add patch for VisionAttention (#4349 ) ### What this PR does / why we need it? - [x] Patch `Qwen2_5_VisionAttention` with `AscendQwen2_5_VisionAttention`. - [x] Replace `AscendQwen2_5_VisionTransformer` with `Qwen2_5_VisionTransformer` in vllm. - [x] Move padding logic (q/k/v and cos/sin) before FA to `forward()` of `Qwen2_5_VisionAttention`. - [x] Covert `cu_seqlens` in `Qwen2_5_VisionAttention` from cumulative form to intervals and move it to cpu (compatible for npu FA). - [x] Remove Qwen2.5-VL modeling files. - [x] Remove Qwen2.5-VL (without padding) modeling files. - [x] Remove related UT. - [x] Make `set_forward_context` pluggable when getting MM embedding. Find more details at https://github.com/vllm-project/vllm/pull/29388. - [x] Simplify padding logic for FA. - [x] Add patch for https://github.com/vllm-project/vllm/pull/28798. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - [x] Functional test (eager mode) - [x] Functional test (graph mode) - [x] Benchmark - vLLM version: v0.11.2 --------- Signed-off-by: shen-shanshan <467638484@qq.com>	2025-11-28 14:23:00 +08:00
LHXuuu	bdc66972db	[Quantization] Support compressed tensors w8a8 static and w8a8 dynamic weight (#4036 ) ### What this PR does / why we need it? While using the LLM Compressor quantization tool from the VLLM community to generate quantized weights, the VLLM Ascend engine needs to be adapted to support the compressed tensors quantization format. 1. Add AscendCompressedTensorsConfig to replace CompressedTensorsConfig in vllm. 2. Support CompressedTensorsW8A8 static weight. - weight: per-channel, int8, symmetric; activation: per-tensor, int8, symmetric. 4. Support CompressedTensorsW8A8Dynamic weight. - weight: per-channel, int8, symmetric; activation: per-token, int8, symmetric, dynamic. 5. Modify the override_quantization_method in AscendQuantConfig. Co-authored-by: taoqun110 taoqun@huawei.com Co-authored-by: chenxi-hh chen464822955@163.com - vLLM version: v0.11.2 --------- Signed-off-by: LHXuuu <scut_xlh@163.com> Signed-off-by: chenxi-hh <chen464822955@163.com> Signed-off-by: chenxi-hh <32731611+chenxi-hh@users.noreply.github.com> Co-authored-by: chenxi-hh <chen464822955@163.com> Co-authored-by: chenxi-hh <32731611+chenxi-hh@users.noreply.github.com>	2025-11-28 14:09:39 +08:00
SILONG ZENG	ab37a7d5ae	[main]Upgrade cann to 8.3rc2 (#4350 ) ### What this PR does / why we need it? Upgrade cann to 8.3rc2 ### Does this PR introduce _any_ user-facing change? Yes, docker image will use 8.3.RC2 - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: MrZ20 <2609716663@qq.com>	2025-11-28 14:06:01 +08:00
Zhu Yi Lin	755b635844	[TEST] Add eagle proposer ut (#4447 ) ### What this PR does / why we need it? Add eagle proposer ut - vLLM version: v0.11.2 Signed-off-by: GDzhu01 <809721801@qq.com>	2025-11-27 21:59:31 +08:00
Slightwind	9fdabb7b60	[feature] Add Custom Op grouped_matmul_swiglu_quant (#4431 ) This PR introduces the `EXEC_NPU_CMD` macro, serving as an adapter layer to simplify the invocation of `aclnn` operators on Ascend NPUs. Key Changes: * Adapter Layer: Added `EXEC_NPU_CMD` macro and related dependencies to standardize `aclnn` calls. * Operator Support: Integrated `grouped_matmul_swiglu_quant` as a reference implementation to demonstrate the usage of the new macro. --- - vLLM version: v0.11.2 --------- Signed-off-by: SlightwindSec <slightwindsec@gmail.com>	2025-11-27 21:56:18 +08:00
Nengjun Ma	89a1a65300	[bugfix] fix ray start failed: local_world_size cannot little than visible device count error (#4457 ) ### What this PR does / why we need it? Fix the ray start failed bug: local_world_size cannot little than visible device count error detail see issue #4456. This fix code is copied from vllm fixing modify, PR: [#28873](https://github.com/vllm-project/vllm/pull/28873) - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: leo-pony <nengjunma@outlook.com>	2025-11-27 21:18:32 +08:00
drslark	1cae3e4a49	[BugFix] Adapted Qwen3-Next eager mode to v0.11.2 (#4477 ) ### What this PR does / why we need it? Adapted Qwen3-Next eager mode to `v0.11.2`. - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 Signed-off-by: drslark <slarksblood@qq.com>	2025-11-27 17:44:59 +08:00
Li Wang	b220de33e8	[CI][Nightly] Support local debugging for multi-node CI test cases (#4489 ) ### What this PR does / why we need it? This patch mainly doing the following things: 1. Make k8s/lws optional for multi-node testing, allowing developers to run multi-node tests locally by actively passing in the IP addresses of all nodes. 2. Allows passing a custom proxy script path in the config file to load the proxy. - vLLM version: v0.11.2 --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-11-27 17:20:29 +08:00
zzzzwwjj	1fd56b1106	chip type judgement code optimization (#4485 ) ### What this PR does / why we need it? \| \| cpu envir \| npu envir \| \|---\|---\|---\| \| set `SOC_VERSION` \| check if `SOC_VERSION` is in dict `soc_to_device`, if not, raise an error that can not support current chip type. \| print a warning log when `SOC_VERSION` is not equal to chip type from `npu-smi`, same as left for others. \| \| not set `SOC_VERSION` \| raise an error that `SOC_VERSION` is necessary when compiling in a cpu envir. \| use chip type from `npu-smi` to compile vllm-ascend. \| ### Does this PR introduce _any_ user-facing change? Now we must set env `SOC_VERSION` when compiling in cpu envir. ### How was this patch tested? - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 Signed-off-by: zzzzwwjj <1183291235@qq.com>	2025-11-27 17:18:49 +08:00
zhangxinyuehfad	84d7f5a10d	[UT] Fix ut test (#4472 ) ### What this PR does / why we need it? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-11-26 21:37:47 +08:00
herizhen	d252e36ae8	Change comment location (#4432 ) ### What this PR does / why we need it? When running 'python example.py',connection issues often occur.The solution is to comment out the first line the code. Complete the specific names of machines A2 and A3. Standardize document format,a space should be added after the colon. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ut - vLLM version: v0.11.2 --------- Signed-off-by: herizhen <you@example.com> Co-authored-by: herizhen <you@example.com>	2025-11-26 16:13:31 +08:00
zzzzwwjj	136ea9ff56	[refact] unified soc_version code (#4359 ) ### What this PR does / why we need it? Currently, there are two paths to judge the chip type in code, `get_ascend_soc_version` use `get_soc_version` api in torch_npu, and `is_310p` `use _build_info.__soc_version__`, which generate when install. We need to unify the two paths. We need to unify these codes based on the following points: 1. We need to ensure consistency in chip type judgment between compiling and running states; 2. In compiling state, we need chip type to complete op's compilation, but in running state, we only need device type(910B/910_93/310P/910_95/etc) to make code branch judgement; 3. In compiling state, torch_npu may not have been installed yet, so we can't use torch_npu's api. Based on the above points, we have made the following changes: 1. When user set env `SOC_VERSION`, use it; when not set, query soc_version by `npu-smi`; 2. generate device_type based on soc_version when compiling, and write `__device_type__` instead of `__soc_version__` in `_build_info.py`; 3. In running state, use `__device_type__` to judge code branch. ### Does this PR introduce _any_ user-facing change? When not set env `SOC_VERSION`, it will not be `ASCEND910B1` by default, we will query soc_version by `npu-smi`. And env `SOC_VERSION` must be in the list `soc_to_device` in `setup.py`. - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` Signed-off-by: zzzzwwjj <1183291235@qq.com>	2025-11-26 14:28:55 +08:00
wangxiyuan	a91e76cd84	[CI] clean up ci (#4452 ) 1. Run 4-card test only when single and 2-card test passed 2. rename file to make it more clear 3. remove useless pd workflow, it has been managed by nightly test already. - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-11-26 14:07:56 +08:00
wangxiyuan	bc69d7cfe1	upgrade to vllm 0.11.2 (#4400 ) Bump vLLM version to v0.11.2 What's broken and changed by vLLM: 1. structured_output is broken by https://github.com/vllm-project/vllm/pull/26866 2. get_mrope_input_positions is broken by https://github.com/vllm-project/vllm/pull/28399 3. graph mode is broken by https://github.com/vllm-project/vllm/pull/25110 we'll upgrade torch to 2.8 to fix the problem later 4. embedding is broken by https://github.com/vllm-project/vllm/pull/27583 5. `get_attn_backend_cls` and attention backend is broken are broken by https://github.com/vllm-project/vllm/pull/28534 6. spec decode is broken by https://github.com/vllm-project/vllm/pull/28771 7. sp feature is broken by https://github.com/vllm-project/vllm/pull/27126 8. mtp is broken by https://github.com/vllm-project/vllm/pull/27922 9. lora is broken by https://github.com/vllm-project/vllm/pull/21068 10. execute_model is broken by https://github.com/vllm-project/vllm/pull/26866 11. `VLLM_DISABLE_SHARED_EXPERTS_STREAM` env is broken by https://github.com/vllm-project/vllm/pull/28159 12. kv cahe is broken by https://github.com/vllm-project/vllm/pull/27753 13. dp is broken by https://github.com/vllm-project/vllm/pull/25110 What's broken and changed by ourself: 1. qwen vl is broken by https://github.com/vllm-project/vllm/pull/28455 We'll remove model files in the future to avoid this kind of error 2. Engine core is broken by https://github.com/vllm-project/vllm/pull/23691 We'll remove the patch file in the future. 3. Ascend scheduler is broken by https://github.com/vllm-project/vllm/pull/28733 We'll remove ascend scheudler later. 4. qwen3-next is broken by https://github.com/vllm-project/vllm/pull/28083 We'll remove model files in the future to avoid this kind of error 5. qwen vl is broken by https://github.com/vllm-project/vllm/pull/27764. We'll remove model files in the future Known issue: 1. ray doesn't work 2. the accuracy of qwen3-next is not correct 3. qwen3-vl is broken 4. prefix cache+ ascend scheduler + deepseek v2 lite is broken. Co-authored-by: MengqingCao <cmq0113@163.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: leo-pony <nengjunma@outlook.com> Co-authored-by: 22dimensions <waitingwind@foxmail.com> Co-authored-by: shen-shanshan <467638484@qq.com> - vLLM version: v0.11.2 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Signed-off-by: leo-pony <nengjunma@outlook.com> Co-authored-by: MengqingCao <cmq0113@163.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: leo-pony <nengjunma@outlook.com>	2025-11-26 11:48:58 +08:00
shiyuan680	d5f77f14d0	mkdir triton package and move triton files (#4420 ) ### What this PR does / why we need it? mkdir triton package and move triton files - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` Signed-off-by: shiyuan680 <917935075@qq.com>	2025-11-26 11:06:12 +08:00
Zhu Yi Lin	1b137d6b1b	[TEST] Delete Comment (#4427 ) ### What this PR does / why we need it? Delete useless comments. ### Does this PR introduce _any_ user-facing change? No - vLLM main: `2918c1b49c` Signed-off-by: GDzhu01 <809721801@qq.com>	2025-11-25 21:39:04 +08:00
wangxiyuan	98031653df	[misc] Remove useless patch_logits (#4252 ) Torch-npu 2.7.1 has fixed the device check bug. This patch can be removed now. - vLLM main: `2918c1b49c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-11-25 21:25:54 +08:00
Shanshan Shen	4864909648	[MM][Bugfix] Minor fix for VL model verification (#4384 ) ### What this PR does / why we need it? To fix ops test, where `model_config` has been set to `None` and doesn't has `hf_config` attribute, we have added a check for `model_config` to guarantee it is not `None_Type`. - vLLM main: `2918c1b49c` Signed-off-by: shen-shanshan <467638484@qq.com>	2025-11-25 20:36:16 +08:00
Zhijun Chen	463910e686	[Bugfix] use module-level import for patched function in Qwen3Next (#4354 ) ### What this PR does / why we need it? Problem: The Qwen3Next model implementation currently imports chunk_gated_delta_rule directly using `from ... import ...` In frameworks like `verl`, the model file is often imported before `vllm-ascend` initializes and applies its patches. This causes the model to permanently hold a reference to the original (unpatched) vLLM kernel, resulting in execution errors on Ascend devices even if the patch runs later. Solution: Changed the import style to `from vllm...ops import chunk` and call `chunk.chunk_gated_delta_rule().` This ensures that the function lookup happens at runtime (dynamic dispatch), allowing the model to correctly pick up the patched function regardless of import order. - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` Signed-off-by: zjchenn <zjchenn@gmail.com>	2025-11-25 20:15:43 +08:00
SILONG ZENG	941d54a2ce	[bugfix]Return the Transformer version from 4.57.2 to 4.57.1 (#4423 ) ### What this PR does / why we need it? This PR pins the transformers dependency to 4.57.1. Reason: CI tests (specifically test_completion_with_prompt_embeds.py) are failing with an AttributeError: 'dict' object has no attribute 'model_type' when using newer versions of transformers. The issue stems from a bug in tokenization_utils_base.py where the code attempts to access the model_type field of a configuration dictionary (_config) using dot notation (_config.model_type) instead of dictionary key lookup (_config["model_type"] or _config.get("model_type")). This occurs in the logic block checking for transformers_version <= 4.57.2. Pinning the version to 4.57.1 bypasses this buggy code path and restores CI stability. Error Traceback: ``` shell /usr/local/python3.11.13/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:2419: if _is_local and _config.model_type not in [ E AttributeError: 'dict' object has no attribute 'model_type' ``` - vLLM main: `2918c1b49c` Signed-off-by: MrZ20 <2609716663@qq.com>	2025-11-25 15:32:24 +08:00
欧派果奶我还要	31a2c09e79	[Bugfix] fix patch typo (#4351 ) ### What this PR does / why we need it? Fix a bug caused by this pr: https://github.com/vllm-project/vllm-ascend/pull/4223 The bug makes vllm-ascend/vllm_ascend/patch/platform/patch_multiproc_executor.py patch in a wrong way ### How was this patch tested? Tested in a single node. When the environment DYNAMIC_EPLB is set to true, the patch works correctly. When it's set to false, the patch do not patch - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` Signed-off-by: 白永斌 <baiyongbin3@h-partners.com> Co-authored-by: 白永斌 <baiyongbin3@h-partners.com>	2025-11-25 15:13:06 +08:00
herizhen	e945e91933	Document error correction (#4422 ) ### What this PR does / why we need it? The "g" at the beginning of the current sentence is redundant and needs to be deleted "MindIE Turbo" is no longer required to be displayed. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ut - vLLM main: `2918c1b49c` --------- Signed-off-by: herizhen <you@example.com> Co-authored-by: herizhen <you@example.com>	2025-11-25 14:21:13 +08:00
wujinyuan1	06f6cc1c81	[Bugfix]Fix the hang issue of multimodal model when running with DP>1 (#4392 ) ### What this PR does / why we need it? When cudagraph_mode is set to FULL_DECODE_ONLY, if dp > 1, the dummy-run process will be triggered. When calling the update_attn_params function, the num_tokens parameter needs to be passed, and this value is obtained through positions.shape[0]. However, the multimodal model uses mRope (multi-dimensional rotary positional embeddings), which causes the shape of positions to be 2. As a result, the value obtained from positions.shape[0] is incorrect. We solve this problem by replacing positions.shape[0] with num_tokens. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? vLLM version: v0.11.0rc3 vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` --------- Signed-off-by: wujinyuan1 <wjy9595@qq.com> Co-authored-by: wujinyuan1 <wjy9595@qq.com>	2025-11-25 09:33:49 +08:00
dependabot[bot]	84eae97f27	Bump actions/checkout from 4 to 6 (#4380 ) Bumps [actions/checkout](https://github.com/actions/checkout) from 4 to 6. - vLLM main: `2918c1b49c` Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-11-25 09:05:11 +08:00

1 2 3 4 5 ...

1501 Commits