xc-llm-ascend

Author	SHA1	Message	Date
wangxiyuan	7f2673ea2d	upgrade vLLM to main (#4608 ) 1. fix https://github.com/vllm-project/vllm/pull/28542 The model structure modifications we involved in are: - Qwen2.5-VL(still exist some patch) - Qwen2-VL - Qwen2 - DeepSeek series - Qwen-moe series 2. fix https://github.com/vllm-project/vllm/pull/29121 the output token now type changed from np to `list[list[int]]` 3. fix https://github.com/vllm-project/vllm/pull/29262 `xformers` backend for multimodal now has been deprecated 4. fix https://github.com/vllm-project/vllm/pull/29342 5. fix https://github.com/vllm-project/vllm/pull/28579 6. fix https://github.com/vllm-project/vllm/pull/28718 7. fix https://github.com/vllm-project/vllm/issues/28665 8. fix https://github.com/vllm-project/vllm/pull/26847 vllm introduced the `optimization-level`, some default config has been changed, and the param `--enforce-eager` has been deprecated 9. fix http://github.com/vllm-project/vllm/pull/29223 it retuns tuple for sampler. 10. fix https://github.com/vllm-project/vllm/pull/29471 we'll remove the related patch to avoid this kind of error. Co-authored-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: wangli <wangli858794774@gmail.com> - vLLM version: v0.11.2 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: wangli <wangli858794774@gmail.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com>	2025-12-02 22:10:52 +08:00
MengLong Chen	143e1f46d0	[Feat] shared expert dp for deepseek_mtp (#3811 ) ### What this PR does / why we need it? Support shared expert DP for deepseek_mtp feature. `shared_expert_dp` requires `SP==True`, with corresponding parameter restrictions. Previously, due to the coupling between `shared_expert_dp` and torchair, and the removal of `deepseek_mtp` in vllm_ascend, shared expert dp of deepseek_mtp was temporarily removed. Currently, by performing the `reduce_scatter` on the input of deepssek_mtp in `mtp_proposer.py`, we ensure that it matches the dimensions of `input_embedding`, and then perform the `all_gather` on the output of mtp. ### How was this patch tested? baseline: <img width="1184" height="692" alt="image" src="https://github.com/user-attachments/assets/9680d53a-7b1d-481a-accc-b8f3dae2b9e3" /> enable shared_expert_dp and multistream_overlap_shared_expert: <img width="1167" height="687" alt="image" src="https://github.com/user-attachments/assets/2531d06b-dfda-4e24-8628-6f4b0f677ddc" /> TPOT: 48ms -> 45.4ms Average TPS per rank: 117.6 -> 126.1 - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: chenmenglong <chenmenglong1@huawei.com> Signed-off-by: zengran <zengran2@huawei.com> Co-authored-by: zengran <zengran2@huawei.com>	2025-12-01 20:44:11 +08:00
wangxiyuan	0d14f635b4	upgrade torch npu version (#4433 ) vLLM graph feature now rely on torch >=2.8. To make graph mode work, we need upgrade torch version as well. For long term support, upgrade torch to a newer one is good to go as well. Related vLLM change: https://github.com/vllm-project/vllm/pull/25110 - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2	2025-12-01 19:01:55 +08:00
Wang Yixuan	c68ddc11ce	[OPS] add bmm_transpose ops (#3990 ) ### What this PR does / why we need it? Add a new fusion ops to custom_op, which can cobime the torch.bmm() and transpsose to achieve better peformance. This ops is used in mla_v1 to replace the bmm and transpose ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.11.2 --------- Signed-off-by: hust17yixuan <303660421@qq.com>	2025-12-01 09:09:51 +08:00
wangxiyuan	a1f142b7ad	Drop 0.11.0 support (#4377 ) There is a lot hack code for v0.11.0, which makes the code hard to upgrade to newer vLLM version. Since v0.11.0 will release soon. Let's drop v0.11.0 support first. Then we'll upgrade to v0.11.2 soon. - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-11-24 17:08:20 +08:00
anon189Ty	5c9f4a40c6	[Feat] Support MTP to running in full graph mode (#3892 ) ### What this PR does / why we need it? Currently, the MTP model still runs in eager in full graph mode. This PR adapts the MTP with the full graph capture and execution. When the graph mode is set to "FULL_DECODE_ONLY", the MTP will run in full-graph to improve the performance. The change in both disable_padded_drafter_batch is True and False case include: 1. Add _mtp_graph_params in acl_graph.py to isolate the data of main model and the data of MTP. 2. Padding some metadata in mla_v1.py when in fullgraph mode. 3. Fixed the essential data address that will be used in model.forward. 4. Adapted according to the aclgraph capture framwork: 1). Rebuild MTP model with ACLGraphWrapper. 2). Add common attn metadata when start capture in MTP dummy_run. 3). Add common attn metadata update in MTP. 4). Addapted data update when num_speculative_tokens > 1. 5. Add a patch of MTP to adapt vllm v0.11.0. Existing Issues: 1. When disable_padded_drafter_batch=True and running in FullGraph mode, the data of the first-round requests in MTP is abnormal. We need to identify the cause subsequently. 2. When disable_padded_drafter_batch=False and running in FullGraph mode, the acceptance rate of the second and third tokens will decrease (For example, if we set the num_speculative_tokens=3, the acceptance rate of first token is 90%, the second is only 50% lower than 60%, the third is only 20% lower than 30%). The reason is that the data processed after the model runs does not match. This is a problem from another PR. It works fine in eager and PIECEWISE mode, but has problem in FullGraph mode. Once we have a solution, we will submit a bugfix. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` --------- Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>	2025-11-20 20:34:54 +08:00
wangxiyuan	2938bd5ad2	remove get_metadata_cls (#4087 ) remove get_metadata_cls. It's only used for V0 engine and has been removed from vLLM already. - vLLM version: v0.11.0 - vLLM main: `83f478bb19` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-11-19 14:58:17 +08:00
zhangsicheng5	df777e9faa	[bugfix] pcp + mtp acl graph bugfix (#4221 ) Fix pcp + mtp bug while using acl graph. While using pcp + mtp, we need to flatten block_table to avoid irregular attn mask shape, this was done in mla attn_metadata builder, but we found out that this influences block_table address and leads to incorrect results while enable acl graph. To fix this, we enlarge block_table buffer size and flatten block_table in model_runner prepare_inputs, so this will not influence block_table address. - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` Signed-off-by: zhangsicheng5 <zhangsicheng5@huawei.com>	2025-11-19 11:21:46 +08:00
XiaoxinWang	e38ef2c434	support FULL graph mode for GQA (#3970 ) ### What this PR does / why we need it? The current library only supports the FullDecodeOnly graph mode, which enables full graph execution during the decode. This PR extends support to allow full graph execution in both the prefill and decode, referred to as FULL graph mode. - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com> Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>	2025-11-17 10:50:35 +08:00
LookAround0301	5ec96fd46c	[long_seq_Feat] support chunk prefill (#4158 ) ### What this PR does / why we need it? 1、qwen GQA attention_v1 optim 2、DeepSeek MLA refactor, all gather q -> all gather kv 3、modelrunner refactor for chunk prefill, we remove some code not use - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` --------- Signed-off-by: LookAround <lixushi@huawei.com> Signed-off-by: Delphine-Nic <tanwenqin@huawei.com> Co-authored-by: Delphine-Nic <tanwenqin@huawei.com>	2025-11-14 08:43:37 +08:00
zhaozx-cn	fdd2db097a	[BugFix] Fix kv_no_split not contiguous (#3594 ) allgather need contiguous data, split operation return uncontiguous data. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: zhaozx-cn <zhaozx2116@163.com>	2025-11-13 11:28:09 +08:00
22dimensions	c272747d13	Upgrade to 0.11.1 newest vllm commit (#3982 ) ### What this PR does / why we need it? adapt vllm-ascend main branch with vllm releases/v0.11.1 fix `forward context not set` in test_vlm.py caused by: https://github.com/vllm-project/vllm/pull/23207 fix import `cdiv round` failed caused by: https://github.com/vllm-project/vllm/pull/27188 fix import `init_cached_hf_modules` failed caused by: https://github.com/vllm-project/vllm/pull/27567 adapt triton kernel `fused_recurrent_gated_delta_rule_fwd_kernel` caused by: https://github.com/vllm-project/vllm/pull/27654 - remove unused code in sigmoid_gating.py - `class FusedRecurrentFunction` , `fused_recurrent_gated_delta_rule`, `fused_recurrent_gated_delta_rule_fwd` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI - vLLM version: v0.11.0 - vLLM main: `83f478bb19` Signed-off-by: 22dimensions <waitingwind@foxmail.com>	2025-11-12 23:01:19 +08:00
zhangsicheng5	a123f355e9	[feature] support pcp + mtp (in pd co-locate scenario) (#4098 ) 1. support pcp + mtp in pd co-locate scenario 2. llmdatadist connector pcp related bugfix and cleancode - vLLM version: v0.11.0 - vLLM main: `83f478bb19` Signed-off-by: zhangsicheng5 <zhangsicheng5@huawei.com>	2025-11-12 17:22:21 +08:00
Apocalypse	71866d5311	[feature] chunkprefill support pcp&dcp (#3801 ) ### What this PR does / why we need it? ChunkPrefill now can support Long Sequence Feature Pcp&Dcp ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI tests passed with self-test - vLLM version: v0.11.0 - vLLM main: `83f478bb19` --------- Signed-off-by: Apocalypse990923-qshi <qiushixu@usc.edu> Signed-off-by: Delphine-Nic <tanwenqin@huawei.com> Co-authored-by: Delphine-Nic <tanwenqin@huawei.com> Co-authored-by: Delphine-Nic <3834144971@qq.com>	2025-11-11 09:18:02 +08:00
hucong	48094148f8	[BugFix] Improve the performance of prefixcache features (#4022 ) ### What this PR does / why we need it? The code bug caused an empty bubble. When the npu_paged_cache_load operator was called, it forcibly transferred seq_len2 to the device, which triggered synchronization and interrupted the CPU operator's launch stream. - vLLM version: v0.11.0 - vLLM main: `83f478bb19` --------- Signed-off-by: underfituu <hzhucong@163.com>	2025-11-08 18:45:31 +08:00
LookAround0301	79e536d939	[Feat] update op for mla (#4000 ) ### What this PR does / why we need it? 1、in mla_v1 module, add torch_npu.npu_attention_update op when pcp and dcp ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0 - vLLM main: `83f478bb19` --------- Signed-off-by: LookAround <lixushi@huawei.com>	2025-11-07 09:48:39 +08:00
weiguihua2	2eebe1dc0a	[feat]decode convert bsnd to tnd and fix bug when pcp and dcp (#3980 ) ### What this PR does / why we need it? 1、in attention_v1 module, convert bsnd t0 tnd when pcp and dcp 2、fix tochair bug: service startup problem ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0 - vLLM main: `83f478bb19` Signed-off-by: weiguihua2 <weiguihua2@huawei.com>	2025-11-06 14:58:24 +08:00
whx	f6149f3894	[Model][3/N] Refactor sfa into mla and remove deepseek_v3_2.py (#3769 ) This is the follow-up PR to PR #3189, which continues to refactor sfa into mla and finally remove deepseek_v3_2.py. This is the last PR of deepseek modeling refactoring. After this, all deepseek-related model codes are removed from vllm_ascend. FurtherMore, after this PR deepseek v3.2 can run chunk-prefill with correct accuracy. - vLLM version: v0.11.0rc3 - vLLM main: `83f478bb19` --------- Signed-off-by: whx-sjtu <2952154980@qq.com>	2025-10-30 17:06:38 +08:00
whx	dc960e798e	[BugFix] Fix mlapo accuracy problem related with weight processing. (#3850 ) This PR fixes a mlapo accuracy problem related with weight processing. Furthermore, add back mlapo related e2e test with quantized deepseek model. - vLLM version: v0.11.0rc3 - vLLM main: `83f478bb19` Signed-off-by: whx-sjtu <2952154980@qq.com>	2025-10-30 00:34:55 +08:00
weiguihua2	4312a92a4f	[feat]dcp pcp support aclgraph (#3731 ) ### What this PR does / why we need it? dcp pcp support full aclgraph, including mla attention_v1 - vLLM version: v0.11.0rc3 - vLLM main: `c9461e05a4` Signed-off-by: weiguihua2 <weiguihua2@huawei.com>	2025-10-27 09:58:23 +08:00
Yizhou	8ab8111fde	[Fix] Prevent memory leak in MLA decode graph (#3743 ) ### What this PR does / why we need it? The cache for MLA decode graph parameters was holding strong references to tensors, preventing them from being garbage collected and leading to increased memory usage. This change wraps the cached tensors in weak references, allowing them to be deallocated when no longer in use and reducing overall memory pressure. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? None. - vLLM version: v0.11.0rc3 - vLLM main: `c9461e05a4` --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-10-25 20:37:33 +08:00
zzzzwwjj	e5676fc36e	[main] remove dbo code (#3712 ) ### What this PR does / why we need it? Remove codes of dbo. Currently, vLLM has supported dbo with pr: https://github.com/vllm-project/vllm/pull/23693. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: `17c540a993` Signed-off-by: zzzzwwjj <1183291235@qq.com>	2025-10-25 15:53:01 +08:00
Mengqing Cao	cea0755b07	[1/N][Refactor] Refactor code to adapt with vllm main (#3612 ) ### What this PR does / why we need it? This is the step 1 of refactoring code to adapt with vllm main, and this pr aligned with `17c540a993` 1. refactor deepseek to the latest code arch as of `17c540a993` 2. bunches of fixes due to vllm changes - Fix `AscendScheduler` `__post_init__`, caused by https://github.com/vllm-project/vllm/pull/25075 - Fix `AscendScheduler` init got an unexpected arg `block_size`, caused by https://github.com/vllm-project/vllm/pull/26296 - Fix `KVCacheManager` `get_num_common_prefix_blocks` arg, caused by https://github.com/vllm-project/vllm/pull/23485 - Fix `MLAAttention` import,caused by https://github.com/vllm-project/vllm/pull/25103 - Fix `SharedFusedMoE` import, caused by https://github.com/vllm-project/vllm/pull/26145 - Fix `LazyLoader` improt, caused by https://github.com/vllm-project/vllm/pull/27022 - Fix `vllm.utils.swap_dict_values` improt, caused by https://github.com/vllm-project/vllm/pull/26990 - Fix `Backend` enum import, caused by https://github.com/vllm-project/vllm/pull/25893 - Fix `CompilationLevel` renaming to `CompilationMode` issue introduced by https://github.com/vllm-project/vllm/pull/26355 - Fix fused_moe ops, caused by https://github.com/vllm-project/vllm/pull/24097 - Fix bert model because of `inputs_embeds`, caused by https://github.com/vllm-project/vllm/pull/25922 - Fix MRope because of `get_input_positions_tensor` to `get_mrope_input_positions`, caused by https://github.com/vllm-project/vllm/pull/24172 - Fix `splitting_ops` changes introduced by https://github.com/vllm-project/vllm/pull/25845 - Fix multi-modality changes introduced by https://github.com/vllm-project/vllm/issues/16229 - Fix lora bias dropping issue introduced by https://github.com/vllm-project/vllm/pull/25807 - Fix structured ouput break introduced by https://github.com/vllm-project/vllm/issues/26737 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? CI passed with existing test. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: Icey <1790571317@qq.com> Co-authored-by: Icey <1790571317@qq.com>	2025-10-24 16:55:08 +08:00
LookAround0301	b54d44e664	support cp&dcp (#3260 ) ### What this PR does / why we need it? This PR adds the Prefill Context Parallelism (PCP) feature, which corresponds to DCP. For specific implementation details, please refer to the RFC https://github.com/vllm-project/vllm/issues/25749. TL;DR: PCP enhances long-sequence inference capabilities by partitioning the sequence dimension during the prefill stage. ### Does this PR introduce _any_ user-facing change? The current implementation primarily includes the following changes: Modified ModelRunner.py for CP partitioning logic for tokens; Modified attention_v1.py and mla_v1.py to adapt the GQA/MLA backend to PCP. Modified block_tables.py to extend the KV cache storage based on DCP&PCP; Added necessary command-line arguments to control parallelism for PCP; ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: LookAround <lixushi@huawei.com> Signed-off-by: chenjie <chenjie137@huawei.com> Signed-off-by: Delphine-Nic <tanwenqin@huawei.com> Signed-off-by: zhangsicheng5 <zhangsicheng5@huawei.com> Signed-off-by: Feng Liu <liufeng248@huawei.com> Signed-off-by: gaojc <1055866782@qq.com> Signed-off-by: weiguihua2 <weiguihua2@huawei.com> Signed-off-by: z50049692 <zhangmingwei11@huawei.com> Co-authored-by: chenjie <chenjie137@huawei.com> Co-authored-by: Delphine-Nic <tanwenqin@huawei.com> Co-authored-by: zhangsicheng5 <zhangsicheng5@huawei.com> Co-authored-by: Feng Liu <liufeng248@huawei.com> Co-authored-by: gaojc <1055866782@qq.com> Co-authored-by: weiguihua2 <weiguihua2@huawei.com> Co-authored-by: z50049692 <zhangmingwei11@huawei.com> Co-authored-by: w00896881 <wangzixuan40@huawei.com>	2025-10-24 10:32:01 +08:00
Yizhou	b13d22bf5a	[Fix] Fixes attribute error in MLA implementation (#3618 ) ### What this PR does / why we need it? Corrects the attribute access for retrieving the device from `q_a_proj` to `q_proj`. This prevents an `AttributeError` as `q_a_proj` does not exist on the class instance. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? Need MLAPO tests. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-10-23 09:12:50 +08:00
linfeng-yuan	4c9af353ee	Revert "[Feat] Shared expert dp for deepseek and deepseek_mtp (#3495 )" (#3586 ) ### What this PR does / why we need it? This reverts commit `bf87606932`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? E2E vllm serving with `enable_shared_expert_dp: true` in eager mode as before. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: linfeng-yuan <1102311262@qq.com>	2025-10-21 22:24:30 +08:00
whx	220df60c61	[Model][2/N] Remove deepseek_mtp modeling. (#3561 ) This PR is step 2 of deepseek model refactoring and removes deepseek_mtp. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: whx-sjtu <2952154980@qq.com>	2025-10-21 20:17:09 +08:00
Zhu Yi Lin	ffb42a8daa	[BugFix] Fixed the bug that caused the transposematmul operator to report an error due to the shape being too large (#3578 ) ### What this PR does / why we need it? npu_transpose_batchmatmul has the problem that the shape being too large - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: GDzhu1 <809721801@qq.com>	2025-10-21 20:16:54 +08:00
Chen Chen	6b290acfe1	remove redundant params in mla_preprocess kernel (#3530 ) ### What this PR does / why we need it? This pull request removes the redundant parameters `gamma1` and `beta1` (also named `gamma0`/`beta0` in some places) from the `mla_preprocess` kernel and its calling hierarchy. The changes are consistent across C++ kernel code, bindings, and Python call sites. The parameters were unused in the lower-level functions, so their removal is a good cleanup. ### Does this PR introduce _any_ user-facing change? The python interface of the kernel is affected, and the params of `gamma0` and `beta0` are not needed. ### How was this patch tested? The unit-test of the kernel is adapted accordingly. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: mojave2 <chenchen145@huawei.com>	2025-10-21 19:20:13 +08:00
lilinsiman	70bef33f13	add new accuracy test case for aclgraph (#3390 ) ### What this PR does / why we need it? Add new accuracy test case Deepseek-V2-Lite-W8A8 for aclgraph ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ut - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: lilinsiman <lilinsiman@gmail.com>	2025-10-20 20:04:04 +08:00
Zhu Yi Lin	34c2996ab8	[main] v_proj combining transpose and matmul (#3545 ) ### What this PR does / why we need it? v_proj combining transpose and matmul ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: GDzhu1 <809721801@qq.com>	2025-10-20 19:53:32 +08:00
whx	f8b52fe950	[Model][1/N] Delete deepseek v2/v3 modeling codes. (#3189 ) This PR deletes model codes of deepseek_v2 and deepseek_v3 to reuse the model file from vLLM. vLLM Ascend now uses custom ops register way instead of model file hard-coding. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: whx-sjtu <2952154980@qq.com>	2025-10-20 15:31:34 +08:00
Angazenn	9547d6f0d9	[Core]Append padding logic for Attention (#3256 ) ### What this PR does / why we need it? This PR aims to add padding logic to seq_lens、block_tables when running in full decode scenario. Before this PR, the number of input tokens with padding might exceeds corresponding seq_lens. For example, when running in full decode scenario: ``` input_ids : [1, 3, 0, 0] seq_lens: [2, 1] query_start_loc: [0, 1, 2] ``` Here, `input_ids` is padded by 2 tokens while `seq_lens`/`query_start_loc` are not. The mismatch between `input_ids` and `seq_lens`/`query_start_loc` might cause some potential bugs. This PR would change it into : ``` input_ids : [1, 3, 0, 0] seq_lens: [2, 1, 1, 1] query_start_loc: [0, 1, 2, 3, 4] ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: Angazenn <supperccell@163.com>	2025-10-17 21:56:01 +08:00
anon189Ty	248ee7fa11	[Feat]Make full graph mode compalible with MTP (#3276 ) ### What this PR does / why we need it? Make the Full Graph mode can run with MTP. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>	2025-10-17 20:19:56 +08:00
zhaozx-cn	bf87606932	[Feat] Shared expert dp for deepseek and deepseek_mtp (#3495 ) ### What this PR does / why we need it? shared expert dp for deepseek and deepseek_mtp, could be combined with sp to improve performance. ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: zhaozx-cn <zhaozx2116@163.com> Co-authored-by: realliujiaxu <realliujiaxu@163.com>	2025-10-17 15:06:37 +08:00
realliujiaxu	f69a83b7ba	[Feat] Flash comm allgher ep (#3334 ) Support flash comm v1(Sequence Parallelism) for Allgather EP. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: realliujiaxu <realliujiaxu@163.com> Co-authored-by: zhaozx-cn <zhaozx2116@163.com>	2025-10-15 19:36:32 +08:00
linfeng-yuan	099255e933	[bugfix] fix pipeline parallel for mla & sfa attention backend (#3459 ) ### What this PR does / why we need it? Fix pipeline parallel break for mla & sfa attention backend caused by a magic number in metadata builder. The error report: `AttributeError: 'PPMissingLayer' object has no attribute 'self_attn'` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This PR was tested with "mp" backend (PP2TP8 on an A3 node) as well as "ray" backend (PP2TP8 on two A2 nodes). - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: linfeng-yuan <1102311262@qq.com>	2025-10-15 17:13:27 +08:00
LeeWenquan	4e720936d8	Fix warning msg print (#3421 ) ### What this PR does / why we need it? Avoid printing some warning msg as below : UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach ... ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: SunnyLee219 <3294305115@qq.com>	2025-10-15 11:30:30 +08:00
Chen Chen	16cb3cc45d	adapt the mla_v1 with the `mla_preprocess` kernel (#3397 ) ### What this PR does / why we need it? This pull request integrates a new `mla_preprocess` kernel to create an optimized path for MLA (Multi-Layer Attention) decode operations on Ascend hardware, controlled by an environment flag. The changes include new utility functions for weight transformation, a method to prepare weights for the fused kernel, and logic to route decode-only batches to this new path. My review identified a critical bug in the `transdata` utility function where padding dimensions are swapped, which will lead to incorrect tensor shapes and kernel failures. Additionally, I've pointed out a high-severity maintainability issue in the trans_rope_weight function, which modifies its input in-place, and I have provided a pure-function alternative. ### Does this PR introduce _any_ user-facing change? No user-facing changes by default. User can enable the `mla_preprocess` kernel in model by enable the env-var `VLLM_ASCEND_ENABLE_MLAPO`. ### How was this patch tested? Dedicated Ascend kernels are not covered by our CI yet, so no extra automated tests were added. Future MLA-focused regression runs will cover this path. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: Chen Chen <0109chenchen@gmail.com>	2025-10-15 10:34:25 +08:00
anon189Ty	07e39620ea	[Feat] Unquantized Linear to nz and control all nz-cast (#3356 ) ### What this PR does / why we need it? Currently, when executing to the Linear layer of models in vLLM-Ascend, the weights format is ND in unquantized case and skipped ascend case. This PR supplements the execution logic for Linear layer. We use a new global variable: VLLM_ASCEND_ENABLE_NZ. When VLLM_ASCEND_ENABLE_NZ=1 and CANN version is 8.3, the weights of the Linear layer will be converted to FRACTAL_NZ, in both unquantized case and skipped ascend case. We also use VLLM_ASCEND_ENABLE_NZ to control the existing NZ conversion, such as w8a8-quantized case. ### Does this PR introduce _any_ user-facing change? Add a new global variable VLLM_ASCEND_ENABLE_NZ. If you want to use NZ format, you should set VLLM_ASCEND_ENABLE_NZ=1. ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>	2025-10-14 17:39:26 +08:00
panchao-hub	1756efa5fd	[Feat][Graph]Support FULL_DECEDE_ONLY mode for MLA models (#3125 ) ### What this PR does / why we need it? Adds support for capturing the Multi-Layer Attention (MLA) decode operation into an ACL graph. This improves performance by compiling the attention kernel for single-token decoding. Key changes include: - Implementing the graph capture logic for the MLA kernel, including workspace management and parameter updates. - Modifying the rotary embedding (RoPE) handling to use pre-allocated tensors, which is a requirement for graph capture. - Adding a `build_for_graph_capture` method to the MLA metadata builder to create dummy metadata during the graph compilation phase. Known issues: - Currently, MTP is not supported in FULL_DECEDE_ONLY mode -- we're working on a fix - We are preparing to remove update_mla_attn_params with auto_dispatch_capture ### Does this PR introduce _any_ user-facing change? compilation_config={ "cudagraph_mode": "FULL_DECODE_ONLY", }, ### How was this patch tested? - vLLM version: v0.11.0 --------- Signed-off-by: panchao-hub <315134829@qq.com> Signed-off-by: p00465316 <panchao13@huawei.com> Co-authored-by: p00465316 <panchao13@huawei.com> Co-authored-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-10-10 16:31:20 +08:00
Ruri	ff37575936	[1/N][Feat] Add weight prefetch feature for Attention layers (#3146 ) ### What this PR does / why we need it? - Refacotr and integrate a unified `WeightPrefetchMethod` - Integrate `qkv_proj.weight` and `o_proj.weight` in quantized Attention modules - Prefetching these weights ahead of matmul-like operators imporves performance by reducing L2 cache transfer latency ### Does this PR introduce _any_ user-facing change? Add a new config in `--additional-config` for configuration: ```json { "weight_prefetch_config": { "enabled": false, "prefetch_ratio": { "attn": { "qkv": 1.0, "o": 1.0, }, }, }, } ``` This feature is enabled by default, and can be disabled through this configuration ### How was this patch tested? - vLLM version: v0.11.0 --------- Signed-off-by: yuzhup <15705211260@163.com> Signed-off-by: zhoux77899 <zhouxiang100@huawei.com> Co-authored-by: yuzhup <15705211260@163.com>	2025-10-09 20:38:39 +08:00
zouyida2052	b72e3327a6	bugfix for mtp>1 (#3174 ) ### What this PR does / why we need it? fix bugs when mtp>1, and reorder input batch when mtp is not accepted. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? by ci - vLLM version: v0.10.2 - vLLM main: `52d0cb8458` --------- Signed-off-by: zouyida2052 <zouyida2002@gmail.com>	2025-09-26 09:04:16 +08:00
lidenghui1110	0f3939e5a9	[Feature]cpu offload connector (#1659 ) This PR implements cpu offload connector to enable NPU kv cache offload to host DRAM. - vLLM version: v0.10.2 - vLLM main: `5aeb925452` Signed-off-by: lidenghui <lidenghui1110@gmail.com> Signed-off-by: AlvisGong <gwly0401@163.com> Signed-off-by: CalvinXKY <kyxiezju@163.com> Co-authored-by: AlvisGong <gwly0401@163.com>	2025-09-23 14:25:05 +08:00
Yizhou	39a85c49fa	[Refactor] Rename cudagraph_support to aclgraph_support (#3104 ) ### What this PR does / why we need it? Updates the `cudagraph_support` attribute to `aclgraph_support` to use terminology appropriate for the Ascend platform (ACL graphs instead of CUDA graphs). This change also explicitly disables graph support for the MLA attention backend. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? None needed. - vLLM version: v0.10.2 - vLLM main: `5aeb925452` Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-09-23 11:30:31 +08:00
Yizhou	338231acaf	[Feat][Graph] Support `FULL_DECODE_ONLY` mode for GQA/MHA models (#2128 ) Note: This depends on [vLLM #25161](https://github.com/vllm-project/vllm/pull/25161) and the torch\_npu release from September 30. ### What this PR does / why we need it? This pull request adds `FULL_DECODE_ONLY` mode for GQA/MHA models (MLA models like DeepSeek V3/R1 are not included). Key improvements include: * Reduced dispatch latency: By replaying the entire model execution graph at once, we cut overhead compared with multiple smaller replays. * Stabilized multi-device performance: Captureing the whole model as one static graph also mitigates the dispatch fluctuations across devices. * Stream/resource savings: Consolidating graph captures frees up streams, allowing more graphs to be captured. Known issues: 1. `_npu_paged_attention` currently manages its own workspace in `torch_npu`, which can deadlock when synchronizing during graph replay — we’re working on a fix. There may be other corner cases. This PR is the first in a planned series; we’ll continue to iterate and address remaining issues in follow-ups. This is essentially a port of #1503 and #1677, but includes two major changes: 1. Let `graph_dispatcher` decide the graph mode instead of hard-coding it in the backend, which decouples Full Graph and Piecewise Graph and could make it possible to remove dynamo. 2. Adapt to the new `attn_group` logic, but leave a small hack in `update_graph_params`; multi-attention models may or may not be fully supported yet. ### Does this PR introduce _any_ user-facing change? ```python compilation_config={ "cudagraph_mode": "FULL_DECODE_ONLY", }, ``` ### How was this patch tested? Tests included. - vLLM version: v0.10.2 - vLLM main: `9607d5eb44` --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-09-22 17:14:28 +08:00
xuyexiong	2a87b4cecb	[Bugfix] Fix specdecoding in chunkedprefill scenario (#3025 ) ### What this PR does / why we need it? The speculative decode phase of chunkedprefill has taken an incorrect path, should always use TND layout for speculative decoding. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: `6d8246aaff` Signed-off-by: xuyexiong <xuyexiong@huawei.com>	2025-09-19 14:05:08 +08:00
LeeWenquan	f4e3d22432	Remove chunked_prefill_for_mla and fix ring_mla bug (#2781 ) ### What this PR does / why we need it? Remove chunked prefill for mla branch in mla , and change dtype of prefill_mask to avoid accuracy problem ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: `ef7eefe17a` --------- Signed-off-by: SunnyLee219 <3294305115@qq.com>	2025-09-18 19:43:26 +08:00
xuyexiong	6681dde902	[Feat][Graph] Support MTP for ACL Graph (#2932 ) ### What this PR does / why we need it? This PR depends on the merge of #2707 and has adapted the aclgraph functionality to support MTP. ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: `2b85697031` --------- Signed-off-by: xuyexiong <xuyexiong@huawei.com>	2025-09-18 14:05:33 +08:00
realliujiaxu	723d460894	[Bugfix] fix kv nz accuracy bug (#2988 ) when `enable_kv_nz` is true, output of Deepseek R1 is invalid. - vLLM version: v0.10.2 - vLLM main: `2b85697031` Signed-off-by: realliujiaxu <realliujiaxu@163.com>	2025-09-17 21:10:25 +08:00

1 2 3

107 Commits