xc-llm-ascend

Author	SHA1	Message	Date
Wang Yixuan	d412565ec9	[Cherry-pick]bmm_transpose to v011dev (#3995 ) ### What this PR does / why we need it? Add a custom op to acclerater the deepseek model. The fusion ops combine the bmm and transpose together, which is applied to mla module. Cherry-pick from this commtid c68ddc11ce53334fc9a17bad58342148cbf14e86 ### Does this PR introduce _any_ user-facing change? No --------- Signed-off-by: hust17yixuan <303660421@qq.com>	2025-12-08 19:22:14 +08:00
henryxuxu0716	71acc8ddeb	For nz unset in bf16&fp16 (#4495 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? disable NZ for float weight case. This is only a quick fix for dev branch. For main branch, we'll consider more case to make it more common. ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> ### How was this patch tested? qwen2.5 32B <img width="441" height="221" alt="image" src="https://github.com/user-attachments/assets/7ae18ffd-1ce2-43d9-9960-be45250ad0da" /> --------- Signed-off-by: 刘哲续 <liuzhexu1@huawei.com> Co-authored-by: 刘哲续 <liuzhexu1@huawei.com>	2025-11-28 17:32:25 +08:00
Icey	378e92a2a2	[Cherry-pick][0.11.0] Adapted to torch_npu.npu_fused_infer_attention_score (#4202 ) ### What this PR does / why we need it? Fixes a compatible bug with torch_npu.npu_fused_infer_attention_score which is discribed in https://github.com/vllm-project/vllm-ascend/issues/4020. @momo609 tells us this solution. cherry-pick: https://github.com/vllm-project/vllm-ascend/pull/4025 ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with new added/existing test. Signed-off-by: Icey <1790571317@qq.com>	2025-11-17 10:56:23 +08:00
zhaozx-cn	cd652acb65	[BugFix] Fix kv_no_split not contiguous (#3711 ) allgather need contiguous data, split operation return uncontiguous data. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: zhaozx-cn <zhaozx2116@163.com>	2025-11-13 11:29:37 +08:00
zhaomingyu13	650ce8ad19	[0.11.0][Bugfix] Fix ngram precision issue and open e2e ngram test (#4092 ) ### What this PR does / why we need it? Fix ngram precision issue and open e2e ngram test --------- Signed-off-by: Icey <1790571317@qq.com> Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com> Signed-off-by: zhaomingyu13 <zhaomingyu13@h-partners.com> Co-authored-by: Icey <1790571317@qq.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2025-11-11 09:58:03 +08:00
hucong	7ea17fbee3	[0.11.0][BugFix] Improve the performance of prefixcache features (#4021 ) ### What this PR does / why we need it? cherry-pick from https://github.com/vllm-project/vllm-ascend/pull/4022 The code bug caused an empty bubble. When the npu_paged_cache_load operator was called, it forcibly transferred seq_len2 to the device, which triggered synchronization and interrupted the CPU operator's launch stream. --------- Signed-off-by: underfituu <hzhucong@163.com>	2025-11-10 11:51:34 +08:00
wangxiyuan	7ee0b0b5d8	[cherry-pick]Upgrade CANN to 8.3.rc1 (#3945 ) (#3962 ) This PR upgrade CANN from 8.2rc1 to 8.3rc1 and remove the CANN version check logic. TODO: we notice that UT runs failed with CANN 8.3 image. So the base image for UT is still 8.2. We'll fix it later. - vLLM version: v0.11.0 - vLLM main: `83f478bb19` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-11-06 09:05:08 +08:00
wangxiyuan	8a7154001e	[0.11.0]Chery pick pta upgrade change (#3940 ) This PR cherry-pick two commit from main to upgrade torch-npu to 2.7.1 official release --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-10-31 22:14:26 +08:00
whx	211d4b9da4	[BugFix] Fix mlapo accuracy problem related with weight processing. (#3857 ) This PR fixes a mlapo accuracy problem related with weight processing. Furthermore, modify mlapo related e2e test with quantized deepseek model to make it effective. Signed-off-by: whx-sjtu <2952154980@qq.com>	2025-10-30 00:35:50 +08:00
Yizhou	43276fd822	[v0.11.0][Fix] Prevent memory leak in MLA decode graph (#3743 ) (#3774 ) ### What this PR does / why we need it? The cache for MLA decode graph parameters was holding strong references to tensors, preventing them from being garbage collected and leading to increased memory usage. This change wraps the cached tensors in weak references, allowing them to be deallocated when no longer in use and reducing overall memory pressure. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? None. Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-10-27 16:00:20 +08:00
ZYang6263	6975d46627	[v0.11.0][Perf] Eliminating the zerolike operator through patch (#3632 ) ### What this PR does / why we need it? There is a zero-like operator before the attention operation in each decoding stage. After analysis, this operator can be eliminated. The purpose of this PR is to remove this operator and improve performance. --------- Signed-off-by: ZYang6263 <zy626375@gmail.com>	2025-10-23 14:49:28 +08:00
Yizhou	984efdc0d0	[v0.11.0][Fix] Fixes attribute error in MLA implementation (#3617 ) ### What this PR does / why we need it? Corrects the attribute access for retrieving the device from `q_a_proj` to `q_proj`. This prevents an `AttributeError` as `q_a_proj` does not exist on the class instance. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? Need MLAPO tests. Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-10-22 15:49:18 +08:00
linfeng-yuan	4c9af353ee	Revert "[Feat] Shared expert dp for deepseek and deepseek_mtp (#3495 )" (#3586 ) ### What this PR does / why we need it? This reverts commit `bf87606932`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? E2E vllm serving with `enable_shared_expert_dp: true` in eager mode as before. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: linfeng-yuan <1102311262@qq.com>	2025-10-21 22:24:30 +08:00
Jade Zheng	0c6349610e	[Feature] Reduce host memory usage for attention mask generation (#3048 ) ### What this PR does / why we need it? Previously, the mask construction process created multiple tensors of size (max_model_len, max_model_len). When max_model_len reached 128k, single GPU host memory usage exceeded hundreds of GB, causing process OOM crashes. This update optimizes the mask generation to significantly reduce memory consumption. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI pass. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>	2025-10-21 20:19:04 +08:00
whx	220df60c61	[Model][2/N] Remove deepseek_mtp modeling. (#3561 ) This PR is step 2 of deepseek model refactoring and removes deepseek_mtp. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: whx-sjtu <2952154980@qq.com>	2025-10-21 20:17:09 +08:00
Zhu Yi Lin	ffb42a8daa	[BugFix] Fixed the bug that caused the transposematmul operator to report an error due to the shape being too large (#3578 ) ### What this PR does / why we need it? npu_transpose_batchmatmul has the problem that the shape being too large - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: GDzhu1 <809721801@qq.com>	2025-10-21 20:16:54 +08:00
Chen Chen	6b290acfe1	remove redundant params in mla_preprocess kernel (#3530 ) ### What this PR does / why we need it? This pull request removes the redundant parameters `gamma1` and `beta1` (also named `gamma0`/`beta0` in some places) from the `mla_preprocess` kernel and its calling hierarchy. The changes are consistent across C++ kernel code, bindings, and Python call sites. The parameters were unused in the lower-level functions, so their removal is a good cleanup. ### Does this PR introduce _any_ user-facing change? The python interface of the kernel is affected, and the params of `gamma0` and `beta0` are not needed. ### How was this patch tested? The unit-test of the kernel is adapted accordingly. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: mojave2 <chenchen145@huawei.com>	2025-10-21 19:20:13 +08:00
lilinsiman	70bef33f13	add new accuracy test case for aclgraph (#3390 ) ### What this PR does / why we need it? Add new accuracy test case Deepseek-V2-Lite-W8A8 for aclgraph ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ut - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: lilinsiman <lilinsiman@gmail.com>	2025-10-20 20:04:04 +08:00
ZYang6263	b9e2896eb1	Revert "[Perf] Add FIA interface in FA case" (#3553 ) Reverts vllm-project/vllm-ascend#3321 The output dimension mismatch and accuracy issue - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: ZYang6263 <zy626375@gmail.com>	2025-10-20 19:56:10 +08:00
Zhu Yi Lin	34c2996ab8	[main] v_proj combining transpose and matmul (#3545 ) ### What this PR does / why we need it? v_proj combining transpose and matmul ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: GDzhu1 <809721801@qq.com>	2025-10-20 19:53:32 +08:00
whx	f8b52fe950	[Model][1/N] Delete deepseek v2/v3 modeling codes. (#3189 ) This PR deletes model codes of deepseek_v2 and deepseek_v3 to reuse the model file from vLLM. vLLM Ascend now uses custom ops register way instead of model file hard-coding. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: whx-sjtu <2952154980@qq.com>	2025-10-20 15:31:34 +08:00
ZYang6263	1e78ecbad6	[Perf] Add FIA interface in FA case (#3321 ) ### What this PR does / why we need it? Add new npu_fused_infer_attention_score op to improve perfomance in flash attention case. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: ZYang6263 <zy626375@gmail.com>	2025-10-19 12:45:33 +08:00
Angazenn	9547d6f0d9	[Core]Append padding logic for Attention (#3256 ) ### What this PR does / why we need it? This PR aims to add padding logic to seq_lens、block_tables when running in full decode scenario. Before this PR, the number of input tokens with padding might exceeds corresponding seq_lens. For example, when running in full decode scenario: ``` input_ids : [1, 3, 0, 0] seq_lens: [2, 1] query_start_loc: [0, 1, 2] ``` Here, `input_ids` is padded by 2 tokens while `seq_lens`/`query_start_loc` are not. The mismatch between `input_ids` and `seq_lens`/`query_start_loc` might cause some potential bugs. This PR would change it into : ``` input_ids : [1, 3, 0, 0] seq_lens: [2, 1, 1, 1] query_start_loc: [0, 1, 2, 3, 4] ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: Angazenn <supperccell@163.com>	2025-10-17 21:56:01 +08:00
anon189Ty	248ee7fa11	[Feat]Make full graph mode compalible with MTP (#3276 ) ### What this PR does / why we need it? Make the Full Graph mode can run with MTP. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>	2025-10-17 20:19:56 +08:00
zhaozx-cn	bf87606932	[Feat] Shared expert dp for deepseek and deepseek_mtp (#3495 ) ### What this PR does / why we need it? shared expert dp for deepseek and deepseek_mtp, could be combined with sp to improve performance. ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: zhaozx-cn <zhaozx2116@163.com> Co-authored-by: realliujiaxu <realliujiaxu@163.com>	2025-10-17 15:06:37 +08:00
Angazenn	d9ee491f70	[BugFix]Move to_list in foward_v1 with FIA earlier to build (#3185 ) ### What this PR does / why we need it? The current implementation of FIA will introduce an `to_list` operation for actual_seq_lengths_q and seq_lens，which comsumes extra time. These operation can be moved earlier into `build` operation of attention metadata. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: Angazenn <supperccell@163.com>	2025-10-17 11:19:41 +08:00
huangdong2022	3a53bbc508	[Feat]Qwen3 Moe supports npu_add_rms_norm_quant op by default, update op with bias, resolve conflict with weight prefetch (#3465 ) ### What this PR does / why we need it? 1.qwen3 moe uses add_rms_norm_quant op instead of 'add_rms_norm op and quant op' during quantization scene. 2.torch_npu.add_rms_norm_quant op fixed accuracy while model weights is quantized by anti_method m4, m4 quantization is asymmetric outlier suppression method, it will generate none-zero norm bias, add_rms_norm_quant op updated to add this parameter to calculate. 3. add torch-npu check ### Does this PR introduce _any_ user-facing change? new feature works if torch_npu version >= torch_npu-2.7.1.dev20250919 ### How was this patch tested? 1.no special parameters to set, no new envs to set. new feature works if torch_npu version >= torch_npu-2.7.1.dev20250919 2.use qwen3 moe quantization model to test ,such as Qwen3-235B-A22B-W8A8, Qwen3-30B-A3B-W8A8, Qwen3-235B-A22B-Instruct-2507-m4 (anti_method m4) - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: h30027576 <huangdong51@huawei.com>	2025-10-17 09:30:51 +08:00
realliujiaxu	f69a83b7ba	[Feat] Flash comm allgher ep (#3334 ) Support flash comm v1(Sequence Parallelism) for Allgather EP. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: realliujiaxu <realliujiaxu@163.com> Co-authored-by: zhaozx-cn <zhaozx2116@163.com>	2025-10-15 19:36:32 +08:00
Mengqing Cao	8abe517870	[Refactor] Adapt deepseek-v3.2 to vllm 0.11.0 (#3432 ) ### What this PR does / why we need it? Adapt deepseek-v3.2 to vllm 0.11.0, removing the useless patch. The final goal is to remove all the patches and align the code arch to vllm, thus we need to do the following work in next prs. TODO: - [x] remove patch on attention spec - [ ] refactor the kvcache creation logic ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? 1. CI passed with existing test. 2. Test pass with deepseek-v3.2-exp - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: MengqingCao <cmq0113@163.com>	2025-10-15 17:48:58 +08:00
linfeng-yuan	099255e933	[bugfix] fix pipeline parallel for mla & sfa attention backend (#3459 ) ### What this PR does / why we need it? Fix pipeline parallel break for mla & sfa attention backend caused by a magic number in metadata builder. The error report: `AttributeError: 'PPMissingLayer' object has no attribute 'self_attn'` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This PR was tested with "mp" backend (PP2TP8 on an A3 node) as well as "ray" backend (PP2TP8 on two A2 nodes). - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: linfeng-yuan <1102311262@qq.com>	2025-10-15 17:13:27 +08:00
LeeWenquan	4e720936d8	Fix warning msg print (#3421 ) ### What this PR does / why we need it? Avoid printing some warning msg as below : UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach ... ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: SunnyLee219 <3294305115@qq.com>	2025-10-15 11:30:30 +08:00
Chen Chen	16cb3cc45d	adapt the mla_v1 with the `mla_preprocess` kernel (#3397 ) ### What this PR does / why we need it? This pull request integrates a new `mla_preprocess` kernel to create an optimized path for MLA (Multi-Layer Attention) decode operations on Ascend hardware, controlled by an environment flag. The changes include new utility functions for weight transformation, a method to prepare weights for the fused kernel, and logic to route decode-only batches to this new path. My review identified a critical bug in the `transdata` utility function where padding dimensions are swapped, which will lead to incorrect tensor shapes and kernel failures. Additionally, I've pointed out a high-severity maintainability issue in the trans_rope_weight function, which modifies its input in-place, and I have provided a pure-function alternative. ### Does this PR introduce _any_ user-facing change? No user-facing changes by default. User can enable the `mla_preprocess` kernel in model by enable the env-var `VLLM_ASCEND_ENABLE_MLAPO`. ### How was this patch tested? Dedicated Ascend kernels are not covered by our CI yet, so no extra automated tests were added. Future MLA-focused regression runs will cover this path. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: Chen Chen <0109chenchen@gmail.com>	2025-10-15 10:34:25 +08:00
zouyida2052	3642b64afc	bugfix for mtp with multistream_moe (#3419 ) ### What this PR does / why we need it? when infer deepseek mtp layer with multistream_moe, we should pass a boolean to evaluate this feature and fix bugs when we are in mtp layer - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: zouyida2052 <zouyida2002@gmail.com>	2025-10-15 08:59:58 +08:00
xuyexiong	02c26dcfc7	[Feat] Supports Aclgraph for bge-m3 (#3171 ) ### What this PR does / why we need it? [Feat] Supports Aclgraph for bge-m3 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ``` pytest -s tests/e2e/singlecard/test_embedding.py pytest -s tests/e2e/singlecard/test_embedding_aclgraph.py ``` to start an online server with bs 10, each batch's seq length=8192, we set --max-num-batched-tokens=8192*10 to ensure encoder is not chunked: ``` vllm serve /home/data/bge-m3 --max_model_len 1024 --served-model-name "bge-m3" --task embed --host 0.0.0.0 --port 9095 --max-num-batched-tokens 81920 --compilation-config '{"cudagraph_capture_sizes":[8192, 10240, 20480, 40960, 81920]}' ``` For bs10, each batch's seq length=8192, QPS is improved from 85 to 104, which is a 22% improvement, lots of host bound is reduced. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: xuyexiong <xuyexiong@huawei.com> Co-authored-by: wangyongjun <1104133197@qq.com>	2025-10-14 23:07:45 +08:00
anon189Ty	07e39620ea	[Feat] Unquantized Linear to nz and control all nz-cast (#3356 ) ### What this PR does / why we need it? Currently, when executing to the Linear layer of models in vLLM-Ascend, the weights format is ND in unquantized case and skipped ascend case. This PR supplements the execution logic for Linear layer. We use a new global variable: VLLM_ASCEND_ENABLE_NZ. When VLLM_ASCEND_ENABLE_NZ=1 and CANN version is 8.3, the weights of the Linear layer will be converted to FRACTAL_NZ, in both unquantized case and skipped ascend case. We also use VLLM_ASCEND_ENABLE_NZ to control the existing NZ conversion, such as w8a8-quantized case. ### Does this PR introduce _any_ user-facing change? Add a new global variable VLLM_ASCEND_ENABLE_NZ. If you want to use NZ format, you should set VLLM_ASCEND_ENABLE_NZ=1. ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>	2025-10-14 17:39:26 +08:00
XiaoxinWang	9eb62935b8	fix pagedattention to support fullgraph. (#3436 ) ### What this PR does / why we need it? Calculate in advance the workspace memory size needed for the PagedAttention operator to avoid deadlocks during resource cleanup. This PR requires torch_npu version 0920 or newer. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com> Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>	2025-10-14 16:10:09 +08:00
无脸男	ace300a549	[Bugfix] Fix the abnormal NPU memory usage in full graph mode. (#3331 ) ### What this PR does / why we need it? In the full graph mode, since paged attention operators updates are required, the parameters of this operators needs to be retained. However, the tensor such as query、key cache、value cache, does not need to be persistently saved, and we can manually release this space by `weak_ref_tensor` to save the memory. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: WithHades <244036962@qq.com>	2025-10-11 10:20:10 +08:00
panchao-hub	1756efa5fd	[Feat][Graph]Support FULL_DECEDE_ONLY mode for MLA models (#3125 ) ### What this PR does / why we need it? Adds support for capturing the Multi-Layer Attention (MLA) decode operation into an ACL graph. This improves performance by compiling the attention kernel for single-token decoding. Key changes include: - Implementing the graph capture logic for the MLA kernel, including workspace management and parameter updates. - Modifying the rotary embedding (RoPE) handling to use pre-allocated tensors, which is a requirement for graph capture. - Adding a `build_for_graph_capture` method to the MLA metadata builder to create dummy metadata during the graph compilation phase. Known issues: - Currently, MTP is not supported in FULL_DECEDE_ONLY mode -- we're working on a fix - We are preparing to remove update_mla_attn_params with auto_dispatch_capture ### Does this PR introduce _any_ user-facing change? compilation_config={ "cudagraph_mode": "FULL_DECODE_ONLY", }, ### How was this patch tested? - vLLM version: v0.11.0 --------- Signed-off-by: panchao-hub <315134829@qq.com> Signed-off-by: p00465316 <panchao13@huawei.com> Co-authored-by: p00465316 <panchao13@huawei.com> Co-authored-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-10-10 16:31:20 +08:00
wangxiyuan	ba19dd3183	Revert PTA upgrade PR (#3352 ) we notice that torch npu 0919 doesn't work. This PR revert related change which rely on 0919 version. Revert PR: #3295 #3205 #3102 Related: #3353 - vLLM version: v0.11.0	2025-10-10 14:09:53 +08:00
XiaoxinWang	579b7e5f21	add pagedattention to support FULL_DECODE_ONLY. (#3102 ) ### What this PR does / why we need it? Calculate in advance the workspace memory size needed for the PagedAttention operator to avoid deadlocks during resource cleanup. This PR requires torch_npu version 0920 or newer. ### How was this patch tested? - vLLM version: v0.11.0 --------- Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com> Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>	2025-10-10 08:50:33 +08:00
Ruri	ff37575936	[1/N][Feat] Add weight prefetch feature for Attention layers (#3146 ) ### What this PR does / why we need it? - Refacotr and integrate a unified `WeightPrefetchMethod` - Integrate `qkv_proj.weight` and `o_proj.weight` in quantized Attention modules - Prefetching these weights ahead of matmul-like operators imporves performance by reducing L2 cache transfer latency ### Does this PR introduce _any_ user-facing change? Add a new config in `--additional-config` for configuration: ```json { "weight_prefetch_config": { "enabled": false, "prefetch_ratio": { "attn": { "qkv": 1.0, "o": 1.0, }, }, }, } ``` This feature is enabled by default, and can be disabled through this configuration ### How was this patch tested? - vLLM version: v0.11.0 --------- Signed-off-by: yuzhup <15705211260@163.com> Signed-off-by: zhoux77899 <zhouxiang100@huawei.com> Co-authored-by: yuzhup <15705211260@163.com>	2025-10-09 20:38:39 +08:00
zouyida2052	81aff9c555	bugfix for mtp (#3300 ) ### What this PR does / why we need it? when mtp>1, we need refresh cos ans sin in each step. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.11.0 Signed-off-by: zouyida2052 <zouyida2002@gmail.com>	2025-10-09 19:22:46 +08:00
wangxiyuan	81bd6e4c99	Add DeepSeek V3.2 support (#3270 ) ### What this PR does / why we need it? This PR added the initial DeepSeek V3.2 support with [vLLM v0.11.0](https://github.com/vllm-project/vllm/tree/releases/v0.11.0) (not released yet). We will complete vLLM adaptation as soon as possible. This feature will be ready in recent 1-2 days. Related doc: https://github.com/vllm-project/vllm-ascend/pull/3223 . ### Does this PR introduce _any_ user-facing change? Yes! ### How was this patch tested? CI passed and Run deepseek doc soon. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/releases/v0.11.0 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: zzzzwwjj <1183291235@qq.com> Signed-off-by: linfeng-yuan <1102311262@qq.com> Signed-off-by: wxsIcey <1790571317@qq.com> Signed-off-by: MengqingCao <cmq0113@163.com> Co-authored-by: zzzzwwjj <1183291235@qq.com> Co-authored-by: linfeng-yuan <1102311262@qq.com> Co-authored-by: wxsIcey <1790571317@qq.com> Co-authored-by: MengqingCao <cmq0113@163.com>	2025-09-30 03:25:58 +08:00
zouyida2052	b72e3327a6	bugfix for mtp>1 (#3174 ) ### What this PR does / why we need it? fix bugs when mtp>1, and reorder input batch when mtp is not accepted. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? by ci - vLLM version: v0.10.2 - vLLM main: `52d0cb8458` --------- Signed-off-by: zouyida2052 <zouyida2002@gmail.com>	2025-09-26 09:04:16 +08:00
wangxiyuan	ac1c2cd9ac	[CI] Upgrade vllm version - 0925 (#3167 ) Upgrade vLLM to newest commit. 1. Remove the useless func get_state_cls, it has been removed from vLLM already. `e6750d0b18` 2. Fix ut broken by `6160ba4151` - vLLM version: v0.10.2 - vLLM main: `b1068903fd` --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-09-25 14:20:10 +08:00
weijinqian0	6aa4253798	[Refactor] [SP]The sequence parallelism characteristics in the MoE and Dense models are integrated into a single solution. (#3085 ) What this PR does / why we need it? there are two sets of sp implementations for moe and dense models. One is called sequence_parallelism, and the other is flashcomm_v1. We did the following things： Merge two sets of code with the same implementation into one. Remove the implementation of sequence_parallelism, as this solution cannot support aclgraph. Does this PR introduce any user-facing change? No How was this patch tested? e2e&ut - vLLM version: v0.10.2 - vLLM main: `f225ea7dd9` --------- Signed-off-by: weijinqian_v1 <weijinqian@huawei.com> Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>	2025-09-24 11:29:59 +08:00
lidenghui1110	0f3939e5a9	[Feature]cpu offload connector (#1659 ) This PR implements cpu offload connector to enable NPU kv cache offload to host DRAM. - vLLM version: v0.10.2 - vLLM main: `5aeb925452` Signed-off-by: lidenghui <lidenghui1110@gmail.com> Signed-off-by: AlvisGong <gwly0401@163.com> Signed-off-by: CalvinXKY <kyxiezju@163.com> Co-authored-by: AlvisGong <gwly0401@163.com>	2025-09-23 14:25:05 +08:00
Yizhou	39a85c49fa	[Refactor] Rename cudagraph_support to aclgraph_support (#3104 ) ### What this PR does / why we need it? Updates the `cudagraph_support` attribute to `aclgraph_support` to use terminology appropriate for the Ascend platform (ACL graphs instead of CUDA graphs). This change also explicitly disables graph support for the MLA attention backend. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? None needed. - vLLM version: v0.10.2 - vLLM main: `5aeb925452` Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-09-23 11:30:31 +08:00
Yizhou	3fa7cf6345	[Refactor][Graph] Move graph parameter logic to acl_graph module (#3101 ) ### What this PR does / why we need it? This is the follow-up PR of #2128 . Moves graph parameter management components, including `GraphParams`, `get_graph_params`, and `set_graph_params`, from the generic `utils.py` to the more specific `compilation/acl_graph.py`. Additionally, extracts the `update_attn_params` logic from the `NPUModelRunner` class into a standalone function within the `acl_graph` module. This refactoring improves code organization by centralizing ACL graph-related logic into its own dedicated module, enhancing modularity and clarity. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? None needed. Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-09-22 22:23:14 +08:00
Yizhou	338231acaf	[Feat][Graph] Support `FULL_DECODE_ONLY` mode for GQA/MHA models (#2128 ) Note: This depends on [vLLM #25161](https://github.com/vllm-project/vllm/pull/25161) and the torch\_npu release from September 30. ### What this PR does / why we need it? This pull request adds `FULL_DECODE_ONLY` mode for GQA/MHA models (MLA models like DeepSeek V3/R1 are not included). Key improvements include: * Reduced dispatch latency: By replaying the entire model execution graph at once, we cut overhead compared with multiple smaller replays. * Stabilized multi-device performance: Captureing the whole model as one static graph also mitigates the dispatch fluctuations across devices. * Stream/resource savings: Consolidating graph captures frees up streams, allowing more graphs to be captured. Known issues: 1. `_npu_paged_attention` currently manages its own workspace in `torch_npu`, which can deadlock when synchronizing during graph replay — we’re working on a fix. There may be other corner cases. This PR is the first in a planned series; we’ll continue to iterate and address remaining issues in follow-ups. This is essentially a port of #1503 and #1677, but includes two major changes: 1. Let `graph_dispatcher` decide the graph mode instead of hard-coding it in the backend, which decouples Full Graph and Piecewise Graph and could make it possible to remove dynamo. 2. Adapt to the new `attn_group` logic, but leave a small hack in `update_graph_params`; multi-attention models may or may not be fully supported yet. ### Does this PR introduce _any_ user-facing change? ```python compilation_config={ "cudagraph_mode": "FULL_DECODE_ONLY", }, ``` ### How was this patch tested? Tests included. - vLLM version: v0.10.2 - vLLM main: `9607d5eb44` --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-09-22 17:14:28 +08:00

1 2 3 4

155 Commits