xc-llm-ascend

Author	SHA1	Message	Date
linfeng-yuan	e0757dc376	[0.11.0]fix the configuration conflicts in documentation (#4824 ) ### What this PR does / why we need it? Fix configuration errors in our documentation. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? NA. Signed-off-by: linfeng-yuan <1102311262@qq.com>	2025-12-09 15:37:06 +08:00
zhangxinyuehfad	033e3557cc	[cherry-pick]fix qwen3vl mrope op (#4484 ) (#4811 ) ### What this PR does / why we need it? Qwen2.5-VL mrope precision problem would been solved once this pr is merged ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Test on G8600 with textVQA dataset - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: shaopeng-666 <lishaopeng21@huawei.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-09 11:07:32 +08:00
Levi	9862a23985	【0.11.0-dev】optimization of kimi-k2 in cann8.3 (#4555 ) ### What this PR does / why we need it? In cann8.3， npu_moe_gating_top_k operator can support expert nums with 384, so kimi can use the operator to get better preformance. --------- Signed-off-by: Levi-JQ <yujinqi2@huawei.com> Co-authored-by: Levi-JQ <yujinqi2@huawei.com>	2025-12-09 08:49:15 +08:00
zhangxinyuehfad	0d094531b4	[bugfix] Fixed the bug in retrieving the quantization method for mlp.… (#4797 ) When retrieving the quantization method for MOE (e.g., the quantization file of DeepSeek v3.2 exp do not match the model's naming convention in eager mode), a KeyError is raised: "model.layers.3.mlp.experts.weight not in self.quant_description". However the quantization file is like : ```bash "model.layers.3.mlp.experts.255.gate_proj.weight": "W8A8_DYNAMIC", "model.layers.3.mlp.experts.255.gate_proj.weight_scale": "W8A8_DYNAMIC", "model.layers.3.mlp.experts.255.gate_proj.weight_offset": "W8A8_DYNAMIC", "model.layers.3.mlp.experts.255.down_proj.weight": "W8A8_DYNAMIC", "model.layers.3.mlp.experts.255.down_proj.weight_scale": "W8A8_DYNAMIC", "model.layers.3.mlp.experts.255.down_proj.weight_offset": "W8A8_DYNAMIC", "model.layers.3.mlp.experts.255.up_proj.weight": "W8A8_DYNAMIC", "model.layers.3.mlp.experts.255.up_proj.weight_scale": "W8A8_DYNAMIC", "model.layers.3.mlp.experts.255.up_proj.weight_offset": "W8A8_DYNAMIC", ``` Co-Authored-By: yangqinghao-cmss <yangqinghao_yewu@cmss.chinamobile.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: yangqinghao-cmss <yangqinghao_yewu@cmss.chinamobile.com>	2025-12-09 08:47:19 +08:00
Levi	4e728f1f40	[Bugfix] fix qwen3-vl-moe shape ERROR during the _prepare_inputs phase under high concurrency. (#4658 ) ### What this PR does / why we need it? Earlier we fixed a similar issue for qwen2.5-vl 【 https://github.com/vllm-project/vllm-ascend/issues/4430 】, and then the multimodal models in vllm v0.11.0 should all have this problem. Here, we have specifically proposed a fix for qwen3-vl-moe. --------- Signed-off-by: Levi-JQ <yujinqi2@huawei.com> Co-authored-by: Levi-JQ <yujinqi2@huawei.com>	2025-12-08 19:30:16 +08:00
Wang Yixuan	d412565ec9	[Cherry-pick]bmm_transpose to v011dev (#3995 ) ### What this PR does / why we need it? Add a custom op to acclerater the deepseek model. The fusion ops combine the bmm and transpose together, which is applied to mla module. Cherry-pick from this commtid `c68ddc11ce` ### Does this PR introduce _any_ user-facing change? No --------- Signed-off-by: hust17yixuan <303660421@qq.com>	2025-12-08 19:22:14 +08:00
Angazenn	6391f0625f	[v0.11.0-dev][bugfix] Add branch for stream up-lifting in `update_attn_params` (#4437 ) ### What this PR does / why we need it? #3985 move stream context initialization before for-loops to improve performance. However, we find that this might cause potential accuracy drop when used with pd disaggregation. Thus we partly revert this change when using pd disaggregation, and we shall fix this bug in th future. ### Does this PR introduce _any_ user-facing change? No. --------- Signed-off-by: Angazenn <supperccell@163.com>	2025-12-08 08:54:46 +08:00
Li Wang	2598124e67	[Image] Correcting the vllm tag of the openeuler image on the A2 device. (#4745 ) ### What this PR does / why we need it? Corrected the vllm tag, which should have been in v0.11.0 Signed-off-by: wangli <wangli858794774@gmail.com>	2025-12-06 10:55:22 +08:00
offline893	350999c4ef	[Bugfix]Fix eplb enable when using mtp float weights. (#4576 ) ### What this PR does / why we need it? Fix eplb enable when using mtp float weights. It will be remove when eplb supporting mtp and float weights. ### How was this patch tested? Deepseek-V3 + MTP + EPLB in A3. --------- Signed-off-by: offline0806 <3337230449@qq.com> Signed-off-by: offline893 <158537145+offline893@users.noreply.github.com> Co-authored-by: offline0806 <3337230449@qq.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2025-12-05 21:15:32 +08:00
1092626063	c4a11a745a	[refactor]support gatingtopk operator generalization (#4356 ) ### What this PR does / why we need it? This pr is cherry-pick from : https://github.com/vllm-project/vllm-ascend/pull/2958 and https://github.com/vllm-project/vllm-ascend/pull/4340 Past： npu_moe_gating_top_k can only support 'group_count=256' pattern Now： 1、npu_moe_gating_top_k support all size of group_count 2、the functionality of `torch_npu.npu_moe_gating_top_k_softmax` are included in `torch_npu.npu_moe_gating_top_k` CANN: depends on 8.3.RC1 Performance： 1. GLM4.5-w8a8, TPS improve 6% 2. Qwen3, the same as before --------- Signed-off-by: 1092626063 <1092626063@qq.com>	2025-12-04 20:10:13 +08:00
LI SHENGYONG	593a96056c	【EPLB】Eplb Redundant Experts Bugfix (#4232 ) ### What this PR does / why we need it? Redundant experts bugfix The calculation logic for redundant experts has been fixed, allowing the correct number of redundant experts to be calculated using the map. Therefore, there is no longer a need to set the redundant expert parameter when passing the map. ### Does this PR introduce _any_ user-facing change? After configuring the path for experts_map, users do not need to configure iinit_redundancy_expert. ### How was this patch tested? The accuracy of EPLB was tested with and without the use of redundant experts. --------- Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2025-12-03 12:00:05 +08:00
Mengqing Cao	b6d63bbd52	[v0.11.0-dev][CI] Fix ngram lacking of input arg `dummy_compute_logits` error (#4648 ) ### What this PR does / why we need it? Fix ngram lacking of input arg `dummy_compute_logits` error ### How was this patch tested? CI passed with existing test. --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-12-03 09:22:07 +08:00
Levi	865f1f7fc8	[Bugfix] Resolve the interface compatibility issue of get_input_embeddings in MM (#4638 ) ### What this PR does / why we need it? Resolve the interface compatibility issue of get_input_embeddings in MM， because the get_input_embeddings func of other model does not have the is_multimodal parameter --------- Signed-off-by: Levi-JQ <yujinqi2@huawei.com> Co-authored-by: Levi-JQ <yujinqi2@huawei.com>	2025-12-02 22:21:47 +08:00
Levi	3b4cb23616	[Bugfix] fix qwen2.5-vl-72b shape ERROR during the _prepare_inputs phase under high concurrency. (#4553 ) ### What this PR does / why we need it? qwen2.5-vl-72b reports a shape ERROR during the _prepare_inputs phase under high concurrency【 issue https://github.com/vllm-project/vllm-ascend/issues/4430 】 This PR fix it. The related PR in main branch :https://github.com/vllm-project/vllm-ascend/pull/3612 The related commit in vllm : `17c540a993/vllm/model_executor/models/interfaces.py` 【The _get_text_embeddings function has been refactored to interfaces.pyin vllm.】 Signed-off-by: Levi-JQ <yujinqi2@huawei.com> Co-authored-by: Levi-JQ <yujinqi2@huawei.com>	2025-12-02 14:20:45 +08:00
Zetong Li	52abd47f8c	[Bugfix][SHM] Use writer lock by default and remove redundant env (#4117 ) ### What this PR does / why we need it? This PR aims to remove env introduced by #3988 and use lock by default. As described in https://github.com/vllm-project/vllm/issues/27858, we have tested the writer lock method in various scenarios and the performance is almost unaffected. Therefore, we believe that it would be safe to enable the lock by default and remove the redundant env `SHM_BARRIER` now. After discussion, we decide to preserve env and set it as true by default. ### Does this PR introduce _any_ user-facing change? `SHM_BARRIER` is set as true by default. ### How was this patch tested? by ci --------- Signed-off-by: Zetong Li <slippersss@126.com>	2025-12-01 22:27:01 +08:00
Li Wang	76d0ba4342	[Image][Build] Cherry pick #4062 from main (#4506 ) ### What this PR does / why we need it? This patch aims to integrate the mooncake [v0.3.7.2.post2](https://github.com/kvcache-ai/Mooncake/releases/tag/v0.3.7.post2) to vllm-ascend images Signed-off-by: wangli <wangli858794774@gmail.com>	2025-12-01 11:39:40 +08:00
zouyida2052	2b4f7a5016	[cherry-pick pr-4254] bugfix for mtp>1 when lm_head_tp>1 (#4360 ) ### What this PR does / why we need it? Previously, the dummy run executed compute_logits only once, regardless of num_speculative_tokens. This caused execute_model to hang on compute_logits when lm head tensor parallelism exceeded 1. The fix ensures compute_logits executes correctly during dummy run, matching num_speculative_tokens. Signed-off-by: zouyida2052 <zouyida2002@gmail.com>	2025-12-01 11:11:15 +08:00
LI SHENGYONG	cd9f5c0611	[bugfix] dep ineffective (#4416 ) ### What this PR does / why we need it? The expert mapping table and weights of the dynamic EPLB were not updated, causing the accuracy to be correct but not effective. This bug has now been fixed. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2025-11-29 15:19:11 +08:00
henryxuxu0716	71acc8ddeb	For nz unset in bf16&fp16 (#4495 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? disable NZ for float weight case. This is only a quick fix for dev branch. For main branch, we'll consider more case to make it more common. ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> ### How was this patch tested? qwen2.5 32B <img width="441" height="221" alt="image" src="https://github.com/user-attachments/assets/7ae18ffd-1ce2-43d9-9960-be45250ad0da" /> --------- Signed-off-by: 刘哲续 <liuzhexu1@huawei.com> Co-authored-by: 刘哲续 <liuzhexu1@huawei.com>	2025-11-28 17:32:25 +08:00
Zhu Yi Lin	96c362361e	[0.11.0][TEST] Delete Comment (#4428 ) ### What this PR does / why we need it? delete chinese comment pick from https://github.com/vllm-project/vllm-ascend/pull/4427 ### Does this PR introduce _any_ user-facing change? no Signed-off-by: GDzhu01 <809721801@qq.com>	2025-11-25 21:39:36 +08:00
zhangxinyuehfad	a686f2962a	[0.11.0][Bugfix] fix e2e full test (#4424 ) ### What this PR does / why we need it? pin Transformer version to 4.57.1 fix 'dict' object has no attribute 'model_type' https://github.com/vllm-project/vllm-ascend/actions/runs/19660859460/job/56306822464 picked from https://github.com/vllm-project/vllm-ascend/pull/4423 Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-11-25 21:21:42 +08:00
Shanshan Shen	cdaf7f4a51	[MM][Bugfix] Minor fix for VL model verification (#4385 ) ### What this PR does / why we need it? To fix ops test, where `model_config` has been set to `None` and doesn't has `hf_config` attribute, we have added a check for `model_config` to guarantee it is not `None_Type`. cherry-pick from main: https://github.com/vllm-project/vllm-ascend/pull/4384. Signed-off-by: shen-shanshan <467638484@qq.com>	2025-11-25 20:36:32 +08:00
wujinyuan1	386a85eccc	[Bugfix]Fix the hang issue of multimodal model when running with DP>1 (#4393 ) ### What this PR does / why we need it? When cudagraph_mode is set to FULL_DECODE_ONLY, if dp > 1, the dummy-run process will be triggered. When calling the update_attn_params function, the num_tokens parameter needs to be passed, and this value is obtained through positions.shape[0]. However, the multimodal model uses mRope (multi-dimensional rotary positional embeddings), which causes the shape of positions to be 2. As a result, the value obtained from positions.shape[0] is incorrect. We solve this problem by replacing positions.shape[0] with num_tokens. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? vLLM version: v0.11.0rc3 vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: wujinyuan1 <wjy9595@qq.com> Co-authored-by: wujinyuan1 <wjy9595@qq.com>	2025-11-25 09:32:22 +08:00
weichen	a3164ac372	[v0.11.0][Bugfix][MoE] enable force_load_balance in aclgraph (#4367 ) ### What this PR does / why we need it? Enable force_load_balance in aclgraph, solving OOM issues. pick from https://github.com/vllm-project/vllm-ascend/pull/4366 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? e2e & ut Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>	2025-11-25 09:16:57 +08:00
mazhixin000	75452abe1e	[Doc][v11.0-dev][cherry-pick]Add single node PD disaggregation instructions (#4370 ) ### What this PR does / why we need it? add single node PD disaggregation instructions for Qwen 2.5VL model. ### Does this PR introduce _any_ user-facing change? no --------- Signed-off-by: mazhixin <mazhixin7@huawei.com> Signed-off-by: mazhixin000 <mazhixinkorea@163.com> Co-authored-by: mazhixin <mazhixin7@huawei.com>	2025-11-24 17:23:11 +08:00
wangxiyuan	a2e4c3fe78	Revert "[cherry-pick][refactor]support gatingtopk operator generalization (#4050 )" (#4352 ) This reverts commit `c87a77e8b4`. it breaks ops e2e test Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-11-21 23:03:20 +08:00
SILONG ZENG	5ad0ccdc31	[v0.11.0]Upgrade cann to 8.3.rc2 (#4332 ) ### What this PR does / why we need it? Upgrade CANN to 8.3.rc2 Signed-off-by: MrZ20 <2609716663@qq.com>	2025-11-21 22:48:57 +08:00
LI SHENGYONG	0f9025cceb	[EPLB] Eplb Verify Fix (#4334 ) ### What this PR does / why we need it? Eplb Verify Fix --------- Signed-off-by: shenchuxiaofugui <1311027364@qq.com> Signed-off-by: LI SHENGYONG <49200266+shenchuxiaofugui@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2025-11-21 18:18:15 +08:00
Ting FU	97ffb9120f	[CI] Defaultly compile vllm with multimodal audio feature in dockerfile (#4324 ) (#4341 ) ### What this PR does / why we need it? For better usability, add multimodal audio to vllm compiling in dockerfile defaultly. Image size will increase only 2.xM. Signed-off-by: Ting FU <futing10@huawei.com>	2025-11-21 17:53:00 +08:00
Li Wang	218bc70f6f	[CI] Remove redundant workflows (#4335 ) ### What this PR does / why we need it? Remove redundant workflows， just maintain a separate workflow which setting up on the main branch to control the execution of each branch, instead of running each branch simultaneously, thus reducing resource waste. Signed-off-by: wangli <wangli858794774@gmail.com>	2025-11-21 16:48:35 +08:00
Shanshan Shen	70f076331f	[MM][Bugfix] Add error log for VL models when enabling FLASHCOMM (#4222 ) ### What this PR does / why we need it? Add error log for VL models when enabling `VLLM_ASCEND_ENABLE_FLASHCOMM1=1` or `VLLM_ASCEND_ENABLE_FLASHCOMM=1` (for backward compatibility). This is a temporary fix for https://github.com/vllm-project/vllm-ascend/issues/4132. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Signed-off-by: shen-shanshan <467638484@qq.com>	2025-11-21 15:04:35 +08:00
LI SHENGYONG	c94b38c82e	[Readme] EPLB Support Scenarios (#4315 ) ### What this PR does / why we need it? Add information on the scope of EPLB support. --------- Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2025-11-21 14:25:39 +08:00
Angazenn	9c6d0b422c	[v0.11.0-dev][misc]change default capture size for Qwen3-MoE when using full dp (#4205 ) ### What this PR does / why we need it? This dev version of #4199 . Currently, the default `cudagraph_capture_size` in vLLM is `[1, 2, 4 ,8 ,16 ,24 ,... , max_capture_size]`. However, this is not always the best choice on different situations. This PR aims to change the default setting when running Qwen3-MoE on full dp (`dp_size > 1` && `tp_size == 1`) setting, which is usually applied in Large-Scale EP. old : `[1, 2, 4 ,8 ,16 ,24 ,... , max_capture_size]` new: `[1, 2, 5 ,10 ,15, 16 ,24 ,... , max_capture_size]` This is mainly because the performance of `_npu_paged_attention` op degrades dramatically on old settings. We hope to provide better performance if users do not set specific `cudagraph_capture_size`. ### Does this PR introduce _any_ user-facing change? The default `cudagraph_capture_size` is modified in above cases. However, if `cudagraph_capture_size` has already set by users, this PR won't have any influence on this. ### How was this patch tested? - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` --------- Signed-off-by: Angazenn <supperccell@163.com>	2025-11-21 11:19:11 +08:00
shaopeng-666	b6d59bdea2	cherry pick from pr 4270 (#4285 ) ### What this PR does / why we need it? avoid mrope fusion op when running qwen25vl on x86 machine --------- Signed-off-by: 李少鹏 <lishaopeng21@huawei.com>	2025-11-19 22:32:02 +08:00
MengLong Chen	277670730c	[Bugfix][Aclgraph] failed to update graph task (#4282 ) ### What this PR does / why we need it? bugfix the error of full graph aclgraph Signed-off-by: chenmenglong <chenmenglong1@huawei.com>	2025-11-19 21:30:48 +08:00
1092626063	c87a77e8b4	[cherry-pick][refactor]support gatingtopk operator generalization (#4050 ) ### What this PR does / why we need it? pick from : https://github.com/vllm-project/vllm-ascend/pull/2958 Past： npu_moe_gating_top_k can only support 'group_count=256' pattern Now： 1、npu_moe_gating_top_k support all size of group_count 2、the functionality of `torch_npu.npu_moe_gating_top_k_softmax` are included in `torch_npu.npu_moe_gating_top_k` CANN: depends on 8.3.RC1 Performance： 1. GLM4.5-w8a8, TPS improve 6% 2. Qwen3, the same as before Signed-off-by: 1092626063 <1092626063@qq.com>	2025-11-19 10:39:28 +08:00
liziyu	ddf3e75800	[Cherry-pick] [0.11.0] pd proxy support ipv6 and fix proxy (#4242 ) ### What this PR does / why we need it? pd proxy support ipv6, mooncake connector check whether the IPv6 address is used and notify the user. --------- Signed-off-by: liziyu <liziyu16@huawei.com>	2025-11-18 16:33:00 +08:00
Icey	378e92a2a2	[Cherry-pick][0.11.0] Adapted to torch_npu.npu_fused_infer_attention_score (#4202 ) ### What this PR does / why we need it? Fixes a compatible bug with torch_npu.npu_fused_infer_attention_score which is discribed in https://github.com/vllm-project/vllm-ascend/issues/4020. @momo609 tells us this solution. cherry-pick: https://github.com/vllm-project/vllm-ascend/pull/4025 ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with new added/existing test. Signed-off-by: Icey <1790571317@qq.com>	2025-11-17 10:56:23 +08:00
zhangyiming	a7eb42cf0a	[v0.11.0-dev][Bugfix][cherry-pick]bugfix for weight load of kimi-k2 (#4190 ) ### What this PR does / why we need it? This is cherry-pick from #3798 Fix kimi-k2 start bug, weight load ERROR：https://github.com/vllm-project/vllm-ascend/issues/3785 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: `c9461e05a4` --------- Signed-off-by: Levi-JQ <yujinqi2@huawei.com> Signed-off-by: menogrey <1299267905@qq.com> Co-authored-by: Levi <54832289+Levi-JQ@users.noreply.github.com> Co-authored-by: Levi-JQ <yujinqi2@huawei.com> Co-authored-by: zhaozx-cn <zhaozx2116@163.com>	2025-11-14 15:43:22 +08:00
weichen	51e5806d76	[0.11.0-dev][Bugfix][EPLB] Quick fix for missing log2phy conversion (#4150 ) ### What this PR does / why we need it? Quick fix for missing log2phy conversion in MC2 token_dispatcher, which has been already fixed in main branch https://github.com/vllm-project/vllm-ascend/pull/3512. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? e2e & ut Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>	2025-11-13 14:32:40 +08:00
zhaozx-cn	cd652acb65	[BugFix] Fix kv_no_split not contiguous (#3711 ) allgather need contiguous data, split operation return uncontiguous data. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: zhaozx-cn <zhaozx2116@163.com>	2025-11-13 11:29:37 +08:00
Angazenn	28a15299ea	[cherry-pick][v0.11.0-dev][bugfix] Change seq_lens in dummy attn_metadata to max_query_len (#4099 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? This is cherry-pick from #4097 . Currently, we set `seq_lens` in dummy attn_metadata to be `max_model_len` to get max workspace for attention during capturing. However, setting it consistently to be `max_model_len` causing dummy_run to execute a long attention when running actual inference. For example, if there is a single req with `seqs_lens` as [8] but `max_model_len` is 131072, the whole process will be slow down by dummy_run as it execute a fake long-seq attention. Therefore, we instead set it to max_query_len, which is also consistent with vLLM gpu implementation. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> --------- Signed-off-by: Angazenn <supperccell@163.com>	2025-11-12 20:32:50 +08:00
zhangxinyuehfad	7732a89fd9	[v0.11.0][UT][Fixbug] Fix UT test (#4151 ) ### What this PR does / why we need it? Fix UT test Backport: https://github.com/vllm-project/vllm-ascend/pull/4116 Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-11-12 16:55:18 +08:00
zhaomingyu13	650ce8ad19	[0.11.0][Bugfix] Fix ngram precision issue and open e2e ngram test (#4092 ) ### What this PR does / why we need it? Fix ngram precision issue and open e2e ngram test --------- Signed-off-by: Icey <1790571317@qq.com> Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com> Signed-off-by: zhaomingyu13 <zhaomingyu13@h-partners.com> Co-authored-by: Icey <1790571317@qq.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2025-11-11 09:58:03 +08:00
Angazenn	2069bef449	[v0.11.0-dev][bugfix] Fix a bug in wrongly set npu_stream (#4106 ) ### What this PR does / why we need it? This pr fixes a bug introduced in #3985, which set wrong npu_stream (possibly by mistakes in cherry-pick). I correct it and make `update_attn_params` consistent to main branch. ### Does this PR introduce _any_ user-facing change? No. Signed-off-by: Angazenn <supperccell@163.com>	2025-11-11 09:16:41 +08:00
Icey	c5fe179cef	[0.11.0] [Cherry-pick #4058 ] Fixes Qwen3-Next enable nz accuracy problem (#4056 ) ### What this PR does / why we need it? - Fixes Qwen3-Next enable nz accuracy problem --------- Signed-off-by: wxsIcey <1790571317@qq.com> Signed-off-by: Icey <1790571317@qq.com>	2025-11-10 20:56:39 +08:00
rjg-lyh	ebd45b6596	[V0.11.0][Core] Restore scheduling logic under default configuration (#4094 ) ### What this PR does / why we need it? Cherry-pick #3967 from main branch. This PR reverts the changes introduced in PR #2894 Initially, due to performance issues with the older version of the chunked prefill ops, the default behavior was to use the Ascend scheduler to disable the chunked prefill feature. However, with the improvements in the performance of the new chunked prefill ops, this interception strategy has been removed. This change also aligns with the community's default configuration behavior. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI passed with new added/existing test. Signed-off-by: rjg-lyh <1318825571@qq.com>	2025-11-10 20:02:23 +08:00
XiaoxinWang	c3c9138719	[Perf] Move attention update stream out of loop to optimize performance (#3985 ) ### What this PR does / why we need it? In the `update_*attn_params` functions, the `torch.npu.stream(update_stream)` context manager was previously located inside the for-loop that updates parameters for each layer. This resulted in redundant stream initiations for every layer, adding unnecessary overhead. This commit refactors the code by moving the stream context manager to wrap the entire for-loop. This ensures that the update stream is initiated only once per function call, rather than for each layer. This change reduces 90us in each decode model. update stream in every layer: <img width="1720" height="383" alt="image" src="https://github.com/user-attachments/assets/70e4cb69-5bc1-4180-a67d-c99132134be6" /> remove update stream in every layer: <img width="1269" height="175" alt="image" src="https://github.com/user-attachments/assets/0e290edb-b0ce-48fe-b032-1b924ade6ae5" /> ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0 - vLLM main: `83f478bb19` Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com> Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>	2025-11-10 17:18:45 +08:00
zhangxinyuehfad	d913f9474b	[0.11.0][Fix] Fix Qwen2-Audio-7B-Instruct accuracy test (#4018 ) ### What this PR does / why we need it? Fix Qwen2-Audio-7B-Instruct accuracy test Backport:https://github.com/vllm-project/vllm-ascend/pull/4017 Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-11-10 11:54:30 +08:00
hucong	7ea17fbee3	[0.11.0][BugFix] Improve the performance of prefixcache features (#4021 ) ### What this PR does / why we need it? cherry-pick from https://github.com/vllm-project/vllm-ascend/pull/4022 The code bug caused an empty bubble. When the npu_paged_cache_load operator was called, it forcibly transferred seq_len2 to the device, which triggered synchronization and interrupted the CPU operator's launch stream. --------- Signed-off-by: underfituu <hzhucong@163.com>	2025-11-10 11:51:34 +08:00

1 2 3 4 5 ...

1251 Commits