xc-llm-ascend

Author	SHA1	Message	Date
Shanshan Shen	cdaf7f4a51	[MM][Bugfix] Minor fix for VL model verification (#4385 ) ### What this PR does / why we need it? To fix ops test, where `model_config` has been set to `None` and doesn't has `hf_config` attribute, we have added a check for `model_config` to guarantee it is not `None_Type`. cherry-pick from main: https://github.com/vllm-project/vllm-ascend/pull/4384. Signed-off-by: shen-shanshan <467638484@qq.com>	2025-11-25 20:36:32 +08:00
wujinyuan1	386a85eccc	[Bugfix]Fix the hang issue of multimodal model when running with DP>1 (#4393 ) ### What this PR does / why we need it? When cudagraph_mode is set to FULL_DECODE_ONLY, if dp > 1, the dummy-run process will be triggered. When calling the update_attn_params function, the num_tokens parameter needs to be passed, and this value is obtained through positions.shape[0]. However, the multimodal model uses mRope (multi-dimensional rotary positional embeddings), which causes the shape of positions to be 2. As a result, the value obtained from positions.shape[0] is incorrect. We solve this problem by replacing positions.shape[0] with num_tokens. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? vLLM version: v0.11.0rc3 vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: wujinyuan1 <wjy9595@qq.com> Co-authored-by: wujinyuan1 <wjy9595@qq.com>	2025-11-25 09:32:22 +08:00
weichen	a3164ac372	[v0.11.0][Bugfix][MoE] enable force_load_balance in aclgraph (#4367 ) ### What this PR does / why we need it? Enable force_load_balance in aclgraph, solving OOM issues. pick from https://github.com/vllm-project/vllm-ascend/pull/4366 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? e2e & ut Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>	2025-11-25 09:16:57 +08:00
mazhixin000	75452abe1e	[Doc][v11.0-dev][cherry-pick]Add single node PD disaggregation instructions (#4370 ) ### What this PR does / why we need it? add single node PD disaggregation instructions for Qwen 2.5VL model. ### Does this PR introduce _any_ user-facing change? no --------- Signed-off-by: mazhixin <mazhixin7@huawei.com> Signed-off-by: mazhixin000 <mazhixinkorea@163.com> Co-authored-by: mazhixin <mazhixin7@huawei.com>	2025-11-24 17:23:11 +08:00
wangxiyuan	a2e4c3fe78	Revert "[cherry-pick][refactor]support gatingtopk operator generalization (#4050 )" (#4352 ) This reverts commit `c87a77e8b4`. it breaks ops e2e test Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-11-21 23:03:20 +08:00
SILONG ZENG	5ad0ccdc31	[v0.11.0]Upgrade cann to 8.3.rc2 (#4332 ) ### What this PR does / why we need it? Upgrade CANN to 8.3.rc2 Signed-off-by: MrZ20 <2609716663@qq.com>	2025-11-21 22:48:57 +08:00
LI SHENGYONG	0f9025cceb	[EPLB] Eplb Verify Fix (#4334 ) ### What this PR does / why we need it? Eplb Verify Fix --------- Signed-off-by: shenchuxiaofugui <1311027364@qq.com> Signed-off-by: LI SHENGYONG <49200266+shenchuxiaofugui@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2025-11-21 18:18:15 +08:00
Ting FU	97ffb9120f	[CI] Defaultly compile vllm with multimodal audio feature in dockerfile (#4324 ) (#4341 ) ### What this PR does / why we need it? For better usability, add multimodal audio to vllm compiling in dockerfile defaultly. Image size will increase only 2.xM. Signed-off-by: Ting FU <futing10@huawei.com>	2025-11-21 17:53:00 +08:00
Li Wang	218bc70f6f	[CI] Remove redundant workflows (#4335 ) ### What this PR does / why we need it? Remove redundant workflows， just maintain a separate workflow which setting up on the main branch to control the execution of each branch, instead of running each branch simultaneously, thus reducing resource waste. Signed-off-by: wangli <wangli858794774@gmail.com>	2025-11-21 16:48:35 +08:00
Shanshan Shen	70f076331f	[MM][Bugfix] Add error log for VL models when enabling FLASHCOMM (#4222 ) ### What this PR does / why we need it? Add error log for VL models when enabling `VLLM_ASCEND_ENABLE_FLASHCOMM1=1` or `VLLM_ASCEND_ENABLE_FLASHCOMM=1` (for backward compatibility). This is a temporary fix for https://github.com/vllm-project/vllm-ascend/issues/4132. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Signed-off-by: shen-shanshan <467638484@qq.com>	2025-11-21 15:04:35 +08:00
LI SHENGYONG	c94b38c82e	[Readme] EPLB Support Scenarios (#4315 ) ### What this PR does / why we need it? Add information on the scope of EPLB support. --------- Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2025-11-21 14:25:39 +08:00
Angazenn	9c6d0b422c	[v0.11.0-dev][misc]change default capture size for Qwen3-MoE when using full dp (#4205 ) ### What this PR does / why we need it? This dev version of #4199 . Currently, the default `cudagraph_capture_size` in vLLM is `[1, 2, 4 ,8 ,16 ,24 ,... , max_capture_size]`. However, this is not always the best choice on different situations. This PR aims to change the default setting when running Qwen3-MoE on full dp (`dp_size > 1` && `tp_size == 1`) setting, which is usually applied in Large-Scale EP. old : `[1, 2, 4 ,8 ,16 ,24 ,... , max_capture_size]` new: `[1, 2, 5 ,10 ,15, 16 ,24 ,... , max_capture_size]` This is mainly because the performance of `_npu_paged_attention` op degrades dramatically on old settings. We hope to provide better performance if users do not set specific `cudagraph_capture_size`. ### Does this PR introduce _any_ user-facing change? The default `cudagraph_capture_size` is modified in above cases. However, if `cudagraph_capture_size` has already set by users, this PR won't have any influence on this. ### How was this patch tested? - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` --------- Signed-off-by: Angazenn <supperccell@163.com>	2025-11-21 11:19:11 +08:00
shaopeng-666	b6d59bdea2	cherry pick from pr 4270 (#4285 ) ### What this PR does / why we need it? avoid mrope fusion op when running qwen25vl on x86 machine --------- Signed-off-by: 李少鹏 <lishaopeng21@huawei.com>	2025-11-19 22:32:02 +08:00
MengLong Chen	277670730c	[Bugfix][Aclgraph] failed to update graph task (#4282 ) ### What this PR does / why we need it? bugfix the error of full graph aclgraph Signed-off-by: chenmenglong <chenmenglong1@huawei.com>	2025-11-19 21:30:48 +08:00
1092626063	c87a77e8b4	[cherry-pick][refactor]support gatingtopk operator generalization (#4050 ) ### What this PR does / why we need it? pick from : https://github.com/vllm-project/vllm-ascend/pull/2958 Past： npu_moe_gating_top_k can only support 'group_count=256' pattern Now： 1、npu_moe_gating_top_k support all size of group_count 2、the functionality of `torch_npu.npu_moe_gating_top_k_softmax` are included in `torch_npu.npu_moe_gating_top_k` CANN: depends on 8.3.RC1 Performance： 1. GLM4.5-w8a8, TPS improve 6% 2. Qwen3, the same as before Signed-off-by: 1092626063 <1092626063@qq.com>	2025-11-19 10:39:28 +08:00
liziyu	ddf3e75800	[Cherry-pick] [0.11.0] pd proxy support ipv6 and fix proxy (#4242 ) ### What this PR does / why we need it? pd proxy support ipv6, mooncake connector check whether the IPv6 address is used and notify the user. --------- Signed-off-by: liziyu <liziyu16@huawei.com>	2025-11-18 16:33:00 +08:00
Icey	378e92a2a2	[Cherry-pick][0.11.0] Adapted to torch_npu.npu_fused_infer_attention_score (#4202 ) ### What this PR does / why we need it? Fixes a compatible bug with torch_npu.npu_fused_infer_attention_score which is discribed in https://github.com/vllm-project/vllm-ascend/issues/4020. @momo609 tells us this solution. cherry-pick: https://github.com/vllm-project/vllm-ascend/pull/4025 ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with new added/existing test. Signed-off-by: Icey <1790571317@qq.com>	2025-11-17 10:56:23 +08:00
zhangyiming	a7eb42cf0a	[v0.11.0-dev][Bugfix][cherry-pick]bugfix for weight load of kimi-k2 (#4190 ) ### What this PR does / why we need it? This is cherry-pick from #3798 Fix kimi-k2 start bug, weight load ERROR：https://github.com/vllm-project/vllm-ascend/issues/3785 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: `c9461e05a4` --------- Signed-off-by: Levi-JQ <yujinqi2@huawei.com> Signed-off-by: menogrey <1299267905@qq.com> Co-authored-by: Levi <54832289+Levi-JQ@users.noreply.github.com> Co-authored-by: Levi-JQ <yujinqi2@huawei.com> Co-authored-by: zhaozx-cn <zhaozx2116@163.com>	2025-11-14 15:43:22 +08:00
weichen	51e5806d76	[0.11.0-dev][Bugfix][EPLB] Quick fix for missing log2phy conversion (#4150 ) ### What this PR does / why we need it? Quick fix for missing log2phy conversion in MC2 token_dispatcher, which has been already fixed in main branch https://github.com/vllm-project/vllm-ascend/pull/3512. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? e2e & ut Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>	2025-11-13 14:32:40 +08:00
zhaozx-cn	cd652acb65	[BugFix] Fix kv_no_split not contiguous (#3711 ) allgather need contiguous data, split operation return uncontiguous data. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: zhaozx-cn <zhaozx2116@163.com>	2025-11-13 11:29:37 +08:00
Angazenn	28a15299ea	[cherry-pick][v0.11.0-dev][bugfix] Change seq_lens in dummy attn_metadata to max_query_len (#4099 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? This is cherry-pick from #4097 . Currently, we set `seq_lens` in dummy attn_metadata to be `max_model_len` to get max workspace for attention during capturing. However, setting it consistently to be `max_model_len` causing dummy_run to execute a long attention when running actual inference. For example, if there is a single req with `seqs_lens` as [8] but `max_model_len` is 131072, the whole process will be slow down by dummy_run as it execute a fake long-seq attention. Therefore, we instead set it to max_query_len, which is also consistent with vLLM gpu implementation. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> --------- Signed-off-by: Angazenn <supperccell@163.com>	2025-11-12 20:32:50 +08:00
zhangxinyuehfad	7732a89fd9	[v0.11.0][UT][Fixbug] Fix UT test (#4151 ) ### What this PR does / why we need it? Fix UT test Backport: https://github.com/vllm-project/vllm-ascend/pull/4116 Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-11-12 16:55:18 +08:00
zhaomingyu13	650ce8ad19	[0.11.0][Bugfix] Fix ngram precision issue and open e2e ngram test (#4092 ) ### What this PR does / why we need it? Fix ngram precision issue and open e2e ngram test --------- Signed-off-by: Icey <1790571317@qq.com> Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com> Signed-off-by: zhaomingyu13 <zhaomingyu13@h-partners.com> Co-authored-by: Icey <1790571317@qq.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2025-11-11 09:58:03 +08:00
Angazenn	2069bef449	[v0.11.0-dev][bugfix] Fix a bug in wrongly set npu_stream (#4106 ) ### What this PR does / why we need it? This pr fixes a bug introduced in #3985, which set wrong npu_stream (possibly by mistakes in cherry-pick). I correct it and make `update_attn_params` consistent to main branch. ### Does this PR introduce _any_ user-facing change? No. Signed-off-by: Angazenn <supperccell@163.com>	2025-11-11 09:16:41 +08:00
Icey	c5fe179cef	[0.11.0] [Cherry-pick #4058 ] Fixes Qwen3-Next enable nz accuracy problem (#4056 ) ### What this PR does / why we need it? - Fixes Qwen3-Next enable nz accuracy problem --------- Signed-off-by: wxsIcey <1790571317@qq.com> Signed-off-by: Icey <1790571317@qq.com>	2025-11-10 20:56:39 +08:00
rjg-lyh	ebd45b6596	[V0.11.0][Core] Restore scheduling logic under default configuration (#4094 ) ### What this PR does / why we need it? Cherry-pick #3967 from main branch. This PR reverts the changes introduced in PR #2894 Initially, due to performance issues with the older version of the chunked prefill ops, the default behavior was to use the Ascend scheduler to disable the chunked prefill feature. However, with the improvements in the performance of the new chunked prefill ops, this interception strategy has been removed. This change also aligns with the community's default configuration behavior. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI passed with new added/existing test. Signed-off-by: rjg-lyh <1318825571@qq.com>	2025-11-10 20:02:23 +08:00
XiaoxinWang	c3c9138719	[Perf] Move attention update stream out of loop to optimize performance (#3985 ) ### What this PR does / why we need it? In the `update_*attn_params` functions, the `torch.npu.stream(update_stream)` context manager was previously located inside the for-loop that updates parameters for each layer. This resulted in redundant stream initiations for every layer, adding unnecessary overhead. This commit refactors the code by moving the stream context manager to wrap the entire for-loop. This ensures that the update stream is initiated only once per function call, rather than for each layer. This change reduces 90us in each decode model. update stream in every layer: <img width="1720" height="383" alt="image" src="https://github.com/user-attachments/assets/70e4cb69-5bc1-4180-a67d-c99132134be6" /> remove update stream in every layer: <img width="1269" height="175" alt="image" src="https://github.com/user-attachments/assets/0e290edb-b0ce-48fe-b032-1b924ade6ae5" /> ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0 - vLLM main: `83f478bb19` Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com> Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>	2025-11-10 17:18:45 +08:00
zhangxinyuehfad	d913f9474b	[0.11.0][Fix] Fix Qwen2-Audio-7B-Instruct accuracy test (#4018 ) ### What this PR does / why we need it? Fix Qwen2-Audio-7B-Instruct accuracy test Backport:https://github.com/vllm-project/vllm-ascend/pull/4017 Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-11-10 11:54:30 +08:00
hucong	7ea17fbee3	[0.11.0][BugFix] Improve the performance of prefixcache features (#4021 ) ### What this PR does / why we need it? cherry-pick from https://github.com/vllm-project/vllm-ascend/pull/4022 The code bug caused an empty bubble. When the npu_paged_cache_load operator was called, it forcibly transferred seq_len2 to the device, which triggered synchronization and interrupted the CPU operator's launch stream. --------- Signed-off-by: underfituu <hzhucong@163.com>	2025-11-10 11:51:34 +08:00
wangxiaoteng888	c2d58c0655	[P/D][BugFix][v0.11.0-dev]Fix proxy format processing errors & Layerwise connector performance optimization (#4069 ) ### What this PR does / why we need it? 1.Fix proxy format processing errors. 2.Layer-wise connector performance optimization Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>	2025-11-09 09:55:10 +08:00
wangx700	55e37f5041	[v0.11.0][Bugfix] fix sleepmode level2 e2e test (#4023 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> fix sleepmode level2 e2e test ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> no ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> use e2e tests Signed-off-by: wangx700 <wangxin700@huawei.com>	2025-11-08 14:11:15 +08:00
tingfu	f9842560cb	[0.11.0][Perf] Add padding vision tower for Qwen2_5_Omni (#4041 ) ### What this PR does / why we need it? This PR repalce the vision tower in Qwen2.5-Omni-Thinker model, Qwen2_5_VisionTransformer, with AscendQwen2_5_VisionTransformer, which use QKV padding for padding performance. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: Ting FU <futing10@huawei.com>	2025-11-08 13:56:05 +08:00
zxr2333	d4e2a44307	[Cherry Pick from pr#3981][0.11.0][P/D]Make kv-transfer env variable take effect & Fix load-balance proxy (#3983 ) ### What this PR does / why we need it? Make kv-transfer env variable take effect & Fix load-balance proxy. Cherry Pick from #3981 --------- Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>	2025-11-08 13:52:33 +08:00
offline893	8e72758645	[BugFix]Fix grouplist type of mc2. (#4049 ) ### What this PR does / why we need it? Fix accrucy problem of eplb because of PTA upgrade. This is a backport of #4047 ### How was this patch tested? Mian: baseline: \| dataset \| version \| metric \| mode \| vllm-api-general-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| aime2024 \| 604a78 \| accuracy \| gen \| 87.50 \| EPLB: \| dataset \| version \| metric \| mode \| vllm-api-general-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| aime2024 \| 604a78 \| accuracy \| gen \| 87.50 \| - vLLM version: v0.11.0 - vLLM main: `83f478bb19` Signed-off-by: offline0806 <3337230449@qq.com> Co-authored-by: offline0806 <3337230449@qq.com>	2025-11-07 17:43:23 +08:00
lilinsiman	016337eaec	[v0.11.0][UT] Add new ut case for aclgraph enable (#4038 ) ### What this PR does / why we need it? add new ut case for aclgraph enable ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ut Signed-off-by: lilinsiman <lilinsiman@gmail.com>	2025-11-07 11:35:24 +08:00
Angazenn	f9494d978a	[cherry-pick][v0.11.0-dev][bugfix] Fix a rare bug triggered by _npu_paged_attention in FULL_DECODE_ONLY mode (#3987 ) ### What this PR does / why we need it? This is cherry-pick from #3986 . This PR fixes a bug where the workspace of `_npu_paged_attention` in setup is smaller than execution. For current implementation of FULL_DECODE_ONLY with `_npu_paged_attention`, we use `_npu_paged_attention_get_workspace` when capturing with `max_model_len` as `seq_lens`. This assumes that PA with larger `seq_lens` inputs should have larger workspace than smaller `seq_lens`. However, there are rare cases where PA with smaller `seq_lens` incurs larger space. So I add `get_workspace` directly into `update_attn_params`. This change might introduce slight(≈1%) performance degradation for small num_tokens(such as 1) in decode phase, and there is no other known memory issues. So I think this change is acceptable. We can remove this if new attention op (such as `npu_fused_infer_attention_score`) does not have such problems. Signed-off-by: Angazenn <supperccell@163.com>	2025-11-06 23:08:57 +08:00
Shanshan Shen	27547a10e6	[MM][Bugfix] Add MoE verification for multi-modal models (#3897 ) (#4027 ) ### What this PR does / why we need it? Fix #3891. The empty of `moe_comm_method` in the above issue is due to the wrong check for MoE models. To be specific, the method `is_moe_model` only checks whether a text-only model is a MoE model, without considering multi-modal models, e.g., `VL` and `Omni`. Check the config dict recursively to find if it has a key contains "expert", without checking the model architecture. It is worth noting that, we can't verify a model by if it contains `FusedMoE` module because `is_moe_model` is called somewhere before the model loading, e.g., it's called when updating the ACLGraph config in platform initialization. - vLLM version: v0.11.0 - vLLM main: `83f478bb19` --------- Signed-off-by: shen-shanshan <467638484@qq.com>	2025-11-06 20:30:40 +08:00
zzzzwwjj	3db53d117e	[0.11.0][doc] add aclgraph developer guide (#3947 ) ### What this PR does / why we need it? Add aclgraph developer guide. Signed-off-by: zzzzwwjj <1183291235@qq.com>	2025-11-06 09:54:38 +08:00
wangxiyuan	7ee0b0b5d8	[cherry-pick]Upgrade CANN to 8.3.rc1 (#3945 ) (#3962 ) This PR upgrade CANN from 8.2rc1 to 8.3rc1 and remove the CANN version check logic. TODO: we notice that UT runs failed with CANN 8.3 image. So the base image for UT is still 8.2. We'll fix it later. - vLLM version: v0.11.0 - vLLM main: `83f478bb19` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-11-06 09:05:08 +08:00
Zetong Li	66b67f9cf2	[Bugfix][SHM] Fix weak memory ordering problem in share memory (#3988 ) ### What this PR does / why we need it? This PR aims to fix weak memory ordering problem in share memory by patching message queue with an additional lock. The detailed issue can be found here https://github.com/vllm-project/vllm/issues/27858. The key point is to use the writer lock to enforce memory fence before the ready flag `metadata_buffer[0] = 1` is set. This is a temporary solution, and you can use it by setting env `SHM_BARRIER=true`. By default, we disable this modification. ### Does this PR introduce _any_ user-facing change? `SHM_BARRIER=true` enables this change while `SHM_BARRIER=false` disables this change. The latter is the default choice. ### How was this patch tested? by ci --------- Signed-off-by: Zetong Li <slippersss@126.com>	2025-11-04 23:07:23 +08:00
zxr2333	954dab64fb	[v0.11.0][P/D]Set adxl as default backend and update readme (#3771 ) ### What this PR does / why we need it? Set adxl engine as the default Mooncake backend, because Ascend Transport is no longer maintained. Update README to include instructions for installing the adxl backend Mooncake. ### Does this PR introduce _any_ user-facing change? Users need to compile and install the mooncake backend for adxl according to the revised README instructions. ### How was this patch tested? By CI. --------- Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>	2025-11-04 16:06:58 +08:00
leo-pony	0cead5c1ee	Quality enhancement: Immediately interrupt execution when allocate NPU memory OOM (#3944 ) ### What this PR does / why we need it? Protect the scene where the first problem occurs. The execution should be interrupted when the video memory application fails, rather than waiting until an illegal address is accessed. ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? NA - vLLM version: v0.11.0 - vLLM main: `83f478bb19` Signed-off-by: leo-pony <nengjunma@outlook.com>	2025-11-04 08:55:22 +08:00
Mengqing Cao	7cc6208029	[0.11.0][MTP][Aclgraph] Fix the support aclgraph with MTP (#3912 ) ### What this PR does / why we need it? Fix 2 breaks of aclgraph with MTP: 1. deepseekmtp in vllm 0.11.0 does not support aclgraph and lack the `support_torch_compile` decorator 2. There is a d2h synchornization in the original forward of mtp predictor. The fix pr in vllm https://github.com/vllm-project/vllm/pull/27643 As we'll fix it in vllm main, this fix pr is only needed in branch v0.11.0-dev The profling shows that MTP replays in aclgraph now: <img width="1612" height="1866" alt="a7d7f04155df4ed454b7eb20a92b2e2a" src="https://github.com/user-attachments/assets/eaa4b9ff-aeb0-416d-964f-5a06e497f155" /> ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-11-03 14:25:37 +08:00
wangxiyuan	8a7154001e	[0.11.0]Chery pick pta upgrade change (#3940 ) This PR cherry-pick two commit from main to upgrade torch-npu to 2.7.1 official release --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-10-31 22:14:26 +08:00
rjg-lyh	3d81ea03ed	[v0.11.0-dev][bugfix] fix valueError in static_forward_context when prefix is empty (#3929 ) ### What this PR does / why we need it? This PR temporarily bypasses the scenario where some models in vLLM trigger a `ValueError` during the process of storing values in `static_forward_context` when no `prefix` is specified for the linear layers, which is a bug in some models in vLLM. The official fix will be addressed by submitting a PR to the vLLM community that specifies a prefix for the linear layers in each model. - vLLM version: v0.11.0 - vLLM main: `83f478bb19` ### How was this patch tested? CI passed with new added/existing test. Signed-off-by: rjg-lyh <1318825571@qq.com>	2025-10-31 15:45:06 +08:00
Nagisa125	9f7de45b75	[Bugfix] fix MTP support for lmhead_tensor_parallel_size (#3921 ) ### What this PR does / why we need it? Fix the issue of MTP being enabled and setting Imhead_tensor_parallel_size=16 causing the inference to hang. Signed-off-by: wyh145 <1987244901@qq.com>	2025-10-31 14:34:28 +08:00
lilinsiman	ee2e55e602	[v0.11.0][Test] Add new test model for aclgraph single_request v0.11.0 (#3889 ) ### What this PR does / why we need it? add new test model for aclgraph single_request v0.11.0 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ut Signed-off-by: lilinsiman <lilinsiman@gmail.com>	2025-10-31 11:23:55 +08:00
zouyida2052	90aca84e60	fix bug when max_seqs=14 in mtp=2 scenario and raise error when cudagraph_capture_sizes can't be an integer multiple of uniform_decode_query_len (#3909 ) ### What this PR does / why we need it? 1. Revert [bugfix for mtp in fullgraph](`0948483642`) and support it when vllm supports 2. raise error when cudagraph_capture_sizes can't be an integer multiple of uniform_decode_query_len 3. bugfix when max_num_seqs=14 in mtp=2 scenario --------- Signed-off-by: zouyida2052 <zouyida2002@gmail.com>	2025-10-31 09:25:06 +08:00
lilinsiman	387ce1cc5b	add new e2e tests case for aclgraph memory to v0.11.0 (#3880 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? add new e2e tests case for aclgraph memory to v0.11.0 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ut Signed-off-by: lilinsiman <lilinsiman@gmail.com>	2025-10-31 09:17:09 +08:00
wangxiaoteng888	38afd2c9cb	[bugfix_v0.11.0]cancel tokenize for layerwise_proxy (#3913 ) ### What this PR does / why we need it? cancel tokenize for layerwise_proxy ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? by ci Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>	2025-10-30 23:55:04 +08:00

1 2 3 4 5 ...

1230 Commits