xc-llm-ascend

Author	SHA1	Message	Date
Icey	e7b623b363	[BugFix][Fusion] Fix graph fusion failure problem (#5253 ) Currently, the vllm pull request (https://github.com/vllm-project/vllm/pull/24252) is causing operator fusion to fail. This issue was previously fixed by patching the backend. The root cause has been identified, and the problem can be resolved with this pull request. - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: wxsIcey <1790571317@qq.com>	2026-01-05 17:49:09 +08:00
lilinsiman	52863c4165	[Refactor][EAGLE] 2/N: load model and generate token (#5437 ) ### What this PR does / why we need it? 1. Refactor eagle and mtp function: load_model and generate_token_ids 2. Remove redundant code in mtp and eagle file 3. Refactor the UT of file 2/N of Refactor and merge mtp and eagle Relational RFC: https://github.com/vllm-project/vllm-ascend/issues/5467 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ut and tests - vLLM version: release/v0.13.0 - vLLM main: `81786c8774` --------- Signed-off-by: lilinsiman <lilinsiman@gmail.com>	2026-01-05 14:07:54 +08:00
pichangping	50e7934415	MLA prefill preformance optimization (#5456 ) ### What this PR does / why we need it? Since the _npu_ring_mla operator deteriorates in long-sequencescenarios, the long sequence is split into shorter sequences for input to improve performance. - vLLM version: v0.13.0 - vLLM main: `5326c89803` --------- Signed-off-by: pichangping <1337510399@qq.com>	2026-01-05 11:41:59 +08:00
panchao-hub	42774df744	[Bugfix] Fix weight transpose in RL scenarios (#5567 ) ### What this PR does / why we need it? In the training-inference switching scenario, there is no need to resume the model weights during KV cache resumption, as this would lead to format mismatch. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `7157596103` Signed-off-by: p00465316 <panchao13@huawei.com> Co-authored-by: p00465316 <panchao13@huawei.com>	2026-01-05 09:17:26 +08:00
LookAround0301	d25a2c20c5	[Bugfix] Fix chunk prefill bug for long_sequence feature (#5444 ) ### What this PR does / why we need it? Fix chunk prefill bug for long_sequence feature When there are two requests with chunk prefill enabled in the long-sequence scenario, if one request has only 1 token during scheduling, it will be identified as a decode request and trigger an error. This PR fixes the issue. Closes: https://github.com/vllm-project/vllm-ascend/issues/5445 - vLLM version: release/v0.13.0 - vLLM main: `81786c8774` --------- Signed-off-by: LookAround <lixushi@huawei.com>	2026-01-05 09:16:36 +08:00
Qiu	f15dc3fa02	[bugfix](pcp) expand max_num_tokens for pcp pad (#5478 ) ### What this PR does / why we need it? Since the [PR](https://github.com/vllm-project/vllm/pull/28988) for PCP modifications to `GPUModelRunner` has not yet been merged into vLLM, this PR temporarily requires adjustments to certain buffer sizes. These changes can be reverted once the original [PR](https://github.com/vllm-project/vllm/pull/28988) is merged. ### Does this PR introduce _any_ user-facing change? No - vLLM version: v0.13.0 - vLLM main: `5326c89803` Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>	2026-01-04 17:25:40 +08:00
Qiu	7c210225a2	[Perf][PCP][DCP] add multi-stream for GQA to enable computation-communication overlap (#5382 ) ### What this PR does / why we need it? This PR adds multi-stream for GQA to enable computation-communication overlap. For chunked prefill, we reduce TTFT by approximately 4%. ### Does this PR introduce _any_ user-facing change? No - vLLM version: release/v0.13.0 - vLLM main: `bc0a5a0c08` --------- Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>	2026-01-04 16:33:18 +08:00
drslark	363ac1b80f	[Feat][main] Supported to use full-graph with Qwen3-Next-MTP (#5477 ) ### What this PR does / why we need it? Supported to use full-graph with Qwen3-Next-MTP. In detail, we adatpted `AscendAttentionState.ChunkedPrefill` in main model, and also adapted `AscendAttentionState.ChunkedPrefill` in mtp model. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? We changed the test of Qwen3-Next-MTP in `tests/e2e/multicard/test_qwen3_next.py` to make it a test of `FULL_DECODE_ONLY`. Then run `pytest -s tests/e2e/multicard/test_qwen3_next.py::test_qwen3_next_distributed_mp_eager_mtp_similarity_tp4`. And this test passed. ```text . ================================================================================================================================= warnings summary ================================================================================================================================= <frozen importlib._bootstrap>:241 <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute <frozen importlib._bootstrap>:241 <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ==================================================================================================================== 1 passed, 2 warnings in 271.89s (0:04:31) ===================================================================================================================== ``` - vLLM version: v0.13.0 - vLLM main: `5326c89803` Signed-off-by: drslark <slarksblood@qq.com>	2026-01-04 12:03:21 +08:00
Chu Yuelin	d07d8a4535	[Model] Add LongCat-Flash (#3833 ) ### What this PR does / why we need it? Add LongCat-Flash support. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed - vLLM version: v0.13.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: chuyuelin <923822139@qq.com> Co-authored-by: chuyuelin <chuyuelin1@huawei.com>	2025-12-31 17:06:55 +08:00
zhenwenqi2024	5d9fde9819	[Feature] Refactor PCP &DCP related code (#5214 ) ### What this PR does / why we need it? Refactor pcp& dcp related code. we use pcp_manager class to Unifiy Manage pcp & dcp . as we do this , many code can be deleted from model_runner, and can avoid break pcp & dcp by other developments. RFC：https://github.com/vllm-project/vllm-ascend/issues/5449 ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com> Co-authored-by: zzzzwwjj <34335947+zzzzwwjj@users.noreply.github.com>	2025-12-31 09:29:57 +08:00
weiguihua2	15d73f248e	[refactor] refactor model runner capture model (#5230 ) ### What this PR does / why we need it? Refactor the `capture_model` method in model_runner to directly reuse the method from vLLM. Currently, most of the logic in the capture_model method is similar to that in the vllm code. Directly using the vllm method can reduce the maintenance cost of the vllm-ascend code. Modify as follows: 1、refactor capture_model function, directly inheriting community methods 2、refactor initialize_aclgraph_capture function, move to initialize_attn_backend ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: weiguihua2 <weiguihua2@huawei.com> Co-authored-by: weijinqian0 <1184188277@qq.com>	2025-12-30 08:32:14 +08:00
Nengjun Ma	5e96f94d2a	Update corresponding vllm commit ID to 12 29 (#5475 ) ### What this PR does / why we need it? - Fixes vllm break: 1. [[BugFix] register quant scale tensors as buffer #31395] (https://github.com/vllm-project/vllm/pull/31395) ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `5326c89803` --------- Signed-off-by: leo-pony <nengjunma@outlook.com>	2025-12-29 22:48:05 +08:00
Ronald	e7e1a7dc05	[Feature] support eager mode in model runner v2 (#5210 ) ### What this PR does / why we need it? #5051 only implement a basic framework for model runner v2, but there are still some bugs for e2e functionality, this PR aim to enable basic functionality. model runner v2 plans: https://github.com/vllm-project/vllm-ascend/issues/5208 - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: Ronald1995 <ronaldautomobile@163.com>	2025-12-29 15:28:34 +08:00
yeyifan	4da46da9bf	[feature] fia support sliding windows (#5239 ) Enable fia to support sliding window function and adapt to the Gemma3 model. - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: nsdie <yeyifan@huawei.com>	2025-12-29 14:56:25 +08:00
anon189Ty	3e67e8276c	[Feature] Support to use fullgraph with eagle (#5118 ) ### What this PR does / why we need it? We support to use full graph with eagle. Change list: 1. Distinguish between processing graph_params and draft_graph_params in attention_v1. 2. Adapt the full-graph mode in eagle_proposer, include: 1). If use full graph, make Fullgraph Wrapper when load model. 2). Build a new meatadata, set running mode in FULL and mark attention update in dummy_run when in Fullgraph mode. 3). Fixed and fill any attn_metadata, such as attn_metadata.slot_mapping. 4). Add a descriptor. 5). Set running mode and triggered update metadata. 3. Trans is_mtp_model to is_draft_model, and add the update of workspace. NOTE: When set async_scheduling=True, the draft model will enforce execution in eager mode. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: anon189Ty <Stari_Falcon@outlook.com> Co-authored-by: Yizhou Liu <liu_yizhou@outlook.com> Co-authored-by: Yizhou <136800916+yiz-liu@users.noreply.github.com>	2025-12-29 09:54:51 +08:00
weijinqian0	dbe4c338f2	[Refactor] cache cos/sin in mla & remove parameter model in builder. (#5277 ) RFC: https://github.com/vllm-project/vllm-ascend/issues/4629 1. Cache cos/sin in mla 2. AttentionBuilder inherits from the original class of vllm. version: release/v0.13.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: weijinqian_v1 <weijinqian@huawei.com> Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>	2025-12-28 10:35:07 +08:00
jiangkuaixue123	e91e11d3b0	[bugfix] fix typo of _skip_all_reduce_across_dp_group (#5435 ) ### What this PR does / why we need it? fix typo of _skip_all_reduce_across_dp_group ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `81786c8774` Signed-off-by: jiangkuaixue123 <jiangxiaozhou111@163.com>	2025-12-27 17:50:04 +08:00
hwhaokun	cb2fbf7df2	[bugfix] solve dp scenario Host-Device sync (#5298 ) ### What this PR does / why we need it? In the speculative decoding scenario, the original code performs Host-Device synchronization, which slows down the main model's execution speed. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: hwhaokun <haokun0405@163.com> Co-authored-by: realliujiaxu <realliujiaxu@163.com>	2025-12-27 10:36:59 +08:00
wangxiyuan	d1f0df7b4b	Revert "MLA prefill preformance optimization (#5275 )" (#5410 ) We'll release 0.13.0 soon. The main branch is freeze. Let's revert the newest change and redo it once 0.13.0 is released - vLLM version: release/v0.13.0 - vLLM main: `81786c8774`	2025-12-27 09:48:56 +08:00
pichangping	711f1861e4	MLA prefill preformance optimization (#5275 ) ### What this PR does / why we need it? Since the _npu_ring_mla operator deteriorates in long-sequencescenarios, the long sequence is split into shorter sequences for input to improve performance. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: pichangping <1337510399@qq.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-27 09:19:45 +08:00
Jade Zheng	0dfdfa9526	[Feature] Enhance all-reduce skipping logic for MoE models in NPUModelRunner (#5329 ) Besides enabling `recompute_scheduler_enable`, we can skip all_reduce when max_num_batched_tokens is below mc2's requirement. - vLLM version: release/v0.13.0 - vLLM main: `bc0a5a0c08` --------- Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>	2025-12-26 17:39:44 +08:00
wangxiyuan	29d2fe653d	cleanup ascend config (#5296 ) 1. refresh additional config doc 2. move kv config logic to platform. 3. improve `dump_config` init logic and rename it to `dump_config_path` this change is user impacted. dump_config is changed from dict to string. 4. correct `enable_async_exponential` type 5. remove useless `chunked_prefill_for_mla` - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-26 14:07:37 +08:00
XiaoxinWang	320877d488	move contiguous in fused_sigmoid_gating_delta_rule_update to model_runner_v1 (#5274 ) ### What this PR does / why we need it? The contiguous() operation temporarily increases memory usage, leading to higher peak GPU memory, which necessitates reducing gpu_memory_utilization. However, making tensors contiguous in modelrunnerv1 significantly enhances operator performance, resulting in greater end-to-end model benefits despite the memory overhead. - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com> Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>	2025-12-26 09:19:47 +08:00
weiguihua2	d752c030e9	[Bugfix] fix pcp 128K break (#5266 ) ### What this PR does / why we need it? [Bugfix] Fixing the issue where 128K context does not work in long sequence scenarios. This issue is caused by not splitting num_token according to pcp_size during profile_run. During `profile_run`, a warm-up is performed based on `self.max_num_tokens`. When PCP is enabled, each PCP group will only schedule up to `self.max_num_tokens / pcp_size`. After `profile_run` is completed, the original scheduling size needs to be restored. This is a temporary workaround; once https://github.com/vllm-project/vllm/pull/28988/files is implemented, this part can be removed. - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: weiguihua2 <weiguihua2@huawei.com>	2025-12-25 11:58:52 +08:00
dsxsteven	30778f371b	[BugFix] Fix num_pcp_pads Assignment Issues (#5273 ) ### What this PR does / why we need it? The variable `self.num_pcp_pads` was incorrectly truncated during assignment, causing errors in certain scenarios such as PD disaggregated. This issue has now been resolved. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? Co-author by: QiuChunshuo <qiuchunshuo@huawei.com> - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: daishixun <dsxsteven@sina.com> Co-authored-by: weijinqian0 <1184188277@qq.com>	2025-12-25 10:38:09 +08:00
Mengqing Cao	e54630e01c	Revert [KV-Sharing] Support KV-Sharing feature in CLA models (#4138 ) (#5317 ) ### What this PR does / why we need it? Revert [KV-Sharing] Support KV-Sharing feature in CLA models (#4138) as it causes deepseek v3.2 hang error - vLLM version: release/v0.13.0 - vLLM main: `5fbfa8d9ef` --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-12-24 22:24:17 +08:00
Slightwind	22138e2727	[main][Refactor] Remove `with_prefill` parameter from `set_ascend_forward_context` (#5094 ) Removes the redundant `with_prefill` parameter from `set_ascend_forward_context` to align the interface with vLLM's `set_forward_context` for future refactoring. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: SlightwindSec <slightwindsec@gmail.com> Signed-off-by: Slightwind <slightwindsec@gmail.com> Co-authored-by: zzzzwwjj <34335947+zzzzwwjj@users.noreply.github.com>	2025-12-23 14:30:50 +08:00
Mengqing Cao	449f8f65a7	[KV-Sharing] Support KV-Sharing feature in CLA models (#4138 ) ### What this PR does / why we need it? Support KV-Sharing feature in CLA (cross layer attention) models, which sharing kv cache in some layers. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-12-23 10:48:31 +08:00
Li Wang	9a79cbaecb	[ModelRunner] Add hunyuan-vl basic support (#5151 ) ### What this PR does / why we need it? This patch add handling of `XDRotaryEmbedding` in modelrunner to support for `hunyuan-vl` ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? CI passed with added/exist tests Closes: https://github.com/vllm-project/vllm-ascend/issues/4992 - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-12-23 10:46:54 +08:00
Wang Kunpeng	c3a8d13ca7	[refactor] Remove unnecessary attributes from set_ascend_forward_context (#5204 ) ### What this PR does / why we need it? Remove unnecessary attributes from set_ascend_forward_context 1.prefetch_stream 2.weight_prefetch_method ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: Wang Kunpeng <1289706727@qq.com>	2025-12-23 08:49:52 +08:00
weijinqian0	95e8a52156	[Refactor] move the metadata from attention_v1 to util(ready for extract common_cp) & realize Ascendmetadata inherit from the parent class. (#5203 ) RFC: https://github.com/vllm-project/vllm-ascend/issues/4629 1. Remove the pcp-related code from attention_v1. 2. Establish the inheritance relationship of CommonAttentionMetadata. TODO 1. extract common_cp 2. move cp metadata to common_cp. 3. remove commonAttentionMetadata for aclgraph. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: weijinqian_v1 <weijinqian@huawei.com> Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>	2025-12-23 00:10:52 +08:00
zhangxinyuehfad	61efaffcaf	[Bugfix] Implement multimodal_cpu_fields in model runner (#5196 ) ### What this PR does / why we need it? Related to https://github.com/vllm-project/vllm-ascend/issues/4084 Implement multimodal_cpu_fields in model runner - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-12-22 18:39:45 +08:00
zhangsicheng5	78aa7f2693	[feature] support pcp + mtp in full graph (#4572 ) 1. support pcp + mtp in full graph 2. pcp/dcp related mtp bugfix 3. support pcp + mtpx - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: zhangsicheng5 <zhangsicheng5@huawei.com>	2025-12-22 16:13:39 +08:00
Yizhou	60d9398f6d	[1/N][Eagle3] Aligns auxiliary hidden state usage for eagle3 models (#5162 ) ### What this PR does / why we need it? This is to prepare for the migration to vLLM's `EagleProposer`, it does not have `name` attribution. Also it's a breakdown of #5100 . Introduces logic to determine whether eagle3 heads require auxiliary hidden states based on configuration, ensuring consistent handling across related components. Prevents incorrect assumptions for eagle3 variants that do not use auxiliary outputs, improving compatibility and correctness. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? None. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-12-22 15:24:54 +08:00
YuhanBai	5d02eed16f	[Performance] Add async exponential while model executing (#4501 ) ### What this PR does / why we need it? Add a control to enable the exponential distribution operator overlapping with model executing (default is OFF due to this feature might not perform well on MOE models, i.e. For Qwen3-30B). Enable async exponential overlapping will provides performance improvement. Also, overlapping the exponential operator with module execution can cover the performance drop introduced by AICPU-version's exponential operator. UPDATE: (12/12) Now our overlap will use the same stream that introduced in this pr: #4908 . We move the `do_async_exponential` from `model_runner_v1.py` to `sampler.py`. Now we are using `additional_config` to enable async exponential: Add `"enable_async_exponential": 1` in `addition_config`. Now we ONLY support default exponential/AI-CPU exponential, the old `"enable_async_exponential": 2` option has been aborted to keep consistency. ### Does this PR introduce _any_ user-facing change? YES, added a new `additional_config` : `"enable_async_exponential": 1`. When `enable_async_exponential` is set to 1, we enable the async exponential and overlap with model runner. When `enable_async_exponential` is set to 0 (default is 0), we disable the async exponential, but exponential will still running on a different stream using stream introduced in #4908. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: YuhanBai <yuhan.bai0830@gmail.com> Signed-off-by: YuhanBai yuhan.bai0830@gmail.com	2025-12-20 21:23:21 +08:00
lianyibo	58773af708	[Fix] Delete pooling redundant code (#4940 ) ### What this PR does / why we need it? Remove redundant code in #3122. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: lianyibo <lianyibo1@kunlunit.com>	2025-12-20 20:47:30 +08:00
wangxiyuan	758d81dcb1	Drop 0.12.0 support (#5146 ) We decided to release v0.13.0 soon. So no need to support 0.12.0 now. Let's drop it. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-20 09:38:53 +08:00
weijinqian0	35ad11b637	[Refactor] remove some metadata variables in attention_v1. (#5160 ) RFC: https://github.com/vllm-project/vllm-ascend/issues/4629 Reason: The metadata data class contains an excessive number of variables. We will inherit the metadata of the community and simultaneously remove some variables that are no longer needed at present. Todo: 1. remove attn_state partly. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: weijinqian_v1 <weijinqian@huawei.com> Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>	2025-12-19 14:57:09 +08:00
zzzzwwjj	cc23067f1e	[refactor] refactor weight trans nz and transpose (#4878 ) ### What this PR does / why we need it? Now `VLLM_ASCEND_ENABLE_NZ` will have three options: 0: disable nz; 1: only quant case enable nz; 2: enable nz as long as possible; And `VLLM_ASCEND_ENABLE_NZ`=1 by default. All cases are shown in the table below: \| \| W4A4 \| W4A8 \| W8A8 \| fp16/bf16 \| fp32 \| \|---\|---\|---\|---\|---\|---\| \| trans nz \| can't support nz \| trans nz by default \| trans nz by default \| trans nz when VLLM_ASCEND_ENABLE_NZ is 2 \| can't support nz \| \| transpose \| only support not transpose case \| only support transpose case \| only support transpose case \| linear: only support not transpose case<br>gmm: only support transpose case \| same to fp16/bf16 \| Some exceptional cases: 1. MLAPO op need to do some additional processing on the weights, including trans nz. If use MLAPO op, some weight will be transformed to nz forcely; 2. MLA/SFA's weight `W_UV` will be used by op `torch.ops._C_ascend.batch_matmul_transpose`, and this op can't support nz currently; ### Does this PR introduce _any_ user-facing change? Now fp16/bf16 weight will not trans nz by default. ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: zzzzwwjj <1183291235@qq.com>	2025-12-19 14:27:24 +08:00
weichen	ca6f631cba	[2/N][Pangu][MoE] Remove Pangu Related Code (#5130 ) ### What this PR does / why we need it? Remove Pangu Related Code ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? e2e & ut - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: weichen <calvin_zhu0210@outlook.com>	2025-12-19 09:00:07 +08:00
Chen Chen	1b47fca0e8	[bugfix] Use FUSED_MC2 MoE comm path for the op `dispatch_ffn_combine` (#5156 ) ### What this PR does / why we need it? - Renames the MoE comm enum value `MoECommType.FUSED_ALLTOALL` to `MoECommType.FUSED_MC2` and updates all call sites. - Updates `select_moe_comm_method` to optionally select `FUSED_MC2` on Ascend A3 when: - `enable_expert_parallel=True` - quantization is `w8a8_dynamic` - `EP <= 16` - `dynamic_eplb` is disabled - `is_mtp_model = False` - Replaces the old “fused all-to-all” comm implementation with `FusedMC2CommImpl`, using `TokenDispatcherWithMC2` / `PrepareAndFinalizeWithMC2` and `dispatch_ffn_combine`. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: Chen Chen <0109chenchen@gmail.com>	2025-12-18 23:34:31 +08:00
Zetong Li	2304218f90	[Bugfix] Fix in_profile_run in mtp_proposer dummy_run (#5165 ) ### What this PR does / why we need it? This PR aims to fix failure of `enable_force_load_balance` caused by missing `in_profile_run` in `dummy_run` of mtp_proposer. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? by ci - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: Zetong Li <slippersss@126.com>	2025-12-18 22:27:47 +08:00
Angazenn	632eab28b7	[BugFix]Fix incorrect get_current_vllm_config (#5121 ) ### What this PR does / why we need it? This PR fixes some incorrect `get_current_vllm_config` calling, which creates empty vllm_config instead. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: Angazenn <supperccell@163.com>	2025-12-18 22:21:36 +08:00
Yizhou	ff3914e31a	[Fix] Refines decode mode padding condition for uniform queries (#5164 ) ### What this PR does / why we need it? The reason why we cannot use `self.cudagraph_batch_sizes[-1]` is that it's actually not the max number of tokens to be padded in `FULL_DECODE_ONLY` mode, much larger instead. And it's trimmed only before capturing to `compilation_cases`, this really caused us lots of trouble. Updates the logic to ensure padding occurs only when the number of input tokens falls within a valid uniform decode query range, improving consistency and avoiding unnecessary padding in specific decode modes. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? None. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-12-18 21:09:23 +08:00
Ronald	b69b04d3a9	implement model runner v2 basic framework (#5051 ) ### What this PR does / why we need it? This PR aim to implement model runner v2 basic framework in vllm-ascend, the e2e function is not guaranteed by this pr. ### Does this PR introduce _any_ user-facing change? use envs.VLLM_USE_V2_MODEL_RUNNER to decide if choose model_runenr_v2. ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: Ronald1995 <ronaldautomobile@163.com>	2025-12-18 15:51:54 +08:00
lidenghui1110	1c8c23de58	[Bugfix] fix pipeline parallelism bug introduced by async-scheduling refactor work (#4973 ) ### What this PR does / why we need it? Currently, when using pipeline parallel and pd disaggregate, model_runner will return None on non-last-pp-rank stages in `sample_tokens`, which will cause assert error in vllm KVOutputAggregator on [this line](https://github.com/vllm-project/vllm/blob/main/vllm/distributed/kv_transfer/kv_connector/utils.py#L84). In fact, all pp workers should return a model_runner_output which contains kv_connector_output to do aggregate in Enginecore scheduler process to ensure all kv transfer is finished for kv cache releasing later. To fix this issue, this PR follows gpu_model_runner in vllm, passing kv_connector_output in `sample_tokens` to make sure all ranks will return a ModelRunnerOutput, in non-last-pp-rank workers, it will return EMPTY_MODEL_RUNNER_OUTPUT with kv_connector_output. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: lidenghui <lidenghui1110@gmail.com>	2025-12-18 15:27:55 +08:00
shaopeng-666	39bdd4cfaa	fix profile run for vl model (#5136 ) ### What this PR does / why we need it? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: 李少鹏 <lishaopeng21@huawei.com>	2025-12-17 23:51:31 +08:00
Yizhou	43d974c6f7	[Fix] Synchronize the host query_start_loc with device values to prevent shape mismatches (#5134 ) ### What this PR does / why we need it? Synchronize the host query_start_loc with device values to prevent shape mismatches when not enable async scheduling. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? None. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-12-17 23:50:12 +08:00
zhenwenqi2024	950570f8d1	[Bugfix]delele profile_run in model_runner (#5122 ) ### What this PR does / why we need it? delete sekf.in_profile_run in model_runner to make EPLB works as expect ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com> Signed-off-by: Mengqing Cao <cmq0113@163.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2025-12-17 23:48:34 +08:00
Yuzhou Tong	7671ce1bf1	Fix a data conversion bug introduced by commit `3b7eb51` in main#4655 (#5115 ) ### What this PR does / why we need it? [Fix a data conversion bug introduced by [main#4655](`3b7eb5179f`) ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: tongyuzhou <tongyuzhou1@huawei.com> Co-authored-by: tongyuzhou <tongyuzhou1@huawei.com> Co-authored-by: weijinqian0 <1184188277@qq.com>	2025-12-17 20:19:02 +08:00

... 2 3 4 5 6 ...

562 Commits