xc-llm-ascend

Author	SHA1	Message	Date
Zetong Li	fe3f2c7702	[Refactor][EAGLE] 3/N delete redundant methods in mtp_proposer (#5420 ) ### What this PR does / why we need it? This PR aims to delete redundant methods in mtp_proposer. All the deleted methods now can be found in eagle_proposer. We also remove some methods in eagle_proposer since they are identical to those in vllm-eagle. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? by ci - vLLM version: release/v0.13.0 - vLLM main: `81786c8774` --------- Signed-off-by: Zetong Li <slippersss@126.com>	2026-01-06 16:47:39 +08:00
lilinsiman	52863c4165	[Refactor][EAGLE] 2/N: load model and generate token (#5437 ) ### What this PR does / why we need it? 1. Refactor eagle and mtp function: load_model and generate_token_ids 2. Remove redundant code in mtp and eagle file 3. Refactor the UT of file 2/N of Refactor and merge mtp and eagle Relational RFC: https://github.com/vllm-project/vllm-ascend/issues/5467 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ut and tests - vLLM version: release/v0.13.0 - vLLM main: `81786c8774` --------- Signed-off-by: lilinsiman <lilinsiman@gmail.com>	2026-01-05 14:07:54 +08:00
drslark	363ac1b80f	[Feat][main] Supported to use full-graph with Qwen3-Next-MTP (#5477 ) ### What this PR does / why we need it? Supported to use full-graph with Qwen3-Next-MTP. In detail, we adatpted `AscendAttentionState.ChunkedPrefill` in main model, and also adapted `AscendAttentionState.ChunkedPrefill` in mtp model. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? We changed the test of Qwen3-Next-MTP in `tests/e2e/multicard/test_qwen3_next.py` to make it a test of `FULL_DECODE_ONLY`. Then run `pytest -s tests/e2e/multicard/test_qwen3_next.py::test_qwen3_next_distributed_mp_eager_mtp_similarity_tp4`. And this test passed. ```text . ================================================================================================================================= warnings summary ================================================================================================================================= <frozen importlib._bootstrap>:241 <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute <frozen importlib._bootstrap>:241 <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ==================================================================================================================== 1 passed, 2 warnings in 271.89s (0:04:31) ===================================================================================================================== ``` - vLLM version: v0.13.0 - vLLM main: `5326c89803` Signed-off-by: drslark <slarksblood@qq.com>	2026-01-04 12:03:21 +08:00
zhenwenqi2024	5d9fde9819	[Feature] Refactor PCP &DCP related code (#5214 ) ### What this PR does / why we need it? Refactor pcp& dcp related code. we use pcp_manager class to Unifiy Manage pcp & dcp . as we do this , many code can be deleted from model_runner, and can avoid break pcp & dcp by other developments. RFC：https://github.com/vllm-project/vllm-ascend/issues/5449 ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com> Co-authored-by: zzzzwwjj <34335947+zzzzwwjj@users.noreply.github.com>	2025-12-31 09:29:57 +08:00
Zetong Li	92353c0643	[Refactor][EAGLE] 1/N delete __init__ in mtp_proposer (#5176 ) ### What this PR does / why we need it? This PR aims to refactor eagle-related modules in vllm-ascend. This is the starting PR of eagle refactoring. Provided with vllm-eagle, ascend-eagle and ascend-mtp, we first let ascend-mtp inherit from ascend-eagle and let ascend-eagle inherit from vllm-eagle. As a initialization, we just delete `__init__` in mtp_proposer and simplify the corresponding logic in eagle_proposer. Based on "vllm-eagle <----- ascend-eagle <----- ascend-mtp", our target is to gradually delete ascend-mtp and enable ascend-eagle to converge to vllm-eagle. So the main workspace is eagle_proposer. In this way, we hope that contributors can concurrently refactor eagle. Incoming changes: 1. delete common methods in vllm-eagle & ascend-eagle & ascend-mtp 2. delete `load_model` in mtp_proposer 3. delete `dummy_run` and `propose` in mtp_proposer 4. ...... RFC: #5467 ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? by ci - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: Zetong Li <slippersss@126.com>	2025-12-29 16:25:52 +08:00
anon189Ty	3e67e8276c	[Feature] Support to use fullgraph with eagle (#5118 ) ### What this PR does / why we need it? We support to use full graph with eagle. Change list: 1. Distinguish between processing graph_params and draft_graph_params in attention_v1. 2. Adapt the full-graph mode in eagle_proposer, include: 1). If use full graph, make Fullgraph Wrapper when load model. 2). Build a new meatadata, set running mode in FULL and mark attention update in dummy_run when in Fullgraph mode. 3). Fixed and fill any attn_metadata, such as attn_metadata.slot_mapping. 4). Add a descriptor. 5). Set running mode and triggered update metadata. 3. Trans is_mtp_model to is_draft_model, and add the update of workspace. NOTE: When set async_scheduling=True, the draft model will enforce execution in eager mode. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: anon189Ty <Stari_Falcon@outlook.com> Co-authored-by: Yizhou Liu <liu_yizhou@outlook.com> Co-authored-by: Yizhou <136800916+yiz-liu@users.noreply.github.com>	2025-12-29 09:54:51 +08:00
weijinqian0	dbe4c338f2	[Refactor] cache cos/sin in mla & remove parameter model in builder. (#5277 ) RFC: https://github.com/vllm-project/vllm-ascend/issues/4629 1. Cache cos/sin in mla 2. AttentionBuilder inherits from the original class of vllm. version: release/v0.13.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: weijinqian_v1 <weijinqian@huawei.com> Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>	2025-12-28 10:35:07 +08:00
Slightwind	22138e2727	[main][Refactor] Remove `with_prefill` parameter from `set_ascend_forward_context` (#5094 ) Removes the redundant `with_prefill` parameter from `set_ascend_forward_context` to align the interface with vLLM's `set_forward_context` for future refactoring. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: SlightwindSec <slightwindsec@gmail.com> Signed-off-by: Slightwind <slightwindsec@gmail.com> Co-authored-by: zzzzwwjj <34335947+zzzzwwjj@users.noreply.github.com>	2025-12-23 14:30:50 +08:00
weijinqian0	95e8a52156	[Refactor] move the metadata from attention_v1 to util(ready for extract common_cp) & realize Ascendmetadata inherit from the parent class. (#5203 ) RFC: https://github.com/vllm-project/vllm-ascend/issues/4629 1. Remove the pcp-related code from attention_v1. 2. Establish the inheritance relationship of CommonAttentionMetadata. TODO 1. extract common_cp 2. move cp metadata to common_cp. 3. remove commonAttentionMetadata for aclgraph. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: weijinqian_v1 <weijinqian@huawei.com> Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>	2025-12-23 00:10:52 +08:00
zhangsicheng5	78aa7f2693	[feature] support pcp + mtp in full graph (#4572 ) 1. support pcp + mtp in full graph 2. pcp/dcp related mtp bugfix 3. support pcp + mtpx - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: zhangsicheng5 <zhangsicheng5@huawei.com>	2025-12-22 16:13:39 +08:00
Qiu	ea6206bb18	[bugfix][ACLGraph][MTP] deletes `cudagraph_batch_sizes` in `MtpProposer` (#5183 ) ### What this PR does / why we need it? This PR deletes `cudagraph_batch_sizes` in `MtpProposer` and reuses the one in `NPUModelRunner`. During our deployment of DeepSeek-V3.2 with MTP across machines 2P2D and conducting AISBench stress testing, an error occurred (see below). After investigation, we found that `compilation_config.cudagraph_capture_sizes` is modified by `adjust_cudagraph_sizes_for_spec_decode` in `NPUModelRunner`. This modification only updates `cudagraph_batch_sizes` in `NPUModelRunner` but is not synchronized to `MtpProposer`. After discussion (CC @yiz-liu) , we believe it is unnecessary to maintain `cudagraph_batch_sizes` in `MtpProposer`; it should directly use the variable from `NPUModelRunner`. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>	2025-12-22 14:08:27 +08:00
Zetong Li	2304218f90	[Bugfix] Fix in_profile_run in mtp_proposer dummy_run (#5165 ) ### What this PR does / why we need it? This PR aims to fix failure of `enable_force_load_balance` caused by missing `in_profile_run` in `dummy_run` of mtp_proposer. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? by ci - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: Zetong Li <slippersss@126.com>	2025-12-18 22:27:47 +08:00
Yizhou	543f122101	[Fix] Fix DeepSeek V3.2 "no attr" error (#5147 ) ### What this PR does / why we need it? Extracts repeated `attn_metadata[layer_name].decode` access into a single variable to improve code readability and reduce redundancy. Uses `getattr` with a default value to safely access the decode attribute, making the code more defensive against potential attribute errors. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? None. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-12-18 14:46:41 +08:00
Yizhou	43d974c6f7	[Fix] Synchronize the host query_start_loc with device values to prevent shape mismatches (#5134 ) ### What this PR does / why we need it? Synchronize the host query_start_loc with device values to prevent shape mismatches when not enable async scheduling. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? None. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-12-17 23:50:12 +08:00
zhenwenqi2024	950570f8d1	[Bugfix]delele profile_run in model_runner (#5122 ) ### What this PR does / why we need it? delete sekf.in_profile_run in model_runner to make EPLB works as expect ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com> Signed-off-by: Mengqing Cao <cmq0113@163.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2025-12-17 23:48:34 +08:00
JeffLee1874	724d04391e	[model] Support PanguUltraMoE (#4615 ) ### What this PR does / why we need it? To support PanguUltraMoE model ### Test result #### Start serving using W8A8 quantized model and ACL graph: Master node: ``` vllm serve $LOCAL_CKPT_DIR \ --host 0.0.0.0 \ --port 8000 \ --data-parallel-size 2 \ --data-parallel-size-local 1 \ --data-parallel-address $MASTER_NODE_IP \ --data-parallel-rpc-port 13389 \ --tensor-parallel-size 16 \ --seed 1024 \ --enable-expert-parallel \ --served-model-name $NAME \ --max-model-len 4096 \ --max-num-batched-tokens 256 \ --max-num-seqs 18 \ --trust-remote-code \ --gpu-memory-utilization 0.90 \ --quantization ascend \ --additional-config '{"ascend_scheduler_config":{"enabled":false, "enable_chunked_prefill":true, "chunked_prefill_enabled":true},"torchair_graph_config":{"enabled":false}}' \ --speculative_config '{"method": "pangu_ultra_moe_mtp", "num_speculative_tokens": 1}' \ ``` Other nodes: ``` vllm serve $LOCAL_CKPT_DIR \ --host 0.0.0.0 \ --port 8000 \ --headless \ --data-parallel-size 2 \ --data-parallel-size-local 1 \ --data-parallel-start-rank 1 \ --data-parallel-address $MASTER_NODE_IP \ --data-parallel-rpc-port 13389 \ --tensor-parallel-size 16 \ --seed 1024 \ --enable-expert-parallel \ --served-model-name $NAME \ --max-model-len 4096 \ --max-num-batched-tokens 256 \ --max-num-seqs 18 \ --trust-remote-code \ --gpu-memory-utilization 0.90 \ --quantization ascend \ --additional-config '{"ascend_scheduler_config":{"enabled":false, "enable_chunked_prefill":true, "chunked_prefill_enabled":true},"torchair_graph_config":{"enabled":false}}' \ --speculative_config '{"method": "pangu_ultra_moe_mtp", "num_speculative_tokens": 1}' \ ``` Request & Response: - Request ``` curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "messages": [ {"role": "system", "content": ""}, {"role": "user", "content": "你是谁？"} ], "max_tokens": "64", "top_p": "0.95", "top_k": "50", "temperature": "0.6", "add_special_tokens" : true }' ``` - Response ``` [unused16] 好的，用户问我是谁，我需要按照之前的设定来回答。首先，我的角色是盘古，由华为开发，属于推理模型。要强调我的主要功能是解答问题和提供信息支持，特别是通过逻辑推理和数据分析处理复杂任务。需要保持回答简洁，用中文，并且符合用户的 ``` - vLLM version: v0.12.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.12.0 Signed-off-by: lijifu <lijifu4@huawei.com> Co-authored-by: lijifu <lijifu4@huawei.com>	2025-12-17 16:15:29 +08:00
Wang Yixuan	153eeaa621	[Bugfix] Fix DeepSeek FIA error in async_scheduling with mtp (#5046 ) ### What this PR does / why we need it? When enable the async_scheduling, in large scale EP scene, mtp module goes to eagler mode, which results in the mismatch of seq_lens_list、block_table. So adapt the judgement before the draft model forward. fix #4986 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: hust17yixuan <303660421@qq.com>	2025-12-17 09:20:44 +08:00
zhenwenqi2024	eb4c08f05d	[bugfix] fix mtp accept rate (#5093 ) ### What this PR does / why we need it? 1. now, npu_model_runner reuses gpu_model_runner, this pr deletes some attrs already defined in gpu_model_runner 2. fix mtp accept rate by disabling in_profile_run 3. remove redundant moe method selection logic 4. Reverts vllm-project/vllm-ascend#5082, which broke CI in https://github.com/vllm-project/vllm-ascend/actions/runs/20266314048/job/58190426832?pr=5088 ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? vLLM version: v0.12.0 vLLM main: `ad32e3e19c` vLLM version: v0.12.0 vLLM main: `ad32e3e19c` - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com> Signed-off-by: Mengqing Cao <cmq0113@163.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2025-12-17 01:35:26 +08:00
zhenwenqi2024	4ed2951400	【Feature】refactor npu_modelrunner for profile_run (#4993 ) ### What this PR does / why we need it? (1)refactor npu_model_runner for profile_run (2) move _select_moe_comm_method to ascend_forward_context (3) delete _init_model_kwargs in npu_model_runner ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Na - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com> Signed-off-by: zhenwenqi2024 <155598497+zhenwenqi2024@users.noreply.github.com>	2025-12-16 17:44:04 +08:00
MengLong Chen	5e0ada5395	[Bugfix] Fix the attn_metadata is None (#5038 ) ### What this PR does / why we need it? Fix the bug " TypeError: 'NoneType' object is not iterable' " in vllm_ascend/compilation/acl_graph.py The reason of that is the attn_metadata is none in the dummy_run of MTP. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: chenmenglong <chenmenglong1@huawei.com>	2025-12-16 09:14:05 +08:00
Jade Zheng	c064d11fd7	[Cleanup] Remove unused attn_metadata parameter from Proposer classes (#4862 ) The `attn_metadata` is not used by any draft proposer, so we can remove it. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>	2025-12-15 21:21:38 +08:00
zzhxxx	e16444f21f	[Bugfix] Fix the bug in initializing the shared_weight communication domain in sfa-cp, and fix the mtp weight load in pp>1 situation (#4913 ) ### What this PR does / why we need it? In PR #4188, a small bug was introduced that caused sfa-cp to be unable to find the global_pp_size parameter during initialization, and this PR fixed the issue. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: zzhx1 <zzh_201018@outlook.com> Co-authored-by: Jade Zheng <zheng.shoujian@outlook.com>	2025-12-15 16:21:49 +08:00
Chen Chen	aa02a85e4d	[bugfix] Fix dummy-run and multi-node issues in MoE routing and MTP (#4947 ) ### What this PR does / why we need it? - Fix a premature `return` in `moe_init_routing_quant_v2.cpp` so the routing kernel completes correctly instead of exiting early in certain paths. - Switch `FusedAlltoAllCommImpl` to use the MC2-based token dispatcher and prepare/finalize routines, aligning MoE communication with the MC2 algorithm optimized for Ascend devices. - Add a temporary override in `MtpProposer` to map `FUSED_ALLTOALL` back to `ALLTOALL` until the MoE communication type selection logic is fully finalized, avoiding incorrect behavior in dummy-run flows. - Simplify the MoE communication selection for Ascend 910-93 in `NPUModelRunner` by removing the EP-size guard on `FUSED_ALLTOALL`, which fixes failures in multi-node / larger-EP configurations while keeping MC2 routing under the configured token capacity. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: mojave2 <chenchen145@huawei.com>	2025-12-15 14:18:23 +08:00
Yizhou	0686b32d82	[Fix] Fixes issues in MTP with async scheduling and ACL graph (#4963 ) ### What this PR does / why we need it? Corrects attention metadata size for MTP when both asynchronous scheduling and full ACL graph mode are enabled. This prevents potential size mismatches during execution. Additionally, improves the robustness of calculating token sample indices by explicitly aligning tensor shapes. Finally, prevents padding when the number of input tokens exceeds the maximum ACL graph batch size to avoid out-of-bounds errors. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? Need to add corresponding test case ASAP. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com> Signed-off-by: Yizhou <136800916+yiz-liu@users.noreply.github.com> Co-authored-by: Jade Zheng <zheng.shoujian@outlook.com>	2025-12-14 00:10:11 +08:00
wangxiyuan	fd7c929145	[perf] replace all_reduce for kv_consumer and support different num_tokens among all ranks (#4983 ) pick from https://github.com/vllm-project/vllm-ascend/pull/4736 to fix the merge conflict ### What this PR does / why we need it? Currently, the all_reduce operation in _sync_metadata_across_dp is performed with gloo backend which is extremely time-consuming when DPEngineCores are in different nodes. This operation cannot be ignored by async scheduling in multi-node-scenarios with speculative decoding (e.g., EAGLE, mtp). This pr eliminates the all_reduce operation for D Nodes and change the input parameter of MoEDispatch & MoeCombine operators to make MC2EP support different num_tokens across all ranks. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Tested with PD disaggregation (2P: DP2TP8EP16 1D: DP8TP4EP32) scenarios while enabling async scheduling. This pr can remove cross-node all_reduce with gloo backend and further reduce latency with correct accuracy. --------- Signed-off-by: linfeng-yuan <1102311262@qq.com> Co-authored-by: linfeng-yuan <1102311262@qq.com>	2025-12-13 18:59:54 +08:00
zhenwenqi2024	f708d919f8	[Feature] model_runner refactor (#4764 ) ### What this PR does / why we need it? refactor npu_modelrunner， we should be close to gpu_modelrunner ### Does this PR introduce _any_ user-facing change? NO - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com> Signed-off-by: zhenwenqi2024 <155598497+zhenwenqi2024@users.noreply.github.com>	2025-12-12 17:27:09 +08:00
wangxiyuan	bb76f7962c	cleanup useless torchair logic (#4856 ) This PR clean up useless torchair logic in model runner. The moge doc is only for torchair, it can be removed as well. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2025-12-11 11:21:13 +08:00
drslark	0fb1dc43a1	[BugFix][main] Adapted Qwen3-Next-MTP to chunked prefill (#4770 ) ### What this PR does / why we need it? The pad `-1` modification is from https://github.com/vllm-project/vllm/pull/25743. It still has bugs for batched chunked prefill. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: drslark <slarksblood@qq.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2025-12-10 22:54:24 +08:00
linfeng-yuan	490ddf536f	[perf][dsv3.2][async_scheduling] improve dsv3.2 performance by eliminating HD synchronization (#4805 ) ### What this PR does / why we need it? This PR eliminates the simplicit HD synchronization in sfa backend, and _build_dummy_attn_metadata and dummy_run in mtp_proposer, significantly improving dsv3.2 performance in low-latency scenarios. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Performance improvements are observed with E2E performance serving (P: DP4TP8EP32 D: DP8TP4EP32) with `num_speculative_tokens=3`. DSV3.2-W8A8-EXP: TPOT: 41.67ms -> 23.36ms ITL: 85.93ms -> 55.96ms DSV3.2-W8A8 (relaesed in December): TPOT: 18.11ms ITL: 56.13ms - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: linfeng-yuan <1102311262@qq.com>	2025-12-10 22:31:47 +08:00
Yizhou	5b179c53f1	[FEAT] Support DeepSeek-V3.2 with `FULL_DECODE_ONLY` mode (#4706 ) ### What this PR does / why we need it? The first commit support `FULL_DECODE_ONLY`: - Update `AscendSFAMetadataBuilder` to use `num_input_tokens` for slicing slots and positions, ensuring fixed tensor shapes. - Implement padding logic for `query_start_loc` in `NPUModelRunner` to support uniform decode in full graph mode, aligning with GPU runner behavior. - Adjust MLA cosine cache allocation to occur independently of graph mode and switch to using device-resident sequence lengths for attention metadata. - Remove redundant slicing of hidden states and outputs in `AscendSFAImpl` and optimize `sin`/`cos` cache updates. The second commit take MTP into account: - Update `AscendSFAMetadataBuilder` to use `num_input_tokens` for slicing slots and positions, ensuring fixed tensor shapes. - Implement padding logic for `query_start_loc` in `NPUModelRunner` to support uniform decode in full graph mode, aligning with GPU runner behavior. - Adjust MLA cosine cache allocation to occur independently of graph mode and switch to using device-resident sequence lengths for attention metadata. - Remove redundant slicing of hidden states and outputs in `AscendSFAImpl` and optimize `sin`/`cos` cache updates. And the rest of them are just bugfix. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? Test cases needed. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-12-10 20:11:09 +08:00
Chen Chen	848419d1ba	[Bugfix] Disable the dispatch_ffn_combine kernel in MTP path (#4751 ) ### What this PR does / why we need it? This PR is to fix a smoking test failure. Adjust mtp_proposer and model_runner_v1 to route MTP decoding through the non‑fused MoE implementation while keeping the overall inference flow unchanged. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: mojave2 <chenchen145@huawei.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2025-12-09 22:14:05 +08:00
zzhxxx	866347a621	Deepseek Mtp model uses the lm_head and embedding from the main model (#2790 ) ### What this PR does / why we need it? In the Deepseek technical report, it is mentioned that the embedding and lmhead layers of the MTP layer are shared with the main model, but the current implementation independently loads the complete embedding and lmhead. In the Deepseek-R1 model, their weight sizes are 129280*7168 in fp16 format, which is 1.72G. This PR fixes the MTP layer to use the lmhead and embedding of the main model, saving 3.45G of GPU memory in the pure DP scenario. The current process will first create temporary spaces for the embedding and lmhead in the mtp layer, then I will call torch.equal to determine if the two matrices are the same. If they are the same, they will be reused, and the previous tensor will be released. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: zzhx1 <zzh_201018@outlook.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-08 10:33:29 +08:00
Ronald	916a9a1913	fix synchronize error of exceeds_max_model_len d2h copy (#4708 ) ### What this PR does / why we need it? there is d2h copy blocking cpu operations in mtp propose method, which make host bound issue. this pr refactor it and use cpu tensor to implement it. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? vllm main f5d3d93c40417c296c20dc301100e55708a17f3f - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: Ronald1995 <ronaldautomobile@163.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-08 09:07:59 +08:00
Ronald	3480094d7c	support async mtp (#4511 ) ### What this PR does / why we need it? this pr aims to support async_scheduling for mtp, which refer to vllm pr https://github.com/vllm-project/vllm/pull/24799. and this pr fix some synchronize problem in vllm-ascend. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: Ronald1995 <ronaldautomobile@163.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-06 17:15:57 +08:00
Zhu Yi Lin	f067623afd	[Bugfix] fix mtp and eagle aclgraph bug (#4710 ) ### What this PR does / why we need it? fix mtp and eagle aclgraph bug - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: GDzhu01 <809721801@qq.com> Co-authored-by: Mengqing Cao <cmq0113@163.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-06 11:22:57 +08:00
LookAround0301	b32ef53b3b	[long_seq] remove long_seq env (#4660 ) ### What this PR does / why we need it? remove env VLLM_ASCEND_ENABLE_CONTEXT_PARALLEL - vLLM version: v0.12.0 --------- Signed-off-by: LookAround <lixushi@huawei.com> Signed-off-by: ZhangMingWei716 <2894054457@qq.com> Co-authored-by: ZhangMingWei716 <2894054457@qq.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-05 10:31:49 +08:00
ZYang6263	7271f0d536	[Feat] MTP support DeepSeekV3.2 (#4465 ) ### What this PR does / why we need it? Currently, MTP does not support the DeepSeekV3.2 model. In this PR, we have enabled this feature. - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 Signed-off-by: ZYang6263 <zy626375@gmail.com>	2025-12-03 14:24:33 +08:00
wangxiyuan	7f2673ea2d	upgrade vLLM to main (#4608 ) 1. fix https://github.com/vllm-project/vllm/pull/28542 The model structure modifications we involved in are: - Qwen2.5-VL(still exist some patch) - Qwen2-VL - Qwen2 - DeepSeek series - Qwen-moe series 2. fix https://github.com/vllm-project/vllm/pull/29121 the output token now type changed from np to `list[list[int]]` 3. fix https://github.com/vllm-project/vllm/pull/29262 `xformers` backend for multimodal now has been deprecated 4. fix https://github.com/vllm-project/vllm/pull/29342 5. fix https://github.com/vllm-project/vllm/pull/28579 6. fix https://github.com/vllm-project/vllm/pull/28718 7. fix https://github.com/vllm-project/vllm/issues/28665 8. fix https://github.com/vllm-project/vllm/pull/26847 vllm introduced the `optimization-level`, some default config has been changed, and the param `--enforce-eager` has been deprecated 9. fix http://github.com/vllm-project/vllm/pull/29223 it retuns tuple for sampler. 10. fix https://github.com/vllm-project/vllm/pull/29471 we'll remove the related patch to avoid this kind of error. Co-authored-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: wangli <wangli858794774@gmail.com> - vLLM version: v0.11.2 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: wangli <wangli858794774@gmail.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com>	2025-12-02 22:10:52 +08:00
MengLong Chen	143e1f46d0	[Feat] shared expert dp for deepseek_mtp (#3811 ) ### What this PR does / why we need it? Support shared expert DP for deepseek_mtp feature. `shared_expert_dp` requires `SP==True`, with corresponding parameter restrictions. Previously, due to the coupling between `shared_expert_dp` and torchair, and the removal of `deepseek_mtp` in vllm_ascend, shared expert dp of deepseek_mtp was temporarily removed. Currently, by performing the `reduce_scatter` on the input of deepssek_mtp in `mtp_proposer.py`, we ensure that it matches the dimensions of `input_embedding`, and then perform the `all_gather` on the output of mtp. ### How was this patch tested? baseline: <img width="1184" height="692" alt="image" src="https://github.com/user-attachments/assets/9680d53a-7b1d-481a-accc-b8f3dae2b9e3" /> enable shared_expert_dp and multistream_overlap_shared_expert: <img width="1167" height="687" alt="image" src="https://github.com/user-attachments/assets/2531d06b-dfda-4e24-8628-6f4b0f677ddc" /> TPOT: 48ms -> 45.4ms Average TPS per rank: 117.6 -> 126.1 - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: chenmenglong <chenmenglong1@huawei.com> Signed-off-by: zengran <zengran2@huawei.com> Co-authored-by: zengran <zengran2@huawei.com>	2025-12-01 20:44:11 +08:00
Jade Zheng	51c8f60eb0	[Bugfix] Resolve MTP > 1 issue when lm head tp > 1 (#4254 ) ### What this PR does / why we need it? Previously, the dummy run executed compute_logits only once, regardless of num_speculative_tokens. This caused execute_model to hang on compute_logits when lm head tensor parallelism exceeded 1. The fix ensures compute_logits executes correctly during dummy run, matching num_speculative_tokens. I set the `non_blocking` argument to False when moving `exceeds_max_model_len` to the CPU. From what I understand, using `non_blocking=True` and immediately accessing the tensor on the CPU can cause accuracy problems. However, this issue doesn't happen when transferring data to a device. ref: https://discuss.pytorch.org/t/should-we-set-non-blocking-to-true/38234/18 - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` --------- Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>	2025-12-01 10:22:36 +08:00
wangxiyuan	1eb5295a1b	remove qwen3-next model file (#4573 ) Let's remove qwen3-next model filecurrently. We'll support it later by using vLLM origin model file - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-11-29 18:37:26 +08:00
wangxiyuan	bc69d7cfe1	upgrade to vllm 0.11.2 (#4400 ) Bump vLLM version to v0.11.2 What's broken and changed by vLLM: 1. structured_output is broken by https://github.com/vllm-project/vllm/pull/26866 2. get_mrope_input_positions is broken by https://github.com/vllm-project/vllm/pull/28399 3. graph mode is broken by https://github.com/vllm-project/vllm/pull/25110 we'll upgrade torch to 2.8 to fix the problem later 4. embedding is broken by https://github.com/vllm-project/vllm/pull/27583 5. `get_attn_backend_cls` and attention backend is broken are broken by https://github.com/vllm-project/vllm/pull/28534 6. spec decode is broken by https://github.com/vllm-project/vllm/pull/28771 7. sp feature is broken by https://github.com/vllm-project/vllm/pull/27126 8. mtp is broken by https://github.com/vllm-project/vllm/pull/27922 9. lora is broken by https://github.com/vllm-project/vllm/pull/21068 10. execute_model is broken by https://github.com/vllm-project/vllm/pull/26866 11. `VLLM_DISABLE_SHARED_EXPERTS_STREAM` env is broken by https://github.com/vllm-project/vllm/pull/28159 12. kv cahe is broken by https://github.com/vllm-project/vllm/pull/27753 13. dp is broken by https://github.com/vllm-project/vllm/pull/25110 What's broken and changed by ourself: 1. qwen vl is broken by https://github.com/vllm-project/vllm/pull/28455 We'll remove model files in the future to avoid this kind of error 2. Engine core is broken by https://github.com/vllm-project/vllm/pull/23691 We'll remove the patch file in the future. 3. Ascend scheduler is broken by https://github.com/vllm-project/vllm/pull/28733 We'll remove ascend scheudler later. 4. qwen3-next is broken by https://github.com/vllm-project/vllm/pull/28083 We'll remove model files in the future to avoid this kind of error 5. qwen vl is broken by https://github.com/vllm-project/vllm/pull/27764. We'll remove model files in the future Known issue: 1. ray doesn't work 2. the accuracy of qwen3-next is not correct 3. qwen3-vl is broken 4. prefix cache+ ascend scheduler + deepseek v2 lite is broken. Co-authored-by: MengqingCao <cmq0113@163.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: leo-pony <nengjunma@outlook.com> Co-authored-by: 22dimensions <waitingwind@foxmail.com> Co-authored-by: shen-shanshan <467638484@qq.com> - vLLM version: v0.11.2 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Signed-off-by: leo-pony <nengjunma@outlook.com> Co-authored-by: MengqingCao <cmq0113@163.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: leo-pony <nengjunma@outlook.com>	2025-11-26 11:48:58 +08:00
weijinqian0	ae068a3342	[Refactor] remove moe type of multicast. (#4224 ) The main purposes of this PR are as follows: 1. Remove the multicast-related code; Reason: 1. In the scenario like a2 Dual-System Back-to-Back Networking，the performance is worse than all_gather. Before the modification, in e2e test, it was 3 tps; after the modification, it is 10 tps. 2. At the same time, we usually enable the SP feature，it is consistent with the current logic. 3. The advantage of broadcast communication lies in the fact that it does not suffer from uneven DP load and does not require the prefill ACL graph to be enabled. But we support prefill Acl graph recently. So we think there is no need to maintain the multicast as one choice in moe communication. Performance benefits are as follows: When not enable_flashcomm1, TTFT remains relatively stable at around 43000ms, which is approximately 15000ms faster than before the modification. When enable_flashcomm1, there is no diffenence, TTFT remains relatively stable at around 29000ms. - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` --------- Signed-off-by: weijinqian_v1 <weijinqian@huawei.com> Signed-off-by: weijinqian0 <1184188277@qq.com> Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>	2025-11-24 17:32:37 +08:00
wangxiyuan	a1f142b7ad	Drop 0.11.0 support (#4377 ) There is a lot hack code for v0.11.0, which makes the code hard to upgrade to newer vLLM version. Since v0.11.0 will release soon. Let's drop v0.11.0 support first. Then we'll upgrade to v0.11.2 soon. - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-11-24 17:08:20 +08:00
Yizhou	97999347c8	[Fix] Remove unnecessary NPU synchronization in MTP proposer (#4325 ) ### What this PR does / why we need it? Remove unnecessary NPU synchronization in MTP proposer to improve performances. Removing this synchronization point improves pipeline efficiency by allowing for better overlap between CPU and NPU operations. A more proper one is already implemented in #4233 ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? None. - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-11-24 14:07:10 +08:00
anon189Ty	5c9f4a40c6	[Feat] Support MTP to running in full graph mode (#3892 ) ### What this PR does / why we need it? Currently, the MTP model still runs in eager in full graph mode. This PR adapts the MTP with the full graph capture and execution. When the graph mode is set to "FULL_DECODE_ONLY", the MTP will run in full-graph to improve the performance. The change in both disable_padded_drafter_batch is True and False case include: 1. Add _mtp_graph_params in acl_graph.py to isolate the data of main model and the data of MTP. 2. Padding some metadata in mla_v1.py when in fullgraph mode. 3. Fixed the essential data address that will be used in model.forward. 4. Adapted according to the aclgraph capture framwork: 1). Rebuild MTP model with ACLGraphWrapper. 2). Add common attn metadata when start capture in MTP dummy_run. 3). Add common attn metadata update in MTP. 4). Addapted data update when num_speculative_tokens > 1. 5. Add a patch of MTP to adapt vllm v0.11.0. Existing Issues: 1. When disable_padded_drafter_batch=True and running in FullGraph mode, the data of the first-round requests in MTP is abnormal. We need to identify the cause subsequently. 2. When disable_padded_drafter_batch=False and running in FullGraph mode, the acceptance rate of the second and third tokens will decrease (For example, if we set the num_speculative_tokens=3, the acceptance rate of first token is 90%, the second is only 50% lower than 60%, the third is only 20% lower than 30%). The reason is that the data processed after the model runs does not match. This is a problem from another PR. It works fine in eager and PIECEWISE mode, but has problem in FullGraph mode. Once we have a solution, we will submit a bugfix. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` --------- Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>	2025-11-20 20:34:54 +08:00
22dimensions	c272747d13	Upgrade to 0.11.1 newest vllm commit (#3982 ) ### What this PR does / why we need it? adapt vllm-ascend main branch with vllm releases/v0.11.1 fix `forward context not set` in test_vlm.py caused by: https://github.com/vllm-project/vllm/pull/23207 fix import `cdiv round` failed caused by: https://github.com/vllm-project/vllm/pull/27188 fix import `init_cached_hf_modules` failed caused by: https://github.com/vllm-project/vllm/pull/27567 adapt triton kernel `fused_recurrent_gated_delta_rule_fwd_kernel` caused by: https://github.com/vllm-project/vllm/pull/27654 - remove unused code in sigmoid_gating.py - `class FusedRecurrentFunction` , `fused_recurrent_gated_delta_rule`, `fused_recurrent_gated_delta_rule_fwd` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI - vLLM version: v0.11.0 - vLLM main: `83f478bb19` Signed-off-by: 22dimensions <waitingwind@foxmail.com>	2025-11-12 23:01:19 +08:00
zhangsicheng5	a123f355e9	[feature] support pcp + mtp (in pd co-locate scenario) (#4098 ) 1. support pcp + mtp in pd co-locate scenario 2. llmdatadist connector pcp related bugfix and cleancode - vLLM version: v0.11.0 - vLLM main: `83f478bb19` Signed-off-by: zhangsicheng5 <zhangsicheng5@huawei.com>	2025-11-12 17:22:21 +08:00
drslark	23b785fdfb	[Feat] Adapted mtp function to Qwen3-next (#3918 ) ### What this PR does / why we need it? Adapts mtp function to Qwen3-next. - vLLM version: v0.11.0 - vLLM main: `83f478bb19` Signed-off-by: drslark <slarksblood@qq.com>	2025-11-07 16:39:03 +08:00
whx	e9bb4491ec	[BugFix] Fix deepseek v3.2 mtp bug. (#3900 ) ### What this PR does / why we need it? This PR fixes deepseek v3.2 mtp bug. ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? All existed ci tests should pass. - vLLM version: v0.11.0 - vLLM main: `83f478bb19` --------- Signed-off-by: whx-sjtu <2952154980@qq.com>	2025-11-04 14:06:59 +08:00

1 2

76 Commits