xc-llm-ascend

Author	SHA1	Message	Date
Nengjun Ma	42c989a437	Update vllm pin to 12.24 (#5307 ) ### What this PR does / why we need it? Fix vllm break in the pr: 1. [Add MiMo-V2-Flash support] (https://github.com/vllm-project/vllm/pull/30836) ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Co-authored-by: zxwang [1476209578@qq.com](mailto:1476209578@qq.com) - vLLM version: release/v0.13.0 - vLLM main: `5fbfa8d9ef` --------- Signed-off-by: leo-pony <nengjunma@outlook.com> Signed-off-by: zxwang <1476209578@qq.com> Co-authored-by: zxwang <1476209578@qq.com>	2025-12-24 17:24:31 +08:00
Chen Chen	9227e6af73	[bugfix] remove the EP buffer allocation introduced by fused-op dispatch_ffn_c… (#5284 ) ### What this PR does / why we need it? - This PR removes the Expert Parallel (EP) HCCL buffer allocation that was previously introduced by the fused-op `dispatch_ffn_combine` (#3532 ), since the fused-op has switch to MC2 HCCL buffer (#5156 ). ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: Chen Chen <0109chenchen@gmail.com>	2025-12-24 11:26:19 +08:00
wujinyuan1	7ff1db4b84	[Refactor]5/N Extract common code of mla_v1.py & extract mla_cp (#5097 ) RFC: https://github.com/vllm-project/vllm-ascend/issues/4629 Reason： The functions related to Cp differ significantly from those of normal MLA-Attention, but the coupling is quite severe. Steps： 1)Extract common code AscendMLAMetadataBuilder.build to 4 functions: build_prefill_metadata, build_decode_metadata,build_cp_metadata, build_chunked_metadata todo： 1)refactor function _compute_prefill_context; 2)refactor function _mla_preprocess,_mla_decode_preprocess 3）Extract public data and processing functions from the attention_cp.py and mla_cp.py files to the common_cp file. vLLM version: 0.13.0rc3 vLLM main: `ad32e3e19c` - vLLM version: 0.13.0rc3 - vLLM main: `ad32e3e19c` --------- Signed-off-by: wujinyuan1 <wjy9595@qq.com> Signed-off-by: wujinyuan1 <wujinyuan1@huawei.com> Co-authored-by: wujinyuan1 <wjy9595@qq.com> Co-authored-by: weijinqian0 <1184188277@qq.com>	2025-12-24 10:25:19 +08:00
Nengjun Ma	3b59f20a28	update to vllm 12-19 (#5223 ) ### What this PR does / why we need it? ### Does this PR introduce _any_ user-facing change? Fix vllm break: 1. [Enable cuda graph for deepepHT, 5.3% throughput improvement, 4.4% TTFT improvement] (https://github.com/vllm-project/vllm/pull/29558) Fix Solution: Add the now-necessary `all2all_backend` parameter. The impact of this parameter on the original `set_splitting_ops_for_v1` implementation is only that graph mode is disabled in `vllm` if `deepep_high_throughput` is enabled; it has no effect on the `vllm-ascend` logic. 2.[Migrate legacy ViT MultiHeadAttention to new MMEncoderAttention interface ] (https://github.com/vllm-project/vllm/pull/30684) Fix Solution: The reason why the GPU does not need to convert qkv to 3D is that the GPU's flash_attention operator is compatible with 3D and 4D (b s h d and s b ( h d)), but the NPU's flash_attention_unpad operator only supports 3D (s b ( h d)). Therefore, we need to introduce the reshape_qkv_to_3d operation. 4.Skip Tencent-Hunyuan/HunyuanOCR test case, as it has following issue in upgrade vllm code: https://github.com/vllm-project/vllm-ascend/issues/5297 ### How was this patch tested? Co-authored-by: zxwang <1476209578@qq.com> - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: leo-pony <nengjunma@outlook.com> Signed-off-by: zxwang <1476209578@qq.com> Co-authored-by: zxwang <1476209578@qq.com>	2025-12-23 23:52:11 +08:00
Zhu Yi Lin	e14514e2fd	[Bugfix] quick fix balance scheduling patch (#5281 ) ### What this PR does / why we need it? quick fix balance scheduling patch - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: GDzhu01 <809721801@qq.com>	2025-12-23 21:23:05 +08:00
weichen	ffe51eedd6	[Refactor][MoE] Reuse vLLM's all_reduce logic (#5189 ) ### What this PR does / why we need it? Move all_reduce logic to AscendFusedMoE.forward, reuse vLLM's logic. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? e2e & ut - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: weichen <calvin_zhu0210@outlook.com> Co-authored-by: weijinqian0 <1184188277@qq.com>	2025-12-23 18:53:48 +08:00
Slightwind	22138e2727	[main][Refactor] Remove `with_prefill` parameter from `set_ascend_forward_context` (#5094 ) Removes the redundant `with_prefill` parameter from `set_ascend_forward_context` to align the interface with vLLM's `set_forward_context` for future refactoring. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: SlightwindSec <slightwindsec@gmail.com> Signed-off-by: Slightwind <slightwindsec@gmail.com> Co-authored-by: zzzzwwjj <34335947+zzzzwwjj@users.noreply.github.com>	2025-12-23 14:30:50 +08:00
Mengqing Cao	449f8f65a7	[KV-Sharing] Support KV-Sharing feature in CLA models (#4138 ) ### What this PR does / why we need it? Support KV-Sharing feature in CLA (cross layer attention) models, which sharing kv cache in some layers. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-12-23 10:48:31 +08:00
Li Wang	9a79cbaecb	[ModelRunner] Add hunyuan-vl basic support (#5151 ) ### What this PR does / why we need it? This patch add handling of `XDRotaryEmbedding` in modelrunner to support for `hunyuan-vl` ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? CI passed with added/exist tests Closes: https://github.com/vllm-project/vllm-ascend/issues/4992 - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-12-23 10:46:54 +08:00
Shanshan Shen	6c478531f8	[CustomOp] Register AscendApplyRotaryEmb CustomOp and remove related patch (#4667 ) ### What this PR does / why we need it? Following https://github.com/vllm-project/vllm/pull/29873, register `AscendApplyRotaryEmb` CustomOp and remove related patch. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? #### ✅ Test Qwen2.5-VL Run: ```bash vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-Instruct \ --max_model_len 16384 ``` Output: ``` {"id":"chatcmpl-b02c1ff3415d2462","object":"chat.completion","created":1766129265,"model":"/root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-In struct","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the illustration is \"TONGYI Qwen.\" The word \"TONGYI\" is writ ten in blue, and \"Qwen\" is written in gray. The text appears to be part of a logo or branding design.","refusal":null,"annotations":null,"audio": null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"tok en_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":78,"total_tokens":129,"completion_tokens":51,"prompt_tokens_d ``` #### ✅ Test Qwen3-VL Run: ```bash vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct \ --max_model_len 16384 ``` Output: ``` {"id":"chatcmpl-a3a7de5a900a9321","object":"chat.completion","created":1766129586,"model":"/root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the illustration is “TONGYI Qwen”.\n\n### How it looks:\n- “TONGYI” is written in uppercase letters in a bold, modern sans-serif font, colored blue.\n- “Qwen” is written in lowercase letters in a slightly thinner, elegant sans-serif font, colored dark gray.\n- The two lines of text are stacked vertically, with “TONG","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":112,"total_tokens":212,"completion_tokens":100,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null} ``` - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: shen-shanshan <467638484@qq.com>	2025-12-23 10:04:37 +08:00
Zhu Yi Lin	3d04ae8e7d	[Main] [Patch] support balance scheduling patch (#5212 ) ### Motivation. Limitations of the current vLLM v1 scheduling strategy vLLM v1 scheduling currently enables chunkedprefill by default, which processes prefill and decode requests simultaneously in a single scheduling session. This can impact the overall system throughput and performance in some scenarios. Balance scheduling addresses this issue by synchronizing the number of running queues across all schedulers to delay the scheduling of new requests, thereby improving the overall system's steady-state decoding time. This achieves: ✅Adding `balance_gather` to the scheduler synchronizes the number of requests in the running queues between DPs. ✅Balance scheduling improves the decode steady-state time, thereby increasing the overall output throughput of the inference system. ### Proposed Change. 1.Feature Overview In the vLLM scheduler, running requests (i.e., requests that are already undergoing pre-filled computation) have the highest priority, followed by waiting requests (i.e., requests that have not yet been computed). As shown in the diagram above, when the entire inference system exits from a steady state, the scheduler will schedule a batch of new requests for prefill operations and then synchronize them among the dynamic programming (DP) models. This can cause some DP models that are entirely decoded to synchronize with the number of prefilled tokens. Frequent prefill scheduling by certain DP models can lead to a deterioration in the overall system output throughput. Balance scheduling synchronizes the number of running queue requests across different DPs, and only schedules new requests for prefilling when at least every scheduler has fewer than max_nun_requst. 2.Implementation Design 3.Experiment Results - Fixed-length input scenario: In the performance test scenario with 3.5K fixed-length input and 1.5K fixed-length output, the throughput performance was improved by approximately 18% after adding balance scheduling. \| Method \| Model \| Input Len \| Request Count \| Output Len \| BatchSize \| Average TTFT \| Average TPOT \| e2e duration \| Input Token Throughput \| Output Token Throughput \| Request Throughput \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| \| Baseline \| DeepSeekV3.1 \| 3500 \| 512 \| 1500 \| 128 \| 6600 \| 86.85 \| 591.9s \| 3030.5 \| 1297.3 \| 0.86 \| \| Balance scheduling \| DeepSeekV3.1 \| 3500 \| 512 \| 1500 \| 128 \| 7012 \| 70.63 \| 501.7s \| 3575.7 \| 1530.7 \| 1.02 \| 4.Demo PR [#29721 ](https://github.com/vllm-project/vllm/pull/29721) --------- Signed-off-by: GDzhu01 <809721801@qq.com>	2025-12-23 09:04:38 +08:00
Wang Kunpeng	c3a8d13ca7	[refactor] Remove unnecessary attributes from set_ascend_forward_context (#5204 ) ### What this PR does / why we need it? Remove unnecessary attributes from set_ascend_forward_context 1.prefetch_stream 2.weight_prefetch_method ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: Wang Kunpeng <1289706727@qq.com>	2025-12-23 08:49:52 +08:00
weijinqian0	95e8a52156	[Refactor] move the metadata from attention_v1 to util(ready for extract common_cp) & realize Ascendmetadata inherit from the parent class. (#5203 ) RFC: https://github.com/vllm-project/vllm-ascend/issues/4629 1. Remove the pcp-related code from attention_v1. 2. Establish the inheritance relationship of CommonAttentionMetadata. TODO 1. extract common_cp 2. move cp metadata to common_cp. 3. remove commonAttentionMetadata for aclgraph. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: weijinqian_v1 <weijinqian@huawei.com> Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>	2025-12-23 00:10:52 +08:00
ApsarasX	3d9954eff0	[Bugfix] Use hf_text_config instead of hf_config to support multimodal PD-Disaggregated (#5205 ) ### What this PR does / why we need it? In code files such as`mooncake_connector.py`, `vllm_config.model_config.hf_config` is used to get the LLM configs. This approach works for LLMs, but not for multi-modal models. For multi-modal models, `vllm_config.model_config.hf_text_config` must be used instead to get the LLM configs. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing UT - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: ApsarasX <apsarax@outlook.com>	2025-12-22 20:21:45 +08:00
zhangxinyuehfad	61efaffcaf	[Bugfix] Implement multimodal_cpu_fields in model runner (#5196 ) ### What this PR does / why we need it? Related to https://github.com/vllm-project/vllm-ascend/issues/4084 Implement multimodal_cpu_fields in model runner - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-12-22 18:39:45 +08:00
zzzzwwjj	052e472453	[bugfix] fix w8a8dynamic fused_moe trans nz (#5199 ) ### What this PR does / why we need it? Currently, `torch_npu.npu_grouped_matmul_swiglu_quant` can only support weight nz, so we need to trans w13_weight, w2_weight to nz forcely. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: zzzzwwjj <1183291235@qq.com>	2025-12-22 17:45:34 +08:00
lvjunqi	55beac9c91	[Feat]Xlite Qwen3-vl Support (#5228 ) ### What this PR does / why we need it? This patch adds support for the Qwen3-VL model in Xlite. For more details about Xlite, please refer to the following link:https://atomgit.com/openeuler/GVirt/blob/master/xlite/README.md. The latest performance comparison data between xlite and the default aclgraph mode is as follows: ### Does this PR introduce _any_ user-facing change? XLite graph mode supports the Qwen3-VL model. ### How was this patch tested? vLLM version: v0.12.0 - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: lvjunqi <lvjunqi1@huawei.com> Co-authored-by: lvjunqi <lvjunqi1@huawei.com>	2025-12-22 16:30:52 +08:00
zhangsicheng5	78aa7f2693	[feature] support pcp + mtp in full graph (#4572 ) 1. support pcp + mtp in full graph 2. pcp/dcp related mtp bugfix 3. support pcp + mtpx - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: zhangsicheng5 <zhangsicheng5@huawei.com>	2025-12-22 16:13:39 +08:00
Zhu Yi Lin	12d581605b	[Triton]support swiglu_quant triton in w4a8 (#5161 ) ### What this PR does / why we need it? support swiglu_quant triton in w4a8 ### Does this PR introduce _any_ user-facing change? No - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: GDzhu01 <809721801@qq.com>	2025-12-22 16:01:58 +08:00
Yizhou	60d9398f6d	[1/N][Eagle3] Aligns auxiliary hidden state usage for eagle3 models (#5162 ) ### What this PR does / why we need it? This is to prepare for the migration to vLLM's `EagleProposer`, it does not have `name` attribution. Also it's a breakdown of #5100 . Introduces logic to determine whether eagle3 heads require auxiliary hidden states based on configuration, ensuring consistent handling across related components. Prevents incorrect assumptions for eagle3 variants that do not use auxiliary outputs, improving compatibility and correctness. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? None. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-12-22 15:24:54 +08:00
Qiu	64669c4243	[misc][FlashComm1][ACLGraph] Incompatibility between Flashcomm1 and FULL_DECODE_ONLY. (#5200 ) ### What this PR does / why we need it? Currently, Flashcomm1 and FULL_DECODE_ONLY are incompatible. When both features are enabled, graph capture errors occur without clear error messages. After discussion, it has been determined that enabling FULL_DECODE_ONLY with Flashcomm1 in mixed deployment scenarios provides almost no TPOT benefit. Additionally, a reconstruction of the decode phase for flashcomm1 is currently underway. Therefore, related adaptation work is temporarily postponed and will be addressed after the decode phase reconstruction plan is finalized. For now, an assert will be added to provide clear error messages and correct deployment recommendations. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? NO - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>	2025-12-22 14:33:32 +08:00
Shanshan Shen	b84ad8c5d8	[CustomOp] Register AscendMMEncoderAttention CustomOp and remove related patch (#4750 ) ### What this PR does / why we need it? Following https://github.com/vllm-project/vllm/pull/30125, register `AscendMMEncoderAttention` CustomOp and remove related patch. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? ✅ Run Qwen2.5-VL: ```bash vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-Instruct \ --max_model_len 16384 ``` Output: ``` {"id":"chatcmpl-b4e3053f30ab2442","object":"chat.completion","created":1764922950,"model":"/root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the image is \"TONGYI Qwen.\" The word \"TONGYI\" is written in blue, and \"Qwen\" is written in gray. The font appears to be modern and clean, with \"TONGYI\" being slightly larger than \"Qwen.\" The design includes a geometric, abstract shape on the left side of the logo, which complements the text.","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":78,"total_tokens":162,"completion_tokens":84,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null} ``` ✅ Run Qwen3-VL: ```bash vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct \ --max_model_len 16384 ``` Output: ``` {"id":"chatcmpl-97571fbda8267bd1","object":"chat.completion","created":1764923306,"model":"/root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the illustration is “TONGYI Qwen”.\n\n### How it looks:\n- “TONGYI” is written in uppercase letters in a bold, modern sans-serif font, colored blue.\n- “Qwen” is written in lowercase letters in a slightly thinner, elegant sans-serif font, colored dark gray.\n- The two lines of text are stacked vertically, with “TONG","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":112,"total_tokens":212,"completion_tokens":100,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null} ``` - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: shen-shanshan <467638484@qq.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com>	2025-12-22 14:32:53 +08:00
Ascendyh	b2c121637f	[task] Add fused gdn gating triton kernel (#4304 ) ### What this PR does / why we need it? This commit introduces a Triton-based fused GDN gating kernel for Ascend NPU, aimed at improving performance in the Gated Delta Net workflow. ### Does this PR introduce _any_ user-facing change? It only adds and refactors internal Triton kernels and wrappers for Ascend. These are backend implementation details. There are no new APIs, flags, CLI options, or behavior changes visible to end users. ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: Ascendyh <hw7osiris@outlook.com>	2025-12-22 14:09:19 +08:00
Qiu	ea6206bb18	[bugfix][ACLGraph][MTP] deletes `cudagraph_batch_sizes` in `MtpProposer` (#5183 ) ### What this PR does / why we need it? This PR deletes `cudagraph_batch_sizes` in `MtpProposer` and reuses the one in `NPUModelRunner`. During our deployment of DeepSeek-V3.2 with MTP across machines 2P2D and conducting AISBench stress testing, an error occurred (see below). After investigation, we found that `compilation_config.cudagraph_capture_sizes` is modified by `adjust_cudagraph_sizes_for_spec_decode` in `NPUModelRunner`. This modification only updates `cudagraph_batch_sizes` in `NPUModelRunner` but is not synchronized to `MtpProposer`. After discussion (CC @yiz-liu) , we believe it is unnecessary to maintain `cudagraph_batch_sizes` in `MtpProposer`; it should directly use the variable from `NPUModelRunner`. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>	2025-12-22 14:08:27 +08:00
wangxiyuan	492173cf89	[Misc] Cleanup useless print and logger (#5220 ) 1. Remove useless print 2. use vLLM logger 3. change useless INFO to DEBUG level - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-22 11:28:26 +08:00
Feng Liu	e117b3d693	[Perf] vectorize PCP/DCP loops in mla_v1.py (#5003 ) ### What this PR does / why we need it? - Replace nested PCP/DCP Python loops with fully vectorized tensor operations - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: F.Liu <liufeng248@huawei.com> Co-authored-by: F.Liu <liufeng248@huawei.com>	2025-12-22 11:06:30 +08:00
Feng Liu	49838d4bec	[Perf] vectorize PCP/DCP loops in attention_cp.py (#4944 ) ### What this PR does / why we need it? - Add explicit .contiguous() after permute/view to ensure mem-friendly layout - Replace nested PCP/DCP Python loops with fully vectorized tensor operations - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: F.Liu <liufeng248@huawei.com> Co-authored-by: F.Liu <liufeng248@huawei.com>	2025-12-22 11:06:19 +08:00
wangqiankun13	904c18f929	[Feature]Use DispatchGmmCombineDecode operator to replace MC2(Optional) (#5040 ) ### What this PR does / why we need it? This PR adds model-side integration for the previously introduced experimental AscendC fused operator DispatchGmmCombineDecode, used in MoE decoding. The operator implementation itself was added in a prior PR[#4139 ](https://github.com/vllm-project/vllm-ascend/pull/4139). This change only adapts the model execution path to optionally use the fused operator. When the environment variable VLLM_ASCEND_ENABLE_FUSED_MC2=2 is set, the original MC2 path composed of multiple operators (A8W8 dispatch → GMM → SwiGLU → GMM → combine) might be replaced by the single fused operator DispatchGmmCombineDecode. By default, the existing multi-operator MC2 implementation is preserved. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangqiankun <wangqiankun13@huawei.com>	2025-12-21 15:23:59 +08:00
Angazenn	67a0325cf2	[BugFix]Fix wrong _cos, _sin instantiation (#5154 ) ### What this PR does / why we need it? This PR add additional check on creating global `_cos` and `_sin`, avoid creating them when using `mrope` or encoder-decoder model. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: Angazenn <supperccell@163.com>	2025-12-20 22:52:50 +08:00
YuhanBai	5d02eed16f	[Performance] Add async exponential while model executing (#4501 ) ### What this PR does / why we need it? Add a control to enable the exponential distribution operator overlapping with model executing (default is OFF due to this feature might not perform well on MOE models, i.e. For Qwen3-30B). Enable async exponential overlapping will provides performance improvement. Also, overlapping the exponential operator with module execution can cover the performance drop introduced by AICPU-version's exponential operator. UPDATE: (12/12) Now our overlap will use the same stream that introduced in this pr: #4908 . We move the `do_async_exponential` from `model_runner_v1.py` to `sampler.py`. Now we are using `additional_config` to enable async exponential: Add `"enable_async_exponential": 1` in `addition_config`. Now we ONLY support default exponential/AI-CPU exponential, the old `"enable_async_exponential": 2` option has been aborted to keep consistency. ### Does this PR introduce _any_ user-facing change? YES, added a new `additional_config` : `"enable_async_exponential": 1`. When `enable_async_exponential` is set to 1, we enable the async exponential and overlap with model runner. When `enable_async_exponential` is set to 0 (default is 0), we disable the async exponential, but exponential will still running on a different stream using stream introduced in #4908. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: YuhanBai <yuhan.bai0830@gmail.com> Signed-off-by: YuhanBai yuhan.bai0830@gmail.com	2025-12-20 21:23:21 +08:00
lianyibo	58773af708	[Fix] Delete pooling redundant code (#4940 ) ### What this PR does / why we need it? Remove redundant code in #3122. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: lianyibo <lianyibo1@kunlunit.com>	2025-12-20 20:47:30 +08:00
wangxiyuan	bbde0f9743	[CI] fix lint (#5216 ) Fix CI lint error - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-20 17:03:25 +08:00
wangxiyuan	758d81dcb1	Drop 0.12.0 support (#5146 ) We decided to release v0.13.0 soon. So no need to support 0.12.0 now. Let's drop it. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-20 09:38:53 +08:00
XiaoxinWang	0cc3fc357f	[pref] qwen3_next add triton ops : fused_sigmoid_gating_delta_rule_update (#4818 ) ### What this PR does / why we need it? qwen3_next add fused_sigmoid_gating_delta_rule_update op which fused fused_gdn_gating+fused_recurrent_gated_delta_rule - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com> Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>	2025-12-19 16:34:11 +08:00
weijinqian0	35ad11b637	[Refactor] remove some metadata variables in attention_v1. (#5160 ) RFC: https://github.com/vllm-project/vllm-ascend/issues/4629 Reason: The metadata data class contains an excessive number of variables. We will inherit the metadata of the community and simultaneously remove some variables that are no longer needed at present. Todo: 1. remove attn_state partly. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: weijinqian_v1 <weijinqian@huawei.com> Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>	2025-12-19 14:57:09 +08:00
zzzzwwjj	cc23067f1e	[refactor] refactor weight trans nz and transpose (#4878 ) ### What this PR does / why we need it? Now `VLLM_ASCEND_ENABLE_NZ` will have three options: 0: disable nz; 1: only quant case enable nz; 2: enable nz as long as possible; And `VLLM_ASCEND_ENABLE_NZ`=1 by default. All cases are shown in the table below: \| \| W4A4 \| W4A8 \| W8A8 \| fp16/bf16 \| fp32 \| \|---\|---\|---\|---\|---\|---\| \| trans nz \| can't support nz \| trans nz by default \| trans nz by default \| trans nz when VLLM_ASCEND_ENABLE_NZ is 2 \| can't support nz \| \| transpose \| only support not transpose case \| only support transpose case \| only support transpose case \| linear: only support not transpose case<br>gmm: only support transpose case \| same to fp16/bf16 \| Some exceptional cases: 1. MLAPO op need to do some additional processing on the weights, including trans nz. If use MLAPO op, some weight will be transformed to nz forcely; 2. MLA/SFA's weight `W_UV` will be used by op `torch.ops._C_ascend.batch_matmul_transpose`, and this op can't support nz currently; ### Does this PR introduce _any_ user-facing change? Now fp16/bf16 weight will not trans nz by default. ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: zzzzwwjj <1183291235@qq.com>	2025-12-19 14:27:24 +08:00
hukongyi	ea8f544ce7	[BugFix]Fix precision issue for LoRA feature (#4141 ) vLLM version: v0.11.0 vLLM main: vllm-project/vllm ### What this PR does / why we need it? Fix the precision issue of the LoRA feature in vllm-ascend. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ```bash pytest tests/lora/test_llama_tp.py::test_llama_lora -s ``` <img width="1319" height="879" alt="lora_test" src="https://github.com/user-attachments/assets/2a0b2325-5b05-4bbc-ac03-a7c9f0ad9d4c" /> - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: hukongyi <hukongyi@cmbchina.com>	2025-12-19 14:22:06 +08:00
weichen	ca6f631cba	[2/N][Pangu][MoE] Remove Pangu Related Code (#5130 ) ### What this PR does / why we need it? Remove Pangu Related Code ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? e2e & ut - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: weichen <calvin_zhu0210@outlook.com>	2025-12-19 09:00:07 +08:00
Chen Chen	1b47fca0e8	[bugfix] Use FUSED_MC2 MoE comm path for the op `dispatch_ffn_combine` (#5156 ) ### What this PR does / why we need it? - Renames the MoE comm enum value `MoECommType.FUSED_ALLTOALL` to `MoECommType.FUSED_MC2` and updates all call sites. - Updates `select_moe_comm_method` to optionally select `FUSED_MC2` on Ascend A3 when: - `enable_expert_parallel=True` - quantization is `w8a8_dynamic` - `EP <= 16` - `dynamic_eplb` is disabled - `is_mtp_model = False` - Replaces the old “fused all-to-all” comm implementation with `FusedMC2CommImpl`, using `TokenDispatcherWithMC2` / `PrepareAndFinalizeWithMC2` and `dispatch_ffn_combine`. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: Chen Chen <0109chenchen@gmail.com>	2025-12-18 23:34:31 +08:00
zhaomingyu13	73e4b4f496	[BugFix] Fix top_p,top_k issue with EAGLE and add top_p,top_k in EAGLE e2e (#5131 ) ### What this PR does / why we need it? Add top_p,top_k in EAGLE e2e - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com>	2025-12-18 23:07:14 +08:00
Zetong Li	2304218f90	[Bugfix] Fix in_profile_run in mtp_proposer dummy_run (#5165 ) ### What this PR does / why we need it? This PR aims to fix failure of `enable_force_load_balance` caused by missing `in_profile_run` in `dummy_run` of mtp_proposer. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? by ci - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: Zetong Li <slippersss@126.com>	2025-12-18 22:27:47 +08:00
Angazenn	632eab28b7	[BugFix]Fix incorrect get_current_vllm_config (#5121 ) ### What this PR does / why we need it? This PR fixes some incorrect `get_current_vllm_config` calling, which creates empty vllm_config instead. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: Angazenn <supperccell@163.com>	2025-12-18 22:21:36 +08:00
shaopeng-666	fd9a47c04d	fix vl pd smoke error (#5103 ) ### What this PR does / why we need it? Fix VL model mooncacke PD smoke test error ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: 李少鹏 <lishaopeng21@huawei.com>	2025-12-18 22:20:45 +08:00
Yizhou	ff3914e31a	[Fix] Refines decode mode padding condition for uniform queries (#5164 ) ### What this PR does / why we need it? The reason why we cannot use `self.cudagraph_batch_sizes[-1]` is that it's actually not the max number of tokens to be padded in `FULL_DECODE_ONLY` mode, much larger instead. And it's trimmed only before capturing to `compilation_cases`, this really caused us lots of trouble. Updates the logic to ensure padding occurs only when the number of input tokens falls within a valid uniform decode query range, improving consistency and avoiding unnecessary padding in specific decode modes. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? None. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-12-18 21:09:23 +08:00
Angazenn	acc3578f58	[Graph][Fusion]Add new pattern for AddRmsnormQuant with SP. (#5077 ) ### What this PR does / why we need it? 1. In addition to [#4168](https://github.com/vllm-project/vllm-ascend/pull/4168), [#5011](https://github.com/vllm-project/vllm-ascend/pull/5011)， this PR adds two more pattern for AddRmsnormQuant with SP enabled. The key difference is to insert an additional `maybe_all_gather_and_maybe_unpad` between `addrmsnorm` and `quantize`. 2. This PR also introduce another api `torch.ops.vllm.quantize`, so that we pass `input_scale` and `input_scale_reciprocal` at the same time. This is because `npu_add_rms_norm_quant` and `npu_quantize` requires different `div_mode`. To avoid introducing additional reciprocal calculation in runtime, we have to pass both of them to quantize api. 3. Removes redundant `AscendQuantRmsnorm`. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: Angazenn <supperccell@163.com>	2025-12-18 20:25:44 +08:00
zzhxxx	a74a1196c5	[Feat] Support MLP_TP feature, exclude MOE layer (#4999 ) #4257 This PR implements the dense_ffn TP of the first three layers of the deepseek model, I have refactored this PR and used very little code to support the implementation of this feature. This PR adds a function `is_moe_layer` to mlp_tp, which supports MLP TP in models with both mlp and moe, such as deepseek or chat GLM. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: zzhx1 <zzh_201018@outlook.com> Co-authored-by: 子潜 <ziqian@U-DMKXH32D-2015.local> Co-authored-by: chenxiao <Jaychou1620@Gmail.com> Co-authored-by: Jade Zheng <zheng.shoujian@outlook.com>	2025-12-18 20:06:53 +08:00
yuxingcyx	5a88e3333b	feat: implement high-performance Triton kernels for rejection sampling (#4830 ) ### What this PR does / why we need it? This PR introduces optimized Triton implementations for the rejection_greedy_sample_kernel and expand_kernel, delivering superior performance compared to the existing Triton implementations. The new Triton kernels maintain full functional accuracy while delivering significant performance improvements across various batch sizes and MTP configurations. ### Does this PR introduce _any_ user-facing change? Yes, this PR modifies rejection_sampler.py to use optimized Triton kernels: - rejection_greedy_sample_kernel is enhanced with rejection_greedy_sample_spec_len_1_triton and rejection_greedy_sample_triton implementations - expand_kernel receives a performance-optimized Triton version These changes provide substantial performance improvements while maintaining backward compatibility - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: yuxingcyx <yuxingchen.math@gmail.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-18 19:42:10 +08:00
LICO67373	9fcaf66646	fix: use batch_matmul_transpose operator in MLA _v_up_proj for better performance (#5142 ) ### What this PR does / why we need it? This PR fixes a bug in the `AscendMLAImpl._v_up_proj` method where the optimized `batch_matmul_transpose` operator was not being utilized. Changes: - Modified `_v_up_proj` method to use `torch.ops._C_ascend.batch_matmul_transpose` operator for FP16/BF16 dtypes when available - Added fallback path using the original `torch.bmm` implementation for other cases - This avoids unnecessary transpose operations and improves performance Why needed: - The previous implementation only used `torch.bmm` with multiple transpose operations, which is less efficient - The Ascend backend provides an optimized `batch_matmul_transpose` operator that can handle the computation more efficiently - This fix improves inference performance for MLA (Multi-head Latent Attention) models on Ascend NPU ### Does this PR introduce _any_ user-facing change? No. This is a performance optimization that maintains the same functionality and output. Users will experience faster inference for MLA-based models, but no API or interface changes are introduced. The changes maintain backward compatibility with the fallback path, ensuring correct behavior when the operator is not available or for unsupported dtypes. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: lico67373 <918688502@qq.com> Co-authored-by: hwhaokun <haokun0405@163.com> Co-authored-by: weijinqian0 <1184188277@qq.com>	2025-12-18 16:48:55 +08:00
Ronald	b69b04d3a9	implement model runner v2 basic framework (#5051 ) ### What this PR does / why we need it? This PR aim to implement model runner v2 basic framework in vllm-ascend, the e2e function is not guaranteed by this pr. ### Does this PR introduce _any_ user-facing change? use envs.VLLM_USE_V2_MODEL_RUNNER to decide if choose model_runenr_v2. ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: Ronald1995 <ronaldautomobile@163.com>	2025-12-18 15:51:54 +08:00
lidenghui1110	1c8c23de58	[Bugfix] fix pipeline parallelism bug introduced by async-scheduling refactor work (#4973 ) ### What this PR does / why we need it? Currently, when using pipeline parallel and pd disaggregate, model_runner will return None on non-last-pp-rank stages in `sample_tokens`, which will cause assert error in vllm KVOutputAggregator on [this line](https://github.com/vllm-project/vllm/blob/main/vllm/distributed/kv_transfer/kv_connector/utils.py#L84). In fact, all pp workers should return a model_runner_output which contains kv_connector_output to do aggregate in Enginecore scheduler process to ensure all kv transfer is finished for kv cache releasing later. To fix this issue, this PR follows gpu_model_runner in vllm, passing kv_connector_output in `sample_tokens` to make sure all ranks will return a ModelRunnerOutput, in non-last-pp-rank workers, it will return EMPTY_MODEL_RUNNER_OUTPUT with kv_connector_output. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: lidenghui <lidenghui1110@gmail.com>	2025-12-18 15:27:55 +08:00

... 11 12 13 14 15 ...

1665 Commits