xc-llm-ascend

Author	SHA1	Message	Date
Li Wang	a6eaf816f1	[Image] Refactor image build (#5175 ) ### What this PR does / why we need it? In the past time, we used a hybrid architecture cross-compilation approach for image building. This method had a problem: cross-compilation performance was very poor, leading to extremely long build times(abort 4h) and even a probability of failure(see https://github.com/vllm-project/vllm-ascend/actions/runs/20152861650/job/57849208186). Therefore, I recommend using a separate architecture build followed by manifest merging, which significantly reduces image build time(20min). - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-12-19 14:35:51 +08:00
zzzzwwjj	cc23067f1e	[refactor] refactor weight trans nz and transpose (#4878 ) ### What this PR does / why we need it? Now `VLLM_ASCEND_ENABLE_NZ` will have three options: 0: disable nz; 1: only quant case enable nz; 2: enable nz as long as possible; And `VLLM_ASCEND_ENABLE_NZ`=1 by default. All cases are shown in the table below: \| \| W4A4 \| W4A8 \| W8A8 \| fp16/bf16 \| fp32 \| \|---\|---\|---\|---\|---\|---\| \| trans nz \| can't support nz \| trans nz by default \| trans nz by default \| trans nz when VLLM_ASCEND_ENABLE_NZ is 2 \| can't support nz \| \| transpose \| only support not transpose case \| only support transpose case \| only support transpose case \| linear: only support not transpose case<br>gmm: only support transpose case \| same to fp16/bf16 \| Some exceptional cases: 1. MLAPO op need to do some additional processing on the weights, including trans nz. If use MLAPO op, some weight will be transformed to nz forcely; 2. MLA/SFA's weight `W_UV` will be used by op `torch.ops._C_ascend.batch_matmul_transpose`, and this op can't support nz currently; ### Does this PR introduce _any_ user-facing change? Now fp16/bf16 weight will not trans nz by default. ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: zzzzwwjj <1183291235@qq.com>	2025-12-19 14:27:24 +08:00
hukongyi	ea8f544ce7	[BugFix]Fix precision issue for LoRA feature (#4141 ) vLLM version: v0.11.0 vLLM main: vllm-project/vllm ### What this PR does / why we need it? Fix the precision issue of the LoRA feature in vllm-ascend. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ```bash pytest tests/lora/test_llama_tp.py::test_llama_lora -s ``` <img width="1319" height="879" alt="lora_test" src="https://github.com/user-attachments/assets/2a0b2325-5b05-4bbc-ac03-a7c9f0ad9d4c" /> - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: hukongyi <hukongyi@cmbchina.com>	2025-12-19 14:22:06 +08:00
1092626063	f952de93df	【Doc】Deepseekv3.1/R1 doc enhancement (#4827 ) ### What this PR does / why we need it? Deepseekv3.1、DeepSeekR1 doc enhancement - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: 1092626063 <1092626063@qq.com>	2025-12-19 10:52:33 +08:00
LookAround0301	76e58d66be	support basic long_seq feature st (#5140 ) ### What this PR does / why we need it? support basic long_seq feature st - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: LookAround <lixushi@huawei.com>	2025-12-19 10:50:01 +08:00
zhangxinyuehfad	cee9b715b5	[Bugfix] install trition for test_custom_op (#5112 ) ### What this PR does / why we need it? 1. install trition for test_custom_op 2. tests/e2e/nightly/ops test timeout, set timeout-minutes let it test over: https://github.com/vllm-project/vllm-ascend/actions/runs/20326482497/job/58392757707?pr=5112 3. ignore test_dispatch_ffn_combine until it is fixed @kiscad ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-12-19 10:40:46 +08:00
weichen	ca6f631cba	[2/N][Pangu][MoE] Remove Pangu Related Code (#5130 ) ### What this PR does / why we need it? Remove Pangu Related Code ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? e2e & ut - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: weichen <calvin_zhu0210@outlook.com>	2025-12-19 09:00:07 +08:00
Chen Chen	1b47fca0e8	[bugfix] Use FUSED_MC2 MoE comm path for the op `dispatch_ffn_combine` (#5156 ) ### What this PR does / why we need it? - Renames the MoE comm enum value `MoECommType.FUSED_ALLTOALL` to `MoECommType.FUSED_MC2` and updates all call sites. - Updates `select_moe_comm_method` to optionally select `FUSED_MC2` on Ascend A3 when: - `enable_expert_parallel=True` - quantization is `w8a8_dynamic` - `EP <= 16` - `dynamic_eplb` is disabled - `is_mtp_model = False` - Replaces the old “fused all-to-all” comm implementation with `FusedMC2CommImpl`, using `TokenDispatcherWithMC2` / `PrepareAndFinalizeWithMC2` and `dispatch_ffn_combine`. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: Chen Chen <0109chenchen@gmail.com>	2025-12-18 23:34:31 +08:00
zhaomingyu13	73e4b4f496	[BugFix] Fix top_p,top_k issue with EAGLE and add top_p,top_k in EAGLE e2e (#5131 ) ### What this PR does / why we need it? Add top_p,top_k in EAGLE e2e - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com>	2025-12-18 23:07:14 +08:00
zxr2333	073a3a6e6c	[Doc][P/D] Fix MooncakeConnector's name (#5172 ) ### What this PR does / why we need it? vLLM community has integrated their MooncakeConnector. The original scripts will now find this MooncakeConnector instead of the one from vLLM-Ascend. All scripts that involve using the MooncakeConnector need to be modified to another name. ### Does this PR introduce _any_ user-facing change? Yes, users need to use a new name to load vLLM-Ascend MooncakeConnector. ### How was this patch tested? By CI. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>	2025-12-18 22:29:19 +08:00
Zetong Li	2304218f90	[Bugfix] Fix in_profile_run in mtp_proposer dummy_run (#5165 ) ### What this PR does / why we need it? This PR aims to fix failure of `enable_force_load_balance` caused by missing `in_profile_run` in `dummy_run` of mtp_proposer. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? by ci - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: Zetong Li <slippersss@126.com>	2025-12-18 22:27:47 +08:00
Li Wang	7d32371b7e	[Doc] Refact benchmark doc (#5173 ) ### What this PR does / why we need it? Refactor some outdated doc - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangli <wangli858794774@gmail.com>	2025-12-18 22:26:13 +08:00
ZT-AIA	6cb76ecd02	[Nightly] Avoid max_model_len being smaller than the decoder prompt to prevent single-node-accuray-tests from failing (#5174 ) ### What this PR does / why we need it? [Nightly] Avoid max_model_len being smaller than the decoder prompt to prevent single-node-accuray-tests from failing ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: ZT-AIA <1028681969@qq.com> Signed-off-by: ZT-AIA <63220130+ZT-AIA@users.noreply.github.com>	2025-12-18 22:25:45 +08:00
Angazenn	632eab28b7	[BugFix]Fix incorrect get_current_vllm_config (#5121 ) ### What this PR does / why we need it? This PR fixes some incorrect `get_current_vllm_config` calling, which creates empty vllm_config instead. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: Angazenn <supperccell@163.com>	2025-12-18 22:21:36 +08:00
shaopeng-666	fd9a47c04d	fix vl pd smoke error (#5103 ) ### What this PR does / why we need it? Fix VL model mooncacke PD smoke test error ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: 李少鹏 <lishaopeng21@huawei.com>	2025-12-18 22:20:45 +08:00
Yizhou	ff3914e31a	[Fix] Refines decode mode padding condition for uniform queries (#5164 ) ### What this PR does / why we need it? The reason why we cannot use `self.cudagraph_batch_sizes[-1]` is that it's actually not the max number of tokens to be padded in `FULL_DECODE_ONLY` mode, much larger instead. And it's trimmed only before capturing to `compilation_cases`, this really caused us lots of trouble. Updates the logic to ensure padding occurs only when the number of input tokens falls within a valid uniform decode query range, improving consistency and avoiding unnecessary padding in specific decode modes. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? None. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-12-18 21:09:23 +08:00
Angazenn	acc3578f58	[Graph][Fusion]Add new pattern for AddRmsnormQuant with SP. (#5077 ) ### What this PR does / why we need it? 1. In addition to [#4168](https://github.com/vllm-project/vllm-ascend/pull/4168), [#5011](https://github.com/vllm-project/vllm-ascend/pull/5011)， this PR adds two more pattern for AddRmsnormQuant with SP enabled. The key difference is to insert an additional `maybe_all_gather_and_maybe_unpad` between `addrmsnorm` and `quantize`. 2. This PR also introduce another api `torch.ops.vllm.quantize`, so that we pass `input_scale` and `input_scale_reciprocal` at the same time. This is because `npu_add_rms_norm_quant` and `npu_quantize` requires different `div_mode`. To avoid introducing additional reciprocal calculation in runtime, we have to pass both of them to quantize api. 3. Removes redundant `AscendQuantRmsnorm`. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: Angazenn <supperccell@163.com>	2025-12-18 20:25:44 +08:00
zzhxxx	a74a1196c5	[Feat] Support MLP_TP feature, exclude MOE layer (#4999 ) #4257 This PR implements the dense_ffn TP of the first three layers of the deepseek model, I have refactored this PR and used very little code to support the implementation of this feature. This PR adds a function `is_moe_layer` to mlp_tp, which supports MLP TP in models with both mlp and moe, such as deepseek or chat GLM. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: zzhx1 <zzh_201018@outlook.com> Co-authored-by: 子潜 <ziqian@U-DMKXH32D-2015.local> Co-authored-by: chenxiao <Jaychou1620@Gmail.com> Co-authored-by: Jade Zheng <zheng.shoujian@outlook.com>	2025-12-18 20:06:53 +08:00
yuxingcyx	5a88e3333b	feat: implement high-performance Triton kernels for rejection sampling (#4830 ) ### What this PR does / why we need it? This PR introduces optimized Triton implementations for the rejection_greedy_sample_kernel and expand_kernel, delivering superior performance compared to the existing Triton implementations. The new Triton kernels maintain full functional accuracy while delivering significant performance improvements across various batch sizes and MTP configurations. ### Does this PR introduce _any_ user-facing change? Yes, this PR modifies rejection_sampler.py to use optimized Triton kernels: - rejection_greedy_sample_kernel is enhanced with rejection_greedy_sample_spec_len_1_triton and rejection_greedy_sample_triton implementations - expand_kernel receives a performance-optimized Triton version These changes provide substantial performance improvements while maintaining backward compatibility - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: yuxingcyx <yuxingchen.math@gmail.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-18 19:42:10 +08:00
wangxiyuan	0f571c347b	Nominate new maintainers @zzzzwwjj @realliujiaxu @LCAIZJ (#5152 ) I'd like to nominate @zzzzwwjj @realliujiaxu @LCAIZJ to join vLLM Ascend committer team. @zzzzwwjj --- - Review Quality‌: He has completed 80+reviews since April. 2025, include https://github.com/vllm-project/vllm-ascend/pull/3232#issuecomment-3506110786, https://github.com/vllm-project/vllm-ascend/pull/4822#discussion_r2601661204, https://github.com/vllm-project/vllm-ascend/pull/4768#issuecomment-3644795995 high quality review. - Sustained Contributions 15+ Valuable bug fix and refactor is very good. https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3Azzzzwwjj+is%3Aclosed+review%3Aapproved Continuous optimization of code architecture https://github.com/vllm-project/vllm-ascend/pulls?q=author%3Azzzzwwjj+is%3Amerged - Quality Contribution‌: https://github.com/vllm-project/vllm-ascend/pull/1229 https://github.com/vllm-project/vllm-ascend/pull/1979 https://github.com/vllm-project/vllm-ascend/pull/4359 https://github.com/vllm-project/vllm-ascend/pull/4878 - Community Involvement‌: He lead the https://github.com/vllm-project/vllm-ascend/issues/1147, to refactor AscendFusedMoE at the first time. He shared topics about large-scale distributed inference and reinforcement learning on vLLM-Ascend meetup on August 2nd. @realliujiaxu --- - Review Quality‌: He has completed about [40+ reviews](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+commenter%3Arealliujiaxu+-author%3Arealliujiaxu+) since September, include https://github.com/vllm-project/vllm-ascend/pull/4868#discussion_r2605549015, https://github.com/vllm-project/vllm-ascend/pull/2275#discussion_r2268455665. - Sustained Contributions He has completed (17 commits)[https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3Arealliujiaxu+is%3Amerged], continuously optimizing the performance of the MoE model. - Quality Contribution‌: Contributed the Flash Comm1 feature to the community, supporting both eager and aclgraph execution modes, while compatible with multiple MoE models including DeepSeek and GLM4.5. - https://github.com/vllm-project/vllm-ascend/pull/3334 - https://github.com/vllm-project/vllm-ascend/pull/3420 - https://github.com/vllm-project/vllm-ascend/pull/3015 co-author: - https://github.com/vllm-project/vllm-ascend/pull/3495 - https://github.com/vllm-project/vllm-ascend/pull/4868 - Community Involvement‌: 1. Completed two major refactors, enabling vllm-ascend to evolve more rapidly and robustly: [Linear module](https://github.com/vllm-project/vllm-ascend/pull/2867) and [rejection sampler](https://github.com/vllm-project/vllm-ascend/pull/4975) 2. [fixed 8 bugs](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3Arealliujiaxu+is%3Amerged+bugfix+) in graph mode, spec decoding and async scheduling. @LCAIZJ --- - Review Quality‌: He's been the go-to reviewer for virtually all PD disaggregation and KV Pool related PRs, having completed [30+ reviews](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+commenter%3ALCAIZJ+is%3Aopen+-author%3ALCAIZJ+) since May 2025. Notable examples include [discussion_r2553887360](https://github.com/vllm-project/vllm-ascend/pull/4345#discussion_r2553887360), [issuecomment-3540994801](https://github.com/vllm-project/vllm-ascend/pull/4161#issuecomment-3540994801), and [discussion_r2492593988](https://github.com/vllm-project/vllm-ascend/pull/3981#discussion_r2492593988), all demonstrating thorough and insightful feedback. - Sustained and Quality Contributions: His contributions reflect a strong grasp of both ‌vLLM‌ and ‌vLLM Ascend‌ codebases, particularly in prefill-decode disaggregation and KV pool areas ([7 PRs merged](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3ALCAIZJ+is%3Amerged+)). Prefill-Decode Disaggregation: Delivered KV transfer functionality using Mooncake TransferEngine and enabled layerwise KV transfer https://github.com/vllm-project/vllm-ascend/pull/1568 https://github.com/vllm-project/vllm-ascend/pull/2602 KV Pool: Developed the foundational KV Pool infrastructure and migrated it to the latest ADXL stack https://github.com/vllm-project/vllm-ascend/pull/2913 https://github.com/vllm-project/vllm-ascend/pull/3350 - Quality Contribution‌: https://github.com/vllm-project/vllm-ascend/pull/1568 https://github.com/vllm-project/vllm-ascend/pull/2602 https://github.com/vllm-project/vllm-ascend/pull/2913 https://github.com/vllm-project/vllm-ascend/pull/3350 - Community Involvement‌: He actively responds to [community issues](https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue%20commenter%3ALCAIZJ%20is%3Aopen%20-author%3ALCAIZJ), continuously monitors functionality and accuracy issues related to PD disaggregation and KV Pool, and proactively delivers [bug fixes](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3ALCAIZJ+is%3Amerged+bugfix). - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-18 18:49:07 +08:00
LICO67373	9fcaf66646	fix: use batch_matmul_transpose operator in MLA _v_up_proj for better performance (#5142 ) ### What this PR does / why we need it? This PR fixes a bug in the `AscendMLAImpl._v_up_proj` method where the optimized `batch_matmul_transpose` operator was not being utilized. Changes: - Modified `_v_up_proj` method to use `torch.ops._C_ascend.batch_matmul_transpose` operator for FP16/BF16 dtypes when available - Added fallback path using the original `torch.bmm` implementation for other cases - This avoids unnecessary transpose operations and improves performance Why needed: - The previous implementation only used `torch.bmm` with multiple transpose operations, which is less efficient - The Ascend backend provides an optimized `batch_matmul_transpose` operator that can handle the computation more efficiently - This fix improves inference performance for MLA (Multi-head Latent Attention) models on Ascend NPU ### Does this PR introduce _any_ user-facing change? No. This is a performance optimization that maintains the same functionality and output. Users will experience faster inference for MLA-based models, but no API or interface changes are introduced. The changes maintain backward compatibility with the fallback path, ensuring correct behavior when the operator is not available or for unsupported dtypes. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: lico67373 <918688502@qq.com> Co-authored-by: hwhaokun <haokun0405@163.com> Co-authored-by: weijinqian0 <1184188277@qq.com>	2025-12-18 16:48:55 +08:00
Ronald	b69b04d3a9	implement model runner v2 basic framework (#5051 ) ### What this PR does / why we need it? This PR aim to implement model runner v2 basic framework in vllm-ascend, the e2e function is not guaranteed by this pr. ### Does this PR introduce _any_ user-facing change? use envs.VLLM_USE_V2_MODEL_RUNNER to decide if choose model_runenr_v2. ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: Ronald1995 <ronaldautomobile@163.com>	2025-12-18 15:51:54 +08:00
lidenghui1110	1c8c23de58	[Bugfix] fix pipeline parallelism bug introduced by async-scheduling refactor work (#4973 ) ### What this PR does / why we need it? Currently, when using pipeline parallel and pd disaggregate, model_runner will return None on non-last-pp-rank stages in `sample_tokens`, which will cause assert error in vllm KVOutputAggregator on [this line](https://github.com/vllm-project/vllm/blob/main/vllm/distributed/kv_transfer/kv_connector/utils.py#L84). In fact, all pp workers should return a model_runner_output which contains kv_connector_output to do aggregate in Enginecore scheduler process to ensure all kv transfer is finished for kv cache releasing later. To fix this issue, this PR follows gpu_model_runner in vllm, passing kv_connector_output in `sample_tokens` to make sure all ranks will return a ModelRunnerOutput, in non-last-pp-rank workers, it will return EMPTY_MODEL_RUNNER_OUTPUT with kv_connector_output. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: lidenghui <lidenghui1110@gmail.com>	2025-12-18 15:27:55 +08:00
ming1212	9268ad11e3	Qwen3-Next：Update the gpu-memory-utilization parameter to 0.7 (#5129 ) ### What this PR does / why we need it? Update the gpu-memory-utilization parameter to 0.7 - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: ming1212 <2717180080@qq.com> Signed-off-by: ming1212 <104972349+ming1212@users.noreply.github.com> Signed-off-by: Mengqing Cao <cmq0113@163.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2025-12-18 15:16:33 +08:00
AlvisGong	ef8157a5f2	fixed fused alltoall execute all reduce (#5109 ) ### What this PR does / why we need it? fixed fused alltoall execute all reduce, when moe_comm_type is MoECommType.FUSED_ALLTOALL if moe_comm_type in {MoECommType.ALLTOALL, MoECommType.MC2, MoECommType.FUSED_ALLTOALL} \ and not shared_expert_dp_enabled(): shared_out = tensor_model_parallel_all_reduce(shared_out) - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: AlvisGong <gwly0401@163.com> Co-authored-by: Jade Zheng <zheng.shoujian@outlook.com>	2025-12-18 15:07:40 +08:00
Yuzhou Tong	78602eab4f	[UT] Add mooncake ut test (#5080 ) ### What this PR does / why we need it? Add UT for mooncake - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: tongyuzhou <tongyuzhou1@huawei.com> Signed-off-by: wangxiaochao <w00642655@china.huawei.com> Co-authored-by: tongyuzhou <tongyuzhou1@huawei.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-18 15:07:14 +08:00
Clorist33	9045843c90	[UT]Ut for function cumsum_group_list in moe_mlp (ref #5025 ) (#5036 ) ### What this PR does / why we need it? Add ut for the cumsum_group_list function, which is related to the precision issues stemming from the moe_mlp.py . The ralated PR is https://github.com/vllm-project/vllm-ascend/pull/5025 ### Does this PR introduce _any_ user-facing change? No - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: tanqingshan (A) <50050625@china.huawei.com> Co-authored-by: tanqingshan (A) <50050625@china.huawei.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: weijinqian0 <1184188277@qq.com>	2025-12-18 15:00:16 +08:00
Yizhou	543f122101	[Fix] Fix DeepSeek V3.2 "no attr" error (#5147 ) ### What this PR does / why we need it? Extracts repeated `attn_metadata[layer_name].decode` access into a single variable to improve code readability and reduce redundancy. Uses `getattr` with a default value to safely access the decode attribute, making the code more defensive against potential attribute errors. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? None. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-12-18 14:46:41 +08:00
yuxinshan	b0376abd4c	[feat] proxy support elastic scaling (#5063 ) [RFC]: Elastic Scaling Support for P/D Instances Based on KV Pool: https://github.com/vllm-project/vllm-ascend/issues/3380 ### What this PR does / why we need it? Support elastic scaling for P/D instances based on mooncake conncetor deplayment. Support API routes * `/instances/add`: add prefill nodes or decode nodes to the list. * `/instances/remove`: remove prefill nodes or decode nodes from the list. Support functions * Support adding prefill nodes or decode nodes. - If prefill or decode server deployed after the proxy deployed, server can use `/instances/add` API to join the proxy server. The prefill server or decode server sends a signal to the proxy server, and the proxy server will check the status of the node util the node is available. * Support removing prefill nodes or decode nodes: - Support using `/instances/remove` API to delete the node from the proxy server. ### Does this PR introduce _any_ user-facing change? For `examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py`: Add 2 params When adding nodes to the proxy, the proxy will wait the nodes to be started util retrying a certain of times. \| name \| type \| default \| help \| \| ----- \| ---- \| ---- \| ---- \| \| max-waiting-retries \| int \| 3 \| Maximum number of retries for waiting nodes to be started \| \| waiting-retry-interval \| float \| 10 \| Check interval (seconds) for waiting nodes to be started \| For example: ```shell python load_balance_proxy_server_example.py \ --host 0.0.0.0 --port 9000 \ --prefiller-hosts 127.0.0.1 127.0.0.1 \ --prefiller-ports 8100 8101 \ --decoder-hosts 127.0.0.1 127.0.0.1 \ --decoder-ports 8200 8201 \ --max-waiting-retries 3 \ --waiting-retry-interval 10 ``` Add 2 API routings * Add instances: `instances/add` For example, add 2 prefiller instances: ```shell curl -X POST http://localhost:9000/instances/add \ -H "Content-Type: application/json" \ -d '{ "type": "prefill", "instances": ["127.0.0.1:8102", "127.0.0.1:8103"] }' ``` Response: ```shell {"message": "add prefill instances: ['127.0.0.1:8102', '127.0.0.1:8103'].", "current_prefill_instances": ['127.0.0.1:8100', '127.0.0.1:8101', '127.0.0.1:8102', '127.0.0.1:8103'], "current_decode_instances": ['127.0.0.1:8200', '127.0.0.1:8201']} ``` If the node '127.0.0.1:8103' has not benn started: ```shell {"message": "add prefill instances: ['127.0.0.1:8102']. Instances ['127.0.0.1:8103'] are waiting to be added.", "current_prefill_instances": ['127.0.0.1:8100', '127.0.0.1:8101', '127.0.0.1:8102'], "current_decode_instances": ['127.0.0.1:8200', '127.0.0.1:8201']} ``` * Remove instances: `instances/remove` For example, remove 1 decoder instance: ```shell curl -X POST http://localhost:9000/instances/remove \ -H "Content-Type: application/json" \ -d '{ "type": "decode", "instances": "127.0.0.1:8201" }' ``` Response: ```shell {"message": "remove decode instances: ['127.0.0.1:8201'].", "current_prefill_instances": ['127.0.0.1:8100', '127.0.0.1:8101'], "current_decode_instances": ['127.0.0.1:8200']} ``` ### How was this patch tested? Run proxy and using `/instances/add` API to add nodes and `/instances/remove` API to remove nodes * vLLM version: v0.11.0.rc3 * vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0.rc3 - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: yuxinshan <syx_ctyg@126.com> Signed-off-by: CalvinXKY <kyxiezju@163.com>	2025-12-18 14:29:53 +08:00
ck-hw-1018	71e544e259	[test] add w4a8 accuracy case (#5110 ) ### What this PR does / why we need it? This PR add w4a8 accuracy testcase for e2e test ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the test - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: cuikai (C) <c00827167@china.huawei.com> Co-authored-by: cuikai (C) <c00827167@china.huawei.com>	2025-12-18 14:10:14 +08:00
ZT-AIA	39fb9e7c83	qwen3_next add triton ops : fused_qkvzba_split_reshape (#4788 ) ### What this PR does / why we need it? add triton ops fused_qkvzba_split_reshape_cat for qwen3_next GatedDeltaNet ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: ZT-AIA <1028681969@qq.com> Signed-off-by: ZT-AIA <63220130+ZT-AIA@users.noreply.github.com>	2025-12-18 11:31:04 +08:00
zhangsicheng5	07014e2101	[UT] Add model_runner pcp related UTs (#4951 ) 1. Add some uts for pcp related functions in NPUModelRunner - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: zhangsicheng5 <zhangsicheng5@huawei.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-18 10:54:57 +08:00
TingW09	879ec2d1c4	[Doc] add qwen3 reranker (#5086 ) ### What this PR does / why we need it? add qwen3 reranker tutorials ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.12.0 --------- Signed-off-by: TingW09 <944713709@qq.com>	2025-12-18 10:54:07 +08:00
panchao-hub	8069442b41	enable npugraph_ex (#5120 ) ### What this PR does / why we need it? We will expose the enabling switch for npugraph_ex to better facilitate subsequent optimization. ### Does this PR introduce _any_ user-facing change? Previously, the enable_npugraph_ex switch would trigger an error; now we have removed the error reporting mechanism to better facilitate subsequent optimization efforts. Basic functionalities are available in CANN and torch_npu for Q3, while advanced optimizations will depend on the Q4 release. ### How was this patch tested? llm =LLM( model=model, enforce_eager=False , additional_config={ "enable_npugraph_ex": True }, compilation_config={ "cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes": [16], }, } - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: p00465316 <panchao13@huawei.com> Co-authored-by: p00465316 <panchao13@huawei.com> Co-authored-by: weijinqian0 <1184188277@qq.com>	2025-12-18 09:08:40 +08:00
shaopeng-666	39bdd4cfaa	fix profile run for vl model (#5136 ) ### What this PR does / why we need it? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: 李少鹏 <lishaopeng21@huawei.com>	2025-12-17 23:51:31 +08:00
Yizhou	43d974c6f7	[Fix] Synchronize the host query_start_loc with device values to prevent shape mismatches (#5134 ) ### What this PR does / why we need it? Synchronize the host query_start_loc with device values to prevent shape mismatches when not enable async scheduling. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? None. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-12-17 23:50:12 +08:00
zhenwenqi2024	950570f8d1	[Bugfix]delele profile_run in model_runner (#5122 ) ### What this PR does / why we need it? delete sekf.in_profile_run in model_runner to make EPLB works as expect ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com> Signed-off-by: Mengqing Cao <cmq0113@163.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2025-12-17 23:48:34 +08:00
weijinqian0	98e6e57622	[Refactor] 4/N Distinguish the branches based on the applicable scenarios of PA and FIA Ops. (#5081 ) RFC: https://github.com/vllm-project/vllm-ascend/issues/4629 Reason: We distinguish the branches based on the applicable scenarios of pagedAttention and fusedInferAttention, making the code more clear. At the same time, it is convenient for the subsequent iterations of sliding_window and sinks and removePA ops after FIA is ready. Todo: remove PA ops after FIA is ready add slidingwindow and ops for gpt_oss replace FIA with FIA_v2 - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: weijinqian_v1 <weijinqian@huawei.com> Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>	2025-12-17 23:14:02 +08:00
Yuzhou Tong	7671ce1bf1	Fix a data conversion bug introduced by commit `3b7eb51` in main#4655 (#5115 ) ### What this PR does / why we need it? [Fix a data conversion bug introduced by [main#4655](`3b7eb5179f`) ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: tongyuzhou <tongyuzhou1@huawei.com> Co-authored-by: tongyuzhou <tongyuzhou1@huawei.com> Co-authored-by: weijinqian0 <1184188277@qq.com>	2025-12-17 20:19:02 +08:00
weichen	7f1e93f185	[Bugfix][MoE] Remove All2All in w4a8_dynamic (#4977 ) ### What this PR does / why we need it? GatherEP has been fixed in https://github.com/vllm-project/vllm-ascend/pull/3279, remove all2all in w4a8_dynamic scenario. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? e2e & ut - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: weichen <calvin_zhu0210@outlook.com>	2025-12-17 17:39:57 +08:00
dsxsteven	97537709ae	[BugFix] Fix mooncake bug in PCP scenario (#5055 ) ### What this PR does / why we need it? The mooncake_connector.py file was importing the wrong arguments to the file, which could cause errors when use PCP; this issue has been corrected. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: daishixun <dsxsteven@sina.com>	2025-12-17 16:32:16 +08:00
Feng Liu	eda3cabf5b	[UT] add pcp&dcp UT for mla_cp (#4953 ) ### What this PR does / why we need it? Add UT of mla_cp, which include: - test_compute_prefill_context_with_dcp_pcp - test_reorg_kvcache_with_dcp_pcp - test_out_lse_reshape - test_npu_attention_update_with_dcp_pcp - test_attention_with_mask_and_nomask_with_dcp_pcp - test_process_attn_out_lse_with_dcp_pcp - test_forward_prefill_cp_with_dcp_pcp - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: F.Liu <liufeng248@huawei.com> Co-authored-by: F.Liu <liufeng248@huawei.com>	2025-12-17 16:19:27 +08:00
JeffLee1874	724d04391e	[model] Support PanguUltraMoE (#4615 ) ### What this PR does / why we need it? To support PanguUltraMoE model ### Test result #### Start serving using W8A8 quantized model and ACL graph: Master node: ``` vllm serve $LOCAL_CKPT_DIR \ --host 0.0.0.0 \ --port 8000 \ --data-parallel-size 2 \ --data-parallel-size-local 1 \ --data-parallel-address $MASTER_NODE_IP \ --data-parallel-rpc-port 13389 \ --tensor-parallel-size 16 \ --seed 1024 \ --enable-expert-parallel \ --served-model-name $NAME \ --max-model-len 4096 \ --max-num-batched-tokens 256 \ --max-num-seqs 18 \ --trust-remote-code \ --gpu-memory-utilization 0.90 \ --quantization ascend \ --additional-config '{"ascend_scheduler_config":{"enabled":false, "enable_chunked_prefill":true, "chunked_prefill_enabled":true},"torchair_graph_config":{"enabled":false}}' \ --speculative_config '{"method": "pangu_ultra_moe_mtp", "num_speculative_tokens": 1}' \ ``` Other nodes: ``` vllm serve $LOCAL_CKPT_DIR \ --host 0.0.0.0 \ --port 8000 \ --headless \ --data-parallel-size 2 \ --data-parallel-size-local 1 \ --data-parallel-start-rank 1 \ --data-parallel-address $MASTER_NODE_IP \ --data-parallel-rpc-port 13389 \ --tensor-parallel-size 16 \ --seed 1024 \ --enable-expert-parallel \ --served-model-name $NAME \ --max-model-len 4096 \ --max-num-batched-tokens 256 \ --max-num-seqs 18 \ --trust-remote-code \ --gpu-memory-utilization 0.90 \ --quantization ascend \ --additional-config '{"ascend_scheduler_config":{"enabled":false, "enable_chunked_prefill":true, "chunked_prefill_enabled":true},"torchair_graph_config":{"enabled":false}}' \ --speculative_config '{"method": "pangu_ultra_moe_mtp", "num_speculative_tokens": 1}' \ ``` Request & Response: - Request ``` curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "messages": [ {"role": "system", "content": ""}, {"role": "user", "content": "你是谁？"} ], "max_tokens": "64", "top_p": "0.95", "top_k": "50", "temperature": "0.6", "add_special_tokens" : true }' ``` - Response ``` [unused16] 好的，用户问我是谁，我需要按照之前的设定来回答。首先，我的角色是盘古，由华为开发，属于推理模型。要强调我的主要功能是解答问题和提供信息支持，特别是通过逻辑推理和数据分析处理复杂任务。需要保持回答简洁，用中文，并且符合用户的 ``` - vLLM version: v0.12.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.12.0 Signed-off-by: lijifu <lijifu4@huawei.com> Co-authored-by: lijifu <lijifu4@huawei.com>	2025-12-17 16:15:29 +08:00
weichen	f0060fc822	[Pangu][MoE] Remove PanguProMoEV1 related code (#5088 ) ### What this PR does / why we need it? PanguProMoEV1 is no longer supported in vllm-ascend, remove related code. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? e2e & ut - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: weichen <calvin_zhu0210@outlook.com>	2025-12-17 16:14:42 +08:00
lilinsiman	3f7a2fba70	[main][doc] Instructions for using permissions added to docker (#5092 ) ### What this PR does / why we need it? Instructions for using permissions added to docker ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ut - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: lilinsiman <lilinsiman@gmail.com>	2025-12-17 15:26:09 +08:00
zzzzwwjj	06b82e7503	[main] rename device type (#5099 ) ### What this PR does / why we need it? Rename `_910B` to `A2`; Rename `_910_93` to `A3`; Rename `_910_95` to `A5`; - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: zzzzwwjj <1183291235@qq.com>	2025-12-17 14:08:19 +08:00
wangxiyuan	4144376e88	[CI] Fix UT (#5106 ) Fix broken ut introduced by #5053 - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-17 09:52:20 +08:00
weiguihua2	bf97048bce	[feat]pd disaggregated support cross-machine (#5008 ) ### What this PR does / why we need it? pd disaggregated support cross-machine. We send the primary and secondary node information of node p to node d. When node d pulls the KV data, it retrieves the corresponding primary or secondary node information from the mapping. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: weiguihua2 <weiguihua2@huawei.com>	2025-12-17 09:28:03 +08:00
Wang Yixuan	153eeaa621	[Bugfix] Fix DeepSeek FIA error in async_scheduling with mtp (#5046 ) ### What this PR does / why we need it? When enable the async_scheduling, in large scale EP scene, mtp module goes to eagler mode, which results in the mismatch of seq_lens_list、block_table. So adapt the judgement before the draft model forward. fix #4986 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: hust17yixuan <303660421@qq.com>	2025-12-17 09:20:44 +08:00
pichangping	06f33540c4	[UT]add the UT of pcp and dcp in the attention_cp file (#5054 ) ### What this PR does / why we need it? add the UT of pcp and dcp in the attention_cp file ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: pichangping <1337510399@qq.com>	2025-12-17 09:11:33 +08:00

1 2 3 4 5 ...

1741 Commits