xc-llm-ascend

Author	SHA1	Message	Date
linfeng-yuan	068ed706c8	[feat][torchair] support super kernel feat for quantized dsr1 (#3485 ) ### What this PR does / why we need it? Port #1916 and #2157 to master branch to fuse operators in deepseek moe layers, which can reduce scheduling overhead on devices. Note that this feature is valid only when `tp_size = 1` and `multistream_overlap_shared_expert` is enabled with torchair graph mode. ### Does this PR introduce _any_ user-facing change? Users can enable this feature with `--additional-config '{"torchair_graph_config":{"enabled":true, "enable_super_kernel":true}, "multistream_overlap_shared_expert":true}'`. ### How was this patch tested? E2E deepseek serving with 2P1D disaggregated prefill scenarios. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: linfeng-yuan <1102311262@qq.com>	2025-10-20 20:04:37 +08:00
zhangxinyuehfad	fdac146f71	[UT] fix skip ut test and enable ut test run normally (#3410 ) ### What this PR does / why we need it? fix skip ut test and enable ut test run normally ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-10-20 16:30:57 +08:00
whx	f8b52fe950	[Model][1/N] Delete deepseek v2/v3 modeling codes. (#3189 ) This PR deletes model codes of deepseek_v2 and deepseek_v3 to reuse the model file from vLLM. vLLM Ascend now uses custom ops register way instead of model file hard-coding. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: whx-sjtu <2952154980@qq.com>	2025-10-20 15:31:34 +08:00
shaopeng-666	646c1db5d7	Add mrope op fusion (#3509 ) ### What this PR does / why we need it? Add mrope fusion op for qwen2.5-vl. This mrope operator dosen't support Qwen3-VL currently. Thus could only take affect in qwen2.5-vl - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: shaopeng666 <shaopeng666@noreply.gitcode.com> Co-authored-by: shaopeng666 <shaopeng666@noreply.gitcode.com>	2025-10-18 18:08:24 +08:00
yechao237	4750d45d86	[BugFix]Support redundant experts in EPLB (#3473 ) This PR adds support for redundant experts in the EPLB. Key points: - Use global_num_experts = num_experts + num_redundant_experts consistently. - Backward compatible when num_redundant_experts=0. Tested On a 16-rank setup (W8A8) with static EPLB and expert_map_path, verifying router logits shape and successful requests. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: yechao237 <yechao20180411@gmail.com>	2025-10-18 00:09:16 +08:00
Slightwind	07ca1b9b78	[Refactor] Clean up w4a4_flatquant_dynamic implementation (#3440 ) Cleans up the initial implementation of `w4a4_flatquant_dynamic` for better readability and maintainability. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: SlightwindSec <slightwindsec@gmail.com>	2025-10-17 23:53:19 +08:00
anon189Ty	248ee7fa11	[Feat]Make full graph mode compalible with MTP (#3276 ) ### What this PR does / why we need it? Make the Full Graph mode can run with MTP. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>	2025-10-17 20:19:56 +08:00
zhaozx-cn	bf87606932	[Feat] Shared expert dp for deepseek and deepseek_mtp (#3495 ) ### What this PR does / why we need it? shared expert dp for deepseek and deepseek_mtp, could be combined with sp to improve performance. ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: zhaozx-cn <zhaozx2116@163.com> Co-authored-by: realliujiaxu <realliujiaxu@163.com>	2025-10-17 15:06:37 +08:00
huangdong2022	3a53bbc508	[Feat]Qwen3 Moe supports npu_add_rms_norm_quant op by default, update op with bias, resolve conflict with weight prefetch (#3465 ) ### What this PR does / why we need it? 1.qwen3 moe uses add_rms_norm_quant op instead of 'add_rms_norm op and quant op' during quantization scene. 2.torch_npu.add_rms_norm_quant op fixed accuracy while model weights is quantized by anti_method m4, m4 quantization is asymmetric outlier suppression method, it will generate none-zero norm bias, add_rms_norm_quant op updated to add this parameter to calculate. 3. add torch-npu check ### Does this PR introduce _any_ user-facing change? new feature works if torch_npu version >= torch_npu-2.7.1.dev20250919 ### How was this patch tested? 1.no special parameters to set, no new envs to set. new feature works if torch_npu version >= torch_npu-2.7.1.dev20250919 2.use qwen3 moe quantization model to test ,such as Qwen3-235B-A22B-W8A8, Qwen3-30B-A3B-W8A8, Qwen3-235B-A22B-Instruct-2507-m4 (anti_method m4) - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: h30027576 <huangdong51@huawei.com>	2025-10-17 09:30:51 +08:00
weichen	cec1fab509	Revert "[MoE] [Refactor] Remove manual memory cleanup (#3365 )" (#3483 ) This reverts commit `4f937f561d`. ### What this PR does / why we need it? This reverts commit `4f937f561d`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? e2e & ut - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>	2025-10-15 22:25:46 +08:00
realliujiaxu	f69a83b7ba	[Feat] Flash comm allgher ep (#3334 ) Support flash comm v1(Sequence Parallelism) for Allgather EP. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: realliujiaxu <realliujiaxu@163.com> Co-authored-by: zhaozx-cn <zhaozx2116@163.com>	2025-10-15 19:36:32 +08:00
Mengqing Cao	8abe517870	[Refactor] Adapt deepseek-v3.2 to vllm 0.11.0 (#3432 ) ### What this PR does / why we need it? Adapt deepseek-v3.2 to vllm 0.11.0, removing the useless patch. The final goal is to remove all the patches and align the code arch to vllm, thus we need to do the following work in next prs. TODO: - [x] remove patch on attention spec - [ ] refactor the kvcache creation logic ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? 1. CI passed with existing test. 2. Test pass with deepseek-v3.2-exp - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: MengqingCao <cmq0113@163.com>	2025-10-15 17:48:58 +08:00
weichen	4f937f561d	[MoE] [Refactor] Remove manual memory cleanup (#3365 ) ### What this PR does / why we need it? 1. Replace manual memory cleanup with passing parameter. 2. FusedMoEPrepareAndFinalizeWithMC2 inherits All2All avoid duplicated code. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? e2e & ut - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>	2025-10-15 12:36:24 +08:00
CaranLic	15b2e5c995	Remove unused row_idx in token_dispatcher (#3442 ) ### What this PR does / why we need it? The `row_idx` parameter is no longer used since PR[#2689](https://github.com/vllm-project/vllm-ascend/pull/2689), so remove it across multiple files to remove unnecessary calculations and parameter passing. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? accuracy test passed for Qwen3 235B and DeepSeek V3 671B after this PR. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: CaranLic <740821011@qq.com>	2025-10-15 09:08:31 +08:00
zouyida2052	3642b64afc	bugfix for mtp with multistream_moe (#3419 ) ### What this PR does / why we need it? when infer deepseek mtp layer with multistream_moe, we should pass a boolean to evaluate this feature and fix bugs when we are in mtp layer - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: zouyida2052 <zouyida2002@gmail.com>	2025-10-15 08:59:58 +08:00
zxr2333	c2c1db78a7	[Bugfix] fix ZeroDivisionError when prefill_tp_size > num_kv_head and fix tp_resharding README (#3437 ) ### What this PR does / why we need it? Fix ZeroDivisionError when prefill_tp_size > num_kv_head, in this situation, num_head_replica can be 0 and used to divide another value, this PR restricts the minimum value of a to be 1. And this PR fix tp_resharding README. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By CI. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: liziyu <liziyu16@huawei.com> Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com> Co-authored-by: liziyu <liziyu16@huawei.com>	2025-10-15 08:45:44 +08:00
yuzhup	78777237a9	[2/N][Feat] Attention and MoE weight prefetch in Qwen3MoE models (#3203 ) ### What this PR does / why we need it? - Refacotr and integrate a unified `WeightPrefetchMethod` - Integrate `gate_up_proj.weight` in quantized Attention modules - Prefetching these weights ahead of matmul-like operators imporves performance by reducing L2 cache transfer latency ### Does this PR introduce _any_ user-facing change? Add a new config in `--additional-config` for configuration: ```json { "weight_prefetch_config": { "enabled": True, "prefetch_ratio": { "moe": { "gate_up": 0.8 }, }, }, } ``` This feature is enabled by default, and can be disabled through this configuration ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: yuzhup <15705211260@163.com>	2025-10-14 20:16:33 +08:00
anon189Ty	07e39620ea	[Feat] Unquantized Linear to nz and control all nz-cast (#3356 ) ### What this PR does / why we need it? Currently, when executing to the Linear layer of models in vLLM-Ascend, the weights format is ND in unquantized case and skipped ascend case. This PR supplements the execution logic for Linear layer. We use a new global variable: VLLM_ASCEND_ENABLE_NZ. When VLLM_ASCEND_ENABLE_NZ=1 and CANN version is 8.3, the weights of the Linear layer will be converted to FRACTAL_NZ, in both unquantized case and skipped ascend case. We also use VLLM_ASCEND_ENABLE_NZ to control the existing NZ conversion, such as w8a8-quantized case. ### Does this PR introduce _any_ user-facing change? Add a new global variable VLLM_ASCEND_ENABLE_NZ. If you want to use NZ format, you should set VLLM_ASCEND_ENABLE_NZ=1. ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>	2025-10-14 17:39:26 +08:00
XiaoxinWang	9eb62935b8	fix pagedattention to support fullgraph. (#3436 ) ### What this PR does / why we need it? Calculate in advance the workspace memory size needed for the PagedAttention operator to avoid deadlocks during resource cleanup. This PR requires torch_npu version 0920 or newer. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com> Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>	2025-10-14 16:10:09 +08:00
menogrey	657c08cfb2	[UT] fix skipped test_utils ut test. (#3422 ) ### What this PR does / why we need it? Fixes: fix the test in `tests/ut/torchair/test_utils.py` and enable the UT test in CI. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? vLLM version: v0.11.0rc3 vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: menogrey <1299267905@qq.com>	2025-10-14 08:31:13 +08:00
Slightwind	4f6d60eb06	[Feature] Add W4A4 Flat Quantization support (#3427 ) Introduce W4A4 Flat Quantization for better model compression and inference efficiency on Ascend devices. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: SlightwindSec <slightwindsec@gmail.com>	2025-10-13 23:20:16 +08:00
weijinqian0	6972df5951	[Feature] optimize sp & qwen3 next support sp. (#3225 ) This PR will accomplish the following tasks: optimize SP In the old version implementation, the first layer was all_reduce, which used rms to split chunks. We changed it to perform reduce_scatter on the embedding side, replace one all_reduce operation and one chunk with one reduce_scatter operation. Support qwen3 next Since Qwen3 Next includes a linear attention module, the prefix name of this module cannot take effect directly. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: weijinqian_v1 <weijinqian@huawei.com> Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>	2025-10-13 23:02:12 +08:00
realliujiaxu	31682961af	[Feat] enable hierarchical communication for mc2 ops on A2 (#3015 ) Currently, when in A2, setting the environment variables `HCCL_INTRA_PCIE_ENABLE=1` and `HCCL_INTRA_ROCE_ENABLE=0` can reduce cross-machine communication traffic and significantly improve communication performance. For more details, please refer to [document](https://www.hiascend.com/document/detail/zh/Pytorch/710/apiref/torchnpuCustomsapi/context/torch_npu-npu_moe_distribute_dispatch_v2.md) - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: realliujiaxu <realliujiaxu@163.com>	2025-10-13 16:13:17 +08:00
linfeng-yuan	e4acb2dfc7	[feat] support customized and separated hccl_buffer_size for process group initialization (#3073 ) ### What this PR does / why we need it? Currently, users have to set `HCCL_BUFFSIZE` to 512~1024 to perform mc2 operators (dispatch and combine) while running moe models with large `ep_size` and `batch_size`. This environmental variable not only affects allocated VRAM for mc2 group, but also increases VRAM allocation for dp, tp & ep groups, leading to significant kvcache and free_memory drops. This PR supports to automatically calculate and set `hccl_buffer_size` for each process group (except mc2 group) separately when users set `HCCL_BUFFSIZE` for mc2 group. This can significantly reduce wasted buffer_size set for dp, tp & ep groups. Note that current mc2 operators can only perform communication space partitioning based on `HCCL_BUFFSIZE` configuration. Once they support `hccl_buffer_size` configuration with `pg_options` while initializing process group, we'll caculate the required buffer size and users would avoid set `HCCL_BUFFSIZE` themselves. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? We performed E2E serving with deepseek_r1 initializing DP/TP/EP/MC2 process group and observed significant kv_cache and free_memory increase! - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: linfeng-yuan <1102311262@qq.com>	2025-10-11 15:55:22 +08:00
offline893	82b6c846ca	[BugFix]Fix eplb problems when using dynamic eplb. (#3364 ) ### What this PR does / why we need it? When using dynamic eplb,it will be blocking by nz tensor.We fix these prolems by clone src tensor and recv tensor. ### Does this PR introduce any user-facing change? ### How was this patch tested? Qwen3_moe in A3. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: offline0806 <3337230449@qq.com> Co-authored-by: offline0806 <3337230449@qq.com>	2025-10-11 14:04:02 +08:00
wangxiaoteng888	ca05f7d632	[Bugfix] TP size larger than KV cache head causes accuracy issues (#3366 ) ### What this PR does / why we need it? Resolve the issue where, in the case of unequal TP (Tensor Parallelism), the TP size is larger than the number of model attention kvcache heads, causing the KV cache to generate duplicates, which leads to transmission errors in the original code. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By ci - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com> Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Co-authored-by: nwpu-zxr <zhouxuerong2@huawei.com>	2025-10-11 11:22:23 +08:00
panchao-hub	1756efa5fd	[Feat][Graph]Support FULL_DECEDE_ONLY mode for MLA models (#3125 ) ### What this PR does / why we need it? Adds support for capturing the Multi-Layer Attention (MLA) decode operation into an ACL graph. This improves performance by compiling the attention kernel for single-token decoding. Key changes include: - Implementing the graph capture logic for the MLA kernel, including workspace management and parameter updates. - Modifying the rotary embedding (RoPE) handling to use pre-allocated tensors, which is a requirement for graph capture. - Adding a `build_for_graph_capture` method to the MLA metadata builder to create dummy metadata during the graph compilation phase. Known issues: - Currently, MTP is not supported in FULL_DECEDE_ONLY mode -- we're working on a fix - We are preparing to remove update_mla_attn_params with auto_dispatch_capture ### Does this PR introduce _any_ user-facing change? compilation_config={ "cudagraph_mode": "FULL_DECODE_ONLY", }, ### How was this patch tested? - vLLM version: v0.11.0 --------- Signed-off-by: panchao-hub <315134829@qq.com> Signed-off-by: p00465316 <panchao13@huawei.com> Co-authored-by: p00465316 <panchao13@huawei.com> Co-authored-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-10-10 16:31:20 +08:00
wangxiyuan	ba19dd3183	Revert PTA upgrade PR (#3352 ) we notice that torch npu 0919 doesn't work. This PR revert related change which rely on 0919 version. Revert PR: #3295 #3205 #3102 Related: #3353 - vLLM version: v0.11.0	2025-10-10 14:09:53 +08:00
XiaoxinWang	579b7e5f21	add pagedattention to support FULL_DECODE_ONLY. (#3102 ) ### What this PR does / why we need it? Calculate in advance the workspace memory size needed for the PagedAttention operator to avoid deadlocks during resource cleanup. This PR requires torch_npu version 0920 or newer. ### How was this patch tested? - vLLM version: v0.11.0 --------- Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com> Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>	2025-10-10 08:50:33 +08:00
offline893	1c2c72af8d	[bugfix]change log2phy map to npu (#3339 ) ### What this PR does / why we need it? Resolved the issue of EPLB failure caused by changes in the log2phy map due to device type modifications when using MTP rotation position encoding. ### Does this PR introduce any user-facing change? ### How was this patch tested? https://github.com/vllm-project/vllm/commit/releases/v0.11.0 - vLLM version: v0.11.0 --------- Signed-off-by: offline0806 <3337230449@qq.com> Co-authored-by: offline0806 <3337230449@qq.com>	2025-10-10 08:47:55 +08:00
Ruri	ff37575936	[1/N][Feat] Add weight prefetch feature for Attention layers (#3146 ) ### What this PR does / why we need it? - Refacotr and integrate a unified `WeightPrefetchMethod` - Integrate `qkv_proj.weight` and `o_proj.weight` in quantized Attention modules - Prefetching these weights ahead of matmul-like operators imporves performance by reducing L2 cache transfer latency ### Does this PR introduce _any_ user-facing change? Add a new config in `--additional-config` for configuration: ```json { "weight_prefetch_config": { "enabled": false, "prefetch_ratio": { "attn": { "qkv": 1.0, "o": 1.0, }, }, }, } ``` This feature is enabled by default, and can be disabled through this configuration ### How was this patch tested? - vLLM version: v0.11.0 --------- Signed-off-by: yuzhup <15705211260@163.com> Signed-off-by: zhoux77899 <zhouxiang100@huawei.com> Co-authored-by: yuzhup <15705211260@163.com>	2025-10-09 20:38:39 +08:00
huangdong2022	23db56a340	[Feat]Qwen3 Moe supports npu_add_rms_norm_quant op by default, update op with norm bias (#3205 ) ### What this PR does / why we need it? 1. qwen3 moe uses add_rms_norm_quant op instead of 'add_rms_norm op and quant op' during quantization scene. 2. torch_npu.add_rms_norm_quant op fixed accuracy while model weights is quantized by anti_method m4, m4 quantization is asymmetric outlier suppression method, it will generate none-zero norm bias, add_rms_norm_quant op updated to add this parameter to calculate. ### Does this PR introduce _any_ user-facing change? please use a torch_npu version >= torch_npu-2.7.1.dev20250919 ### How was this patch tested? 1. no special parameters to set, no new envs to set. 2. use qwen3 moe quantization model to test ,such as Qwen3-235B-A22B-W8A8, Qwen3-30B-A3B-W8A8, Qwen3-235B-A22B-Instruct-2507-m4 (anti_method m4) - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: huangdong2022 <huangdong51@huawei.com> Signed-off-by: h30027576 <huangdong51@huawei.com>	2025-10-09 20:18:10 +08:00
Wang Yixuan	30c5d947c3	[bugfix]fix multistream moe in torchair (#3164 ) ### What this PR does / why we need it? the multistream moe in tochari only validate in decode, but can't be applied to chunked prefill, So add some judgments to isolate the scenario ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: hust17yixuan <303660421@qq.com>	2025-10-09 19:00:32 +08:00
weichen	94dd832815	[MoE] [Refactor] Combine common_fused_moe and fused_moe (#3176 ) ### What this PR does / why we need it? 1. Move additional functionalities from fused_moe.py to common_fused_moe.py and remove fused_moe.py 2. Remove unnecessary custom classes from qwen3_moe.py, and it will be completely removed after we release vllm-ascend v0.11.0 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Qwen3-30B-A3B/Qwen3-30B-A3B-W8A8/DeepSeek-V3-W4A8-Pruing/deepseek-mtp/pangu-pro-moe-pruing: 1. Enable/Disable EP 3. Aclgraph & eager 4. SP - vLLM version: v0.11.0 --------- Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com> Co-authored-by: weijinqian0 <12153182+weijinqian0@users.noreply.github.com>	2025-10-09 14:12:46 +08:00
wangxiyuan	1c5b302f0d	[Misc] Clean up useless patch (#3320 ) ### What this PR does / why we need it? 1. clean up v0.10.2 support in ut and e2e test 2. remove v0.11.0 period job, we're at v0.11.0 now. 3. remove uesless patch for deepseek v3.2. They have been done in vLLM already. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-10-09 14:07:26 +08:00
weijinqian0	474fa737c8	[bugfix] Fix moe bug: allgather error. (#3279 ) It will crash when deepseek model executed in A2. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/releases/v0.11.0 --------- Signed-off-by: weijinqian_v1 <weijinqian@huawei.com> Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>	2025-09-30 18:45:09 +08:00
wangxiyuan	4abdcdba4e	upgrade pta to 0919 (#3295 ) ### What this PR does / why we need it? Upgrade torch-npu to the newest POC version ### Does this PR introduce _any_ user-facing change? yes, user need upgrade the pta version as well. ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/releases/v0.11.0 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-09-30 17:14:23 +08:00
Chao Lei	a486ff8c11	KVCache Transfer via Layer-wise Strategy in Disaggregation (#2602 ) ### What this PR does / why we need it? See RFC: https://github.com/vllm-project/vllm-ascend/issues/2470 This PR add a new kv connector for layer-wised kv transfer ### Does this PR introduce _any_ user-facing change? yes, a new kv connector is added. User can use layer wised feature now. ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/releases/v0.11.0 --------- Signed-off-by: leichao.lc <leichao139636@163.com> Signed-off-by: CaveNightingale <2859066733@qq.com> Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com> Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Signed-off-by: hanxinlong <50882499@qq.com> Signed-off-by: liziyu <liziyu16@huawei.com> Co-authored-by: CaveNightingale <2859066733@qq.com> Co-authored-by: nwpu-zxr <zhouxuerong2@huawei.com> Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com> Co-authored-by: hanxinlong <50882499@qq.com>	2025-09-30 15:10:29 +08:00
Mengqing Cao	f8c93d8d24	[Aclgraph][DP] Fix dp dummy run not in aclgraph error (#3208 ) ### What this PR does / why we need it? When running DP in a non-equilibrium scenario, which means there is some dp groups executing `dummy_run`, we need to make sure it running the same mode as other dp, thus improving then performance in dp scenario ### How was this patch tested? Tested by adding log in `_dummy_run` - vLLM version: v0.10.2 - vLLM main: https://github.com/vllm-project/vllm/commit/releases/v0.11.0 --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-09-30 11:14:51 +08:00
Mengqing Cao	4ff422c730	[CI][Bugfix] Quickfix for DPMetaData (#3234 ) ### What this PR does / why we need it? Fix `dpmetadata` and `Qwen3MoeSparseMoeBlock` break introduced by `26a7a33b88 (diff-c1550d0a38469d039370567d8981969530cbfffc7302cd1778e7c2c8a9322dea)` NOTE: we maintain a different sp in vllm-ascend with vllm, thus we can just use `cu_tokens_across_sp(1)` as `cu_tokens_across_dp_cpu` close https://github.com/vllm-project/vllm-ascend/issues/3236, https://github.com/vllm-project/vllm-ascend/issues/3239 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: https://github.com/vllm-project/vllm/commit/releases/v0.11.0 --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-09-28 21:11:22 +08:00
lilinsiman	1705501ae2	[BugFix] Fix ACLgraph bug in Qwen3_32b_int8 case (#3204 ) ### What this PR does / why we need it? 1. Solved the issue where sizes capture failed for the Qwen3-32b-int8 model when aclgraph, dp1, and tp4 were enabled. 2. Added the exception thrown when sizes capture fails and provided a solution 3. Add this common problem to the FAQ doc ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ut - vLLM version: v0.10.2 - vLLM main: https://github.com/vllm-project/vllm/commit/releases/v0.11.0 Signed-off-by: lilinsiman <lilinsiman@gmail.com>	2025-09-28 17:44:04 +08:00
Wang Kunpeng	859e861d92	[main][quantization] Support deepseek w4a8 per-channel quantization (#3011 ) ### What this PR does / why we need it? 1.Support deepseek w4a8 per-channel quantization 2.The eager mode supports converting weights to the NZ format ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? #### How to get weights using Modelslim ##### Installation steps git clone https://gitcode.com/Ascend/msit.git cd msit/msmodelslim bash install.sh ##### Generate w4a8 per-channel weights cd /example/DeepSeek Command reference: msmodelslim/example/DeepSeek/README.md - vLLM version: v0.10.2 - vLLM main: `f225ea7dd9` --------- Signed-off-by: Wang Kunpeng <1289706727@qq.com>	2025-09-27 21:01:16 +08:00
florenceCH	14497b748d	Remove qwen3 moe MC2 cumsum & cast (#3126 ) What this PR does / why we need it? The Qwen3 moe MC2 graph currently has two redundant computational operator implementations. After npu_moe_distribute_dispatch_v2, the cumsum and cast operations have been added. By using expert_token_nums_type=0 and not converting weight_scale to float32, these two operators can be eliminated, thereby improving inference performance. Does this PR introduce any user-facing change? No How was this patch tested? No need vLLM version: v0.10.2 vLLM main: `f225ea7dd9` - vLLM version: v0.10.2 - vLLM main: `f225ea7dd9` --------- Signed-off-by: florenceCH <gaoxiang120@huawei.com> Co-authored-by: florenceCH <gaoxiang120@huawei.com>	2025-09-26 08:51:30 +08:00
wangxiyuan	0794f64a18	Revert "[Disagg][Perf] Use NPU event sync instead of blocking tolist (#3194 ) …to avoid unintentional copy ops blocking across different NPU streams, improving disagg TTIT/TTFT (#2788)" ### What this PR does / why we need it? This reverts commit `6995a7bc5b`. We'll add it back once the issue is fixed. related issue: https://github.com/vllm-project/vllm-ascend/issues/3195 ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: `52d0cb8458`	2025-09-26 06:17:36 +08:00
wangxiyuan	ac1c2cd9ac	[CI] Upgrade vllm version - 0925 (#3167 ) Upgrade vLLM to newest commit. 1. Remove the useless func get_state_cls, it has been removed from vLLM already. `e6750d0b18` 2. Fix ut broken by `6160ba4151` - vLLM version: v0.10.2 - vLLM main: `b1068903fd` --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-09-25 14:20:10 +08:00
mfyCn-1204	33c118c80e	[core]vllm-ascend support msMonitor tool (#3123 ) ### What this PR does / why we need it? vllm-ascend support [msMonitor ](https://gitcode.com/Ascend/mstt/tree/master/msmonitor)tool to collect performance of vllm-ascend ### Does this PR introduce _any_ user-facing change? 1.add env MSMONITOR_USE_DAEMON； 2.user cann enable msMonitor tool by setting MSMONITOR_USE_DAEMON=1 before run vllm-ascend model； 3.MSMONITOR_USE_DAEMON and VLLM_TORCH_PROFILER_DIR cannot both set ### How was this patch tested? 1.run vllm-ascend model while not set MSMONITOR_USE_DAEMON=1 or set MSMONITOR_USE_DAEMON=0, model will run successfully; 2.run vllm-ascend model while set MSMONITOR_USE_DAEMON=1, run msMonitor tool to collect profile data; 3.run vllm-ascend model while set MSMONITOR_USE_DAEMON=1 and VLLM_TORCH_PROFILER_DIR, will raise error - vLLM version: v0.10.2 - vLLM main: `f225ea7dd9` Signed-off-by: mei-feiyao <1332490378@qq.com>	2025-09-25 14:15:02 +08:00
wangxiyuan	a055183821	[CI] Upgrade vLLM version (#3139 ) Upgrade vLLM version to the newest commit. - Fix the break change introduced by `969b4da3a6` - Add a patch to quick fix torhcair `de94289a98` - fix the ut error introduced by `de94289a98` Close: https://github.com/vllm-project/vllm-ascend/issues/3138 - vLLM version: v0.10.2 - vLLM main: `f225ea7dd9` --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: MengqingCao <cmq0113@163.com> Co-authored-by: MengqingCao <cmq0113@163.com>	2025-09-25 07:36:51 +08:00
Clorist33	302494c1fe	[EPLB] ut for EPLB (#3035 ) ## UT for EPLB Co-authored-by Skywalker-EP 173723846@qq.com Co-authored-by offline 0806@qq.com Co-authored-by dsxsteven@sina.com ## UT Description ### 1. Module Description - Module: EPLB ### 2. Covered Source Files - vllm_ascend/eplb/adaptor/abstract_adaptor.py - vllm_ascend/eplb/core/eplb_device_transfer_loader.py - vllm_ascend/eplb/core/eplb_utils.py - vllm_ascend/eplb/core/policy/policy_abstract.py - vllm_ascend/eplb/core/policy/policy_dynamic_ep.py - vllm_ascend/eplb/core/policy/policy_dynamic_ep_v2.py - vllm_ascend/eplb/core/policy/policy_factory.py ### 3. Testing Method - Framework: pytest - Test Data: mock data - Test Type: unit test ### 4. Coverage - Statement Coverage: 90% - vLLM version: v0.10.2 - vLLM main: `f225ea7dd9` --------- Signed-off-by: tanqingshan (A) <50050625@china.huawei.com> Signed-off-by: tanqingshan <50050625@china.huawei.com> Signed-off-by: daishixun <dsxsteven@sina.com> Co-authored-by: tanqingshan (A) <t50050625@china.huawei.com> Co-authored-by: tanqingshan <50050625@china.huawei.com> Co-authored-by: daishixun <dsxsteven@sina.com> Co-authored-by: dsxsteven <36877507+dsxsteven@users.noreply.github.com>	2025-09-24 17:14:38 +08:00
Csrayz	80524f5711	[CORE] concurrent partial prefills (#2372 ) # What this PR does / why we need it? When processing a mix of large and small requests, the TTFT of responses is significantly reduc\ed. Please refer to https://github.com/vllm-project/vllm/pull/10235, which achieves the same effect by simply limiting the number of prompt fills for long requests. This solution can be applied to both AscendScheduler (V0) and vLLM Scheduler (V1). Tests show that TTFT can be significantly improved when handling such mixed requests. However, This capability is currently missing when Ascend Scheduler is enabled. This benchmark used the Qwen3-8B model, with a context length of 128K, running on a single card. Regarding dataset selection, the sharegpt_clean dataset is used, with its content concatenated and cropped. Small requests with token=50 and medium requests with token=10240 were constructed (there were also large requests with token=102400, but these were ignored because when using the Prefill First scheduling strategy, max_num_batched_tokens will not be set to such a large value). When loading vLLM, set max_num_batched_tokens=22000. This length can accommodate two medium-sized requests and some short requests, reflecting an extreme scenario where the budget is almost entirely occupied by longer requests. Next, we mix 990 small requests and 100 medium requests into one type of load scenario (hereinafter referred to as 10%), and similarly generate load scenarios with 5% medium requests and 1% load scenarios. Performance tests were conducted separately for enabling vLLMScheduler, AscendScheduler, and AscendScheduler (long prompt concurrency set to 1). - vLLM version: v0.10.2 - vLLM main: `1dfea5f4a9` --------- Signed-off-by: Csrayz <jover@cmbchina.com>	2025-09-24 17:12:55 +08:00
baxingpiaochong	eb205d9f35	[P/D][BugFix]Mooncake timeout release bug fix (#2899 ) ### What this PR does / why we need it? In the P node timeout release mechanism during PD separation, the req_id that requires timeout release is transmitted from the scheduler to the worker. If the KV cache between PDs is transferred too quickly, the P node's req_id may be released twice. The first release is when the D node notifies the P node that the KV cache has been pulled, and the second release is when the scheduler transmits the timeout release to the worker. To address this bug, an intermediate component is introduced to manage the release of req_ids. Pull kv and forward2 may occur one after the other in timing. The previous timeout defaulted to forward2 being before pull_kv. ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: `f225ea7dd9` --------- Signed-off-by: baxingpiaochong <771405853@qq.com>	2025-09-24 11:22:46 +08:00

1 2 3 4 5

246 Commits