xc-llm-ascend

Author	SHA1	Message	Date
liziyu	3164cb663c	[Bugfix] mooncake connector support external dp & update readme (#3579 ) ### What this PR does / why we need it? mooncake connector support external dp & update readme ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: liziyu <liziyu16@huawei.com>	2025-10-21 20:15:24 +08:00
Chen Chen	6b290acfe1	remove redundant params in mla_preprocess kernel (#3530 ) ### What this PR does / why we need it? This pull request removes the redundant parameters `gamma1` and `beta1` (also named `gamma0`/`beta0` in some places) from the `mla_preprocess` kernel and its calling hierarchy. The changes are consistent across C++ kernel code, bindings, and Python call sites. The parameters were unused in the lower-level functions, so their removal is a good cleanup. ### Does this PR introduce _any_ user-facing change? The python interface of the kernel is affected, and the params of `gamma0` and `beta0` are not needed. ### How was this patch tested? The unit-test of the kernel is adapted accordingly. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: mojave2 <chenchen145@huawei.com>	2025-10-21 19:20:13 +08:00
jiangyunfan1	80b8df881f	[TEST] Add Qwen3-32b-w8a8 acc/perf A2/A3 test (#3541 ) ### What this PR does / why we need it? This PR Qwen3-32b-w8a8 acc/perf 8 cases on A2 and A3, we need test them daily. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? by running the test - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: jiangyunfan1 <jiangyunfan1@h-partners.com> Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: root <root@hostname-2pbfv.foreman.pxe> Co-authored-by: wangli <wangli858794774@gmail.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com>	2025-10-21 17:34:48 +08:00
Yizhou	ec1d2b5c04	[Test] Temporarily skip flaky ACL graph test (#3577 ) ### What this PR does / why we need it? Disables `FULL_DECODE_ONLY` end-to-end test that fails intermittently. This prevents CI blockages while the root cause of the flakiness is investigated. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? None needed. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-10-21 17:16:15 +08:00
Li Wang	9830f85c42	[CI] Fix test_mla_v1 (#3570 ) ### What this PR does / why we need it? Remove test cases containing CPU incompatible operators ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: wangli <wangli858794774@gmail.com>	2025-10-21 10:31:55 +08:00
Zhu Yi Lin	4a849df6fa	[main] support cpu binding (#3546 ) ### What this PR does / why we need it? Currently, in the piecewise of aclgraph, the model will be in eagle mode in attention, which will cause abnormal allreduce latency of O matrix. The reason is that cpu resources will be preempted in eagle mode. So I hope to temporarily add cpu binding to vllm-ascend. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed with new existing test. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: GDzhu1 <809721801@qq.com>	2025-10-21 09:17:03 +08:00
Yizhou	274b708e0c	[Fix] Refactor dummy attention metadata creation (#3497 ) ### What this PR does / why we need it? The `force_attention` parameter is designed for flash infer kernel warmup, we don't actually need it on Ascend device (at least for now).And it tends to make things more complicated. So we replace the `force_attention` parameter with `aclgraph_runtime_mode` in the attention metadata creation logic. This change makes the control flow more explicit by directly using the graph runtime mode to determine how to build attention metadata, rather than relying on an intermediate boolean flag. This simplification removes redundant logic and clarifies the conditions for building attention metadata for full decode graph mode. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? DP + `FULL_DECODE_ONLY` + online serving. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-10-21 00:00:42 +08:00
likeful	6b6857929d	[Doc] Add --shm-size option to Docker command for qwen3 vl 235B (#3519 ) ### What this PR does / why we need it? Added shared memory size option to Docker run command.If shm-size is not specified, docker will use 64MB by default. In this case, vllm:EngineCore process may coredump if workload is high. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Done Closes: https://github.com/vllm-project/vllm-ascend/issues/3513 - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: likeful <irayki@gmail.com> Signed-off-by: leijie2015 <irayki@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2025-10-20 23:37:35 +08:00
wangxiyuan	0bf3f21a98	Revert "Add mrope op fusion (#3509 )" (#3562 ) This reverts commit `646c1db5d7`. this new ops may lead accuracy problem ### What this PR does / why we need it? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0	2025-10-20 20:19:24 +08:00
linfeng-yuan	068ed706c8	[feat][torchair] support super kernel feat for quantized dsr1 (#3485 ) ### What this PR does / why we need it? Port #1916 and #2157 to master branch to fuse operators in deepseek moe layers, which can reduce scheduling overhead on devices. Note that this feature is valid only when `tp_size = 1` and `multistream_overlap_shared_expert` is enabled with torchair graph mode. ### Does this PR introduce _any_ user-facing change? Users can enable this feature with `--additional-config '{"torchair_graph_config":{"enabled":true, "enable_super_kernel":true}, "multistream_overlap_shared_expert":true}'`. ### How was this patch tested? E2E deepseek serving with 2P1D disaggregated prefill scenarios. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: linfeng-yuan <1102311262@qq.com>	2025-10-20 20:04:37 +08:00
lilinsiman	70bef33f13	add new accuracy test case for aclgraph (#3390 ) ### What this PR does / why we need it? Add new accuracy test case Deepseek-V2-Lite-W8A8 for aclgraph ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ut - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: lilinsiman <lilinsiman@gmail.com>	2025-10-20 20:04:04 +08:00
ZYang6263	b9e2896eb1	Revert "[Perf] Add FIA interface in FA case" (#3553 ) Reverts vllm-project/vllm-ascend#3321 The output dimension mismatch and accuracy issue - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: ZYang6263 <zy626375@gmail.com>	2025-10-20 19:56:10 +08:00
Zhu Yi Lin	34c2996ab8	[main] v_proj combining transpose and matmul (#3545 ) ### What this PR does / why we need it? v_proj combining transpose and matmul ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: GDzhu1 <809721801@qq.com>	2025-10-20 19:53:32 +08:00
Jade Zheng	e04a5e3dd3	[Bugfix] Fix race condition in d2h transfer (#3372 ) ### What this PR does / why we need it? Using non-blocking operations for device-to-host transfers can lead to data corruption in later steps. The CPU tensor is accessed right after the transfer is triggered, but the transfer might not be complete yet. As a result, the data could be wrong. This problem was seen in the A3 environment during `profile_run`. ### How was this patch tested? CI pass. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>	2025-10-20 18:24:21 +08:00
zhangxinyuehfad	fdac146f71	[UT] fix skip ut test and enable ut test run normally (#3410 ) ### What this PR does / why we need it? fix skip ut test and enable ut test run normally ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-10-20 16:30:57 +08:00
whx	f8b52fe950	[Model][1/N] Delete deepseek v2/v3 modeling codes. (#3189 ) This PR deletes model codes of deepseek_v2 and deepseek_v3 to reuse the model file from vLLM. vLLM Ascend now uses custom ops register way instead of model file hard-coding. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: whx-sjtu <2952154980@qq.com>	2025-10-20 15:31:34 +08:00
Mengqing Cao	918ded9155	[BugFix][HybridKV] Update the check logic of reinitializing inputbatch (#3540 ) ### What this PR does / why we need it? Update the check logic of reinitializing inputbatch, this is a follow-up pr of #3477. `kernel_block_sizes` is a `list[list[int]]` and the original logic will always update `InputBatch` when using hybrid blocks, this pr fixes that ### How was this patch tested? locally test with qwen3-next - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: MengqingCao <cmq0113@163.com>	2025-10-20 15:29:48 +08:00
Mengqing Cao	daa4dd0a57	[DeepSeek] Seperate deepseek v3.2 modeling form deepseek v2 (#3531 ) ### What this PR does / why we need it? Seperate deepseek v3.2 modeling form deepseek v2 ### How was this patch tested? - CI passed with existing test. - test deepseek v3.2 locally - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: MengqingCao <cmq0113@163.com>	2025-10-20 09:50:44 +08:00
Mengqing Cao	6c65dd891f	[ModelRunner][Qwen3-Next] Fix attn_group initialization timing (#3477 ) ### What this PR does / why we need it? Fix attn_group initialization timing so that fix qwen3-next model ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-10-20 09:39:40 +08:00
jiangyunfan1	9e59fc1510	[TEST] Add initial aisbench support and Qwen3 32B acc/perf test (#3474 ) ### What this PR does / why we need it? This PR adds the first aisbench case for nightly test, it lays a foundation for following performance and accuracy tests in nightly test. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the test - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Co-authored-by: jiangyunfan1 <jiangyunfan1@h-partners.com>	2025-10-20 09:33:17 +08:00
zouyida2052	58a37ce189	bugfix for mooncake (#3535 ) ### What this PR does / why we need it? bugfix for mooncake, remove useless judgement. ### How was this patch tested? by ci - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: zouyida2052 <zouyida2002@gmail.com>	2025-10-19 17:06:05 +08:00
ZYang6263	1e78ecbad6	[Perf] Add FIA interface in FA case (#3321 ) ### What this PR does / why we need it? Add new npu_fused_infer_attention_score op to improve perfomance in flash attention case. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: ZYang6263 <zy626375@gmail.com>	2025-10-19 12:45:33 +08:00
Wang Kunpeng	4b3bd4f397	[main][bugfix] bugfix for minicpm models (#3527 ) ### What this PR does / why we need it? bugfix for minicpm-2b and minicpm3-4b - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: Wang Kunpeng <1289706727@qq.com>	2025-10-19 11:00:55 +08:00
offline893	6c9909c861	[Patch]patch of v1 executor when enable eplb. (#3511 ) ### What this PR does / why we need it? when using dynamic eplb, patch v1 executor to avoid create child process failed. ### How was this patch tested? deepseek in v3. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: offline0806 <3337230449@qq.com> Co-authored-by: offline0806 <3337230449@qq.com>	2025-10-19 10:54:26 +08:00
shaopeng-666	646c1db5d7	Add mrope op fusion (#3509 ) ### What this PR does / why we need it? Add mrope fusion op for qwen2.5-vl. This mrope operator dosen't support Qwen3-VL currently. Thus could only take affect in qwen2.5-vl - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: shaopeng666 <shaopeng666@noreply.gitcode.com> Co-authored-by: shaopeng666 <shaopeng666@noreply.gitcode.com>	2025-10-18 18:08:24 +08:00
xuyexiong	0777e2f899	Optimize torchair kv_consumer padding logic (#3526 ) ### What this PR does / why we need it? Optimize torchair kv_consumer padding logic. Only pad when it is spec decoding ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: xuyexiong <xuyexiong@huawei.com>	2025-10-18 16:42:17 +08:00
Shirley125	b4233a2ec3	[Bugfix] Route requests requiring KVC recomputation from the decode instance to the P instance (#3448 ) ### What this PR does / why we need it? This PR is aimed to fix the recomputing out of memory bug in decode instance. When recomputing happens in decode, kv cache usage may exceed the pre-allocated memory, and it will cause OOM. So we propose a new scheduling strategy, when decode instance cannot allocate new block for running requests, we will stop the request that will be preempted. These stopped request will be recognied by proxy, and they will be send to prefill instance again to calculate kvc and then direct to decode instance. This is a temporary plan to fix the bug. The long-term stratege is to use CPU offload in decode instance. ### Does this PR introduce _any_ user-facing change? An extra ascend configuration option -- recompute_scheduler_enable = True is added to enable this strategy. The default value is False ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>	2025-10-18 15:56:44 +08:00
yechao237	4750d45d86	[BugFix]Support redundant experts in EPLB (#3473 ) This PR adds support for redundant experts in the EPLB. Key points: - Use global_num_experts = num_experts + num_redundant_experts consistently. - Backward compatible when num_redundant_experts=0. Tested On a 16-rank setup (W8A8) with static EPLB and expert_map_path, verifying router logits shape and successful requests. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: yechao237 <yechao20180411@gmail.com>	2025-10-18 00:09:16 +08:00
Slightwind	07ca1b9b78	[Refactor] Clean up w4a4_flatquant_dynamic implementation (#3440 ) Cleans up the initial implementation of `w4a4_flatquant_dynamic` for better readability and maintainability. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: SlightwindSec <slightwindsec@gmail.com>	2025-10-17 23:53:19 +08:00
xuyexiong	21769e8f44	[BUGFIX] Mtp torchair pd fix (#3506 ) ### What this PR does / why we need it? In memory of https://github.com/vllm-project/vllm-ascend/pull/2610 and #3449 Fix Mtp torchair pd bug. In the pd Disaggregation scenario, the first token of the inference after the d node receives the kv follows the eager mode. Fixes: Running with MTP torchair graph mode with Prefilling Decoding Disaggregation , if all requests processed by the D node are requests just transmitted from the P node, it will break the torchair graph. Reason: During PD Disaggregation , the P node only transmits the KV cache and prompt to the D node, not the actual tokens inferred (neither the main model tokens nor the MTP tokens are transmitted). Therefore, the D node will treat this request as one without MTP tokens for inference (seq_len=1). The community does not have graph mode issues because the community's attention has a seq_len=1 for each batch during the decode phase. We have issues because the graph mode pads according to processing 2 tokens per request. When there are some seq_len=1 and some seq_len=2, padding is done at the end. If all requests received by the D node are seq_len=1, padding cannot be performed normally according to the attention's fia operator constraints. Solution: The kv consumer uses extra torchair graph padding to avoid breaking FIA graph constrains (The one this PR implemented). ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: xuyexiong <xuyexiong@huawei.com>	2025-10-17 21:57:05 +08:00
Angazenn	9547d6f0d9	[Core]Append padding logic for Attention (#3256 ) ### What this PR does / why we need it? This PR aims to add padding logic to seq_lens、block_tables when running in full decode scenario. Before this PR, the number of input tokens with padding might exceeds corresponding seq_lens. For example, when running in full decode scenario: ``` input_ids : [1, 3, 0, 0] seq_lens: [2, 1] query_start_loc: [0, 1, 2] ``` Here, `input_ids` is padded by 2 tokens while `seq_lens`/`query_start_loc` are not. The mismatch between `input_ids` and `seq_lens`/`query_start_loc` might cause some potential bugs. This PR would change it into : ``` input_ids : [1, 3, 0, 0] seq_lens: [2, 1, 1, 1] query_start_loc: [0, 1, 2, 3, 4] ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: Angazenn <supperccell@163.com>	2025-10-17 21:56:01 +08:00
realliujiaxu	b154a8e22c	[Bugfix] fix logging and d2h bug for flash comm1 (#3505 ) ### What this PR does / why we need it? Fix 3 bugs in flash comm1 of Allgather EP(https://github.com/vllm-project/vllm-ascend/pull/3334): 1. call `enable_sp()` with argument `vllm_config` trigger a lot of warning log, this PR caches its return value. 2. `num_tokens_after_padding` should be cpu tensor as it will used as `num_tokens_across_dp_cpu` in `DPMetadata`. It will causes may d2h copy when running model. 3. In PD, model runner will execute `kv_connector_no_forward`，where `num_tokens` is None - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: realliujiaxu <realliujiaxu@163.com>	2025-10-17 21:13:41 +08:00
anon189Ty	248ee7fa11	[Feat]Make full graph mode compalible with MTP (#3276 ) ### What this PR does / why we need it? Make the Full Graph mode can run with MTP. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>	2025-10-17 20:19:56 +08:00
anon189Ty	46e62efd44	[Feat]mtp aclgraph support (#3244 ) ### What this PR does / why we need it? Currently, MTP Model in deepseek can not be capture in ACLGraph. This PR is use to allow MTP to be captured in ACLGraph mode. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>	2025-10-17 18:14:49 +08:00
lilinsiman	1b424fb7f1	ACLgraph enable: Test cases revisions for all features (#3388 ) ### What this PR does / why we need it? This PR revise the test cases of various features on the warehouse which add the enablement of aclgraph to the test cases. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ut - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: lilinsiman <lilinsiman@gmail.com>	2025-10-17 17:15:19 +08:00
zhaozx-cn	bf87606932	[Feat] Shared expert dp for deepseek and deepseek_mtp (#3495 ) ### What this PR does / why we need it? shared expert dp for deepseek and deepseek_mtp, could be combined with sp to improve performance. ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: zhaozx-cn <zhaozx2116@163.com> Co-authored-by: realliujiaxu <realliujiaxu@163.com>	2025-10-17 15:06:37 +08:00
Angazenn	d9ee491f70	[BugFix]Move to_list in foward_v1 with FIA earlier to build (#3185 ) ### What this PR does / why we need it? The current implementation of FIA will introduce an `to_list` operation for actual_seq_lengths_q and seq_lens，which comsumes extra time. These operation can be moved earlier into `build` operation of attention metadata. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: Angazenn <supperccell@163.com>	2025-10-17 11:19:41 +08:00
xuyexiong	30e3d86b0f	Revert "[BUGFIX] Mtp torchair pd fix (#3449 )" (#3500 ) This reverts commit `b0ae203e72`. ### What this PR does / why we need it? The fix is not ready yet, conflict with #3411 need to revert first. Will fix this issue later ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Signed-off-by: xuyexiong <xuyexiong@huawei.com>	2025-10-17 09:42:48 +08:00
huangdong2022	3a53bbc508	[Feat]Qwen3 Moe supports npu_add_rms_norm_quant op by default, update op with bias, resolve conflict with weight prefetch (#3465 ) ### What this PR does / why we need it? 1.qwen3 moe uses add_rms_norm_quant op instead of 'add_rms_norm op and quant op' during quantization scene. 2.torch_npu.add_rms_norm_quant op fixed accuracy while model weights is quantized by anti_method m4, m4 quantization is asymmetric outlier suppression method, it will generate none-zero norm bias, add_rms_norm_quant op updated to add this parameter to calculate. 3. add torch-npu check ### Does this PR introduce _any_ user-facing change? new feature works if torch_npu version >= torch_npu-2.7.1.dev20250919 ### How was this patch tested? 1.no special parameters to set, no new envs to set. new feature works if torch_npu version >= torch_npu-2.7.1.dev20250919 2.use qwen3 moe quantization model to test ,such as Qwen3-235B-A22B-W8A8, Qwen3-30B-A3B-W8A8, Qwen3-235B-A22B-Instruct-2507-m4 (anti_method m4) - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: h30027576 <huangdong51@huawei.com>	2025-10-17 09:30:51 +08:00
Li Wang	4c4a8458a5	[CI] Refator multi-node CI (#3487 ) ### What this PR does / why we need it? Refactor the multi-machine CI use case. The purpose of this PR is to increase the ease of adding multi-machine CI use cases, allowing developers to add multi-machine cluster model testing use cases (including PD separation) by simply adding a new YAML configuration file. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-10-17 09:04:31 +08:00
Yizhou	ccb6fb9ec1	[Fix] Clears unused slot mappings and fix accuracy issue with MLA models when enabling `FULL_DECODE_ONLY` (#3482 ) ### What this PR does / why we need it? MLA and GQA use different computation logic: MLA slice batches and only compute on the actually valid tokens. That means outer padding must be handled carefully — the accuracy issue this PR fixes was caused by stale data in `slot_mapping` being reused by subsequent inference steps. So we zeros out the portion of the slot mapping tensor that is not used by the currently scheduled tokens. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? Working on it. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-10-16 19:43:09 +08:00
elilzhu	f9535cc9e2	[BugFix] fix qwenVL quant assertion error (#3466 ) ### What this PR does / why we need it? This PR fixes issues: 1. Solve the problem that multimodal scene cannot do weight prefetching and throw an assertion error exception. 2. Standardize the grid_thw data type of qwen2VL to torch.int32. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? - ci & e2e - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: elilzhu <2435754260@qq.com> Co-authored-by: zhulei (AK) <z00692222@china.huawei.com>	2025-10-16 17:08:00 +08:00
menogrey	9ff6b0b862	[CI]: Fix doctest ci for main release (#3451 ) ### What this PR does / why we need it? Fix dockets CI for main release. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: menogrey <1299267905@qq.com>	2025-10-16 14:38:11 +08:00
xuyexiong	b0ae203e72	[BUGFIX] Mtp torchair pd fix (#3449 ) ### What this PR does / why we need it? In memory of https://github.com/vllm-project/vllm-ascend/pull/2610 In the pd Disaggregation scenario, the first token of the inference after the d node receives the kv follows the eager mode. Fixes: Running with MTP torchair graph mode with Prefilling Decoding Disaggregation , if all requests processed by the D node are requests just transmitted from the P node, it will break the torchair graph. Reason: During PD Disaggregation , the P node only transmits the KV cache and prompt to the D node, not the actual tokens inferred (neither the main model tokens nor the MTP tokens are transmitted). Therefore, the D node will treat this request as one without MTP tokens for inference (seq_len=1). The community does not have graph mode issues because the community's attention has a seq_len=1 for each batch during the decode phase. We have issues because the graph mode pads according to processing 2 tokens per request. When there are some seq_len=1 and some seq_len=2, padding is done at the end. If all requests received by the D node are seq_len=1, padding cannot be performed normally according to the attention's fia operator constraints. Solution: The kv consumer uses extra torchair graph padding to avoid breaking FIA graph constrains (The one this PR implemented). The kv producer provides the correct tokens to the kv consumer, so that our graph mode constraints are not broken, and all logic is the same as the PD mixed deployment . Since we are using the community scheduler, the modification requires patching the vllm scheduler, but theoretically, performance should be better. (Maybe later ) ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: xuyexiong <xuyexiong@huawei.com>	2025-10-16 09:03:49 +08:00
leo-pony	291c00a224	[Doc] pin version that can stable running 310I Duo to vllm-ascend v0.10.0rc1 (#3455 ) Pin version that can stable running 310I Duo to vllm-ascend v0.10.0rc1. ### What this PR does / why we need it? Since PR #2614 310I Duo been broken. Although we are currently working on fixing the issue with the 310I Duo being broken, there is no confirmed timeline for a fix in the short term. To allow users to quickly find a working version instead of going back and forth on trial and error, this PR fixes the version in the 310I Duo guide. ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? NA - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: leo-pony <nengjunma@outlook.com>	2025-10-16 08:54:09 +08:00
leo-pony	ff91904ee2	[Doc] Clearer corresponding relationship between configurations for multi-node guides (#3441 ) Optimize multi-node guide: more clearer corresponding relationship between configuration items and nodes ### What this PR does / why we need it? Some issues caused by misunderstandings due to unclear guidance content, for example: #3367 ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? NA - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: leo-pony <nengjunma@outlook.com>	2025-10-16 08:54:03 +08:00
DreamerLeader	aa6154703a	[BugFix]GPQA Accuracy Issue Bugfix (#3476 ) ### What this PR does / why we need it? The GPQA dataset accuracy in the PD separation scenario of testing is 33.2, which does not meet the paper's requirement of 70. Resolve this accuracy issue. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? qpqa has accuracy issues, but modifying the code can ensure the accuracy meets the standard - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: fjw <2270923832@qq.com>	2025-10-15 23:28:17 +08:00
weichen	cec1fab509	Revert "[MoE] [Refactor] Remove manual memory cleanup (#3365 )" (#3483 ) This reverts commit `4f937f561d`. ### What this PR does / why we need it? This reverts commit `4f937f561d`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? e2e & ut - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>	2025-10-15 22:25:46 +08:00
realliujiaxu	f69a83b7ba	[Feat] Flash comm allgher ep (#3334 ) Support flash comm v1(Sequence Parallelism) for Allgather EP. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: realliujiaxu <realliujiaxu@163.com> Co-authored-by: zhaozx-cn <zhaozx2116@163.com>	2025-10-15 19:36:32 +08:00
Mengqing Cao	8abe517870	[Refactor] Adapt deepseek-v3.2 to vllm 0.11.0 (#3432 ) ### What this PR does / why we need it? Adapt deepseek-v3.2 to vllm 0.11.0, removing the useless patch. The final goal is to remove all the patches and align the code arch to vllm, thus we need to do the following work in next prs. TODO: - [x] remove patch on attention spec - [ ] refactor the kvcache creation logic ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? 1. CI passed with existing test. 2. Test pass with deepseek-v3.2-exp - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: MengqingCao <cmq0113@163.com>	2025-10-15 17:48:58 +08:00

... 2 3 4 5 6 ...

1275 Commits