xc-llm-ascend

Author	SHA1	Message	Date
Clorist33	4f0dddc9ee	[Bugfix] bugfix for moe_mlp in vllm-ascend/v0.11.0-dev (#4885 ) ### What this PR does / why we need it? This PR fixes a bug in the moe_mlp module by correcting the arguments passed to the torch_npu.npu_dequant_swiglu_quant function.It properly converts group_list from a cumulative sum to counts for the group_index parameter. ### Does this PR introduce _any_ user-facing change? No - vLLM version: v0.12.0 - vLLM main: https://github.com/vllm-project/vllm/main --------- Signed-off-by: tanqingshan (A) <50050625@china.huawei.com> Signed-off-by: tanqingshan (A) <50050625@china.huawei.com> Co-authored-by: tanqingshan (A) <50050625@china.huawei.com> Co-authored-by: Mercykid-bash <ruanche0218@gmail.com>	2025-12-12 14:51:47 +08:00
1092626063	ceadc2788d	Revert "[refactor]support gatingtopk operator generalization (#4356 )" (#4873 ) This reverts commit `c4a11a745a`. ops npu_gating_top_k caused Qwen3-30B precision problem, so revert it. Signed-off-by: 1092626063 <1092626063@qq.com>	2025-12-10 15:45:20 +08:00
Levi	9862a23985	【0.11.0-dev】optimization of kimi-k2 in cann8.3 (#4555 ) ### What this PR does / why we need it? In cann8.3， npu_moe_gating_top_k operator can support expert nums with 384, so kimi can use the operator to get better preformance. --------- Signed-off-by: Levi-JQ <yujinqi2@huawei.com> Co-authored-by: Levi-JQ <yujinqi2@huawei.com>	2025-12-09 08:49:15 +08:00
Wang Yixuan	d412565ec9	[Cherry-pick]bmm_transpose to v011dev (#3995 ) ### What this PR does / why we need it? Add a custom op to acclerater the deepseek model. The fusion ops combine the bmm and transpose together, which is applied to mla module. Cherry-pick from this commtid c68ddc11ce53334fc9a17bad58342148cbf14e86 ### Does this PR introduce _any_ user-facing change? No --------- Signed-off-by: hust17yixuan <303660421@qq.com>	2025-12-08 19:22:14 +08:00
1092626063	c4a11a745a	[refactor]support gatingtopk operator generalization (#4356 ) ### What this PR does / why we need it? This pr is cherry-pick from : https://github.com/vllm-project/vllm-ascend/pull/2958 and https://github.com/vllm-project/vllm-ascend/pull/4340 Past： npu_moe_gating_top_k can only support 'group_count=256' pattern Now： 1、npu_moe_gating_top_k support all size of group_count 2、the functionality of `torch_npu.npu_moe_gating_top_k_softmax` are included in `torch_npu.npu_moe_gating_top_k` CANN: depends on 8.3.RC1 Performance： 1. GLM4.5-w8a8, TPS improve 6% 2. Qwen3, the same as before --------- Signed-off-by: 1092626063 <1092626063@qq.com>	2025-12-04 20:10:13 +08:00
LI SHENGYONG	593a96056c	【EPLB】Eplb Redundant Experts Bugfix (#4232 ) ### What this PR does / why we need it? Redundant experts bugfix The calculation logic for redundant experts has been fixed, allowing the correct number of redundant experts to be calculated using the map. Therefore, there is no longer a need to set the redundant expert parameter when passing the map. ### Does this PR introduce _any_ user-facing change? After configuring the path for experts_map, users do not need to configure iinit_redundancy_expert. ### How was this patch tested? The accuracy of EPLB was tested with and without the use of redundant experts. --------- Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2025-12-03 12:00:05 +08:00
Zhu Yi Lin	96c362361e	[0.11.0][TEST] Delete Comment (#4428 ) ### What this PR does / why we need it? delete chinese comment pick from https://github.com/vllm-project/vllm-ascend/pull/4427 ### Does this PR introduce _any_ user-facing change? no Signed-off-by: GDzhu01 <809721801@qq.com>	2025-11-25 21:39:36 +08:00
wangxiyuan	a2e4c3fe78	Revert "[cherry-pick][refactor]support gatingtopk operator generalization (#4050 )" (#4352 ) This reverts commit `c87a77e8b4`. it breaks ops e2e test Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-11-21 23:03:20 +08:00
SILONG ZENG	5ad0ccdc31	[v0.11.0]Upgrade cann to 8.3.rc2 (#4332 ) ### What this PR does / why we need it? Upgrade CANN to 8.3.rc2 Signed-off-by: MrZ20 <2609716663@qq.com>	2025-11-21 22:48:57 +08:00
shaopeng-666	b6d59bdea2	cherry pick from pr 4270 (#4285 ) ### What this PR does / why we need it? avoid mrope fusion op when running qwen25vl on x86 machine --------- Signed-off-by: 李少鹏 <lishaopeng21@huawei.com>	2025-11-19 22:32:02 +08:00
1092626063	c87a77e8b4	[cherry-pick][refactor]support gatingtopk operator generalization (#4050 ) ### What this PR does / why we need it? pick from : https://github.com/vllm-project/vllm-ascend/pull/2958 Past： npu_moe_gating_top_k can only support 'group_count=256' pattern Now： 1、npu_moe_gating_top_k support all size of group_count 2、the functionality of `torch_npu.npu_moe_gating_top_k_softmax` are included in `torch_npu.npu_moe_gating_top_k` CANN: depends on 8.3.RC1 Performance： 1. GLM4.5-w8a8, TPS improve 6% 2. Qwen3, the same as before Signed-off-by: 1092626063 <1092626063@qq.com>	2025-11-19 10:39:28 +08:00
zhaomingyu13	650ce8ad19	[0.11.0][Bugfix] Fix ngram precision issue and open e2e ngram test (#4092 ) ### What this PR does / why we need it? Fix ngram precision issue and open e2e ngram test --------- Signed-off-by: Icey <1790571317@qq.com> Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com> Signed-off-by: zhaomingyu13 <zhaomingyu13@h-partners.com> Co-authored-by: Icey <1790571317@qq.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2025-11-11 09:58:03 +08:00
Icey	c5fe179cef	[0.11.0] [Cherry-pick #4058 ] Fixes Qwen3-Next enable nz accuracy problem (#4056 ) ### What this PR does / why we need it? - Fixes Qwen3-Next enable nz accuracy problem --------- Signed-off-by: wxsIcey <1790571317@qq.com> Signed-off-by: Icey <1790571317@qq.com>	2025-11-10 20:56:39 +08:00
rjg-lyh	ebd45b6596	[V0.11.0][Core] Restore scheduling logic under default configuration (#4094 ) ### What this PR does / why we need it? Cherry-pick #3967 from main branch. This PR reverts the changes introduced in PR #2894 Initially, due to performance issues with the older version of the chunked prefill ops, the default behavior was to use the Ascend scheduler to disable the chunked prefill feature. However, with the improvements in the performance of the new chunked prefill ops, this interception strategy has been removed. This change also aligns with the community's default configuration behavior. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI passed with new added/existing test. Signed-off-by: rjg-lyh <1318825571@qq.com>	2025-11-10 20:02:23 +08:00
zhangxinyuehfad	d913f9474b	[0.11.0][Fix] Fix Qwen2-Audio-7B-Instruct accuracy test (#4018 ) ### What this PR does / why we need it? Fix Qwen2-Audio-7B-Instruct accuracy test Backport:https://github.com/vllm-project/vllm-ascend/pull/4017 Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-11-10 11:54:30 +08:00
hucong	7ea17fbee3	[0.11.0][BugFix] Improve the performance of prefixcache features (#4021 ) ### What this PR does / why we need it? cherry-pick from https://github.com/vllm-project/vllm-ascend/pull/4022 The code bug caused an empty bubble. When the npu_paged_cache_load operator was called, it forcibly transferred seq_len2 to the device, which triggered synchronization and interrupted the CPU operator's launch stream. --------- Signed-off-by: underfituu <hzhucong@163.com>	2025-11-10 11:51:34 +08:00
wangxiaoteng888	c2d58c0655	[P/D][BugFix][v0.11.0-dev]Fix proxy format processing errors & Layerwise connector performance optimization (#4069 ) ### What this PR does / why we need it? 1.Fix proxy format processing errors. 2.Layer-wise connector performance optimization Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>	2025-11-09 09:55:10 +08:00
wangx700	55e37f5041	[v0.11.0][Bugfix] fix sleepmode level2 e2e test (#4023 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> fix sleepmode level2 e2e test ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> no ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> use e2e tests Signed-off-by: wangx700 <wangxin700@huawei.com>	2025-11-08 14:11:15 +08:00
zxr2333	d4e2a44307	[Cherry Pick from pr#3981][0.11.0][P/D]Make kv-transfer env variable take effect & Fix load-balance proxy (#3983 ) ### What this PR does / why we need it? Make kv-transfer env variable take effect & Fix load-balance proxy. Cherry Pick from #3981 --------- Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>	2025-11-08 13:52:33 +08:00
lilinsiman	016337eaec	[v0.11.0][UT] Add new ut case for aclgraph enable (#4038 ) ### What this PR does / why we need it? add new ut case for aclgraph enable ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ut Signed-off-by: lilinsiman <lilinsiman@gmail.com>	2025-11-07 11:35:24 +08:00
wangxiyuan	7ee0b0b5d8	[cherry-pick]Upgrade CANN to 8.3.rc1 (#3945 ) (#3962 ) This PR upgrade CANN from 8.2rc1 to 8.3rc1 and remove the CANN version check logic. TODO: we notice that UT runs failed with CANN 8.3 image. So the base image for UT is still 8.2. We'll fix it later. - vLLM version: v0.11.0 - vLLM main: `83f478bb19` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-11-06 09:05:08 +08:00
wangxiyuan	8a7154001e	[0.11.0]Chery pick pta upgrade change (#3940 ) This PR cherry-pick two commit from main to upgrade torch-npu to 2.7.1 official release --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-10-31 22:14:26 +08:00
lilinsiman	ee2e55e602	[v0.11.0][Test] Add new test model for aclgraph single_request v0.11.0 (#3889 ) ### What this PR does / why we need it? add new test model for aclgraph single_request v0.11.0 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ut Signed-off-by: lilinsiman <lilinsiman@gmail.com>	2025-10-31 11:23:55 +08:00
lilinsiman	387ce1cc5b	add new e2e tests case for aclgraph memory to v0.11.0 (#3880 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? add new e2e tests case for aclgraph memory to v0.11.0 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ut Signed-off-by: lilinsiman <lilinsiman@gmail.com>	2025-10-31 09:17:09 +08:00
wangxiaoteng888	af7a56550b	[bugfix_v0.11.0-dev] layerwise D first plan (#3907 ) ### What this PR does / why we need it? Refactored the layerwise code to send to the D node first, preventing P-node hangs due to communication timeouts when DP > 1. --------- Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com> Signed-off-by: liziyu <liziyu16@huawei.com> Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Co-authored-by: nwpu-zxr <zhouxuerong2@huawei.com> Co-authored-by: liziyu <liziyu16@huawei.com>	2025-10-30 22:21:11 +08:00
offline893	d5a9aba03f	[BugFix]Fix group list type of mc2. (#3890 ) ### What this PR does / why we need it? Fix the precision issue caused by the inconsistency between the group list type used by mc2 and that of eplb. --------- Signed-off-by: offline0806 <3337230449@qq.com>	2025-10-30 21:44:14 +08:00
weichen	c506ba60fb	[v0.11.0] [Bugfix] [MoE]fix error in deepseek when using allgather (#3827 ) ### What this PR does / why we need it? After refactoring vllm_ascend/models and FusedMoE, we are unable to pass `gate` from deepseekv2.py to `AscendFusedMoE.forward`, which will result in error when running deepseek v3/r1 with allgather. Hence, this pr removes `gate` related computations from FusedMoE module in eager/aclgraph mode. ### Does this PR introduce _any_ user-facing change? `rm_router_logits` is deprecated in eager/aclgraph. ### How was this patch tested? e2e & ut Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>	2025-10-30 14:59:46 +08:00
whx	211d4b9da4	[BugFix] Fix mlapo accuracy problem related with weight processing. (#3857 ) This PR fixes a mlapo accuracy problem related with weight processing. Furthermore, modify mlapo related e2e test with quantized deepseek model to make it effective. Signed-off-by: whx-sjtu <2952154980@qq.com>	2025-10-30 00:35:50 +08:00
fems14	19f49ecb5f	[0.11.0][Bugfix]fix_mulit_connector_bug (#3332 ) (#3882 ) ### What this PR does / why we need it? When using multi connector, the multi connector does not define get_finished_count, which will cause the kv cache to be released ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: `83f478bb19` Signed-off-by: baxingpiaochong <771405853@qq.com> Co-authored-by: baxingpiaochong <771405853@qq.com>	2025-10-29 23:44:52 +08:00
realliujiaxu	29bd9235ed	[v0.11.0][Perf] Delete redundant operations in model_runner and forward_context (#3775 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> cherry pick https://github.com/vllm-project/vllm-ascend/pull/3677 Remove redundant operations from `model_runner` and `forward_context`. This optimization can significantly reduce the idle time (bubble) before decoding when running models with small parameter counts (e.g., Qwen/Qwen2.5-0.5B). Testing on 800I A2, bubble is reduced from 3.8ms to 2.8ms : Before <img width="1655" height="696" alt="image" src="https://github.com/user-attachments/assets/d7608e52-2438-46dd-8fc9-391fd6274495" /> After <img width="1607" height="774" alt="image" src="https://github.com/user-attachments/assets/56daf081-2dba-4d2e-99d4-e055187d9806" /> ### What this PR does / why we need it? <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> No ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> --------- Signed-off-by: realliujiaxu <realliujiaxu@163.com>	2025-10-29 15:58:53 +08:00
Ruri	825fdfb197	[v0.11.0][Feat] Prefetching Attention QKV Linear Weight With `AddRmsNormQuant` Custom Op (#3649 ) ### What this PR does / why we need it? - `qkv_proj.weight` prefetching has been implemented with `Quant` op, when `AddRmsNormQuant` is enabled (#3465) `qkv_proj.weight` prefetching won't work - Implement `qkv_proj.weight` prefetching with `AddRmsNormQuant`, which has been merged on `main` branch (#3517) ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? Tested on `Qwen3-235B-A22B-W8A8` <img width="1868" height="109" alt="image" src="https://github.com/user-attachments/assets/0bc28082-0287-4d5c-b8f6-f907c3134d36" /> - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>	2025-10-27 09:42:09 +08:00
whx	a58ff9e92f	[Cherry-pick] Port MoE multi-stream fix to v0.11.0-dev (#3753 ) This PR moves the communication operation of shared experts out of extra stream because I found that this might cause rtMemcpy related errors when running shared experts multistream with aclgraph. Furthermore, I utilize a global variable as extra stream object to avoid allocating streams for each layer in full-graph mode. Signed-off-by: whx-sjtu <2952154980@qq.com>	2025-10-25 15:51:43 +08:00
shaopeng-666	fed8145aea	[cherry-pick][Feat] Add mrope fusion op#3708 (#3735 ) ### What this PR does / why we need it? Add mrope fusion op for qwen2.5-vl. This mrope operator dosen't support Qwen3-VL currently. Thus could only take affect in qwen2.5-vl cherry pick from 39b994a987888f7ba78df28b1ccb41a5e8d6eaf5 CI passed with existing test Signed-off-by: shaopeng666 <shaopeng666@noreply.gitcode.com> Co-authored-by: shaopeng666 <shaopeng666@noreply.gitcode.com>	2025-10-25 11:41:23 +08:00
whx	0644113c35	[BugFix] cherry-pick PR 3736 to v0.11.0-dev (#3737 ) This PR comments out newly added vlm e2e test of ascend scheduler scenario because I found that when running in multi-batch this will stuck. Need to add this back after dealing with this issue. Signed-off-by: whx-sjtu <2952154980@qq.com>	2025-10-25 10:35:14 +08:00
whx	5a2c5be229	[BugFix][Cherry-pick] Cherry-pick PR 3675 to v0.11.0-dev (#3732 ) This PR cherry-picks the bugfix related with running multi-modal models with AscendScheduler to v0.11.0-dev Signed-off-by: hw_whx <wanghexiang7@huawei.com> Co-authored-by: hw_whx <wanghexiang7@huawei.com>	2025-10-25 09:41:51 +08:00
liziyu	f3ea657e93	[0.11.0][Bugfix] fix delay free prefill req & D node support prefix cache (#3609 ) ### What this PR does / why we need it? Fix mooncake connector. In scenarios where TP is not equal, when the prefill TP size is less than the number of key-value heads, _get_remote_tp_ranks_for_req will return a list of np.arrays. Performing an operation like int in list of np.arrays will cause an error. Converting the list of np.arrays into a single np.array resolves this issue. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? qwen235B P tp16, D tp1 P tp8, D tp1 P tp4, D tp1 P tp8, D tp2 - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: liziyu <liziyu16@huawei.com>	2025-10-23 20:39:35 +08:00
rjg-lyh	74903af460	[v0.11.0][refactor] refactor SequenceRowParallelOp forward (#3654 ) ### What this PR does / why we need it? This PR refactors SequenceRowParallelOp forward. In order to further expand the operator inclusion scope in dynamic judgment scenarios, this PR customizes the entire matmul computation and communication as a custom operator masking. With this refactor, it will support directly writing code such as common operation fusion into the SequenceRowParallelOp class's member function matmul_and_reduce, without the need to register more redundant custom masking operators. ### How was this patch tested? CI passed with new added/existing test. Signed-off-by: rjg-lyh <1318825571@qq.com>	2025-10-23 14:45:49 +08:00
Zetong Li	6e72bfdc50	[v0.11.0] cherry-pick Fix performance degradation when mtp>1 (#3597 ) (#3630 ) ### What this PR does / why we need it? cherry-pick Fix performance degradation when mtp>1 (#3597) This PR aims to fix performance degradation when mtp>1. Since mtp>1 may result in more tokens (i.e. larger batch size) than acl graph maximum batch size, this will cause draft model to run in eager mode. ### How was this patch tested? by ci --------- Signed-off-by: Zetong Li <slippersss@126.com>	2025-10-22 22:07:39 +08:00
wangxiyuan	a0c3b8dd2d	[v0.11.0]cherry-pick fix ut (#3608 ) (#3614 ) cherry-pick fix ut (#3608) Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-10-22 14:14:15 +08:00
offline893	e916265b2b	[CI]Add EPLB CI. (#3568 ) ### What this PR does / why we need it? 1.Add eplb ci to check the change of eplb feature. 2.Add param checking of eplb params. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Qwen in A3. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: offline0806 <3337230449@qq.com> Co-authored-by: offline0806 <3337230449@qq.com>	2025-10-21 22:58:02 +08:00
linfeng-yuan	4c9af353ee	Revert "[Feat] Shared expert dp for deepseek and deepseek_mtp (#3495 )" (#3586 ) ### What this PR does / why we need it? This reverts commit `bf87606932`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? E2E vllm serving with `enable_shared_expert_dp: true` in eager mode as before. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: linfeng-yuan <1102311262@qq.com>	2025-10-21 22:24:30 +08:00
whx	bd11c0054f	[BugFix] Fix torchair+mtp bug after deleting deepseek_mtp. (#3590 ) This is a missing bug fix introduced by PR #3561 - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: whx-sjtu <2952154980@qq.com>	2025-10-21 22:23:52 +08:00
shaopeng-666	0c83eee9b1	fix vl float model not support NZ format weight error (#3533 ) ### What this PR does / why we need it? fix vl float model not support nz mm op ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: shaopeng666 <shaopeng666@noreply.gitcode.com> Co-authored-by: shaopeng666 <shaopeng666@noreply.gitcode.com>	2025-10-21 22:23:17 +08:00
wangxiyuan	13e8e75143	[Refactor] refactor patch module (#3555 ) ### What this PR does / why we need it? we notice that `patch_main` is never used. Usually the patch is for all version. And if it's for specified version, we can use `vllm_version_is` instead. So let's remove the useless sub folder in patch module to make it clear. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-10-21 20:19:46 +08:00
Anion	5f8b1699ae	[Feat][quantization] Support new version w4a8 dynamic quantization for Linear layers (#3311 ) ### What this PR does / why we need it? Problem Description: The existing implementation for the w4a8-dynamic linear method only supports the old quantization format from msmodelslim. When attempting to load models quantized with the new version, vLLM encounters errors due to mismatched tensor shapes and unprocessed quantization parameters. Relavant issues: - https://github.com/vllm-project/vllm-ascend/issues/3192 - https://github.com/vllm-project/vllm-ascend/issues/3152 Proposed Changes: 1. Add support for w4a8 dynamic(new format) in AscendW4A8DynamicLinearMethod and TorchairAscendW4A8DynamicLinearMethod 2. Add unit tests and e2e tests for w4a8 dynamic new and old format models <details> <summary><b>details</b></summary> 1. Support for new w4a8-dynamic format: * Detects quantization format by reading the "version" field in quant_description to ensure backward compatibility. * Handles the new pre-packed weight format (`2x int4` in an `int8`), which has a halved dimension. It tells the vLLM loader how to unpack it using `_packed_dim` and `_packed_factor`. * Supports the new `scale_bias` parameter, setting its shape based on the layer type, as required by msmodelslim. For api consistency and future use, the `layer_type` parameter was also added to other quantization methods. * Updates the weight processing logic: new format weights are handled with `.view(torch.int32)` since they're pre-packed, while old ones are processed with `npu_convert_weight_to_int4pack`. 2. New unit and E2E tests: * Added unit tests that verify the logic for both the old and new formats. * Split the distributed E2E test to confirm that both old and new format models work correctly. </details> Theoretically, these changes will provide support for all common new version w4a8(dynamic) models from msmodelslim. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? I implement relevant unit tests and e2e tests and test the changes with following commands: ```bash # unit tests python -m pytest tests/ut/quantization/test_w4a8_dynamic.py tests/ut/torchair/quantization/test_torchair_w4a8_dynamic.py -v # e2e tests pytest tests/e2e/singlecard/test_quantization.py -v -s pytest tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_Qwen3_W4A8DYNAMIC_new_version -v -s pytest tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_Qwen3_W4A8DYNAMIC_old_version -v -s pytest tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_DeepSeek_W4A8DYNAMIC -v -s ``` I also tested Hunyuan-1.8B-Instruct quantized with the new w4a8-dynamic format: ``` vllm serve ./models/Hunyuan-1.8B-Instruct-quantized --gpu-memory-utilization 0.96 --quantization ascend --max-model-len 9600 --seed 0 --max-num-batched-tokens 16384 ``` All tests mentioned passed locally. NOTE: I use quantization model from my own repo in test_offline_inference_distributed.py. Here is the description: [Anionex/Qwen3-1.7B-W4A8-V1](https://modelscope.cn/models/Anionex/Qwen3-1.7B-W4A8-V1/summary) (including quantization steps).This should be replaced by a model in vllm-ascend ci modelscope repo. Thanks for reading! - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: Anionex <1005128408@qq.com>	2025-10-21 20:18:39 +08:00
whx	220df60c61	[Model][2/N] Remove deepseek_mtp modeling. (#3561 ) This PR is step 2 of deepseek model refactoring and removes deepseek_mtp. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: whx-sjtu <2952154980@qq.com>	2025-10-21 20:17:09 +08:00
liziyu	3164cb663c	[Bugfix] mooncake connector support external dp & update readme (#3579 ) ### What this PR does / why we need it? mooncake connector support external dp & update readme ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: liziyu <liziyu16@huawei.com>	2025-10-21 20:15:24 +08:00
Chen Chen	6b290acfe1	remove redundant params in mla_preprocess kernel (#3530 ) ### What this PR does / why we need it? This pull request removes the redundant parameters `gamma1` and `beta1` (also named `gamma0`/`beta0` in some places) from the `mla_preprocess` kernel and its calling hierarchy. The changes are consistent across C++ kernel code, bindings, and Python call sites. The parameters were unused in the lower-level functions, so their removal is a good cleanup. ### Does this PR introduce _any_ user-facing change? The python interface of the kernel is affected, and the params of `gamma0` and `beta0` are not needed. ### How was this patch tested? The unit-test of the kernel is adapted accordingly. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: mojave2 <chenchen145@huawei.com>	2025-10-21 19:20:13 +08:00
jiangyunfan1	80b8df881f	[TEST] Add Qwen3-32b-w8a8 acc/perf A2/A3 test (#3541 ) ### What this PR does / why we need it? This PR Qwen3-32b-w8a8 acc/perf 8 cases on A2 and A3, we need test them daily. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? by running the test - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: jiangyunfan1 <jiangyunfan1@h-partners.com> Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: root <root@hostname-2pbfv.foreman.pxe> Co-authored-by: wangli <wangli858794774@gmail.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com>	2025-10-21 17:34:48 +08:00
Yizhou	ec1d2b5c04	[Test] Temporarily skip flaky ACL graph test (#3577 ) ### What this PR does / why we need it? Disables `FULL_DECODE_ONLY` end-to-end test that fails intermittently. This prevents CI blockages while the root cause of the flakiness is investigated. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? None needed. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-10-21 17:16:15 +08:00

1 2 3 4 5 ...

462 Commits