xc-llm-ascend

Author	SHA1	Message	Date
1092626063	ceadc2788d	Revert "[refactor]support gatingtopk operator generalization (#4356 )" (#4873 ) This reverts commit `c4a11a745a`. ops npu_gating_top_k caused Qwen3-30B precision problem, so revert it. Signed-off-by: 1092626063 <1092626063@qq.com>	2025-12-10 15:45:20 +08:00
Wang Yixuan	d412565ec9	[Cherry-pick]bmm_transpose to v011dev (#3995 ) ### What this PR does / why we need it? Add a custom op to acclerater the deepseek model. The fusion ops combine the bmm and transpose together, which is applied to mla module. Cherry-pick from this commtid c68ddc11ce53334fc9a17bad58342148cbf14e86 ### Does this PR introduce _any_ user-facing change? No --------- Signed-off-by: hust17yixuan <303660421@qq.com>	2025-12-08 19:22:14 +08:00
1092626063	c4a11a745a	[refactor]support gatingtopk operator generalization (#4356 ) ### What this PR does / why we need it? This pr is cherry-pick from : https://github.com/vllm-project/vllm-ascend/pull/2958 and https://github.com/vllm-project/vllm-ascend/pull/4340 Past： npu_moe_gating_top_k can only support 'group_count=256' pattern Now： 1、npu_moe_gating_top_k support all size of group_count 2、the functionality of `torch_npu.npu_moe_gating_top_k_softmax` are included in `torch_npu.npu_moe_gating_top_k` CANN: depends on 8.3.RC1 Performance： 1. GLM4.5-w8a8, TPS improve 6% 2. Qwen3, the same as before --------- Signed-off-by: 1092626063 <1092626063@qq.com>	2025-12-04 20:10:13 +08:00
SILONG ZENG	5ad0ccdc31	[v0.11.0]Upgrade cann to 8.3.rc2 (#4332 ) ### What this PR does / why we need it? Upgrade CANN to 8.3.rc2 Signed-off-by: MrZ20 <2609716663@qq.com>	2025-11-21 22:48:57 +08:00
zhaomingyu13	650ce8ad19	[0.11.0][Bugfix] Fix ngram precision issue and open e2e ngram test (#4092 ) ### What this PR does / why we need it? Fix ngram precision issue and open e2e ngram test --------- Signed-off-by: Icey <1790571317@qq.com> Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com> Signed-off-by: zhaomingyu13 <zhaomingyu13@h-partners.com> Co-authored-by: Icey <1790571317@qq.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2025-11-11 09:58:03 +08:00
zhangxinyuehfad	d913f9474b	[0.11.0][Fix] Fix Qwen2-Audio-7B-Instruct accuracy test (#4018 ) ### What this PR does / why we need it? Fix Qwen2-Audio-7B-Instruct accuracy test Backport:https://github.com/vllm-project/vllm-ascend/pull/4017 Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-11-10 11:54:30 +08:00
wangx700	55e37f5041	[v0.11.0][Bugfix] fix sleepmode level2 e2e test (#4023 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> fix sleepmode level2 e2e test ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> no ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> use e2e tests Signed-off-by: wangx700 <wangxin700@huawei.com>	2025-11-08 14:11:15 +08:00
wangxiyuan	7ee0b0b5d8	[cherry-pick]Upgrade CANN to 8.3.rc1 (#3945 ) (#3962 ) This PR upgrade CANN from 8.2rc1 to 8.3rc1 and remove the CANN version check logic. TODO: we notice that UT runs failed with CANN 8.3 image. So the base image for UT is still 8.2. We'll fix it later. - vLLM version: v0.11.0 - vLLM main: `83f478bb19` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-11-06 09:05:08 +08:00
lilinsiman	ee2e55e602	[v0.11.0][Test] Add new test model for aclgraph single_request v0.11.0 (#3889 ) ### What this PR does / why we need it? add new test model for aclgraph single_request v0.11.0 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ut Signed-off-by: lilinsiman <lilinsiman@gmail.com>	2025-10-31 11:23:55 +08:00
lilinsiman	387ce1cc5b	add new e2e tests case for aclgraph memory to v0.11.0 (#3880 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? add new e2e tests case for aclgraph memory to v0.11.0 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ut Signed-off-by: lilinsiman <lilinsiman@gmail.com>	2025-10-31 09:17:09 +08:00
offline893	d5a9aba03f	[BugFix]Fix group list type of mc2. (#3890 ) ### What this PR does / why we need it? Fix the precision issue caused by the inconsistency between the group list type used by mc2 and that of eplb. --------- Signed-off-by: offline0806 <3337230449@qq.com>	2025-10-30 21:44:14 +08:00
whx	211d4b9da4	[BugFix] Fix mlapo accuracy problem related with weight processing. (#3857 ) This PR fixes a mlapo accuracy problem related with weight processing. Furthermore, modify mlapo related e2e test with quantized deepseek model to make it effective. Signed-off-by: whx-sjtu <2952154980@qq.com>	2025-10-30 00:35:50 +08:00
whx	a58ff9e92f	[Cherry-pick] Port MoE multi-stream fix to v0.11.0-dev (#3753 ) This PR moves the communication operation of shared experts out of extra stream because I found that this might cause rtMemcpy related errors when running shared experts multistream with aclgraph. Furthermore, I utilize a global variable as extra stream object to avoid allocating streams for each layer in full-graph mode. Signed-off-by: whx-sjtu <2952154980@qq.com>	2025-10-25 15:51:43 +08:00
whx	0644113c35	[BugFix] cherry-pick PR 3736 to v0.11.0-dev (#3737 ) This PR comments out newly added vlm e2e test of ascend scheduler scenario because I found that when running in multi-batch this will stuck. Need to add this back after dealing with this issue. Signed-off-by: whx-sjtu <2952154980@qq.com>	2025-10-25 10:35:14 +08:00
whx	5a2c5be229	[BugFix][Cherry-pick] Cherry-pick PR 3675 to v0.11.0-dev (#3732 ) This PR cherry-picks the bugfix related with running multi-modal models with AscendScheduler to v0.11.0-dev Signed-off-by: hw_whx <wanghexiang7@huawei.com> Co-authored-by: hw_whx <wanghexiang7@huawei.com>	2025-10-25 09:41:51 +08:00
offline893	e916265b2b	[CI]Add EPLB CI. (#3568 ) ### What this PR does / why we need it? 1.Add eplb ci to check the change of eplb feature. 2.Add param checking of eplb params. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Qwen in A3. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: offline0806 <3337230449@qq.com> Co-authored-by: offline0806 <3337230449@qq.com>	2025-10-21 22:58:02 +08:00
Anion	5f8b1699ae	[Feat][quantization] Support new version w4a8 dynamic quantization for Linear layers (#3311 ) ### What this PR does / why we need it? Problem Description: The existing implementation for the w4a8-dynamic linear method only supports the old quantization format from msmodelslim. When attempting to load models quantized with the new version, vLLM encounters errors due to mismatched tensor shapes and unprocessed quantization parameters. Relavant issues: - https://github.com/vllm-project/vllm-ascend/issues/3192 - https://github.com/vllm-project/vllm-ascend/issues/3152 Proposed Changes: 1. Add support for w4a8 dynamic(new format) in AscendW4A8DynamicLinearMethod and TorchairAscendW4A8DynamicLinearMethod 2. Add unit tests and e2e tests for w4a8 dynamic new and old format models <details> <summary><b>details</b></summary> 1. Support for new w4a8-dynamic format: * Detects quantization format by reading the "version" field in quant_description to ensure backward compatibility. * Handles the new pre-packed weight format (`2x int4` in an `int8`), which has a halved dimension. It tells the vLLM loader how to unpack it using `_packed_dim` and `_packed_factor`. * Supports the new `scale_bias` parameter, setting its shape based on the layer type, as required by msmodelslim. For api consistency and future use, the `layer_type` parameter was also added to other quantization methods. * Updates the weight processing logic: new format weights are handled with `.view(torch.int32)` since they're pre-packed, while old ones are processed with `npu_convert_weight_to_int4pack`. 2. New unit and E2E tests: * Added unit tests that verify the logic for both the old and new formats. * Split the distributed E2E test to confirm that both old and new format models work correctly. </details> Theoretically, these changes will provide support for all common new version w4a8(dynamic) models from msmodelslim. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? I implement relevant unit tests and e2e tests and test the changes with following commands: ```bash # unit tests python -m pytest tests/ut/quantization/test_w4a8_dynamic.py tests/ut/torchair/quantization/test_torchair_w4a8_dynamic.py -v # e2e tests pytest tests/e2e/singlecard/test_quantization.py -v -s pytest tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_Qwen3_W4A8DYNAMIC_new_version -v -s pytest tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_Qwen3_W4A8DYNAMIC_old_version -v -s pytest tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_DeepSeek_W4A8DYNAMIC -v -s ``` I also tested Hunyuan-1.8B-Instruct quantized with the new w4a8-dynamic format: ``` vllm serve ./models/Hunyuan-1.8B-Instruct-quantized --gpu-memory-utilization 0.96 --quantization ascend --max-model-len 9600 --seed 0 --max-num-batched-tokens 16384 ``` All tests mentioned passed locally. NOTE: I use quantization model from my own repo in test_offline_inference_distributed.py. Here is the description: [Anionex/Qwen3-1.7B-W4A8-V1](https://modelscope.cn/models/Anionex/Qwen3-1.7B-W4A8-V1/summary) (including quantization steps).This should be replaced by a model in vllm-ascend ci modelscope repo. Thanks for reading! - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: Anionex <1005128408@qq.com>	2025-10-21 20:18:39 +08:00
Chen Chen	6b290acfe1	remove redundant params in mla_preprocess kernel (#3530 ) ### What this PR does / why we need it? This pull request removes the redundant parameters `gamma1` and `beta1` (also named `gamma0`/`beta0` in some places) from the `mla_preprocess` kernel and its calling hierarchy. The changes are consistent across C++ kernel code, bindings, and Python call sites. The parameters were unused in the lower-level functions, so their removal is a good cleanup. ### Does this PR introduce _any_ user-facing change? The python interface of the kernel is affected, and the params of `gamma0` and `beta0` are not needed. ### How was this patch tested? The unit-test of the kernel is adapted accordingly. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: mojave2 <chenchen145@huawei.com>	2025-10-21 19:20:13 +08:00
jiangyunfan1	80b8df881f	[TEST] Add Qwen3-32b-w8a8 acc/perf A2/A3 test (#3541 ) ### What this PR does / why we need it? This PR Qwen3-32b-w8a8 acc/perf 8 cases on A2 and A3, we need test them daily. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? by running the test - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: jiangyunfan1 <jiangyunfan1@h-partners.com> Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: root <root@hostname-2pbfv.foreman.pxe> Co-authored-by: wangli <wangli858794774@gmail.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com>	2025-10-21 17:34:48 +08:00
Yizhou	ec1d2b5c04	[Test] Temporarily skip flaky ACL graph test (#3577 ) ### What this PR does / why we need it? Disables `FULL_DECODE_ONLY` end-to-end test that fails intermittently. This prevents CI blockages while the root cause of the flakiness is investigated. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? None needed. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-10-21 17:16:15 +08:00
lilinsiman	70bef33f13	add new accuracy test case for aclgraph (#3390 ) ### What this PR does / why we need it? Add new accuracy test case Deepseek-V2-Lite-W8A8 for aclgraph ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ut - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: lilinsiman <lilinsiman@gmail.com>	2025-10-20 20:04:04 +08:00
jiangyunfan1	9e59fc1510	[TEST] Add initial aisbench support and Qwen3 32B acc/perf test (#3474 ) ### What this PR does / why we need it? This PR adds the first aisbench case for nightly test, it lays a foundation for following performance and accuracy tests in nightly test. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the test - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Co-authored-by: jiangyunfan1 <jiangyunfan1@h-partners.com>	2025-10-20 09:33:17 +08:00
anon189Ty	248ee7fa11	[Feat]Make full graph mode compalible with MTP (#3276 ) ### What this PR does / why we need it? Make the Full Graph mode can run with MTP. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>	2025-10-17 20:19:56 +08:00
lilinsiman	1b424fb7f1	ACLgraph enable: Test cases revisions for all features (#3388 ) ### What this PR does / why we need it? This PR revise the test cases of various features on the warehouse which add the enablement of aclgraph to the test cases. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ut - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: lilinsiman <lilinsiman@gmail.com>	2025-10-17 17:15:19 +08:00
Li Wang	4c4a8458a5	[CI] Refator multi-node CI (#3487 ) ### What this PR does / why we need it? Refactor the multi-machine CI use case. The purpose of this PR is to increase the ease of adding multi-machine CI use cases, allowing developers to add multi-machine cluster model testing use cases (including PD separation) by simply adding a new YAML configuration file. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-10-17 09:04:31 +08:00
menogrey	9ff6b0b862	[CI]: Fix doctest ci for main release (#3451 ) ### What this PR does / why we need it? Fix dockets CI for main release. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: menogrey <1299267905@qq.com>	2025-10-16 14:38:11 +08:00
weichen	cec1fab509	Revert "[MoE] [Refactor] Remove manual memory cleanup (#3365 )" (#3483 ) This reverts commit `4f937f561d`. ### What this PR does / why we need it? This reverts commit `4f937f561d`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? e2e & ut - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>	2025-10-15 22:25:46 +08:00
realliujiaxu	f69a83b7ba	[Feat] Flash comm allgher ep (#3334 ) Support flash comm v1(Sequence Parallelism) for Allgather EP. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: realliujiaxu <realliujiaxu@163.com> Co-authored-by: zhaozx-cn <zhaozx2116@163.com>	2025-10-15 19:36:32 +08:00
linfeng-yuan	099255e933	[bugfix] fix pipeline parallel for mla & sfa attention backend (#3459 ) ### What this PR does / why we need it? Fix pipeline parallel break for mla & sfa attention backend caused by a magic number in metadata builder. The error report: `AttributeError: 'PPMissingLayer' object has no attribute 'self_attn'` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This PR was tested with "mp" backend (PP2TP8 on an A3 node) as well as "ray" backend (PP2TP8 on two A2 nodes). - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: linfeng-yuan <1102311262@qq.com>	2025-10-15 17:13:27 +08:00
weichen	4f937f561d	[MoE] [Refactor] Remove manual memory cleanup (#3365 ) ### What this PR does / why we need it? 1. Replace manual memory cleanup with passing parameter. 2. FusedMoEPrepareAndFinalizeWithMC2 inherits All2All avoid duplicated code. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? e2e & ut - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>	2025-10-15 12:36:24 +08:00
CaranLic	15b2e5c995	Remove unused row_idx in token_dispatcher (#3442 ) ### What this PR does / why we need it? The `row_idx` parameter is no longer used since PR[#2689](https://github.com/vllm-project/vllm-ascend/pull/2689), so remove it across multiple files to remove unnecessary calculations and parameter passing. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? accuracy test passed for Qwen3 235B and DeepSeek V3 671B after this PR. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: CaranLic <740821011@qq.com>	2025-10-15 09:08:31 +08:00
zouyida2052	3642b64afc	bugfix for mtp with multistream_moe (#3419 ) ### What this PR does / why we need it? when infer deepseek mtp layer with multistream_moe, we should pass a boolean to evaluate this feature and fix bugs when we are in mtp layer - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: zouyida2052 <zouyida2002@gmail.com>	2025-10-15 08:59:58 +08:00
xuyexiong	02c26dcfc7	[Feat] Supports Aclgraph for bge-m3 (#3171 ) ### What this PR does / why we need it? [Feat] Supports Aclgraph for bge-m3 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ``` pytest -s tests/e2e/singlecard/test_embedding.py pytest -s tests/e2e/singlecard/test_embedding_aclgraph.py ``` to start an online server with bs 10, each batch's seq length=8192, we set --max-num-batched-tokens=8192*10 to ensure encoder is not chunked: ``` vllm serve /home/data/bge-m3 --max_model_len 1024 --served-model-name "bge-m3" --task embed --host 0.0.0.0 --port 9095 --max-num-batched-tokens 81920 --compilation-config '{"cudagraph_capture_sizes":[8192, 10240, 20480, 40960, 81920]}' ``` For bs10, each batch's seq length=8192, QPS is improved from 85 to 104, which is a 22% improvement, lots of host bound is reduced. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: xuyexiong <xuyexiong@huawei.com> Co-authored-by: wangyongjun <1104133197@qq.com>	2025-10-14 23:07:45 +08:00
yuzhup	78777237a9	[2/N][Feat] Attention and MoE weight prefetch in Qwen3MoE models (#3203 ) ### What this PR does / why we need it? - Refacotr and integrate a unified `WeightPrefetchMethod` - Integrate `gate_up_proj.weight` in quantized Attention modules - Prefetching these weights ahead of matmul-like operators imporves performance by reducing L2 cache transfer latency ### Does this PR introduce _any_ user-facing change? Add a new config in `--additional-config` for configuration: ```json { "weight_prefetch_config": { "enabled": True, "prefetch_ratio": { "moe": { "gate_up": 0.8 }, }, }, } ``` This feature is enabled by default, and can be disabled through this configuration ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: yuzhup <15705211260@163.com>	2025-10-14 20:16:33 +08:00
XiaoxinWang	9eb62935b8	fix pagedattention to support fullgraph. (#3436 ) ### What this PR does / why we need it? Calculate in advance the workspace memory size needed for the PagedAttention operator to avoid deadlocks during resource cleanup. This PR requires torch_npu version 0920 or newer. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com> Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>	2025-10-14 16:10:09 +08:00
lilinsiman	22a1d91cf5	[CI] Add single request test case for aclgraph (#3392 ) ### What this PR does / why we need it? This pr adds online single request DP2 test case for aclgraph ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ut - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: lilinsiman <lilinsiman@gmail.com>	2025-10-14 11:13:44 +08:00
Yikun Jiang	cd69385dab	Add models test and add serval new models yaml (#3394 ) ### What this PR does / why we need it? This PR added Add accuracy CI for servals new models - `ascend test / accuracy` is for PR triggered check popluar models accuracy - `ascedn test / models` is for accuracy report, full models test, nightly model test - Add Qwen2-Audio-7B-Instruct, Qwen2-VL-7B-Instruct, Qwen3-8B, Qwen3-VL-30B-A3B-Instruct ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Closes: https://github.com/vllm-project/vllm-ascend/pull/2330 Closes: https://github.com/vllm-project/vllm-ascend/pull/3362 - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: hfadzxy <starmoon_zhang@163.com> Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com>	2025-10-12 17:27:50 +08:00
jiangyunfan1	d05d29ff0e	Enable nightly test and add qwen3 32b test case (#3370 ) ### What this PR does / why we need it? This PR adds a nightly test case for qwen3_32b bf16 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? by running the case - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: jiangyunfan1 <jiangyunfan1@h-partners.com> Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Co-authored-by: wangli <wangli858794774@gmail.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com>	2025-10-12 15:46:28 +08:00
Chen Chen	bcc313e8f2	add mla_preprocess kernel (#3226 ) ### What this PR does / why we need it? - Adds the `mla_preprocess` custom kernel to provide an optimized pre-processing operator for Multi-head Latent Attention (MLA) on Ascend NPUs. - Wires the new kernel into the C++ extension pipeline so vLLM can invoke it directly, cutting Python-side tensor shuffling and memory copies that previously bottlenecked MLA compilation paths. ### Does this PR introduce any user-facing change? - No. The change only introduces a low-level kernel; public APIs and inference behavior remain unchanged. ### How was this patch tested? - Dedicated Ascend kernels are not covered by our CI yet, so no extra automated tests were added. Future MLA-focused regression runs will cover this path. - vLLM version: v0.11.0 Signed-off-by: Chen Chen <0109chenchen@gmail.com>	2025-10-12 07:39:45 +08:00
Li Wang	1b1207e3c3	[Bugfix] Add quantization param for multi-node CI (#3383 ) ### What this PR does / why we need it? Add quantization param for `deepseek-w8a8` multi-node test ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: wangli <wangli858794774@gmail.com>	2025-10-11 19:25:16 +08:00
huangxialu	e8c871ed0a	[Test] enable external launcher and add e2e test for sleep mode in level2 (#3344 ) ### What this PR does / why we need it? 1. Enable tests/e2e/multicard/test_external_launcher.py 2. Add e2e test for sleep mode in level2 ### Does this PR introduce _any_ user-facing change? not involved ### How was this patch tested? CI passed with existing test. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: huangxialu <huangxialu1@huawei.com> Co-authored-by: Shangwei-Li <lishangwei2@huawei.com>	2025-10-11 17:29:38 +08:00
Li Wang	9eb103607f	[1/N][CI] Add multi node test (#3359 ) ### What this PR does / why we need it? This pr purpose to add multi-node test, on the first step, add `deepseek-v3` dp+tp+ep test ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-10-11 14:50:46 +08:00
panchao-hub	1756efa5fd	[Feat][Graph]Support FULL_DECEDE_ONLY mode for MLA models (#3125 ) ### What this PR does / why we need it? Adds support for capturing the Multi-Layer Attention (MLA) decode operation into an ACL graph. This improves performance by compiling the attention kernel for single-token decoding. Key changes include: - Implementing the graph capture logic for the MLA kernel, including workspace management and parameter updates. - Modifying the rotary embedding (RoPE) handling to use pre-allocated tensors, which is a requirement for graph capture. - Adding a `build_for_graph_capture` method to the MLA metadata builder to create dummy metadata during the graph compilation phase. Known issues: - Currently, MTP is not supported in FULL_DECEDE_ONLY mode -- we're working on a fix - We are preparing to remove update_mla_attn_params with auto_dispatch_capture ### Does this PR introduce _any_ user-facing change? compilation_config={ "cudagraph_mode": "FULL_DECODE_ONLY", }, ### How was this patch tested? - vLLM version: v0.11.0 --------- Signed-off-by: panchao-hub <315134829@qq.com> Signed-off-by: p00465316 <panchao13@huawei.com> Co-authored-by: p00465316 <panchao13@huawei.com> Co-authored-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-10-10 16:31:20 +08:00
wangxiyuan	ba19dd3183	Revert PTA upgrade PR (#3352 ) we notice that torch npu 0919 doesn't work. This PR revert related change which rely on 0919 version. Revert PR: #3295 #3205 #3102 Related: #3353 - vLLM version: v0.11.0	2025-10-10 14:09:53 +08:00
zhangxinyuehfad	601a37aeff	[Fixbug] Fix accuarcy template (#3088 ) ### What this PR does / why we need it? Fix empty lines between lm_eval command lines for accuarcy template - vLLM version: v0.10.2 - vLLM main: `9607d5eb44` Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-10-10 09:03:21 +08:00
XiaoxinWang	579b7e5f21	add pagedattention to support FULL_DECODE_ONLY. (#3102 ) ### What this PR does / why we need it? Calculate in advance the workspace memory size needed for the PagedAttention operator to avoid deadlocks during resource cleanup. This PR requires torch_npu version 0920 or newer. ### How was this patch tested? - vLLM version: v0.11.0 --------- Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com> Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>	2025-10-10 08:50:33 +08:00
wangxiyuan	1c5b302f0d	[Misc] Clean up useless patch (#3320 ) ### What this PR does / why we need it? 1. clean up v0.10.2 support in ut and e2e test 2. remove v0.11.0 period job, we're at v0.11.0 now. 3. remove uesless patch for deepseek v3.2. They have been done in vLLM already. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-10-09 14:07:26 +08:00
wangxiyuan	c73dd8fecb	[CI] Fix CI by addressing max_split_size_mb config (#3258 ) ### What this PR does / why we need it? Fix CI by addressing max_split_size_mb config ### Does this PR introduce _any_ user-facing change? No, test onyl ### How was this patch tested? Full CI passed, espcially eagle one - vLLM version: v0.10.2 - vLLM main: https://github.com/vllm-project/vllm/commit/releases/v0.11.0 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-09-29 14:05:12 +08:00
wangxiyuan	15b8aff582	[CI] Add max_split_size_mb for e2e test to avoid oom (#3252 ) ### What this PR does / why we need it? we add a patch for model weight loader to avoid using vLLM weight loader v2, since v2 will lead unknown issue for torchair. While this patch make some unknown memory usage problem. To quick fix the problem, let's expend the `max_split_size_mb` to a larger value to avoid weight load oom issue. Further solution is to remove the patch and address weight loader v2 from vLLM. Closes: https://github.com/vllm-project/vllm-ascend/issues/3251 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: https://github.com/vllm-project/vllm/commit/releases/v0.11.0 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-09-29 09:13:08 +08:00
Wang Kunpeng	859e861d92	[main][quantization] Support deepseek w4a8 per-channel quantization (#3011 ) ### What this PR does / why we need it? 1.Support deepseek w4a8 per-channel quantization 2.The eager mode supports converting weights to the NZ format ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? #### How to get weights using Modelslim ##### Installation steps git clone https://gitcode.com/Ascend/msit.git cd msit/msmodelslim bash install.sh ##### Generate w4a8 per-channel weights cd /example/DeepSeek Command reference: msmodelslim/example/DeepSeek/README.md - vLLM version: v0.10.2 - vLLM main: `f225ea7dd9` --------- Signed-off-by: Wang Kunpeng <1289706727@qq.com>	2025-09-27 21:01:16 +08:00

1 2 3 4

177 Commits