xc-llm-ascend

Author	SHA1	Message	Date
wangxiyuan	4144376e88	[CI] Fix UT (#5106 ) Fix broken ut introduced by #5053 - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-17 09:52:20 +08:00
pichangping	06f33540c4	[UT]add the UT of pcp and dcp in the attention_cp file (#5054 ) ### What this PR does / why we need it? add the UT of pcp and dcp in the attention_cp file ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: pichangping <1337510399@qq.com>	2025-12-17 09:11:33 +08:00
Icey	cadfa5ddc1	[Fusion] [Graph] Add qknorm rope fusion operator (#4711 ) ### What this PR does / why we need it? This PR add `qkv_rmsnorm_rope` operator and introduces a graph fusion pass for `qknorm_rope` operations. The implementation includes a new configuration flag, a pattern matching pass using `torch._inductor.pattern_matcher`, and a custom Triton kernel for the fused operation. Co-authored-by: Angazenn [supperccell@163.com](mailto:supperccell@163.com) ### Does this PR introduce _any_ user-facing change? Yes, add new additional_config - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: wxsIcey <1790571317@qq.com>	2025-12-17 08:53:44 +08:00
ZixuanWang	b1a853b0f6	Upgrade vllm commit hash to 1216 (#5053 ) ### What this PR does / why we need it? Upstream vLLM PR #30212 https://github.com/vllm-project/vllm/pull/30212 refactored the attention backend selection interface, This PR adapts vllm-ascend's get_attn_backend_cls to align with the new upstream standard, ensuring compatibility and reducing maintenance overhead. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? co-author:[leo-pony][nengjunma@outlook.com](mailto:nengjunma@outlook.com) - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: zxwang <1476209578@qq.com> Signed-off-by: leo-pony <nengjunma@outlook.com> Co-authored-by: leo-pony <nengjunma@outlook.com>	2025-12-17 08:48:36 +08:00
anon189Ty	5b1da4e914	[Feat] Support async_scheduler and disable_padded_drafter_batch in eagle (#4893 ) ### What this PR does / why we need it? We refactored the eagle_proposer.py to adapt the framework of eagle.py in vllm-v0.12.0, to support the logit of padded drafter batch and async-scheduler. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: anon189Ty <Stari_Falcon@outlook.com> Co-authored-by: drslark <slarksblood@qq.com>	2025-12-16 22:06:40 +08:00
zhenwenqi2024	4ed2951400	【Feature】refactor npu_modelrunner for profile_run (#4993 ) ### What this PR does / why we need it? (1)refactor npu_model_runner for profile_run (2) move _select_moe_comm_method to ascend_forward_context (3) delete _init_model_kwargs in npu_model_runner ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Na - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com> Signed-off-by: zhenwenqi2024 <155598497+zhenwenqi2024@users.noreply.github.com>	2025-12-16 17:44:04 +08:00
realliujiaxu	9e24bdd44c	[Feat] Refactor rejection sampler (#4975 ) ### What this PR does / why we need it? Currently, we are using `AscendRejctionSampler` that extends from `RejctionSampler` in spec decoding. `AscendRejctionSampler` override `forward` of `RejctionSampler`, only aming to replace `rejection_sample` func. This causes a lot of code of `RejctionSampler` cannot be reused, for example: - https://github.com/vllm-project/vllm/pull/19482 - https://github.com/vllm-project/vllm/pull/26060 - https://github.com/vllm-project/vllm/pull/29223 #### Proposed Change: - Delete `AscendRejctionSampler` and use `RejctionSampler` directly in model runner. - Patch `RejctionSampler.expand_batch_to_tokens` and `RejctionSampler.rejection_sample`, maybe a better way is to make them as custom ops. - Modify `NPUModelRunner` following https://github.com/vllm-project/vllm/pull/26060 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - [x] test logits processor for spec decoding - [x] test logprobs for spec decoding - [x] test logprobs for spec decoding + async shcheduling (test with https://github.com/vllm-project/vllm-ascend/pull/4893/) - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: realliujiaxu <realliujiaxu@163.com>	2025-12-16 11:32:26 +08:00
SILONG ZENG	237fad635c	[Fix]Revert temporary skip on mtp1/mtp2 correctness tests (aclgraph fix) (#5039 ) ### What this PR does / why we need it? This Pull Request removes the @pytest.mark.skip decorators from test_mtp1_correctness_piecewise_graph and test_mtp2_correctness_piecewise_graph. These tests were temporarily skipped because of an issue with the MTP ACL Graph (as per the original TODO comment). Since the relevant bug/issue has been resolved, these tests are now re-enabled to ensure full correctness coverage for MTP functionality. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: MrZ20 <2609716663@qq.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2025-12-16 10:40:00 +08:00
Jade Zheng	c064d11fd7	[Cleanup] Remove unused attn_metadata parameter from Proposer classes (#4862 ) The `attn_metadata` is not used by any draft proposer, so we can remove it. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>	2025-12-15 21:21:38 +08:00
Li Wang	8d2998d0e4	[Misc] Upgrade vllm hash to 12_14 (#5000 ) ### What this PR does / why we need it? ### Does this PR introduce _any_ user-facing change? 1. fix https://github.com/vllm-project/vllm/pull/27938 2. fix https://github.com/vllm-project/vllm/pull/27145 pooling models now supports chunked prefill and prefix caching, 3. fix https://github.com/vllm-project/vllm/pull/30181 define the CPU fields in the field config where they really belong. 4. fix https://github.com/vllm-project/vllm/pull/28168 define the CPU fields in the field config where they really belong. 5. fix https://github.com/vllm-project/vllm/pull/30201 some moudle rename 6. fix https://github.com/vllm-project/vllm/pull/29067 fusedmoe moudle refactor 7. fix https://github.com/vllm-project/vllm/pull/29066 fusedmoe moudle refactor 8. fix https://github.com/vllm-project/vllm/pull/29624 ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-12-15 19:54:23 +08:00
zengzengran	6029bea480	[UT]add pcp dcp ut (#4949 ) ### What this PR does / why we need it? Adding UT for DCP/PCP -vLLM version: v0.12.0 -vLLM main: `ad32e3e19c` Signed-off-by: zengran <zengran2@huawei.com>	2025-12-15 18:41:38 +08:00
Icey	5fae65f3a8	[Graph][Fusion] Add AddRMSNorm(with bias) and Quant Fusion Pattern (#5011 ) ### What this PR does / why we need it? AddRMSNorm(with bias) and Quant Fusion Pattern ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: wxsIcey <1790571317@qq.com>	2025-12-15 18:37:56 +08:00
SILONG ZENG	70606e0bb9	[Test]update accuracy test of models (#4911 ) ### What this PR does / why we need it? Delete accuracy tests for models that are no longer retained： - Meta-Llama-3.1-8B-Instruct - llava-1.5-7b-hf - InternVL2-8B.yaml - InternVL2_5-8B.yaml - InternVL3-8B.yaml Add accuracy tests for the new models： - Llama-3.2-3B-Instruct - llava-onevision-qwen2-0.5b-ov-hf - Qwen3-VL-30B-A3B-Instruct - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: MrZ20 <2609716663@qq.com>	2025-12-15 15:04:20 +08:00
drslark	8fb0ef5ffa	[main][BugFix] Fixed an accuracy bug of Qwen3-next-MTP when batched inferring (#4932 ) ### What this PR does / why we need it? Fixes an accuracy bug of Qwen3-next-MTP when batched inferring. It is descibed in https://github.com/vllm-project/vllm-ascend/issues/4930. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: drslark <slarksblood@qq.com>	2025-12-15 13:22:30 +08:00
wujinyuan1	545e856971	[Refactor]3/N Refactor mla_v1.py & extract mla_cp (#4933 ) RFC: https://github.com/vllm-project/vllm-ascend/issues/4629 Reason： The functions related to Cp differ significantly from those of normal MLA-Attention, but the coupling is quite severe. Steps： Isolate PCP and DCP (1) create a new python file: mla_cp.py (2) add classes AscendMlaCPImpl and AscendMlaCPMetadataBuilder，Inheritance AscendMLAImpl and AscendMLAMetadataBuilder (3) Remove PCP and DCP-related methods from mla_v1.py to mla_cp.py vLLM version: v0.12.0 - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: wujinyuan1 <wjy9595@qq.com> Co-authored-by: wujinyuan1 <wjy9595@qq.com> Co-authored-by: weijinqian0 <1184188277@qq.com>	2025-12-15 12:59:18 +08:00
LookAround0301	bb7b74c14f	add ut for model runner (#4991 ) ### What this PR does / why we need it? add ut for model runner - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: LookAround <lixushi@huawei.com>	2025-12-14 11:16:20 +08:00
AlvisGong	ba28d54f35	[Perf]enable prefill flashcommon3 (#4065 ) ### What this PR does / why we need it? moe multistream overlap to improve the performance. ### How was this patch tested? --additional-config '{"multistream_overlap_gate": true}' - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: AlvisGong <gwly0401@163.com> Signed-off-by: chenxiao <Jaychou1620@Gmail.com> Co-authored-by: clrs97 <524936896@qq.com> Co-authored-by: zzhx1 <zzh_201018@outlook.com> Co-authored-by: chenxiao <Jaychou1620@Gmail.com>	2025-12-14 09:34:13 +08:00
Li Wang	0f92d34a70	[CI] Pull latest vllm-ascend src before tests (#4988 ) ### What this PR does / why we need it? Currently, our image build suffers from errors during cross-compilation, which causing the image to fail to build sometimes(see https://github.com/vllm-project/vllm-ascend/actions/runs/20152861650/job/57849208186). This results in the nightly test code not being the latest version. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-12-13 19:04:14 +08:00
wangxiyuan	fd7c929145	[perf] replace all_reduce for kv_consumer and support different num_tokens among all ranks (#4983 ) pick from https://github.com/vllm-project/vllm-ascend/pull/4736 to fix the merge conflict ### What this PR does / why we need it? Currently, the all_reduce operation in _sync_metadata_across_dp is performed with gloo backend which is extremely time-consuming when DPEngineCores are in different nodes. This operation cannot be ignored by async scheduling in multi-node-scenarios with speculative decoding (e.g., EAGLE, mtp). This pr eliminates the all_reduce operation for D Nodes and change the input parameter of MoEDispatch & MoeCombine operators to make MC2EP support different num_tokens across all ranks. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Tested with PD disaggregation (2P: DP2TP8EP16 1D: DP8TP4EP32) scenarios while enabling async scheduling. This pr can remove cross-node all_reduce with gloo backend and further reduce latency with correct accuracy. --------- Signed-off-by: linfeng-yuan <1102311262@qq.com> Co-authored-by: linfeng-yuan <1102311262@qq.com>	2025-12-13 18:59:54 +08:00
wangxiyuan	5211e991ad	Revert "[Bugfix] support mtp kv transfer and pp partition by hand in kv transfer (#4892 )" (#4981 ) This reverts commit `332b547728`. This break deepseek3.2 in PD case. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c`	2025-12-13 18:58:55 +08:00
zhenwenqi2024	4721e4f53f	[bugfix] asyncscheduler bug fix (#4968 ) ### What this PR does / why we need it? now vllm-ascend uses AsyncGPUModelRunnerOutput ,AsyncNPUModelRunnerOutput before is outdated, so we should fix it - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>	2025-12-13 17:04:54 +08:00
MengLong Chen	fa367e3b1a	[CI] Add mtp_proposer ut (#4397 ) ### What this PR does / why we need it? Add mtp_proposer ut - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: chenmenglong <chenmenglong1@huawei.com>	2025-12-12 20:41:31 +08:00
zhenwenqi2024	f708d919f8	[Feature] model_runner refactor (#4764 ) ### What this PR does / why we need it? refactor npu_modelrunner， we should be close to gpu_modelrunner ### Does this PR introduce _any_ user-facing change? NO - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com> Signed-off-by: zhenwenqi2024 <155598497+zhenwenqi2024@users.noreply.github.com>	2025-12-12 17:27:09 +08:00
Li Wang	5b12c068f9	[Nightly] Remove gen_ranktable logic (#4941 ) ### What this PR does / why we need it? Since the `llmdatadist` has sunset, the logic gen_ranktable should also be removed - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangli <wangli858794774@gmail.com>	2025-12-12 17:20:18 +08:00
Clorist33	4984e8a284	[Bugfix] bugfix for moe_mlp (#4822 ) ### What this PR does / why we need it? This PR fixes a bug in the moe_mlp module by correcting the arguments passed to the torch_npu.npu_dequant_swiglu_quant function.It properly converts group_list from a cumulative sum to counts for the group_index parameter. ### Does this PR introduce _any_ user-facing change? No - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: tanqingshan (A) <50050625@china.huawei.com> Co-authored-by: tanqingshan (A) <50050625@china.huawei.com>	2025-12-12 14:51:20 +08:00
lidenghui1110	d65fb194d9	[Feat] Add custom Embedding tensor model parallel (#2616 ) Similar to #2309 , this PR introduces Embedding tensor model parallel to achieve decreasing of memory consumption. It support both eager mode and graph mode. And this PR refactor module tensor parallel configurations supported in #2309, #2167, #2120, merge all config into `finegrained_tp_config` in `additional_config`, including: `lmhead_tensor_parallel_size` `oproj_tensor_parallel_size` `embedding_tensor_parallel_size` `mlp_tensor_parallel_size` - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: zzhx1 <zzh_201018@outlook.com> Signed-off-by: zzhxx <zhangzihang23@mails.ucas.ac.cn> Co-authored-by: zzhx1 <zzh_201018@outlook.com> Co-authored-by: chenxiao <Jaychou1620@Gmail.com> Co-authored-by: zzhxx <zhangzihang23@mails.ucas.ac.cn> Co-authored-by: Jade Zheng <zheng.shoujian@outlook.com>	2025-12-12 14:41:20 +08:00
zhangxinyuehfad	bfafe30953	[CI] refect e2e test (#4799 ) ### What this PR does / why we need it? This PR updates the CI configuration and adjusts a set of end-to-end (e2e) tests under tests/e2e/multicard, in order to refactor the test suite and ensure compatibility with current codebase and CI workflows. 1. tests/e2e/multicard/test_prefix_caching.py: change model to Qwen3-8B and rename the test case 2. tests/e2e/multicard/test_quantization.py: rename the test case 3. tests/e2e/multicard/test_qwen3_moe.py: remove duplicate test and rename test cases 4. tests/e2e/multicard/test_qwen3_next.py: rename test cases and change the W8A8 pruning model to the W8A8 model and remove the eager parameter 5. tests/e2e/multicard/test_shared_expert_dp.py: rename test case and remove the eager parameter 6. tests/e2e/multicard/test_single_request_aclgraph.py: rename test case and change Qwen3-30B to Qwen3-0.6B 7. tests/e2e/multicard/test_torchair_graph_mode.py: delete test cases about torchair - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-12-12 08:42:08 +08:00
weijinqian0	a6ef3ac4e4	[Performance] Pre-issued exponential distribution operator. (#4908 ) Pre-issued exponential distribution operator. Result: Single inference saves 200-300 microseconds. before： <img width="2257" height="1058" alt="2" src="https://github.com/user-attachments/assets/c1da19e2-a439-42cb-9d7c-c0218e61fd4c" /> After： <img width="2211" height="342" alt="image" src="https://github.com/user-attachments/assets/03c84292-c802-4755-949c-4266a9a72fc0" /> - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: weijinqian_v1 <weijinqian@huawei.com> Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>	2025-12-11 23:02:51 +08:00
linfeng-yuan	0fbe0831ec	[bugfix][refactor] fix recompute_scheduler break with vllm 0.12.0 & support async scheduling & refactor recompute_scheduler.py (#4895 ) ### What this PR does / why we need it? Currently, the initialization and fundamental functions of RecomputeScheduler are broken with `vLLM v0.12.0`. This PR fixes the conflicts of `RecomputeScheduler` and refactor its implementations by inheriting original `Scheduler` of vLLM. Meanwhile, this PR also supports async cheduling with recompute scheduler by implementing `AsyncRecomputeScheduler` which is simply inherited `AsncyScheduler` of vLLM and `RecomputeScheduler` of vLLM-Ascend with python MRO. ### Does this PR introduce _any_ user-facing change? No. The switch naming is the same as v0.11.0 : `recompute_scheduler_enable` ### How was this patch tested? E2E serving with 2P1D dsv3.1 passed. The performance was the same as original vllm scheduler with `async_scheduling` and preempted requests in D Nodes are successfully transfered to Proxy and further to P Node. This significantly improves the performance and robustness of PD disaggregation deployments. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: linfeng-yuan <1102311262@qq.com>	2025-12-11 22:24:49 +08:00
SILONG ZENG	e56dba9b0d	[CI]cleanup e2e test (#4800 ) ### What this PR does / why we need it? This PR refactors the E2E multicard test suite to improve test case identification and maintainability. Specifically, it renames various test functions to be more descriptive (explicitly indicating model families like Qwen/DeepSeek and parallelism strategies like DP/TP/PP/EP) and cleans up outdated or redundant test configurations in the offline distributed inference tests. Key Changes: 1. Test Function Renaming (Standardization): Renamed multiple test functions across `tests/e2e/multicard/` to include clear suffixes/prefixes regarding the model and parallel strategy. This helps differentiate test cases in CI logs and prevents naming collisions. `test_aclgraph_capture_replay.py`: - `test_aclgraph_capture_replay_dp2` -> `test_aclgraph_capture_replay_metrics_dp2` `test_data_parallel.py`: - `test_data_parallel_inference` -> `test_qwen_inference_dp2` `test_data_parallel_tp2.py`: - `test_data_parallel_inference` -> `test_qwen_inference_dp2_tp2` `test_expert_parallel.py`: - `test_e2e_ep_correctness` -> `test_deepseek_correctness_ep` `test_external_launcher.py`: - `test_external_launcher` -> `test_qwen_external_launcher` - `test_moe_external_launcher` -> `test_qwen_moe_external_launcher_ep` - `test_external_launcher_and_sleepmode` -> `test_qwen_external_launcher_with_sleepmode` - `test_external_launcher_and_sleepmode_level2` -> `test_qwen_external_launcher_with_sleepmode_level2` - `test_mm_allreduce` -> `test_qwen_external_launcher_with_matmul_allreduce` `test_full_graph_mode.py`: - `test_models_distributed_Qwen3_MOE_TP2_WITH_FULL_DECODE_ONLY` -> `test_qwen_moe_with_full_decode_only` - `test_models_distributed_Qwen3_MOE_TP2_WITH_FULL` -> `test_qwen_moe_with_full` `test_fused_moe_allgather_ep.py`: - `test_generate_with_allgather `-> `test_deepseek_moe_fused_allgather_ep` - `test_generate_with_alltoall` -> `test_deepseek_moe_fused_alltoall_ep` `test_offline_weight_load.py`: - `test_offline_weight_load_and_sleepmode` -> `test_qwen_offline_weight_load_and_sleepmode` `test_pipeline_parallel.py`: - `test_models` -> `test_models_pp2` 2. Distributed Inference Cleanup (`test_offline_inference_distributed.py`): model list changes: ``` QWEN_DENSE_MODELS = [ - "vllm-ascend/Qwen3-8B-W8A8", "vllm-ascend/Qwen2.5-0.5B-Instruct-W8A8" + "vllm-ascend/Qwen3-8B-W8A8", ] ``` ``` - QWEN_W4A8_OLD_VERSION_MODELS = [ - "vllm-ascend/Qwen3-8B-W4A8", - ] - QWEN_W4A8_NEW_VERSION_MODELS = [ - "vllm-ascend/DeepSeek-V3-W4A8-Pruing", - "vllm-ascend/DeepSeek-V3.1-W4A8-puring", - ] + DEEPSEEK_W4A8_MODELS = [ + "vllm-ascend/DeepSeek-V3.1-W4A8-puring", + ] ``` Test Function Changes: - removed `test_models_distributed_QwQ` - removed `test_models_distributed_Qwen3_W8A8` - removed `test_models_distributed_Qwen3_W4A8DYNAMIC_old_version` - `test_models_distributed_Qwen3_W4A8DYNAMIC_new_version` -> `test_models_distributed_Qwen3_W4A8DYNAMIC` - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: MrZ20 <2609716663@qq.com>	2025-12-11 20:35:32 +08:00
wangxiyuan	06a66939cd	Remove mindie_turbo (#4896 ) mindie_turbo is out of data for long time. This PR remove the related register method. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-11 18:46:12 +08:00
wangxiyuan	b89763f1ed	[CI] speed up ut (#4901 ) avoid model download to speed up ut test. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-11 18:45:43 +08:00
Icey	18221c0e1d	[Fusion] normalize fusion naming and enable e2e test (#4693 ) ### What this PR does / why we need it? This PR standardizes the fusion naming, changing `enable_quantization_fusion` to `fuse_norm_quant`, and enables e2e testing. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: wxsIcey <1790571317@qq.com>	2025-12-11 17:53:43 +08:00
lidenghui1110	332b547728	[Bugfix] support mtp kv transfer and pp partition by hand in kv transfer (#4892 ) ### What this PR does / why we need it? Current mooncake connector has following problems with PP and MTP enabled: 1. MTP layer kv caches are not transfered, it may cause decreasing of accept ratio: This PR add MTP layer indices for last PP stage after calculating end_layer in transfer_kv_cache 2. While MTP enabled, PP layers divided by default may cause imbalance between stages, we need to use `VLLM_PP_LAYER_PARTITION` environment to make it balance by hand, but in mooncake connector kv transfer, decode doesn't know the partition of prefill node: This PR add config `pp_layer_partition` in `kv_connector_extra_config` to make decode node acquire the partition information of prefill node. ### Does this PR introduce _any_ user-facing change? When prefill using `VLLM_PP_LAYER_PARTITION` environment, add `pp_layer_partition` in `kv_connector_extra_config` like below: ``` export VLLM_PP_LAYER_PARTITION=33,28 "kv_connector_extra_config": { "use_ascend_direct": true, "prefill": { "dp_size": 1, "tp_size": 8, "pp_size": 2, "pp_layer_partition": "33,28" }, "decode": { "dp_size": 16, "tp_size": 1, "pp_size": 1 } } ``` ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: lidenghui <lidenghui1110@gmail.com>	2025-12-11 17:23:21 +08:00
QilaiZhang	78bf211539	[OPS] support triton causal_conv1d_fn ops (#4119 ) ### What this PR does / why we need it? Support triton causal_conv1d_fn ops. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: QilaiZhang <245706640@qq.com>	2025-12-11 15:52:39 +08:00
zzhxxx	eac72f5f23	[Feat] Flashcomm2 use o_shared linear (#4188 ) ### What this PR does / why we need it? It is mentioned in the [flashcomm2 technical report](https://gitcode.com/ascend-tribe/ascend-inference-cluster/blob/main/FlashComm/FlashComm2%E5%A4%A7%E6%A8%A1%E5%9E%8B%E6%8E%A8%E7%90%86%E4%B8%AD%E4%BB%A5%E5%AD%98%E6%8D%A2%E4%BC%A0%E7%9A%84%E9%80%9A%E4%BF%A1%E4%BC%98%E5%8C%96%E6%8A%80%E6%9C%AF.pdf) that FC2 will introduce full redundant storage of the o_proj matrix, which will put pressure on the memory. Therefore, the technical report proposed a compromise solution using otp2, but it will introduce additional reduce-scatter communication. We propose a shared linear feature (#2931 ) that supports distributing weights layer by layer to each card, avoiding the need for TP splitting, and can solve the memory issue. This PR depends on #3232 and #2931 ### Flashcomm2 flowchart <img width="1142" height="878" alt="PixPin_2025-11-14_13-37-39" src="https://github.com/user-attachments/assets/d45ea8db-d8ef-4d45-8e18-abd4d82ce3e0" /> ### Does this PR introduce _any_ user-facing change? Use environment variables ```bash export VLLM_ASCEND_FLASHCOMM2_PARALLEL_SIZE=1 export VLLM_ASCEND_ENABLE_FLASHCOMM2_OSHARED=1 ``` - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: zzhx1 <zzh_201018@outlook.com> Signed-off-by: zzhxx <2783294813@qq.com> Co-authored-by: zzh02232027 <zzh02232027@antgroup.com> Co-authored-by: clrs97 <524936896@qq.com> Co-authored-by: Levi-JQ <yujinqi2@huawei.com>	2025-12-11 12:43:04 +08:00
wangxiyuan	bb76f7962c	cleanup useless torchair logic (#4856 ) This PR clean up useless torchair logic in model runner. The moge doc is only for torchair, it can be removed as well. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2025-12-11 11:21:13 +08:00
chenjunyi	c12eb22cbe	[feat] mlapo add bf16 no_quant support (#4852 ) ### What this PR does / why we need it? This PR adds mlapo operation support for bf16 no_quant mode. ### Does this PR introduce _any_ user-facing change? This PR makes quant related parameters optional. ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: chenjunyi <isjunyi.chen@gmail.com>	2025-12-11 11:06:56 +08:00
zhangyiming	c95c271538	[E2E] Optimize nightly testcase. (#4886 ) ### What this PR does / why we need it? Optimize nightly testcase. Changes: - tests/e2e/nightly/multi_node/config/models/Qwen3-235B-A3B.yaml: Add accuracy and performance benchmark - tests/e2e/models/configs/Qwen3-8B-Base.yaml: Delete - tests/e2e/models/configs/internlm-7b.yaml: Change to internlm3-8b-instruct - tests/e2e/nightly/models/test_deepseek_r1_w8a8_eplb.py: Change to DeepSeek-R1-0528-W8A8 model - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: menogrey <1299267905@qq.com>	2025-12-11 10:15:39 +08:00
zhangyiming	66b0781840	[E2E] Refactor the e2e testcases. (#4789 ) ### What this PR does / why we need it? Refactor the e2e testcases. - tests/e2e/multicard/test_weight_loader.py: Remove the unused code. - tests/e2e/singlecard/multi-modal/test_internvl.py: Move to accuracy test. - tests/e2e/singlecard/test_aclgraph.py: Rename the file. - tests/e2e/singlecard/test_embedding_aclgraph.py : Combine with tests/e2e/singlecard/test_bge_model.py - tests/e2e/singlecard/test_completion_with_prompt_embeds.py: Delete eager mode and modify model to Qwen3-0.6B - tests/e2e/singlecard/test_quantization.py: Modify model to Qwen3-0.6B-W8A8 - tests/e2e/singlecard/test_vlm.py: Modify model to Qwen3-VL-8B - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: menogrey <1299267905@qq.com>	2025-12-11 10:15:00 +08:00
zhangyiming	11bebb518c	[E2E] Remove unused PD-disaggreate scripts in E2E test. (#4837 ) ### What this PR does / why we need it? Remove unused PD-disaggreate scripts in E2E test. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: menogrey <1299267905@qq.com>	2025-12-11 09:23:38 +08:00
wangxiyuan	f917d5edcf	Remove useless env (#4858 ) cleanup useless env. These envs are not used anymore `VLLM_ASCEND_TRACE_RECOMPILES`, `VLLM_ASCEND_KV_CACHE_MEGABYTES_FLOATING_TOLERANCE`, `VLLM_ASCEND_MLA_PA`, `PHYSICAL_DEVICES` - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-11 06:51:07 +08:00
wangxiyuan	37db0844f5	Remove COMPILE_CUSTOM_KERNELS env (#4864 ) With more and more custom ops merged, disable `COMPILE_CUSTOM_KERNELS ` for vllm ascend seems useless now. Let's enable csrc compile by default. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-10 23:48:03 +08:00
drslark	0fb1dc43a1	[BugFix][main] Adapted Qwen3-Next-MTP to chunked prefill (#4770 ) ### What this PR does / why we need it? The pad `-1` modification is from https://github.com/vllm-project/vllm/pull/25743. It still has bugs for batched chunked prefill. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: drslark <slarksblood@qq.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2025-12-10 22:54:24 +08:00
ChenCangtao	dd622aa6a6	[Feature] Support npuhraph_ex backend (#4700 ) ### What this PR does / why we need it? We introduced the npugraph_ex backend through the vllm's adaptor dispatch mechanism to accelerate aclgraph. This solution is based on torch.compile and uses torchair to optimize the fx.graph. The performance gains are mainly obtained from the static kernel. We conducted tests on Qwen3-30B and achieved over 5% performance optimization. ### Does this PR introduce _any_ user-facing change? Yes, we add a new switch named"enable_npugraph_ex" in additional_config, default is False. We also add an example to show how to register custom replacement pass ### More information about this PR This feature depends on the release of CANN and torch_npu in Q4. We tested it on a package that has not been publicly released yet and verified that the functionality works. This feature is still experimental at the moment; setting the config true will directly raise error. Merging into the main branch initially involves some preliminary commits to facilitate subsequent development and testing of the feature, as well as to avoid submitting an excessively large PR at once. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: chencangtao <chencangtao@huawei.com> Signed-off-by: ChenCangtao <50493711+ChenCangtao@users.noreply.github.com> Co-authored-by: chencangtao <chencangtao@huawei.com> Co-authored-by: panchao-hub <315134829@qq.com> Co-authored-by: wbigat <wbigat@163.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2025-12-10 20:48:05 +08:00
Yizhou	5b179c53f1	[FEAT] Support DeepSeek-V3.2 with `FULL_DECODE_ONLY` mode (#4706 ) ### What this PR does / why we need it? The first commit support `FULL_DECODE_ONLY`: - Update `AscendSFAMetadataBuilder` to use `num_input_tokens` for slicing slots and positions, ensuring fixed tensor shapes. - Implement padding logic for `query_start_loc` in `NPUModelRunner` to support uniform decode in full graph mode, aligning with GPU runner behavior. - Adjust MLA cosine cache allocation to occur independently of graph mode and switch to using device-resident sequence lengths for attention metadata. - Remove redundant slicing of hidden states and outputs in `AscendSFAImpl` and optimize `sin`/`cos` cache updates. The second commit take MTP into account: - Update `AscendSFAMetadataBuilder` to use `num_input_tokens` for slicing slots and positions, ensuring fixed tensor shapes. - Implement padding logic for `query_start_loc` in `NPUModelRunner` to support uniform decode in full graph mode, aligning with GPU runner behavior. - Adjust MLA cosine cache allocation to occur independently of graph mode and switch to using device-resident sequence lengths for attention metadata. - Remove redundant slicing of hidden states and outputs in `AscendSFAImpl` and optimize `sin`/`cos` cache updates. And the rest of them are just bugfix. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? Test cases needed. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-12-10 20:11:09 +08:00
lidenghui1110	a82b0fa70e	mooncake connector support pipeline parallel & fix pp with flashcomm1 (#4054 ) ### What this PR does / why we need it? To support pipeline parallel with PD disaggregation, this PR support PP in mooncake connector and fix other bugs when enable pp with other optimization params, including following changes: - mooncake connector support pp in prefill, we do not support decode pp currently - fix bugs when enable both pp and flashcomm1 - optimize ascend-scheduler to support full batch in multiple pipeline stages, original implementation would cause all pipeline stages batch_size total summed to max_num_seq, which makes pipeline is not full, this optimization can make all stages running with full batch_size = max_num_seq, the same changes will contribute to vllm scheduler too. ### Does this PR introduce _any_ user-facing change? add `pp_size` in mooncake connector kv_connector_extra_config ``` "kv_connector_extra_config": { "use_ascend_direct": true, "prefill": { "dp_size": 1, "tp_size": 4, "pp_size": 4 }, "decode": { "dp_size": 16, "tp_size": 1 } } ``` ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: chenxiao <Jaychou1620@Gmail.com> Signed-off-by: Kurumi5210 <Jaychou1620@Gmail.com> Signed-off-by: Kurumi5210 <jaychou1620@gmail.com> Signed-off-by: 秋刀鱼 <jaychou1620@Gmail.com> Co-authored-by: chenxiao <Jaychou1620@Gmail.com> Co-authored-by: zss <zss@qq.com> Co-authored-by: zss <3265779424@qq.com>	2025-12-10 16:01:43 +08:00
Ruri	ce5872705e	[Feat] Support native Kimi-K2-Thinking native W4A16 quantized experts weights (#4516 ) ### What this PR does / why we need it? Adds W4A16 quantization method for the Kimi-K2-Thinking model and updates relevant modules to support the new quantization method. - Implements complete W4A16 quantization method including weight packing/unpacking, per-group quantization parameter generation, post-processing logic and MoE method application. - Adds parameters `use_int4_w4a16`, `w1_offset` and `w2_offset`, adjusts `with_quant` conditional logic to support W4A16 matrix multiplication. - Adds `packed_modules_model_mapping` for Kimi-K2-Thinking model and processing logic for `weight_packed` field. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: zhoux77899 <zhouxiang100@huawei.com> Signed-off-by: Ruri <33858552+zhoux77899@users.noreply.github.com> Signed-off-by: Ruri <zhouxiang100@huawei.com>	2025-12-10 15:58:52 +08:00
SILONG ZENG	7132ae8532	[CI]Cleanup accurary test (#4861 ) ### What this PR does / why we need it? Delete accuracy testing of some models: - Qwen2-VL-7B-Instruct - Qwen2.5-VL-7B-Instruct - gemma-2-9b-it - DeepSeek-V2-Lite - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: MrZ20 <2609716663@qq.com>	2025-12-10 14:13:56 +08:00
lianyibo	e32014ac1d	[Model] Support pooling models (#3122 ) ### What this PR does / why we need it? Support pooling models (like `bge-reranker-v2-m3`) in vllm-ascend, this pr covered the three model types of embed (cls_token, mean_token, lasttoken). After this [commit](`17373dcd93`), vllm has provided support for adapting pooling models on the v1 engine. This PR includes corresponding adaptations on the vllm-ascend side. Fixes #1960 - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: lianyibo <lianyibo1@kunlunit.com> Signed-off-by: MengqingCao <cmq0113@163.com> Co-authored-by: MengqingCao <cmq0113@163.com>	2025-12-10 11:37:57 +08:00

... 8 9 10 11 12 ...

1120 Commits