xc-llm-ascend

Author	SHA1	Message	Date
offline893	76844eec78	Dynamic Expert Load Balance with Zero-like-overhead (#2956 ) ### Motivation Currently dynamically experts balancing would stop-the-world. Asynchronously expert load balancing would be better without flowing problems: Host-bound latency: There are many cpu operations during EPLB such as eplb-algorithm、creating p2p ops、and log2phy expert converting would spend long cpu time, as ~1s. Communication latency: The transfer time would cost much in the situation without nvlink. As the weight of an expert maybe transfer to multiple new positions, thus N times send/recv for one expert, with result long latency. We had tested that batch_isend_irecv cost more 100ms for 16 experts weight transmission in A2 server of ascend. SwiftBalancer would not stop-the-world anymore, in out test on NPU 1~2ms cost for each layer while benefit 5ms-8ms decode latency with ep_size = 64. The following updates have been made: 1、expert distribution recording with lower cost. 2、async cpu computing for eplb algo and other python operator. 3、new eplb algo with less expert rebalancing while almost the same effect. ### Proposed Change We will gradually migrate the EPLB logic to the VLLM community and implement a generalized design. Relevant RFC: https://github.com/vllm-project/vllm/issues/22246 The overall workflow involves: <img width="801" height="302" alt="474430541-23b06f58-23bc-44a3-a1be-00f268aeb15c" src="https://github.com/user-attachments/assets/1d73a459-1b23-4b0a-812a-bf0a75debfed" /> 1. Record experts distribution during forward. We using expert_token_num after disptach instead of topk_ids, thus we got much smaller tensor shape to reduce cost of hbm recording and add-operator. 2. Do all-gather for experts distribution. Using all-gather instead of all-reduce as less traffic volume. 3. Wake up eplb worker process with experts distribution when num_iterations comes. Run eplb algorithm in eplb worker. 4. Generate p2p send/recv ops and other operator such as log2phy would cost long cpu time. 5. Lanch ibatch_send_recv in async_stream before forward. 6. After forward, wait for the ibatch_send_recv finish, then do uapte expert map and expert weights. ### Co-author Co-authored-by: raindaywhu raindaywhu@raindaywhu@ 163.con Co-authored-by: njuyuan yuanjl19@smail.nju.edu.cn Co-authored-by: qmkakaxi wjh1594260677@qq.com Co-authored-by: Skywalker-EP 173723846@qq.com - vLLM version: v0.10.2 - vLLM main: `567939953b` --------- Signed-off-by: offline0806 <z00858301@china.huawei.com> Co-authored-by: offline0806 <z00858301@china.huawei.com>	2025-09-17 10:36:43 +08:00
xuyexiong	ae758dda05	[Bugfix] Fix mtp torchair in pd Disaggregation scenario (#2951 ) ### What this PR does / why we need it? 1. In memory of #2509, Fix mtp torchair in pd Disaggregation scenario 2. fix mla bug in SpecDecoding Scenario， since num_decodes != num_decode_tokens ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: `5206ab20ba` Signed-off-by: xuyexiong <xuyexiong@huawei.com>	2025-09-17 09:07:58 +08:00
rjg-lyh	6b7117dbb7	[main] addrmsnorm + quant fusion optim in Dense Models (#2772 ) ### What this PR does / why we need it? This PR fused addrmsnorm op and w8a8 quant op to get better perf. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.10.2 - vLLM main: `0faf3cc3e8` Signed-off-by: rjg-lyh <1318825571@qq.com>	2025-09-16 22:31:38 +08:00
yiz-liu	88ca8a051c	[Feat][Graph] Support DeepSeek with ACL Graph (#2707 ) ### What this PR does / why we need it? In memory of #677 , a long overdue milestone. Now DeepSeek V3/R1 should be OK with ACL Graph. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? Working on it. - vLLM version: v0.10.2 - vLLM main: `68dbde5dbb` --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-09-16 17:50:17 +08:00
linfeng-yuan	1c5900327b	[refactor] refactor deepseek-related files (#2849 ) ### What this PR does / why we need it? This PR deletes ~2K lines of code about deepseek modeling. It falls back CustomDeepseekV2 modules to original vllm implementations and adapts some modifications in vllm about deepseek and moe. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? E2E vllm serving with torchair graph mode and eager mode. - vLLM version: v0.10.2 - vLLM main: `759ef49b15` --------- Signed-off-by: linfeng-yuan <1102311262@qq.com> Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com> Co-authored-by: yiz-liu <136800916+yiz-liu@users.noreply.github.com> Co-authored-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-09-16 14:13:07 +08:00
weichen	18ca7861f6	[Main] [Refactor] Enable MoECommMethod in Eager Mode (#2791 ) ### What this PR does / why we need it? 1. Replace prepare/finalize operation in fused_moe.py by moe_comm_method.prepare()/finalize() 2. Replace unified_fused_experts by moe_comm_method.fused_experts() in fused_moe.py/w8a8_dynamic.py/w4a8_dynamic.py 3. Add calling _select_moe_comm_method in spec-decode proposers. 4. Currently, w4a8_dynamic does not support gatherep, use all2allv instead. 5. Remove redundant code. ### Does this PR introduce _any_ user-facing change? AllgatherEP switch is disabled in aclgraph/eager mode, just follow the rules in modelrunner_v1._select_moe_comm_method() ### How was this patch tested? e2e & ut - vLLM version: v0.10.2 - vLLM main: `7f6f2c1182` Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com> Co-authored-by: weijinqian0 <12153182+weijinqian0@users.noreply.github.com>	2025-09-16 11:06:00 +08:00
wangxiyuan	c556038ef0	[New model] Qwen3-next support (#2917 ) ### What this PR does / why we need it? Add Qwen3-next support. ### Does this PR introduce _any_ user-facing change? Yes, users can use Qwen3 next. Related doc: https://github.com/vllm-project/vllm-ascend/pull/2916 the tutorial will be ready in [here](https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_npu_qwen3_next.html) ### How was this patch tested? Doc CI passed Related: https://github.com/vllm-project/vllm-ascend/issues/2884 Co-Authored-By: Angazenn <supperccell@163.com> Co-Authored-By: zzzzwwjj <1183291235@qq.com> Co-Authored-By: MengqingCao <cmq0113@163.com> Co-Authored-By: linfeng-yuan <1102311262@qq.com> Co-Authored-By: hust17yixuan <303660421@qq.com> Co-Authored-By: SunnyLee219 <3294305115@qq.com> Co-Authored-By: maoxx241 <maoxx241@umn.edu> - vLLM version: v0.10.2 - vLLM main: `b834b4cbf1` --------- Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: Angazenn <supperccell@163.com> Signed-off-by: Your Name <you@example.com> Signed-off-by: zzzzwwjj <1183291235@qq.com> Signed-off-by: linfeng-yuan <1102311262@qq.com> Signed-off-by: hust17yixuan <303660421@qq.com> Co-authored-by: MengqingCao <cmq0113@163.com> Co-authored-by: Angazenn <supperccell@163.com> Co-authored-by: Your Name <you@example.com> Co-authored-by: zzzzwwjj <1183291235@qq.com> Co-authored-by: linfeng-yuan <1102311262@qq.com> Co-authored-by: hust17yixuan <303660421@qq.com>	2025-09-16 01:17:42 +08:00
wangxiyuan	382c29f3e1	[BugFix] Fix world size bug in model_runner (#2915 ) - Fix world size bug in model_runner to make sure ep>16 runs with MC2 - enable e2e test for vl Co-Authored-By: whx-sjtu <2952154980@qq.com> Co-Authored-By: Icey <1790571317@qq.com> - vLLM version: v0.10.2 - vLLM main: `3e903b6cb4` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-09-14 12:20:25 +08:00
fan2956	c5a502fd2e	main add ascend scheduler support multimodal (#2844 ) ### What this PR does / why we need it? On main, AscendScheduler does not support Multimodels, becuse of lacking of scheduled_encoder_inputs which is need on multimodels inference ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? vLLM version: main@93e28e6862669e3b5cf47cea9f782a65ec47e155 - vLLM version: v0.10.2rc2 - vLLM main: `15b8fef453` --------- Signed-off-by: fan2956 <zhoufan53@huawei.com> Co-authored-by: zhoufan2956 <zhoufan2956@163.com>	2025-09-14 09:38:51 +08:00
zxr2333	0a27705917	fix mooncake connector adxl hostname usage (#2824 ) ### What this PR does / why we need it? This PR is used to adapt the hostname format for Mooncake when using adxl. When Mooncake uses adxl, it is necessary to set ```USE_ASCEND_DIRECT``` to True in the file ```/Mooncake/mooncake-common/common.cmake``` during compilation. The mooncake_connector obtains this config by calling ```vllm_config.kv_transfer_config.get_from_extra_config```, determines whether Mooncake is using adxl, and selects the corresponding hostname format. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By CI. - vLLM version: main - vLLM main: `d21a36f5f9` --------- Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>	2025-09-13 14:38:48 +08:00
Jiawei Li	e57cca971c	Fix the bugs about operator registration by PyTorch Dispatcher (#2786 ) Background: There are two principles about operator registration in PyTorch - The same namespace can be only registered once by `TORCH_LIBRARY` - The operator signatures can be only registered once by `def` Considering that all custom operators defined in the current repo are only used by Ascend, instead of defining a common operator schema by vLLM, all accelerators then follow this operator schema and complete the implementation based on their respective hardware, which is conducive to functional abstraction. Therefore, we can rename the operator registration namespace to an Ascend-specific namespace(_C_ascend). Related ISSUE: https://github.com/vllm-project/vllm-ascend/issues/2742 - vLLM version: main - vLLM main: `f592b3174b` Signed-off-by: FFFrog <ljw1101.vip@gmail.com>	2025-09-13 11:58:52 +08:00
rjg-lyh	585a494baa	[Core] Disable the chunked prefill feature in Non-MLA LLMs (#2894 ) ### What this PR does / why we need it? This PR enforces the forcible disabling of the chunked prefill feature in Non-MLA models, as the performance of operators supporting this functionality is currently suboptimal. Unless the user has enabled chunked prefill in the ascend_scheduler_config, we would allow this feature. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI passed with new added/existing test. Related: https://github.com/vllm-project/vllm-ascend/pull/2659 - vLLM version: main - vLLM main: `d21a36f5f9` Signed-off-by: rjg-lyh <1318825571@qq.com>	2025-09-12 23:17:09 +08:00
Yikun Jiang	756b8a1946	Revert "[Feat] Unquantized linear nz support (#2619 )" (#2896 ) ### What this PR does / why we need it? This reverts commit `7b2ecc1e9a`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed - vLLM version: main - vLLM main: `64d90c3e4f` Closes: https://github.com/vllm-project/vllm-ascend/issues/2890 Closes: https://github.com/vllm-project/vllm-ascend/issues/2887 Closes: https://github.com/vllm-project/vllm-ascend/issues/2885 Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-09-12 20:51:12 +08:00
rjg-lyh	fc2bcbe21c	[Ops] Fix bug in register_custom_ops without forward_context (#2883 ) ### What this PR does / why we need it? This PR fixed the bug in register_custom_ops without forward_context. We set try-except to consider this situation. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI passed with new added/existing test. - vLLM version: main - vLLM main: `7920de0a2a` Signed-off-by: rjg-lyh <1318825571@qq.com>	2025-09-12 16:58:08 +08:00
realliujiaxu	778cb72556	fix bug when rotary_dim is not 128 (#2847 ) ### What this PR does / why we need it? `torch_npu.npu_apply_rotary_pos_emb` only support head_size and rotary_dim equal 128. Error occurs when running GLM ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: main - vLLM main: `404c85ca72` Signed-off-by: realliujiaxu <realliujiaxu@163.com>	2025-09-12 09:49:36 +08:00
22dimensions	f5a97e8fa5	[Quantization] register AscendQuantRMSNorm for quantization (#2856 ) ### What this PR does / why we need it? modelslim will generate self.bias for rms norm in quantization, since RMSNorm in vllm has no this parameter, so its nesscesary to create a AscendQuantRmsNorm. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? tested by deepseek-v3.1-w8a8 <img width="2496" height="592" alt="image" src="https://github.com/user-attachments/assets/004c6e76-3d7a-4a1f-b59f-a14304012663" /> - vLLM version: main - vLLM main: `d6249d0699` Signed-off-by: 22dimensions <waitingwind@foxmail.com>	2025-09-11 23:14:02 +08:00
wyu0-0	eab3635850	[Bugfix] Retrieve num_redundant_experts from eplb_config in torchair qwen3_moe.py (#2857 ) ### What this PR does / why we need it? This PR addresses a configuration retrieval issue related to EPLB (Expert Parallel Load Balancing) settings in qwen3_moe.py. The key change is adjusting the source of num_redundant_experts to correctly fetch from the eplb_config sub-structure within parallel_config, rather than directly from parallel_config. This aligns with the updated configuration hierarchy for EPLB-related parameters. This change references `vllm_ascend/models/qwen3_moe.py` https://github.com/vllm-project/vllm-ascend/blob/main/vllm_ascend/models/qwen3_moe.py#L255-L257 ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? run bash as follows and test pass ``` source /sfs_turbo/humpy/B080/cann_b080/ascend-toolkit/set_env.sh source /sfs_turbo/humpy/B080/cann_b080/nnal/atb/set_env.sh #export HCCL_BUFFSIZE=300 # export HCCL_SOCKET_IFNAME="eth0" # export TP_SOCKET_IFNAME="eth0" # export GLOO_SOCKET_IFNAME="eth0" # export HCCL_IF_IP=33.215.118.231 export VLLM_USE_V1=1 export VLLM_ASCEND_ENABLE_MOE_ALL2ALL_SEQ=1 export TASK_QUEUE_ENABLE=1 # export VLLM_VERSION=0.9.1 export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True export HCCL_OP_EXPANSION_MODE="AIV" export HCCL_INTRA_PCIE_ENABLE=1 export HCCL_INTRA_ROCE_ENABLE=0 rm -rf ./.torchair_cache/ rm -rf ./dynamo_* rm -rf /root/ascend/log/debug/plog/* python -m vllm.entrypoints.openai.api_server \ --model=/sfs_turbo/tzq/model/Qwen/Qwen3-235B-A22B/ \ --served-model-name auto \ --port 8006 \ -tp 1 \ -dp 16 \ --enable_expert_parallel \ --max-num-seqs 48 \ --max-model-len 32768 \ --gpu-memory-utilization 0.95 \ --additional-config '{"torchair_graph_config":{"enabled":true,"use_cached_graph":true,"graph_batch_sizes_init":false,"graph_batch_sizes":[1, 8, 16, 24, 48]}, "ascend_scheduler_config":{"enabled":false}, "refresh":true}' \ --kv-transfer-config \ '{ "kv_connector": "SharedStorageConnector", "kv_buffer_device": "npu", "kv_role": "kv_consumer", "kv_parallel_size": 2, "kv_port": "20002", "engine_id": "decode-'${NODE_RANK}'", "kv_rank": 1, "kv_connector_extra_config": { "prefill": { "dp_size": 1, "tp_size": 16 }, "decode": { "dp_size": 16, "tp_size": 1 } } }' \ 2>&1 disown ``` - vLLM version: main - vLLM main: `0ae43dbf8c` Signed-off-by: wyu0-0 <woshilynn@163.com>	2025-09-11 22:15:19 +08:00
Angazenn	aeffe27b30	[Perf]set moe w2_weight default to be nz (#2842 ) ### What this PR does / why we need it? This PR sets the default format of GMM w2_weight in w8a8_dynamic to be NZ to improve performance. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - vLLM version: main - vLLM main: `e40827280b` --------- Signed-off-by: Angazenn <supperccell@163.com>	2025-09-11 21:40:54 +08:00
wuweiqiang24	9615dea3a7	Refactor tensor_parallel and comm_utils (#2814 ) ### What this PR does / why we need it? 1. Move ops/comm_utils to ops/moe/comm_utils 2. Move distributed/tensor_parallel/gather_from_sequence_parallel_region to ops/moe/comm_utils 3. Delete distributed/tensor_parallel ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? e2e & ut - vLLM version: main - vLLM main: `a1213fae5f` --------- Signed-off-by: wuweiqiang24 <1005334931@qq.com> Signed-off-by: wuweiqiang24 <wuweiqiang11@huawei.com>	2025-09-11 21:26:36 +08:00
rjg-lyh	0005479b9c	[main] mlp weight prefetch in Qwen Dense Models (#2816 ) ### What this PR does / why we need it? This PR prefetchs the weight of mlp layers in Qwen Dense Models to optimize the performance in Decode phase mainly. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI passed with new added/existing test. - vLLM version: main - vLLM main: `a1213fae5f` Signed-off-by: rjg-lyh <1318825571@qq.com> Co-authored-by: Shuming19 <313093131@qq.com>	2025-09-11 21:20:09 +08:00
无脸男	c3c2221503	[Feat]support dynamic quantization in allgather (#2841 ) ### What this PR does / why we need it? [Feat]support dynamic quantization in allgather ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: main - vLLM main: `5931b7e5d9` Signed-off-by: withHades <244036962@qq.com> Signed-off-by: WithHades <244036962@qq.com>	2025-09-11 18:47:20 +08:00
6lazijiamo	bd3dedea61	support qwen25 vl w8a8 quantization (#2778 ) ### What this PR does / why we need it? support qwen25 vl w8a8 quantization ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? - vLLM version: v0.10.1.1 - vLLM main: `62f66be1f7` --------- Signed-off-by: lijiaojiao <lijiaojiao990304@163.com> Co-authored-by: lijiaojiao <lijiaojiao990304@163.com>	2025-09-11 16:40:51 +08:00
jiangpeng	2b9269b581	[Perf][V1] Fully overlap model execution (#2783 ) This PR is based on top of [#23569](https://github.com/vllm-project/vllm/pull/23569) and [#24219](https://github.com/vllm-project/vllm/pull/24219). ### What this PR does / why we need it? This PR allows the model runner to function asynchronously when using async scheduling. This allows full overlap of the cpu operations (including prepare_inputs) and the model forward pass. This diff is functional and does not support speculative decoding, PP, or guided decoding. Expected speedup is 5-10% over the current async scheduling. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? server ``` python -m vllm.entrypoints.openai.api_server --model=Qwen3-32B\ --trust-remote-code --enforce-eager \ --distributed-executor-backend=mp \ -tp=4 \ --port 8006 \ --max-model-len 32000 \ --block-size 128 \ --gpu-memory-utilization 0.99 ``` client ``` python $TEST_PY --backend vllm --trust-remote-code --model Qwen3-32B \ --dataset-name random --random-input-len 2048 --random-output-len 2048 \ --ignore-eos\ --num-prompts 48 --max-concurrency 48 --request-rate inf --temperature 0 \ --metric-percentiles 90 --base-url http://localhost:8006 --save-result \ --result-dir $PROFILER_DIR ``` benchmark test based on Qwen3-32B TPOT result: \|\|forward async\| scheduler async \|sync\| \|-\|-\|-\|-\| \|avg\|41.73\|41.86\|44.20\| \|improve0\|0.3%\|0\|0\| \|improve1\|5.58%\|0\|0\| benchmark test based on Qwen2___5-VL-7B-Instruct TPOT result: \|\|forward async\|sync\| \|-\|-\|-\| \|avg\|23.22\|29.16\| \|improve\|20.3%\|0\| - vLLM version: main - vLLM main: `e93f4cc9e3` Signed-off-by: jiangpeng36 <jiangpeng36@huawei.com> Signed-off-by: Ronald1995 <ronaldautomobile@163.com> Co-authored-by: jiangpeng36 <jiangpeng36@huawei.com> Co-authored-by: Ronald1995 <ronaldautomobile@163.com>	2025-09-11 16:35:36 +08:00
zhaozx-cn	923cdaeba3	fix ascend fused moe spelling error (#2863 ) ### What this PR does / why we need it? fix ascend fused moe spelling error ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? `0ae43dbf8c` - vLLM version: main - vLLM main: `fcc0a3130a` Signed-off-by: zhaozixin <zhaozixin1@huawei.com> Co-authored-by: zhaozixin <zhaozixin1@huawei.com>	2025-09-11 14:35:46 +08:00
zhaozx-cn	b9a0a75c78	fix qwen torchair attention PrefillCacheHit (#2787 ) ### What this PR does / why we need it? Fix qwen torchair attention PrefillCacheHit ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? vLLM version: v0.10.1.1 vLLM main: `e599e2c65e` - vLLM version: main - vLLM main: `0b9a612fa3` Signed-off-by: zhaozixin <zhaozixin1@huawei.com> Co-authored-by: zhaozixin <zhaozixin1@huawei.com>	2025-09-11 14:26:59 +08:00
anon189Ty	7b2ecc1e9a	[Feat] Unquantized linear nz support (#2619 ) ### What this PR does / why we need it? Currently, when executing to the Linear layer of the model in vLLM-Ascend, the weights input format is ND in unquantized case and skipped ascend case, which is slower than FRACTAL_NZ. This PR supplements the execution logic for Linear layer. When VLLM_ASCEND_ENABLE_MLP_OPTIMIZE=1 and CANN version is 8.3, the weights of the Linear layer will be converted to FRACTAL_NZ, in both unquantized case and skipped ascend case. - vLLM version: main - vLLM main: `267c80d31f` Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>	2025-09-11 11:40:00 +08:00
liziyu	5691104249	LLMdatadist connector adapt the distributed KV aggregation (#2718 ) ### What this PR does / why we need it? LLMdatadist connector adapt the distributed KV aggregation for the main branch. Change the P node from returning "finish sending" only when TP0 responds to returning "finish sending" as soon as each NPU receives it. The D node will send a finish receive signal to the corresponding tp rank of the P node. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? gsm8k test 2*A3 1P 1D P: dp2 tp8 D:dp 4 tp4 P: dp2 tp8 D:dp 2 tp8 - vLLM version: main - vLLM main: `cc99baf14d` Signed-off-by: liziyu <liziyu16@huawei.com>	2025-09-11 11:37:41 +08:00
Mengqing Cao	c2fdd4b8bc	[CI/UT] Fix UTs on register customop and warm up model (#2862 ) ### What this PR does / why we need it? Fix UTs on register customop and warm up model ### How was this patch tested? CI passed with existing test. Co-authored-by: Icey <1790571317@qq.com> - vLLM version: main - vLLM main: `cc99baf14d` --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-09-11 11:30:16 +08:00
lilinsiman	b7df04de9b	debug_aclgraph_sizes_capture (#2827 ) ### What this PR does / why we need it? 1. Solved the problem that in the Qwen3 Moe model case, opening DP would use an extra stream, causing ACLgraph sizes capture error 2. After experimentation, it was found that in many cases, some operators would occupy more streams than expected. Therefore, the buffer area for streams in ACLgraph was not large enough. After discussion, extra 120 streams were added as buffer. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ut - vLLM version: main - vLLM main: `0ae43dbf8c` Signed-off-by: lilinsiman <lilinsiman@gmail.com>	2025-09-10 22:50:48 +08:00
huangxialu	88d7af62be	[main] adjust the position of warm_up_atb (#2823 ) ### What this PR does / why we need it? Adjust the position of warm_up_atb. ### Does this PR introduce _any_ user-facing change? not involved ### How was this patch tested? CI passed with existing test. - vLLM version: main - vLLM main: `b23fb78623` Signed-off-by: huangxialu <huangxialu1@huawei.com>	2025-09-10 14:06:38 +08:00
Li Wang	22b425765a	[Bugfix] Fix broken CI (#2825 ) ### What this PR does / why we need it? 1. Initial support disable tp for integrating with [vllm-commit](https://github.com/vllm-project/vllm/pull/23024) 2. [vllm@commit](https://github.com/vllm-project/vllm/pull/23673) now use `bytes` to save the `BlockHash` to reduce GC overhead, this pr add the integration - vLLM version: main - vLLM main: `e40827280b` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-09-10 13:29:29 +08:00
Icey	aa4d2a91ed	Refactor AscendMultiHeadLatentAttention (#2826 ) ### What this PR does / why we need it? Register AscendMultiHeadLatentAttention as CustomOP, following vllm changes ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with new added/existing test. - vLLM version: main - vLLM main: `b23fb78623` --------- Signed-off-by: Icey <1790571317@qq.com>	2025-09-10 11:26:11 +08:00
CaranLic	168ad600b5	[main] add pd transfer for ascend scheduler (#2753 ) ### What this PR does / why we need it? For offline scenarios, adjust the scheduling process to prioritize the prefill phase of all requests, then process the decode phase of all requests. ### How was this patch tested? ``` max_num_seqs=24, additional_config={ "ascend_scheduler_config":{ "enabled": True, "enable_pd_transfer": True, "decode_max_num_seqs": 24, "enable_chunked_prefill": False } }, ``` \| input \| output \| num prompts \| max_num_seqs \| dp \| tp \| scheduler \| tps \| \| ------ \| ------ \| ---------- \| ---------------- \| ---- \| ---- \| ---------------- \| --------------- \| \| dapo-math-17K \| 2K \| 384 \| 24 \| 2 \| 1 \| v1 \| 234.06 \| \| dapo-math-17K \| 2K \| 384 \| 24 \| 2 \| 1 \| pd transfer \| 239.59(+2.4%) \| \| dapo-math-17K\| 2K \| 384 \| 24 \| 4 \| 1 \| v1 \| 222.85 \| \| dapo-math-17K\| 2K \| 384 \| 24 \| 4 \| 1 \| pd transfer \| 225.81(+1.3%) \| - vLLM version: v0.10.1.1 - vLLM main: `6fb2788163` --------- Signed-off-by: CaranLic <740821011@qq.com>	2025-09-10 08:46:39 +08:00
Mengqing Cao	edf1f600ad	[CI] Remove compatibility maintenance for vllm v0.10.1 and v0.10.1.1 (#2840 ) ### What this PR does / why we need it? Remove compatibility maintenance for vllm v0.10.1 and v0.10.1.1 ### Does this PR introduce _any_ user-facing change? branch main of vllm-ascend will not be compatible with vllm v0.10.1 and v0.10.1.1 ### How was this patch tested? CI passed with existing test. - vLLM version: v0.10.1.1 - vLLM main: `6fb2788163` --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-09-10 08:43:10 +08:00
sherie	93e28e6862	add weight transpose check. (#2756 ) ### What this PR does / why we need it? In reinforcement learning scenarios, weight updates are required, but the current inference applies a transpose operation to the weights, altering their shape. This causes a shape mismatch with the training weights, triggering an error during weight updates. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.10.1.1 - vLLM main: `6fb2788163` Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com> Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>	2025-09-09 20:33:43 +08:00
yiz-liu	e13c4ddb42	[Fix] Fix SharedFusedMoE (#2817 ) ### What this PR does / why we need it? Really strange that `register_oot` doesn't work with `SharedFusedMoE`, so we have to add this patch, for now. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? This PR won't have any effect in DeepSeek since we currently still stick with the old `CustomDeepseekV2`. - vLLM version: v0.10.1.1 - vLLM main: `0cdd213641` --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-09-09 18:19:56 +08:00
rjg-lyh	7a205dbaa8	[main] Optimize rope in Qwen Models (#2571 ) ### What this PR does / why we need it? Optimize rope by caching sin and cos at the first layer in Qwen Models. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.10.1.1 - vLLM main: `562663a044` --------- Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: ZYang6263 <zy626375@gmail.com> Signed-off-by: rjg-lyh <1318825571@qq.com> Co-authored-by: Mengqing Cao <cmq0113@163.com> Co-authored-by: ZYang6263 <51255902183@stu.ecnu.edu.cn> Co-authored-by: ZYang6263 <zy626375@gmail.com>	2025-09-09 14:28:14 +08:00
rjg-lyh	1bbb20ea13	[main] flashcomm_v1 optim in Qwen Dense Models (#2802 ) ### What this PR does / why we need it? Flashcomm_v1 optim in Qwen Dense Models. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.10.1.1 - vLLM main: `5e537f45b4` Co-authored-by: 1024daniel <xxltju324@gmail.com>	2025-09-08 22:52:24 +08:00
zzzzwwjj	4df8df5b94	[bugfix] fix deepseek rope sincoscache re-generation (#2744 ) ### What this PR does / why we need it? The current implementation will result in duplicate generation of `sin_cos_cache` in rope when `kv_seqlen` > 4k, because the initialization length of the `sin_cos_cache` is only 4k. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? After this PR merged, sin_cos_cache will not increase in forward func, so `test_native_rope_deepseek_forward_cache_handling` is not necessary. - vLLM version: v0.10.1.1 - vLLM main: `60f0843ef8` Signed-off-by: zzzzwwjj <1183291235@qq.com>	2025-09-08 22:03:34 +08:00
wangxiyuan	7d6d9449a8	[Misc] Move lora patch file into lora module (#2797 ) Cleanup useless file in patch module. Update the lora support list is OK in vLLM Ascend, no need to patch vLLM - vLLM version: v0.10.1.1 - vLLM main: `f4962a6d55` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-09-08 21:42:12 +08:00
wangxiyuan	85d989a3b9	[Misc] Remove pangu model file (#2798 ) vllm-ascend won't contain model file anymore. Now pangu model file has been moved to torchair module. The origin one can be removed. Note: After this PR, pangu only works with torchair mode then. - vLLM version: v0.10.1.1 - vLLM main: `8c892b1831` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-09-08 21:30:37 +08:00
weichen	a041d4f328	[main] [refactor] refactor common_fused_moe.py (#2706 ) ### What this PR does / why we need it? 1. Move prepare/finalize operation from moe_comm_method to /ops/moe/fused_moe_prepare_and_finalize 2. Adapt to token_dispatcher in moe_comm_method 3. Move moe_comm_method/experts_selector/token_dispatcher/fused_moe_prepare_and_finalize to /ops/moe ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? e2e & ut - vLLM version: v0.10.1.1 - vLLM main: `f4962a6d55` Signed-off-by: weichen <calvin_zhu0210@outlook.com> Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com> Co-authored-by: weijinqian0 <12153182+weijinqian0@users.noreply.github.com>	2025-09-08 20:09:50 +08:00
machenglong2025	1a82b16355	Remove unused code in fused_moe.py (#2805 ) ### What this PR does / why we need it? line 408 already declared mc2_mask , remove duplicated unused code ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? CI passed with existing test. - vLLM version: v0.10.1.1 - vLLM main: `60f0843ef8` Signed-off-by: machenglong <machenglong_yewu@cmss.chinamobile.com>	2025-09-08 20:05:19 +08:00
22dimensions	d51694a77b	[2/N][Refactor][Quantization] clean quantization patch (#2785 ) ### What this PR does / why we need it? quantization patch is unused code ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? tested by CI - vLLM version: v0.10.1.1 - vLLM main: `f4962a6d55` Signed-off-by: 22dimensions <waitingwind@foxmail.com>	2025-09-08 17:31:53 +08:00
realliujiaxu	d3c3538ddc	[Bugfix]fix bug when graph_size is not divisible by tp_size (#2719 ) ### What this PR does / why we need it? fix https://github.com/vllm-project/vllm-ascend/issues/2702 - A2: skip graph_size update that makes it to tp_size because dispatch/combine op support different batch size across EP ranks - A3: add `max_num_reqs = max(new_graph_batch_sizes)` to fix graph_size and max_num_reqs mismatch ### Does this PR introduce _any_ user-facing change? Nope ### How was this patch tested? - vLLM version: v0.10.1.1 - vLLM main: `e599e2c65e` --------- Signed-off-by: realliujiaxu <realliujiaxu@163.com>	2025-09-08 14:52:33 +08:00
TaoYu Chen	dd087effcc	Refector prepare_inputs in model_runner_v1.py (#2750 ) ### What this PR does / why we need it? Refector prepare_inputs in model_runner_v1.py for more easy read. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? PASS CI - vLLM version: v0.10.1.1 - vLLM main: `e599e2c65e` --------- Signed-off-by: ChenTaoyu-SJTU <ctynb@qq.com>	2025-09-08 10:45:23 +08:00
yiz-liu	c735bb0941	[Fix] Ensure metadata sync across DP ranks in eager mode (#2766 ) ### What this PR does / why we need it? Removes the condition that skips metadata synchronization when `enforce_eager` is enabled. This change is necessary to correctly sync the `with_prefill` and `enable_dbo` flags across all data parallel ranks, which is not required in the base implementation. Forcing the sync operation prevents potential inconsistencies, albeit with a minor performance impact. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? Add a E2E online test case? - vLLM version: v0.10.1.1 - vLLM main: `e599e2c65e` Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-09-08 09:55:16 +08:00
sherie	2693196ef8	add gatherep select. (#2740 ) ### What this PR does / why we need it? add gatherep select. - vLLM version: v0.10.1.1 - vLLM main: `e599e2c65e` Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com> Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>	2025-09-08 09:15:50 +08:00
Marco Barletta	6666e5265d	Added support for KV connector v1 (#2039 ) ### What this PR does / why we need it? - This PR adds the support for the KV connector interface in the V1 architecture, in the same way as vllm. Vllm-ascend currently lacks of this support, required to support also layerwise management of KV caches. - The connector interface allows using external tools and integrate them with vllm ### Notes: We are aware of Issue #684 , however that issue does not modify the attention classes as necessary to perform layerwise management of KV caches required for connectors like LMCache. The implementation of this PR ported the necessary code from the vanilla vllm. The KV connector API is the same as vanilla vllm, supporting the standard KV connector API. EDIT: this PR was re-implementing part of the changes merged one hour before this PR was made on the file model_runner_v1.py. I solved the conflicts by removing any modification to the model_runner_v1 file, which now are largely already merged in main. Now this PR is left for the modifications to the attention_v1 file. ### Does this PR introduce _any_ user-facing change? The PR does not modify current APIs, but it extends the behavior of current worker runner and attention classes to save and load KV caches. In absence of connectors, the behavior should stay untouched. ### How was this patch tested? - No unit test implemented yet for the worker. - Tested together with LMCache using https://github.com/LMCache/LMCache/blob/dev/examples/kv_cache_reuse/local_backends/offload.py with the following models: 1 Deepseek-R1-Distill-Qwen-1.5B 2 Qwen3-30B-A3B 3 Deepseek-v2-lite 4 Llama-3.1-8B LMCache used in both layerwise and non-layerwise mode. - Performed LMEval on LMCache integrated with vllm-ascend. Results without LMCache on Qwen3-8B: \|Tasks\|Version\| Filter \|n-shot\| Metric \| \|Value \| \|Stderr\| \|-----\|------:\|----------------\|-----:\|-----------\|---\|-----:\|---\|-----:\| \|gsm8k\| 3\|flexible-extract\| 5\|exact_match\|↑ \|0.8400\|± \|0.0101\| \| \| \|strict-match \| 5\|exact_match\|↑ \|0.8355\|± \|0.0102\| Results with LMCache Layerwise: \|Tasks\|Version\| Filter \|n-shot\| Metric \| \|Value \| \|Stderr\| \|-----\|------:\|----------------\|-----:\|-----------\|---\|-----:\|---\|-----:\| \|gsm8k\| 3\|flexible-extract\| 5\|exact_match\|↑ \|0.8385\|± \|0.0101\| \| \| \|strict-match \| 5\|exact_match\|↑ \|0.8332\|± \|0.0103\| - vLLM version: v0.10.1.1 - vLLM main: `50fede6634` --------- Signed-off-by: marcobarlo <barlettamarco8@gmail.com> Signed-off-by: marcobarlo <65128997+marcobarlo@users.noreply.github.com>	2025-09-08 09:04:22 +08:00
yeyifan	b2f77d3aa8	[fix] prefill unsupport sliding window attention (#2758 ) ### What this PR does / why we need it? fix prefill attention bug，not support sliding window. npu_fused_infer_attention_score head_dim only equal 128, not support other number. ### Does this PR introduce _any_ user-facing change? remove prefill phase npu_fused_infer_attention_score ### How was this patch tested? - vLLM version: v0.10.1.1 - vLLM main: `e599e2c65e` --------- Signed-off-by: nsdie <yeyifan@huawei.com>	2025-09-07 10:34:38 +08:00

1 2 3 4 5 ...

500 Commits