xc-llm-ascend

Author	SHA1	Message	Date
zouyida2052	adadd50613	bugfix for mtp fullgraph (#3845 ) ### What this PR does / why we need it? bugfix for mtp fullgraph ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: `83f478bb19` Signed-off-by: zouyida2052 <zouyida2002@gmail.com>	2025-10-29 23:50:13 +08:00
baxingpiaochong	d6ef3df3b3	[Bugfix]fix_mulit_connector_bug (#3332 ) ### What this PR does / why we need it? When using multi connector, the multi connector does not define get_finished_count, which will cause the kv cache to be released ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: `83f478bb19` --------- Signed-off-by: baxingpiaochong <771405853@qq.com>	2025-10-29 23:23:06 +08:00
liziyu	07873d9396	fix mooncake layerwise connector (#3849 ) ### What this PR does / why we need it? fix a typo in mooncake layerwise connector. There is only `requests`, instead of `request` in `connector_metadata`. This pr fixes this typo - vLLM version: v0.11.0rc3 - vLLM main: `83f478bb19` Signed-off-by: liziyu <liziyu16@huawei.com>	2025-10-29 23:10:51 +08:00
Wang Yixuan	870a3f21cb	[BugFix] deepseek torchair adapt for torch_npu version (#3862 ) ### What this PR does / why we need it? To adapt the torch_npu version to avoid the precision problem of torchair deepseek. The torch_npu version may result in the different branches in the ops register, the rms_norm ops has two branches according to the verson_check, this pr unify the rms_norm in torchair by patching quant_rms_norm to rms_norm to fix the accuracy issue in torchair scenario - vLLM version: v0.11.0rc3 - vLLM main: `83f478bb19` Signed-off-by: hust17yixuan <303660421@qq.com>	2025-10-29 22:39:34 +08:00
realliujiaxu	74191864b7	[Perf] Delete redundant operations in model_runner and forward_context (#3677 ) ### What this PR does / why we need it? Remove redundant operations from `model_runner` and `forward_context`. This optimization can significantly reduce the idle time (bubble) before decoding when running models with small parameter counts (e.g., Qwen/Qwen2.5-0.5B). Testing on 800I A2, bubble is reduced from 3.8ms to 2.8ms : Before <img width="1655" height="696" alt="image" src="https://github.com/user-attachments/assets/d7608e52-2438-46dd-8fc9-391fd6274495" /> After <img width="1607" height="774" alt="image" src="https://github.com/user-attachments/assets/56daf081-2dba-4d2e-99d4-e055187d9806" /> ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/releases/v0.11.1 --------- Signed-off-by: realliujiaxu <realliujiaxu@163.com>	2025-10-29 15:59:55 +08:00
weichen	0d1859af08	[Bugfix] [MoE] fix error in deepseek when using allgather (#3824 ) ### What this PR does / why we need it? After refactoring vllm_ascend/models and FusedMoE, we are unable to pass `gate` from deepseekv2.py to `AscendFusedMoE.forward`, which will result in error when running deepseek v3/r1 with allgather. Hence, this pr removes `gate` related computations from FusedMoE module in eager/aclgraph mode. ### Does this PR introduce _any_ user-facing change? `rm_router_logits` is deprecated in eager/aclgraph. ### How was this patch tested? e2e & ut - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/releases/v0.11.1 Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>	2025-10-29 14:51:39 +08:00
Mengqing Cao	900086fdc6	[HybridKV][Bugfix] Fix Hybrid kvcache sharing bug in same attention type (#3760 ) ### What this PR does / why we need it? Part of https://github.com/vllm-project/vllm-ascend/pull/3106 Fix Hybrid kvcache sharing bug in same attention type Change the `shared_by` logic so that the same attention spec could share the same buffer instead of allocating more hbm. After this pr, kvcache memory saved 50% in qwen3-next compared with before (`self_attn:linear_attn=1:3` in an `attn_group`), and `gpu_memory_utilization` could increase to `0.8` on Qwen3-Next when running on A2 64G/card with tp4 <img width="2833" height="1540" alt="image" src="https://github.com/user-attachments/assets/2a91fa99-fb0f-447c-9e8b-acd587890fbe" /> ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Test pass with the latest e2e test case on qwen3-next - vLLM version: v0.11.0rc3 - vLLM main: `c9461e05a4` --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-10-29 14:18:52 +08:00
XiaoxinWang	1e31b07fa7	fix qwen3next full graph break. (#3812 ) ### What this PR does / why we need it? fix qwen3next full graph break. linearattention doesnot has aclgraph_support attr，so change to cudagraph_support to support vllm. <img width="603" height="120" alt="image" src="https://github.com/user-attachments/assets/d2de53bb-4147-495a-9129-51d9083749be" /> ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/releases/v0.11.1 Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com> Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>	2025-10-29 10:30:23 +08:00
liziyu	c76db627ab	[P/D] force with_prefill true after allreduce in kv producer (#3768 ) ### What this PR does / why we need it? force with_prefill true after allreduce in kv producer - vLLM version: v0.11.0rc3 - vLLM main: `c9461e05a4` --------- Signed-off-by: liziyu <liziyu16@huawei.com>	2025-10-29 10:15:38 +08:00
pichangping	f57bdb09fc	[long_seq_optim] BSND to TND and FA_UPDATE replacement (#3778 ) ### What this PR does / why we need it? We have optimized the performance of long sequences：First,Modify the input data format for attention calculation. Instead of using the original BSND format, remove the logic for converting between TND and BSND, and directly adopt the TND format. The TND input format can be directly reused, which shortens the data flow path. Converting to BSND is an unnecessary processing step.Second, we switched the output update of the concatenated small operators to the npu_attention_update fusion operator to improve performance. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: `c9461e05a4` --------- Signed-off-by: pichangping <1337510399@qq.com>	2025-10-29 09:33:35 +08:00
ZYang6263	d08401d1e7	[Main][Bugfix]Avoid using the fusion operator in the MOE model (#3834 ) ### What this PR does / why we need it? The current MatmulReduceScatter operator experiences performance degradation in small-shape scenarios, so it determines whether to use this operator by judging the size of the shape. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/releases/v0.11.1 --------- Signed-off-by: ZYang6263 <zy626375@gmail.com>	2025-10-28 23:30:27 +08:00
Icey	a7450db1bd	Upgrade to 0.11.1 newest vllm commit (#3762 ) ### What this PR does / why we need it? `c9461e05a4` Fix ```spec decode rejection sampler```, caused by https://github.com/vllm-project/vllm/pull/26060 Fix some ```import```, caused by https://github.com/vllm-project/vllm/pull/27374 Fix ```scheduler_config.send_delta_data```, caused by https://github.com/vllm-project/vllm-ascend/pull/3719 Fix ```init_with_cudagraph_sizes```, caused by https://github.com/vllm-project/vllm/pull/26016 Fix ```vl model```of replacing PatchEmbed's conv3d to linear layer, caused by https://github.com/vllm-project/vllm/pull/27418 ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.11.0rc3 - vLLM main: `c9461e05a4` --------- Signed-off-by: Icey <1790571317@qq.com>	2025-10-28 14:55:03 +08:00
Levi	d64bdd06ae	【Bugfix】bugfix for weight load of kimi-k2 (#3798 ) Signed-off-by: Levi-JQ <yujinqi2@huawei.com> ### What this PR does / why we need it? Fix kimi-k2 start bug, weight load ERROR：https://github.com/vllm-project/vllm-ascend/issues/3785 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: `c9461e05a4` Signed-off-by: Levi-JQ <yujinqi2@huawei.com> Co-authored-by: Levi-JQ <yujinqi2@huawei.com> Co-authored-by: zhaozx-cn <zhaozx2116@163.com>	2025-10-27 21:18:35 +08:00
shiyuan680	00aa0bf33e	support prefill cache mode use fia op (#3696 ) ### What this PR does / why we need it? support prefill cache mode use fia op for full graph ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: `17c540a993` origin ============ Serving Benchmark Result ============ Successful requests: 30 Maximum request concurrency: 256 Request rate configured (RPS): 0.70 Benchmark duration (s): 131.63 Total input tokens: 61363 Total generated tokens: 61440 Request throughput (req/s): 0.23 Output token throughput (tok/s): 466.77 Peak output token throughput (tok/s): 750.00 Peak concurrent requests: 30.00 Total Token throughput (tok/s): 932.95 ---------------Time to First Token---------------- Mean TTFT (ms): 125.17 Median TTFT (ms): 121.51 P50 TTFT (ms): 121.51 P90 TTFT (ms): 140.91 P99 TTFT (ms): 182.36 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 43.85 Median TPOT (ms): 43.84 P50 TPOT (ms): 43.84 P90 TPOT (ms): 44.28 P99 TPOT (ms): 44.32 ---------------Inter-token Latency---------------- Mean ITL (ms): 43.85 Median ITL (ms): 42.63 P50 ITL (ms): 42.63 P90 ITL (ms): 48.74 P99 ITL (ms): 59.62 ================================================== after ============ Serving Benchmark Result ============ Successful requests: 30 Maximum request concurrency: 256 Request rate configured (RPS): 0.70 Benchmark duration (s): 130.10 Total input tokens: 61363 Total generated tokens: 61440 Request throughput (req/s): 0.23 Output token throughput (tok/s): 472.26 Peak output token throughput (tok/s): 750.00 Peak concurrent requests: 30.00 Total Token throughput (tok/s): 943.94 ---------------Time to First Token---------------- Mean TTFT (ms): 123.69 Median TTFT (ms): 122.51 P50 TTFT (ms): 122.51 P90 TTFT (ms): 143.69 P99 TTFT (ms): 165.00 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 43.07 Median TPOT (ms): 43.13 P50 TPOT (ms): 43.13 P90 TPOT (ms): 43.50 P99 TPOT (ms): 43.57 ---------------Inter-token Latency---------------- Mean ITL (ms): 43.07 Median ITL (ms): 41.81 P50 ITL (ms): 41.81 P90 ITL (ms): 48.11 P99 ITL (ms): 62.13 ================================================== Signed-off-by: shiyuan680 <917935075@qq.com>	2025-10-27 19:41:07 +08:00
weiguihua2	4312a92a4f	[feat]dcp pcp support aclgraph (#3731 ) ### What this PR does / why we need it? dcp pcp support full aclgraph, including mla attention_v1 - vLLM version: v0.11.0rc3 - vLLM main: `c9461e05a4` Signed-off-by: weiguihua2 <weiguihua2@huawei.com>	2025-10-27 09:58:23 +08:00
Yizhou	8ab8111fde	[Fix] Prevent memory leak in MLA decode graph (#3743 ) ### What this PR does / why we need it? The cache for MLA decode graph parameters was holding strong references to tensors, preventing them from being garbage collected and leading to increased memory usage. This change wraps the cached tensors in weak references, allowing them to be deallocated when no longer in use and reducing overall memory pressure. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? None. - vLLM version: v0.11.0rc3 - vLLM main: `c9461e05a4` --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-10-25 20:37:33 +08:00
Icey	bb5f16d926	[BugFix] Fix Qwen3-next break (#3428 ) ### What this PR does / why we need it? Fix Qwen3NextGatedDeltaNet, caused by https://github.com/vllm-project/vllm/pull/26437 ### How was this patch tested? ``` def main(): prompts = [ "窗前明月光，", "The president of the United States is Mr.", "The capital of France is", "The future of AI is", "感时花溅泪，", "家书抵万金啥意思？", "plz tell me a story: ", ] # Create a sampling params object. sampling_params = SamplingParams(max_tokens=100, temperature=0.6, top_k=40, top_p=0.95) # Create an LLM. llm = LLM( model="/root/.cache/modelscope/hub/models/Qwen/Qwen3-Next-80B-A3B-Instruct", tensor_parallel_size=4, enforce_eager=True, trust_remote_code=True, max_model_len=256, gpu_memory_utilization=0.7, block_size=64 ) # Generate texts from the prompts. outputs = llm.generate(prompts, sampling_params) for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: Icey <1790571317@qq.com>	2025-10-25 18:03:36 +08:00
zzzzwwjj	e5676fc36e	[main] remove dbo code (#3712 ) ### What this PR does / why we need it? Remove codes of dbo. Currently, vLLM has supported dbo with pr: https://github.com/vllm-project/vllm/pull/23693. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: `17c540a993` Signed-off-by: zzzzwwjj <1183291235@qq.com>	2025-10-25 15:53:01 +08:00
Icey	d9cdc65854	Upgrade to new vllm commit (#3719 ) ### What this PR does / why we need it? Upgrade to new vllm commit: `c9461e05a4` - Fix many imports, caused by https://github.com/vllm-project/vllm/pull/26908 - Fix import ```sha256```, caused by https://github.com/vllm-project/vllm/pull/27169 - Remove ```SchedulerConfig.send_delta_data```, caused by https://github.com/vllm-project/vllm/pull/27142 - Fix ```FusedMoE``` because of dual stream execution, caused by https://github.com/vllm-project/vllm/pull/26440 ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.11.0rc3 - vLLM main: `17c540a993` --------- Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: Icey <1790571317@qq.com> Co-authored-by: MengqingCao <cmq0113@163.com>	2025-10-25 15:36:32 +08:00
fems14	226f832c0b	[bugfixfix] correct _register function place for mooncacke (#3747 ) correct _register function place for mooncacke - vLLM version: v0.11.0rc3 - vLLM main: `17c540a993` Signed-off-by: fems14 <1804143737@qq.com>	2025-10-25 14:20:09 +08:00
Yizhou	1f25d60870	[Fix] Cap max tokens to prevent potential OOM (#3720 ) ### What this PR does / why we need it? Caps the calculated maximum number of tokens at 512. This prevents allocating an excessively large buffer when a cudagraph capture size is not specified, mitigating the risk of out-of-memory errors. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? None. - vLLM version: v0.11.0rc3 - vLLM main: `17c540a993` Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-10-25 11:23:21 +08:00
weichen	63c363d3de	[Refactor] [MoE] Rename moe-related classes & files (#3646 ) ### What this PR does / why we need it? 1. Rename common_fused_moe.py to fused_moe.py. 2. Rename fused_moe_prepare_and_finalize.py / FusedMoEPrepareAndFinalize to prepare_finalize.py / PrepareAndFinalize. 3. Rename vllm_ascend/ops/moe to vllm_ascend/ops/fused_moe. 4. Move vllm_ascend/ops/fused_moe.py to vllm_ascend/ops/fused_moe/fused_moe.py ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? e2e & ut - vLLM version: v0.11.0rc3 - vLLM main: `17c540a993` Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>	2025-10-25 11:22:03 +08:00
lio	9e150e5009	[Refactor] optimize _prepare_inputs method in eagle_proposer (#3296 ) ### What this PR does / why we need it? We optimized the _prepare_input method in eagle_proposer and no longer use the _prepare_eagle_input_sequential method, improving the performance of eagle-3. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? ``` python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 13963 --dtype bfloat16 --model meta-llama/Llama-3.1-8B-Instruct --served-model-name Llama-3.1-8B-Instruct --tensor-parallel-size 1 --gpu-memory-utilization 0.85 --max-model-len 32768 --trust-remote-code --seed 42 --no-enable-prefix-caching --speculative_config '{"method":"eagle3","model":"yuhuili/EAGLE3-LLaMA3.1-Instruct-8B","num_speculative_tokens":2,"draft_tensor_parallel_size":1}' ``` Co-authored-by: QilaiZhang (245706640@qq.com ) - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: lio <1983142975@qq.com>	2025-10-25 09:49:42 +08:00
QilaiZhang	d30bb95b90	[Bugfix] Fix zero attention output in qwen3-next (#3572 ) ### What this PR does / why we need it? Since Attention and LinearAttention share the same ```slot_mapping```, and the ```slot_mapping``` for LinearAttention is all zeros, the ```slot_mapping``` for Attention gets overwritten, resulting in the computed output being all zeros. This PR removes the uniformly managed ```self.slot_mapping``` and directly passes the ```slot_mapping``` from ```input_batch.blocktable``` to ```attn_metadata```, along with modifying the relevant references. Due to hardware, the data type of ```block_table.slot_mapping``` needs to be set to int32. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed with existing test. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: QilaiZhang <245706640@qq.com>	2025-10-25 09:47:03 +08:00
whx	e33751ef8b	[BugFix][Core] Fix a bug running multi-modal with ascend_scheduler (#3675 ) This PR fix the bug related with running multi-modal models with AscendScheduler. This bug was introduced by PR #2372 by using the same parameter names as vLLM with different default values. Currently I fix this bug by changing the default values of these two parameters to align with vLLM. - vLLM version: v0.11.0rc3 - vLLM main: `17c540a993` Signed-off-by: hw_whx <wanghexiang7@huawei.com> Co-authored-by: hw_whx <wanghexiang7@huawei.com>	2025-10-25 09:41:33 +08:00
hucong	292cf339c3	[BugFix][P/D] Modify the recalculation logic to prevent waiting requests from filling up the D node KVCache (#3641 ) ### What this PR does / why we need it? Modify the recalculation logic to prevent waiting requests from filling up the D node KVCache - vLLM version: v0.11.0rc3 - vLLM main: `17c540a993` Signed-off-by: underfituu <hzhucong@163.com>	2025-10-25 09:14:20 +08:00
shaopeng-666	39b994a987	[Feat] Add mrope fusion op (#3708 ) ### What this PR does / why we need it? Add mrope fusion op for qwen2.5-vl. This mrope operator dosen't support Qwen3-VL currently. Thus could only take affect in qwen2.5-vl - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: shaopeng666 <shaopeng666@noreply.gitcode.com> Co-authored-by: shaopeng666 <shaopeng666@noreply.gitcode.com>	2025-10-25 09:12:18 +08:00
Yizhou	3158742a97	[Refactor] Refactor Ascend attention implementation forward (#3714 ) ### What this PR does / why we need it? This PR refactors the Ascend attention implementation to align with vLLM's core interfaces, simplifying the code and improving maintainability. ### Key Changes: * Align with vLLM's Attention Interface: The `forward` method signature in `AscendAttentionBackendImpl` now matches the base `AttentionImpl` in vLLM, removing the custom `trace_flag`. * Enable Opaque Attention Operator: By adding `opaque_attention_op` to `AscendPlatform`, we allow vLLM to wrap our attention kernel in its standard `vllm.unified_attention_with_output` operator. This avoids the need for a custom call path. * Remove Obsolete Code: * The custom op `vllm.unified_ascend_attention_with_output` has been deleted as it is now redundant. * The `trace_flag` and its associated logic were removed, reducing code complexity. * An outdated quantization branch within the attention implementation was cleaned up. * Improve Readability: Renamed output variables (`output` vs. `intermediate_output`) and added comments to clarify the in-place nature of the attention output. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? No extra tests needed. - vLLM version: v0.11.0rc3 - vLLM main: `17c540a993` --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-10-25 08:58:35 +08:00
ZYang6263	0b1da24742	[Main][Perf] Add fused matmul/reduce-scatter kernel for performance optimization. (#3693 ) ### What this PR does / why we need it? This PR boosts performance by introducing a fused kernel for the matrix matmul and reduce scatter operations. It supports both unquantized (e.g., BFloat16) and W8A8 quantized models. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: ZYang6263 <zy626375@gmail.com>	2025-10-24 18:19:58 +08:00
fems14	82a4970fe9	look up multi_tp key (#3699 ) ### What this PR does / why we need it? In multi-Tensor Parallel (TP) scenarios, the KV pool only queries the first GPU card. When keys on other cards are released, the query result still returns as successful, introducing accuracy issues. This PR modifies the KV pool's query logic to check all cards, resolving this problem. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: fems14 <1804143737@qq.com>	2025-10-24 17:23:36 +08:00
fems14	c83efcb9e4	kvpool sync load (#3698 ) ### What this PR does / why we need it? In certain scenarios, the performance of synchronously loading data from the pool is better than that of asynchronously loading data. Therefore, a control logic (or switch) for asynchronous loading from the pool has been added. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: fems14 <1804143737@qq.com>	2025-10-24 17:22:53 +08:00
何必问	59bb16b75c	[Bugfix] The server fails to locate the request, leading to the server hanging. (#3703 ) ### What this PR does / why we need it? fix bug: In the mooncake pooling scenario, when the client closes the request, the server fails to locate the request, leading to the server hanging.oling scenario, when the client closes the request, the server fails to locate the request, leading to the server hanging. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Pull up the PD separated pooling service, send requests using aisbench, press CTRL+C twice, and check if the vllm_ascend service exit. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: linhebiwen <linhebiwen@gmail.com>	2025-10-24 17:18:03 +08:00
offline893	9b0baa1182	[BugFix] Check all expert maps when using muilty instance. (#3576 ) ### What this PR does / why we need it? Check all expert maps when using muilty instance. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? Qwen 235B in double A3. case1：master has expert map, slave has not expert map. case2: master has expert map, slave has error expert map. case3: master has expert map,slave has correct expert map. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: offline0806 <3337230449@qq.com> Co-authored-by: offline0806 <3337230449@qq.com>	2025-10-24 17:10:14 +08:00
Mengqing Cao	cea0755b07	[1/N][Refactor] Refactor code to adapt with vllm main (#3612 ) ### What this PR does / why we need it? This is the step 1 of refactoring code to adapt with vllm main, and this pr aligned with `17c540a993` 1. refactor deepseek to the latest code arch as of `17c540a993` 2. bunches of fixes due to vllm changes - Fix `AscendScheduler` `__post_init__`, caused by https://github.com/vllm-project/vllm/pull/25075 - Fix `AscendScheduler` init got an unexpected arg `block_size`, caused by https://github.com/vllm-project/vllm/pull/26296 - Fix `KVCacheManager` `get_num_common_prefix_blocks` arg, caused by https://github.com/vllm-project/vllm/pull/23485 - Fix `MLAAttention` import,caused by https://github.com/vllm-project/vllm/pull/25103 - Fix `SharedFusedMoE` import, caused by https://github.com/vllm-project/vllm/pull/26145 - Fix `LazyLoader` improt, caused by https://github.com/vllm-project/vllm/pull/27022 - Fix `vllm.utils.swap_dict_values` improt, caused by https://github.com/vllm-project/vllm/pull/26990 - Fix `Backend` enum import, caused by https://github.com/vllm-project/vllm/pull/25893 - Fix `CompilationLevel` renaming to `CompilationMode` issue introduced by https://github.com/vllm-project/vllm/pull/26355 - Fix fused_moe ops, caused by https://github.com/vllm-project/vllm/pull/24097 - Fix bert model because of `inputs_embeds`, caused by https://github.com/vllm-project/vllm/pull/25922 - Fix MRope because of `get_input_positions_tensor` to `get_mrope_input_positions`, caused by https://github.com/vllm-project/vllm/pull/24172 - Fix `splitting_ops` changes introduced by https://github.com/vllm-project/vllm/pull/25845 - Fix multi-modality changes introduced by https://github.com/vllm-project/vllm/issues/16229 - Fix lora bias dropping issue introduced by https://github.com/vllm-project/vllm/pull/25807 - Fix structured ouput break introduced by https://github.com/vllm-project/vllm/issues/26737 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? CI passed with existing test. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: Icey <1790571317@qq.com> Co-authored-by: Icey <1790571317@qq.com>	2025-10-24 16:55:08 +08:00
zzzzwwjj	6be321b95e	remove useless code (#3685 ) ### What this PR does / why we need it? `vanilla_chunked_prefill_mla` and `vanilla_decode_mla` is unused, so remove it. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: zzzzwwjj <1183291235@qq.com>	2025-10-24 16:29:08 +08:00
whx	1b270a64bd	[MoE][Multistream] Avoid performing communication in extra stream. (#3582 ) This PR moves the communication operation of shared experts out of extra stream because I found that this might cause rtMemcpy related errors when running shared experts multistream with aclgraph. Furthermore, I utilize a global variable as extra stream object to avoid allocating streams for each layer in full-graph mode. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: whx-sjtu <2952154980@qq.com>	2025-10-24 10:44:38 +08:00
LookAround0301	b54d44e664	support cp&dcp (#3260 ) ### What this PR does / why we need it? This PR adds the Prefill Context Parallelism (PCP) feature, which corresponds to DCP. For specific implementation details, please refer to the RFC https://github.com/vllm-project/vllm/issues/25749. TL;DR: PCP enhances long-sequence inference capabilities by partitioning the sequence dimension during the prefill stage. ### Does this PR introduce _any_ user-facing change? The current implementation primarily includes the following changes: Modified ModelRunner.py for CP partitioning logic for tokens; Modified attention_v1.py and mla_v1.py to adapt the GQA/MLA backend to PCP. Modified block_tables.py to extend the KV cache storage based on DCP&PCP; Added necessary command-line arguments to control parallelism for PCP; ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: LookAround <lixushi@huawei.com> Signed-off-by: chenjie <chenjie137@huawei.com> Signed-off-by: Delphine-Nic <tanwenqin@huawei.com> Signed-off-by: zhangsicheng5 <zhangsicheng5@huawei.com> Signed-off-by: Feng Liu <liufeng248@huawei.com> Signed-off-by: gaojc <1055866782@qq.com> Signed-off-by: weiguihua2 <weiguihua2@huawei.com> Signed-off-by: z50049692 <zhangmingwei11@huawei.com> Co-authored-by: chenjie <chenjie137@huawei.com> Co-authored-by: Delphine-Nic <tanwenqin@huawei.com> Co-authored-by: zhangsicheng5 <zhangsicheng5@huawei.com> Co-authored-by: Feng Liu <liufeng248@huawei.com> Co-authored-by: gaojc <1055866782@qq.com> Co-authored-by: weiguihua2 <weiguihua2@huawei.com> Co-authored-by: z50049692 <zhangmingwei11@huawei.com> Co-authored-by: w00896881 <wangzixuan40@huawei.com>	2025-10-24 10:32:01 +08:00
fems14	2bcadcb9d5	【main】patch sched_yield (#3648 ) ### What this PR does / why we need it? On Arm systems, os.sched_yield() does not take effect, causing the GIL (Global Interpreter Lock) to remain unrelinquished and resulting in CPU bound issues. This PR applies a patch to sched_yield in vLLM, making the process execute time.sleep(0) instead to release the GIL. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: fems14 <1804143737@qq.com>	2025-10-24 00:06:45 +08:00
Wang Yixuan	a7b40b09eb	[BugFix]fix deepseek torchair recompile (#3678 ) ### What this PR does / why we need it? The #3624 PR fix the precision of deepseek torchair, but don't consider the limitation of torch compile which results in the recompile, This PR fixs this problem ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: hust17yixuan <303660421@qq.com>	2025-10-23 22:53:01 +08:00
Slightwind	3366d47694	[main][bugfix] Add 'layer_type' param to get_pergroup_param() for compatibility (#3682 ) Resolves a `TypeError: got an unexpected keyword argument 'layer_type'`. A recent change (PR #3311) started passing the `layer_type` argument when calling `get_pergroup_param()`. This specific implementation does not use this parameter, causing the error. This patch adds `layer_type=None` to the method signature to maintain API compatibility and ignore the unused argument. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: SlightwindSec <slightwindsec@gmail.com>	2025-10-23 21:26:33 +08:00
liziyu	aeddf4261a	[Bugfix] fix delay free prefill req & D node support prefix cache (#3607 ) ### What this PR does / why we need it? Fix mooncake connector. In scenarios where TP is not equal, when the prefill TP size is less than the number of key-value heads, _get_remote_tp_ranks_for_req will return a list of np.arrays. Performing an operation like int in list of np.arrays will cause an error. Converting the list of np.arrays into a single np.array resolves this issue. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? qwen235B P tp16, D tp1 P tp8, D tp1 P tp4, D tp1 P tp8, D tp2 - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: liziyu <liziyu16@huawei.com> Signed-off-by: underfituu <hzhucong@163.com> Co-authored-by: underfituu <hzhucong@163.com>	2025-10-23 20:39:14 +08:00
Shanshan Shen	e3c1ac89e5	[Structured Output] Replace `apply_grammar_bitmask()` method with that in vllm to avoid maintenance (#2524 ) ### What this PR does / why we need it? Replace `apply_grammar_bitmask()` method with that in vllm to avoid maintenance. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: shen-shanshan <467638484@qq.com>	2025-10-23 17:26:27 +08:00
Rui Kang	427b17e2da	[Misc] Add a model loader that utilizes HCCL for weight loading (#2888 ) ### What this PR does / why we need it? This PR introduces a new model loader called Netloader, which leverages high-bandwidth P2P direct transfer between NPU cards to achieve weight loading. Netloader is implemented as a plugin through the newly added 'register_model_loader' function in vLLM 0.10. It facilitates the process of weight loading by sending weights from a pre-loaded model (server) to an empty model of a newly started instance (client). The server operates concurrently with normal inference tasks through sub-threads and the 'stateless_init_torch_distributed_process_group' in vLLM. The client initiates a transfer request after verifying that the model and partitioning method are the same as the server's, and uses HCCL's collective communication (send/recv) to load the weights in the order they are stored in the model. Application Scenarios: 1. Significantly Reduces Inference Instance Startup Time By reusing the weights of already loaded instances and performing high-speed transfers directly between computing cards, this method reduces model loading latency compared to traditional remote/local pull methods. 2. Reduces Network and Storage Pressure Avoids the need to repeatedly download weight files from remote repositories, reducing the impact on centralized storage and network traffic, thereby enhancing overall system stability and service quality. 3. Improves Resource Utilization and Reduces Costs Accelerating the loading process reduces reliance on redundant computing pools, allowing computing resources to be elastically scaled and reclaimed as needed. 4. Enhances Business Continuity and High Availability In fault recovery scenarios, new instances can quickly take over existing services, avoiding prolonged business interruptions and improving the system's high availability and user experience. ### Does this PR introduce _any_ user-facing change? Netloader utilizes the existing --load-format=netloader and --model-loader-extra-config to be activated. The model-loader-extra-config needs to be input as a JSON string (as it is now) Afterwards, you can check whether the outputs for the same sentence are consistent when the temperature is set to 0. Signed-off-by: destinysky <kangrui10@126.com> - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: destinysky <kangrui10@126.com>	2025-10-23 15:56:07 +08:00
NeverRaR	807686dec9	perf : optimize memory for deepseek mtp (#2713 ) ### What this PR does / why we need it? delete the temp tensor to optimize memory for deepseek mtp for torchair case - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: boying <897013703@qq.com>	2025-10-23 15:52:17 +08:00
Wang Yixuan	2584f97217	[BugFix] fix deepseek torchair precision (#3624 ) ### What this PR does / why we need it? The precision of deepseek torchair is broken by #3465 , which due to the origin patch or rmsnorm in torchair. This PR fixes the precision of deepseek torchair ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: hust17yixuan <303660421@qq.com>	2025-10-23 15:41:50 +08:00
rjg-lyh	292e213dd2	[main][refactor] refactor SequenceRowParallelOp forward (#3616 ) ### What this PR does / why we need it? This PR refactors SequenceRowParallelOp forward. In order to further expand the operator inclusion scope in dynamic judgment scenarios, this PR customizes the entire matmul computation and communication as a custom operator masking. With this refactor, it will support directly writing code such as common operation fusion into the `SequenceRowParallelOp` class's member function `matmul_and_reduce`, without the need to register more redundant custom masking operators. ### How was this patch tested? CI passed with existing test. Signed-off-by: rjg-lyh <1318825571@qq.com>	2025-10-23 14:41:15 +08:00
Ruri	dd7a25063c	[Feat] Prefetching Attention QKV Linear Weight With `AddRmsNormQuant` Custom Op (#3517 ) ### What this PR does / why we need it? - `qkv_proj.weight` prefetching has been implemented with `Quant` op, when `AddRmsNormQuant` is enabled (#3465) `qkv_proj.weight` prefetching won't work - Implement `qkv_proj.weight` prefetching with `AddRmsNormQuant` ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? Tested on `Qwen3-235B-A22B-W8A8` <img width="1868" height="109" alt="image" src="https://github.com/user-attachments/assets/0bc28082-0287-4d5c-b8f6-f907c3134d36" /> - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>	2025-10-23 10:07:37 +08:00
whx	72695c97d0	[BugFix][main] Fix quantization related mtp bug with patch (#3620 ) vLLM 0.11.0 didn't bring PR (https://github.com/vllm-project/vllm/pull/25805) thus missing the prefix of mtp's SharedHead. This PR fixes this bug with a patch to vllm's deepseek_mtp. main also need this bugfix to support vllm's v0.11.0 - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: whx-sjtu <2952154980@qq.com>	2025-10-23 09:54:31 +08:00
Yizhou	4381d296e5	[Fix] Fix attention metadata handling for profiling and MLA (#3636 ) ### What this PR does / why we need it? Move the creation of dummy attention metadata to occur after the ACL graph runtime mode is determined. This ensures the metadata is initialized with the correct configuration during a profile run. Additionally, remove the `attn_metadata` existence check before updating MLA attention parameters. This change prevents the update from being skipped when metadata is not yet available, ensuring parameters are set correctly. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? None. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-10-23 09:35:18 +08:00
Yizhou	b13d22bf5a	[Fix] Fixes attribute error in MLA implementation (#3618 ) ### What this PR does / why we need it? Corrects the attribute access for retrieving the device from `q_a_proj` to `q_proj`. This prevents an `AttributeError` as `q_a_proj` does not exist on the class instance. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? Need MLAPO tests. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-10-23 09:12:50 +08:00

... 3 4 5 6 7 ...

941 Commits