xc-llm-ascend

Author	SHA1	Message	Date
pppeng	9a0b786f2b	[bugfix][0.18.0] Fix race in non-blocking num_accepted_tokens (#8764 ) ### What this PR does / why we need it? The same fix from https://github.com/vllm-project/vllm/pull/36013. In _update_states_after_model_execute, num_accepted_tokens is copied from GPU to pinned CPU memory with non_blocking=True. The CPU-side numpy view is later read in _build_attention_metadata during the next execute_model call. With async scheduling, _bookkeeping_sync deliberately avoids any CUDA synchronization (the whole point of async scheduling), so there is no guarantee the DMA has landed before the CPU read. Signed-off-by: ppppeng <zepengliu912@qq.com>	2026-04-27 23:28:52 +08:00
wangbj127	e6ba5a88f7	[v0.18.0][BugFix] Fix Qwen3.5 MoE FC1 error under high concurrency when dp>1 (#8395 ) ### What this PR does / why we need it? GDN Attention uses FIA's query_start_loc (padded), which may cause conv1d update errors under high concurrency when dp > 1, and this PR is to make GDN use its own query_start_loc (unpadded). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - vLLM version: v0.18.0 Signed-off-by: Wangbingjie <wangbj1207@126.com>	2026-04-20 10:26:19 +08:00
wangbj127	6bdc72949b	Revert "[v0.18.0][BugFix] Fix dimension mismatch error when SP padding causes num_tokens_padded != num_tokens_unpadded" (#8413 ) Reverts vllm-project/vllm-ascend#8133 - Reversion of Logic: This pull request reverts the changes introduced in a previous commit that attempted to handle dimension mismatches during SP padding. Signed-off-by: Wangbingjie <wangbj1207@126.com>	2026-04-18 20:43:42 +08:00
1kzk	c995a959e6	[BugFix] fix hang in async scheduling while open ENPU (#8354 ) ### What this PR does / why we need it? 1. there is no synchronization between steps. However, in async scheduling with aclgraph, it is possible that the CPU's record event for the current iteration completes before the previous iteration's graph execution has finished. If cpu is fast enough, device will hang on event_wait in interation i+1 (assume that event_record is executed immediately on update stream of device). 2. Under ENPU, eagle proposers also need to follow event.record first, and then event.Wait. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? --------- Signed-off-by: 1zzk <785396250@qq.com>	2026-04-18 00:07:15 +08:00
wangbj127	f2956ce944	[v0.18.0][BugFix] Fix dimension mismatch error when SP padding causes num_tokens_padded != num_tokens_unpadded (#8133 ) Cherry-picked from https://github.com/vllm-project/vllm-ascend/pull/7858 ### What this PR does / why we need it? This PR fixes a `RuntimeError` (dimension mismatch) that occurs when Sequence Parallelism (SP) is enabled and the padding added for SP causes `num_tokens_padded` to differ from `num_tokens_unpadded`. In such cases, `_pad_query_start_loc_for_fia` adds a dummy request, increasing `num_reqs_padded`. This mismatch between the actual number of requests and the padded number of requests leads to errors in downstream token count computations (e.g., `compute_num_computed_tokens`). The fix modifies the restrictive condition `num_tokens_padded == num_tokens_unpadded` when reverting the dummy request padding if SP is enabled, as SP padding is handled by stripping it after communication and should not be treated as an additional request in the attention metadata. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? vLLM version: v0.18.0 vLLM-Ascend version: releases/v0.18.0 Signed-off-by: Wangbj127 <wangbj1207@126.com>	2026-04-17 22:50:22 +08:00
Zetong Li	b72ade9acd	[0.18.0][BugFix] Update capture sizes after rounding operations (#8380 ) ### What this PR does / why we need it? This PR is partially cherry-picked from #8172. This PR aims to fix mismatched capture sizes after rounding operations when using sp or speculative. The reason is that original `self.cudagraph_capture_sizes` is no longer updated and remains as the initial sizes. Now we use `self.cudagraph_dispatcher.get_capture_descs` to the get up-to-date sizes. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? by ci Signed-off-by: Zetong Li <slippersss@126.com>	2026-04-17 22:46:16 +08:00
1kzk	52f0f9b5e4	[0.18.0][BugFix]: order acl graph updates before model forward for ENPU (#8317 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> For the ENPU scenario, it is required that device events follow the principle of "record first, wait later", otherwise the inference process may become stuck. However, in the current model_forward function, event.wait precedes event.record. Therefore, for the ENPU scenario, graph parameter updates should be performed before model execution. ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> N/A ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> --------- Signed-off-by: 1zzk <785396250@qq.com> Signed-off-by: 1kzk <785396250@qq.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2026-04-16 16:26:59 +08:00
Zetong Li	b6aa5bbdbf	[0.18.0][BugFix] Add PrefillNoCache state in mla _forward_decode for short prompt (#8264 ) ### What this PR does / why we need it? This PR is cherry-pick from #8263. This PR aims to fix short prompt problem. The root cause can be found in #8029. Since the previous pr may miss mixed long and short prompt batch, after discussion, we decide to add PrefillNoCache state in mla _forward_decode now instead. Signed-off-by: Zetong Li <slippersss@126.com>	2026-04-15 09:23:52 +08:00
Yaphets24	13c7392416	[BugFix] fix dsv3.1 service failed to start (#8207 ) ### What this PR does / why we need it? This PR fixes a service startup failure for DeepSeek-V3.1 models by removing a strict type assertion for `MLAAttentionSpec` in `NPUModelRunner.get_kv_cache_spec`. The assertion was failing due to class identity mismatches caused by the runtime patching of `MLAAttentionSpec` with `AscendMLAAttentionSpec`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Verified that the service starts correctly for DSV3.1 models. Signed-off-by: mayumeng <m30059191@china.huawei.com> Co-authored-by: mayumeng <m30059191@china.huawei.com>	2026-04-14 17:52:55 +08:00
Zetong Li	054fde7b72	[0.18.0][BugFix] Fix attention state of short prompt for correct forwarding (#8088 ) ### What this PR does / why we need it? This PR is cherry-pick from #8029. This PR aims to fix attention state of short prompt for correct forwarding. Since a batch of short prompts (prefill tokens less than or equal to num_spec_tokens + 1) will be treated as decode requests (by split_decodes_and_prefills), its original PrefillNoCache attention state contradicts. Thus these short prompts will be passed into a mismatched branch and incur errors. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? by ci Signed-off-by: Zetong Li <slippersss@126.com>	2026-04-09 21:21:24 +08:00
linfeng-yuan	7c9aa498d6	[releases/v0.18.0][BugFix] Restore global_bs=0 and mc2_mask for uniform-token dispatching and support inter-node roce hierarchical MC2 communication (#8040 ) ### What this PR does / why we need it? Cherry-picked from #8039 Restore the setting of MC2 `global_bs` and `mc2_mask` handling when `all_reduce` across DP group cannot be skipped. Ascend MC2 ops require `global_bs=0` + `mc2_mask` while enabling inter-node roce hierarchical communication. PR #4983 always passed non-zero `global_bs` without `mc2_mask`, which is incompatible with hierarchy comm raised in PR #7583 Changes: - Add `should_skip_allreduce_across_dp_group()` to `utils.py` with hierarchy constraint - Set `global_bs=0` when allreduce is not skipped; pass `mc2_mask` accordingly - Add `mc2_mask` field to `MoEMC2CombineMetadata` for dispatch→combine propagation ### Does this PR introduce _any_ user-facing change? No. But this PR fixes cross-super-node communication function on A3 with `enable_mc2_hierarchy_comm=True` in `additional_config` and `export HCCL_INTRA_ROCE_ENABLE=1`. ### How was this patch tested? E2E serving succeeded and CI pssed. - vLLM version: v0.18.0 - vLLM main: `14acf429ac` --------- Signed-off-by: linfeng-yuan <1102311262@qq.com>	2026-04-09 16:51:17 +08:00
pz1116	1225c613fb	[BugFix][0.18.0][KV Pool] Fix KV Pool not putting kv cache for vllm v0.18.0 (#7874 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? vLLM v0.18 defers KV connector finalization during target-modelforward when speculative decoding is enable, leading to KV Pool not doing Put Operation. This change is forgotten when we bumpped up the version for vllm-ascend. Fix by adding finalize_kv_connector for spec decode. <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> Signed-off-by: Pz1116 <zpbzpb123123@gmail.com> Co-authored-by: DreamerLeader <2270923832@qq.com> Co-authored-by: fems14 <1804143737@qq.com>	2026-04-02 10:57:09 +08:00
Yang Yuxi	e776d5c0f1	[Bugfix]v0.18.0 support FlashComm1 & DCP for Qwen (#7726 ) ### What this PR does / why we need it? This PR backports the changes from #7673 ([Bugfix] support FlashComm1 & DCP for Qwen) to the releases/v0.18.0 branch. -------- Signed-off-by: Yang Yuxi <907276627@qq.com>	2026-03-29 15:59:19 +08:00
Wang Kunpeng	5df2ddd8db	[v0.18.0][Bugfix]Fix Error "AttributeError: 'AscendCompressedTensorsConfig' obiect has no attribute 'enabling_fa_quant'" (#7748 ) ### What this PR does / why we need it? cherry-pick from https://github.com/vllm-project/vllm-ascend/pull/7736 Error information When the quantized weights in CompressedTensors format of the kimi-k2 model are used, the following error is reported: `AttributeError: 'AscendCompressedTensorsConfig' obiect has no attribute 'enabling_fa_quant'` Error Cause Currently, FA3 quantization supports only the weights of modelslim quantization. The added methods are not defined in AscendCompressedTensorsConfig. Solution Before invoking related methods, check whether the FA3 feature is enabled. Additionally, the unused `get_scaled_act_names` method and its corresponding unit test have been removed. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing unit tests were updated by removing a deprecated test case, and the refactored logic was reviewed for correctness. Signed-off-by: Wang Kunpeng <1289706727@qq.com>	2026-03-28 17:03:56 +08:00
linfeng-yuan	ab619e1c53	[0.18.0][profiler] profile AICore and MTE time with torch profiler (#7730 ) ### What this PR does / why we need it? This pull request updates the NPU profiler configuration in the Ascend worker by changing the `aic_metrics` parameter from `AiCoreNone` to `PipeUtilization`. This change enables the collection of pipe utilization metrics during profiling sessions. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI passed. Signed-off-by: linfeng-yuan <1102311262@qq.com>	2026-03-27 16:37:54 +08:00
zhangxinyuehfad	d781902ce9	[v0.18.0][CI] Fix releases/v0.18.0 ci test only support vllm v0.18.0 (#7686 ) ### What this PR does / why we need it? Fix releases/v0.18.0 ci test only support vllm v0.18.0 Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2026-03-26 18:36:04 +08:00
Wangbei25	dd55736ee4	fix uncompatible between fc1 and non-sp-padding (#7643 ) cherry pick https://github.com/vllm-project/vllm-ascend/pull/7614 ### What this PR does / why we need it? fix uncompatible between fc1 and non-sp-padding After PR [non-sp-padding](https://github.com/vllm-project/vllm-ascend/pull/7297), kimi2.5 open flashcomm1 will raise an error : The expanded size of the tensor do not match the existing size at non-singleton dimension 0. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.18.0 - vLLM-Ascend main: `9976e685b7` Signed-off-by: Wangbei25 <wangbei41@huawie.com> Co-authored-by: Wangbei25 <wangbei41@huawie.com>	2026-03-25 23:23:37 +08:00
Ronald	d96440924a	adapt to main2main for model runner v2 (#7578 ) ### What this PR does / why we need it? This PR aims to adapt to newest commit of vllm main branch for model runner v2. please refer to https://github.com/vllm-project/vllm-ascend/issues/5208 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.18.0 - vLLM main: `ed359c497a` --------- Signed-off-by: Ronald1995 <ronaldautomobile@163.com>	2026-03-25 09:08:44 +08:00
lhp-deep	0e3186f07c	[model_runner_v2]:optimize the performance of the _compute_slot_mappings_kernel (#7575 ) ### What this PR does / why we need it? This PR optimizes the `_compute_slot_mappings_kernel` for Ascend NPUs to improve performance. The key changes include: - A new Triton kernel implementation (`_compute_slot_mappings_kernel`) with NPU-specific optimizations, such as using `tl.gather` to handle non-contiguous memory access and replacing modulo operations. - A new method `compute_slot_mappings` in `AscendBlockTables` to use this new kernel. - An end-to-end test to verify the correctness of the new kernel against the reference GPU implementation. The optimization is needed to avoid performance degradation from scalar computation on Ascend devices. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.18.0 - vLLM main: `ed359c497a` --------- Signed-off-by: lhp-deep <liuhaopeng1@huawei.com>	2026-03-24 17:29:14 +08:00
realliujiaxu	5d12446573	[Feat][SP] Suport SP for VL MoE models (#7044 ) ### What this PR does / why we need it? 2nd PR for https://github.com/vllm-project/vllm-ascend/issues/5712, extend SP to VL MoE models. ### Does this PR introduce _any_ user-facing change? remove `sp_threshold` in additional config and reuse `sp_min_token_num` from vLLM. ### How was this patch tested? - Model: Qwen3-VL-30B-A3B, - TP4 DP2 - 100 reqs - max concurrency 1 \| Seq length \| Mean TTFT (ms) main \| Mean TTFT (ms) this PR \| \|------------\|---------------------\|------------------------\| \| 4k \| 429.40 \| 323.3 \| \| 16k \| 1297.01 \| 911.74 \| - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: realliujiaxu <realliujiaxu@163.com>	2026-03-24 17:16:00 +08:00
zxr2333	67aad1fce8	[BugFix][P/D] fix padding error on FullGraph mode && fix layerwise connector mamba accuracy (#7506 ) ### What this PR does / why we need it? 1. When the FullGraph mode is used, the branches in the Triton operator are compiled and fixed during the graph capture process, causing the branch condition in the `fused_recurrent_gated_delta_rule` operator, which checks whether `ssm_state_indices >= 0` before writing to the SSM cache, to become invalid. Now, the write operation is performed regardless of the value. This results in the operator performing address offset calculations and writing to the SSM cache based on the -1 offset after -1 is used for padding in vLLM GDN backend. Since the conv cache and SSM cache in vLLM Ascend implementation are actually a single continuous tensor divided into two parts, this leads to data overwriting and the generation of NaN values. This PR addresses two cases where padding -1 is required in the GDN metadata builder. The same logic is used to replace the padding with 0 to avoid the problem of memory overwriting, because block 0 is a reserved block. 2. Fix layerwise connector bug for mamba cache sending on heterogeneous TP. - vLLM version: v0.17.0 - vLLM main: `8b6325758c` --------- Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>	2026-03-24 15:15:55 +08:00
weijinqian0	bdd90c0088	[model_runner_v2]optimize the performance of the post_update. (#7496 ) ### What this PR does / why we need it? - This PR aims to enhance the operator performance in the `post_update` phase of `model_runner_v2` on NPUs. By optimizing the relevant operations, it is expected to improve the overall efficiency and speed of the model running on NPU hardware, which is crucial for scenarios where high-performance inference is required. - when bs = 256, time cost reduce from 26us to 11 us; ### Does this PR introduce _any_ user-facing change? No, there are no changes to the API, interface, or other high-level behaviors that would directly affect the user's code or interaction with the system beyond the performance improvement. ### How was this patch tested? CI passed with new added/existing tests. In addition to the regular CI tests, specific benchmark tests were conducted on NPU hardware to measure the performance improvement of the `post_update` operators. --------- Signed-off-by: weijinqian_v1 <weijinqian@huawei.com> Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>	2026-03-23 20:29:55 +08:00
Levi	9976e685b7	[Bugfix][eager][oom] fix rank0 load imbalance by no padding when multi dp (#7297 ) ### What this PR does / why we need it? Fix multi dp padding logic for eager mode, bacause its will cause rank0 load imbalance in kimi-k2.5-w4a8 with the all the padding tokens router to rank0. And the fix can also apply to other model in multi dp. - before hbm usage： <img width="2229" height="733" alt="image" src="https://github.com/user-attachments/assets/50479b6d-cfd0-4206-8e80-974024652997" /> preformance： ```shell Concurrency NumPrompts QPS TTFT_Avg TTFT_P50 TPOT_Avg TPOT_P50 TPOT_P90 ============ ============ ============ ============ ============ ============ ============ ============ 1 15 0.0179 1667.7803 1673.3437 35.2973 35.2775 35.3784 32 480 0.4725 2764.8027 1905.2137 40.8030 40.6978 41.0179 64 960 0.7820 4123.7096 3485.6153 48.0461 48.1598 48.2971 100 1500 1.0852 6216.7988 5714.0082 52.9323 53.0613 54.6304 108 1620 1.1040 6277.4892 5798.7425 56.3862 56.9224 57.2901 116 1740 1.1680 6563.3293 6039.5659 56.9894 57.4027 57.5786 128 1920 1.2555 7822.5551 7604.1662 57.7660 58.1768 58.2717 192 2880 1.4314 9212.1953 9131.3461 58.9905 59.1683 59.2791 256 3840 1.4480 9028.0812 8913.7937 59.0092 59.2385 59.3516 ``` - after hbm usage： <img width="2246" height="1005" alt="image" src="https://github.com/user-attachments/assets/d0936481-5a58-4bc5-a6f1-b92735d47885" /> preformance： ```shell Concurrency NumPrompts QPS TTFT_Avg TTFT_P50 TPOT_Avg TPOT_P50 TPOT_P90 ============ ============ ============ ============ ============ ============ ============ ============ 1 15 0.0181 601.4171 600.9774 35.6270 35.6254 35.6480 32 480 0.4455 720.8782 724.2889 45.4250 45.4755 45.6318 64 960 0.8445 729.6209 728.2149 47.0464 47.0896 47.1985 100 1500 1.2601 723.4834 724.6673 48.3108 48.3844 48.5355 108 1620 1.3409 727.1509 720.6772 48.8962 48.9409 49.0489 116 1740 1.4080 679.9799 677.6119 49.1253 49.1983 49.3087 128 1920 1.4155 680.6284 674.9436 49.2193 49.2450 49.3763 192 2880 1.4422 684.6577 676.7833 49.2059 49.2264 49.3229 256 3840 1.4558 685.2462 678.1709 49.2191 49.2351 49.3419 ``` ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: Levi-JQ <yujinqi2@huawei.com> Co-authored-by: Levi-JQ <yujinqi2@huawei.com> Co-authored-by: fny-coder <985619145@qq.com>	2026-03-23 17:05:02 +08:00
Shanshan Shen	5c0d02f689	[Bugfix] Fix multi-instance serving OOM on single card (#7427 ) ### What this PR does / why we need it? Fix https://github.com/vllm-project/vllm-ascend/issues/7308. Subtracting `init_non_torch_memory` (maybe used by the first instance) from the total `non_torch_memory` when calculating `available_kv_cache_memory`. Directly use `non_torch_memory_increase` (contained in `non_kv_cache_memory`) to calculate `available_kv_cache_memory`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Launch tow vllm-ascend instances sequentially on single card. ```bash # Launch first instance vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen3-0.6B \ --port 8100 \ --host 0.0.0.0 \ --additional-config='{"enable_cpu_binding":true}' \ --gpu-memory-utilization 0.3 \ --max-num-seqs 1 \ --max-model-len 2048 \ --max-num-batched-tokens 2048 \ --no-enable-prefix-caching \ --enforce-eager # Launch second instance vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen3-0.6B \ --port 8101 \ --host 0.0.0.0 \ --additional-config='{"enable_cpu_binding":true}' \ --gpu-memory-utilization 0.3 \ --max-num-seqs 1 \ --max-model-len 2048 \ --max-num-batched-tokens 2048 \ --no-enable-prefix-caching \ --enforce-eager ``` Before this PR: ```bash # First instance: ------------------------------------------------------------------ requested_memory: 18.287109375 GiB non_kv_cache_memory: 1.2340388298034668 GiB init_non_torch_memory: 0.3616676330566406 GiB non_torch_memory_before_empty_cache: 0.3896217346191406 GiB non_torch_memory_increase: 0.0279541015625 GiB non_torch_memory_cleared_by_empty_cache: 0.3616676330566406 GiB ------------------------------------------------------------------ # Second instance: ------------------------------------------------------------------ requested_memory: 18.287109375 GiB non_kv_cache_memory: 1.2336344718933105 GiB init_non_torch_memory: 18.37220001220703 GiB non_torch_memory_before_empty_cache: 18.399906158447266 GiB non_torch_memory_increase: 0.02754974365234375 GiB non_torch_memory_cleared_by_empty_cache: 18.372356414794922 GiB ------------------------------------------------------------------ # available_kv_cache_memory = requested_memory - non_kv_cache_memory - non_torch_memory_cleared_by_empty_cache Available KV cache memory: -1.32 GiB ``` After this PR: ```bash # First instance: ------------------------------------------------------------------ requested_memory: 18.287109375 GiB non_kv_cache_memory: 1.2340540885925293 GiB init_non_torch_memory: 0.36182403564453125 GiB non_torch_memory_before_empty_cache: 0.38979339599609375 GiB non_torch_memory_increase: 0.0279693603515625 GiB non_torch_memory_cleared_by_empty_cache: 0.0 GiB ------------------------------------------------------------------ # Second instance: ------------------------------------------------------------------ requested_memory: 18.287109375 GiB non_kv_cache_memory: 1.233344554901123 GiB init_non_torch_memory: 18.74309539794922 GiB non_torch_memory_before_empty_cache: 18.770355224609375 GiB non_torch_memory_increase: 0.02725982666015625 GiB non_torch_memory_cleared_by_empty_cache: 0.0 GiB ------------------------------------------------------------------ # available_kv_cache_memory = requested_memory - non_kv_cache_memory - non_torch_memory_cleared_by_empty_cache Available KV cache memory: 17.05 GiB ``` - vLLM version: v0.17.0 - vLLM main: `4497431df6` --------- Signed-off-by: shen-shanshan <467638484@qq.com> Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>	2026-03-23 14:22:59 +08:00
XiaoxinWang	cbf46fad3c	fixed graph mode bug. (#7460 ) ### What this PR does / why we need it? In fulldecodeonly mode, num_req_padded was set to an incorrect value, causing accuracy degradation in Qwen3-Next. Therefore, we added a check for compilation_config.cudagraph_mode to the conditional logic, ensuring that padding is applied only in FULL mode. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.17.0 - vLLM main: `8a680463fa` Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>	2026-03-22 10:09:37 +08:00
Zetong Li	84a74f0cb1	[Bugfix] Fix padding logic in eagle proposer for kimi25 (#7348 ) ### What this PR does / why we need it? This PR aims to fix padding logic in eagle proposer for kimi25. Main changes involve: 1. modify the way to obtain draft model attention builder and backend 2. add block table padding & related tensor slicing in common metadata when `draft_step>1` for solving fia verifying error 3. replace block table in `update_graph_params` for solving fia verifying error - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` Signed-off-by: Zetong Li <slippersss@126.com>	2026-03-21 16:57:22 +08:00
meihanc	bff4fbfca5	upgrade to 0.18.0 (#7502 ) ### What this PR does / why we need it? 1. upgrade to 0.18.0 2. ensure kernel_block_sizes is int for Eagle drafter ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.17.0 - vLLM main: `8b6325758c` --------- Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com>	2026-03-21 16:05:38 +08:00
HongtaoYang	80a4265717	[Feat] Support separate attention backend for target and draft model. (#7342 ) ### What this PR does / why we need it? This PR enables separate attention backend configuration for target and draft models in speculative decoding, decoupling the previously bound attention backend settings between the two models. It solves the compatibility issue where some draft models do not support the attention backend used by the target model, and allows users to select the optimal attention backend for each model individually to maximize inference performance. The change is fully backward compatible. --------- Signed-off-by: SidaoY <1024863041@qq.com>	2026-03-21 10:48:01 +08:00
LI SHENGYONG	4e6dbe0956	[EPLB][Bugfix] Set parallel_config.enable_eplb to true to load redundant experts (#7470 ) ### What this PR does / why we need it? pr: https://github.com/vllm-project/vllm/pull/37136 break eplb because it filters out redundant experts. pr: https://github.com/vllm-project/vllm/pull/37322 fix it due to use parallel_config.enable_eplb to determine whether to skip the weight loading filter. But in vllm-ascend, parallel_config.enable_eplb is always false. When we use eplb, we temporarily set it to true. ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> ### How was this patch tested? ![Snipaste_2026-03-19_16-13-01](https://github.com/user-attachments/assets/b3a4911e-36b3-4c31-951c-7c091f416d00) \| dataset \| version \| metric \| mode \| vllm-api-stream-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| aime2024 \| 604a78 \| accuracy \| gen \| 86.67 \| Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2026-03-20 15:22:55 +08:00
Nengjun Ma	8b79d4de52	Main2main upgrade to vllm 0317 afternoon (#7409 ) ### What this PR does / why we need it? 1.fix "TypeError: get_attn_backend() remove variable": [Refactor `check_and_update_config`](https://github.com/vllm-project/vllm/pull/35122) 2.fix [Rename `compile_ranges_split_points` to `compile_ranges_endpoints`](https://github.com/vllm-project/vllm/pull/36027) 3.fix "RuntimeError: device_allocator not a DeviceAllocator":[Replace memory related torch.cuda APIs"](https://github.com/vllm-project/vllm/pull/37031) 4.fix [Support multiple KV groups in OffloadingSpec ](https://github.com/vllm-project/vllm/pull/36610) removed self.offloaded_block_size and changed self.gpu_block_size from a scalar to a tuple of per-group block sizes, adding block_size_factor. 5.fix [Consolidate SupportsEagle](https://github.com/vllm-project/vllm/pull/36063) renamed get_eagle3_aux_hidden_state_layers() to get_eagle3_default_aux_hidden_state_layers() and added a supports_eagle3() guard before calling it. ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? E2E - vLLM version: v0.17.0 - vLLM main: `8a680463fa` --------- Signed-off-by: leo-pony <nengjunma@outlook.com> Co-authored-by: Claude Code <noreply@anthropic.com>	2026-03-18 23:24:27 +08:00
zhangyiming	1c954ff264	[main2main] upgrade vllm to 0308 (#7213 ) ### What this PR does / why we need it? Update main2main to vllm 0308. breaks: * https://github.com/vllm-project/vllm/pull/30681 * https://github.com/vllm-project/vllm/pull/35552 remove self.cudagraph_batch_sizes * https://github.com/vllm-project/vllm/pull/35158 clear_metadata -> defer_finalize * https://github.com/vllm-project/vllm/pull/36006 remove CacheConfig.cpu_offload_gb * https://github.com/vllm-project/vllm/pull/35472 * https://github.com/vllm-project/vllm/pull/34552 attn_metadata_builder * https://github.com/vllm-project/vllm/pull/30515 profile_seq_lens * https://github.com/vllm-project/vllm/pull/28053 - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: MrZ20 <2609716663@qq.com> Signed-off-by: menogrey <1299267905@qq.com> Co-authored-by: MrZ20 <2609716663@qq.com>	2026-03-18 09:24:43 +08:00
zxr2333	5645ca8392	[BugFix]A2 MOE method&& layerwise MTP bugfix && Mamba gdn_metadata bugfix (#7364 ) ### What this PR does / why we need it? Some bug fixes, mainly including: 1. For A2, the number of experts each single card cannot be greater than 16 when using MC2. The PR fixed the error in the A2 moe communication method selection, which would cause the selection of an incorrect communication method when the number of model experts exceeds 256. For example, when using an A2 16-cards model to load the PD-disaggregation D node with Qwen3.5 series models, the incorrect MC2 method would be chosen. 2. Fixed the issue where the layerwise connector sends the kv-cache of the MTP layer multiple times when `num_spec_tokens` > 1. Now, the kv-cache is sent only when the MTP layer is forward for the first time. 3. Fix the accuracy issue of qwen3.5 when using MTP for PD disaggregation. The cause is that `num_decode_draft_tokens` does not consider that `spec_tokens` are not existed during the first inference when PD disaggregation (`spec_tokens` are generated during the first inference). However, `spec_tokens_padding` is added by `recomputed_scheduler`. As a result, `gdn_metadata` incorrectly considers that the prefill with a length of 2 is performed. --------- Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com> Signed-off-by: zxr2333 <64738772+nwpu-zxr@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2026-03-17 23:03:45 +08:00
pichangping	3f39ac9c8d	[Feature]Supports DSv3.1 PD separation and C8 quantization (#7222 ) Co-authored-by: kunpengW-code <1289706727@qq.com> Co-authored-by: linsheng1 <1950916997@qq.com> ### What this PR does / why we need it? Currently, chunked prefill is forcibly enabled. DeepSeek V3.1 W8A8C8 supports only the PD separation scenario. C8 refers to quantizing the KV cache to int8, which aims to reduce the GPU memory usage of the KV cache and improve the inference throughput. Constraints: 1. Only the PD separation mode can be used and MooncakeLayerwiseConnector can be used to run the model. 2. Currently, only the activation value supports dynamic quantization, and the KV cache supports static quantization. C8 quantization with MTP is not supported. You can use ModelSlim for quantization. The quantization procedure is as follows: pip install transformers==4.48.2 git clone https://gitcode.com/Ascend/msmodelslim.git cd msmodelslim bash install.sh cd example/DeepSeek/ python3 quant_deepseek_w8a8.py --model_path <path/weight> --save_path <path/quant_weight> --anti_dataset../common/deepseek_anti_prompt_50_v3_1.json --calib_dataset../common/deepseek_calib_prompt_50_v3_1.json --rot --trust_remote_code True --fa_quant --dynamic --anti_method m6 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: pichangping <1337510399@qq.com> Signed-off-by: Wang Kunpeng <1289706727@qq.com> Co-authored-by: Wang Kunpeng <1289706727@qq.com>	2026-03-16 22:49:05 +08:00
drslark	a6f6e919e6	[main][bugfix] Fixed the problem that eagle3 will crash in FULL_DECODE_ONLY (#7290 ) ### What this PR does / why we need it? Two problems have been solved in this pr. These problems occur in the `FULL_DECODE_ONLY` mode that `num_tokens` should be padded to some value in `cudagraph_capture_sizes`. 1. We found the length of `seq_lens_list` in drafter's `attn_metadata` is 1 shorter than expected. It will raise a kernel exception to make vllm crash. e.g., `num_reqs` = 3, `cudagraph_capture_sizes` = [20], `actual_seq_lengths_q` is padded well to [4, 8, 12, 20]. But `seq_lens_list` = [5742, 4700, 7996], it is not padded. 3. Though the length of `seq_lens_list` in target's `attn_metadata` is the same as expected in `FULL_DECODE_ONLY`, some data are corrupted at the end of the list. e.g., `num_reqs` = 3, `cudagraph_capture_sizes` = [20], `actual_seq_lengths_q` is padded well to [4, 8, 12, 20]. But `seq_lens_list` = [5742, 4700, 7996, 5738], it has corrupted at the end of the list. - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` Signed-off-by: drslark <slarksblood@qq.com>	2026-03-16 20:41:36 +08:00
wangx700	22d0e1d3d7	[model_runner_v2]optimize the performance of the _topk_log_softmax_kernel (#7221 ) ### What this PR does / why we need it? Optimize the performance of the triton operator _topk_log_softmax_kernel in model_runner_v2 to 1.04xH100，which is 7% of its original value.(issue https://github.com/vllm-project/vllm-ascend/issues/5208) - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: wangx700 <wangxin700@huawei.com>	2026-03-16 16:49:10 +08:00
rjg-lyh	4d443b9228	[bugfix] restore pr-7029 and fix patch error (#7294 ) ### What this PR does / why we need it? This PR restores #7029, which adds W8A8C8 support for dsv3.2/glm5 using the `lightning_indexer_quant` ops in the pd-mix stage. The original PR was reverted by #7288 because the patch did not work with the recompute scheduler. This PR also fixes the patching issue so that it works correctly with the recompute scheduler. ### Does this PR introduce _any_ user-facing change? Yes. To enable LI C8, users need to set the `enable_sparse_c8` option to `"true"` in `additional_config`. - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: rjg-lyh <1318825571@qq.com>	2026-03-16 15:39:42 +08:00
Mengqing Cao	0c299f79b9	Revert "[Perf][1/N] w8a8c8 support in dsv3.2/glm5 (#7029 )" (#7288 ) ### What this PR does / why we need it? This reverts commit `7ed9e9de69`, which introduces an issue that the patch doesn't work with recompute scheduler enabled. - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2026-03-15 20:19:09 +08:00
Qiu	7daccf4b64	Perf(PP): support PP with async send/recv. (#7143 ) ### What this PR does / why we need it? Follow up the PR https://github.com/vllm-project/vllm/pull/33368, this PR provides async send/recv support for PP in vllm-ascend. --- - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>	2026-03-15 09:45:09 +08:00
Angazenn	ce5544bfc1	[Hybrid] support prefix cache for Qwen3.5/Next with `--mamba-cache-mode align` (#7103 ) ### What this PR does / why we need it? To support prefix cache for Qwen3.5/Next in vLLM-Ascend, this PR mainly follows the design in [#30877](https://github.com/vllm-project/vllm/pull/30877) and inherits changes to functions which are overridden in vLLM-Ascend. Note: 1. `--mamba-cache-mode align` && PD disaggregation is still not supported yet in vLLM v0.17.0(see https://github.com/vllm-project/vllm/blob/main/vllm/v1/core/sched/scheduler.py#L295). 2. The current implementation of hybrid kv cache might result in a very large block_size when scheduling. For example, if we run Qwen3.5-35B-A3B with `-tp 2`, the block_size is adjusted to 2048, which means that any prefix shorter than 2048 will never be cached. Although this behavior is consistent with vLLM, it still needs improvements in the future. 3. `--mamba-cache-mode align` requires to copy mamba states during forward steps. vLLM uses a triton kernel to implement it. However, the original version run into some bugs on Ascend hardwares. Thus we patch a new triton kernel to avoid this bug. ### Does this PR introduce _any_ user-facing change? To use mamba prefix cache, set `--enable-prefix-caching` and `--mamba-cache-mode align`. Note that the mamba state copy function(see [do_mamba_copy_block](https://github.com/vllm-project/vllm/blob/main/vllm/v1/worker/mamba_utils.py#L132)) does not provide a torch native version, thus it might have trouble if users can't use triton. - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: Angazenn <supperccell@163.com>	2026-03-15 09:44:09 +08:00
Mengqing Cao	e7aa2c285c	[SpecDecode] Fix Draft model proposer (#7230 ) ### What this PR does / why we need it? This pr fix the Unified draft parallel feature. 1. In Draft model proposer, there are exceed 1 attention layers in target model, thus removing the assertion on layer number. 2. we should get block size through `draft_attn_groups` instead of `attn_metadata_builder` after 0.17.0. 3. `attn_update_stack_num_spec_norm` shouldn't be done when unified draft parallel is enabled ### How was this patch tested? Test pass with `tests/e2e/singlecard/spec_decode/test_v1_spec_decode.py::test_parallel_drafting_acceptance`, which is already included in CI - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` Signed-off-by: MengqingCao <cmq0113@163.com>	2026-03-14 18:26:37 +08:00
Mengqing Cao	986cd45397	[Version] Drop 0.16.0 support (#7153 ) ### What this PR does / why we need it? Drop 0.16.0 support in main - Fix eagle proposer break introduced by https://github.com/vllm-project/vllm/pull/34552. Mainly change to use the draft attention group to initialize the attention metadata builder. - Fix the `ModelRunner` has no attribute `cudagraph_capture_sizes` error, which is a bug in vLLM v0.17.0, and fixed by a later pr https://github.com/vllm-project/vllm/pull/30515 - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2026-03-13 16:14:15 +08:00
rjg-lyh	7ed9e9de69	[Perf][1/N] w8a8c8 support in dsv3.2/glm5 (#7029 ) ### What this PR does / why we need it? This PR supports W8A8C8 in dsv3.2/glm5 with lightning_indexer_quant ops in pd-mix stage mainly. Because the code for the current PD-disaggregated scenario is still under refactoring and cleanup, this PR prioritizes ensuring the C8 functionality in the pd-mix scenario. The next steps are planned in two parts: ① Once the optimized scatter operator is updated, we will replace the original operator to improve the performance of storing k_scale. ② Once the code logic for the PD-disaggregated scenario becomes stable, we will carry out more comprehensive validation and make appropriate adaptations. ③ Because enabling C8 currently introduces several new operators whose performance still needs improvement, performance may regress in some scenarios. Therefore, only after all the operators are fully ready can we ensure that this feature does not cause any performance degradation. At that point, we will enable this feature by default and remove the switch in `additional_config`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: rjg-lyh <1318825571@qq.com>	2026-03-13 14:47:42 +08:00
kx	df1ee8070d	[feat][spec decode]Unified draft parallel (#6766 ) ### What this PR does / why we need it? Implement a unified parallelized speculative decoding in VLLM Ascend，which can simultaneously support parallel speculative inference schemes such as Pard, P-Eagle, etc. refer to https://github.com/vllm-project/vllm-ascend/pull/6565 and https://github.com/vllm-project/vllm-ascend/pull/4078 ### How was this patch tested? run with parallel drafting script: export target=/model/Llama-3.1-8B-Instruct export draft=/model/PARD-Llama-3.2-1B export CUDA_VISIBLE_DEVICES=6 export ASCEND_RT_VISIBLE_DEVICES=6 vllm serve $target \ --tensor-parallel-size 1 \ --max-model-len 4096 \ --no-enable-prefix-caching \ --port 8811 \ --speculative-config '{"model": "/model/PARD-Llama-3.2-1B", "method": "draft_model", "num_speculative_tokens": 8, "parallel_drafting": true}' base script: export target=/model/Llama-3.1-8B-Instruct export draft=/model/PARD-Llama-3.2-1B export CUDA_VISIBLE_DEVICES=6 export ASCEND_RT_VISIBLE_DEVICES=6 vllm serve $target \ --tensor-parallel-size 1 \ --max-model-len 4096 \ --no-enable-prefix-caching \ --port 8811 benchmark script: MAX_CONCURRENCY=1 NUM_PROMPTS=80 vllm bench serve --port 8811 \ --temperature 0 \ --model /model/Llama-3.1-8B-Instruct \ --backend openai-chat \ --endpoint /v1/chat/completions \ --dataset-name hf \ --dataset-path philschmid/mt-bench \ --num-prompts ${NUM_PROMPTS} \ --max-concurrency ${MAX_CONCURRENCY} \ --seed 1234 test results : base(without spec decode): TTFT 79.46ms TPOT 26.99ms output_tokens_throughput 36.75 tok/s this pr(with parallel drafting): TTFT 72.24ms TPOT 13.45ms output_tokens_throughput 72.98 tok/s per-position acceptance(from position 0 to 7): 79.48%、56.93%、40%、27.90%、19.79%、14.25%、10.57%、7.61%. ---------------------------------------------------------------------- run on qwen3 model script ： export target=/model/Qwen3-1.7B export draft=/model/PARD-Qwen3-0.6B export CUDA_VISIBLE_DEVICES=1 export ASCEND_RT_VISIBLE_DEVICES=1 vllm serve $target \ --tensor-parallel-size 1 \ --max-model-len 4096 \ --no-enable-prefix-caching \ --port 8811 \ --speculative-config '{"model": "/model/PARD-Qwen3-0.6B", "method": "draft_model", "num_speculative_tokens": 8, "parallel_drafting": true}' cc @NickJudyHvv - vLLM version: v0.15.0 - vLLM main: `9562912cea` --------- Signed-off-by: 01267596 <xiongkai123@cmbchina.com> Signed-off-by: kx <1670186653@qq.com> Signed-off-by: HF-001 <1670186653@qq.com> Co-authored-by: 01267596 <xiongkai123@cmbchina.com>	2026-03-13 14:07:35 +08:00
Qiu	c377e73933	Perf(PP): support PP with async scheduling. (#7136 ) ### What this PR does / why we need it? Follow up the PR https://github.com/vllm-project/vllm/pull/32618, this PR provides async scheduling support for PP in vllm-ascend. --- - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>	2026-03-13 10:27:23 +08:00
Ronald	c980e68d40	[Feature] support aclgraph for model runner v2 (#7110 ) ### What this PR does / why we need it? This PR aims to support aclgraph for model runner v2, please see RFC #5208. The PR contains these modifications: - adapt to newest commit of vllm main branch. - supply a unified interface of extra forward context for both model runner v1 and model runner v2. - implement graph mode for main model. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: Ronald1995 <ronaldautomobile@163.com>	2026-03-13 09:11:46 +08:00
zxr2333	fe4cad24e9	[BugFix]fix qwen3.5 reshape_kvcache bug (#7209 ) ### What this PR does / why we need it? This PR fixes a bug in `reshape_kvcache_tensors` when reshaping the Mamba cache for models like Qwen3.5. The previous implementation did not correctly handle cases where the KV cache tensors have different data types. This change ensures that slicing is performed based on byte offsets before reshaping the tensors, which correctly handles heterogeneous dtypes. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By CI. - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>	2026-03-12 23:51:40 +08:00
drslark	de93790d08	[main][bugfix] Fixed the problem of drafter crashed in FULL mode (#7158 ) ### What this PR does / why we need it? The merged graph of draft in `FULL` mode is broken now. This pr solves it. Also, `actual_seq_lengths_q` in `model_runner` is found redundant, so, it is removed. It depends on https://github.com/vllm-project/vllm-ascend/pull/7144 and https://github.com/vllm-project/vllm-ascend/pull/7148. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? Test code is shown as below: ```python prompts = [ "1.Who are you?", "2. Who are you?", ] sampling_params = SamplingParams(temperature=0.0, top_p=0.95, top_k=40, max_tokens=200) llm = LLM( model="/home/some-model/Meta-Llama-3.1-8B-Instruct", tensor_parallel_size=1, max_num_seqs=32, # enforce_eager=True, disable_log_stats=False, distributed_executor_backend="mp", gpu_memory_utilization=0.7, async_scheduling=True, speculative_config={ "enforce_eager": True, "model": "/home/some-model/EAGLE3-LLaMA3.1-Instruct-8B", "disable_padded_drafter_batch": False, "method": "eagle3", "num_speculative_tokens": 3, }, compilation_config={ "cudagraph_mode": "FULL", "cudagraph_num_of_warmups": 1, }, max_model_len=4096, enable_prefix_caching=False, ) outputs = llm.generate(prompts, sampling_params) ``` The result before: ```text File "/vllm-workspace/vllm-ascend/vllm_ascend/attention/attention_v1.py", line 575, in full_graph_fia graph_params.events[num_tokens].append(event) ~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ KeyError: 132 ``` The result after: ```text -------------------------------------------------- total_num_output_tokens: 400 num_drafts: 242 num_draft_tokens: 726 num_accepted_tokens: 156 mean acceptance length: 1.64 -------------------------------------------------- acceptance at token 0: 0.42 acceptance at token 1: 0.16 acceptance at token 2: 0.07 ``` We also test `FULL_DECODE_ONLY` mode. The result is: ```text -------------------------------------------------- total_num_output_tokens: 400 num_drafts: 244 num_draft_tokens: 732 num_accepted_tokens: 155 mean acceptance length: 1.64 -------------------------------------------------- acceptance at token 0: 0.42 acceptance at token 1: 0.16 acceptance at token 2: 0.06 ``` - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` Signed-off-by: drslark <slarksblood@qq.com>	2026-03-12 18:38:50 +08:00
无脸男	09d26754cd	[Bugfix] Fix the issue where no exception is thrown when graph capture fails. (#5644 ) ### What this PR does / why we need it? Fix the issue where no exception is thrown when graph capture fails. - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: WithHades <244036962@qq.com>	2026-03-12 16:14:45 +08:00
Hexiang Wang	f244f3c4a9	[BugFix] Fix problem of extra processes on rank0 device (#7107 ) ### What this PR does / why we need it? Currently when tp>1, we have extra processes on tp rank0 device which consumes extra HBM memory. This is caused by `import torch_npu._inductor` before set_device which introduces extra initialization of device. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? All ci passed. - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: whx-sjtu <2952154980@qq.com>	2026-03-12 15:59:03 +08:00
XiaoxinWang	37d1bd8c50	fixed fia pad logic in graph mode. (#7144 ) ### What this PR does / why we need it? related to vllm PR #34043 this pr delete func ‘relax_for_mixed_batch_cudagraphs’, num_reqs no longer equals the actual number of requests, due to fia operator requires that query_start_loc[-1] equals the total number of computed tokens, so this func delete cause the ifa error. In full graph mode, set num_reqs_paded = num_reqs to fix the error ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com> Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>	2026-03-12 14:50:54 +08:00

1 2 3 4 5 ...

572 Commits