xc-llm-ascend

Author	SHA1	Message	Date
SILONG ZENG	eb92e7d50e	[Bugfix] Restore balance scheduling patch for v0.17.0 (#7479 ) ### What this PR does / why we need it? Restore previously introduced patches： - https://github.com/vllm-project/vllm-ascend/pull/5212 - vLLM version: v0.17.0 - vLLM main: `8b6325758c` Signed-off-by: MrZ20 <2609716663@qq.com>	2026-03-19 20:12:57 +08:00
vllm-ascend-ci	95e1dc11d8	[CI]: Auto-update estimated test times in config.yaml (#7413 ) ## Summary This PR was auto-generated by the Update estimated test times [workflow](https://github.com/vllm-project/vllm-ascend/actions/runs/23226502411). It updates the `estimated_time` values in `.github/workflows/scripts/config.yaml` based on actual elapsed times collected from CI workflow runs. ### Methodology - Each e2e test job uploads its elapsed time as a `timing-data-` artifact upon completion. - The workflow aggregates all collected timing artifacts across jobs. - For each test, the median* elapsed time is computed to reduce outlier impact. - A 10% safety buffer is applied and the result is rounded to the nearest 10 seconds. ### Review Checklist - [ ] Verify that updated `estimated_time` values are within a reasonable range. - [ ] Confirm no test entries are missing or unexpectedly removed. > If the new values look reasonable, feel free to merge. Otherwise, leave a comment describing the anomaly. - vLLM version: v0.17.0 - vLLM main: `4497431df6` Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>	2026-03-19 19:01:16 +08:00
ichaoren	9d1452c74d	[OPS]add split_qkv_tp_rmsnorm_rope ops (#7376 ) ### What this PR does / why we need it? This PR introduces a new fused Triton kernel, `split_qkv_tp_rmsnorm_rope` for Minimax-m2.5. The implementation includes two Triton kernels: 1. `_split_qkv_and_compute_local_qk_var_kernel`: Splits the QKV input and computes the local variance for RMSNorm. 2. `_apply_global_rmsnorm_kernel`: Applies global RMSNorm (considering TP all-reduce for variance) and Neox-style RoPE. ### Does this PR introduce _any_ user-facing change? Does not. ### How was this patch tested? ```python pytest tests/e2e/nightly/single_node/ops/singlecard_ops/triton/test_split_qkv_tp_rmsnorm_rope.py ``` ### Test Data A3 TP16 基线 \| data \| TTFT(ms) \| TPOT(ms) \| TPS \| \|------------\|---------:\|---------:\|-------:\| \| 4k/1k@bs1 \| 267.55 \| 25.5 \| 38.85 \| \| 4k/1k@bs4 \| 542.4 \| 26.51 \| 148.06 \| 测试线 \| data \| TTFT(ms) \| TPOT(ms) \| TPS \| \|------------\|---------:\|---------:\|-------:\| \| 4k/1k@bs1 \| 234.64 \| 20.96 \| 47.24 \| \| 4k/1k@bs4 \| 508.36 \| 22.16 \| 176.69 \| - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` Signed-off-by: xutianyi <xutianyi5@huawei.com> Co-authored-by: xutianyi <xutianyi5@huawei.com>	2026-03-19 17:19:18 +08:00
Nengjun Ma	ee804ce23e	Main2main upgrade vllm to 0318 commit (#7412 ) ### What this PR does / why we need it? Upgrade vllm commit to 0318. Main content: Added a pre-operation for cleaning up and waiting(default max 50s) for the completion of the clean up of the NPU memory to some test cases that failed due to the failure to release the NPU memory in a timely manner when the previous test cases were executed. ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? NA - vLLM version: v0.17.0 - vLLM main: `4497431df6` --------- Signed-off-by: leo-pony <nengjunma@outlook.com>	2026-03-19 17:17:36 +08:00
ZT-AIA	05afc7f8c3	[CI]repair for ci custom ops (#7461 ) ### What this PR does / why we need it? NPU resources are not released immediately when custom operator test cases are executed, causing an error when other operator test cases are executed. - vLLM version: v0.17.0 - vLLM main: `8a680463fa` Signed-off-by: ZT-AIA <1028681969@qq.com> Signed-off-by: ZT-AIA <63220130+ZT-AIA@users.noreply.github.com>	2026-03-19 17:13:12 +08:00
Li Wang	83a4065b4b	[CI] Add pre-commit check for patch logger (#7446 ) ### What this PR does / why we need it? See https://github.com/vllm-project/vllm-ascend/pull/7402, pre-commit hook will forbid init_logger(__name__) in vllm_ascend patch modules - vLLM version: v0.17.0 - vLLM main: `8a680463fa` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2026-03-19 16:53:20 +08:00
Feng-xiaosuo	38e637eef5	Fix manual mapping registration and kimi_k2 layer name mapping (#7347 ) ### What this PR does / why we need it? This PR fixes the layer name mapping logic in `AscendModelSlimConfig` for quantization config loading. 1. kimi_k2 model layer name mapping issue: The `kimi_k2` model has a unique layer naming convention that differs from the standard `hf_to_vllm` mapping. One layer was defined in the mapper but was not being correctly applied, causing quantization config lookup failures. 2. Manual mapping registration timing issue: The manual mapping check in `apply_vllm_mapper` was executed before `vllm_config` was initialized, causing `model_type` to be unavailable. This prevented some models with manual mappings from being correctly registered. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? Tested with `kimi_k2` model to verify the special layer name mapping works correctly. Also tested with other models that have manual mappings defined in `QUANT_MODEL_PREFIX_MAPPINGS` to ensure the registration timing fix works properly. - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: Matrix_K <zhangke144@huawei.com> Signed-off-by: Feng-xiaosuo <tengchang1@huawei.com> Co-authored-by: Matrix_K <zhangke144@huawei.com> Co-authored-by: Wang Kunpeng <1289706727@qq.com>	2026-03-19 16:46:41 +08:00
aipaes	87d6424b2e	[CI] Add nightly CI test cases for the GLM-4.7 model. (#7391 ) ### What this PR does / why we need it? Add acc nightly CI test cases for the GLM-4.7 model. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? through CI - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: zjks98 <zhangjiakang4@huawei.com> Co-authored-by: zjks98 <zhangjiakang4@huawei.com>	2026-03-19 16:43:29 +08:00
aipaes	0261d1b1c6	[CI] add glm4.7 weights download (#7395 ) ### What this PR does / why we need it? Download GLM4.7 w8a8 weights for CI ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? through CI - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` Signed-off-by: zjks98 <zhangjiakang4@huawei.com> Signed-off-by: aipaes <82140963+aipaes@users.noreply.github.com> Co-authored-by: zjks98 <zhangjiakang4@huawei.com>	2026-03-19 16:43:15 +08:00
aipaes	5e65062973	[doc] Fix issues in the GLM4.7 documentation (#7457 ) ### What this PR does / why we need it? Fix issues in the GLM4.7 documentation and add some missing explanations. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? document test - vLLM version: v0.17.0 - vLLM main: `8a680463fa` --------- Signed-off-by: zjks98 <zhangjiakang4@huawei.com> Co-authored-by: zjks98 <zhangjiakang4@huawei.com>	2026-03-19 16:42:59 +08:00
pz1116	6fc190b44a	[Doc][KV Pool]Revision KV Pool User Guide [2/2] (#7456 ) ### What this PR does / why we need it? Revise the KV Pool user guide: 4. Revise parameters for Memcache for better clarity, at notification that currently heterogeneous protocol setting is not supported (e.g. enable `device_rdma` and `device_sdma` at the same time, a example scenario would be data transfer by memcache across different super pods) 5. Modify the condition for Mooncakestore warmup, warmup is now needed only when `ASCEND_BUFFER_POOL` is enabled. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.17.0 - vLLM main: `8a680463fa` --------- Signed-off-by: Pz1116 <zpbzpb123123@gmail.com> Co-authored-by: Chao Lei <leichao139636@163.com>	2026-03-19 16:17:34 +08:00
chenxi-hh	42bcad7e9b	GMM custom operator optimization in small batch scenarios (#7100 ) ### What this PR does / why we need it? GMM custom operator optimization in small batch scenarios ### How was this patch tested? Qwen3-30B input: 4k, output: 1k batch 1： TPOT 7.9 ms -> 7.0 ms Output Token Throughput 125.4651 token/s -> 140.6278 token/s batch 2： TPOT 9.4 ms -> 8.8 ms Output Token Throughput 211.8187 token/s -> 225.2254 token/s batch 16： TPOT 13.6 ms -> 13.5 ms Output Token Throughput 1159.8213 token/s -> 1165.0982 token/s - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: chenxi-hh <chen464822955@163.com>	2026-03-19 16:10:30 +08:00
wangxiyuan	8e0ebb470a	[Misc] Drop Prefetch MLP Env (#7357 ) ### What this PR does / why we need it? remove deprecated environment variables related to MLP prefetching ### Does this PR introduce _any_ user-facing change? yes, the deprecated env vars can not be used then. - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-03-19 14:27:27 +08:00
zhangxinyuehfad	ce239db4fb	[CI] Add multi-hardware wheel build and release workflow (#7312 ) ### What this PR does / why we need it? Adds a scheduled CI workflow (schedule_release_code_and_wheel.yml) to automatically build and release vllm-ascend source packages and binary wheels for multiple Ascend hardware targets. Key features: 1. Source release: Builds tar.gz sdist and uploads to PyPI on version tag push 2. Multi-hardware wheel builds: Supports three hardware targets in parallel: 2.1 A2 (Ascend 910B): x86_64 + ARM64, Python 3.10 / 3.11 2.2 A3 (Ascend 910C): x86_64 + ARM64, Python 3.10 / 3.11 2.3 310P: x86_64 + ARM64, Python 3.10 / 3.11 3. Wheel repair: Uses auditwheel to produce manylinux-compatible wheels, excluding Ascend NPU runtime libs (libascend.so, libtorch.so, etc.) that must be provided by the runtime environment 4. Variant wheels: Generates hardware-variant wheels via variantlib for hardware-specific distribution 5. OBS upload: Aggregates all variant wheels and a combined index JSON, then uploads to Huawei OBS for hosting ### Does this PR introduce _any_ user-facing change? Yes. Users will be able to install hardware-specific vllm-ascend wheels from PyPI or the OBS variant index, eliminating the need to build from source. ### How was this patch tested? 1. CI verification only — workflow syntax and job dependency logic reviewed manually 2. Wheel build steps validated against existing Dockerfiles (Dockerfile.buildwheel.a2/a3/310p) 3. auditwheel exclusion list verified against known Ascend runtime shared libraries - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: hfadzxy <starmoon_zhang@163.com> Signed-off-by: YanZhicong <mryanzhicong@163.com> Co-authored-by: YanZhicong <mryanzhicong@163.com>	2026-03-19 11:06:17 +08:00
LoganJane	270c5cb8cd	[CI] Add nightly CI test cases for the Kimi-K2.5 (#7416 ) ### What this PR does / why we need it? Add nightly CI test cases for the Kimi-K2.5. - vLLM version: v0.17.0 - vLLM main: `4497431df6` --------- Signed-off-by: LoganJane <loganJane73@hotmail.com> Signed-off-by: LoganJane <42287016+LoganJane@users.noreply.github.com>	2026-03-19 11:02:29 +08:00
pz1116	3effc4bc70	[Doc][KV Pool]Revision KV Pool User Guide (#7434 ) ### What this PR does / why we need it? Revise the KV Pool user guide: 1. Revise Mooncake environment variables and kvconnector extra configs. 2. Delete `use_ascend_direct` in kv connector extra config as it is deprecated 3. Delete `kv_buffer_device` and `kv_rank` in P2P mooncake config 4. Unifies default `max-model-len` and `max-num-batch-tokens` in examples given. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.17.0 - vLLM main: `4497431df6` --------- Signed-off-by: Pz1116 <zpbzpb123123@gmail.com> Co-authored-by: Chao Lei <leichao139636@163.com>	2026-03-19 10:13:13 +08:00
meihanc	ab9cd2e305	[CI]Add CI summary log (#7202 ) ### What this PR does / why we need it? This PR adds a new CI log summarizer, `ci_log_summary.py`, and wires it into unit-test and e2e workflows so failed jobs publish a structured failure summary to the GitHub step summary. Examples: - `python3 .github/workflows/scripts/ci_log_summary.py --log-file /tmp/unit-test.log --mode ut --step-name "Unit test"` - `python3 .github/workflows/scripts/ci_log_summary.py --run-id 23127187822 --format json` A maintenance note is added to `ci_utils.py` to clarify that the `START` / `PASSED` / `FAILED (exit code X)` log lines are parsed by `ci_log_summary.py`, so any future format changes must be coordinated with the corresponding summarizer regexes. 🤖 Generated with [Codex]<noreply@openai.com> - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com> Signed-off-by: meihanc <jcccx.cmh@gmail.com> Co-authored-by: Codex <noreply@openai.com>	2026-03-19 09:32:06 +08:00
pu-zhe	e8f7b2e3f1	[Refactor] [310p] Support Mamba Cache and support attn_head_size larger than 128 (#7372 ) ### What this PR does / why we need it? 1. Mamba Cache Support on 310P: Implemented logic to correctly initialize and allocate KV cache for Mamba models on the 310P platform, including handling of state tensors and page size alignment. 2. Increased Attention Head Size Support: Modified the attention backend to support attn_head_size larger than 128 by dynamically selecting appropriate kernel block sizes based on hardware limitations (e.g., block_size * head_size <= 16384). 3. Refactored KV Cache Allocation: Consolidated and improved the KV cache allocation mechanism, moving from separate size calculation and allocation steps to a unified _allocate_kv_cache_tensors method that handles both Attention and Mamba specific cache structures. 4. Dynamic Mamba Config Patching: Introduced conditional loading of Mamba configuration patches, specifically using patch_mamba_config_310 for the 310P platform to ensure platform-specific optimizations and validations. 5. Reserve reasonable memory to allocate KV cache to avoid OOM issue with default gpu_memory_utilization. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Qwen3.5 E2E test - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: pu-zhe <zpuaa@outlook.com>	2026-03-19 09:16:22 +08:00
Nengjun Ma	8b79d4de52	Main2main upgrade to vllm 0317 afternoon (#7409 ) ### What this PR does / why we need it? 1.fix "TypeError: get_attn_backend() remove variable": [Refactor `check_and_update_config`](https://github.com/vllm-project/vllm/pull/35122) 2.fix [Rename `compile_ranges_split_points` to `compile_ranges_endpoints`](https://github.com/vllm-project/vllm/pull/36027) 3.fix "RuntimeError: device_allocator not a DeviceAllocator":[Replace memory related torch.cuda APIs"](https://github.com/vllm-project/vllm/pull/37031) 4.fix [Support multiple KV groups in OffloadingSpec ](https://github.com/vllm-project/vllm/pull/36610) removed self.offloaded_block_size and changed self.gpu_block_size from a scalar to a tuple of per-group block sizes, adding block_size_factor. 5.fix [Consolidate SupportsEagle](https://github.com/vllm-project/vllm/pull/36063) renamed get_eagle3_aux_hidden_state_layers() to get_eagle3_default_aux_hidden_state_layers() and added a supports_eagle3() guard before calling it. ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? E2E - vLLM version: v0.17.0 - vLLM main: `8a680463fa` --------- Signed-off-by: leo-pony <nengjunma@outlook.com> Co-authored-by: Claude Code <noreply@anthropic.com>	2026-03-18 23:24:27 +08:00
jiangmengyu18	305820f1a9	[Bugfix] fix bug about model type of qwen3_vl_8b_instruct_w8a8 (#7383 ) ### What this PR does / why we need it? Adapt to the model type of Qwen3-VL-8B-Instruct-W8A8 - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: betta18 <jiangmengyu1@huawei.com> Co-authored-by: betta18 <jiangmengyu1@huawei.com>	2026-03-18 20:30:03 +08:00
SparrowMu	fb8e22ec00	[DOC] MiniMax-M2.5 model intro (#7296 ) ### What this PR does / why we need it? 1. Add nightly test on MiniMax-M2.5 with deployment method on A3 2. Add MiniMax-M2.5 deployment introduction to vllm-ascend docs - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: limuyuan <limuyuan3@huawei.com> Signed-off-by: SparrowMu <52023119+SparrowMu@users.noreply.github.com> Co-authored-by: limuyuan <limuyuan3@huawei.com>	2026-03-18 20:14:36 +08:00
LoganJane	2916601e6c	[CI] add Kimi-K2.5 weights download (#7406 ) ### What this PR does / why we need it? Add Kimi-K2.5 weights download. - vLLM version: v0.17.0 - vLLM main: `4497431df6` Signed-off-by: LoganJane <loganJane73@hotmail.com>	2026-03-18 18:29:37 +08:00
SILONG ZENG	adc57c5951	[release] Add GLM5 known issue for 2-node PD mixed deployment (#7436 ) ### What this PR does / why we need it? Documented an issue in the 2-node PD mixed deployment scenario where inference may hang when concurrency exceeds 8.(GLM5) Noted that the issue has been fixed in PR: - #7235 - #7290. --------- Signed-off-by: MrZ20 <2609716663@qq.com> Signed-off-by: Mengqing Cao <cmq0113@163.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2026-03-18 10:03:18 +00:00
LoganJane	565868a2a6	[doc] add doc for Kimi-K2.5.md (#7371 ) ### What this PR does / why we need it? Upload doc for Kimi-K2.5 on Ascend Base on vllm-ascend:v0.17.0rc1 - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: g00887675/loganJane <g00887675/loganJane73@hotmail.com> Signed-off-by: LoganJane <loganJane73@hotmail.com> Co-authored-by: g00887675/loganJane <g00887675/loganJane73@hotmail.com>	2026-03-18 17:16:35 +08:00
Angazenn	ec34bf0062	[Misc]fix logger which does not take effects in patches (#7402 ) ### What this PR does / why we need it? This PR fixes the logger initialization in patches so that the log info can be displayed as expected. ### Does this PR introduce _any_ user-facing change? No. - vLLM version: v0.17.0 - vLLM main: `4497431df6` --------- Signed-off-by: Angazenn <supperccell@163.com>	2026-03-18 17:13:12 +08:00
dependabot[bot]	1ff9e3f25f	[CI] Bump docker/login-action from 3 to 4 (#7299 ) Bumps [docker/login-action](https://github.com/docker/login-action) from3 to 4. - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-03-18 17:06:48 +08:00
dependabot[bot]	b3206cd6f6	[CI] Bump actions/setup-python from 5 to 6 (#7298 ) Bumps [actions/setup-python](https://github.com/actions/setup-python)from 5 to 6. - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-03-18 17:06:28 +08:00
liuhy1213-cell	58725b8b24	[doc] add Prefill-Decode Disaggregation doc for GLM5.md (#7300 ) ### What this PR does / why we need it? add Prefill-Decode Disaggregation doc for GLM5.md w8a8 65k-1.5k Concurrency: 80 prefixcache: 90% tps: 2054 - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: liuhaiyang27 <liuhaiyang27@huawei.com> Co-authored-by: liuhaiyang27 <liuhaiyang27@huawei.com>	2026-03-18 17:00:31 +08:00
Nagisa125	6bc68c55d0	[doc] Refresh the documentation for DeepSeek-V3.2 (#7403 ) ### What this PR does / why we need it? Updated the DSV32 document. 1. Changed the PD separation boot mode to layerwise. 2. Changed max-num-batched-tokens to a multiple of the TP to avoid triggering a verification error. 3. Added a link to help users adjust the configuration. - vLLM version: v0.17.0 - vLLM main: `4497431df6` Signed-off-by: wyh145 <1987244901@qq.com>	2026-03-18 14:59:48 +08:00
rjg-lyh	c1392a6ce6	[bugfix][accuracy] Fix ds indexer accuracy problem caused by k rope (#7341 ) ### What this PR does / why we need it? The rotary algorithm in deepseek indexer should be neox-style instead of gptj style. PR #4641 fix this accuracy bug in original pytorch version. But PR #5701 accidentally removed the fixed code line and reverted the implementation back to the problematic version. This PR fixes it. Signed-off-by: rjg-lyh <1318825571@qq.com>	2026-03-18 14:20:21 +08:00
wangxiaoteng888	c7157af8f7	[P/D] LayerwiseConnector supports the virtual push functionality on node D. (#7361 ) ### What this PR does / why we need it? LayerwiseConnector supports the virtual push functionality on node D.By adding a do_virtual flag to request metadata, the system can now identify and process certain requests virtually, bypassing the actual KV cache transfer process. This allows for immediate completion of these requests from the consumer's perspective, potentially enabling optimizations or specific testing scenarios where physical data transfer is not required. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By ci - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>	2026-03-18 10:50:02 +08:00
Li Wang	5894a27bfd	[CI] Add PAT_TOKEN when checkout (#7400 ) ### What this PR does / why we need it? When we checkout the fork repo and wanna to submit push to the fork repo, the pat_token is needed - vLLM version: v0.17.0 - vLLM main: `4497431df6` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2026-03-18 10:31:32 +08:00
zhangyiming	1c954ff264	[main2main] upgrade vllm to 0308 (#7213 ) ### What this PR does / why we need it? Update main2main to vllm 0308. breaks: * https://github.com/vllm-project/vllm/pull/30681 * https://github.com/vllm-project/vllm/pull/35552 remove self.cudagraph_batch_sizes * https://github.com/vllm-project/vllm/pull/35158 clear_metadata -> defer_finalize * https://github.com/vllm-project/vllm/pull/36006 remove CacheConfig.cpu_offload_gb * https://github.com/vllm-project/vllm/pull/35472 * https://github.com/vllm-project/vllm/pull/34552 attn_metadata_builder * https://github.com/vllm-project/vllm/pull/30515 profile_seq_lens * https://github.com/vllm-project/vllm/pull/28053 - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: MrZ20 <2609716663@qq.com> Signed-off-by: menogrey <1299267905@qq.com> Co-authored-by: MrZ20 <2609716663@qq.com>	2026-03-18 09:24:43 +08:00
drizzlezyk	79ef41a53d	[CI] add scheduled stale issue management (#7354 ) ### What this PR does / why we need it? 1. issue with "resolved", 7 days stale, 14 days closed after stale with `stale` and `resolved` label. 2. issue with "awaiting-feedback", 7 days stale, 14 days closed after stale with `stale` and `awaiting-feedback` label. Change items: - Add a scheduled stale-management workflow to process resolved and awaiting-feedback issues independently. - Automatically mark inactive issues as stale , post tailored reminder messages, and close issues after a grace period. - Remove source labels when issues become active again, and disable PR stale handling so the automation remains issue-scoped. ### Does this PR introduce _any_ user-facing change? - No API or runtime behavior changes. - This PR only updates GitHub issue automation (labeling and stale management workflow). ### How was this patch tested? - Test locally - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: drizzlezyk <drizzlezyk@163.com>	2026-03-17 23:28:29 +08:00
drizzlezyk	467c815db6	[CI] expand issue labeler rules for feature/model triage (#7356 ) - Replace minimal label rules with a comprehensive keyword-based issue labeler taxonomy. - Add grouped labels for core features and advanced capabilities to improve issue routing. - Expand model-related matching for LLM, multimodal generation, multimodal understanding, audio, and omni scenarios. - Add/normalize regex patterns for common model families (DeepSeek, Kimi, GLM, Qwen, 310p, etc.) to increase auto-label coverage and consistency. ### What this PR does / why we need it? - Expands `.github/issue-labeler.yml` from a minimal set of rules to a richer keyword-based labeling configuration. - Adds grouped label dimensions for: - Core features (e.g., PD disaggregation, KV cache pool, ACLGraph, async scheduler, CPU binding, quantization) - Advanced features (e.g., long sequence, DPC/PCP, MTP/speculative decode) - Model categories (LLM, multimodal generation, multimodal understanding, audio, omni, etc.) - Specific model families (e.g., DeepSeek, Kimi, GLM, Qwen, 310p) - Improves automatic issue triage accuracy and reduces manual label maintenance effort. - Makes issue categorization more consistent for maintainers and contributors. Why needed: - Existing labeler rules were too limited and could not adequately cover current feature/model issue distribution. - Broader and more structured matching helps faster routing, prioritization, and ownership assignment. Fixes #N/A ### Does this PR introduce _any_ user-facing change? - No runtime/API user-facing changes. - This PR only updates GitHub issue automation rules. ### How was this patch tested? - Performed static validation and review of `.github/issue-labeler.yml` structure and regex entries. - Verified that rule groups and label keys are correctly formatted for GitHub issue labeler consumption. - Confirmed that legacy minimal rules were replaced by expanded taxonomy without syntax-breaking YAML changes. - No unit/e2e tests were added because this is repository automation configuration (GitHub labeling rules) rather than application runtime logic. - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: drizzlezyk <drizzlezyk@163.com>	2026-03-17 23:28:04 +08:00
Chao Lei	d9ac7e8539	[Bugfix] Assertion error when decode prefix cache fully hits (#7236 ) ### What this PR does / why we need it? #### Problem When decode node enables prefix cache and the local prefix cache fully hits, the following assertion error occurs: ``` (EngineCore_DP3 pid=34912) File "/usr/local/python3.11.14/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 520, in step_with_batch_queue (EngineCore_DP3 pid=34912) engine_core_outputs = self.scheduler.update_from_output( (EngineCore_DP3 pid=34912) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP3 pid=34912) File "/usr/local/python3.11.14/lib/python3.11/site-packages/vllm/v1/core/sched/scheduler.py", line 1520, in update_from_output (EngineCore_DP3 pid=34912) self._update_from_kv_xfer_finished(kv_connector_output) (EngineCore_DP3 pid=34912) File "/usr/local/python3.11.14/lib/python3.11/site-packages/vllm/v1/core/sched/scheduler.py", line 2120, in _update_from_kv_xfer_finished (EngineCore_DP3 pid=34912) assert RequestStatus.is_finished(req.status) (EngineCore_DP3 pid=34912) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP3 pid=34912) AssertionError ``` The error is triggered in scheduler.py at _update_from_kv_xfer_finished: ``` if req.status == RequestStatus.WAITING_FOR_REMOTE_KVS: self.finished_recving_kv_req_ids.add(req_id) else: assert RequestStatus.is_finished(req.status) ``` #### Root Cause When decode node has prefix cache enabled and local prefix cache fully hits: 1. get_num_new_matched_tokens returns ext_tokens=0, load_kv_async=False when decode prefix cache fully hits 2. Request status becomes RUNNING (not WAITING_FOR_REMOTE_KVS) 3. However, update_state_after_alloc still adds the request to _reqs_need_recv because remote_block_ids exists in kv_transfer_params 4. Worker processes the request in _handle_request: - _transfer_kv_cache returns immediately (no actual transfer, local_block_ids is empty) - finally block still calls update_done_task_count(request_id) 5. finished_recving contains this request 6. When _update_from_kv_xfer_finished processes finished_recving, request status is RUNNING 7. Assertion fails #### Solution In _handle_request, only notify scheduler (update_done_task_count) when actual KV transfer happened (local_block_ids is not empty). The signals to notify Prefill to release KVCache (_send_done_signal_to_free_remote_port and _send_done_recv_signal) are still sent regardless. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` Signed-off-by: LCAIZJ <leichao139636@163.com>	2026-03-17 15:17:45 +00:00
aipaes	3b3dd2a889	[doc] Refresh the documentation for GLM-4.7 (#7292 ) ### What this PR does / why we need it? Refresh the documentation for GLM4.7. --------- Signed-off-by: zjks98 <zhangjiakang4@huawei.com> Co-authored-by: zjks98 <zhangjiakang4@huawei.com>	2026-03-17 23:09:12 +08:00
zxr2333	5645ca8392	[BugFix]A2 MOE method&& layerwise MTP bugfix && Mamba gdn_metadata bugfix (#7364 ) ### What this PR does / why we need it? Some bug fixes, mainly including: 1. For A2, the number of experts each single card cannot be greater than 16 when using MC2. The PR fixed the error in the A2 moe communication method selection, which would cause the selection of an incorrect communication method when the number of model experts exceeds 256. For example, when using an A2 16-cards model to load the PD-disaggregation D node with Qwen3.5 series models, the incorrect MC2 method would be chosen. 2. Fixed the issue where the layerwise connector sends the kv-cache of the MTP layer multiple times when `num_spec_tokens` > 1. Now, the kv-cache is sent only when the MTP layer is forward for the first time. 3. Fix the accuracy issue of qwen3.5 when using MTP for PD disaggregation. The cause is that `num_decode_draft_tokens` does not consider that `spec_tokens` are not existed during the first inference when PD disaggregation (`spec_tokens` are generated during the first inference). However, `spec_tokens_padding` is added by `recomputed_scheduler`. As a result, `gdn_metadata` incorrectly considers that the prefill with a length of 2 is performed. --------- Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com> Signed-off-by: zxr2333 <64738772+nwpu-zxr@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2026-03-17 23:03:45 +08:00
pppeng	a457d0f0e8	[doc] Upload doc for qwen3.5-27B and qwen3.5-397B-A17B on Ascend (#7313 ) ### What this PR does / why we need it? Upload doc for qwen3.5-27B and qwen3.5-397B-A17B on Ascend Base on vllm-ascend:v0.17.0rc1 - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: pppeng <zepengliu912@qq.com> Signed-off-by: pppeng <60355449+ppppeng@users.noreply.github.com>	2026-03-17 22:54:57 +08:00
asunxiao	a370dfa962	[bugfix]Enable dispatch_ffn_combine feature for qwen3.5 (#7066 ) ### What this PR does / why we need it? Qwen3.5 Moe supports enabling the dispatch_ffn_combine fusion operator. Fix problem: In the w8a8 quantization scene, Qwen3.5 model's config.json lacks the quantize field. The previous logic strictly relied on quant_type == "w8a8_dynamic" to enable VLLM_ASCEND_ENABLE_FUSED_MC2. This caused the dispatch_ffn_combine fusion operator to fail to activate even when the environment variable was set. Enable dispatch_ffn_combine fusion operator for BF16 scenarios. - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: asunxiao <asunxiao@qq.com>	2026-03-17 19:53:02 +08:00
aipaes	83ad14c74c	[bugfix] fix unzip file path for fia operator (#7367 ) ### What this PR does / why we need it? The decompression path of the FIA operator package is incorrect, and unnecessary folders have been created during modification. - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` Signed-off-by: zjks98 <zhangjiakang4@huawei.com> Co-authored-by: zjks98 <zhangjiakang4@huawei.com>	2026-03-17 17:21:27 +08:00
rjg-lyh	7669963c27	[Perf] Optimize bias handling in AscendRMSNorm (#7226 ) ### What this PR does / why we need it? This PR optimizes bias handling in `AscendRMSNorm` without changing the intended functional behavior. In the current implementation, bias may be initialized for `AscendRMSNorm` based on configuration-level detection, even though some norm layers never actually load a bias weight. This can cause the inference path to enter the bias branch and execute an unnecessary `add_` operator. To improve this, this PR introduces a loader-based flag to record whether the bias has actually been loaded. The bias addition is then executed only when the bias is truly present. This optimization reduces redundant computation in inference and makes the bias application logic better aligned with the actual model weights. - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` Signed-off-by: rjg-lyh <1318825571@qq.com>	2026-03-17 16:53:28 +08:00
lilinsiman	8f278fc101	[eagle3][pcp] fix bug for eagle3 and cp enable (#7309 ) ### What this PR does / why we need it? This PR fixes the bug for eagle3 and cp enable introduced by the parallel speculative inference PR. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? tests and ut - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: lilinsiman <lilinsiman@gmail.com>	2026-03-17 16:14:45 +08:00
lidenghui1110	4e62a2ae15	[Bugfix] fix TransposeKvCacheByBlock op error report in plog (#7235 ) ### What this PR does / why we need it? As issue #7201 reported, there are some TransposeKvCacheByBlock operation related ERRORs in plog when vllm launching, though it doesn't influence the running of vllm, but ERRORs will be very confused in debug, this PR fixed the problem as suggested. ### Does this PR introduce _any_ user-facing change? no. ### How was this patch tested? - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` Signed-off-by: lidenghui <lidenghui1110@gmail.com>	2026-03-17 10:08:32 +08:00
pichangping	3f39ac9c8d	[Feature]Supports DSv3.1 PD separation and C8 quantization (#7222 ) Co-authored-by: kunpengW-code <1289706727@qq.com> Co-authored-by: linsheng1 <1950916997@qq.com> ### What this PR does / why we need it? Currently, chunked prefill is forcibly enabled. DeepSeek V3.1 W8A8C8 supports only the PD separation scenario. C8 refers to quantizing the KV cache to int8, which aims to reduce the GPU memory usage of the KV cache and improve the inference throughput. Constraints: 1. Only the PD separation mode can be used and MooncakeLayerwiseConnector can be used to run the model. 2. Currently, only the activation value supports dynamic quantization, and the KV cache supports static quantization. C8 quantization with MTP is not supported. You can use ModelSlim for quantization. The quantization procedure is as follows: pip install transformers==4.48.2 git clone https://gitcode.com/Ascend/msmodelslim.git cd msmodelslim bash install.sh cd example/DeepSeek/ python3 quant_deepseek_w8a8.py --model_path <path/weight> --save_path <path/quant_weight> --anti_dataset../common/deepseek_anti_prompt_50_v3_1.json --calib_dataset../common/deepseek_calib_prompt_50_v3_1.json --rot --trust_remote_code True --fa_quant --dynamic --anti_method m6 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: pichangping <1337510399@qq.com> Signed-off-by: Wang Kunpeng <1289706727@qq.com> Co-authored-by: Wang Kunpeng <1289706727@qq.com>	2026-03-16 22:49:05 +08:00
drslark	a6f6e919e6	[main][bugfix] Fixed the problem that eagle3 will crash in FULL_DECODE_ONLY (#7290 ) ### What this PR does / why we need it? Two problems have been solved in this pr. These problems occur in the `FULL_DECODE_ONLY` mode that `num_tokens` should be padded to some value in `cudagraph_capture_sizes`. 1. We found the length of `seq_lens_list` in drafter's `attn_metadata` is 1 shorter than expected. It will raise a kernel exception to make vllm crash. e.g., `num_reqs` = 3, `cudagraph_capture_sizes` = [20], `actual_seq_lengths_q` is padded well to [4, 8, 12, 20]. But `seq_lens_list` = [5742, 4700, 7996], it is not padded. 3. Though the length of `seq_lens_list` in target's `attn_metadata` is the same as expected in `FULL_DECODE_ONLY`, some data are corrupted at the end of the list. e.g., `num_reqs` = 3, `cudagraph_capture_sizes` = [20], `actual_seq_lengths_q` is padded well to [4, 8, 12, 20]. But `seq_lens_list` = [5742, 4700, 7996, 5738], it has corrupted at the end of the list. - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` Signed-off-by: drslark <slarksblood@qq.com>	2026-03-16 20:41:36 +08:00
LVYANGGUO	b1a78886a9	[xlite][Bugfix] Support mrope and deepstack features in xlite backend (#7295 ) ### What this PR does / why we need it? This PR fixes a bug in Xlite backend(https://atomgit.com/openeuler/GVirt/issues/3). This PR adds support for mrope (Mixture-of-RoPE) and deepstack features in the xlite backend. These features are necessary for running certain multimodal models that utilize them. The main changes include: - Updating `_build_model_config` to parse mrope and deepstack configurations from the model's `hf_config`. - Modifying `XliteWrapper.__call__` to handle `deepstack_input_embeds` and mrope positions during the model forward pass. - Replacing `ModelAttnMeta` with the newer `AttnMeta` to accommodate the new metadata fields required by these features. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? online server config: ``` python -m vllm.entrypoints.openai.api_server \ --model /mnt/nvme0n1/models/checkpoint-8200 \ --additional-config='{"xlite_graph_config": {"enabled": true}}' \ --tensor-parallel-size 4 \ --gpu-memory-utilization 0.9 \ --max-num-batched-tokens 8192 \ --max-num-seqs=20 \ --block-size 128 \ --max-model-len 8192 \ --trust-remote-code \ --served-model-name Qwen3-VL-8B \ --host localhost \ --generation-config vllm \ --port 6777 ``` test_config: ``` vllm bench serve \ --max-concurrency ${maxconcurrency} \ --num-prompts ${num_prompts} \ --host ${HOST} \ --port ${PORT} \ --model ${MODEL_NAME} \ --dataset-name random \ --backend openai-chat \ --random-input-len 512 \ --random-output-len 512 \ --random-range-ratio 0.2 \ --temperature 0.6 \ --metric-percentiles "50,90,99" \ --tokenizer ${TOKENIZER_PATH} \ --endpoint /v1/chat/completions \ --ignore-eos ``` - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` Signed-off-by: LVYANGGUO <lvyangguo@huawei.com> Co-authored-by: LVYANGGUO <lvyangguo@huawei.com>	2026-03-16 17:05:52 +08:00
wangx700	22d0e1d3d7	[model_runner_v2]optimize the performance of the _topk_log_softmax_kernel (#7221 ) ### What this PR does / why we need it? Optimize the performance of the triton operator _topk_log_softmax_kernel in model_runner_v2 to 1.04xH100，which is 7% of its original value.(issue https://github.com/vllm-project/vllm-ascend/issues/5208) - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: wangx700 <wangxin700@huawei.com>	2026-03-16 16:49:10 +08:00
rjg-lyh	4d443b9228	[bugfix] restore pr-7029 and fix patch error (#7294 ) ### What this PR does / why we need it? This PR restores #7029, which adds W8A8C8 support for dsv3.2/glm5 using the `lightning_indexer_quant` ops in the pd-mix stage. The original PR was reverted by #7288 because the patch did not work with the recompute scheduler. This PR also fixes the patching issue so that it works correctly with the recompute scheduler. ### Does this PR introduce _any_ user-facing change? Yes. To enable LI C8, users need to set the `enable_sparse_c8` option to `"true"` in `additional_config`. - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: rjg-lyh <1318825571@qq.com>	2026-03-16 15:39:42 +08:00
zhaomingyu13	9320365dab	[Test][Feature] Add e2e test for QuaRot model with eagle3 (#7128 ) ### What this PR does / why we need it? Add an e2e test for QuaRot model with eagle3 that runs both the QuaRot model and the float model, and then compares their acceptance rates. The QuaRot model adapting eagle3 PR(#6914, #7038) - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com>	2026-03-16 15:35:55 +08:00

1 2 3 4 5 ...

2664 Commits