xc-llm-ascend

Author	SHA1	Message	Date
Jing Wang	b6549b6e38	Add feature: priority Signed-off-by: Jing Wang <jingwang96@qq.com>	2026-05-13 06:16:25 +00:00
Jing Wang	d627a45881	fix multiproc executor determine kv cache memory & update Dockerfile Signed-off-by: Jing Wang <jingwang96@qq.com>	2026-05-09 07:10:37 +00:00
Jing Wang	6c097beaa5	adapt to vllm-ascend v0.18.0 Signed-off-by: Jing Wang <jingwang96@qq.com>	2026-05-09 07:10:12 +00:00
zzzzwwjj	e18643f8a4	[doc][0.18.0] v0.18.0 release note (#8383 ) ### What this PR does / why we need it? Added v0.18.0 release note. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? --------- Signed-off-by: zzzzwwjj <1183291235@qq.com>	2026-04-30 18:44:08 +08:00
Nagisa125	600bf80c6d	[CI]Fix the error caused by layer_sharding in dsv32 (#8719 ) ### What this PR does / why we need it? This PR fixes the error in DSV32 mixed deployment caused by enabling layer_sharding. - Currently, mixed deployment no longer supports the enabling of layer_sharding. Therefore, it has been removed from the service-oriented configuration. - The error "RPC call to sample_tokens timed out" occurred because the dshm size limit was set too small. Therefore, it was increased to 512 Gi. ### Does this PR introduce _any_ user-facing change? no. ### How was this patch tested? The nightly test has passed. Signed-off-by: wyh145 <1987244901@qq.com>	2026-04-30 10:35:48 +08:00
ZYang6263	483f7d8188	[Doc][v0.18.0] Add GLM-5.1 to models tutotials (#8778 ) ### What this PR does / why we need it? Add a description of glm-5.1 in the document. Signed-off-by: ZYang6263 <zy626375@gmail.com>	2026-04-30 10:09:55 +08:00
LI SHENGYONG	0cb4bca1ff	[Doc] EPLB update the documentation (#8795 ) ### What this PR does / why we need it? 1. Update documentation：In the current version, we recommend using the following: policy of swift balancer(2). ### Does this PR introduce _any_ user-facing change? --additional-config '{ "eplb_config": { "dynamic_eplb": true, "expert_heat_collection_interval": 600, "algorithm_execution_interval": 50, "eplb_policy_type": 2, "num_redundant_experts": {ep_size} }}' ### How was this patch tested? Test in DSV3.1 EP32 Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2026-04-29 17:15:29 +08:00
zhangxinyuehfad	bc5ca2c856	[0.18.0][Bugfix] Restore VLLM_MOONCAKE_ABORT_REQUEST_TIMEOUT to original value for nightly test (#8794 ) ### What this PR does / why we need it? PR #8618 renamed `VLLM_NIXL_ABORT_REQUEST_TIMEOUT` to `VLLM_MOONCAKE_ABORT_REQUEST_TIMEOUT` and simultaneously reduced the timeout value from 300000 to 480 seconds in the nightly test configs. The 480s value is far too short for heavy multi-node workloads (DeepSeek V3/R1 under W8A8 + EP), causing [spurious abort-request timeouts](https://github.com/vllm-project/vllm-ascend/actions/runs/25067539406/job/73441223206) in CI. This PR restores the timeout value to the original 300000 to fix the nightly test failures introduced by #8618. Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2026-04-29 14:31:12 +08:00
ZT-AIA	96b90ad625	[CI] repair bug custom op ci conftest.py (#8786 ) ### What this PR does / why we need it? repair bug custom op ci `conftest.py`：Some test cases fail or are skipped, leading to incorrect time retrieval. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? nightly Signed-off-by: ZT-AIA <1028681969@qq.com>	2026-04-29 10:17:08 +08:00
zouyida2052	73f55bd6da	[Misc] add memfabric and memcache into requirements(#8748 ) ### What this PR does / why we need it?. add memcache and memfabric into dockerfile ### Does this PR introduce _any_ user-facing change? no --------- Signed-off-by: zouyida2052 <zouyida2002@gmail.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-04-29 09:30:15 +08:00
SILONG ZENG	2e2aaa2fae	[Doc][v0.18.0] Fix documentation formatting and improve code examples (#8701 ) ### What this PR does / why we need it? This PR fixes various documentation issues and improves code examples throughout the project. Signed-off-by: MrZ20 <2609716663@qq.com>	2026-04-28 09:01:25 +08:00
pppeng	9a0b786f2b	[bugfix][0.18.0] Fix race in non-blocking num_accepted_tokens (#8764 ) ### What this PR does / why we need it? The same fix from https://github.com/vllm-project/vllm/pull/36013. In _update_states_after_model_execute, num_accepted_tokens is copied from GPU to pinned CPU memory with non_blocking=True. The CPU-side numpy view is later read in _build_attention_metadata during the next execute_model call. With async scheduling, _bookkeeping_sync deliberately avoids any CUDA synchronization (the whole point of async scheduling), so there is no guarantee the DMA has landed before the CPU read. Signed-off-by: ppppeng <zepengliu912@qq.com>	2026-04-27 23:28:52 +08:00
wangbj127	9fd01a52c0	[v0.18.0][BugFix] Fix DSV3.1 W4A8 TTFT degradation (#8674 ) ### What this PR does / why we need it? Fix TTFT degradation on Deepseek-V3.1-W4A8. Revert change of `balance_flag` in https://github.com/vllm-project/vllm-ascend/pull/7611. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - vLLM version: v0.18.0 Signed-off-by: Wangbingjie <wangbj1207@126.com>	2026-04-27 23:27:34 +08:00
ZT-AIA	0cc76860d5	[CI ][Misc] Add timeout check for custom op CI and optimize test parameters (#8755 ) ### What this PR does / why we need it? This PR introduces a mechanism to track test duration in `conftest.py` and skip subsequent tests in a file if a certain number of tests exceed a timeout threshold. This is intended to prevent CI hangs or long-running nightly tests. Additionally, it reduces the parameter space for `test_fused_qkvzba_split_reshape_cat.py` to further optimize CI runtime. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? nightly Signed-off-by: ZT-AIA <1028681969@qq.com>	2026-04-27 21:48:54 +08:00
hucong	8a671a109c	[CI][Cherry-pick] Relax TTFT benefits threshold from 0.4 to 0.5 to account for DP load imbalance (#8684 ) Cherry-pick https://github.com/vllm-project/vllm-ascend/pull/8683 ### What this PR does / why we need it? This PR relaxes the TTFT threshold from `0.4` to `0.5` to improve robustness under Data Parallel (DP) load imbalance. #### Background The current assertion enforces: prefix75 < prefix0 * 0.4 #### ❌ Nightly Failure Cases (Observed) \| prefix0 \| threshold (0.4x) \| prefix75 \| delta \| \|--------\|------------------\|----------\|--------\| \| 4696.24 \| 1878.50 \| 1883.99 \| +5.49 \| \| 4696.20 \| 1878.48 \| 1896.01 \| +17.53 \| \| 4636.73 \| 1854.69 \| 1902.48 \| +47.79 \| \| 4655.17 \| 1862.07 \| 1913.54 \| +51.47 \| \| 4685.35 \| 1874.14 \| 1919.36 \| +45.22 \| \| 4660.33 \| 1864.13 \| 1915.41 \| +51.28 \| \| 4648.30 \| 1859.32 \| 1950.50 \| +91.18 \| \| 4655.30 \| 1862.12 \| 1962.32 \| +100.20 \| --- #### ✅ Nightly Passing Cases (Observed) \| prefix0 \| threshold (0.4x) \| prefix75 \| margin \| \|--------\|------------------\|----------\|---------\| \| 4685.64 \| 1874.26 \| 1864.46 \| -9.80 \| \| 5520.28 \| 2208.11 \| 1928.97 \| -279.14 \| \| 4639.23 \| 1855.69 \| 1846.86 \| -8.83 \| \| 4651.64 \| 1860.66 \| 1854.30 \| -6.36 \| \| 4640.39 \| 1856.15 \| 1840.32 \| -15.83 \| \| 4677.20 \| 1870.88 \| 1848.35 \| -22.53 \| --- #### Key Observations - Failures exceed the threshold by only ~5 ms to ~100 ms (~0.3%–5%) - Passing cases often have very tight margins (~5–10 ms) - There is clear overlap between pass and fail boundaries - Many failures are borderline violations, not real regressions --- #### Root Cause The instability is caused by Data Parallel (DP) load imbalance, which introduces systematic variance: - Uneven request distribution across workers - Queueing delays - Increased TTFT variance (especially for `prefix75`) --- #### Conclusion - The current threshold (`0.4x`) is too strict - Observed natural fluctuation: - Absolute: up to ~100 ms - Relative: up to ~5% over threshold - Pass/fail boundary is currently too sensitive to runtime jitter --- #### Change We relax the threshold: 0.4 → 0.5 This adjustment: - Accounts for expected runtime variance - Reduces false negatives - Maintains a meaningful performance constraint Even with `0.5`, the requirement remains strict (`prefix75 < 50% of prefix0`) and does not mask real regressions. --- ### Does this PR introduce _any_ user-facing change? No. This change only affects internal test assertions and does not impact user-facing behavior or model performance. --- ### How was this patch tested? - Verified against existing TTFT test cases: - Previously failing cases (due to small variance) now pass - No regressions observed in other scenarios - Confirmed that failures were due to DP load imbalance rather than actual performance degradation - Ensured the updated threshold still enforces a meaningful constraint on TTFT Signed-off-by: underfituu <hzhucong@163.com>	2026-04-27 16:38:07 +08:00
ZT-AIA	2cee0c32e5	[CI] Repair custom op nightly (#8707 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> #### Fixed: 1. The function name in test_moe_init_routing_custom.py is incorrect; it is not named as a test case function starting with 'test'. 2.In Night ops singlecard_ops add the printing of timestamps for use cases, making it easier to quickly locate issues after a timeout occurs. #### To be repaired: 1. The test_penality.py test case partially fails. It takes one hour. The owner has been notified to fix the case after the 5.1 holiday. ——Yang Cheng 3. The csrc/copy_and_expand_eagle_inputs operator invoked by test_copy_and_expand_eagle_inputs.py supports only 910b.——HF001 4. The test_causal_conv1d.py test case is incorrect. The triton operator `causal_conv1d_fn` invoked by the test_causal_conv1d.py test case uses `get_forward_context`, but the operator case does not use `set_forward_context` (which is normal in the model). ——Zeng Tian 5. The test_causal_conv1d.py case is incorrect. In this scenario, uboverflow occurs when the triton invoked ——Zeng Tian ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> no ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> nightly Signed-off-by: ZT-AIA <1028681969@qq.com>	2026-04-25 19:05:33 +08:00
tanhaoan333	963b82f42d	[Doc]Update Qwen3-Omni-30B-A3B-Thinking.md (#8669 ) ### What this PR does / why we need it? apt-get install ffmpeg -y -> apt-get install -y ffmpeg Signed-off-by: tanhaoan333 <tanhaoan@huawei.com>	2026-04-25 14:25:46 +08:00
ZT-AIA	81d0a37bf5	[CI] repair ci custom op (#8571 ) ### What this PR does / why we need it? After he completes the subsequent repairs, it can be restored. For now, let's skip test_copy_and_expand_eagle_inputs ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? nightly Signed-off-by: ZT-AIA <1028681969@qq.com>	2026-04-24 17:06:25 +08:00
Lucky1	bd3774d601	[Doc][Misc] Improve documentation quality by revising specific content. (#8603 ) ### What this PR does / why we need it? To improve the quality of certain docs by revising specific content. ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? - vLLM version: v0.19.0 - vLLM main: `6f786f2c50` --------- Signed-off-by: Lucky1 <144669645+verylucky01@users.noreply.github.com>	2026-04-24 15:40:41 +08:00
csoulnd	97dbcaf919	[BugFix][310P][v0.18.0] Use CPU generator cache for sampling (#8624 ) ### What this PR does / why we need it? This PR introduces a caching mechanism for CPU-based `torch.Generator` objects in the `_random_sample_310p` function to optimize sampling performance. It includes unit tests for cache persistence and state recovery. Feedback highlights a critical bug where keying the cache by batch index instead of generator ID can break RNG reproducibility during request re-scheduling, and notes a potential memory leak in the global cache. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Tested via new unit tests in `tests/ut/_310p/sample/test_sampler_310.py` verifying cache logic and error handling. --------- Signed-off-by: csoulnd <daidaicurry@foxmail.com>	2026-04-24 09:34:14 +08:00
Shaoxu Cheng	00ddacf4e7	[Doc][0.18.0] Add the 310p guide (#8639 ) ### What this PR does / why we need it? Add a detailed 310 deployment tutorial. ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? NA --------- Signed-off-by: Tflowers-0129 <2906339855@qq.com>	2026-04-23 23:50:40 +08:00
Wangbei25	571edc58fa	[Doc]Update DeepSeekOCR2.md for releases/v0.18.0 (#8604 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> Update DeepSeekOCR2.md for releases/v0.18.0 ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> NO ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> vLLM version: v0.18.0 vLLM main: `bcf2be9612` --------- Signed-off-by: Wangbei25 <wangbei41@huawie.com> Signed-off-by: Wangbei25 <wangbei41@huawei.com> Co-authored-by: Wangbei25 <wangbei41@huawie.com>	2026-04-23 23:48:03 +08:00
pppeng	696dcc9265	[Bugfix][0.18.0] fix kernels in sample when mask is not static or draft_token_id is invalid (#8531 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> The triton kernels in sample encounter some problems, scenarios are shown below: 1. 【expand_kernel/ rejection_random_sample_kernel/ prepare_inputs_padded_kernel】, these three operations will use ‘tl.load(prt + offsets -1, mask)’ in their implementations, but triton compiler reports that the masks in these scenarios are not static and contiguous. As a result, compiler will first access this memory and apply the mask. Therefore, I modified the code to ‘tl.load(prt +tl.maximum(offsets - 1, 0), mask)’ to ensure no -1 reads. 2. 【sample_recovered_tokens_kernel/ rejection_random_sample_kernel】, this kernel uses draft_token_id as an address offset for the load operation. In the PD separation scenario, if the pad token is -1, illegal memory reads and writes can occur. Therefore, i modified the kernel and so they can do well with -1 token. ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> Signed-off-by: ppppeng <zepengliu912@qq.com> Co-authored-by: zepengliu912@qq.com <root@localhost.localdomain>	2026-04-23 23:04:19 +08:00
aipaes	45f75b4178	[Doc][v0.18.0]Fix minimax2.5 readme (#8638 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> Correct the descriptive errors in the document. ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> no ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> doc test --------- Signed-off-by: zjks98 <zhangjiakang4@huawei.com> Co-authored-by: zjks98 <zhangjiakang4@huawei.com>	2026-04-23 22:56:24 +08:00
wangxiaoteng888	c47371c1af	[BugFix]Backport validate pd mode feature gates no fused mc2 v0.18.0 clean (#8583 ) ### What this PR does / why we need it? Backport validate pd mode feature gates no fused mc2 v0.18.0 clean backport #8582 --------- Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>	2026-04-23 19:44:07 +08:00
zzzzwwjj	1fc7bc056d	[0.18.0][Doc] Add NPU soft partitioning + cudagraph.piecewise limitation (#8595 ) ### What this PR does / why we need it? Added NPU soft partitioning + cudagraph.piecewise limitation in graph mode user guide doc. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Signed-off-by: zzzzwwjj <1183291235@qq.com>	2026-04-23 19:09:55 +08:00
linfeng-yuan	5c048a9b71	[Doc][releases/v0.18.0] fix documentation error or non-standard description (#8626 ) ### What this PR does / why we need it? fix documentation error or non-standard description in releases/v0.18.0 branch ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Documentation check. --------- Signed-off-by: linfeng-yuan <1102311262@qq.com>	2026-04-23 18:55:44 +08:00
sunshine202600	786eaf8b07	[Doc][Misc] Improve readability and fix typos in documentation (#8633 ) ### What this PR does / why we need it? This PR improves the readability of the documentation by fixing typos, correcting command extensions. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Documentation changes only. Signed-off-by: sunshine202600 <sunshine202600@163.com>	2026-04-23 18:45:17 +08:00
jack	d81101acdd	[releases/v0.18.0][Platform][BugFix] Guard forced tool choice with empty content (#8400 ) ### What this PR does / why we need it? This backports the forced-tool-choice `content=None` guard to the `releases/v0.18.0` compatibility layer. Upstream vLLM still has forced named tool-choice branches that assert `content is not None` after reasoning extraction. Some reasoning parsers can legally consume the full output and return `(reasoning, None)`, which makes the assert reachable and can surface as a server-side failure. This PR follows the same compatibility-patch pattern used by: - `7314bbe2` fix(platform): reimplement MiniMax usage accounting patch (#7835) - `f83cb0e6` [Bugfix][Platform] Fix GLM47 tool-call finish backfill (#7710) The patch is intentionally narrow: - normalize `content=None` to `""` only for forced named tool choice - patch both chat-completions and responses parser entry points - keep the rest of upstream behavior unchanged Upstream tracking: - issue: vllm-project/vllm#40147 - PR: vllm-project/vllm#40148 ### Does this PR introduce _any_ user-facing change? Yes. Forced named tool choice becomes robust when the reasoning parser returns no post-reasoning content, avoiding an internal assertion failure and emitting an empty-argument function call instead. ### How was this patch tested? Unit tests: ```bash pytest -sv tests/ut/patch/platform/test_patch_tool_choice_none_content.py \ tests/ut/patch/platform/test_patch_glm_tool_call_parser.py \ tests/ut/patch/platform/test_patch_minimax_usage_accounting.py ``` Result: 22 passed. --------- Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com> Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>	2026-04-23 16:46:10 +08:00
herizhen	ff76c6780e	[releases/v0.18.0][Doc][Misc] Modifying Configuration Parameters (#8618 ) ### What this PR does / why we need it? This PR renames the environment variable VLLM_NIXL_ABORT_REQUEST_TIMEOUT to VLLM_MOONCAKE_ABORT_REQUEST_TIMEOUT to align with the Mooncake connector naming convention. It also updates the documentation and test configurations to reflect this change and adjusts the suggested timeout value in the documentation to 480 seconds for consistency. ### Does this PR introduce _any_ user-facing change? Yes. The environment variable for configuring the abort request timeout has been renamed. Users should update their environment settings from VLLM_NIXL_ABORT_REQUEST_TIMEOUT to VLLM_MOONCAKE_ABORT_REQUEST_TIMEOUT. ### How was this patch tested? The changes were verified by updating the corresponding test configuration files and ensuring consistency across the documentation. --------- Signed-off-by: herizhen <1270637059@qq.com> Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com>	2026-04-23 16:23:31 +08:00
Frank Chen	ce92be29d2	[Doc] Clarify irqbalance service management (#8614 ) ### What this PR does / why we need it? This PR clarifies the CPU binding documentation for managing the `irqbalance` service. The previous wording only mentioned Ubuntu while the command shown is specific to systemd-based Linux distributions. This update describes the command as applicable to Ubuntu and other systemd-based distributions, and adds a note for non-systemd systems to use the distribution-specific service-management command. ### Does this PR introduce _any_ user-facing change? No. This is a documentation-only update and does not change vLLM or vllm-ascend runtime behavior. ### How was this patch tested? Signed-off-by: chenchuw886 <chenchuw@huawei.com> Co-authored-by: chenchuw886 <chenchuw@huawei.com>	2026-04-23 16:10:07 +08:00
Frank Chen	a4ba82e138	[BugFix] Require kv producer for layer sharding (#8563 ) ### What this PR does / why we need it? This PR introduce stricter Ascend `additional_config.layer_sharding` validation to the 0.18 release branch so it is only accepted on PD-disaggregated P nodes with `kv_role="kv_producer"`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? E2E test --------- Signed-off-by: chenchuw886 <chenchuw@huawei.com> Co-authored-by: chenchuw886 <chenchuw@huawei.com>	2026-04-23 16:06:53 +08:00
aipaes	4a254ba59a	[Doc] [v0.18.0]Fix glm4.7 readme v18 (#8460 ) ### What this PR does / why we need it? update GLM4.7 doc. Fix configuration issues, including:VLLM_ASCEND_ENABLE_FLASHCOMM1、VLLM_ASCEND_BALANCE_SCHEDULING、VLLM_NIXL_ABORT_REQUEST_TIMEOUT etc. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? doc test --------- Signed-off-by: zjks98 <zhangjiakang4@huawei.com> Signed-off-by: aipaes <82140963+aipaes@users.noreply.github.com> Co-authored-by: zjks98 <zhangjiakang4@huawei.com>	2026-04-23 14:42:28 +08:00
liziyu	58c87bd15b	[BugFix][0.18.0] Remove unused layers assignment in mooncake connector (#8602 ) ### What this PR does / why we need it? Remove unused layers assignment in mooncake connector ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? by nightly Signed-off-by: liziyu <liziyu16@huawei.com>	2026-04-23 14:08:53 +08:00
pz1116	30d08ced2d	[Doc][0.18.0] Fix kv pool CLI flag typo and formatting (#8608 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? Fix kv pool CLI flag typo and formatting ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> Signed-off-by: Pz1116 <zpbzpb123123@gmail.com>	2026-04-23 12:07:47 +08:00
vllm-ascend-ci	0c458aa6dc	[v0.18.0][Doc] Translated Doc files 2026-04-22 (#8565 ) ## Auto-Translation Summary Translated 43 file(s): - <code>docs/source/locale/zh_CN/LC_MESSAGES/community/versioning_policy.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/KV_Cache_Pool_Guide.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/cpu_binding.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/disaggregated_prefill.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/eplb_swift_balancer.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/npugraph_ex.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/patch.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/quantization.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/contribution/index.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/contribution/multi_node_test.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/evaluation/using_ais_bench.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/evaluation/using_evalscope.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/evaluation/using_lm_eval.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/evaluation/using_opencompass.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/faqs.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/installation.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/features/long_sequence_context_parallel_multi_node.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/features/pd_colocated_mooncake_multi_instance.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/features/pd_disaggregation_mooncake_multi_node.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/features/pd_disaggregation_mooncake_single_node.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/DeepSeek-V3.1.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/GLM4.x.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/GLM5.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/PaddleOCR-VL.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/Qwen-VL-Dense.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/Qwen3-235B-A22B.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/Qwen3.5-397B-A17B.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/Qwen3_embedding.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/configuration/additional_config.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/Fine_grained_TP.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/batch_invariance.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/context_parallel.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/cpu_binding.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/dynamic_batch.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/eplb_swift_balancer.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/kv_pool.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/layer_sharding.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/netloader.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/npugraph_ex.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/sleep_mode.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/ucm_deployment.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/weight_prefetch.po</code> --- [Workflow run](https://github.com/vllm-project/vllm-ascend/actions/runs/24767290887) Signed-off-by: vllm-ascend-ci <vllm-ascend-ci@users.noreply.github.com> Co-authored-by: vllm-ascend-ci <vllm-ascend-ci@users.noreply.github.com>	2026-04-23 11:06:05 +08:00
Yang Yuxi	9e31e4f234	[Doc]change format (#8592 ) ### What this PR does / why we need it? change --compilation_config to --compilation-config change --max-model-len 133008 to --max-model-len 131072 for matching 128k ### Does this PR introduce _any_ user-facing change? No Signed-off-by: Yang Yuxi <907276627@qq.com>	2026-04-23 10:46:09 +08:00
wangx700	4020b3df60	[BugFix] fix tl.extract_slice and tl.insert_slice. (#8567 ) ### What this PR does / why we need it? fix tl.extract_slice and tl.insert_slice to extract_slice and insert_slice from torch_utils ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? --------- Signed-off-by: wangx700 <wangxin700@huawei.com>	2026-04-23 09:50:29 +08:00
liziyu	c3b1d409a9	[BugFix] [P/D] [CherryPick] 8540 In scenarios where TP is not equal, the KV cache at the MTP layer is not handled. (#8541 ) ### What this PR does / why we need it? Fix the issue where the Mooncake connector does not handle the MTP layer KV cache when TP is unbalanced. backport: #8540 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? by nightly Signed-off-by: liziyu <liziyu16@huawei.com>	2026-04-23 09:16:37 +08:00
LQLlulu	fcf4d477a7	[BugFix][0.18.0]dispatch_ffn_combine kernal rollback combine 、unpermute part and scale part (#8534 ) cherry-pick https://github.com/vllm-project/vllm-ascend/pull/8539 <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> Due to end-to-end testing , three optimization points for the decode scenario have been reverted in dispatch_ffn_combine kernel. ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> --------- Signed-off-by: l00893928 <liuquanlu@huawei.com> Co-authored-by: l00893928 <liuquanlu@huawei.com>	2026-04-22 23:27:02 +08:00
pz1116	69a57bc9ec	[Doc][0.18.0]Fix typo in KV Cache Pool developr guide (#8575 ) ### What this PR does / why we need it? Fix typo in KV Cache Pool developr guide Signed-off-by: pz1116 <47019764+Pz1116@users.noreply.github.com>	2026-04-22 17:59:16 +08:00
Li Wang	c335710b82	[Misc] Bump mooncake version to v0.3.9 (#8445 ) ### What this PR does / why we need it? This PR updates the `MOONCAKE_TAG` version from `v0.3.8.post1` to `v0.3.9` across all Dockerfiles. Signed-off-by: wangli <wangli858794774@gmail.com>	2026-04-22 15:41:06 +08:00
liziyu	9e15ce7074	[Bugfix] [P/D] Fix connector with pcp dcp (#8538 ) ### What this PR does / why we need it? Fix the issue where a request does not return due to a specific NPU on node D having no transmission tasks in the scenario where node D is enabled with DCP. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? by nightly Signed-off-by: liziyu <liziyu16@huawei.com>	2026-04-21 22:23:21 +08:00
starmountain1997	2cb9f76a0f	[CI] Ds32 ep aime2025 (#8496 ) Backport of #7882 to releases/v0.18.0. Adds aime2025 benchmark test for DeepSeek-V3.2-W8A8 EP with disaggregated prefill on A3 (4-node, 16 NPUs per node, accuracy benchmark baseline 66.67%). Signed-off-by: guozr <guozr1997@hotmail.com> Co-authored-by: guozr <guozr1997@hotmail.com>	2026-04-21 19:49:06 +08:00
herizhen	862bd8143c	[Releases/v0.18.0][Doc][Misc] Added version description in the writing template (#8451 ) ### What this PR does / why we need it? This PR updates the model deployment tutorial template to include a requirement for authors to add a comment when code examples contain version numbers. This ensures that users are prompted to use the version appropriate for their specific environment. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? N/A (Documentation change) --------- Signed-off-by: herizhen <1270637059@qq.com> Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com>	2026-04-21 16:33:43 +08:00
1kzk	7850264324	[v0.18.0][BugFix] PIECEWISE mode also requires synchronization (#8469 ) ### What this PR does / why we need it? This PR enables synchronization for the `PIECEWISE` runtime mode in ACL graph replay. Previously, synchronization was only performed in `FULL` mode. However, `PIECEWISE` mode also requires this barrier to ensure that parameter updates are completed before the graph is replayed, preventing accuracy loss. The logic is also corrected to skip synchronization specifically for EAGLE draft models, as intended. Fixes # ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed. --------- Signed-off-by: 1zzk <785396250@qq.com>	2026-04-21 16:22:32 +08:00
yangjiuhua	b717dc17a3	[v0.18.0][Test][Misc] Update CI for GLM-5 configuration on vllm-ascend/releases/v0.18.0 branch (#8322 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? Update CI for GLM-5 configuration on vllm-ascend/releases/v0.18.0 branch 在0.18.0版本上对glm5-w4a8做测试 ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> --------- Signed-off-by: yangjiuhua <y00845194@china.huawei.com> Co-authored-by: yangjiuhua <y00845194@china.huawei.com>	2026-04-21 14:10:11 +08:00
Li Wang	36a0470de1	[Doc] Upgrade env `VLLM_ASCEND_ENABLE_FUSED_MC2` used in nightly test and tutorials (#8441 ) ### What this PR does / why we need it? The env `VLLM_ASCEND_ENABLE_FUSED_MC2` should only enabled in the decoder node during Prefill-Decode Disaggregation scenario --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2026-04-20 22:39:23 +08:00
ZT-AIA	3db5048d74	[CI]repair ci for custom op (#8455 ) ### What this PR does / why we need it? repair ci for custom op nightly Signed-off-by: ZT-AIA <1028681969@qq.com>	2026-04-20 17:51:37 +08:00
zyz111222	dd7e08c6db	[Performance] Use forward_native for Conv3dLayer and add UT (#8375 ) What this PR does / why we need it? switch Ascend conv3d forward_oot to use forward_native and add ut Does this PR introduce any user-facing change? No How was this patch tested? by CI --------- Signed-off-by: zouyizhou <zouyizhou@huawei.com>	2026-04-20 17:20:40 +08:00

1 2 3 4 5 ...

2892 Commits