xc-llm-ascend

Author	SHA1	Message	Date
Shanshan Shen	ba577dfc52	[Doc] Add Structured Output guide (#1499 ) ### What this PR does / why we need it? Add Structured Output guide. Signed-off-by: shen-shanshan <467638484@qq.com>	2025-06-30 17:21:44 +08:00
whx	f286265791	[BugFix] Address PrefillCacheHit state to fix prefix cache accuracy bug (#1498 ) When use AscendScheduler with prefix-cache enabled and chunk-prefill disabled, there will be accuray problem because there is no branch in mla_v1 to process this scenario. This PR fixes it. Signed-off-by: whx-sjtu <2952154980@qq.com>	2025-06-30 16:51:20 +08:00
Li Wang	5f8241c25c	[V1][ModelRunner] Support pooling model for v1 engine (#1359 ) ### What this PR does / why we need it? Change as little existing code as possible to add v1 pooling task's support, notice that i move down the `vllm.v1.worker.gpu_input_batch` to vllm-ascend, Considering the frequent changes in upstream interfaces, in order to decouple, so i move it here ### How was this patch tested? CI passed with new added/existing test, and I have a simple test was first conducted locally which is adapted from https://www.modelscope.cn/models/Qwen/Qwen3-Embedding-0.6B, just like bellow： ```python import os import torch from vllm import LLM os.environ["VLLM_USE_MODELSCOPE"]="True" def get_detailed_instruct(task_description: str, query: str) -> str: return f'Instruct: {task_description}\nQuery:{query}' # Each query must come with a one-sentence instruction that describes the task task = 'Given a web search query, retrieve relevant passages that answer the query' queries = [ get_detailed_instruct(task, 'What is the capital of China?'), get_detailed_instruct(task, 'Explain gravity') ] # No need to add instruction for retrieval documents documents = [ "The capital of China is Beijing.", "Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun." ] input_texts = queries + documents model = LLM(model="Qwen/Qwen3-Embedding-0.6B", task="embed") outputs = model.embed(input_texts) embeddings = torch.tensor([o.outputs.embedding for o in outputs]) scores = (embeddings[:2] @ embeddings[2:].T) print(scores.tolist()) # [[0.7620252966880798, 0.14078938961029053], [0.1358368694782257, 0.6013815999031067]] ``` --------- Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: wangli <858794774@qq.com> Co-authored-by: wangli <858794774@qq.com>	2025-06-30 16:31:12 +08:00
dependabot[bot]	790c810bf7	Bump actions/github-script from 6 to 7 (#1519 ) Bumps [actions/github-script](https://github.com/actions/github-script) from 6 to 7. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/actions/github-script/releases">actions/github-script's releases</a>.</em></p> <blockquote> <h2>v7.0.0</h2> <h2>What's Changed</h2> <ul> <li>Add base-url option by <a href="https://github.com/robandpdx"><code>@robandpdx</code></a> in <a href="https://redirect.github.com/actions/github-script/pull/429">actions/github-script#429</a></li> <li>Expose async-function argument type by <a href="https://github.com/viktorlott"><code>@viktorlott</code></a> in <a href="https://redirect.github.com/actions/github-script/pull/402">actions/github-script#402</a>, see for details <a href="https://github.com/actions/github-script#use-scripts-with-jsdoc-support">https://github.com/actions/github-script#use-scripts-with-jsdoc-support</a></li> <li>Update dependencies and use Node 20 by <a href="https://github.com/joshmgross"><code>@joshmgross</code></a> in <a href="https://redirect.github.com/actions/github-script/pull/425">actions/github-script#425</a></li> </ul> <h2>New Contributors</h2> <ul> <li><a href="https://github.com/navarroaxel"><code>@navarroaxel</code></a> made their first contribution in <a href="https://redirect.github.com/actions/github-script/pull/285">actions/github-script#285</a></li> <li><a href="https://github.com/robandpdx"><code>@robandpdx</code></a> made their first contribution in <a href="https://redirect.github.com/actions/github-script/pull/429">actions/github-script#429</a></li> <li><a href="https://github.com/viktorlott"><code>@viktorlott</code></a> made their first contribution in <a href="https://redirect.github.com/actions/github-script/pull/402">actions/github-script#402</a></li> </ul> <p><strong>Full Changelog</strong>: <a href="https://github.com/actions/github-script/compare/v6.4.1...v7.0.0">https://github.com/actions/github-script/compare/v6.4.1...v7.0.0</a></p> <h2>v6.4.1</h2> <h2>What's Changed</h2> <ul> <li>Add <code>@octokit/plugin-request-log</code>, to produce debug output for requests by <a href="https://github.com/mjpieters"><code>@mjpieters</code></a> in <a href="https://redirect.github.com/actions/github-script/pull/358">actions/github-script#358</a></li> <li>fix input handling by <a href="https://github.com/mjpieters"><code>@mjpieters</code></a> in <a href="https://redirect.github.com/actions/github-script/pull/357">actions/github-script#357</a></li> <li>Remove unused dependencies by <a href="https://github.com/mjpieters"><code>@mjpieters</code></a> in <a href="https://redirect.github.com/actions/github-script/pull/356">actions/github-script#356</a></li> <li>Default debug to current runner debug state by <a href="https://github.com/mjpieters"><code>@mjpieters</code></a> in <a href="https://redirect.github.com/actions/github-script/pull/363">actions/github-script#363</a></li> </ul> <h2>New Contributors</h2> <ul> <li><a href="https://github.com/mjpieters"><code>@mjpieters</code></a> made their first contribution in <a href="https://redirect.github.com/actions/github-script/pull/358">actions/github-script#358</a></li> </ul> <p><strong>Full Changelog</strong>: <a href="https://github.com/actions/github-script/compare/v6.4.0...v6.4.1">https://github.com/actions/github-script/compare/v6.4.0...v6.4.1</a></p> <h2>v6.4.0</h2> <h2>What's Changed</h2> <ul> <li>Bump json5 from 2.1.3 to 2.2.3 by <a href="https://github.com/dependabot"><code>@dependabot</code></a> in <a href="https://redirect.github.com/actions/github-script/pull/319">actions/github-script#319</a></li> <li>Bump minimatch from 3.0.4 to 3.1.2 by <a href="https://github.com/dependabot"><code>@dependabot</code></a> in <a href="https://redirect.github.com/actions/github-script/pull/320">actions/github-script#320</a></li> <li>Add node-fetch by <a href="https://github.com/danmichaelo"><code>@danmichaelo</code></a> in <a href="https://redirect.github.com/actions/github-script/pull/321">actions/github-script#321</a></li> </ul> <h2>New Contributors</h2> <ul> <li><a href="https://github.com/jongwooo"><code>@jongwooo</code></a> made their first contribution in <a href="https://redirect.github.com/actions/github-script/pull/313">actions/github-script#313</a></li> <li><a href="https://github.com/austinvazquez"><code>@austinvazquez</code></a> made their first contribution in <a href="https://redirect.github.com/actions/github-script/pull/306">actions/github-script#306</a></li> <li><a href="https://github.com/danmichaelo"><code>@danmichaelo</code></a> made their first contribution in <a href="https://redirect.github.com/actions/github-script/pull/321">actions/github-script#321</a></li> </ul> <p><strong>Full Changelog</strong>: <a href="https://github.com/actions/github-script/compare/v6.3.3...v6.4.0">https://github.com/actions/github-script/compare/v6.3.3...v6.4.0</a></p> <h2>v6.3.3</h2> <h2>What's Changed</h2> <ul> <li>Update <code>@actions/glob</code> to 0.3.0 by <a href="https://github.com/nineinchnick"><code>@nineinchnick</code></a> in <a href="https://redirect.github.com/actions/github-script/pull/279">actions/github-script#279</a></li> </ul> <h2>New Contributors</h2> <ul> <li><a href="https://github.com/nineinchnick"><code>@nineinchnick</code></a> made their first contribution in <a href="https://redirect.github.com/actions/github-script/pull/279">actions/github-script#279</a></li> </ul> <p><strong>Full Changelog</strong>: <a href="https://github.com/actions/github-script/compare/v6.3.2...v6.3.3">https://github.com/actions/github-script/compare/v6.3.2...v6.3.3</a></p> <h2>v6.3.2</h2> <h2>What's Changed</h2> <ul> <li>Update <code>@actions/core</code> to 1.10.0 by <a href="https://github.com/rentziass"><code>@rentziass</code></a> in <a href="https://redirect.github.com/actions/github-script/pull/295">actions/github-script#295</a></li> </ul> <!-- raw HTML omitted --> </blockquote> <p>... (truncated)</p> </details> <details> <summary>Commits</summary> <ul> <li><a href="`60a0d83039`"><code>60a0d83</code></a> Merge pull request <a href="https://redirect.github.com/actions/github-script/issues/440">#440</a> from actions/joshmgross/v7.0.1</li> <li><a href="`b7fb2001b4`"><code>b7fb200</code></a> Update version to 7.0.1</li> <li><a href="`12e22ed06b`"><code>12e22ed</code></a> Merge pull request <a href="https://redirect.github.com/actions/github-script/issues/439">#439</a> from actions/joshmgross/avoid-setting-base-url</li> <li><a href="`d319f8f5b5`"><code>d319f8f</code></a> Avoid setting <code>baseUrl</code> to undefined when input is not provided</li> <li><a href="`e69ef5462f`"><code>e69ef54</code></a> Merge pull request <a href="https://redirect.github.com/actions/github-script/issues/425">#425</a> from actions/joshmgross/node-20</li> <li><a href="`ee0914b839`"><code>ee0914b</code></a> Update licenses</li> <li><a href="`d6fc56f33b`"><code>d6fc56f</code></a> Use <code>@types/node</code> for Node 20</li> <li><a href="`384d6cf581`"><code>384d6cf</code></a> Fix quotations in tests</li> <li><a href="`84724927e3`"><code>8472492</code></a> Only validate GraphQL <code>previews</code></li> <li><a href="`84903f5182`"><code>84903f5</code></a> Remove <code>node-fetch</code> from type</li> <li>Additional commits viewable in <a href="https://github.com/actions/github-script/compare/v6...v7">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=actions/github-script&package-manager=github_actions&previous-version=6&new-version=7)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-06-30 16:04:41 +08:00
Yikun Jiang	e4df0a4395	Add Pangu MoE Pro for 300I series docs (#1516 ) ### What this PR does / why we need it? Add Pangu MoE Pro for 300I series docs ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-06-30 13:37:22 +08:00
Yikun Jiang	cad4c693c6	Add Pangu MoE Pro docs (#1512 ) ### What this PR does / why we need it? This PR add Pangu MoE Pro 72B docs [1] https://gitcode.com/ascend-tribe/pangu-pro-moe-model ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-06-30 12:15:33 +08:00
yiz-liu	75d05ee200	[Core] Fix block table shape to make Prefix cache work with Ascend scheduler (#1446 ) ### What this PR does / why we need it? This fix the shape of block_table which was introduced by hybrid kv groups several weeks ago. Error will be raised when enable prefix-cache (eager or not) and Ascend Scheduler at the same time, just send two identical requests and it will reproduce. v0.9.1: https://github.com/vllm-project/vllm-ascend/pull/1297 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Test manually Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-06-30 11:25:19 +08:00
Zhu Yi Lin	b308a7a258	support pangumoe w8a8c8 and docs (#1477 ) ### What this PR does / why we need it? support pangu moe w8a8c8 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed with new added test. Signed-off-by: zhuyilin <809721801@qq.com>	2025-06-28 18:51:07 +08:00
Angazenn	c59d69d9e6	[PERF]support MERRouter (#1421 ) ### What this PR does / why we need it? This PR introduces an expert rearrange algorithm for PanguProMoE model. Different from the original grouped topk, it filters out the top experts that are allocated more tokens. Therefore, we can load less experts when calculating gmm. We have test this algorithm for PanguProMoE-72B on 300I Duo platform and 800I A2 platform. On 300I Duo platform, we find that `num_voted_experts` set to 5 achieves both good performance and accuracy. While on 800I A2, we still set it to 8 to use original pangu grouped topk. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> Signed-off-by: angazenn <zengyanjia@huawei.com> Co-authored-by: angazenn <zengyanjia@huawei.com>	2025-06-28 16:14:49 +08:00
Angazenn	8fa188111d	[PERF]support H2P communication optimization for PanguProMoe (#1463 ) ### What this PR does / why we need it? In this PR, we support H2P communication optimization when running PanguProMoE with dp_size > 1. H2P use `reduce_scatter` and `all_gather` to replace `all_reduce` to improve performance: original layer: input_layernorm --> attn --> tp all_reduce --> post_attention_layernorm --> dp all_gather --> moe/mlp --> dp reduce_scatter --> tp all_reduce now: input_layernorm --> tp all_gather --> attn --> tp reduce_scatter --> post_attention_layernorm --> all_rank all_gather --> moe/mlp --> all_rank reduce_scatter Besides, because `reduce_scatter` requires num_tokens that can be divided by group size, we need pad the seqs based on `max_tokens_across_dp`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This PR has been tested with both offline and online inference using PanguProMoE-72B. --------- Signed-off-by: angazenn <zengyanjia@huawei.com> Co-authored-by: angazenn <zengyanjia@huawei.com>	2025-06-28 16:10:27 +08:00
Angazenn	5c53cbaf2a	[BugFix]Fix bugs when initializing communication groups with dp on 300I Duo (#1478 ) ### What this PR does / why we need it? This PR fixes a bug that use broadcast with cpu_group when running dp. The `broadcast310p` patch will take effects for both cpu_group and device group, but we only need it for device group. Hence a wrapper is added to allow cpu_group use native torch broadcast and it solves the bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? With this PR, DP on 310p runs normally and generates reasonable answers. Signed-off-by: angazenn <zengyanjia@huawei.com> Co-authored-by: angazenn <zengyanjia@huawei.com>	2025-06-28 16:07:52 +08:00
Mengqing Cao	2cf9c4c3a2	[CI/Build] Fix version conflict on transformers (#1490 ) ### What this PR does / why we need it? Fix version conflict on transformers: `pip._vendor.pkg_resources.ContextualVersionConflict: (transformers 4.53.0 (/usr/local/python3.10.17/lib/python3.10/site-packages), Requirement.parse('transformers<4.53.0'), {'vllm-ascend'})` Fix https://github.com/vllm-project/vllm-ascend/actions/runs/15933263325/job/44947231642 ### Does this PR introduce _any_ user-facing change? Fix broken build ### How was this patch tested? CI passed with new existing test. Signed-off-by: MengqingCao <cmq0113@163.com>	2025-06-28 15:11:04 +08:00
Mengqing Cao	5f4391652f	[PromptLogprobs][V1] Support prompt logprobs to fix ceval accuracy in V1 (#1483 ) ### What this PR does / why we need it? Support prompt logprobs in V1. This also enable lm_eval to test accuracy on V1 ### Does this PR introduce _any_ user-facing change? support prompt logprobs output ### How was this patch tested? CI passed with accuracy test. Using lm_eval, which use prompt logprobs as output to test accuracy, to test: ```python VLLM_USE_V1=1 lm_eval \ --model vllm \ --model_args pretrained=Qwen/Qwen2.5-7B-Instruct,max_model_len=4096,block_size=4 \ --tasks ceval-valid_computer_network \ --batch_size 8 ``` After this pr, the accuracy test results of `Qwen/Qwen2.5-7B-Instruct` on V1 is: ```bash \| Tasks \|Version\|Filter\|n-shot\| Metric \| \|Value \| \|Stderr\| \|----------------------------\|------:\|------\|-----:\|--------\|---\|-----:\|---\|-----:\| \|ceval-valid_computer_network\| 2\|none \| 0\|acc \|↑ \|0.7368\|± \|0.1038\| \| \| \|none \| 0\|acc_norm\|↑ \|0.7368\|± \|0.1038\| ``` Closes: https://github.com/vllm-project/vllm-ascend/issues/1043 Signed-off-by: MengqingCao <cmq0113@163.com>	2025-06-28 09:38:52 +08:00
Shanshan Shen	99e685532d	[Doc] Add Qwen2.5-VL eager mode doc (#1394 ) ### What this PR does / why we need it? Add Qwen2.5-VL eager mode doc. --------- Signed-off-by: shen-shanshan <467638484@qq.com>	2025-06-28 09:08:51 +08:00
Mengqing Cao	d59e7fa095	[CI] Pin transformers<4.53.0 and fix EPLB load_weights to make CI passed (#1482 ) ### What this PR does / why we need it? - Fix vLLM EPLB break `e9fd658a73` by recovering load_weights back to [v0.9.1 version](`07b8fae219`) temporarily. - Fix transformers>=4.53.0 image processor break Related: https://github.com/vllm-project/vllm-ascend/issues/1470 - Mirror torch_npu requirements to pyproject.toml ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed --------- Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com>	2025-06-28 00:12:43 +08:00
Shanshan Shen	3687676fa7	[Doc] Add guidance on how to implement and register new models (#1426 ) ### What this PR does / why we need it? Add guidance on how to implement and register new models. Modified based on PR https://github.com/vllm-project/vllm-ascend/pull/1126, thanks for the contribution of @linfeng-yuan. --------- Signed-off-by: shen-shanshan <467638484@qq.com>	2025-06-27 16:46:49 +08:00
wangxiyuan	5571fb7118	[Misc] Add release checklist issue template (#1447 ) Add the release checklist issue template. Every release manager should create and follow the checklist to do the release step by step. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-06-27 09:15:36 +08:00
wangxiyuan	5968dff4e0	[Build] Add build info (#1386 ) Add static build_info py file to show soc and sleep mode info. It helps to make the code clean and the error info will be more friendly for users This PR also added the unit test for vllm_ascend/utils.py This PR also added the base test class for all ut in tests/ut/base.py Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-06-27 09:14:43 +08:00
Li Wang	c563a08f0a	[CI] Fix nightly benchmark (#1453 ) ### What this PR does / why we need it? Sometimes the performance benchmark workflow may fail. We hope to add a prompt when the operation fails and not upload the dirty data of the failed operation. --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-06-26 19:39:18 +08:00
Zesheng Zong	192dbbcc6e	Optimize Patch developer guide (#1452 ) ### What this PR does / why we need it? Fix some terms in the user guide. Signed-off-by: zeshengzong <zesheng.zong@outlook.com>	2025-06-26 19:10:16 +08:00
wangyanhui-cmss	e5eea64b66	[CI/UT] Add ut for parallel_state.py (#1460 ) ### What this PR does / why we need it? Add ut for parallel_state.py ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? python -m unittest test_parallel_state.py --------- Signed-off-by: wangyanhui-cmss <wangyanhui_yewu@cmss.chinamobile.com>	2025-06-26 19:03:27 +08:00
Shanshan Shen	4e2daf5ab7	[Doc] Add qwen2-audio eager mode tutorial (#1371 ) ### What this PR does / why we need it? Add qwen2-audio eager mode tutorial. Signed-off-by: shen-shanshan <467638484@qq.com>	2025-06-26 16:56:05 +08:00
leo-pony	1025344912	Doc Enhancement: Single NPU(Qwen3-8B) aclgraph mode + eager mode (#1374 ) ### What this PR does / why we need it? Doc Enhancement: Single NPU(Qwen3-8B) aclgraph mode + eager mode. Relate RFC: https://github.com/vllm-project/vllm-ascend/issues/1248 ### Does this PR introduce _any_ user-facing change? No changes. ### How was this patch tested? Preview Signed-off-by: leo-pony <nengjunma@outlook.com> Signed-off-by: leo-pony <nengjunma@outlook.com>	2025-06-26 16:52:54 +08:00
sdmyzlp	53c2d58ae1	Handle with_prefill_across_dp for multistream mla (#1322 ) ### What this PR does / why we need it? After #1094, decode might be executed with non-compiled mode, despite of `torchair_graph_config.enabled`, causing multistream mla to fail, which assumes torchair compiled mode for decode when `torchair_graph_config.enabled == True`. Augment that assumption to fix this. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Tested both offline, and by graph mode mla e2e testcase. --------- Signed-off-by: sdmyzlp <lrwei2@petalmail.com>	2025-06-26 09:32:07 +08:00
yiz-liu	2690697caa	[Bugfix] Reset all unused positions to prevent out-of-bounds in GatherV3 (#1416 ) ### What this PR does / why we need it? Reset all unused positions in `NPUModelRunner` to prevent out-of-bounds asserts in the `GatherV3` operator. Currently, in [`get_splitfuse_attn_mask`](https://github.com/vllm-project/vllm-ascend/blob/main/vllm_ascend/attention/attention.py#L124), the `position` tensor may contain values that exceed the dimensions of the attention mask, triggering a `GatherV3` boundary check failure. These invalid indices originate from stale “dirty” entries left over in `position` due to padding logic in the ACL graph. Specifically, in [`_process_reqs`](https://github.com/vllm-project/vllm-ascend/blob/main/vllm_ascend/worker/model_runner_v1.py#L989), the variable `num_input_tokens` is always greater than or equal to `total_num_scheduled_tokens`, so any positions not explicitly cleared from a previous batch will persist and cause this sporadic error. BTW, in the original vLLM implementation, masks are constructed internally using other args, so these lingering values do not surface. However, on the Ascend platform—where split-fuse attention requires externally supplied masks—these residual indices become critical and lead to this elusive, hard-to-reproduce failure. The fix is to explicitly reset or zero out all unused entries in the `position` tensor before passing it to `GatherV3`, ensuring that every index lies within the valid range of the attention mask. Closes: https://github.com/vllm-project/vllm-ascend/issues/1038 ### Does this PR introduce _any_ user-facing change? No Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-06-26 09:27:43 +08:00
zhangxinyuehfad	06ccce1ddf	[FOLLOWUP] fix name and format in accuracy test (#1288 ) (#1435 ) ### What this PR does / why we need it? fix accuracy test: 1. fix accuracy report like:https://vllm-ascend--1429.org.readthedocs.build/en/1429/developer_guide/evaluation/accuracy_report/Qwen2.5-7B-Instruct-V0.html 2. fix create pr for report Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-06-26 00:26:54 +08:00
Pr0Wh1teGivee	2fda60464c	[Perf] Use fused ops npu_top_k_top_p (#1308 ) ### What this PR does / why we need it? Use fused ops torch_npu.npu_top_k_top_p(logits, p, k) when p and k are not None, otherwise fallback to the original one. The replacement will take place automatically when `VLLM_ASCEND_ENABLE_TOPK_OPTIMIZE=1` . This patch are using `npu_top_k_top_p` which required torch_npu>=2.5.1.post1.dev20250619 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Tested by DeepSeek R1 and UT passed Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>	2025-06-25 20:59:06 +08:00
yuancaoyaoHW	e7efc7e7e7	[BugFix] Remove not using patch_eagle.py for CI. (#1385 ) ### What this PR does / why we need it? This PR aims to address a long-standing CI bug and remove unused code. The specific changes include: 1. Fixing CI Bug: Resolves the root cause of CI test failures or instability. This often stems from incorrect environment configurations, dependency version conflicts, or flawed test script logic. This fix ensures the reliability and consistency of the CI pipeline. 2. Removing `patch_eagle.py`: Deletes the `patch_eagle.py` file, which is no longer utilized by the project. This file was likely legacy code, experimental code, or its functionality has since been replaced by other modules. Its removal helps reduce codebase complexity, improves maintainability, and prevents potential confusion. ### Does this PR introduce _any_ user-facing change? No, this PR primarily focuses on internal CI stability maintenance and code cleanup. It does not introduce any user-visible changes to APIs, interfaces, or other behaviors. ### How was this patch tested? CI passed. Specifically: 1. Existing CI Pipelines Passed: After fixing the CI bug, all existing CI tests and pipelines were verified to run correctly and pass successfully. 2. Code Cleanup Verified: Following the removal of `patch_eagle.py`, it was ensured that any related functional modules (if applicable) continue to work as expected, without introducing new regressions. This was typically verified by running the project's main test suite. Signed-off-by: yuancaoyaoHW <a2749322671@gmail.com>	2025-06-25 20:36:05 +08:00
sharonyunyun	941269a6c5	adjusting the communication method in graph mode (#1194 ) ### What this PR does / why we need it? Communication performance optimization: replace allreduce with reduce_scatter+all_gather in MLA layer's TP group，to remove stridedsliced and all_gather in MOE layer. when tp > 1, It is enabled during the decode phase of the graph mode when enable_multistream_moe、MLA, use_v1, and MC2 are used. According to the end-to-end RL inference test results, this PR can bring 3% gain in the decode stage. Before Improvement Profiling kernel_details ![image](https://github.com/user-attachments/assets/1bb5dfa1-809b-410a-90c9-c5fd23cff003) Evaluation ![image](https://github.com/user-attachments/assets/0b8ea0c7-88e7-410f-9ef4-f0cfe910cdc7) ![image](https://github.com/user-attachments/assets/94fde910-c125-4c2e-8de4-88fc3fafc057) After Improvement Profiling kernel_details ![image](https://github.com/user-attachments/assets/55fac0e0-11f2-4654-8fd4-287949e0b29e) Evaluation ![image](https://github.com/user-attachments/assets/e923f74b-29c4-4171-9382-40a00cf05df0) ![image](https://github.com/user-attachments/assets/5dba7967-07ea-4926-a8be-804bfd34e3e4) ### Does this PR introduce _any_ user-facing change? Users need to configure enable_multistream_moe=True ### How was this patch tested? Add e2e test cases to cover code logic Signed-off-by: sharonyunyun <zhangying134@huawei.com>	2025-06-25 19:56:49 +08:00
wangxiyuan	205cb85a1e	[Doc] Fix doc typo (#1424 ) 1. Fix the typo 2. Fix 404 url 3. update graph mode and additional config user guide Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-06-25 19:28:26 +08:00
wangxiyuan	ca884ef86d	[Misc] Clean up uesless code for LLM initialize (#1373 ) This PR aims to clean up the useless code for LLM setup. It helps to make the code more clear. 1. remove useless `self.xxx` property 2. change `set_random_seed` to `seed_everything` 3. remove `set_custom_all_reduce`, it's only used for cuda This is just a code clean. no change for any code logic. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-06-25 16:20:14 +08:00
zhangxinyuehfad	0060886a37	[CI]Update accuracy report test (#1288 ) ### What this PR does / why we need it? Update accuracy report test 1. Add Record commit hashes and GitHub links for both vllm and vllm-ascend in accuracy reports 2. Add accuracy result verification checks to ensure output correctness 3. Creat PR via forked repository workflow ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? dense-accuracy-test: https://github.com/vllm-project/vllm-ascend/actions/runs/15745619485 create pr via forked repository workflow: https://github.com/zhangxinyuehfad/vllm-ascend/actions/runs/15747013719/job/44385134080 accuracy report pr: https://github.com/vllm-project/vllm-ascend/pull/1292 Currently, the accuracy report used is old and needs to be merged into pr, retest, update new report, then close #1292 . Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-06-25 14:10:34 +08:00
Li Wang	15df8be937	[Doc] Add sleep mode doc (#1295 ) ### What this PR does / why we need it? Add sleep related doc and example --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-06-25 14:07:14 +08:00
wangxiyuan	e4e0b7af05	[Doc] Add patch doc (#1414 ) 1. Format the developer guide content to make it more clear 2. Add the patch doc for developer guide Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-06-25 12:00:45 +08:00
Mengqing Cao	52317f92cb	[DP] Tiny fix of dp and update example (#1273 ) ### What this PR does / why we need it? Add `max_num_tokens_across_dp` to AscendMetadata to fix dp This pr fixes the bug introduced by https://github.com/vllm-project/vllm-ascend/pull/1229, which add an arg `max_num_tokens_across_dp` when dp_size > 1. Signed-off-by: MengqingCao <cmq0113@163.com>	2025-06-25 11:03:04 +08:00
Mengqing Cao	c1c5d56255	[Doc] Update FAQ and add test guidance (#1360 ) ### What this PR does / why we need it? - Add test guidance - Add reduce layer guidance - update faq on determinitic calculation --------- Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com>	2025-06-25 09:59:23 +08:00
Li Wang	5f5800ba42	[Bugfix] Sync MRotaryEmbedding interface change to recover CI (#1399 ) ### What this PR does / why we need it? Sync MRotaryEmbedding interface change to recover main CI (https://github.com/vllm-project/vllm/pull/19939) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-06-24 22:56:39 +08:00
liziyu	6ed3f00427	[Doc] remove environment variable VLLM_ENABLE_MC2 (#1406 ) ### What this PR does / why we need it? remove unused environment variable VLLM_ENABLE_MC2 Signed-off-by: liziyu <liziyu16@huawei.com>	2025-06-24 21:18:10 +08:00
Mengqing Cao	20767a043c	[CI/UT] Fix disaggregated prefill ci (#1313 ) ### What this PR does / why we need it? Use eager mode to run disaggregated prefill ci ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with new existing test. --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-06-24 17:11:00 +08:00
wangxiyuan	9cbce423ce	[MISC] Remove useless patch (#1366 ) ### What this PR does / why we need it? `stateless_init_dp_group` in vllm works with non-cuda platform now. Remove this useless patch. Which was introduced in vllm-ascend by `e74331a1ed` (v0.8.4rc2) vLLM upstream merged: `3e472d882a` (v0.8.0) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-06-24 10:05:59 +08:00
lyj-jjj	5177bef87a	support fused_moe_allgather_ep (#1335 ) ### What this PR does / why we need it? support fused_moe_allgather_ep ### How was this patch tested? It was tested by UT. Signed-off-by: lyj-jjj <liuyingjun5@huawei.com>	2025-06-23 22:03:38 +08:00
Yikun Jiang	917c6b71af	[TEST][DOC] Fix doctest and add system package installation (#1375 ) ### What this PR does / why we need it? - Fix [doctest](https://github.com/vllm-project/vllm-ascend/actions/workflows/vllm_ascend_doctest.yaml?query=event%3Aschedule) - add system package installation - Add doc for run doctests - Cleanup all extra steps in .github/workflows/vllm_ascend_doctest.yaml - Change schedule job from 4 ---> 12 hours ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - doctest CI passed - Local test with `/vllm-workspace/vllm-ascend/tests/e2e/run_doctests.sh`. Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-06-23 20:50:33 +08:00
Icey	08cfc7cb4b	Modify installation.md for adding pip extra index of torch-npu (#1272 ) ### What this PR does / why we need it? Modify installation.md for adding pip extra index of torch-npu ### How was this patch tested? No need --------- Signed-off-by: Icey <1790571317@qq.com>	2025-06-23 15:37:50 +08:00
weiguihua2	e1123172d1	[Doc] Add reinstall instructions doc (#1303 ) Add a new FAQ, if users re-install vllm-ascend with pip, the `build` folder should be removed first --------- Signed-off-by: rjg-lyh <1318825571@qq.com> Signed-off-by: weiguihua <weiguihua2@huawei.com> Signed-off-by: weiguihua2 <weiguihua2@huawei.com>	2025-06-23 14:06:27 +08:00
linfeng-yuan	15592c0d48	[bugfix] fix accuracy prolem for deepseek V3/R1 models with torchair graph in long sequence predictions (#1331 ) ### What this PR does / why we need it? Fix the issue of insufficient cached cosine and sine length in MLA's TorchAir graph mode, which causes accuracy deviation during long-sequence inference. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? We tested the accuracy of this patch with DeepSeek R1 e2e becnhmark serving, and get 83.33 sore for AIME2024 dataset with DP4TP4EP16 setting. Signed-off-by: linfeng-yuan <1102311262@qq.com>	2025-06-23 09:52:27 +08:00
zxdukki	f04c6763d8	[Bugfix] fix env variable in dbo (#1284 ) ### What this PR does / why we need it? Fix env variable in dbo to enable dbo in DeepSeek-V3 model. Besides, we have fixed an known issue in deepseek-dbo. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? This patch can be tested with newly added e2e tests: [tests/multicard/test_offline_inference_distributed.py](https://github.com/vllm-project/vllm-ascend/pull/1285/files#diff-7cd2e6b1bda6b8ad1bedb3276971fe7064aeae4dc0efd41c301c4ede2158c57e). It can be verified with pytest. --------- Signed-off-by: zhuohuan <zxdu1997@gmail.com>	2025-06-23 09:07:57 +08:00
Shanshan Shen	21fb68a03a	[CI] Update guided decoding ut (#1312 ) ### What this PR does / why we need it? Update guided decoding ut. Signed-off-by: shen-shanshan <467638484@qq.com>	2025-06-23 09:06:20 +08:00
wemaster	339d6894f6	[CI/UT][bugfix] fix v0 spec decode (#1321 ) ### What this PR does / why we need it? 1. [PR913](https://github.com/vllm-project/vllm-ascend/pull/913) introduced an error that caused V0's spec decode function to fail. [PR1109](https://github.com/vllm-project/vllm-ascend/pull/1109) wanted to fix this problem. Unfortunately, the fix broke the ngram function. I fixed the ngram function in this PR. PS: Q: Why is there a problem when ngram is not found when pr1109 is merged? A: The newly introduced problem will only appear when tp>1, and the use cases on CI are all tp=1 2. In versions after 0.7.3, vllm-ascend deleted some spec decode UTs to avoid CI taking too long, including eagle speculative UTs, which made CI unable to take care of the eagle function. I added it(`test_eagle_correctness.py`) back in this PR 3. Because of the reason mentioned in 2, the current version of Eagle has a problem. I located and fixed this problem. It was because vllm's `draft_model_runner.py` was changed and vllm-ascend was not synchronized in time. 4. Currently, the UTs of v0 and v1 are mixed in the spec_decode directory. I split them into two directories: spec_decode_v0 and spec_decode_v1. 5. i found `vllm.spec_decode.multi_step_worker.MultiStepWorker.set_include_gpu_probs_tensor` and `vllm.spec_decode.multi_step_worker.MultiStepWorker.set_should_modify_greedy_probs_inplace` have changed in vllm, so i remove it in this pr. ### Does this PR introduce _any_ user-facing change? This PR fixes the functions of ngram and eagle spec decode in the v0 engine ### How was this patch tested? tested by CI Signed-off-by: mengwei805 <mengwei25@huawei.com>	2025-06-23 09:05:13 +08:00
Pleaplusone	7e6efbf2a9	update torch-npu to 2.5.1.post1.dev20250619 (#1347 ) ### What this PR does / why we need it? This PR update the torch_npu to newest release version 2.5.1.post1.dev20250619 . ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI tested will guarantee the update Signed-off-by: ganyi <pleaplusone.gy@gmail.com>	2025-06-23 09:02:09 +08:00
xleoken	4447e53d7a	[Doc] Change not to no in faqs.md (#1357 ) ### What this PR does / why we need it? Change not to no in faqs.md. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Local Test Signed-off-by: xleoken <xleoken@163.com>	2025-06-23 09:01:00 +08:00

1 2 3 4 5 ...

452 Commits