xc-llm-ascend

Author	SHA1	Message	Date
Shanshan Shen	70f076331f	[MM][Bugfix] Add error log for VL models when enabling FLASHCOMM (#4222 ) ### What this PR does / why we need it? Add error log for VL models when enabling `VLLM_ASCEND_ENABLE_FLASHCOMM1=1` or `VLLM_ASCEND_ENABLE_FLASHCOMM=1` (for backward compatibility). This is a temporary fix for https://github.com/vllm-project/vllm-ascend/issues/4132. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Signed-off-by: shen-shanshan <467638484@qq.com>	2025-11-21 15:04:35 +08:00
Angazenn	9c6d0b422c	[v0.11.0-dev][misc]change default capture size for Qwen3-MoE when using full dp (#4205 ) ### What this PR does / why we need it? This dev version of #4199 . Currently, the default `cudagraph_capture_size` in vLLM is `[1, 2, 4 ,8 ,16 ,24 ,... , max_capture_size]`. However, this is not always the best choice on different situations. This PR aims to change the default setting when running Qwen3-MoE on full dp (`dp_size > 1` && `tp_size == 1`) setting, which is usually applied in Large-Scale EP. old : `[1, 2, 4 ,8 ,16 ,24 ,... , max_capture_size]` new: `[1, 2, 5 ,10 ,15, 16 ,24 ,... , max_capture_size]` This is mainly because the performance of `_npu_paged_attention` op degrades dramatically on old settings. We hope to provide better performance if users do not set specific `cudagraph_capture_size`. ### Does this PR introduce _any_ user-facing change? The default `cudagraph_capture_size` is modified in above cases. However, if `cudagraph_capture_size` has already set by users, this PR won't have any influence on this. ### How was this patch tested? - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` --------- Signed-off-by: Angazenn <supperccell@163.com>	2025-11-21 11:19:11 +08:00
rjg-lyh	ebd45b6596	[V0.11.0][Core] Restore scheduling logic under default configuration (#4094 ) ### What this PR does / why we need it? Cherry-pick #3967 from main branch. This PR reverts the changes introduced in PR #2894 Initially, due to performance issues with the older version of the chunked prefill ops, the default behavior was to use the Ascend scheduler to disable the chunked prefill feature. However, with the improvements in the performance of the new chunked prefill ops, this interception strategy has been removed. This change also aligns with the community's default configuration behavior. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI passed with new added/existing test. Signed-off-by: rjg-lyh <1318825571@qq.com>	2025-11-10 20:02:23 +08:00
zouyida2052	90aca84e60	fix bug when max_seqs=14 in mtp=2 scenario and raise error when cudagraph_capture_sizes can't be an integer multiple of uniform_decode_query_len (#3909 ) ### What this PR does / why we need it? 1. Revert [bugfix for mtp in fullgraph](`0948483642`) and support it when vllm supports 2. raise error when cudagraph_capture_sizes can't be an integer multiple of uniform_decode_query_len 3. bugfix when max_num_seqs=14 in mtp=2 scenario --------- Signed-off-by: zouyida2052 <zouyida2002@gmail.com>	2025-10-31 09:25:06 +08:00
zouyida2052	d9249c968e	bugfix for mtp in fullgraph (#3878 ) ### What this PR does / why we need it? bugfix for mtp in fullgraph ### Does this PR introduce _any_ user-facing change? no --------- Signed-off-by: zouyida2052 <zouyida2002@gmail.com>	2025-10-29 23:52:20 +08:00
linfeng-yuan	4c9af353ee	Revert "[Feat] Shared expert dp for deepseek and deepseek_mtp (#3495 )" (#3586 ) ### What this PR does / why we need it? This reverts commit `bf87606932`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? E2E vllm serving with `enable_shared_expert_dp: true` in eager mode as before. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: linfeng-yuan <1102311262@qq.com>	2025-10-21 22:24:30 +08:00
Yizhou	ef3fabf399	[Chore] Prevents use of ASCEND_LAUNCH_BLOCKING with ACL Graph (#3574 ) ### What this PR does / why we need it? Adds a validation check to prevent running with an incompatible configuration. The `ASCEND_LAUNCH_BLOCKING=1` environment variable, used for debugging, enforces synchronous execution which is incompatible with ACL Graph. This change raises an explicit error to inform the user about the conflict and how to resolve it, preventing a more obscure failure later. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? None needed. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-10-21 20:17:33 +08:00
Shirley125	b4233a2ec3	[Bugfix] Route requests requiring KVC recomputation from the decode instance to the P instance (#3448 ) ### What this PR does / why we need it? This PR is aimed to fix the recomputing out of memory bug in decode instance. When recomputing happens in decode, kv cache usage may exceed the pre-allocated memory, and it will cause OOM. So we propose a new scheduling strategy, when decode instance cannot allocate new block for running requests, we will stop the request that will be preempted. These stopped request will be recognied by proxy, and they will be send to prefill instance again to calculate kvc and then direct to decode instance. This is a temporary plan to fix the bug. The long-term stratege is to use CPU offload in decode instance. ### Does this PR introduce _any_ user-facing change? An extra ascend configuration option -- recompute_scheduler_enable = True is added to enable this strategy. The default value is False ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>	2025-10-18 15:56:44 +08:00
zhaozx-cn	bf87606932	[Feat] Shared expert dp for deepseek and deepseek_mtp (#3495 ) ### What this PR does / why we need it? shared expert dp for deepseek and deepseek_mtp, could be combined with sp to improve performance. ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: zhaozx-cn <zhaozx2116@163.com> Co-authored-by: realliujiaxu <realliujiaxu@163.com>	2025-10-17 15:06:37 +08:00
realliujiaxu	f69a83b7ba	[Feat] Flash comm allgher ep (#3334 ) Support flash comm v1(Sequence Parallelism) for Allgather EP. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: realliujiaxu <realliujiaxu@163.com> Co-authored-by: zhaozx-cn <zhaozx2116@163.com>	2025-10-15 19:36:32 +08:00
Mengqing Cao	8abe517870	[Refactor] Adapt deepseek-v3.2 to vllm 0.11.0 (#3432 ) ### What this PR does / why we need it? Adapt deepseek-v3.2 to vllm 0.11.0, removing the useless patch. The final goal is to remove all the patches and align the code arch to vllm, thus we need to do the following work in next prs. TODO: - [x] remove patch on attention spec - [ ] refactor the kvcache creation logic ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? 1. CI passed with existing test. 2. Test pass with deepseek-v3.2-exp - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: MengqingCao <cmq0113@163.com>	2025-10-15 17:48:58 +08:00
xuyexiong	02c26dcfc7	[Feat] Supports Aclgraph for bge-m3 (#3171 ) ### What this PR does / why we need it? [Feat] Supports Aclgraph for bge-m3 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ``` pytest -s tests/e2e/singlecard/test_embedding.py pytest -s tests/e2e/singlecard/test_embedding_aclgraph.py ``` to start an online server with bs 10, each batch's seq length=8192, we set --max-num-batched-tokens=8192*10 to ensure encoder is not chunked: ``` vllm serve /home/data/bge-m3 --max_model_len 1024 --served-model-name "bge-m3" --task embed --host 0.0.0.0 --port 9095 --max-num-batched-tokens 81920 --compilation-config '{"cudagraph_capture_sizes":[8192, 10240, 20480, 40960, 81920]}' ``` For bs10, each batch's seq length=8192, QPS is improved from 85 to 104, which is a 22% improvement, lots of host bound is reduced. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: xuyexiong <xuyexiong@huawei.com> Co-authored-by: wangyongjun <1104133197@qq.com>	2025-10-14 23:07:45 +08:00
panchao-hub	1756efa5fd	[Feat][Graph]Support FULL_DECEDE_ONLY mode for MLA models (#3125 ) ### What this PR does / why we need it? Adds support for capturing the Multi-Layer Attention (MLA) decode operation into an ACL graph. This improves performance by compiling the attention kernel for single-token decoding. Key changes include: - Implementing the graph capture logic for the MLA kernel, including workspace management and parameter updates. - Modifying the rotary embedding (RoPE) handling to use pre-allocated tensors, which is a requirement for graph capture. - Adding a `build_for_graph_capture` method to the MLA metadata builder to create dummy metadata during the graph compilation phase. Known issues: - Currently, MTP is not supported in FULL_DECEDE_ONLY mode -- we're working on a fix - We are preparing to remove update_mla_attn_params with auto_dispatch_capture ### Does this PR introduce _any_ user-facing change? compilation_config={ "cudagraph_mode": "FULL_DECODE_ONLY", }, ### How was this patch tested? - vLLM version: v0.11.0 --------- Signed-off-by: panchao-hub <315134829@qq.com> Signed-off-by: p00465316 <panchao13@huawei.com> Co-authored-by: p00465316 <panchao13@huawei.com> Co-authored-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-10-10 16:31:20 +08:00
wangxiyuan	f12f76d7ba	Drop 0.10.2 (#3284 ) Drop v0.10.2 support, we support vLLM 0.11.0rc3 now. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/releases/v0.11.0 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-10-09 10:28:38 +08:00
wangxiyuan	00ba071022	[Doc] Release note for v0.11.0rc0 (#3224 ) ### What this PR does / why we need it? Add release note for v0.11.0rc0 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/releases/v0.11.0 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-09-30 03:26:18 +08:00
wangxiyuan	81bd6e4c99	Add DeepSeek V3.2 support (#3270 ) ### What this PR does / why we need it? This PR added the initial DeepSeek V3.2 support with [vLLM v0.11.0](https://github.com/vllm-project/vllm/tree/releases/v0.11.0) (not released yet). We will complete vLLM adaptation as soon as possible. This feature will be ready in recent 1-2 days. Related doc: https://github.com/vllm-project/vllm-ascend/pull/3223 . ### Does this PR introduce _any_ user-facing change? Yes! ### How was this patch tested? CI passed and Run deepseek doc soon. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/releases/v0.11.0 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: zzzzwwjj <1183291235@qq.com> Signed-off-by: linfeng-yuan <1102311262@qq.com> Signed-off-by: wxsIcey <1790571317@qq.com> Signed-off-by: MengqingCao <cmq0113@163.com> Co-authored-by: zzzzwwjj <1183291235@qq.com> Co-authored-by: linfeng-yuan <1102311262@qq.com> Co-authored-by: wxsIcey <1790571317@qq.com> Co-authored-by: MengqingCao <cmq0113@163.com>	2025-09-30 03:25:58 +08:00
Mengqing Cao	050d202bb9	[Quickfix] Fix dp+ep+tp error when sp chunked the hidden_states (#3246 ) ### What this PR does / why we need it? Fix dp+ep+tp inplace copy error when sp chunked the `hidden_states`. ### How was this patch tested? test locally with the following scripts ```bash python examples/offline_data_parallel.py \ --model="Qwen/Qwen3-30B-A3B" \ --dp-size=2 \ --tp-size=2 \ --enable-expert-parallel ``` Signed-off-by: MengqingCao <cmq0113@163.com>	2025-09-29 09:12:49 +08:00
Ronald	621aa7d270	fix error async_scheduler can't be enabled (#3127 ) ### What this PR does / why we need it? PR #2894 make ascend_scheduler_config.enabled always be `True` for non-mla models，when `ascend_scheduler_config.enabled=True `, it will always initialize `AscendScheduler` which is a subclass of `Scheduler`, but when we enbale async_scheduling,we need to initialize `AsyncScheduler` in vllm, this will make async_scheduling can't be enabled. ### Does this PR introduce _any_ user-facing change? not-related ### How was this patch tested? when user set `async_scheduling`, it means user don't want to use `AscendScheduler`, so we shouldn't set `ascend_scheduler_config.enabled = True` - vLLM version: v0.10.2 - vLLM main: `f225ea7dd9` Signed-off-by: Ronald1995 <ronaldautomobile@163.com>	2025-09-26 08:51:54 +08:00
wangxiyuan	2930e4a6bd	[CI] Upgrade vllm to newest commit (#3182 ) ### What this PR does / why we need it? Upgrade vLLM to newest commit - Fix the aclgraph doesn't work problem, caused by `24fab45d96` - Fix PoolerOutput import error, caused by `755ed7b05b` - Fix the aclgraph weight load error to keep the same with torchair fix. `4492e3a554` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? All test should pass - vLLM version: v0.10.2 - vLLM main: `52d0cb8458` --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-09-26 06:18:15 +08:00
weijinqian0	6aa4253798	[Refactor] [SP]The sequence parallelism characteristics in the MoE and Dense models are integrated into a single solution. (#3085 ) What this PR does / why we need it? there are two sets of sp implementations for moe and dense models. One is called sequence_parallelism, and the other is flashcomm_v1. We did the following things： Merge two sets of code with the same implementation into one. Remove the implementation of sequence_parallelism, as this solution cannot support aclgraph. Does this PR introduce any user-facing change? No How was this patch tested? e2e&ut - vLLM version: v0.10.2 - vLLM main: `f225ea7dd9` --------- Signed-off-by: weijinqian_v1 <weijinqian@huawei.com> Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>	2025-09-24 11:29:59 +08:00
Li Wang	02f89d166f	[CI] Update vllm version to 20250922(5aeb925) (#3091 ) ### What this PR does / why we need it? This pr bump vllm commit hash to `5aeb925452` fix issues: 1. https://github.com/vllm-project/vllm/pull/25345 has remove v0 metadata 2. https://github.com/vllm-project/vllm/pull/25332 3. https://github.com/vllm-project/vllm/pull/25334 4. https://github.com/vllm-project/vllm/pull/23558, note that this vllm commit update the model register logic, which will check all the model registered have the `vllm.model_executor.models` path , which breaks our custom registration of the deepseek_v3 model (it doesn't exist in the vllm model path). so I move deepseek_v3 model registy to deepseek_v2 to solve temporary ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: `9607d5eb44` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-09-22 22:18:13 +08:00
Yizhou	338231acaf	[Feat][Graph] Support `FULL_DECODE_ONLY` mode for GQA/MHA models (#2128 ) Note: This depends on [vLLM #25161](https://github.com/vllm-project/vllm/pull/25161) and the torch\_npu release from September 30. ### What this PR does / why we need it? This pull request adds `FULL_DECODE_ONLY` mode for GQA/MHA models (MLA models like DeepSeek V3/R1 are not included). Key improvements include: * Reduced dispatch latency: By replaying the entire model execution graph at once, we cut overhead compared with multiple smaller replays. * Stabilized multi-device performance: Captureing the whole model as one static graph also mitigates the dispatch fluctuations across devices. * Stream/resource savings: Consolidating graph captures frees up streams, allowing more graphs to be captured. Known issues: 1. `_npu_paged_attention` currently manages its own workspace in `torch_npu`, which can deadlock when synchronizing during graph replay — we’re working on a fix. There may be other corner cases. This PR is the first in a planned series; we’ll continue to iterate and address remaining issues in follow-ups. This is essentially a port of #1503 and #1677, but includes two major changes: 1. Let `graph_dispatcher` decide the graph mode instead of hard-coding it in the backend, which decouples Full Graph and Piecewise Graph and could make it possible to remove dynamo. 2. Adapt to the new `attn_group` logic, but leave a small hack in `update_graph_params`; multi-attention models may or may not be fully supported yet. ### Does this PR introduce _any_ user-facing change? ```python compilation_config={ "cudagraph_mode": "FULL_DECODE_ONLY", }, ``` ### How was this patch tested? Tests included. - vLLM version: v0.10.2 - vLLM main: `9607d5eb44` --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-09-22 17:14:28 +08:00
Li Wang	12bcbd02bb	[CI] Upgrade vLLM to 20250919 (6d8246aa) and fix some broken issue (#2907 ) ### What this PR does / why we need it? 1. This pr bump vllm commit to `6d8246aaff` 2. fix upstream changes https://github.com/vllm-project/vllm/pull/24548 abort multi-modal kwargs, make vllm main and `v0.10.2` both adaptable 3. fix metadata_builder changes introduced by https://github.com/vllm-project/vllm/pull/23693 4. fix `structured_outputs_config` changes introduced by https://github.com/vllm-project/vllm/pull/22772 5. fix `moe_config` changes introduced by https://github.com/vllm-project/vllm/pull/22537 Co-authored-by: MengqingCao <cmq0113@163.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com> - vLLM version: v0.10.2 - vLLM main: `c60e6137f0` --------- Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: MengqingCao <cmq0113@163.com> Co-authored-by: MengqingCao <cmq0113@163.com>	2025-09-20 17:37:57 +08:00
Mengqing Cao	367edff5af	[HybridKV] Fix prefill disaggregation kvcache addr alignment & use hybrid kv cache only when running qwen3_next (#3007 ) ### What this PR does / why we need it? This pr fixes a few issues on prefill disaggregation: 1. Fix prefill disaggregation kvcache addr alignment issue, llmdatadist needs the addr of tensors to be aligned with 2M 2. Fix prefill disaggregation kvcache shape error, llmdatadist requires k/v tensors with shape [num_blocks, ...], however the implentment before this pr is [2, num_blocks, ...], which will break prefill disaggregation 3. Use hybrid kv cache only when running qwen3_next to fix accuracy issue on prefill disaggregation. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? Tested locally by @liziyu179 - vLLM version: v0.10.2 - vLLM main: `4f02b77de4` --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-09-18 21:43:22 +08:00
yiz-liu	88ca8a051c	[Feat][Graph] Support DeepSeek with ACL Graph (#2707 ) ### What this PR does / why we need it? In memory of #677 , a long overdue milestone. Now DeepSeek V3/R1 should be OK with ACL Graph. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? Working on it. - vLLM version: v0.10.2 - vLLM main: `68dbde5dbb` --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-09-16 17:50:17 +08:00
linfeng-yuan	1c5900327b	[refactor] refactor deepseek-related files (#2849 ) ### What this PR does / why we need it? This PR deletes ~2K lines of code about deepseek modeling. It falls back CustomDeepseekV2 modules to original vllm implementations and adapts some modifications in vllm about deepseek and moe. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? E2E vllm serving with torchair graph mode and eager mode. - vLLM version: v0.10.2 - vLLM main: `759ef49b15` --------- Signed-off-by: linfeng-yuan <1102311262@qq.com> Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com> Co-authored-by: yiz-liu <136800916+yiz-liu@users.noreply.github.com> Co-authored-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-09-16 14:13:07 +08:00
wangxiyuan	c556038ef0	[New model] Qwen3-next support (#2917 ) ### What this PR does / why we need it? Add Qwen3-next support. ### Does this PR introduce _any_ user-facing change? Yes, users can use Qwen3 next. Related doc: https://github.com/vllm-project/vllm-ascend/pull/2916 the tutorial will be ready in [here](https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_npu_qwen3_next.html) ### How was this patch tested? Doc CI passed Related: https://github.com/vllm-project/vllm-ascend/issues/2884 Co-Authored-By: Angazenn <supperccell@163.com> Co-Authored-By: zzzzwwjj <1183291235@qq.com> Co-Authored-By: MengqingCao <cmq0113@163.com> Co-Authored-By: linfeng-yuan <1102311262@qq.com> Co-Authored-By: hust17yixuan <303660421@qq.com> Co-Authored-By: SunnyLee219 <3294305115@qq.com> Co-Authored-By: maoxx241 <maoxx241@umn.edu> - vLLM version: v0.10.2 - vLLM main: `b834b4cbf1` --------- Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: Angazenn <supperccell@163.com> Signed-off-by: Your Name <you@example.com> Signed-off-by: zzzzwwjj <1183291235@qq.com> Signed-off-by: linfeng-yuan <1102311262@qq.com> Signed-off-by: hust17yixuan <303660421@qq.com> Co-authored-by: MengqingCao <cmq0113@163.com> Co-authored-by: Angazenn <supperccell@163.com> Co-authored-by: Your Name <you@example.com> Co-authored-by: zzzzwwjj <1183291235@qq.com> Co-authored-by: linfeng-yuan <1102311262@qq.com> Co-authored-by: hust17yixuan <303660421@qq.com>	2025-09-16 01:17:42 +08:00
rjg-lyh	585a494baa	[Core] Disable the chunked prefill feature in Non-MLA LLMs (#2894 ) ### What this PR does / why we need it? This PR enforces the forcible disabling of the chunked prefill feature in Non-MLA models, as the performance of operators supporting this functionality is currently suboptimal. Unless the user has enabled chunked prefill in the ascend_scheduler_config, we would allow this feature. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI passed with new added/existing test. Related: https://github.com/vllm-project/vllm-ascend/pull/2659 - vLLM version: main - vLLM main: `d21a36f5f9` Signed-off-by: rjg-lyh <1318825571@qq.com>	2025-09-12 23:17:09 +08:00
wangxiyuan	7d6d9449a8	[Misc] Move lora patch file into lora module (#2797 ) Cleanup useless file in patch module. Update the lora support list is OK in vLLM Ascend, no need to patch vLLM - vLLM version: v0.10.1.1 - vLLM main: `f4962a6d55` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-09-08 21:42:12 +08:00
lidenghui1110	5a7181569c	[feat]: oproj tensor parallelism in pure DP and graph-mode scenarios. (#2167 ) ### What this PR does / why we need it? This PR introduces Oproj matrix tensor model parallel to achieve decreasing of memory consumption. It only support graph mode in pure DP scenario. In deepseek r1 w8a8 PD disagregated Decode instance, using pure DP, with oproj_tensor_parallel_size = 8, we have 1 ms TPOT increasing, saved 5.8 GB NPU memory per RANK. We got best performance when oproj_tensor_parallel_size=4 without TPOT increasing. performance data: <img width="1442" height="442" alt="image" src="https://github.com/user-attachments/assets/83270fc5-868a-4387-b0a9-fac29b4a376d" /> ### Does this PR introduce _any_ user-facing change? This PR introduces one new config in `additional_config`. \| Name \| Effect \| Required \| Type \| Constraints \| \| :---------------------------- \| :--------------------------------------- \| :------- \| :--- \| :----------------- \| \| oproj_tensor_parallel_size \| Split the o_proj matrix along the row dimension (head num * head dim) into oproj_tensor_parallel_size pieces. \| No \| int \| default value is None, once this value is set, the feature will be enabled, head num * head dim must be divisible by this value. \| example `--additional_config={"oproj_tensor_parallel_size": 8}` ### How was this patch tested? - vLLM version: v0.10.1.1 - vLLM main: `eddaafc1c7` --------- Signed-off-by: zzhx1 <zzh_201018@outlook.com> Co-authored-by: zzh <zzh_201018@outlook.com>	2025-09-07 10:31:32 +08:00
无脸男	0c0789be74	[Feat] allow using aclgraph in ray backend (#2589 ) ### What this PR does / why we need it? Allow using aclgraph in ray backend, for tp + pp + aclgraph in multi machine ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.10.1.1 - vLLM main: `4ba0c587ba` Signed-off-by: withHades <244036962@qq.com>	2025-09-04 11:45:56 +08:00
linfeng-yuan	90a75a90a9	[bugfix] fix torchair runtime error caused by configuration mismtaches and file missing (#2532 ) ### What this PR does / why we need it? This PR ports #2312 #2506 #2531 to main branch. Original implementation of torchair caching forces users to make everything prepared, fix all the configuration and enable `use_cached_npu_graph`, and it might cause some problems confusing to understand and tackle for users. It is better to compile the graph twice instead of reusing the old kvcaches and cached torchair graph. And the extra duration time is acceptable. Additionally, this pr fixes a recompilation problem of torchair graph mode caused by `running_in_graph` variable in `AscendMLATorchairImpl`. ### Does this PR introduce _any_ user-facing change? If users want to enabling torchair.cache_compile with high compilation speed, it is recommended to enable both `use_cached_kv_cache_bytes` and `use_cached_graph` in `torchair_graph_config`. Without `use_cached_kv_cache_bytes`, we'll compile torchair computation graph twice to avoid runtime error caused by configuration mismtaches (the second compilation will be much faster). Additionally, we've made a change to how the TORCHAIR_CACHE_HOME enviroment variable is utilized to enhance safety and prevent accidental file deletion by adding a suffix directory. ### How was this patch tested? CI and e2e vllm serving pass. - vLLM version: v0.10.1.1 - vLLM main: `70549c1245` --------- Signed-off-by: linfeng-yuan <1102311262@qq.com>	2025-09-03 17:56:12 +08:00
Mengqing Cao	6c973361fc	[Bugfix] Fix aclgraph not enabled by default (#2590 ) ### What this PR does / why we need it? As vllm will set `cudagraph_mode` to `NONE` before `check_and_update_config` in post init of `VllmConfig` (`5da4f5d857/vllm/config/__init__.py (L3630)`), we always have `cudagraph_mode` isn't `None`, thus we must remove this check and add it when the related adaption in vllm is done. part of https://github.com/vllm-project/vllm-ascend/pull/2577, will add the e2e test on applying reply after the CI refactor is done ### How was this patch tested? CI passed with existing test. - vLLM version: v0.10.1.1 - vLLM main: `f48a9af892` Signed-off-by: MengqingCao <cmq0113@163.com>	2025-08-28 14:08:31 +08:00
Mengqing Cao	a9e78a3299	[Aclgraph] Update compilation config in `check_and_update_config` (#2540 ) ### What this PR does / why we need it? This pr updates compilation config in `check_and_update_config`, we use `compilation_config.level` to update `compilation_config.cudagraph_mode` to ensure the config is correct. Add `compilation_config.cudagraph_num_of_warmups = 1` when V1 is enabled, cause this is also used in torchair graph mode. and this fixes https://github.com/vllm-project/vllm-ascend/issues/2523 fix the bug that the `aclgraphmode` always be `NONE` while running forward in aclgraph mode ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.10.1.1 - vLLM main: `f58675bfb3` --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-08-27 09:30:25 +08:00
zhanghw0354	b3fdd78a6b	[Main][Refactor]Change ASCEND_QUATIZATION_METHOD to ASCEND_QUANTIZATION_METHOD (#2517 ) ### What this PR does / why we need it? The constant ASCEND_QUATIZATION_METHOD in vllm_ascend/utils.py is misspelled and should be corrected to ASCEND_QUANTIZATION_METHOD. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.10.1.1 - vLLM main: `c9abb10489` Signed-off-by: zhanghaiwen <zhanghaiwen@cmss.chinamobile.com> Co-authored-by: zhanghaiwen <zhanghaiwen@cmss.chinamobile.com>	2025-08-26 09:06:16 +08:00
linfeng-yuan	0ca3f48c90	[2/N][refactor] torchair deepseek mla backend refactor (#2459 ) ### What this PR does / why we need it? This PR move current unified mla backend to torchair folder and remove torchair-related code in attention/mla_v1.py (1.3k -> 0.9k). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Running eager mode with mla backend, and torchair mode with code before [2445](https://github.com/vllm-project/vllm-ascend/pull/2445) - vLLM version: v0.10.0 - vLLM main: `f571ff8eb6` Signed-off-by: linfeng-yuan <1102311262@qq.com>	2025-08-21 14:02:30 +08:00
Mengqing Cao	1327f9be1c	Fix some ci issue and refactor modelrunner (#2445 ) ### What this PR does / why we need it? Fix some ci issue and refactor modelrunner ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with existing test. - vLLM version: v0.10.0 - vLLM main: `4d9c61993a` --------- Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: weiguihua2 <weiguihua2@huawei.com> Co-authored-by: wangli <wangli858794774@gmail.com> Co-authored-by: weiguihua2 <weiguihua2@huawei.com>	2025-08-20 09:01:04 +08:00
Shanshan Shen	83e0f41408	[3/N][Refactor] Move `torchair_attention` to `torchair` dir (#2017 ) ### What this PR does / why we need it? 1. Move `torchair_attention` to `torchair` dir. 2. Make `AscendAttentionTorchairBackend` extend `AscendAttentionBackend` to reduce duplicate methods. 3. Make `AscendTorchairMetadata` extend `AscendMetadata` to reduce duplicate properties. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.10.0 - vLLM main: `0933f9d518` --------- Signed-off-by: shen-shanshan <467638484@qq.com>	2025-08-19 10:25:22 +08:00
Pleaplusone	3f4a358b14	[Bugfix] Fix custom op register issue (#2409 ) ### What this PR does / why we need it? Our current code register the custom ops inside the platform intialization phase. however, when a new process started by creating a worker, the former patch will lose it effect on the custom ops and lead to fallback to the native pass wrote in vllm. This PR move the patch code to the worker to make sure the custom op patch worker as our expected. ### Does this PR introduce _any_ user-facing change? No - vLLM version: v0.10.0 - vLLM main: `8ea0c2753a` Signed-off-by: ganyi <pleaplusone.gy@gmail.com>	2025-08-19 09:09:43 +08:00
Shanshan Shen	103654ccd6	[Misc] Remove redundant imported `envs`, using `envs_ascend` instead (#2193 ) ### What this PR does / why we need it? Remove redundant imported `envs`, using `envs_ascend` instead. ```python import vllm.envs as envs_vllm import vllm_ascend.envs as envs_ascend ``` - vLLM version: v0.10.0 - vLLM main: `71683ca6f6` --------- Signed-off-by: shen-shanshan <467638484@qq.com>	2025-08-14 09:33:39 +08:00
yiz-liu	992271b027	[1/N][Feat] Support MoE models with ACL Graph and refactor MoE communication logic (#2125 ) ### What this PR does / why we need it? This PR refactors the MoE (Mixture of Experts) communication logic by introducing a strategy pattern. It defines an abstract base class, `MoECommMethod`, which encapsulates different communication strategies for MoE layers. By decoupling the MoE implementation from any single communication method, this change makes it simpler to add, replace, or optimize communication strategies in the future. Plan / Roadmap 1. Introduce `MoECommMethod`, implement `AllGatherImpl`, and adapt ACL Graph handling to cover all scenarios (this PR). 2. Implement `MC2CommImpl` and `AllToAllCommImpl` to optimize performance in specific scenarios. 3. Enable W8A8 / Int8 models to use `unified_fused_experts`. Other notes * Data-parallel (DP) communication currently does not work with vLLM's dispatch/combine mechanisms; an alternative approach is required to resolve this incompatibility. - vLLM version: v0.10.0 - vLLM main: `f7ad6a1eb3` --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-08-12 21:10:20 +08:00
lbk-sys	c611291661	【main】SP For Qwen3 MoE (#2209 ) ### What this PR does / why we need it? Qwen3 MoE supports SP. In scenarios like AlltoAll, AlltoAllv, and MC2, replacing AllReduce with Reduce-Scatter and AllGather achieves computational benefits in norm operations while saving one AllGather communication. This feature is enabled during the P-phase and delivers notable gains in long-sequence scenarios (e.g., 16k–25k), with performance improvements reaching 5%–10%. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ``` compilation_config={ "pass_config":{ "enable_sequence_parallelism": True } }, enable_expert_parallel=True, ``` - vLLM version: v0.10.0 - vLLM main: `9edd1db02b` --------- Signed-off-by: libaokui <libaokui@huawei.com> Co-authored-by: libaokui <libaokui@huawei.com>	2025-08-07 09:15:49 +08:00
leo-pony	807f0895b2	Bump torch version to 2.7.1 (#1562 ) ### What this PR does / why we need it? Bump torch version to 2.7.1, and cleanup infer schema patch https://github.com/vllm-project/vllm-ascend/commit/857f489 (https://github.com/vllm-project/vllm-ascend/pull/837), this patch depends on also: https://github.com/vllm-project/vllm-ascend/pull/1974 ### Does this PR introduce any user-facing change? No #### How was this patch tested? CI passed torch-npu 2.7.1rc1 install guide: https://gitee.com/ascend/pytorch/tree/v2.7.1/ install depending: ``` pip3 install pyyaml pip3 install setuptools ``` install torch-npu: Closes: https://github.com/vllm-project/vllm-ascend/issues/1866 Closes: https://github.com/vllm-project/vllm-ascend/issues/1390 - vLLM version: v0.10.0 - vLLM main: `9af654cc38` --------- Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: leo-pony <nengjunma@outlook.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com>	2025-08-05 08:43:24 +08:00
22dimensions	9e65da990e	[Misc] Add warning for incompatible Ray backend with ACL Graph mode (#2132 ) ### What this PR does / why we need it? cherry-pick #1501 from 0.9.1-dev to main Currently, Ray is not compatible with ACL Graph, so we need to fall back to eager mode when using the Ray backend. co-authored: Yizhou Liu <liu_yizhou@outlook.com> - vLLM version: v0.10.0 - vLLM main: `2836dd73f1` Signed-off-by: 22dimensions <waitingwind@foxmail.com>	2025-08-01 09:06:09 +08:00
wangxiyuan	af56ae3ed1	[1/4][Refactor] Refactor torchair worker (#1885 ) There is a lot torchair specified logic in common code. It results hard code maintenance. We will create a new torchair module to launch torchair related logic there. I plan to add 4 PR. 1. Refactor worker (this PR) - create torchair module and move torchair related code in worker to the new module 3. Refactor utils 4. Refactor model_runner 5. Refactor attention - vLLM version: v0.9.2 - vLLM main: `8188196a1c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-07-21 11:50:46 +08:00
Mengqing Cao	8cfd257992	[Dist][EP] Remove ETP/EP maintained in vllm-ascend (#1681 ) ### What this PR does / why we need it? Remove ETP/EP maintained in branch main. We drop this as there is no relevant scenarios to use ETP now, and we may subsequently advocate implementing expert tensor parallelism in vLLM to support scenarios where the expert is needed to be sliced This is a part of #1422 backport. Fixes https://github.com/vllm-project/vllm-ascend/issues/1396 https://github.com/vllm-project/vllm-ascend/issues/1154 ### Does this PR introduce _any_ user-facing change? We'll not maintain etp/ep in vllm-ascend anymore, and use the tp/ep in vllm instead. ### How was this patch tested? CI passed with new added and existing test. - vLLM version: v0.9.2 - vLLM main: `fe8a2c544a` Signed-off-by: MengqingCao <cmq0113@163.com>	2025-07-21 09:08:04 +08:00
Mengqing Cao	574fe407eb	[1/N][CustomOp] Register activation customop instead of overwrite forward_oot (#1841 ) ### What this PR does / why we need it? We'll refator `CustomOp` in vllm-ascend from this pr on. Use function `CustomOp.register_oot` to achieve the customop registery, taking `AscendQuickGELU` as an example: ```python from vllm_ascend.ops.activation import AscendQuickGELU CustomOp.register_oot(_decorated_op_cls=AscendQuickGELU, name="QuickGELU") ``` This is a quick adapt for `CustomOp.register_oot` mechanism from vllm 0.9.2. For further step, we can remove inherit from `QuickGELU` can write our own `QuickGELU` at all. Part of https://github.com/vllm-project/vllm-ascend/pull/1647 - vLLM version: v0.9.2 - vLLM main: `8dfb45ca33` --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-07-18 23:07:14 +08:00
Shanshan Shen	f96100fad5	[Misc][V0 Deprecation] Remove V0 related codes of test, example, platform (#1805 ) ### What this PR does / why we need it? Remove V0 related codes of test, example, platform. This PR is a part of https://github.com/vllm-project/vllm-ascend/issues/1620. - vLLM version: v0.9.2 - vLLM main: `235bfd5dfe` --------- Signed-off-by: shen-shanshan <467638484@qq.com>	2025-07-15 19:58:55 +08:00
wangxiyuan	7bdada58eb	[Misc] Remove VLLM_USE_V1 usage in code (#1764 ) We plan to remove V0 code from this version. The first step is to delete v0 usage. Related: https://github.com/vllm-project/vllm-ascend/issues/1620 - vLLM version: v0.9.2 - vLLM main: `61e20828da` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-07-15 11:52:16 +08:00
Angazenn	a5f33590d3	[CORE]initial support for torchair with non-mla backend (#1506 ) ### What this PR does / why we need it? This PR supports torchair graph mode with non-mla backend on both 800IA2 and 300I Duo platforms. The main change is to add `attention_v1_torchair.py` to support specific attention related operations that are required by torchair. ### Does this PR introduce _any_ user-facing change? Before this PR, vLLM-Ascend only allows deepseek to use torchair. Now we can also use it with pangu. Besides, we add a support model list to control which type of models that can use torchair. ### How was this patch tested? We have test it with PanguProMoE on both 800IA2 and 300I Duo platforms, and model generates answer normally. --------- Signed-off-by: angazenn <zengyanjia@huawei.com> Signed-off-by: tianyitang <tangtianyi4@huawei.com> Co-authored-by: angazenn <zengyanjia@huawei.com> Co-authored-by: tianyitang <tangtianyi4@huawei.com>	2025-07-03 22:21:42 +08:00

1 2 3

105 Commits