xc-llm-ascend

Author	SHA1	Message	Date
Mengqing Cao	4ff422c730	[CI][Bugfix] Quickfix for DPMetaData (#3234 ) ### What this PR does / why we need it? Fix `dpmetadata` and `Qwen3MoeSparseMoeBlock` break introduced by `26a7a33b88 (diff-c1550d0a38469d039370567d8981969530cbfffc7302cd1778e7c2c8a9322dea)` NOTE: we maintain a different sp in vllm-ascend with vllm, thus we can just use `cu_tokens_across_sp(1)` as `cu_tokens_across_dp_cpu` close https://github.com/vllm-project/vllm-ascend/issues/3236, https://github.com/vllm-project/vllm-ascend/issues/3239 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: https://github.com/vllm-project/vllm/commit/releases/v0.11.0 --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-09-28 21:11:22 +08:00
fan2956	f2d8493221	[BugFix] Fix ascend scheduler assert error (#3191 ) ### What this PR does / why we need it? Running multimodal model with ascend scheduler may cause assert error 【assert (request.num_tokens - request.num_computed_tokens) == 1】 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: `17b4c6685c` --------- Signed-off-by: fan2956 <zhoufan53@huawei.com>	2025-09-28 18:22:08 +08:00
Icey	68c5401ad6	[Eagle] Fix attn_mask index out of range in high concurrency situations (#3187 ) ### What this PR does / why we need it? - Fixes the bug that Multiple calls (maybe >100) to eagle3-qwen3-8b often incurs "attn_mask index out of range" error ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? ``` python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --served-model-name Eagle3 --port 8000 --model Qwen/Qwen3-8B --seed 42 -tp 1 --speculative_config '{"model": "Tengyunw/qwen3_8b_eagle3", "draft_tensor_parallel_size": 1, "num_speculative_tokens": 5, "method": "eagle3"}' ``` Co-authored-by: liuruijin17 [ricklrj@outlook.com](mailto:ricklrj@outlook.com) - vLLM version: v0.10.2 - vLLM main: `52d0cb8458` Signed-off-by: Icey <1790571317@qq.com>	2025-09-28 18:09:26 +08:00
lilinsiman	1705501ae2	[BugFix] Fix ACLgraph bug in Qwen3_32b_int8 case (#3204 ) ### What this PR does / why we need it? 1. Solved the issue where sizes capture failed for the Qwen3-32b-int8 model when aclgraph, dp1, and tp4 were enabled. 2. Added the exception thrown when sizes capture fails and provided a solution 3. Add this common problem to the FAQ doc ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ut - vLLM version: v0.10.2 - vLLM main: https://github.com/vllm-project/vllm/commit/releases/v0.11.0 Signed-off-by: lilinsiman <lilinsiman@gmail.com>	2025-09-28 17:44:04 +08:00
Zetong Li	a86ece5e39	[Bugfix][LoRA] Fix forward error and shape mismatch when using LoRA (#3153 ) ### What this PR does / why we need it? Relying on #3044, this PR aims to further fix: 1. The forward error occured when `LogitsProcessorWithLoRA` calls `AscendLogitsProcessor.forward`. Since `LogitsProcessorWithLoRA` bypasses the MRO to call it, `super().forward(...)` in `AscendLogitsProcessor.forward` will raise an error. This PR fixes it by directly invoking `LogitsProcessor.forward(self, ...)`; 2. The shape mismatch in `add_lora_logits` in punica_npu.py. The `lora_a_stacked` and `lora_b_stacked` are organized as [num_loras, 1, lora_rank, hidden_size] and [num_loras, 1, vocab_size, lora_rank] shapes respectively, but they are misunderstood in #1583---the last two dimensions were assumed in reverse order, which causes errors in `bgmv_shrink` and `bgmv_expand`. This PR fixes it by reverting it to the previous version to align with the implementation in punica_cpu.py in vllm. ### Dependencies This PR depends on changes introduced by #3044 (LoRA support for `AscendQKVParallelLinear` and `AscendMergedQKVParallelLinear` layers). ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? The LoRA-related tests, e.g., test_ilama_lora.py and test_ilama_lora_tp2.py, use ilama-3.2-1B, and this model is regarded as `TransformersForCausalLM`, where `embedding_modules` attribute lacks `lm_head`. However, `LlamaForCausalLM` and most other models include both `embed_tokens` and `lm_head` in `embedding_modules`. This attribute contributes to `supported_lora_modules` when using LoRA in vllm. Therefore, without `lm_head` in `embedding_modules`, current tests using ilama-3.2-1B are unable to find the abve errors since `LogitsProcessorWithLoRA` replacing `lm_head` is skipped. Simply using Meta-Llama-3.1-8B-Instruct can reproduce the above errors and check whether these fixes can work. What's more, it's necessary to add more comprehensive tests for LoRA. - vLLM version: v0.10.2 - vLLM main: `f225ea7dd9` Signed-off-by: Zetong Li <slippersss@126.com>	2025-09-28 17:30:50 +08:00
Peipei	3d21ed9ee8	[Bugfix]Fix quant_config input parameter bug in qwenvl series (#3220 ) ### What this PR does / why we need it? Fix quant_config input parameter bug in qwenvl series. Currently, non-instantiated variables should be passed. ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: https://github.com/vllm-project/vllm/commit/releases/v0.11.0 Signed-off-by: booker123456 <945658361@qq.com>	2025-09-28 14:08:24 +08:00
Yikun Jiang	96089b5155	Add vLLM 0.11.0 release hourly job (#3215 ) ### What this PR does / why we need it? Add vLLM 0.11.0 release hourly job to monitor release branch changes ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed - vLLM version: v0.10.2 - vLLM main: https://github.com/vllm-project/vllm/commit/releases/v0.11.0 Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-09-27 23:15:41 +08:00
Wang Kunpeng	859e861d92	[main][quantization] Support deepseek w4a8 per-channel quantization (#3011 ) ### What this PR does / why we need it? 1.Support deepseek w4a8 per-channel quantization 2.The eager mode supports converting weights to the NZ format ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? #### How to get weights using Modelslim ##### Installation steps git clone https://gitcode.com/Ascend/msit.git cd msit/msmodelslim bash install.sh ##### Generate w4a8 per-channel weights cd /example/DeepSeek Command reference: msmodelslim/example/DeepSeek/README.md - vLLM version: v0.10.2 - vLLM main: `f225ea7dd9` --------- Signed-off-by: Wang Kunpeng <1289706727@qq.com>	2025-09-27 21:01:16 +08:00
wangxiyuan	e9359bd8fa	[CI] Pin vLLM to releases/v0.11.0 (#3211 ) ### What this PR does / why we need it? - Pin vLLM commit to releases/v0.11.0 branch. - Fix the break change by vLLM commit `d4d9899860` ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: `17b4c6685c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-09-27 10:41:48 +08:00
yupeng	9caf6fbaf5	[Bugfix][LoRA] Fix LoRA bug after supporting Qwen3-Next (#3044 ) ### What this PR does / why we need it? LoRA e2e test uses ilama-3.2-1B model. It uses transformers.py model files. Its self-attention layer names end with "\.attn", not "\.self_attn". There are some other model attention layer names end with "*.attn", such as baichuan.py, bert.py. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? pytest -sv tests/e2e/singlecard/test_ilama_lora.py pytest -sv tests/e2e/multicard/test_ilama_lora_tp2.py - vLLM version: v0.10.2 - vLLM main: `17b4c6685c` --------- Signed-off-by: paulyu12 <507435917@qq.com>	2025-09-26 11:12:45 +08:00
XiaoxinWang	8406aafaff	Add e2e test related to weight updates in RL scenarios. (#2954 ) ### What this PR does / why we need it? Add e2e test related to weight updates in RL scenarios. Due to CI issues, the newly added Python test files cannot locate the correct path. As a temporary solution, use absolute paths to add test cases. - vLLM version: v0.10.2 - vLLM main: `52d0cb8458` Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com> Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com> Co-authored-by: Shangwei-Li <lishangwei2@huawei.com>	2025-09-26 11:07:10 +08:00
realliujiaxu	d8a9cb8458	[Bugfix] fix bug when tp=1 (#3193 ) ### What this PR does / why we need it? Addresses a bug in DenseOptimRowParallelOp that occurs when tensor parallelism is not used ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: `52d0cb8458`	2025-09-26 10:55:32 +08:00
zouyida2052	b72e3327a6	bugfix for mtp>1 (#3174 ) ### What this PR does / why we need it? fix bugs when mtp>1, and reorder input batch when mtp is not accepted. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? by ci - vLLM version: v0.10.2 - vLLM main: `52d0cb8458` --------- Signed-off-by: zouyida2052 <zouyida2002@gmail.com>	2025-09-26 09:04:16 +08:00
无脸男	69509bcdd6	[bugfix] fix oom in aclgraph (#3158 ) ### What this PR does / why we need it? fix oom in aclgraph. 1. In the current token dispatch implementation, tensors are mounted on class instances to facilitate parameter passing between different methods. This approach prevents automatic recycling of these tensors. In some cases, it may lead to out-of-memory error. To address this issue, we manually set these tensors to None to release corresponding memory. 2. The `profile_run` method is designed to accurately estimate the maximum NPU memory usage during vLLM inference. However, in certain scenarios, MoE models perform inference via MC2, which includes communication and consumes additional NPU memory. This leads to inaccurate estimation by the profile run. We address this by actively triggering the MC2 during profile run for initialization.```. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: `52d0cb8458` Signed-off-by: WithHades <244036962@qq.com>	2025-09-26 08:57:47 +08:00
Ronald	621aa7d270	fix error async_scheduler can't be enabled (#3127 ) ### What this PR does / why we need it? PR #2894 make ascend_scheduler_config.enabled always be `True` for non-mla models，when `ascend_scheduler_config.enabled=True `, it will always initialize `AscendScheduler` which is a subclass of `Scheduler`, but when we enbale async_scheduling,we need to initialize `AsyncScheduler` in vllm, this will make async_scheduling can't be enabled. ### Does this PR introduce _any_ user-facing change? not-related ### How was this patch tested? when user set `async_scheduling`, it means user don't want to use `AscendScheduler`, so we shouldn't set `ascend_scheduler_config.enabled = True` - vLLM version: v0.10.2 - vLLM main: `f225ea7dd9` Signed-off-by: Ronald1995 <ronaldautomobile@163.com>	2025-09-26 08:51:54 +08:00
florenceCH	14497b748d	Remove qwen3 moe MC2 cumsum & cast (#3126 ) What this PR does / why we need it? The Qwen3 moe MC2 graph currently has two redundant computational operator implementations. After npu_moe_distribute_dispatch_v2, the cumsum and cast operations have been added. By using expert_token_nums_type=0 and not converting weight_scale to float32, these two operators can be eliminated, thereby improving inference performance. Does this PR introduce any user-facing change? No How was this patch tested? No need vLLM version: v0.10.2 vLLM main: `f225ea7dd9` - vLLM version: v0.10.2 - vLLM main: `f225ea7dd9` --------- Signed-off-by: florenceCH <gaoxiang120@huawei.com> Co-authored-by: florenceCH <gaoxiang120@huawei.com>	2025-09-26 08:51:30 +08:00
wangxiyuan	2930e4a6bd	[CI] Upgrade vllm to newest commit (#3182 ) ### What this PR does / why we need it? Upgrade vLLM to newest commit - Fix the aclgraph doesn't work problem, caused by `24fab45d96` - Fix PoolerOutput import error, caused by `755ed7b05b` - Fix the aclgraph weight load error to keep the same with torchair fix. `4492e3a554` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? All test should pass - vLLM version: v0.10.2 - vLLM main: `52d0cb8458` --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-09-26 06:18:15 +08:00
wangxiyuan	0794f64a18	Revert "[Disagg][Perf] Use NPU event sync instead of blocking tolist (#3194 ) …to avoid unintentional copy ops blocking across different NPU streams, improving disagg TTIT/TTFT (#2788)" ### What this PR does / why we need it? This reverts commit `6995a7bc5b`. We'll add it back once the issue is fixed. related issue: https://github.com/vllm-project/vllm-ascend/issues/3195 ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: `52d0cb8458`	2025-09-26 06:17:36 +08:00
Peipei	31dda3f557	[Model]Add support for qwen3_vl and qwen3_vl_moe (#3103 ) ### What this PR does / why we need it? This PR is for the adaptation and optimization of qwen3_vl and qwen3_vl_moe on the Ascend platform. ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: `b1068903fd` --------- Signed-off-by: booker123456 <945658361@qq.com>	2025-09-25 18:50:12 +08:00
wangxiyuan	f7a3815bff	[CI] Do not drop ready label when PR is merge conflict (#3173 ) ### What this PR does / why we need it? `ready` label now is used for trigger full e2e test now. If a PR is ready and merge conflict then, no need to drop the ready label. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Just a github action change. No need for function test. - vLLM version: v0.10.2 - vLLM main: `52d0cb8458` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-09-25 18:45:19 +08:00
offline893	5d13bbe796	[BugFix]Modify eplb feature guide. (#3183 ) ### What this PR does / why we need it? Revise the EPLB feature guide content.Add eplb params to ascend config. ### Does this PR introduce any user-facing change? ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: `52d0cb8458` Co-authored-by: offline0806 <3337230449@qq.com>	2025-09-25 17:01:51 +08:00
MengLong Chen	07f4710216	[BugFix] Fix dummy_run memory explosion in eager mode (#3132 ) ### What this PR does / why we need it? It is a quick bugfix for the memory explosion issue that requires further refactoring. The dummy_run in eager mode may lead to OOM and the reason is that `hidden_states` were not released in time. The PR temporarily resolves the issue by manually clearing the cache, and further refactoring will be conducted subsequently. Before the modification, the dummy_run's memory showed an accumulation issue. <img width="1796" height="207" alt="image" src="https://github.com/user-attachments/assets/05e2b04c-2f99-4085-9eda-c78b7d9a57b0" /> After modification, it can be observed that the memory is released promptly. And it was verified that the model responded normally after a single data input. - vLLM version: v0.10.2 - vLLM main: `b1068903fd` --------- Signed-off-by: chenmenglong <chenmenglong1@huawei.com>	2025-09-25 16:09:44 +08:00
leo-pony	72f64c10b7	[bugFix] Correct the vllm interface e2e test Base container image name (#3179 ) ### What this PR does / why we need it? Correct the vllm interface e2e test Base container image name ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? Tests in vllm ci pipeline - vLLM version: v0.10.2 - vLLM main: `52d0cb8458` Signed-off-by: leo-pony <nengjunma@outlook.com>	2025-09-25 16:03:09 +08:00
Icey	2a9d02e080	[Bugfix] eagle and eagle3 spec decode failures and enable e2e test (#2979 ) ### What this PR does / why we need it? - Fix the bug https://github.com/vllm-project/vllm-ascend/issues/2978 - Enable e2e test, - Adapt to scenarios where Speculative tokens are greater than 2, - Fix the bug that causes Eagle3 inference failures under high concurrency and improve the acceptance rate of draft models, by https://github.com/vllm-project/vllm-ascend/pull/2794 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? CI passed with new added/existing test. Co-authored-by: hukongyi [hukongyi@cmbchina.com](mailto:hukongyi@cmbchina.com) Co-authored-by: guanyuzhu [zhuguanyu@huawei.com](mailto:zhuguanyu@huawei.com) Co-authored-by: liumail680 [liumail680@163.com](mailto:liumail680@163.com) - vLLM version: v0.10.2 - vLLM main: `f225ea7dd9` --------- Signed-off-by: Icey <1790571317@qq.com>	2025-09-25 14:39:12 +08:00
wangxiyuan	ac1c2cd9ac	[CI] Upgrade vllm version - 0925 (#3167 ) Upgrade vLLM to newest commit. 1. Remove the useless func get_state_cls, it has been removed from vLLM already. `e6750d0b18` 2. Fix ut broken by `6160ba4151` - vLLM version: v0.10.2 - vLLM main: `b1068903fd` --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-09-25 14:20:10 +08:00
mfyCn-1204	33c118c80e	[core]vllm-ascend support msMonitor tool (#3123 ) ### What this PR does / why we need it? vllm-ascend support [msMonitor ](https://gitcode.com/Ascend/mstt/tree/master/msmonitor)tool to collect performance of vllm-ascend ### Does this PR introduce _any_ user-facing change? 1.add env MSMONITOR_USE_DAEMON； 2.user cann enable msMonitor tool by setting MSMONITOR_USE_DAEMON=1 before run vllm-ascend model； 3.MSMONITOR_USE_DAEMON and VLLM_TORCH_PROFILER_DIR cannot both set ### How was this patch tested? 1.run vllm-ascend model while not set MSMONITOR_USE_DAEMON=1 or set MSMONITOR_USE_DAEMON=0, model will run successfully; 2.run vllm-ascend model while set MSMONITOR_USE_DAEMON=1, run msMonitor tool to collect profile data; 3.run vllm-ascend model while set MSMONITOR_USE_DAEMON=1 and VLLM_TORCH_PROFILER_DIR, will raise error - vLLM version: v0.10.2 - vLLM main: `f225ea7dd9` Signed-off-by: mei-feiyao <1332490378@qq.com>	2025-09-25 14:15:02 +08:00
whx	c814b32b90	[Quant][GLM] Adapt glm quant. (#3147 ) adapt glm quant - vLLM version: v0.10.2 - vLLM main: `f225ea7dd9` Signed-off-by: whx-sjtu <2952154980@qq.com>	2025-09-25 11:13:29 +08:00
wangxiyuan	a055183821	[CI] Upgrade vLLM version (#3139 ) Upgrade vLLM version to the newest commit. - Fix the break change introduced by `969b4da3a6` - Add a patch to quick fix torhcair `de94289a98` - fix the ut error introduced by `de94289a98` Close: https://github.com/vllm-project/vllm-ascend/issues/3138 - vLLM version: v0.10.2 - vLLM main: `f225ea7dd9` --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: MengqingCao <cmq0113@163.com> Co-authored-by: MengqingCao <cmq0113@163.com>	2025-09-25 07:36:51 +08:00
liziyu	464270e4ca	Remove useless PD check in deepseek (#3161 ) ### What this PR does / why we need it? Remove useless PD check in deepseek ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: `f225ea7dd9` Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com>	2025-09-24 23:25:47 +08:00
zzhxxx	4ee58e213b	[BugFix] explicitly setting the tensor shape of otp output (#3027 ) When MTP and oprojTP are enabled, it triggers the recompilation of the torchair graph, leading to a decrease in performance, and this PR fixes this issue. - vLLM version: v0.10.2 - vLLM main: `486c5599e3` --------- Signed-off-by: zzhx1 <zzh_201018@outlook.com>	2025-09-24 18:44:15 +08:00
leo-pony	360a736dfa	Add OOT platform E2E test case to be run in the vllm buildkite pipeline (#3154 ) ### What this PR does / why we need it? Add OOT platform E2E test case to be run in the vllm buildkite pipeline. Note: added test case is not run in vllm-ascend CI. ### Does this PR introduce _any_ user-facing change? NA - vLLM version: v0.10.2 - vLLM main: `f225ea7dd9` Signed-off-by: leo-pony <nengjunma@outlook.com>	2025-09-24 17:55:58 +08:00
clrs97	cd1ffbb6cd	[1/N][Feat] Cut down memory usage for o_proj in DeepSeek (#2931 ) ### What this PR does / why we need it? To cut down the memory usage of large weight matrices, we often rely on various linear operations: - `ReplicatedLinear`: Stores the entire matrix, consuming excessive memory. - `RowParallelLinear`: Requires an `all_reduce` to merge answer, introducing additional communication overhead and potential accuracy loss. Each token is handled across multiple devices rather than a single device, which is undesirable in SP scenario. - ... Furthermore, in multi-way Data Parallelism (DP) configurations, layers typically store redundant weight copies. This PR introduces a shared-weight plugin for layers inheriting from `LinearBase`. It offers the following advantages: - It evenly distributes a set of layers with identical structures across devices. Each layer retains its complete weights, eliminating redundant memory usage. - It supports asynchronous broadcasting to prefetch weights for upcoming layers. - It preserves the custom `process_weights_after_loading()` method to make keeping NZ format possible. - It is compatible with any linear class that inherits from `LinearBase`, thereby preserving all the features of the original linear implementation. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? vLLM main: `f4a948f33f` - vLLM version: v0.10.2 - vLLM main: `f225ea7dd9` --------- Signed-off-by: clrs97 <524936896@qq.com> Co-authored-by: CalvinXKY <kyxiezju@163.com>	2025-09-24 17:16:41 +08:00
Clorist33	302494c1fe	[EPLB] ut for EPLB (#3035 ) ## UT for EPLB Co-authored-by Skywalker-EP 173723846@qq.com Co-authored-by offline 0806@qq.com Co-authored-by dsxsteven@sina.com ## UT Description ### 1. Module Description - Module: EPLB ### 2. Covered Source Files - vllm_ascend/eplb/adaptor/abstract_adaptor.py - vllm_ascend/eplb/core/eplb_device_transfer_loader.py - vllm_ascend/eplb/core/eplb_utils.py - vllm_ascend/eplb/core/policy/policy_abstract.py - vllm_ascend/eplb/core/policy/policy_dynamic_ep.py - vllm_ascend/eplb/core/policy/policy_dynamic_ep_v2.py - vllm_ascend/eplb/core/policy/policy_factory.py ### 3. Testing Method - Framework: pytest - Test Data: mock data - Test Type: unit test ### 4. Coverage - Statement Coverage: 90% - vLLM version: v0.10.2 - vLLM main: `f225ea7dd9` --------- Signed-off-by: tanqingshan (A) <50050625@china.huawei.com> Signed-off-by: tanqingshan <50050625@china.huawei.com> Signed-off-by: daishixun <dsxsteven@sina.com> Co-authored-by: tanqingshan (A) <t50050625@china.huawei.com> Co-authored-by: tanqingshan <50050625@china.huawei.com> Co-authored-by: daishixun <dsxsteven@sina.com> Co-authored-by: dsxsteven <36877507+dsxsteven@users.noreply.github.com>	2025-09-24 17:14:38 +08:00
Csrayz	80524f5711	[CORE] concurrent partial prefills (#2372 ) # What this PR does / why we need it? When processing a mix of large and small requests, the TTFT of responses is significantly reduc\ed. Please refer to https://github.com/vllm-project/vllm/pull/10235, which achieves the same effect by simply limiting the number of prompt fills for long requests. This solution can be applied to both AscendScheduler (V0) and vLLM Scheduler (V1). Tests show that TTFT can be significantly improved when handling such mixed requests. However, This capability is currently missing when Ascend Scheduler is enabled. This benchmark used the Qwen3-8B model, with a context length of 128K, running on a single card. Regarding dataset selection, the sharegpt_clean dataset is used, with its content concatenated and cropped. Small requests with token=50 and medium requests with token=10240 were constructed (there were also large requests with token=102400, but these were ignored because when using the Prefill First scheduling strategy, max_num_batched_tokens will not be set to such a large value). When loading vLLM, set max_num_batched_tokens=22000. This length can accommodate two medium-sized requests and some short requests, reflecting an extreme scenario where the budget is almost entirely occupied by longer requests. Next, we mix 990 small requests and 100 medium requests into one type of load scenario (hereinafter referred to as 10%), and similarly generate load scenarios with 5% medium requests and 1% load scenarios. Performance tests were conducted separately for enabling vLLMScheduler, AscendScheduler, and AscendScheduler (long prompt concurrency set to 1). - vLLM version: v0.10.2 - vLLM main: `1dfea5f4a9` --------- Signed-off-by: Csrayz <jover@cmbchina.com>	2025-09-24 17:12:55 +08:00
Mengqing Cao	2d885869c5	[KVCache][Bugfix] Fix kv cache initialization error of attention layer (#3113 ) ### What this PR does / why we need it? Fixes #3096 1. Fix kv cache initialization error of attention layer. There are some models with layer name like `attn.attn`, instead of `self_attn`, but the initialization of kv cache tensors only check for `self_attn` and `attn.attn`, which leding to the error `AssertionError: Some layers are not correctly initialized` 2. Set the default value of input arg `sampling_metadata` in `compute_logits` for the modeling files in vllm-ascend. Thus fixing the error `Qwen3NextForCausalLM.compute_logits() missing 1 required positional argument: 'sampling_metadata'` ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? test locally with internlm - vLLM version: v0.10.2 - vLLM main: `5aeb925452` --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-09-24 11:32:34 +08:00
weijinqian0	6aa4253798	[Refactor] [SP]The sequence parallelism characteristics in the MoE and Dense models are integrated into a single solution. (#3085 ) What this PR does / why we need it? there are two sets of sp implementations for moe and dense models. One is called sequence_parallelism, and the other is flashcomm_v1. We did the following things： Merge two sets of code with the same implementation into one. Remove the implementation of sequence_parallelism, as this solution cannot support aclgraph. Does this PR introduce any user-facing change? No How was this patch tested? e2e&ut - vLLM version: v0.10.2 - vLLM main: `f225ea7dd9` --------- Signed-off-by: weijinqian_v1 <weijinqian@huawei.com> Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>	2025-09-24 11:29:59 +08:00
Icey	e7618d9414	[2/N][Refactor][Qwen3-Next] remove redundant methods and patch methods in Qwen3NextGatedDeltaNet (#3082 ) ### What this PR does / why we need it? remove redundant methods and patch methods in Qwen3NextGatedDeltaNet involved causal_conv1d_fn, causal_conv1d_update_npu, fused_gdn_gating, fused_reccrrent_gated_delta_rule, torch_chunk_gated_delta_rule, RMSNormGated ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? ``` def main(): prompts = [ "The future of AI is", ] # Create a sampling params object. sampling_params = SamplingParams(max_tokens=100, temperature=0.6, top_k=40, top_p=0.95) # Create an LLM. llm = LLM( model="Qwen/Qwen3-Next-80B-A3B-Instruct", tensor_parallel_size=4, enforce_eager=True, trust_remote_code=True, max_model_len=256, gpu_memory_utilization=0.7, block_size=64, ) # Generate texts from the prompts. outputs = llm.generate(prompts, sampling_params) for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` CI passed with new added/existing test. - vLLM version: v0.10.2 - vLLM main: `5aeb925452` --------- Signed-off-by: Icey <1790571317@qq.com>	2025-09-24 11:25:42 +08:00
baxingpiaochong	eb205d9f35	[P/D][BugFix]Mooncake timeout release bug fix (#2899 ) ### What this PR does / why we need it? In the P node timeout release mechanism during PD separation, the req_id that requires timeout release is transmitted from the scheduler to the worker. If the KV cache between PDs is transferred too quickly, the P node's req_id may be released twice. The first release is when the D node notifies the P node that the KV cache has been pulled, and the second release is when the scheduler transmits the timeout release to the worker. To address this bug, an intermediate component is introduced to manage the release of req_ids. Pull kv and forward2 may occur one after the other in timing. The previous timeout defaulted to forward2 being before pull_kv. ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: `f225ea7dd9` --------- Signed-off-by: baxingpiaochong <771405853@qq.com>	2025-09-24 11:22:46 +08:00
Song Zhixin	6995a7bc5b	[Disagg][Perf] Use NPU event sync instead of blocking tolist to avoid unintentional copy ops blocking across different NPU streams, improving disagg TTIT/TTFT (#2788 ) ### What this PR does / why we need it? When we copy the sampled valid token ids from device to host, avoid using tolist which would trigger a CUDA wise stream sync if the source is on device. We change it to use non-blocking copy followed by an explicit CUDA event sync. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Bring up vLLM server ```bash VLLM_USE_V1=1 vllm serve Qwen/Qwen2.5-14B-Instruct --disable-l og-requests -tp 8 --max-num-seqs 64 --no-enable-prefix-caching --max_num_batched_tokens=8000 ``` ## Before： ![76218085a0cde9b2a73214e35fb7fc08](https://github.com/user-attachments/assets/38cbd02d-d380-47f8-a111-4bd859102eb1) ## After ![6c2111136673332244d3ce11060f4048](https://github.com/user-attachments/assets/957f9bf1-ec50-4f49-9318-f4876b3e3691) As shown in the figure, the TTFT decreased - vLLM version: v0.10.2 - vLLM main: `9607d5eb44` --------- Signed-off-by: jesse <szxfml@gmail.com>	2025-09-24 11:21:58 +08:00
Peipei	c4b976af1a	[Model][VLM][Patch]Modify ascend affinity _merge_multimodal_embeddings (#3071 ) ### What this PR does / why we need it? This PR aims to address the incompatibility of the `.masked_scatter_` operation in the current `_merge_multimodal_embeddings` function on Ascend. For now, it reverts to the previous version of the CPU operation, which can be executed asynchronously on the device side to enhance performance. - vLLM version: v0.10.2 - vLLM main: `f225ea7dd9` --------- Signed-off-by: booker123456 <945658361@qq.com>	2025-09-24 10:25:28 +08:00
weiguihua2	b1380f3b87	[Doc] modify the version compatibility between vllm and vllm-ascend (#3130 ) ### What this PR does / why we need it? modify the version compatibility between vllm and vllm-ascend, the main branch of vllm-ascend corresponds to the v0.10.2 tag of vllm. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: `f225ea7dd9` Signed-off-by: weiguihua2 <weiguihua2@huawei.com>	2025-09-23 20:31:49 +08:00
linfeng-yuan	d01fd1d1c3	[misc][torchair] fix bugs around `deepseek mtp`, `enable_shared_expert_dp` and `use_cached_kv_cache_bytes` (#3074 ) ### What this PR does / why we need it? This miscellaneous contains several small fixes: 1) fix initialization and forward bugs of DeepseekMTPLayer with `shared_expert_dp` enabled. 2) fix a tensor shape mismatches after o_proj caused by a work-aroud change in NPUModelRunner. 3) avoid unnecessary decline of kv_cache memory (default: 64MB) with `use_cached_kv_cache_bytes` disabled. 4) fall back `fused_moe_state` from `MC2` to `All2All` since the padding logic of `mc2_mask` is incompatible with input hidden_states when `shared_expert_dp` enabled. Once this PR is merged, users can launch disaggregated_prefill deployments (large_ep) with `deepseek_mtp` and `shared_expert_dp` as `v0.9.1-dev` branch. The remaining problem of kv_cache tokens decline compared to `v0.9.1-dev` will be resolved by https://github.com/vllm-project/vllm-ascend/pull/3073. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? E2E vllm serving about deepseek_mtp with torchair graph mode and `enable_shared_expert_dp` with eager mode. Large ep deployments are also tested with this PR. - vLLM version: v0.10.2 - vLLM main: `5aeb925452` --------- Signed-off-by: linfeng-yuan <1102311262@qq.com>	2025-09-23 14:52:42 +08:00
lidenghui1110	0f3939e5a9	[Feature]cpu offload connector (#1659 ) This PR implements cpu offload connector to enable NPU kv cache offload to host DRAM. - vLLM version: v0.10.2 - vLLM main: `5aeb925452` Signed-off-by: lidenghui <lidenghui1110@gmail.com> Signed-off-by: AlvisGong <gwly0401@163.com> Signed-off-by: CalvinXKY <kyxiezju@163.com> Co-authored-by: AlvisGong <gwly0401@163.com>	2025-09-23 14:25:05 +08:00
Li Wang	96eb1ed408	[CI] Bump vLLM commit hash to 0923(f225ea7) (#3110 ) ### What this PR does / why we need it? Bump vLLM commit hash to `f225ea7dd9` ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: `5aeb925452` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-09-23 14:13:25 +08:00
Jianwei Mao	d586255678	fix wrong --num-gpus parameter requirements, and avoid ambiguity (#3116 ) fix the problem of https://github.com/vllm-project/vllm-ascend/issues/3114 - vLLM version: v0.10.2 - vLLM main: `5aeb925452` Signed-off-by: Jianwei Mao <maojianwei2012@126.com>	2025-09-23 11:58:44 +08:00
Yizhou	39a85c49fa	[Refactor] Rename cudagraph_support to aclgraph_support (#3104 ) ### What this PR does / why we need it? Updates the `cudagraph_support` attribute to `aclgraph_support` to use terminology appropriate for the Ascend platform (ACL graphs instead of CUDA graphs). This change also explicitly disables graph support for the MLA attention backend. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? None needed. - vLLM version: v0.10.2 - vLLM main: `5aeb925452` Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-09-23 11:30:31 +08:00
wyu0-0	d2399ab97b	Fix VLLM_ASCEND_LLMDD_RPC_PORT renaming (#3108 ) ### What this PR does / why we need it? This PR implements the renaming of the environment variable VLLM_LLMDD_RPC_PORT to VLLM_ASCEND_LLMDD_RPC_PORT, as proposed and tracked in [#2450](https://github.com/vllm-project/vllm-ascend/pull/2450). The renaming is intended to align the variable naming convention with other Ascend-specific environment variables in the vllm-ascend codebase, enhancing consistency and clarity for developers and users working with Ascend-based deployments. ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? CI passed with existing test. - vLLM version: v0.10.2 - vLLM main: `9607d5eb44` Signed-off-by: wyu0-0 <woshilynn@163.com>	2025-09-23 10:33:04 +08:00
Mercykid-bash	29c173ab48	FlashLB algorithm (#3042 ) ## Purpose This Pull Request enhances the EPLB (Expert Parallelism Load Balancing) system by introducing a novel balancing algorithm: FlashLB. ## Motivation 1. The default algorithm adopts a two-stage greedy strategy: a. Replica allotment: Determine the number of expert replicas by minimizing the maximum load per replica (Min Max Replica, MMR). b. Replica placement: Distribute replicas across devices by repeatedly assigning the heaviest replica to the least loaded device (Longest Processing Time First, LPT). However, this sequential process lacks inter-stage collaborative optimization, often leading to suboptimal load balancing. For example, in the simple case shown in the figure below: given 8 logical experts with hotness values of 600, 560, 120, 120, 20, 10, 10, 10, and 2 replicas allocated per device across 8 devices, the EPLB algorithm yields a maximum per-device hotness of 232, while our proposed FlashLB algorithm can reduce this value to 205. 2. The default algorithm relies on the averaged expert hotness over a fixed time window for optimization. While this provides a coarse approximation of the hotness distribution, it fails to capture oscillatory deviations and temporal correlations of expert hotness observed across iterations in real-world scenarios, limiting optimization quality. 3. The default algorithm periodically regenerates the expert placement table. However, it generates the table for each individual layer, and the new table does not account for correlations with the previous one; these two factors collectively lead to nearly full-scale expert reassignment. ## FlashLB Algorithm Principle 1. Joint Optimization FlashLB achieves joint optimization of replica allotment and placement through group-based decision-making. Each group gradually determines the replica count and placement for a subset of experts, ensuring that the expected inter-device load balance (considering both deployed and pending expert replicas) is holistically optimized. To attain superior load balancing, FlashLB employs tree search to expand the solution space while integrating pruning and precompilation techniques for acceleration, thereby delivering load balancing that is both high-quality and practically efficient. 2. Multi-Shot Enhancement FlashLB partitions each profiling interval (e.g., 1024 iterations) into consecutive smaller sub-intervals (e.g., 16 iterations), each capturing independent hotness measurements. It then performs multi-shot optimization to co-optimize these sub-intervals simultaneously—enabling adaptation to time-variant expert hotness while enhancing robustness. 3. Incremental Adjustment To reduce the overhead of frequent expert re-deployment, FlashLB introduces an incremental adjustment scheme operating at both inter-layer and intra-layer levels: a. Inter-Layer: Hotness variations are tracked at the layer level. Only layers with fluctuations exceeding a predefined threshold trigger re-computation of expert placement, avoiding unnecessary redeployment for stable layers； b. Intra-Layer (Optional): A lightweight incremental LPT algorithm (LPT-Incremental) is applied. Instead of recomputing full placement for all experts in a layer, it selectively adjusts only the hottest experts or those with replica count changes, further reducing migration overhead. This incremental strategy significantly reduces adjustment costs while maintaining balanced performance across layers and devices. ## Co-author: Co-authored-by: Skywalker-EP 173723846@qq.com - vLLM version: v0.10.2 - vLLM main: `9607d5eb44` --------- Signed-off-by: sdmyzlp <lrwei2@petalmail.com> Signed-off-by: Che Ruan <cr623@ic.ac.uk> Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com> Signed-off-by: shen-shanshan <467638484@qq.com> Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: 22dimensions <waitingwind@foxmail.com> Signed-off-by: zhanghaiwen <zhanghaiwen@cmss.chinamobile.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Signed-off-by: Lucas Kabela <lucaskabela@meta.com> Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: Icey <1790571317@qq.com> Signed-off-by: linfeng-yuan <1102311262@qq.com> Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: tangtianyi <tangtianyi4@huawei.com> Signed-off-by: Angazenn <supperccell@163.com> Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com> Signed-off-by: rjg-lyh <1318825571@qq.com> Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com> Signed-off-by: fems14 <1804143737@qq.com> Co-authored-by: sdmyzlp <117554856+sdmyzlp@users.noreply.github.com> Co-authored-by: Che Ruan <cr623@ic.ac.uk> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Shanshan Shen <467638484@qq.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com> Co-authored-by: 22dimensions <waitingwind@foxmail.com> Co-authored-by: zhanghw0354 <zhanghaiwencmss@139.com> Co-authored-by: zhanghaiwen <zhanghaiwen@cmss.chinamobile.com> Co-authored-by: zhangxinyuehfad <59153331+zhangxinyuehfad@users.noreply.github.com> Co-authored-by: Lucas Kabela <lucasakabela@gmail.com> Co-authored-by: Li Wang <wangli858794774@gmail.com> Co-authored-by: MengqingCao <cmq0113@163.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: Icey <1790571317@qq.com> Co-authored-by: linfeng-yuan <1102311262@qq.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: tianyitang <tangtianyi4@huawei.com> Co-authored-by: Angazenn <supperccell@163.com> Co-authored-by: Yizhou <136800916+yiz-liu@users.noreply.github.com> Co-authored-by: rjg-lyh <83491835+rjg-lyh@users.noreply.github.com> Co-authored-by: weichen <132029610+Pr0Wh1teGivee@users.noreply.github.com> Co-authored-by: weijinqian0 <12153182+weijinqian0@users.noreply.github.com> Co-authored-by: fems14 <74094523+fems14@users.noreply.github.com>	2025-09-23 10:27:14 +08:00
hucong	8dd53c8860	[Bugfix][PD] Auto-clear producer KV cache if no pull notification (#2174 ) ### What this PR does / why we need it? This PR addresses a critical issue where Node D (Device) failures cause Node P (Processor) to hang due to inability to release KV cache. Trigger Scenarios: 1. Node D fails mid-inference (e.g., network disconnection) 2. Node D rejects requests at a certain stage (e.g., via API server) 3. Load-test script termination causes Node P or D to abort queued requests Root Cause Analysis: 1. Currently, Node D sends a "KV cache pull complete, release approved" message to Node P 2. This message is transmitted via the worker connector. If PD connection breaks or requests are rejected upstream, Node D cannot send the message 3. Node P will never release KV cache without receiving this message Solution: Following VLLM community's approach (NIXL connector timeout mechanism), we're implementing: - A timeout mechanism with comprehensive warnings - Updated README documentation - Reference: VLLM's optimization PR [#20139](https://github.com/vllm-project/vllm/pull/20139) ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? None - vLLM version: v0.10.2 - vLLM main: `9607d5eb44` --------- Signed-off-by: underfituu <hzhucong@163.com>	2025-09-23 09:53:34 +08:00
yupeng	704467cd9a	[Bugfix][LoRA] Fix bug introduced by upstream vllm#25249 (#3095 ) ### What this PR does / why we need it? Fix the impact to LoRA that https://github.com/vllm-project/vllm/pull/25249 brought. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? pytest -sv tests/e2e/singlecard/test_ilama_lora.py pytest -sv tests/e2e/multicard/test_ilama_lora_tp2.py - vLLM version: v0.10.2 - vLLM main: `9607d5eb44` --------- Signed-off-by: paulyu12 <507435917@qq.com>	2025-09-22 22:26:01 +08:00

... 4 5 6 7 8 ...

1240 Commits