xc-llm-ascend

Author	SHA1	Message	Date
zhangxinyuehfad	e26fe1caf1	[TEST] Speed up DS V2 accuracy test and turn up accuracy baseline (#3047 ) ### What this PR does / why we need it? 1. update expected accuracy for DeepSeek-V2-Lite 2. add batch size ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Accuracy CI passed - vLLM version: v0.10.2 - vLLM main: `838d7116ba` Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-09-20 00:40:33 +08:00
zhangxinyuehfad	a22b532d38	[Fixbug] Fix shape not match when sliding_window and dynamic batch_size (#2830 ) ### What this PR does / why we need it? Fix shape not match when test LLM-Research/Phi-4-mini-instruct accuarcy ### Does this PR introduce _any_ user-facing change? Users can't set dynamic batch_size or use lm_eval test accuracy when using models(sliding_window) ### How was this patch tested? accuarcy of LLM-Research/Phi-4-mini-instruct is ok : ``` vllm (pretrained=LLM-Research/Phi-4-mini-instruct,max_model_len=4096,dtype=auto,tensor_parallel_size=1), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto \|Tasks\|Version\| Filter \|n-shot\| Metric \| \|Value \| \|Stderr\| \|-----\|------:\|----------------\|-----:\|-----------\|---\|-----:\|---\|-----:\| \|gsm8k\| 3\|flexible-extract\| 5\|exact_match\|↑ \|0.8105\|± \|0.0108\| \| \| \|strict-match \| 5\|exact_match\|↑ \|0.8097\|± \|0.0108\| ``` - vLLM version: v0.10.2 - vLLM main: `3c96e7b8a1` Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-09-19 22:35:14 +08:00
zhanghw0354	cf549b976d	[Test]Add unit test for compilation/acl_graph.py (#3039 ) ### What this PR does / why we need it? According to issue [#1298 ](https://github.com/vllm-project/vllm-ascend/issues/1298) ,this pull request adds unit test code for compilation/acl_graph.py. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.10.2 - vLLM main: `f2718d2948` --------- Signed-off-by: zhanghaiwen <zhanghaiwen@cmss.chinamobile.com> Co-authored-by: zhanghaiwen <zhanghaiwen@cmss.chinamobile.com>	2025-09-19 21:31:17 +08:00
22dimensions	0942d9aaab	[3/N][Refactor][Quantization]remove packed_modules_mapping from models (#3021 ) ### What this PR does / why we need it? Some custom models in vllm-ascend define packed_modules_mapping, which prevent keeping same model class with vllm community. So move these custom packed_modules_mapping to quant utils.py. After this pr, some custom models can be removed. ### Does this PR introduce _any_ user-facing change? tested by CI ### How was this patch tested? tested by CI - vLLM version: v0.10.2 - vLLM main: `5089fd749c` Signed-off-by: 22dimensions <waitingwind@foxmail.com>	2025-09-19 20:50:14 +08:00
Yikun Jiang	4ba56716f9	Increase doctest timeout to 300s and time print (#3041 ) ### What this PR does / why we need it? Increase doctest timeout to 300s and time print, according to time print in https://github.com/vllm-project/vllm-ascend/pull/3045 , most of time consumed in `Graph capturing`, so I think it's fine to increase doctest timeout This PR also add time log for each task. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Run `/vllm-workspace/vllm-ascend/tests/e2e/run_doctests.sh` - CI passed - vLLM version: v0.10.2 - vLLM main: `a684c0124c` Closes: https://github.com/vllm-project/vllm-ascend/issues/3045 Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-09-19 20:26:00 +08:00
Shanshan Shen	8326f15ecf	[CustomOp] Register AscendSharedFusedMoE custom op (#2980 ) ### What this PR does / why we need it? Register `AscendSharedFusedMoE` custom op. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? `DeepSeek-V2-Lite` is a MoE model with shared experts. Test: ```bash vllm serve /root/.cache/modelscope/hub/models/deepseek-ai/DeepSeek-V2-Lite \ --trust-remote-code \ --enforce-eager \ --no-enable-prefix-caching \ --gpu-memory-utilization 0.95 curl -X POST http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "/root/.cache/modelscope/hub/models/deepseek-ai/DeepSeek-V2-Lite", "messages": [ {"role": "user", "content": "介绍一下联通公司？"} ], "stream": false, "max_tokens": 100 }' ``` Output: ```bash 中国联合网络通信集团有限公司（简称“中国联通”）于2009年1月6日在原中国网通和原中国联通的基础上合并组建而成，在国内31个省（自治区、直辖市）和境外多个国家和地区设有分支机构，是中国唯一一家在纽约、香港、上海三地同时上市的电信运营企业，连续多年入选“世界500强企业”。\n\n中国联通主要经营固定通信业务，移动通信业务，国内 ``` - vLLM version: v0.10.2 - vLLM main: `486c5599e3` --------- Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com> Signed-off-by: shen-shanshan <467638484@qq.com>	2025-09-19 19:05:01 +08:00
sdmyzlp	05a700d370	[Bugfix] Fix async copy bug under single expert scenario (#3005 ) Add missing barrier when no implicit synchonize by `repeat_interleave` is available. Otherwise, the `non_blocking=True` copy of `output_splits` and `input_splits` from NPU may failed to complete before later `async_all_to_all` uses them. ### What this PR does / why we need it? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: `ef7eefe17a` Signed-off-by: sdmyzlp <lrwei2@petalmail.com>	2025-09-19 14:05:36 +08:00
xuyexiong	2a87b4cecb	[Bugfix] Fix specdecoding in chunkedprefill scenario (#3025 ) ### What this PR does / why we need it? The speculative decode phase of chunkedprefill has taken an incorrect path, should always use TND layout for speculative decoding. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: `6d8246aaff` Signed-off-by: xuyexiong <xuyexiong@huawei.com>	2025-09-19 14:05:08 +08:00
Song Zhixin	833cd1b698	[BugFix] Async scheduling and PP compatibility with DP (#2796 ) ### What this PR does / why we need it? based on the https://github.com/vllm-project/vllm/pull/23770, fix Async scheduling and PP compatibility with DP, also fixes issue with finished requests not being processed in async scheduling and PP cases, and possible worker race conditions. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: `544fe76b95` --------- Signed-off-by: jesse <szxfml@gmail.com>	2025-09-19 11:29:50 +08:00
whx	0a526768f5	[Feature] Support moe multi-stream for aclgraph. (#2946 ) This PR puts the calculation of shared experts into a separate stream, overlaping with routing experts. - vLLM version: v0.10.2 - vLLM main: `fbd6523ac0` --------- Signed-off-by: whx-sjtu <2952154980@qq.com>	2025-09-19 11:06:45 +08:00
zhangxinyuehfad	0c04bf1e36	[Fixbug] Fix accuracy for DeepSeek-V2-Lite (#3016 ) ### What this PR does / why we need it? Fix accuracy for DeepSeek-V2-Lite ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? CI passed - vLLM version: v0.10.2 - vLLM main: `66072b36db` Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-09-18 23:58:23 +08:00
Mengqing Cao	367edff5af	[HybridKV] Fix prefill disaggregation kvcache addr alignment & use hybrid kv cache only when running qwen3_next (#3007 ) ### What this PR does / why we need it? This pr fixes a few issues on prefill disaggregation: 1. Fix prefill disaggregation kvcache addr alignment issue, llmdatadist needs the addr of tensors to be aligned with 2M 2. Fix prefill disaggregation kvcache shape error, llmdatadist requires k/v tensors with shape [num_blocks, ...], however the implentment before this pr is [2, num_blocks, ...], which will break prefill disaggregation 3. Use hybrid kv cache only when running qwen3_next to fix accuracy issue on prefill disaggregation. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? Tested locally by @liziyu179 - vLLM version: v0.10.2 - vLLM main: `4f02b77de4` --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-09-18 21:43:22 +08:00
Icey	acb46f303f	Fix VocabParallelEmbedding UT (#2722 ) ### What this PR does / why we need it? Fix VocabParallelEmbedding UT ### How was this patch tested? CI passed with new added/existing test. - vLLM version: main - vLLM main: `f592b3174b` --------- Signed-off-by: Icey <1790571317@qq.com>	2025-09-18 19:54:01 +08:00
Li Wang	01592515b8	[Bugfix] Fix sleep mode level 2 (#1376 ) ### What this PR does / why we need it? For sleep mode level 2, we discarded model both weights and kv_cache, but the problems is: When we discard weights, we also discard some tensors representing the model state which we called `model.named_buffers()`, such as: `running_mean / running_var` in BatchNorm、rope cos-sin cache ... when we update weights, but forgot to update buffers as well, this will lead to some unknown issue ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: `5963b98b46` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-09-18 19:51:52 +08:00
LeeWenquan	f4e3d22432	Remove chunked_prefill_for_mla and fix ring_mla bug (#2781 ) ### What this PR does / why we need it? Remove chunked prefill for mla branch in mla , and change dtype of prefill_mask to avoid accuracy problem ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: `ef7eefe17a` --------- Signed-off-by: SunnyLee219 <3294305115@qq.com>	2025-09-18 19:43:26 +08:00
linfeng-yuan	79a910ef47	[bugfix][torchair] fix multistream_moe problems in torchair graph mode (#2681 ) This pr fixes two problems while `multistream_moe` enabled in torchair graph mode: 1. check `TorchairAscendW8A8DynamicFusedMoEMethod` instead of incorrect `AscendW8A8DynamicFusedMoEMethod` 2. mc2_mask should be chunked no matter `replace_allreduce` is True or False in forward function of `TorchairAscendFusedMoE` - vLLM version: v0.10.2 - vLLM main: `0fb2551c23` Signed-off-by: linfeng-yuan <1102311262@qq.com>	2025-09-18 17:35:04 +08:00
Li Wang	4267f5d55f	[Doc] Add multi-node ray backend tutorial (#2376 ) ### What this PR does / why we need it? Add multi-node ray backend tutorial for Qwen235B-A3B ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: `f4cd80f944` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-09-18 15:30:18 +08:00
realliujiaxu	af2a886814	refactor linear (#2867 ) ### What this PR does / why we need it? The current linear.py has the following issues: - There is redundant conditional logic in the `comm_group` and `forward` selection for classes such as `AscendMergedColumnParallelLinear`. - Inconsistent comm_group selection logic exists among `AscendMergedColumnParallelLinear`, `AscendColumnParallelLinear`, and `AscendQKVParallelLinear`. To address these two issues, this PR encapsulates `comm_group` and `forward` into classes and extracts the classes selection logic into common functions. For future additions of custom communication groups or forward methods, it will only be necessary to extend `CustomColumnParallelOp` or `CustomRowParallelOp` and add new selection logic. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: `dd39baf717` --------- Signed-off-by: realliujiaxu <realliujiaxu@163.com> Co-authored-by: weijinqian0 <weijinqian@huawei.com>	2025-09-18 14:09:19 +08:00
panchao-hub	a7f8ed38ed	[Bugfix]:replace npu_incre_flash_attention with npu_fused_infer_atten… (#2901 ) ### What this PR does / why we need it? [Bugfix]:replace npu_incre_flash_attention with npu_fused_infer_attention_score in order to be able to tiling update ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: `2b85697031` Signed-off-by: p00465316 <panchao13@huawei.com> Co-authored-by: p00465316 <panchao13@huawei.com>	2025-09-18 14:06:08 +08:00
xuyexiong	6681dde902	[Feat][Graph] Support MTP for ACL Graph (#2932 ) ### What this PR does / why we need it? This PR depends on the merge of #2707 and has adapted the aclgraph functionality to support MTP. ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: `2b85697031` --------- Signed-off-by: xuyexiong <xuyexiong@huawei.com>	2025-09-18 14:05:33 +08:00
Chao Lei	cef43b524e	[Feat] A Connector that supports Mooncake store (#2913 ) ### What this PR does / why we need it? Added a new connector for Mooncake store integration to enable kvcache reuse in scenarios with system prompts or multi-turn dialogues. ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: `5963b98b46` --------- Signed-off-by: LCAIZJ <leichao139636@163.com> Signed-off-by: fems14 <1804143737@qq.com> Co-authored-by: fems14 <1804143737@qq.com> Co-authored-by: Dreamerleader <2270923832@qq.com> Co-authored-by: Pz1116 <zpbzpb123123@gmail.com> Co-authored-by: lizy124 <1950471827@qq.com> Co-authored-by: zouyida2052 <zouyida2002@gmail.com>	2025-09-18 14:04:45 +08:00
realliujiaxu	723d460894	[Bugfix] fix kv nz accuracy bug (#2988 ) when `enable_kv_nz` is true, output of Deepseek R1 is invalid. - vLLM version: v0.10.2 - vLLM main: `2b85697031` Signed-off-by: realliujiaxu <realliujiaxu@163.com>	2025-09-17 21:10:25 +08:00
linfeng-yuan	8bcc0ccd57	[bugfix] fix shared expert dp with hybrid kvcache (#2964 ) ### What this PR does / why we need it? https://github.com/vllm-project/vllm-ascend/pull/2849 moves the implementation of `shared_expert_dp` to torchair deepseek_modeling. However, the calling of `set_forward_context` with `enforce_eager` and `shared_expert_dp` falls back to the implementation of model_runner_v1.py and set the global attn_metadata as a dictionary. It leads to a RuntimerError when attn_metadata is got from the forward context and used in torchair_deepseek_v2.py. This PR fixes this problem by introducing the transformation of attn_metadata in this file. Note that current E2E testing lacks the case of deepseek with `shared_expert_dp`. We need to add an ST with `shared_expert_dp` in testing workflow. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? e2e vllm serving with `enable_shared_expert_dp: true` passed. - vLLM version: v0.10.2 - vLLM main: `de3e53a75b` Signed-off-by: linfeng-yuan <1102311262@qq.com>	2025-09-17 20:01:47 +08:00
1Fire4	1f6465c399	Add an option of enable frozen parameter (#2869 ) ### What this PR does / why we need it? Add an option of enable frozen parameter ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: `68dbde5dbb` Signed-off-by: 1Fire4 <wangdingyi2@huawei.com>	2025-09-17 12:00:44 +08:00
offline893	76844eec78	Dynamic Expert Load Balance with Zero-like-overhead (#2956 ) ### Motivation Currently dynamically experts balancing would stop-the-world. Asynchronously expert load balancing would be better without flowing problems: Host-bound latency: There are many cpu operations during EPLB such as eplb-algorithm、creating p2p ops、and log2phy expert converting would spend long cpu time, as ~1s. Communication latency: The transfer time would cost much in the situation without nvlink. As the weight of an expert maybe transfer to multiple new positions, thus N times send/recv for one expert, with result long latency. We had tested that batch_isend_irecv cost more 100ms for 16 experts weight transmission in A2 server of ascend. SwiftBalancer would not stop-the-world anymore, in out test on NPU 1~2ms cost for each layer while benefit 5ms-8ms decode latency with ep_size = 64. The following updates have been made: 1、expert distribution recording with lower cost. 2、async cpu computing for eplb algo and other python operator. 3、new eplb algo with less expert rebalancing while almost the same effect. ### Proposed Change We will gradually migrate the EPLB logic to the VLLM community and implement a generalized design. Relevant RFC: https://github.com/vllm-project/vllm/issues/22246 The overall workflow involves: <img width="801" height="302" alt="474430541-23b06f58-23bc-44a3-a1be-00f268aeb15c" src="https://github.com/user-attachments/assets/1d73a459-1b23-4b0a-812a-bf0a75debfed" /> 1. Record experts distribution during forward. We using expert_token_num after disptach instead of topk_ids, thus we got much smaller tensor shape to reduce cost of hbm recording and add-operator. 2. Do all-gather for experts distribution. Using all-gather instead of all-reduce as less traffic volume. 3. Wake up eplb worker process with experts distribution when num_iterations comes. Run eplb algorithm in eplb worker. 4. Generate p2p send/recv ops and other operator such as log2phy would cost long cpu time. 5. Lanch ibatch_send_recv in async_stream before forward. 6. After forward, wait for the ibatch_send_recv finish, then do uapte expert map and expert weights. ### Co-author Co-authored-by: raindaywhu raindaywhu@raindaywhu@ 163.con Co-authored-by: njuyuan yuanjl19@smail.nju.edu.cn Co-authored-by: qmkakaxi wjh1594260677@qq.com Co-authored-by: Skywalker-EP 173723846@qq.com - vLLM version: v0.10.2 - vLLM main: `567939953b` --------- Signed-off-by: offline0806 <z00858301@china.huawei.com> Co-authored-by: offline0806 <z00858301@china.huawei.com>	2025-09-17 10:36:43 +08:00
xuyexiong	ae758dda05	[Bugfix] Fix mtp torchair in pd Disaggregation scenario (#2951 ) ### What this PR does / why we need it? 1. In memory of #2509, Fix mtp torchair in pd Disaggregation scenario 2. fix mla bug in SpecDecoding Scenario， since num_decodes != num_decode_tokens ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: `5206ab20ba` Signed-off-by: xuyexiong <xuyexiong@huawei.com>	2025-09-17 09:07:58 +08:00
rjg-lyh	6b7117dbb7	[main] addrmsnorm + quant fusion optim in Dense Models (#2772 ) ### What this PR does / why we need it? This PR fused addrmsnorm op and w8a8 quant op to get better perf. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.10.2 - vLLM main: `0faf3cc3e8` Signed-off-by: rjg-lyh <1318825571@qq.com>	2025-09-16 22:31:38 +08:00
yiz-liu	88ca8a051c	[Feat][Graph] Support DeepSeek with ACL Graph (#2707 ) ### What this PR does / why we need it? In memory of #677 , a long overdue milestone. Now DeepSeek V3/R1 should be OK with ACL Graph. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? Working on it. - vLLM version: v0.10.2 - vLLM main: `68dbde5dbb` --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-09-16 17:50:17 +08:00
dependabot[bot]	3e60aa5483	Bump actions/setup-python from 5.4.0 to 6.0.0 (#2926 ) Bumps [actions/setup-python](https://github.com/actions/setup-python) from 5.4.0 to 6.0.0. - vLLM version: v0.10.2 - vLLM main: `3f3313981c` Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-09-16 14:15:10 +08:00
linfeng-yuan	1c5900327b	[refactor] refactor deepseek-related files (#2849 ) ### What this PR does / why we need it? This PR deletes ~2K lines of code about deepseek modeling. It falls back CustomDeepseekV2 modules to original vllm implementations and adapts some modifications in vllm about deepseek and moe. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? E2E vllm serving with torchair graph mode and eager mode. - vLLM version: v0.10.2 - vLLM main: `759ef49b15` --------- Signed-off-by: linfeng-yuan <1102311262@qq.com> Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com> Co-authored-by: yiz-liu <136800916+yiz-liu@users.noreply.github.com> Co-authored-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-09-16 14:13:07 +08:00
weichen	18ca7861f6	[Main] [Refactor] Enable MoECommMethod in Eager Mode (#2791 ) ### What this PR does / why we need it? 1. Replace prepare/finalize operation in fused_moe.py by moe_comm_method.prepare()/finalize() 2. Replace unified_fused_experts by moe_comm_method.fused_experts() in fused_moe.py/w8a8_dynamic.py/w4a8_dynamic.py 3. Add calling _select_moe_comm_method in spec-decode proposers. 4. Currently, w4a8_dynamic does not support gatherep, use all2allv instead. 5. Remove redundant code. ### Does this PR introduce _any_ user-facing change? AllgatherEP switch is disabled in aclgraph/eager mode, just follow the rules in modelrunner_v1._select_moe_comm_method() ### How was this patch tested? e2e & ut - vLLM version: v0.10.2 - vLLM main: `7f6f2c1182` Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com> Co-authored-by: weijinqian0 <12153182+weijinqian0@users.noreply.github.com>	2025-09-16 11:06:00 +08:00
Yikun Jiang	0aba644633	Update max_tokens and prompt in qwen3 online doc (#2945 ) ### What this PR does / why we need it? Update max_tokens and prompt in qwen3 online doc Before: ``` "'max_tokens' or 'max_completion_tokens' is too large: 4096. This model's maximum context length is 4096 tokens and your request has 18 input tokens (4096 > 4096 - 18). None" ``` After: ``` curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{ "model": "/root/.cache/modelscope/hub/models/Qwen-SGlang/Qwen3-Next-80B-A3B-Instruct", "messages": [ {"role": "user", "content": "Who are you?"} ], "temperature": 0.6, "top_p": 0.95, "top_k": 20, "max_tokens": 32 }' .{"id":"chatcmpl-8ddbd65c9ddc405397219a6792feb9a0","object":"chat.completion","created":1757985049,"model":"/root/.cache/modelscope/hub/models/Qwen-SGlang/Qwen3-Next-80B-A3B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"Hello! I am Qwen, a large-scale language model independently developed by the Tongyi Lab under Alibaba Group. I am designed to assist you in generating various","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning_content":null},"logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":12,"total_tokens":44,"completion_tokens":32,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null} ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Manually test on my local env - CI passed Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-09-16 09:27:50 +08:00
wangxiyuan	048bfd5553	[Release] Add release note for v0.10.2rc1 (#2921 ) Add release note for v0.10.2rc1 - vLLM version: v0.10.2 - vLLM main: `b834b4cbf1` --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-09-16 01:20:05 +08:00
wangxiyuan	c556038ef0	[New model] Qwen3-next support (#2917 ) ### What this PR does / why we need it? Add Qwen3-next support. ### Does this PR introduce _any_ user-facing change? Yes, users can use Qwen3 next. Related doc: https://github.com/vllm-project/vllm-ascend/pull/2916 the tutorial will be ready in [here](https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_npu_qwen3_next.html) ### How was this patch tested? Doc CI passed Related: https://github.com/vllm-project/vllm-ascend/issues/2884 Co-Authored-By: Angazenn <supperccell@163.com> Co-Authored-By: zzzzwwjj <1183291235@qq.com> Co-Authored-By: MengqingCao <cmq0113@163.com> Co-Authored-By: linfeng-yuan <1102311262@qq.com> Co-Authored-By: hust17yixuan <303660421@qq.com> Co-Authored-By: SunnyLee219 <3294305115@qq.com> Co-Authored-By: maoxx241 <maoxx241@umn.edu> - vLLM version: v0.10.2 - vLLM main: `b834b4cbf1` --------- Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: Angazenn <supperccell@163.com> Signed-off-by: Your Name <you@example.com> Signed-off-by: zzzzwwjj <1183291235@qq.com> Signed-off-by: linfeng-yuan <1102311262@qq.com> Signed-off-by: hust17yixuan <303660421@qq.com> Co-authored-by: MengqingCao <cmq0113@163.com> Co-authored-by: Angazenn <supperccell@163.com> Co-authored-by: Your Name <you@example.com> Co-authored-by: zzzzwwjj <1183291235@qq.com> Co-authored-by: linfeng-yuan <1102311262@qq.com> Co-authored-by: hust17yixuan <303660421@qq.com>	2025-09-16 01:17:42 +08:00
Yikun Jiang	b5ccef6115	[Doc] Add doc for Qwen3 Next (#2916 ) ### What this PR does / why we need it? Add doc for Qwen3 Next ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Doc CI passed Related: https://github.com/vllm-project/vllm-ascend/issues/2884 - vLLM version: v0.10.2 - vLLM main: `01413e0cf5` Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-09-16 01:16:06 +08:00
liziyu	aa3c4563ce	fix all cards super_pod_id same on A3 & proxy support min_tokens (#2939 ) ### What this PR does / why we need it? fix all cards super_pod_id same on A3 & proxy support min_tokens ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? 2*A3 gen ranktable before: "prefill_device_list": [ { "server_id": "xxx", "device_id": "0", "device_ip": "xxx", "super_pod_id": "0", "super_device_id": "106758159", "cluster_id": "1" }, { "server_id": "xxx", "device_id": "1", "device_ip": "xxx", "super_pod_id": "0", "super_device_id": "106758159", "cluster_id": "2" }... after: "prefill_device_list": [ { "server_id": "xxx", "device_id": "0", "device_ip": "xxx", "super_pod_id": "0", "super_device_id": "104857600", "cluster_id": "1" }, { "server_id": "xxx", "device_id": "1", "device_ip": "xxx", "super_pod_id": "0", "super_device_id": "104923137", "cluster_id": "2" }... --------- Signed-off-by: liziyu <liziyu16@huawei.com>	2025-09-16 01:09:18 +08:00
wangxiyuan	382c29f3e1	[BugFix] Fix world size bug in model_runner (#2915 ) - Fix world size bug in model_runner to make sure ep>16 runs with MC2 - enable e2e test for vl Co-Authored-By: whx-sjtu <2952154980@qq.com> Co-Authored-By: Icey <1790571317@qq.com> - vLLM version: v0.10.2 - vLLM main: `3e903b6cb4` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-09-14 12:20:25 +08:00
fan2956	c5a502fd2e	main add ascend scheduler support multimodal (#2844 ) ### What this PR does / why we need it? On main, AscendScheduler does not support Multimodels, becuse of lacking of scheduled_encoder_inputs which is need on multimodels inference ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? vLLM version: main@93e28e6862669e3b5cf47cea9f782a65ec47e155 - vLLM version: v0.10.2rc2 - vLLM main: `15b8fef453` --------- Signed-off-by: fan2956 <zhoufan53@huawei.com> Co-authored-by: zhoufan2956 <zhoufan2956@163.com>	2025-09-14 09:38:51 +08:00
Yikun Jiang	0747a6e68c	Bump vLLM version to v0.10.2 (#2914 ) ### What this PR does / why we need it? Bump vLLM version to v0.10.2 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed - vLLM version: v0.10.2rc3 - vLLM main: `15b8fef453` Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-09-14 06:57:59 +08:00
Yikun Jiang	f97a64ba7f	Bump vLLM version to v0.10.2rc3 (#2911 ) ### What this PR does / why we need it? Bump vLLM version to v0.10.2rc3 https://github.com/vllm-project/vllm/compare/v0.10.2rc2...v0.10.2rc3 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed - vLLM version: v0.10.2rc2 - vLLM main: `15b8fef453` Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-09-13 19:15:48 +08:00
Yikun Jiang	8ece6956e7	Revert "Upgrade CANN version to 8.3.rc1.alpha001 (#2903 )" (#2909 ) ### What this PR does / why we need it? This reverts commit `339fceb89c`. ### Does this PR introduce _any_ user-facing change? Yes, use 8.2rc1 image by default ### How was this patch tested? CI passed - vLLM version: v0.10.2rc2 - vLLM main: `cfa3234a5b` Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-09-13 16:21:54 +08:00
zxr2333	0a27705917	fix mooncake connector adxl hostname usage (#2824 ) ### What this PR does / why we need it? This PR is used to adapt the hostname format for Mooncake when using adxl. When Mooncake uses adxl, it is necessary to set ```USE_ASCEND_DIRECT``` to True in the file ```/Mooncake/mooncake-common/common.cmake``` during compilation. The mooncake_connector obtains this config by calling ```vllm_config.kv_transfer_config.get_from_extra_config```, determines whether Mooncake is using adxl, and selects the corresponding hostname format. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By CI. - vLLM version: main - vLLM main: `d21a36f5f9` --------- Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>	2025-09-13 14:38:48 +08:00
Yikun Jiang	d2250c80b5	Enable push trigger for image job (#2906 ) ### What this PR does / why we need it? Enable push trigger for image job ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Followup on https://github.com/vllm-project/vllm-ascend/pull/2864 - vLLM version: v0.10.2rc2 - vLLM main: `89e08d6d18` Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-09-13 12:31:36 +08:00
Yikun Jiang	339fceb89c	Upgrade CANN version to 8.3.rc1.alpha001 (#2903 ) ### What this PR does / why we need it? Upgrade CANN version to 8.3.rc1.alpha001 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.10.2rc2 - vLLM main: `89e08d6d18` Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-09-13 12:10:21 +08:00
Jiawei Li	e57cca971c	Fix the bugs about operator registration by PyTorch Dispatcher (#2786 ) Background: There are two principles about operator registration in PyTorch - The same namespace can be only registered once by `TORCH_LIBRARY` - The operator signatures can be only registered once by `def` Considering that all custom operators defined in the current repo are only used by Ascend, instead of defining a common operator schema by vLLM, all accelerators then follow this operator schema and complete the implementation based on their respective hardware, which is conducive to functional abstraction. Therefore, we can rename the operator registration namespace to an Ascend-specific namespace(_C_ascend). Related ISSUE: https://github.com/vllm-project/vllm-ascend/issues/2742 - vLLM version: main - vLLM main: `f592b3174b` Signed-off-by: FFFrog <ljw1101.vip@gmail.com>	2025-09-13 11:58:52 +08:00
Yikun Jiang	138e932630	Bump vLLM version to v0.10.2rc2 (#2902 ) ### What this PR does / why we need it? Upgrade vLLM version to 0.10.2rc2 ### Does this PR introduce _any_ user-facing change? Yes, image will use 0.10.2rc2 vLLM ### How was this patch tested? - vLLM version: main - vLLM main: `f17c075884` Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-09-13 11:39:48 +08:00
rjg-lyh	585a494baa	[Core] Disable the chunked prefill feature in Non-MLA LLMs (#2894 ) ### What this PR does / why we need it? This PR enforces the forcible disabling of the chunked prefill feature in Non-MLA models, as the performance of operators supporting this functionality is currently suboptimal. Unless the user has enabled chunked prefill in the ascend_scheduler_config, we would allow this feature. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI passed with new added/existing test. Related: https://github.com/vllm-project/vllm-ascend/pull/2659 - vLLM version: main - vLLM main: `d21a36f5f9` Signed-off-by: rjg-lyh <1318825571@qq.com>	2025-09-12 23:17:09 +08:00
Yikun Jiang	756b8a1946	Revert "[Feat] Unquantized linear nz support (#2619 )" (#2896 ) ### What this PR does / why we need it? This reverts commit `7b2ecc1e9a`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed - vLLM version: main - vLLM main: `64d90c3e4f` Closes: https://github.com/vllm-project/vllm-ascend/issues/2890 Closes: https://github.com/vllm-project/vllm-ascend/issues/2887 Closes: https://github.com/vllm-project/vllm-ascend/issues/2885 Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-09-12 20:51:12 +08:00
rjg-lyh	fc2bcbe21c	[Ops] Fix bug in register_custom_ops without forward_context (#2883 ) ### What this PR does / why we need it? This PR fixed the bug in register_custom_ops without forward_context. We set try-except to consider this situation. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI passed with new added/existing test. - vLLM version: main - vLLM main: `7920de0a2a` Signed-off-by: rjg-lyh <1318825571@qq.com>	2025-09-12 16:58:08 +08:00
Yikun Jiang	6d8bc38c7b	Enable label-based image test and use free runner to run lint (#2864 ) ### What this PR does / why we need it? - Enable label-based image test and use free runner to run lint - soft revert `26f388ba08` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: main - vLLM main: `404c85ca72` Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-09-12 10:49:42 +08:00

1 2 3 4 5 ...

1023 Commits