xc-llm-ascend

Author	SHA1	Message	Date
Shanshan Shen	6c478531f8	[CustomOp] Register AscendApplyRotaryEmb CustomOp and remove related patch (#4667 ) ### What this PR does / why we need it? Following https://github.com/vllm-project/vllm/pull/29873, register `AscendApplyRotaryEmb` CustomOp and remove related patch. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? #### ✅ Test Qwen2.5-VL Run: ```bash vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-Instruct \ --max_model_len 16384 ``` Output: ``` {"id":"chatcmpl-b02c1ff3415d2462","object":"chat.completion","created":1766129265,"model":"/root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-In struct","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the illustration is \"TONGYI Qwen.\" The word \"TONGYI\" is writ ten in blue, and \"Qwen\" is written in gray. The text appears to be part of a logo or branding design.","refusal":null,"annotations":null,"audio": null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"tok en_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":78,"total_tokens":129,"completion_tokens":51,"prompt_tokens_d ``` #### ✅ Test Qwen3-VL Run: ```bash vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct \ --max_model_len 16384 ``` Output: ``` {"id":"chatcmpl-a3a7de5a900a9321","object":"chat.completion","created":1766129586,"model":"/root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the illustration is “TONGYI Qwen”.\n\n### How it looks:\n- “TONGYI” is written in uppercase letters in a bold, modern sans-serif font, colored blue.\n- “Qwen” is written in lowercase letters in a slightly thinner, elegant sans-serif font, colored dark gray.\n- The two lines of text are stacked vertically, with “TONG","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":112,"total_tokens":212,"completion_tokens":100,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null} ``` - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: shen-shanshan <467638484@qq.com>	2025-12-23 10:04:37 +08:00
zhangxinyuehfad	61efaffcaf	[Bugfix] Implement multimodal_cpu_fields in model runner (#5196 ) ### What this PR does / why we need it? Related to https://github.com/vllm-project/vllm-ascend/issues/4084 Implement multimodal_cpu_fields in model runner - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-12-22 18:39:45 +08:00
Shanshan Shen	b84ad8c5d8	[CustomOp] Register AscendMMEncoderAttention CustomOp and remove related patch (#4750 ) ### What this PR does / why we need it? Following https://github.com/vllm-project/vllm/pull/30125, register `AscendMMEncoderAttention` CustomOp and remove related patch. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? ✅ Run Qwen2.5-VL: ```bash vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-Instruct \ --max_model_len 16384 ``` Output: ``` {"id":"chatcmpl-b4e3053f30ab2442","object":"chat.completion","created":1764922950,"model":"/root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the image is \"TONGYI Qwen.\" The word \"TONGYI\" is written in blue, and \"Qwen\" is written in gray. The font appears to be modern and clean, with \"TONGYI\" being slightly larger than \"Qwen.\" The design includes a geometric, abstract shape on the left side of the logo, which complements the text.","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":78,"total_tokens":162,"completion_tokens":84,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null} ``` ✅ Run Qwen3-VL: ```bash vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct \ --max_model_len 16384 ``` Output: ``` {"id":"chatcmpl-97571fbda8267bd1","object":"chat.completion","created":1764923306,"model":"/root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the illustration is “TONGYI Qwen”.\n\n### How it looks:\n- “TONGYI” is written in uppercase letters in a bold, modern sans-serif font, colored blue.\n- “Qwen” is written in lowercase letters in a slightly thinner, elegant sans-serif font, colored dark gray.\n- The two lines of text are stacked vertically, with “TONG","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":112,"total_tokens":212,"completion_tokens":100,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null} ``` - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: shen-shanshan <467638484@qq.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com>	2025-12-22 14:32:53 +08:00
Ascendyh	b2c121637f	[task] Add fused gdn gating triton kernel (#4304 ) ### What this PR does / why we need it? This commit introduces a Triton-based fused GDN gating kernel for Ascend NPU, aimed at improving performance in the Gated Delta Net workflow. ### Does this PR introduce _any_ user-facing change? It only adds and refactors internal Triton kernels and wrappers for Ascend. These are backend implementation details. There are no new APIs, flags, CLI options, or behavior changes visible to end users. ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: Ascendyh <hw7osiris@outlook.com>	2025-12-22 14:09:19 +08:00
XiaoxinWang	0cc3fc357f	[pref] qwen3_next add triton ops : fused_sigmoid_gating_delta_rule_update (#4818 ) ### What this PR does / why we need it? qwen3_next add fused_sigmoid_gating_delta_rule_update op which fused fused_gdn_gating+fused_recurrent_gated_delta_rule - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com> Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>	2025-12-19 16:34:11 +08:00
ZT-AIA	39fb9e7c83	qwen3_next add triton ops : fused_qkvzba_split_reshape (#4788 ) ### What this PR does / why we need it? add triton ops fused_qkvzba_split_reshape_cat for qwen3_next GatedDeltaNet ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: ZT-AIA <1028681969@qq.com> Signed-off-by: ZT-AIA <63220130+ZT-AIA@users.noreply.github.com>	2025-12-18 11:31:04 +08:00
Canlin Guo	bb3a826e08	[Refactor] Remove the process patches of Qwen2.5-VL and Qwen2.5-Omni (#5035 ) ### What this PR does / why we need it? Related to #4084. Before we add the patches temporarily for making `set_forward_context` patched by `set_ascend_forward_context` in the function `_process_image_input` and `_process_video_input` of `Qwen2.5-VL` and `Qwen2.5-Omni` models. After removing these patches, I met the `AttributeError` for `ForwardContext` missing `prefetch_mlp_enabled`. So we need to add the defensive check for `prefetch_mlp_enabled`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? ``` vllm serve Qwen/Qwen2.5-VL-7B-Instruct \ --max-model-len 30000 \ --max-num-batched-tokens 50000 \ --max-num-seqs 30 \ --no-enable-prefix-caching \ --trust-remote-code \ --dtype bfloat16 ``` ``` {"id":"chatcmpl-b66d8acb76905c49","object":"chat.completion","created":1765796863,"model":"/root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the illustration reads \"TONGYI Qwen.\"","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":73,"total_tokens":88,"completion_tokens":15,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null} ``` - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: gcanlin <canlinguosdu@gmail.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-16 11:43:52 +08:00
realliujiaxu	9e24bdd44c	[Feat] Refactor rejection sampler (#4975 ) ### What this PR does / why we need it? Currently, we are using `AscendRejctionSampler` that extends from `RejctionSampler` in spec decoding. `AscendRejctionSampler` override `forward` of `RejctionSampler`, only aming to replace `rejection_sample` func. This causes a lot of code of `RejctionSampler` cannot be reused, for example: - https://github.com/vllm-project/vllm/pull/19482 - https://github.com/vllm-project/vllm/pull/26060 - https://github.com/vllm-project/vllm/pull/29223 #### Proposed Change: - Delete `AscendRejctionSampler` and use `RejctionSampler` directly in model runner. - Patch `RejctionSampler.expand_batch_to_tokens` and `RejctionSampler.rejection_sample`, maybe a better way is to make them as custom ops. - Modify `NPUModelRunner` following https://github.com/vllm-project/vllm/pull/26060 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - [x] test logits processor for spec decoding - [x] test logprobs for spec decoding - [x] test logprobs for spec decoding + async shcheduling (test with https://github.com/vllm-project/vllm-ascend/pull/4893/) - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: realliujiaxu <realliujiaxu@163.com>	2025-12-16 11:32:26 +08:00
drslark	8fb0ef5ffa	[main][BugFix] Fixed an accuracy bug of Qwen3-next-MTP when batched inferring (#4932 ) ### What this PR does / why we need it? Fixes an accuracy bug of Qwen3-next-MTP when batched inferring. It is descibed in https://github.com/vllm-project/vllm-ascend/issues/4930. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: drslark <slarksblood@qq.com>	2025-12-15 13:22:30 +08:00
QilaiZhang	78bf211539	[OPS] support triton causal_conv1d_fn ops (#4119 ) ### What this PR does / why we need it? Support triton causal_conv1d_fn ops. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: QilaiZhang <245706640@qq.com>	2025-12-11 15:52:39 +08:00
wangxiyuan	3362be7f86	Update patch doc (#4869 ) Update patch doc. After this PR is merged, all the new patch PR should update this doc as well. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-10 23:27:45 +08:00
drslark	0fb1dc43a1	[BugFix][main] Adapted Qwen3-Next-MTP to chunked prefill (#4770 ) ### What this PR does / why we need it? The pad `-1` modification is from https://github.com/vllm-project/vllm/pull/25743. It still has bugs for batched chunked prefill. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: drslark <slarksblood@qq.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2025-12-10 22:54:24 +08:00
lianyibo	e32014ac1d	[Model] Support pooling models (#3122 ) ### What this PR does / why we need it? Support pooling models (like `bge-reranker-v2-m3`) in vllm-ascend, this pr covered the three model types of embed (cls_token, mean_token, lasttoken). After this [commit](`17373dcd93`), vllm has provided support for adapting pooling models on the v1 engine. This PR includes corresponding adaptations on the vllm-ascend side. Fixes #1960 - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: lianyibo <lianyibo1@kunlunit.com> Signed-off-by: MengqingCao <cmq0113@163.com> Co-authored-by: MengqingCao <cmq0113@163.com>	2025-12-10 11:37:57 +08:00
Shanshan Shen	fb15fec662	[MM][Patch] Remove patch for cos/sin cache (#4672 ) ### What this PR does / why we need it? Remove patch for https://github.com/vllm-project/vllm/pull/28798. - vLLM version: v0.12.0 Signed-off-by: shen-shanshan <467638484@qq.com>	2025-12-04 22:30:06 +08:00
wangxiyuan	3f4c0ea0a0	upgrade vLLM to 0.12.0 tag (#4647 ) Upgrade vLLM to v0.12.0 tag - vLLM version: 86e178f7c4d8c3b0eaf3c8e3f810a83f63b90e24 - vLLM main: `86e178f7c4` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-03 23:43:05 +08:00
LeeWenquan	38bd95229f	[Model] Add qwen3Next support in Main (#4596 ) ### What this PR does / why we need it? Add Qwen3Next support in main ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: SunnyLee219 <3294305115@qq.com>	2025-12-03 14:17:37 +08:00
wangxiyuan	7f2673ea2d	upgrade vLLM to main (#4608 ) 1. fix https://github.com/vllm-project/vllm/pull/28542 The model structure modifications we involved in are: - Qwen2.5-VL(still exist some patch) - Qwen2-VL - Qwen2 - DeepSeek series - Qwen-moe series 2. fix https://github.com/vllm-project/vllm/pull/29121 the output token now type changed from np to `list[list[int]]` 3. fix https://github.com/vllm-project/vllm/pull/29262 `xformers` backend for multimodal now has been deprecated 4. fix https://github.com/vllm-project/vllm/pull/29342 5. fix https://github.com/vllm-project/vllm/pull/28579 6. fix https://github.com/vllm-project/vllm/pull/28718 7. fix https://github.com/vllm-project/vllm/issues/28665 8. fix https://github.com/vllm-project/vllm/pull/26847 vllm introduced the `optimization-level`, some default config has been changed, and the param `--enforce-eager` has been deprecated 9. fix http://github.com/vllm-project/vllm/pull/29223 it retuns tuple for sampler. 10. fix https://github.com/vllm-project/vllm/pull/29471 we'll remove the related patch to avoid this kind of error. Co-authored-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: wangli <wangli858794774@gmail.com> - vLLM version: v0.11.2 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: wangli <wangli858794774@gmail.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com>	2025-12-02 22:10:52 +08:00
Shanshan Shen	6b9a997076	[MM][Model] Remove Qwen3-VL modeling files (#4577 ) ### What this PR does / why we need it? Following https://github.com/vllm-project/vllm-ascend/pull/4349, remove Qwen3-VL modeling files. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: shen-shanshan <467638484@qq.com> Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>	2025-12-02 07:33:17 +08:00
Ting FU	e8e20c0bbf	[BugFix] Fix Qwen2.5_Omni vision customized op attr err (#4568 ) Qwen2.5_Omni vision tower use AscendRMSNorm, which conatins a property function. It would be override by set_forward_context(), patch Qwen2_5OmniThinkerForConditionalGeneration func with customized _process_image_input() and _process_video_input() to fix it. ### What this PR does / why we need it? Fix Qwen2.5_Omni model infer image/video issue ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 Signed-off-by: Ting FU <futing10@huawei.com>	2025-12-01 09:18:55 +08:00
Shanshan Shen	2a19215e5f	[MM][Model] Remove Qwen2-VL modeling files (#4534 ) ### What this PR does / why we need it? Following https://github.com/vllm-project/vllm-ascend/pull/4349, remove Qwen2-VL modeling files. - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: shen-shanshan <467638484@qq.com>	2025-11-29 18:07:01 +08:00
shiyuan680	1c4a0468ee	【OPS】qwen3-next support triton chunk_gated_delta_rule ops (#4070 ) ### What this PR does / why we need it? qwen3-next suppot triton chunk_gated_delta_rule ops ### co-owners @OsirisDuan - vLLM version: v0.11.2 Signed-off-by: shiyuan680 <917935075@qq.com>	2025-11-28 20:55:43 +08:00
Shanshan Shen	e52ebf8674	[MM][Model][Perf] Remove Qwen2.5-VL modeling files and add patch for VisionAttention (#4349 ) ### What this PR does / why we need it? - [x] Patch `Qwen2_5_VisionAttention` with `AscendQwen2_5_VisionAttention`. - [x] Replace `AscendQwen2_5_VisionTransformer` with `Qwen2_5_VisionTransformer` in vllm. - [x] Move padding logic (q/k/v and cos/sin) before FA to `forward()` of `Qwen2_5_VisionAttention`. - [x] Covert `cu_seqlens` in `Qwen2_5_VisionAttention` from cumulative form to intervals and move it to cpu (compatible for npu FA). - [x] Remove Qwen2.5-VL modeling files. - [x] Remove Qwen2.5-VL (without padding) modeling files. - [x] Remove related UT. - [x] Make `set_forward_context` pluggable when getting MM embedding. Find more details at https://github.com/vllm-project/vllm/pull/29388. - [x] Simplify padding logic for FA. - [x] Add patch for https://github.com/vllm-project/vllm/pull/28798. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - [x] Functional test (eager mode) - [x] Functional test (graph mode) - [x] Benchmark - vLLM version: v0.11.2 --------- Signed-off-by: shen-shanshan <467638484@qq.com>	2025-11-28 14:23:00 +08:00
shiyuan680	d5f77f14d0	mkdir triton package and move triton files (#4420 ) ### What this PR does / why we need it? mkdir triton package and move triton files - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` Signed-off-by: shiyuan680 <917935075@qq.com>	2025-11-26 11:06:12 +08:00
wangxiyuan	98031653df	[misc] Remove useless patch_logits (#4252 ) Torch-npu 2.7.1 has fixed the device check bug. This patch can be removed now. - vLLM main: `2918c1b49c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-11-25 21:25:54 +08:00
wangxiyuan	a1f142b7ad	Drop 0.11.0 support (#4377 ) There is a lot hack code for v0.11.0, which makes the code hard to upgrade to newer vLLM version. Since v0.11.0 will release soon. Let's drop v0.11.0 support first. Then we'll upgrade to v0.11.2 soon. - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-11-24 17:08:20 +08:00
anon189Ty	5c9f4a40c6	[Feat] Support MTP to running in full graph mode (#3892 ) ### What this PR does / why we need it? Currently, the MTP model still runs in eager in full graph mode. This PR adapts the MTP with the full graph capture and execution. When the graph mode is set to "FULL_DECODE_ONLY", the MTP will run in full-graph to improve the performance. The change in both disable_padded_drafter_batch is True and False case include: 1. Add _mtp_graph_params in acl_graph.py to isolate the data of main model and the data of MTP. 2. Padding some metadata in mla_v1.py when in fullgraph mode. 3. Fixed the essential data address that will be used in model.forward. 4. Adapted according to the aclgraph capture framwork: 1). Rebuild MTP model with ACLGraphWrapper. 2). Add common attn metadata when start capture in MTP dummy_run. 3). Add common attn metadata update in MTP. 4). Addapted data update when num_speculative_tokens > 1. 5. Add a patch of MTP to adapt vllm v0.11.0. Existing Issues: 1. When disable_padded_drafter_batch=True and running in FullGraph mode, the data of the first-round requests in MTP is abnormal. We need to identify the cause subsequently. 2. When disable_padded_drafter_batch=False and running in FullGraph mode, the acceptance rate of the second and third tokens will decrease (For example, if we set the num_speculative_tokens=3, the acceptance rate of first token is 90%, the second is only 50% lower than 60%, the third is only 20% lower than 30%). The reason is that the data processed after the model runs does not match. This is a problem from another PR. It works fine in eager and PIECEWISE mode, but has problem in FullGraph mode. Once we have a solution, we will submit a bugfix. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` --------- Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>	2025-11-20 20:34:54 +08:00
22dimensions	c272747d13	Upgrade to 0.11.1 newest vllm commit (#3982 ) ### What this PR does / why we need it? adapt vllm-ascend main branch with vllm releases/v0.11.1 fix `forward context not set` in test_vlm.py caused by: https://github.com/vllm-project/vllm/pull/23207 fix import `cdiv round` failed caused by: https://github.com/vllm-project/vllm/pull/27188 fix import `init_cached_hf_modules` failed caused by: https://github.com/vllm-project/vllm/pull/27567 adapt triton kernel `fused_recurrent_gated_delta_rule_fwd_kernel` caused by: https://github.com/vllm-project/vllm/pull/27654 - remove unused code in sigmoid_gating.py - `class FusedRecurrentFunction` , `fused_recurrent_gated_delta_rule`, `fused_recurrent_gated_delta_rule_fwd` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI - vLLM version: v0.11.0 - vLLM main: `83f478bb19` Signed-off-by: 22dimensions <waitingwind@foxmail.com>	2025-11-12 23:01:19 +08:00
whx	f6149f3894	[Model][3/N] Refactor sfa into mla and remove deepseek_v3_2.py (#3769 ) This is the follow-up PR to PR #3189, which continues to refactor sfa into mla and finally remove deepseek_v3_2.py. This is the last PR of deepseek modeling refactoring. After this, all deepseek-related model codes are removed from vllm_ascend. FurtherMore, after this PR deepseek v3.2 can run chunk-prefill with correct accuracy. - vLLM version: v0.11.0rc3 - vLLM main: `83f478bb19` --------- Signed-off-by: whx-sjtu <2952154980@qq.com>	2025-10-30 17:06:38 +08:00
Icey	d9cdc65854	Upgrade to new vllm commit (#3719 ) ### What this PR does / why we need it? Upgrade to new vllm commit: `c9461e05a4` - Fix many imports, caused by https://github.com/vllm-project/vllm/pull/26908 - Fix import ```sha256```, caused by https://github.com/vllm-project/vllm/pull/27169 - Remove ```SchedulerConfig.send_delta_data```, caused by https://github.com/vllm-project/vllm/pull/27142 - Fix ```FusedMoE``` because of dual stream execution, caused by https://github.com/vllm-project/vllm/pull/26440 ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.11.0rc3 - vLLM main: `17c540a993` --------- Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: Icey <1790571317@qq.com> Co-authored-by: MengqingCao <cmq0113@163.com>	2025-10-25 15:36:32 +08:00
Mengqing Cao	cea0755b07	[1/N][Refactor] Refactor code to adapt with vllm main (#3612 ) ### What this PR does / why we need it? This is the step 1 of refactoring code to adapt with vllm main, and this pr aligned with `17c540a993` 1. refactor deepseek to the latest code arch as of `17c540a993` 2. bunches of fixes due to vllm changes - Fix `AscendScheduler` `__post_init__`, caused by https://github.com/vllm-project/vllm/pull/25075 - Fix `AscendScheduler` init got an unexpected arg `block_size`, caused by https://github.com/vllm-project/vllm/pull/26296 - Fix `KVCacheManager` `get_num_common_prefix_blocks` arg, caused by https://github.com/vllm-project/vllm/pull/23485 - Fix `MLAAttention` import,caused by https://github.com/vllm-project/vllm/pull/25103 - Fix `SharedFusedMoE` import, caused by https://github.com/vllm-project/vllm/pull/26145 - Fix `LazyLoader` improt, caused by https://github.com/vllm-project/vllm/pull/27022 - Fix `vllm.utils.swap_dict_values` improt, caused by https://github.com/vllm-project/vllm/pull/26990 - Fix `Backend` enum import, caused by https://github.com/vllm-project/vllm/pull/25893 - Fix `CompilationLevel` renaming to `CompilationMode` issue introduced by https://github.com/vllm-project/vllm/pull/26355 - Fix fused_moe ops, caused by https://github.com/vllm-project/vllm/pull/24097 - Fix bert model because of `inputs_embeds`, caused by https://github.com/vllm-project/vllm/pull/25922 - Fix MRope because of `get_input_positions_tensor` to `get_mrope_input_positions`, caused by https://github.com/vllm-project/vllm/pull/24172 - Fix `splitting_ops` changes introduced by https://github.com/vllm-project/vllm/pull/25845 - Fix multi-modality changes introduced by https://github.com/vllm-project/vllm/issues/16229 - Fix lora bias dropping issue introduced by https://github.com/vllm-project/vllm/pull/25807 - Fix structured ouput break introduced by https://github.com/vllm-project/vllm/issues/26737 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? CI passed with existing test. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: Icey <1790571317@qq.com> Co-authored-by: Icey <1790571317@qq.com>	2025-10-24 16:55:08 +08:00
fems14	2bcadcb9d5	【main】patch sched_yield (#3648 ) ### What this PR does / why we need it? On Arm systems, os.sched_yield() does not take effect, causing the GIL (Global Interpreter Lock) to remain unrelinquished and resulting in CPU bound issues. This PR applies a patch to sched_yield in vLLM, making the process execute time.sleep(0) instead to release the GIL. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: fems14 <1804143737@qq.com>	2025-10-24 00:06:45 +08:00
whx	72695c97d0	[BugFix][main] Fix quantization related mtp bug with patch (#3620 ) vLLM 0.11.0 didn't bring PR (https://github.com/vllm-project/vllm/pull/25805) thus missing the prefix of mtp's SharedHead. This PR fixes this bug with a patch to vllm's deepseek_mtp. main also need this bugfix to support vllm's v0.11.0 - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: whx-sjtu <2952154980@qq.com>	2025-10-23 09:54:31 +08:00
wangxiyuan	13e8e75143	[Refactor] refactor patch module (#3555 ) ### What this PR does / why we need it? we notice that `patch_main` is never used. Usually the patch is for all version. And if it's for specified version, we can use `vllm_version_is` instead. So let's remove the useless sub folder in patch module to make it clear. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-10-21 20:19:46 +08:00
Wang Kunpeng	4b3bd4f397	[main][bugfix] bugfix for minicpm models (#3527 ) ### What this PR does / why we need it? bugfix for minicpm-2b and minicpm3-4b - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: Wang Kunpeng <1289706727@qq.com>	2025-10-19 11:00:55 +08:00
Mengqing Cao	8abe517870	[Refactor] Adapt deepseek-v3.2 to vllm 0.11.0 (#3432 ) ### What this PR does / why we need it? Adapt deepseek-v3.2 to vllm 0.11.0, removing the useless patch. The final goal is to remove all the patches and align the code arch to vllm, thus we need to do the following work in next prs. TODO: - [x] remove patch on attention spec - [ ] refactor the kvcache creation logic ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? 1. CI passed with existing test. 2. Test pass with deepseek-v3.2-exp - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: MengqingCao <cmq0113@163.com>	2025-10-15 17:48:58 +08:00
xuyexiong	02c26dcfc7	[Feat] Supports Aclgraph for bge-m3 (#3171 ) ### What this PR does / why we need it? [Feat] Supports Aclgraph for bge-m3 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ``` pytest -s tests/e2e/singlecard/test_embedding.py pytest -s tests/e2e/singlecard/test_embedding_aclgraph.py ``` to start an online server with bs 10, each batch's seq length=8192, we set --max-num-batched-tokens=8192*10 to ensure encoder is not chunked: ``` vllm serve /home/data/bge-m3 --max_model_len 1024 --served-model-name "bge-m3" --task embed --host 0.0.0.0 --port 9095 --max-num-batched-tokens 81920 --compilation-config '{"cudagraph_capture_sizes":[8192, 10240, 20480, 40960, 81920]}' ``` For bs10, each batch's seq length=8192, QPS is improved from 85 to 104, which is a 22% improvement, lots of host bound is reduced. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: xuyexiong <xuyexiong@huawei.com> Co-authored-by: wangyongjun <1104133197@qq.com>	2025-10-14 23:07:45 +08:00
Mengqing Cao	223cc34085	[KVCache] Refactor KVCache as page_size_bytes is ineffective (#3438 ) ### What this PR does / why we need it? Refactor KVCache as page_size_bytes is ineffective. 1. Currently the `AttentionSpec` is patched, but the `page_size_bytes` is still using that in vLLM in runtime, thus the patch is not working actually. Thus this pr removes the patch on `AttentionSpec`, and will do the final fix in vLLM. 2. Use `MLAAttentionSpec` instead of `FullAttentionSpec` to reduce `page_size_bytes` of spec, so that num_blocks in spec could double ### How was this patch tested? Test pass with Qwen3-Next and DeepSeek-V3.2-Exp - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-10-14 21:28:41 +08:00
linfeng-yuan	e4acb2dfc7	[feat] support customized and separated hccl_buffer_size for process group initialization (#3073 ) ### What this PR does / why we need it? Currently, users have to set `HCCL_BUFFSIZE` to 512~1024 to perform mc2 operators (dispatch and combine) while running moe models with large `ep_size` and `batch_size`. This environmental variable not only affects allocated VRAM for mc2 group, but also increases VRAM allocation for dp, tp & ep groups, leading to significant kvcache and free_memory drops. This PR supports to automatically calculate and set `hccl_buffer_size` for each process group (except mc2 group) separately when users set `HCCL_BUFFSIZE` for mc2 group. This can significantly reduce wasted buffer_size set for dp, tp & ep groups. Note that current mc2 operators can only perform communication space partitioning based on `HCCL_BUFFSIZE` configuration. Once they support `hccl_buffer_size` configuration with `pg_options` while initializing process group, we'll caculate the required buffer size and users would avoid set `HCCL_BUFFSIZE` themselves. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? We performed E2E serving with deepseek_r1 initializing DP/TP/EP/MC2 process group and observed significant kv_cache and free_memory increase! - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: linfeng-yuan <1102311262@qq.com>	2025-10-11 15:55:22 +08:00
Peipei	8c1a4dedf3	[Bugfix]modify the enable range of _merge_multimodal_embeddings patch (#3360 ) ### What this PR does / why we need it? Modify the enable range of _merge_multimodal_embeddings patch. The current patch is only enabled for offline inference on the platform. For online serviceization, due to the addition of the worker sub-process, it is not enabled within the sub-process. ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: booker123456 <945658361@qq.com>	2025-10-11 08:37:07 +08:00
wangxiyuan	f12f76d7ba	Drop 0.10.2 (#3284 ) Drop v0.10.2 support, we support vLLM 0.11.0rc3 now. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/releases/v0.11.0 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-10-09 10:28:38 +08:00
wangxiyuan	81bd6e4c99	Add DeepSeek V3.2 support (#3270 ) ### What this PR does / why we need it? This PR added the initial DeepSeek V3.2 support with [vLLM v0.11.0](https://github.com/vllm-project/vllm/tree/releases/v0.11.0) (not released yet). We will complete vLLM adaptation as soon as possible. This feature will be ready in recent 1-2 days. Related doc: https://github.com/vllm-project/vllm-ascend/pull/3223 . ### Does this PR introduce _any_ user-facing change? Yes! ### How was this patch tested? CI passed and Run deepseek doc soon. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/releases/v0.11.0 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: zzzzwwjj <1183291235@qq.com> Signed-off-by: linfeng-yuan <1102311262@qq.com> Signed-off-by: wxsIcey <1790571317@qq.com> Signed-off-by: MengqingCao <cmq0113@163.com> Co-authored-by: zzzzwwjj <1183291235@qq.com> Co-authored-by: linfeng-yuan <1102311262@qq.com> Co-authored-by: wxsIcey <1790571317@qq.com> Co-authored-by: MengqingCao <cmq0113@163.com>	2025-09-30 03:25:58 +08:00
wangxiyuan	2930e4a6bd	[CI] Upgrade vllm to newest commit (#3182 ) ### What this PR does / why we need it? Upgrade vLLM to newest commit - Fix the aclgraph doesn't work problem, caused by `24fab45d96` - Fix PoolerOutput import error, caused by `755ed7b05b` - Fix the aclgraph weight load error to keep the same with torchair fix. `4492e3a554` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? All test should pass - vLLM version: v0.10.2 - vLLM main: `52d0cb8458` --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-09-26 06:18:15 +08:00
wangxiyuan	a055183821	[CI] Upgrade vLLM version (#3139 ) Upgrade vLLM version to the newest commit. - Fix the break change introduced by `969b4da3a6` - Add a patch to quick fix torhcair `de94289a98` - fix the ut error introduced by `de94289a98` Close: https://github.com/vllm-project/vllm-ascend/issues/3138 - vLLM version: v0.10.2 - vLLM main: `f225ea7dd9` --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: MengqingCao <cmq0113@163.com> Co-authored-by: MengqingCao <cmq0113@163.com>	2025-09-25 07:36:51 +08:00
Icey	e7618d9414	[2/N][Refactor][Qwen3-Next] remove redundant methods and patch methods in Qwen3NextGatedDeltaNet (#3082 ) ### What this PR does / why we need it? remove redundant methods and patch methods in Qwen3NextGatedDeltaNet involved causal_conv1d_fn, causal_conv1d_update_npu, fused_gdn_gating, fused_reccrrent_gated_delta_rule, torch_chunk_gated_delta_rule, RMSNormGated ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? ``` def main(): prompts = [ "The future of AI is", ] # Create a sampling params object. sampling_params = SamplingParams(max_tokens=100, temperature=0.6, top_k=40, top_p=0.95) # Create an LLM. llm = LLM( model="Qwen/Qwen3-Next-80B-A3B-Instruct", tensor_parallel_size=4, enforce_eager=True, trust_remote_code=True, max_model_len=256, gpu_memory_utilization=0.7, block_size=64, ) # Generate texts from the prompts. outputs = llm.generate(prompts, sampling_params) for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` CI passed with new added/existing test. - vLLM version: v0.10.2 - vLLM main: `5aeb925452` --------- Signed-off-by: Icey <1790571317@qq.com>	2025-09-24 11:25:42 +08:00
Li Wang	12bcbd02bb	[CI] Upgrade vLLM to 20250919 (6d8246aa) and fix some broken issue (#2907 ) ### What this PR does / why we need it? 1. This pr bump vllm commit to `6d8246aaff` 2. fix upstream changes https://github.com/vllm-project/vllm/pull/24548 abort multi-modal kwargs, make vllm main and `v0.10.2` both adaptable 3. fix metadata_builder changes introduced by https://github.com/vllm-project/vllm/pull/23693 4. fix `structured_outputs_config` changes introduced by https://github.com/vllm-project/vllm/pull/22772 5. fix `moe_config` changes introduced by https://github.com/vllm-project/vllm/pull/22537 Co-authored-by: MengqingCao <cmq0113@163.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com> - vLLM version: v0.10.2 - vLLM main: `c60e6137f0` --------- Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: MengqingCao <cmq0113@163.com> Co-authored-by: MengqingCao <cmq0113@163.com>	2025-09-20 17:37:57 +08:00
Shanshan Shen	8326f15ecf	[CustomOp] Register AscendSharedFusedMoE custom op (#2980 ) ### What this PR does / why we need it? Register `AscendSharedFusedMoE` custom op. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? `DeepSeek-V2-Lite` is a MoE model with shared experts. Test: ```bash vllm serve /root/.cache/modelscope/hub/models/deepseek-ai/DeepSeek-V2-Lite \ --trust-remote-code \ --enforce-eager \ --no-enable-prefix-caching \ --gpu-memory-utilization 0.95 curl -X POST http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "/root/.cache/modelscope/hub/models/deepseek-ai/DeepSeek-V2-Lite", "messages": [ {"role": "user", "content": "介绍一下联通公司？"} ], "stream": false, "max_tokens": 100 }' ``` Output: ```bash 中国联合网络通信集团有限公司（简称“中国联通”）于2009年1月6日在原中国网通和原中国联通的基础上合并组建而成，在国内31个省（自治区、直辖市）和境外多个国家和地区设有分支机构，是中国唯一一家在纽约、香港、上海三地同时上市的电信运营企业，连续多年入选“世界500强企业”。\n\n中国联通主要经营固定通信业务，移动通信业务，国内 ``` - vLLM version: v0.10.2 - vLLM main: `486c5599e3` --------- Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com> Signed-off-by: shen-shanshan <467638484@qq.com>	2025-09-19 19:05:01 +08:00
linfeng-yuan	1c5900327b	[refactor] refactor deepseek-related files (#2849 ) ### What this PR does / why we need it? This PR deletes ~2K lines of code about deepseek modeling. It falls back CustomDeepseekV2 modules to original vllm implementations and adapts some modifications in vllm about deepseek and moe. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? E2E vllm serving with torchair graph mode and eager mode. - vLLM version: v0.10.2 - vLLM main: `759ef49b15` --------- Signed-off-by: linfeng-yuan <1102311262@qq.com> Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com> Co-authored-by: yiz-liu <136800916+yiz-liu@users.noreply.github.com> Co-authored-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-09-16 14:13:07 +08:00
wangxiyuan	7d6d9449a8	[Misc] Move lora patch file into lora module (#2797 ) Cleanup useless file in patch module. Update the lora support list is OK in vLLM Ascend, no need to patch vLLM - vLLM version: v0.10.1.1 - vLLM main: `f4962a6d55` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-09-08 21:42:12 +08:00
lidenghui1110	5a7181569c	[feat]: oproj tensor parallelism in pure DP and graph-mode scenarios. (#2167 ) ### What this PR does / why we need it? This PR introduces Oproj matrix tensor model parallel to achieve decreasing of memory consumption. It only support graph mode in pure DP scenario. In deepseek r1 w8a8 PD disagregated Decode instance, using pure DP, with oproj_tensor_parallel_size = 8, we have 1 ms TPOT increasing, saved 5.8 GB NPU memory per RANK. We got best performance when oproj_tensor_parallel_size=4 without TPOT increasing. performance data: <img width="1442" height="442" alt="image" src="https://github.com/user-attachments/assets/83270fc5-868a-4387-b0a9-fac29b4a376d" /> ### Does this PR introduce _any_ user-facing change? This PR introduces one new config in `additional_config`. \| Name \| Effect \| Required \| Type \| Constraints \| \| :---------------------------- \| :--------------------------------------- \| :------- \| :--- \| :----------------- \| \| oproj_tensor_parallel_size \| Split the o_proj matrix along the row dimension (head num * head dim) into oproj_tensor_parallel_size pieces. \| No \| int \| default value is None, once this value is set, the feature will be enabled, head num * head dim must be divisible by this value. \| example `--additional_config={"oproj_tensor_parallel_size": 8}` ### How was this patch tested? - vLLM version: v0.10.1.1 - vLLM main: `eddaafc1c7` --------- Signed-off-by: zzhx1 <zzh_201018@outlook.com> Co-authored-by: zzh <zzh_201018@outlook.com>	2025-09-07 10:31:32 +08:00
yupeng	9f1e054fe3	[Bugfix][LoRA][Operator] Fix LoRA custom operators accuracy issue (#2672 ) ### What this PR does / why we need it? Fix the LoRA accuracy issue that introduced by custom AscendC operator "bgmv_shrink, sgmv_shrink, bgmv_expand, sgmv_epand". The bug details are: - In the kernel function, if you want to call GlobalTensor.GetSize method, you have to pass the second parameter of bufferSize when you call GlobalTensor.SetGlobalBuffer first. - Or GlobalTensor.GetSize method will return a random value. - You can refer to [this doc](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/81RC1alpha002/apiref/ascendcopapi/atlasascendc_api_07_00024.html). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? pytest -sv tests/e2e/singlecard/test_ilama_lora.py pytest -sv tests/e2e/multicard/test_ilama_lora_tp2.py - vLLM version: v0.10.1.1 - vLLM main: `a344a5aa0a` --------- Signed-off-by: paulyu12 <paulyu0307@gmail.com> Signed-off-by: paulyu12 <507435917@qq.com> Co-authored-by: paulyu12 <paulyu0307@gmail.com>	2025-09-02 11:46:59 +08:00

1 2 3

138 Commits