xc-llm-ascend

Author	SHA1	Message	Date
LuLina	2be0fe2691	[Feat] Add Euler xlite graph wrapper support (#4526 ) ### What this PR does / why we need it? This patch adds support for the xlite graph wrapper to vllm_ascend. Xlite provides operator implementations of the transformer network on Ascend hardware. For details about xlite, please refer to the following link: https://gitee.com/openeuler/GVirt/blob/master/xlite/README.md The latest performance comparison data between xlite and the default aclgraph mode is as follows: ## Qwen3 32B TPS 910B3(A2) Online Inference Performance Comparison - aclgraph: main(`c4a71fc6`) - xlite-full: main(`c4a71fc6`) + xlite-full - xlite-decode-only: main(`c4a71fc6`) + xlite-decode-only - diff1: Performance comparison between xlite-full and aclgraph - diff2: Performance comparison between xlite-decode-only and aclgraph ### Does this PR introduce _any_ user-facing change? Enable the xlite graph mode by setting xlite_graph_config: --additional-config='{"xlite_graph_config": {"enabled": true}}' # Enabled for decode only --additional-config='{"xlite_graph_config": {"enabled": true, "full_mode": true}}' # Enabled for prefill and decode - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: lulina <lina.lulina@huawei.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-08 08:27:46 +08:00
Yizhou	8fdb689a32	[BugFix] Refactor ACL graph size adjustment for speculative decoding (#4640 ) ### What this PR does / why we need it? Move the logic for adjusting ACL graph capture sizes for speculative decoding from the generic utility module into a dedicated method within the compilation configuration. This change improves code organization and encapsulation by making the compilation configuration responsible for managing its own state. The model runner now triggers this adjustment directly, providing the necessary context. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? None. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-07 17:32:45 +08:00
liziyu	688b1332da	[P/D] check kv extra config and del hccl backend (#4547 ) ### What this PR does / why we need it? check kv extra config & del hccl backend - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: liziyu <liziyu16@huawei.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-07 15:19:42 +08:00
ZYang6263	b91a5f0968	Support DeepSeekV3.2 with MLAPO operator (#4753 ) ### What this PR does / why we need it? This PR adds support for the optimized MLAPO operator in DSV3.2 and this operator provides an optimized implementation that avoids redundant q_down recomputation. The operator implementation and optimizations were introduced in PR [#4707](https://github.com/vllm-project/vllm-ascend/pull/4707). ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: ZYang6263 <zy626375@gmail.com> Co-authored-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-12-07 12:40:24 +08:00
AlvisGong	a5163c8c36	[Feat]enable sfa cp for dsv3.2 (#4702 ) ### What this PR does / why we need it? RFC: https://github.com/vllm-project/vllm/issues/30055 ### How was this patch tested? 1. enable flashcommon1 export VLLM_ASCEND_ENABLE_FLASHCOMM1=1 2. enable sfa-cp --additional-config '{ "enable_sfa_cp": true }' \ - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: AlvisGong <gwly0401@163.com> Co-authored-by: clrs97 <524936896@qq.com> Co-authored-by: zzhx1 <zzh_201018@outlook.com> Co-authored-by: hwhaokun <haokun0405@163.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-06 19:46:41 +08:00
GuoRen868	4bd1030842	[Kernel] add custom op DispatchGmmCombineDecode (#4139 ) #### What this PR does / why we need it? add custom opapi DispatchGmmCombineDecode for A3, include kernel inpl, python Api, pytest. vLLM version: v0.11.0 vLLM main: `24d6314718` - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangqiankun <wangqiankun13@huawei.com> Co-authored-by: wangqiankun <wangqiankun13@huawei.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-06 17:33:14 +08:00
zhaomingyu13	cb42564942	[BugFix] Fix eagle3 accuracy problem when enforce_eager=True (#4521 ) ### What this PR does / why we need it? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? def main(): prompts = [ "The future of AI is", ] # Create a sampling params object. sampling_params = SamplingParams(temperature=0.8, top_p=0.95) # Create an LLM. llm = LLM( model="meta-llama/Llama-3.1-8B-Instruct", tensor_parallel_size=1, speculative_config={ "method": "eagle3", "model": "yuhuili/EAGLE3-LLaMA3.1-Instruct-8B" "num_speculative_tokens": 3 }, enforce_eager=True, ) # Generate texts from the prompts. outputs = llm.generate(prompts, sampling_params) print(f"Outputs: {outputs}") for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-06 17:31:26 +08:00
Ronald	3480094d7c	support async mtp (#4511 ) ### What this PR does / why we need it? this pr aims to support async_scheduling for mtp, which refer to vllm pr https://github.com/vllm-project/vllm/pull/24799. and this pr fix some synchronize problem in vllm-ascend. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: Ronald1995 <ronaldautomobile@163.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-06 17:15:57 +08:00
Zhu Yi Lin	f067623afd	[Bugfix] fix mtp and eagle aclgraph bug (#4710 ) ### What this PR does / why we need it? fix mtp and eagle aclgraph bug - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: GDzhu01 <809721801@qq.com> Co-authored-by: Mengqing Cao <cmq0113@163.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-06 11:22:57 +08:00
h1074112368	74033999ed	mlapo add qdown output (#4707 ) ### What this PR does / why we need it? This PR adds mlapo operation support qdown of output. ### Does this PR introduce _any_ user-facing change? mlapo operation add enable_inner_out of input ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: h1074112368 <h1074112368@gmail.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-06 11:18:53 +08:00
zzzzwwjj	8378f56f53	rm vanilla attn (#4558 ) ### What this PR does / why we need it? Remove unused vanilla attn code. ### Does this PR introduce _any_ user-facing change? NA - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: zzzzwwjj <1183291235@qq.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-06 10:53:55 +08:00
Wang Yixuan	e0c5073956	[Bugfix]fix bmm_transpose ops for cann version (#4653 ) ### What this PR does / why we need it? Due to the upgrade of CANN version, custom op cannot be used in high version. In the high level cann version, the ops will start with redundant vector core while this ops will only use cube core, this results in the missalign when copy data from ub memory to global memory. So add limitation to the ops to make it use cube core only. ### Does this PR introduce _any_ user-facing change? No - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: hust17yixuan <303660421@qq.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-06 10:52:46 +08:00
weijinqian0	a78f49ea57	[Refactor] 1/N Refactor attention_v1 & extract attention_cp (#4628 ) RFC: https://github.com/vllm-project/vllm-ascend/issues/4629 Reason： The functions related to Cp differ significantly from those of normal Attention, but the coupling is quite severe. Steps： Isolate PCP and DCP (1) Forward class extraction (100%) (2) Metadata coupling processing (3) Builder processing - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: weijinqian_v1 <weijinqian@huawei.com> Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>	2025-12-06 09:33:28 +08:00
mazhixin000	3740b3edfc	【main】[Doc]add 2P1D instruction for single node (#4716 ) ### What this PR does / why we need it? Add the description for 2P1D， keeping it consistent with the content in the dev branch. ### Does this PR introduce _any_ user-facing change? no - vLLM version: v0.12.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.12.0 Signed-off-by: mazhixin000 <mazhixinkorea@163.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-05 18:35:18 +08:00
Li Wang	4b016b98a2	[CI] Fix unit test fault `no space left` (#4728 ) ### What this PR does / why we need it? Using an ARM-based github_hosted node to temporarily resolve `no space left` issues when installing vllm in UT. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: wangli <wangli858794774@gmail.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com>	2025-12-05 17:21:30 +08:00
wangxiaoteng888	41fbc5ebc9	[P/D][main] Clean connector history information (#4650 ) ### What this PR does / why we need it? Clean connector history information when the node restarts. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By ci - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-05 16:22:23 +08:00
欧派果奶我还要	a336543977	[Bugifx] fix quant_apply_mlp w1_scale type error & fix getting num_local_expert (#4632 ) ### What this PR does / why we need it? Fix bugs introduced by `bc67696a02` 1. fix getting num_local_experet error in vllm_adaptor 2. fix w1_scale type error in moe_mlp.quant_apply_mlp.npu_dequant_swiglu_quant in w4a8 quantized scenario - vLLM version: v0.12.0 --------- Signed-off-by: 白永斌 <baiyongbin3@h-partners.com> Signed-off-by: 欧派果奶我还要 <47294568+845473182@users.noreply.github.com> Co-authored-by: 白永斌 <baiyongbin3@h-partners.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-05 16:04:24 +08:00
whx	a7f91079b8	[BugFix][Triton] Fix ub overflow bug of sample_recover_tokens_kernel (#4673 ) ### What this PR does / why we need it? Original `sample_recover_tokens_kernel` of reject sampler didn't tile the vocab size dim, whitch will cause ub overflow problem for models with big vocab size like deepseek. This PR adds tiling to the vocab size dim to avoid this problem. Note that currently we just use a emperical `SUB_BLOCK_SIZE` of `4*1024` for functionality. If in the future this kernel becomes performance bottle neck, we can use triton autotune to optimize this. What's more, we have to disable multibuffer of this kernel due to some accuracy issues. - vLLM version: v0.12.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.12.0 Signed-off-by: whx-sjtu <2952154980@qq.com> Co-authored-by: weijinqian0 <1184188277@qq.com>	2025-12-05 15:16:19 +08:00
Chen Chen	7f33838e6e	Update comment doc (#4731 ) ### What this PR does / why we need it? Translate remaining Chinese comments in the `dispatch_ffn_combine` code to English and update the installation guide to remind users to initialize submodules when building from source. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: mojave2 <chenchen145@huawei.com> Signed-off-by: Chen Chen <0109chenchen@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-05 15:07:31 +08:00
LookAround0301	b32ef53b3b	[long_seq] remove long_seq env (#4660 ) ### What this PR does / why we need it? remove env VLLM_ASCEND_ENABLE_CONTEXT_PARALLEL - vLLM version: v0.12.0 --------- Signed-off-by: LookAround <lixushi@huawei.com> Signed-off-by: ZhangMingWei716 <2894054457@qq.com> Co-authored-by: ZhangMingWei716 <2894054457@qq.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-05 10:31:49 +08:00
wangxiyuan	ea54388e19	Drop ascend scheduler (#4623 ) It's safe to drop ascend scheduler now. The related test and doc has been removed already - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-05 09:03:45 +08:00
wangxiyuan	00b4fb80de	[Doc] Update vLLM version in doc (#4691 ) Correct vLLM version in doc - vLLM version: v0.12.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.12.0 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com>	2025-12-05 08:59:41 +08:00
Li Wang	cd8e5be7c7	[Bugfix] Quick hot fix for nightly CI (#4727 ) Quick fix for multi-node tests --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-12-04 23:51:16 +08:00
Chen Chen	ad0607f900	add `dispatch_gmm_combine` kernel (#3532 ) ### What this PR does / why we need it? This PR introduces the Ascend implementation of the `dispatch_ffn_combine` kernel and wires it into the vLLM-Ascend runtime, together with follow‑up fixes to ensure the kernel builds and runs correctly in CI. - Add full host and device implementation of the `dispatch_ffn_combine` kernel under `csrc/dispatch_ffn_combine`, including tiling logic, MOE routing helpers, and kernel utilities for quantized FFN dispatch. - Integrate the new kernel with the PyTorch binding (csrc/torch_binding.cpp, csrc/torch_binding_meta.cpp) and the Ascend runtime (vllm_ascend/ascend_forward_context.py, vllm_ascend/worker/model_runner_v1.py). - Extend fused MoE communication and token dispatch support in `vllm_ascend/ops/fused_moe`, adding methods/utilities needed by the new dispatch path. - Update quantization logic in vllm_ascend/quantization/w8a8_dynamic.py to support the new FFN dispatch flow. - Fix kernel build issues by adjusting `csrc/build_aclnn.sh`, CMake configuration, and include/namespace usage in the new kernel files. - Add an end‑to‑end nightly test `tests/e2e/nightly/ops/test_dispatch_ffn_combine.py` and helper utilities in `vllm_ascend/utils.py` to validate the new kernel. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.12.0 --------- Signed-off-by: mojave2 <chenchen145@huawei.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-04 23:00:59 +08:00
Li Wang	752a55473c	[Misc] Upgrade vllm vllm commit to 2025_12_04 (#4690 ) ### What this PR does / why we need it? As title shows, upgrade vllm commit hash to `ad32e3e` - vLLM version: v0.12.0 --------- Signed-off-by: wangli <wangli858794774@gmail.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-04 22:31:45 +08:00
Li Wang	283bc5c7ba	[Nightly] Optimize nightly CI (#4509 ) ### What this PR does / why we need it? 1. Optimize multi-node waiting logic 2. Remove the `tee` pipeline for logs, which will lead to hang issue ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.12.0 Signed-off-by: wangli <wangli858794774@gmail.com>	2025-12-04 22:31:07 +08:00
Shanshan Shen	fb15fec662	[MM][Patch] Remove patch for cos/sin cache (#4672 ) ### What this PR does / why we need it? Remove patch for https://github.com/vllm-project/vllm/pull/28798. - vLLM version: v0.12.0 Signed-off-by: shen-shanshan <467638484@qq.com>	2025-12-04 22:30:06 +08:00
1092626063	b3e1377a92	【fix】ops gatingtopk fix nightly ci error (#4340 ) ### What this PR does / why we need it? This pr https://github.com/vllm-project/vllm-ascend/pull/2958 is supporting gatingtopk operator generalization, but caused nightly ci error. Now we add check logits for ops gatingtopk, and fix nightly ci. - vLLM version: v0.12.0 Signed-off-by: 1092626063 <1092626063@qq.com>	2025-12-04 20:09:21 +08:00
wangxiyuan	da84eb2f40	Remove ascend schuduler ut (#4684 ) ### What this PR does / why we need it? 1. Remove ascend schuduler ut 2. Remove models ut 3. move mla to ops 4. skip the failed ut - vLLM version: v0.12.0 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-04 14:10:28 +08:00
Icey	178ca1607e	Adopt inductor fusion and define quantization fusion pass (#4168 ) ### What this PR does / why we need it? The main goal of this PR to alleviate the high maintenance burden from model duplication when we are going to do the model optimization. Some of our optimized models diverges a little from the vllm's modeling, but needs to rewrite several part of original one, brings negligible maintenance bruden to the vllm-ascend.In order to solve that, we propose to leverage `torch.compile` and `inductor pattern matcher`, automatically fuse the pattern we want to merge. For more details can refer to the RFC https://github.com/vllm-project/vllm-ascend/issues/4239 This pr integrates `AddRMSNorm` and the `Quant` operator, which can improve the inference speed of models using `w8a8 `quantization. ### Does this PR introduce _any_ user-facing change? Yes, add new additional_config ### How was this patch tested? ```python def main(): prompts = [ "The president of the United States is Mr.", ] # Create a sampling params object. sampling_params = SamplingParams(max_tokens=100, temperature=0.6, top_k=40, top_p=0.95) # Create an LLM. llm = LLM( model="/root/.cache/modelscope/hub/models/vllm-ascend/Qwen3-8B-W8A8", # enforce_eager=True, tensor_parallel_size=1, trust_remote_code=True, gpu_memory_utilization=0.7, quantization="ascend", ) # Generate texts from the prompts. outputs = llm.generate(prompts, sampling_params) for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` ```text Prompt: 'The president of the United States is Mr.', Generated text: ' Trump. The president of the United States is Mr. Biden. Which of the following statements is correct? \n\nA. Mr. Trump is Mr. Biden. \nB. Mr. Trump is not Mr. Biden. \nC. The president of the United States is not Mr. Trump. \nD. The president of the United States is not Mr. Biden.\n\nThe question presents a contradiction: it states that "The president of the United States is Mr. Trump" and "The president of' ``` - vLLM version: 86e178f7c4d8c3b0eaf3c8e3f810a83f63b90e24 - vLLM main: `86e178f7c4` --------- Signed-off-by: Icey <1790571317@qq.com> Signed-off-by: wxsIcey <1790571317@qq.com>	2025-12-04 10:29:48 +08:00
Yikun Jiang	c4a71fc6d5	Remove cancel for main to main check (#4685 ) ### What this PR does / why we need it? Remove cancel for main to main check. Another choice we set timeout to 4h but I think 2h to get results is more important. Related: https://github.com/vllm-project/vllm-ascend/actions/workflows/vllm_ascend_test_full_vllm_main.yaml ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Can be merged directly if lint passed - vLLM version: v0.12.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.12.0 Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-12-04 09:10:27 +08:00
wangxiyuan	3f4c0ea0a0	upgrade vLLM to 0.12.0 tag (#4647 ) Upgrade vLLM to v0.12.0 tag - vLLM version: 86e178f7c4d8c3b0eaf3c8e3f810a83f63b90e24 - vLLM main: `86e178f7c4` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-03 23:43:05 +08:00
amy-why-3459	26e8e58cea	[Core] Encoder separation for Encode-Prefill-Decode Disaggregation (#4176 ) ### What this PR does / why we need it? Support Encoder separation for Encode-Prefill-Decode Disaggregation - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com>	2025-12-03 20:48:45 +08:00
wangxiyuan	6ece6660ec	fix custom ops env set error (#4675 ) Move Custom ops register to correct place to make CI happy - vLLM version: 86e178f7c4d8c3b0eaf3c8e3f810a83f63b90e24 - vLLM main: `86e178f7c4` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-03 19:27:38 +08:00
Shanshan Shen	a1c0667392	[Misc] Add cann custom ops to `.gitignore` (#4670 ) ### What this PR does / why we need it? Add `CANN-custom_ops--linux.aarch64.run` (generated during installing vllm-ascend) to `.gitignore`. - vLLM version: 86e178f7c4d8c3b0eaf3c8e3f810a83f63b90e24 - vLLM main: `86e178f7c4` Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>	2025-12-03 18:29:10 +08:00
Icey	6ac5730640	[CI] Fix ut ci: no space on the device (#4662 ) ### What this PR does / why we need it? The current ut ci is encountering a disk space shortage issue, see https://github.com/vllm-project/vllm-ascend/actions/runs/19884417749/job/56990694594?pr=4168 ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with new added/existing test. - vLLM version: 86e178f7c4d8c3b0eaf3c8e3f810a83f63b90e24 - vLLM main: `86e178f7c4` --------- Signed-off-by: wxsIcey <1790571317@qq.com>	2025-12-03 17:35:06 +08:00
XiaoxinWang	15dc01f050	[Fix] Fix FIA `query` and `query_start_loc` shape mismatch error (#4518 ) ### What this PR does / why we need it? Due to the requirement of the FIA operator that the query.shape[0] must match actual_seq_len[-1], in graph mode and multi-DP scenarios, the query is padded to the size of num_input_token. This leads to validation errors during tiling in the operator. However, since the padding is applied at the end of the query, it does not affect the actual execution result of the operator, and the precision remains unaffected. <img width="2434" height="49" alt="image" src="https://github.com/user-attachments/assets/63520816-fbc3-4382-82b9-89dbb1492f6c" /> Our modification padding both actual_seq_len and actual_seq_len_kv to resolve the validation issue in the operator. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com> Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>	2025-12-03 17:33:31 +08:00
ZYang6263	7271f0d536	[Feat] MTP support DeepSeekV3.2 (#4465 ) ### What this PR does / why we need it? Currently, MTP does not support the DeepSeekV3.2 model. In this PR, we have enabled this feature. - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 Signed-off-by: ZYang6263 <zy626375@gmail.com>	2025-12-03 14:24:33 +08:00
LeeWenquan	38bd95229f	[Model] Add qwen3Next support in Main (#4596 ) ### What this PR does / why we need it? Add Qwen3Next support in main ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: SunnyLee219 <3294305115@qq.com>	2025-12-03 14:17:37 +08:00
wangxiyuan	3f81c4bb25	fix typo (#4657 ) typo fix for release title - vLLM version: 86e178f7c4d8c3b0eaf3c8e3f810a83f63b90e24 - vLLM main: `86e178f7c4` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-03 11:56:47 +08:00
wangxiyuan	9a73c22b1c	[Doc] add release note for v0.11.0rc3 (#4646 ) Add release note for 0.11.0rc3. We'll release it today. - vLLM version: 86e178f7c4d8c3b0eaf3c8e3f810a83f63b90e24 - vLLM main: `86e178f7c4` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-03 11:49:44 +08:00
Song Mingyang	18b90b501d	[kernel] add AscendC op: lightning_indexer and sparse_flash_attention (#4625 ) ### What this PR does / why we need it? Provide high-performance AscendC operators lightning_indexer and sparse_flash_attention to boost the execution performance of the DeepSeek v3.2 model. Meanwhile, adapt the two AscendC operators to vllm-ascend framework. ### Does this PR introduce _any_ user-facing change? No (only underlying operator optimizations, with no user-facing changes) ### How was this patch tested? - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 Signed-off-by: MingYang119 <songmingyang@huawei.com>	2025-12-03 09:53:10 +08:00
wangxiyuan	7f2673ea2d	upgrade vLLM to main (#4608 ) 1. fix https://github.com/vllm-project/vllm/pull/28542 The model structure modifications we involved in are: - Qwen2.5-VL(still exist some patch) - Qwen2-VL - Qwen2 - DeepSeek series - Qwen-moe series 2. fix https://github.com/vllm-project/vllm/pull/29121 the output token now type changed from np to `list[list[int]]` 3. fix https://github.com/vllm-project/vllm/pull/29262 `xformers` backend for multimodal now has been deprecated 4. fix https://github.com/vllm-project/vllm/pull/29342 5. fix https://github.com/vllm-project/vllm/pull/28579 6. fix https://github.com/vllm-project/vllm/pull/28718 7. fix https://github.com/vllm-project/vllm/issues/28665 8. fix https://github.com/vllm-project/vllm/pull/26847 vllm introduced the `optimization-level`, some default config has been changed, and the param `--enforce-eager` has been deprecated 9. fix http://github.com/vllm-project/vllm/pull/29223 it retuns tuple for sampler. 10. fix https://github.com/vllm-project/vllm/pull/29471 we'll remove the related patch to avoid this kind of error. Co-authored-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: wangli <wangli858794774@gmail.com> - vLLM version: v0.11.2 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: wangli <wangli858794774@gmail.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com>	2025-12-02 22:10:52 +08:00
Chenxi Qian	4588cdac02	[Bugfix] fix custom op GmmSwigluQuantWeightNzTensorList (#4593 ) ### What this PR does / why we need it? 1. Fixes the environment path used to locate custom op shared libraries. 2. Uses empty tensor initialization for op outputs instead of zero-initialization for better efficiency. - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: QianChenxi <chenxi.qian.cq@outlook.com>	2025-12-02 22:02:04 +08:00
1092626063	b84c9afbf5	【doc fix】doc fix: deepseekv3.1 (#4645 ) ### What this PR does / why we need it? fix deepseekv3.1 doc to recomand developers to use Mooncake instead of LLMDatadist ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> Signed-off-by: AiChiMomo <1092626063@qq.com>	2025-12-02 21:49:13 +08:00
FuNanyang	1b5513aa91	[performance] Enhance performance after enabling min_p (#4529 ) ### What this PR does / why we need it? When min_p post-processing parameters are enabled, the original vllm implementation introduces the aclnInIndexPutImpl operator, which performs poorly on NPU ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? After enabling min_p to collect profiling The performance has been greatly improved - vLLM version: v0.11.2 --------- Signed-off-by: funanyang <985619145@qq.com>	2025-12-02 20:35:51 +08:00
1092626063	eabedf43aa	[Doc] Refactor the DeepSeek-V3.1 tutorial. (#4399 ) ### What this PR does / why we need it? Refactor the DeepSeek-V3.1 tutorial. - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 Signed-off-by: 1092626063 <1092626063@qq.com>	2025-12-02 18:46:30 +08:00
wangxiyuan	874097a1de	clean up model module (#4611 ) Model module is useless now. Let't remove it totally. - vLLM version: v0.11.2 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-02 17:35:47 +08:00
whx	96b2cdf6d8	[Ops][Triton] Add a triton kernel supporting partial rope. (#4413 ) ### What this PR does / why we need it? This PR adds a triton rope kernel witch supports scenarios of `rope_dim != head_dim`. This can save the split op before rope and the concat op after rope. Profiling shows improvement. ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? I will add related ut after ci integrated with triton. - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: whx-sjtu <2952154980@qq.com>	2025-12-02 17:10:19 +08:00
yeyifan	8907010815	[Doc] Add tutorial for Qwen3-Coder-30B-A3B (#4391 ) ### What this PR does / why we need it? Add tutorial for Qwen3-Coder-30B-A3B - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: nsdie <yeyifan@huawei.com> Signed-off-by: herizhen <you@example.com> Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com> Signed-off-by: jiangyunfan1 <jiangyunfan1@h-partners.com> Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com> Signed-off-by: weijinqian_v1 <weijinqian@huawei.com> Signed-off-by: weijinqian0 <1184188277@qq.com> Co-authored-by: Li Wang <wangli858794774@gmail.com> Co-authored-by: herizhen <59841270+herizhen@users.noreply.github.com> Co-authored-by: herizhen <you@example.com> Co-authored-by: Yizhou <136800916+yiz-liu@users.noreply.github.com> Co-authored-by: jiangyunfan1 <jiangyunfan1@h-partners.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: XiaoxinWang <963372609@qq.com> Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com> Co-authored-by: weijinqian0 <1184188277@qq.com> Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>	2025-12-02 16:03:37 +08:00

1 2 3 4 5 ...

1528 Commits