xc-llm-ascend

Author	SHA1	Message	Date
weiguihua2	49e346c6a6	[UT]add pcp aclgraph ut (#4804 ) ### What this PR does / why we need it? add pcp aclgraph ut - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: weiguihua2 <weiguihua2@huawei.com>	2025-12-09 17:27:40 +08:00
Wang Yixuan	c68dfa70ac	[Bugfix]fix bmm_transpose ops in dsv32 (#4791 ) ### What this PR does / why we need it? bmm transpose ops can't be used in cp, so add judgement in the modeling ### Does this PR introduce _any_ user-facing change? No - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: hust17yixuan <303660421@qq.com>	2025-12-09 16:55:09 +08:00
Li Wang	c8b671c498	[CI] Increase HCCL_BUFFSIZE for A3 (#4838 ) ### What this PR does / why we need it? Unified increase HCCL_BUFFSIZE for A3 Signed-off-by: wangli <wangli858794774@gmail.com>	2025-12-09 16:39:50 +08:00
wangqiankun13	9567e5dd8c	[kernel] Adapt DispatchGmmCombineDecode operator to parameters of small operators (#4790 ) ### What this PR does / why we need it? This PR adapt DispatchGmmCombineDecode operator to parameters of small operators. 1. This operator no longer requires permuting the weights and scales of GMM1. 2. This operator no longer requires transposing the weights of GMM2. Therefore, this operator and the small operator can use the same parameters (weights and scales), which is beneficial for model adaptation. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangqiankun <wangqiankun13@huawei.com>	2025-12-09 16:17:06 +08:00
dsxsteven	9a885d08d0	[Feat] Multi-stream for eplb heat collection and aggregation (#4214 ) ### What this PR does / why we need it? This PR optimizes multistream for eplb heat collection and aggregation - vLLM version: v0.12.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.12.0 --------- Signed-off-by: daishixun <dsxsteven@sina.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2025-12-09 16:16:55 +08:00
baxingpiaochong	dda027e680	[KVPOOl]Support pp (#4761 ) ### What this PR does / why we need it? Support pp for kv pool - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: baxingpiaochong <771405853@qq.com>	2025-12-09 16:15:26 +08:00
Li Wang	9038865261	[CI] Optimize CI time (#4821 ) ### What this PR does / why we need it? Considering that long queues severely impact the developer experience, we have decided to make the following changes: 1. Changes will use the self_hosted runner 2. e2e-2card will use the A3 node. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: wangli <wangli858794774@gmail.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2025-12-09 16:09:37 +08:00
linfeng-yuan	56f01820e8	[Docs]fix the configuration conflicts in documentation (#4823 ) ### What this PR does / why we need it? Fix configuration error in our documentations. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? NA. Signed-off-by: linfeng-yuan <1102311262@qq.com>	2025-12-09 15:37:38 +08:00
Li Wang	1c70f5c922	[CI] Skip `test_suffix_correctness` (#4820 ) ### What this PR does / why we need it? Currently, suffix decoding has known correctness issue see https://github.com/vllm-project/vllm-ascend/actions/runs/20033509824/job/57457565620?pr=4781" Signed-off-by: wangli <wangli858794774@gmail.com>	2025-12-09 11:48:13 +08:00
Canlin Guo	2b819bb35b	[Bugfix] Add the check for a null VllmConfig (#4749 ) ### What this PR does / why we need it? In vllm-omni, we create the empty `VllmConfig`, which raised the null error in [`vllm-ascend/vllm_ascend/utils.py`](`a7f91079b8/vllm_ascend/utils.py (L833)`). More details are [here](https://github.com/vllm-project/vllm-omni/issues/208). ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: gcanlin <canlinguosdu@gmail.com>	2025-12-09 09:21:17 +08:00
Mengqing Cao	7e70da9fb7	Revert "[Kernel] add custom moe ops for prefill" (#4806 ) Reverts vllm-project/vllm-ascend#4194 as it broke CI in https://github.com/vllm-project/vllm-ascend/actions/runs/20030369087/job/57437687382?pr=4791 Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-08 23:20:32 +08:00
ZYang6263	432b861cae	Fix incorrect MLAPO weight release in PD mixex scenarios. (#4774 ) ### What this PR does / why we need it? Fix incorrect MLAPO weight release in PD mixex scenarios. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: ZYang6263 <zy626375@gmail.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-08 23:17:45 +08:00
lhp-deep	b230e7e987	[MOE]move weight transpose to wakeup for RL secnarios (#4626 ) ### What this PR does / why we need it? In reinforcement learning scenarios, the current inference applies a transpose operation to the weights. For a cleaner architecture, the weight transpose module was moved to wakeup. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: lhp-deep <liuhaopeng1@huawei.com> Co-authored-by: weijinqian0 <1184188277@qq.com>	2025-12-08 20:34:52 +08:00
Mengqing Cao	58db21f56a	[DP] Fix dp padding logic in dummyrun (#4705 ) ### What this PR does / why we need it? Fix dp padding logic in dummyrun. After https://github.com/vllm-project/vllm/pull/28579, `num_tokens` will be padded in `CudagraphDispatcher`, thus we also need to do the pad in the dummy_run. ### How was this patch tested? Test locally with the following scripts ```bash VLLM_USE_MODELSCOPE=true python3 -m vllm.entrypoints.openai.api_server \ --model wemaster/deepseek_mtp_main_random_bf16 \ --trust-remote-code \ --data-parallel-size 4 \ --tensor-parallel-size 1 \ --compilation-config '{"cudagraph_capture_sizes":[96],"cudagraph_mode":"FULL_DECODE_ONLY"}' \ --enable-expert-parallel ``` ```bash vllm bench serve --model wemaster/deepseek_mtp_main_random_bf16 --endpoint /v1/completions --dataset-name random --random-input 512 --random-output 100 --num-prompts 48 --request-rate 1 --ready-check-timeout-sec 0 ``` - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: MengqingCao <cmq0113@163.com>	2025-12-08 20:32:35 +08:00
xuyexiong	193dc1703f	[Doc] Add Qwen3-235B tutorial (#4358 ) ### What this PR does / why we need it? Add Qwen3-235B tutorial including the following examples - Single-node Online Deployment for 128k context inference - Multi-node Deployment with MP - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: xuyexiong <xuyexiong@huawei.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-08 20:06:46 +08:00
shaopeng-666	9766cf9128	fix qwen3vl mrope op (#4484 ) ### What this PR does / why we need it? Qwen2.5-VL mrope precision problem would been solved once this pr is merged ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Test on G8600 with textVQA dataset - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: 李少鹏 <lishaopeng21@huawei.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-08 19:19:17 +08:00
dependabot[bot]	3c3c9a5386	Bump actions/checkout from 6.0.0 to 6.0.1 (#4772 ) Bumps [actions/checkout](https://github.com/actions/checkout) from 6.0.0 to 6.0.1. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-08 19:15:40 +08:00
shiro-zzzz	0617d7d394	[Kernel] add custom moe ops for prefill (#4194 ) ### What this PR does / why we need it? 1.Add the implementation of normal Aclnn operators: MoeCombineNormal, MoeDispatchNormal, NotifyDispatch，and DispatchLayout. - MoeCombineNormal: Implements the combine logic within MoE operations. - MoeDispatchNormal: Implements the dispatch logic within MoE operations. - NotifyDispatch: Exchanges topk_idx information among different ranks to calculate the device memory required for the dispatch stage. - DispatchLayout: Used to calculate information related to the device memory layout for the dispatch stage. 2.Provide PyTorch interfaces for normal operators—get_dispatch_layout, dispatch_prefill, and combine_prefill—to be used for MoE communication during the prefill stage in vLLM. - get_dispatch_layout: Calculates information related to the device memory layout for the dispatch operator, and is called before dispatch_prefill. - dispatch_prefill: Initiates the dispatch operation. - combine_prefill: Initiates the combine operation. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? The functionality has already been validated using the local Qwen model. Test cases will be added after support for multi-NPU use cases in the CI pipeline is finalized. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: shiro-zzzz <zhangdianhao@huawei.com>	2025-12-08 19:11:58 +08:00
zengzengran	f0876b5d88	[Bugfix] Fix Dcp dimension mismatch when enable Mlapo (#4687 ) ### What this PR does / why we need it? After enabling Mlapo and DCP, since Mlapo has its own mla_preprocess logic and does not perform additional all_gather operations on the DCP group, this will lead to dimension mismatch during the subsequent forward proces ### Does this PR introduce _any_ user-facing change? N/A - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: zengran <zengran2@huawei.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-08 17:19:58 +08:00
LuLina	afe00505de	[Fix] skip xlite e2e test (#4786 ) ### What this PR does / why we need it? Due to the differences in operators used and execution order between xlite and eager modes, there will be slight precision discrepancies. This patch skip the xlite e2e tests. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? vLLM version: v0.12.0 vLLM main: `ad32e3e19c` Signed-off-by: lulina <lina.lulina@huawei.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-08 16:48:15 +08:00
dsxsteven	96ea0e078f	[EPLB] Add log Info for moe_load Imbalance Ratio (#4482 ) ### What this PR does / why we need it? Add log Info for MOE_load Imbalance Ratio ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.12.0 --------- Signed-off-by: daishixun <dsxsteven@sina.com> Co-authored-by: weijinqian0 <1184188277@qq.com>	2025-12-08 14:28:13 +08:00
ZYang6263	a433f3280a	[Op] DeepSeekV3.2 support bmm_transpose operator (#4631 ) ### What this PR does / why we need it? DeepSeekV3.2 support bmm_transpose operator. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: ZYang6263 <zy626375@gmail.com> Signed-off-by: ZYang6263 <50876451+ZYang6263@users.noreply.github.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-08 14:03:38 +08:00
wangxiyuan	0b65ac6c4b	remove useless patch (#4699 ) patach_config is useless now. Let's remove it - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2025-12-08 11:02:42 +08:00
zzhxxx	866347a621	Deepseek Mtp model uses the lm_head and embedding from the main model (#2790 ) ### What this PR does / why we need it? In the Deepseek technical report, it is mentioned that the embedding and lmhead layers of the MTP layer are shared with the main model, but the current implementation independently loads the complete embedding and lmhead. In the Deepseek-R1 model, their weight sizes are 129280*7168 in fp16 format, which is 1.72G. This PR fixes the MTP layer to use the lmhead and embedding of the main model, saving 3.45G of GPU memory in the pure DP scenario. The current process will first create temporary spaces for the embedding and lmhead in the mtp layer, then I will call torch.equal to determine if the two matrices are the same. If they are the same, they will be reused, and the previous tensor will be released. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: zzhx1 <zzh_201018@outlook.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-08 10:33:29 +08:00
fluctlux	9fbcfa36af	[CI] Fix ngram & suffix test oom (#4755 ) ### What this PR does / why we need it? Avoid oom during CI by using `with VllmRunner` instead of `LLM()`, and enable `test_ngram_correctness` ### How was this patch tested? CI passed. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: fluctlux <38945811+fluctlux@users.noreply.github.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-08 09:26:29 +08:00
Ronald	916a9a1913	fix synchronize error of exceeds_max_model_len d2h copy (#4708 ) ### What this PR does / why we need it? there is d2h copy blocking cpu operations in mtp propose method, which make host bound issue. this pr refactor it and use cpu tensor to implement it. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? vllm main f5d3d93c40417c296c20dc301100e55708a17f3f - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: Ronald1995 <ronaldautomobile@163.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-08 09:07:59 +08:00
LuLina	2be0fe2691	[Feat] Add Euler xlite graph wrapper support (#4526 ) ### What this PR does / why we need it? This patch adds support for the xlite graph wrapper to vllm_ascend. Xlite provides operator implementations of the transformer network on Ascend hardware. For details about xlite, please refer to the following link: https://gitee.com/openeuler/GVirt/blob/master/xlite/README.md The latest performance comparison data between xlite and the default aclgraph mode is as follows: ## Qwen3 32B TPS 910B3(A2) Online Inference Performance Comparison - aclgraph: main(`c4a71fc6`) - xlite-full: main(`c4a71fc6`) + xlite-full - xlite-decode-only: main(`c4a71fc6`) + xlite-decode-only - diff1: Performance comparison between xlite-full and aclgraph - diff2: Performance comparison between xlite-decode-only and aclgraph ### Does this PR introduce _any_ user-facing change? Enable the xlite graph mode by setting xlite_graph_config: --additional-config='{"xlite_graph_config": {"enabled": true}}' # Enabled for decode only --additional-config='{"xlite_graph_config": {"enabled": true, "full_mode": true}}' # Enabled for prefill and decode - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: lulina <lina.lulina@huawei.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-08 08:27:46 +08:00
Yizhou	8fdb689a32	[BugFix] Refactor ACL graph size adjustment for speculative decoding (#4640 ) ### What this PR does / why we need it? Move the logic for adjusting ACL graph capture sizes for speculative decoding from the generic utility module into a dedicated method within the compilation configuration. This change improves code organization and encapsulation by making the compilation configuration responsible for managing its own state. The model runner now triggers this adjustment directly, providing the necessary context. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? None. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-07 17:32:45 +08:00
liziyu	688b1332da	[P/D] check kv extra config and del hccl backend (#4547 ) ### What this PR does / why we need it? check kv extra config & del hccl backend - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: liziyu <liziyu16@huawei.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-07 15:19:42 +08:00
ZYang6263	b91a5f0968	Support DeepSeekV3.2 with MLAPO operator (#4753 ) ### What this PR does / why we need it? This PR adds support for the optimized MLAPO operator in DSV3.2 and this operator provides an optimized implementation that avoids redundant q_down recomputation. The operator implementation and optimizations were introduced in PR [#4707](https://github.com/vllm-project/vllm-ascend/pull/4707). ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: ZYang6263 <zy626375@gmail.com> Co-authored-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-12-07 12:40:24 +08:00
AlvisGong	a5163c8c36	[Feat]enable sfa cp for dsv3.2 (#4702 ) ### What this PR does / why we need it? RFC: https://github.com/vllm-project/vllm/issues/30055 ### How was this patch tested? 1. enable flashcommon1 export VLLM_ASCEND_ENABLE_FLASHCOMM1=1 2. enable sfa-cp --additional-config '{ "enable_sfa_cp": true }' \ - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: AlvisGong <gwly0401@163.com> Co-authored-by: clrs97 <524936896@qq.com> Co-authored-by: zzhx1 <zzh_201018@outlook.com> Co-authored-by: hwhaokun <haokun0405@163.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-06 19:46:41 +08:00
GuoRen868	4bd1030842	[Kernel] add custom op DispatchGmmCombineDecode (#4139 ) #### What this PR does / why we need it? add custom opapi DispatchGmmCombineDecode for A3, include kernel inpl, python Api, pytest. vLLM version: v0.11.0 vLLM main: `24d6314718` - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangqiankun <wangqiankun13@huawei.com> Co-authored-by: wangqiankun <wangqiankun13@huawei.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-06 17:33:14 +08:00
zhaomingyu13	cb42564942	[BugFix] Fix eagle3 accuracy problem when enforce_eager=True (#4521 ) ### What this PR does / why we need it? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? def main(): prompts = [ "The future of AI is", ] # Create a sampling params object. sampling_params = SamplingParams(temperature=0.8, top_p=0.95) # Create an LLM. llm = LLM( model="meta-llama/Llama-3.1-8B-Instruct", tensor_parallel_size=1, speculative_config={ "method": "eagle3", "model": "yuhuili/EAGLE3-LLaMA3.1-Instruct-8B" "num_speculative_tokens": 3 }, enforce_eager=True, ) # Generate texts from the prompts. outputs = llm.generate(prompts, sampling_params) print(f"Outputs: {outputs}") for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-06 17:31:26 +08:00
Ronald	3480094d7c	support async mtp (#4511 ) ### What this PR does / why we need it? this pr aims to support async_scheduling for mtp, which refer to vllm pr https://github.com/vllm-project/vllm/pull/24799. and this pr fix some synchronize problem in vllm-ascend. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: Ronald1995 <ronaldautomobile@163.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-06 17:15:57 +08:00
Zhu Yi Lin	f067623afd	[Bugfix] fix mtp and eagle aclgraph bug (#4710 ) ### What this PR does / why we need it? fix mtp and eagle aclgraph bug - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: GDzhu01 <809721801@qq.com> Co-authored-by: Mengqing Cao <cmq0113@163.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-06 11:22:57 +08:00
h1074112368	74033999ed	mlapo add qdown output (#4707 ) ### What this PR does / why we need it? This PR adds mlapo operation support qdown of output. ### Does this PR introduce _any_ user-facing change? mlapo operation add enable_inner_out of input ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: h1074112368 <h1074112368@gmail.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-06 11:18:53 +08:00
zzzzwwjj	8378f56f53	rm vanilla attn (#4558 ) ### What this PR does / why we need it? Remove unused vanilla attn code. ### Does this PR introduce _any_ user-facing change? NA - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: zzzzwwjj <1183291235@qq.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-06 10:53:55 +08:00
Wang Yixuan	e0c5073956	[Bugfix]fix bmm_transpose ops for cann version (#4653 ) ### What this PR does / why we need it? Due to the upgrade of CANN version, custom op cannot be used in high version. In the high level cann version, the ops will start with redundant vector core while this ops will only use cube core, this results in the missalign when copy data from ub memory to global memory. So add limitation to the ops to make it use cube core only. ### Does this PR introduce _any_ user-facing change? No - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: hust17yixuan <303660421@qq.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-06 10:52:46 +08:00
weijinqian0	a78f49ea57	[Refactor] 1/N Refactor attention_v1 & extract attention_cp (#4628 ) RFC: https://github.com/vllm-project/vllm-ascend/issues/4629 Reason： The functions related to Cp differ significantly from those of normal Attention, but the coupling is quite severe. Steps： Isolate PCP and DCP (1) Forward class extraction (100%) (2) Metadata coupling processing (3) Builder processing - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: weijinqian_v1 <weijinqian@huawei.com> Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>	2025-12-06 09:33:28 +08:00
mazhixin000	3740b3edfc	【main】[Doc]add 2P1D instruction for single node (#4716 ) ### What this PR does / why we need it? Add the description for 2P1D， keeping it consistent with the content in the dev branch. ### Does this PR introduce _any_ user-facing change? no - vLLM version: v0.12.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.12.0 Signed-off-by: mazhixin000 <mazhixinkorea@163.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-05 18:35:18 +08:00
Li Wang	4b016b98a2	[CI] Fix unit test fault `no space left` (#4728 ) ### What this PR does / why we need it? Using an ARM-based github_hosted node to temporarily resolve `no space left` issues when installing vllm in UT. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: wangli <wangli858794774@gmail.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com>	2025-12-05 17:21:30 +08:00
wangxiaoteng888	41fbc5ebc9	[P/D][main] Clean connector history information (#4650 ) ### What this PR does / why we need it? Clean connector history information when the node restarts. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By ci - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-05 16:22:23 +08:00
欧派果奶我还要	a336543977	[Bugifx] fix quant_apply_mlp w1_scale type error & fix getting num_local_expert (#4632 ) ### What this PR does / why we need it? Fix bugs introduced by `bc67696a02` 1. fix getting num_local_experet error in vllm_adaptor 2. fix w1_scale type error in moe_mlp.quant_apply_mlp.npu_dequant_swiglu_quant in w4a8 quantized scenario - vLLM version: v0.12.0 --------- Signed-off-by: 白永斌 <baiyongbin3@h-partners.com> Signed-off-by: 欧派果奶我还要 <47294568+845473182@users.noreply.github.com> Co-authored-by: 白永斌 <baiyongbin3@h-partners.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-05 16:04:24 +08:00
whx	a7f91079b8	[BugFix][Triton] Fix ub overflow bug of sample_recover_tokens_kernel (#4673 ) ### What this PR does / why we need it? Original `sample_recover_tokens_kernel` of reject sampler didn't tile the vocab size dim, whitch will cause ub overflow problem for models with big vocab size like deepseek. This PR adds tiling to the vocab size dim to avoid this problem. Note that currently we just use a emperical `SUB_BLOCK_SIZE` of `4*1024` for functionality. If in the future this kernel becomes performance bottle neck, we can use triton autotune to optimize this. What's more, we have to disable multibuffer of this kernel due to some accuracy issues. - vLLM version: v0.12.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.12.0 Signed-off-by: whx-sjtu <2952154980@qq.com> Co-authored-by: weijinqian0 <1184188277@qq.com>	2025-12-05 15:16:19 +08:00
Chen Chen	7f33838e6e	Update comment doc (#4731 ) ### What this PR does / why we need it? Translate remaining Chinese comments in the `dispatch_ffn_combine` code to English and update the installation guide to remind users to initialize submodules when building from source. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: mojave2 <chenchen145@huawei.com> Signed-off-by: Chen Chen <0109chenchen@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-05 15:07:31 +08:00
LookAround0301	b32ef53b3b	[long_seq] remove long_seq env (#4660 ) ### What this PR does / why we need it? remove env VLLM_ASCEND_ENABLE_CONTEXT_PARALLEL - vLLM version: v0.12.0 --------- Signed-off-by: LookAround <lixushi@huawei.com> Signed-off-by: ZhangMingWei716 <2894054457@qq.com> Co-authored-by: ZhangMingWei716 <2894054457@qq.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-05 10:31:49 +08:00
wangxiyuan	ea54388e19	Drop ascend scheduler (#4623 ) It's safe to drop ascend scheduler now. The related test and doc has been removed already - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-05 09:03:45 +08:00
wangxiyuan	00b4fb80de	[Doc] Update vLLM version in doc (#4691 ) Correct vLLM version in doc - vLLM version: v0.12.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.12.0 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com>	2025-12-05 08:59:41 +08:00
Li Wang	cd8e5be7c7	[Bugfix] Quick hot fix for nightly CI (#4727 ) Quick fix for multi-node tests --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-12-04 23:51:16 +08:00
Chen Chen	ad0607f900	add `dispatch_gmm_combine` kernel (#3532 ) ### What this PR does / why we need it? This PR introduces the Ascend implementation of the `dispatch_ffn_combine` kernel and wires it into the vLLM-Ascend runtime, together with follow‑up fixes to ensure the kernel builds and runs correctly in CI. - Add full host and device implementation of the `dispatch_ffn_combine` kernel under `csrc/dispatch_ffn_combine`, including tiling logic, MOE routing helpers, and kernel utilities for quantized FFN dispatch. - Integrate the new kernel with the PyTorch binding (csrc/torch_binding.cpp, csrc/torch_binding_meta.cpp) and the Ascend runtime (vllm_ascend/ascend_forward_context.py, vllm_ascend/worker/model_runner_v1.py). - Extend fused MoE communication and token dispatch support in `vllm_ascend/ops/fused_moe`, adding methods/utilities needed by the new dispatch path. - Update quantization logic in vllm_ascend/quantization/w8a8_dynamic.py to support the new FFN dispatch flow. - Fix kernel build issues by adjusting `csrc/build_aclnn.sh`, CMake configuration, and include/namespace usage in the new kernel files. - Add an end‑to‑end nightly test `tests/e2e/nightly/ops/test_dispatch_ffn_combine.py` and helper utilities in `vllm_ascend/utils.py` to validate the new kernel. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.12.0 --------- Signed-off-by: mojave2 <chenchen145@huawei.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-04 23:00:59 +08:00

1 2 3 4 5 ...

1554 Commits