xc-llm-ascend

Author	SHA1	Message	Date
herizhen	8c87a3b053	Change the first letter to uppercase (#4375 ) ### What this PR does / why we need it? The first letter of the English title should be capitalized ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ut - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` Signed-off-by: herizhen <you@example.com> Co-authored-by: herizhen <you@example.com>	2025-11-24 12:18:24 +08:00
Li Wang	b5f7a83927	[Doc] Upgrade multi-node doc (#4365 ) ### What this PR does / why we need it? When we are using `Ascend scheduler`, the param `max_num_batched_tokens` should be larger than `max_model_len`, otherwise, will encountered the follow error: ```shell Value error, Ascend scheduler is enabled without chunked prefill feature. Argument max_num_batched_tokens (4096) is smaller than max_model_len (32768). This effectively limits the maximum sequence length to max_num_batched_tokens and makes vLLM reject longer sequences. Please increase max_num_batched_tokens or decrease max_model_len. [type=value_error, input_value=ArgsKwargs((), {'model_co...g': {'enabled': True}}}), input_type=ArgsKwargs] ``` ### Does this PR introduce _any_ user-facing change? Users/Developers who running the model according to the [tutorial](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/multi_node.html), the parameters can be specified correctly. ### How was this patch tested? - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-11-24 10:57:50 +08:00
Li Wang	b34f195cc8	[CI] Fix nightly CI for A2 series (#3825 ) ### What this PR does / why we need it? For multi-node CI system, we need to ensure that cluster resources meet the expected specifications before conducting multi-node interoperability tests. Otherwise, unexpected errors may occur (for example, we might mistakenly assume all nodes are ready and perform a global cluster IP acquisition, which would cause an exception to be thrown in Python because some nodes might not actually be ready at that point). Therefore, we need to wait at the workflow level until all resources meet the expected specifications. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-11-23 23:05:33 +08:00
mazhixin000	ab51fcea4c	[Doc]Add single node PD disaggregation instructions (#4337 ) ### What this PR does / why we need it? add single node PD disaggregation instructions for Qwen 2.5VL model. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` --------- Signed-off-by: mazhixin <mazhixin7@huawei.com> Signed-off-by: mazhixin000 <mazhixinkorea@163.com> Co-authored-by: mazhixin <mazhixin7@huawei.com>	2025-11-22 23:33:07 +08:00
pz1116	ea3372fb0c	[Bugfix][KV Pool]fix get_ip import in mooncake_store (#4355 ) ### What this PR does / why we need it? fix import error for get_ip() in vllm main branch ### Does this PR introduce _any_ user-facing change? N ### How was this patch tested? - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` --------- Signed-off-by: pz1116 <zpbzpb123123@gmail.com>	2025-11-22 18:52:48 +08:00
Angazenn	9b3a484b46	[BugFix] Fix some issues caused by the ascending order of cudagraph_capture_sizes (#4338 ) ### What this PR does / why we need it? In [#26016](https://github.com/vllm-project/vllm/pull/26016), vllm change the `cudagraph_capture_sizes` to be in ascending order. This PR fixes related issues caused by this. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` --------- Signed-off-by: Angazenn <supperccell@163.com>	2025-11-22 17:33:12 +08:00
wangxiyuan	fff258bce1	[Doc] add release note for v0.11.0rc2 (#4348 ) add release note for v0.11.0rc2 - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-11-21 23:03:32 +08:00
LI SHENGYONG	3955bf2908	[EPLB] Eplb Verify Fix (#4333 ) ### What this PR does / why we need it? Eplb Verify Fix ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` --------- Signed-off-by: shenchuxiaofugui <1311027364@qq.com> Signed-off-by: LI SHENGYONG <49200266+shenchuxiaofugui@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2025-11-21 18:17:46 +08:00
wangxiaochao	3deeea14a0	[bugfix] bugfix for PD disaggregate (#4319 ) This PR is used to fix mooncake_connector in pcp/dcp case. When executing function update_done_task_count, it is necessary to ensure that both pcp/dcp and TP ranks have finished transferring KV cache. - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` --------- Signed-off-by: wangxiaochao <w00642655@china.huawei.com> Co-authored-by: wangxiaochao <w00642655@china.huawei.com>	2025-11-21 18:08:56 +08:00
CodeCat	e332e27ec3	[Test] Add ut test for torchair (#4287 ) ### What this PR does / why we need it? The current community lacks unit tests (UT) for files such as torchair_worker, mtp_proposer, and model_runner. Therefore, UT coverage for these files needs to be added. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` --------- Signed-off-by: CodeNine-CJ <chenjian343@huawei.com>	2025-11-21 16:33:34 +08:00
whx	a5554b6661	[Feat][Doc] Add a load_balance_dp_proxy in examples and external dp doc. (#4265 ) ### What this PR does / why we need it? This PR adds a load-balance dp proxy server which can be used in external DP scenario without Disaggregated-Prefill enabled. What's more, add a doc of external dp and load-balance dp proxy server. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? See the new doc. - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` --------- Signed-off-by: whx-sjtu <2952154980@qq.com>	2025-11-21 16:33:23 +08:00
Ting FU	6c157cb75a	[CI] Defaultly compile vllm with multimodal audio feature in dockerfile (#4324 ) ### What this PR does / why we need it? For better usability, add multimodal audio to vllm compiling in dockerfile defaultly. Image size will increase only 2.xM. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? vLLM version: v0.11.0 vLLM main: `2918c1b49c` - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` Signed-off-by: Ting FU <futing10@huawei.com>	2025-11-21 16:15:31 +08:00
Shanshan Shen	8e3b834bf7	[MM][Bugfix] Add error log for VL models when enabling FLASHCOMM (#4272 ) ### What this PR does / why we need it? Add error log for VL models when enabling `VLLM_ASCEND_ENABLE_FLASHCOMM1=1` or `VLLM_ASCEND_ENABLE_FLASHCOMM=1` (for backward compatibility). This is a temporary fix for https://github.com/vllm-project/vllm-ascend/issues/4132. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` Signed-off-by: shen-shanshan <467638484@qq.com>	2025-11-21 15:04:18 +08:00
LI SHENGYONG	4573c855b7	[Readme] EPLB Support Scenarios (#4314 ) ### What this PR does / why we need it? Add information on the scope of EPLB support. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` --------- Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2025-11-21 14:24:54 +08:00
LI SHENGYONG	019c7ded91	eplb redundant expert bugfix (#4291 ) ### What this PR does / why we need it? Redundant experts bugfix ### Does this PR introduce _any_ user-facing change? After configuring the path for experts_map, users do not need to configure iinit_redundancy_expert. ### How was this patch tested? The accuracy of EPLB was tested with and without the use of redundant experts. - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` --------- Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2025-11-21 14:24:35 +08:00
InSec	5a4e8cdeba	[Feat][BugFix]Support the Qwen3-Next-80B-A3B-Instruct quantization model&Fix the NZ issue (#4245 ) ### What this PR does / why we need it? Support the Qwen3-Next-80B-A3B-Instruct quantization model and Fix the NZ issue. Triton kernel doesn't support data format nz, thus we skip converting weight to nz on layer `conv1d` - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` --------- Signed-off-by: IncSec <1790766300@qq.com>	2025-11-21 10:42:56 +08:00
Yizhou	cbb27feaf2	[Test] Add ACL graph capture/replay DP test (#4259 ) ### What this PR does / why we need it? Add ACL graph capture/replay DP test, this is a imprved version of #3886 Restructures the multi-card ACL graph test for improved clarity, robustness, and accuracy. Key improvements include: - Replaces fragile `sys.settrace` and manual patching with a clean, reusable spy installer using `unittest.mock.patch`. - Introduces more precise metrics by tracking `NPUModelRunner.execute_model` and `_dummy_run` calls directly. - Rewrites assertions to be more accurate and provides clear explanations for the expected counts of graph captures, replays, model executions, and dummy runs. - Simplifies the overall test structure by separating the worker logic into a dedicated function. - Removes a long, unnecessary sleep at the end of the test. - Expands test coverage by adding a larger `max_tokens` parameter. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? None. - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` --------- Signed-off-by: lilinsiman <lilinsiman@gmail.com> Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com> Co-authored-by: lilinsiman <lilinsiman@gmail.com>	2025-11-21 08:50:46 +08:00
Zhu Yi Lin	d96d5fa971	[Test] quick fix mla ut (#4318 ) ### What this PR does / why we need it? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` Signed-off-by: GDzhu01 <809721801@qq.com>	2025-11-20 23:06:12 +08:00
anon189Ty	5c9f4a40c6	[Feat] Support MTP to running in full graph mode (#3892 ) ### What this PR does / why we need it? Currently, the MTP model still runs in eager in full graph mode. This PR adapts the MTP with the full graph capture and execution. When the graph mode is set to "FULL_DECODE_ONLY", the MTP will run in full-graph to improve the performance. The change in both disable_padded_drafter_batch is True and False case include: 1. Add _mtp_graph_params in acl_graph.py to isolate the data of main model and the data of MTP. 2. Padding some metadata in mla_v1.py when in fullgraph mode. 3. Fixed the essential data address that will be used in model.forward. 4. Adapted according to the aclgraph capture framwork: 1). Rebuild MTP model with ACLGraphWrapper. 2). Add common attn metadata when start capture in MTP dummy_run. 3). Add common attn metadata update in MTP. 4). Addapted data update when num_speculative_tokens > 1. 5. Add a patch of MTP to adapt vllm v0.11.0. Existing Issues: 1. When disable_padded_drafter_batch=True and running in FullGraph mode, the data of the first-round requests in MTP is abnormal. We need to identify the cause subsequently. 2. When disable_padded_drafter_batch=False and running in FullGraph mode, the acceptance rate of the second and third tokens will decrease (For example, if we set the num_speculative_tokens=3, the acceptance rate of first token is 90%, the second is only 50% lower than 60%, the third is only 20% lower than 30%). The reason is that the data processed after the model runs does not match. This is a problem from another PR. It works fine in eager and PIECEWISE mode, but has problem in FullGraph mode. Once we have a solution, we will submit a bugfix. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` --------- Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>	2025-11-20 20:34:54 +08:00
Zhu Yi Lin	15c1eb025c	[CI] Add mla ut (#4280 ) ### What this PR does / why we need it? add mla_v1.py and mla.py ut ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? `pytest tests/ut/attention/test_mla_v1.py` `pytest tests/ut/models/test_mla.py` - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` Signed-off-by: GDzhu01 <809721801@qq.com>	2025-11-20 20:29:09 +08:00
CodeCat	470fe05df6	[Test] Add tests for the multi-node DeepSeek-V2-Lite network in GE Graph (#4039 ) ### What this PR does / why we need it? Add tests for the multi-node DeepSeek-V2-Lite network in GE Graph mode, and supplement the end-to-end (e2e) tests for the MLA and NZ features of this network. - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` --------- Signed-off-by: CodeNine-CJ <chenjian343@huawei.com>	2025-11-20 17:28:32 +08:00
shaopeng-666	3653f33878	avoid mrope fusion op when running qwen2.5-vl on a+x machine (#4270 ) ### What this PR does / why we need it? avoid mrope fusion op when running qwen2.5-vl on a+x machine ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? Test text VQA accuracy on G8600 with aisbench - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` --------- Signed-off-by: 李少鹏 <lishaopeng21@huawei.com>	2025-11-19 22:31:14 +08:00
欧派果奶我还要	c848da0687	[Bugfix] fix nightly multi-node EPLB tests' "DYNAMIC_EPLB=true" environment not working (#4223 ) ### What this PR does / why we need it? fix nightly multi-node EPLB tests by adjusting vllm_ascend\eplb\core\eplb_utils.py dynamic_eplb gate checking ### Does this PR introduce _any_ user-facing change? no - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` --------- Signed-off-by: 白永斌 <baiyongbin3@h-partners.com> Signed-off-by: 欧派果奶我还要 <47294568+845473182@users.noreply.github.com> Co-authored-by: 白永斌 <baiyongbin3@h-partners.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2025-11-19 21:31:58 +08:00
Delphine-Nic	a3e9673137	[long seq feat]GQA support long-prefill-token-threshold and fixbug (#4209 ) ### What this PR does / why we need it? GQA chunk prefill with pcp and dcp support long-prefill-token-threshold The markdown format results is as below: \| dataset \| version \| metric \| mode \| vllm-api-general-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| gsm8kdataset \| - \| accuracy \| gen \| 96.13 \| - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` --------- Signed-off-by: Delphine-Nic <tanwenqin@huawei.com> Signed-off-by: Delphine-Nic <t00608739@china.huawei.com> Co-authored-by: Delphine-Nic <tanwenqin@huawei.com> Co-authored-by: Delphine-Nic <t00608739@china.huawei.com>	2025-11-19 18:10:27 +08:00
wangxiyuan	97daf7f78c	[misc] clean up get_metadata_cls (#4276 ) Follow up #4087 to remove get_metadata_cls - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-11-19 17:18:19 +08:00
Canlin Guo	d5fef22149	[Docs] Improve the AISBench multi-modal testing docs (#4255 ) ### What this PR does / why we need it? Add some of the pitfalls I ran into when using AISBench to test multi-modal models. - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` --------- Signed-off-by: gcanlin <canlinguosdu@gmail.com>	2025-11-19 16:00:39 +08:00
pz1116	d43022f3ed	[doc]fix readme for kv pool user guide (#4271 ) ### What this PR does / why we need it? Add the parameter "register_buffer" for PD Aggregated Scenario in the given example. - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` Signed-off-by: Pz1116 <zpbzpb123123@gmail.com>	2025-11-19 15:57:50 +08:00
wangxiyuan	2938bd5ad2	remove get_metadata_cls (#4087 ) remove get_metadata_cls. It's only used for V0 engine and has been removed from vLLM already. - vLLM version: v0.11.0 - vLLM main: `83f478bb19` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-11-19 14:58:17 +08:00
realliujiaxu	1cdf9ffa73	[Bugfix] fix hang in async scheduling (#4233 ) ### What this PR does / why we need it? After https://github.com/vllm-project/vllm-ascend/pull/4113, there is no synchronization between steps. However, in async scheduling with aclgraph, it is possible that the CPU's record event for the current iteration completes before the previous iteration's graph execution has finished. If cpu is fast enough, device will hang on event_wait in interation i+1 (assume that event_record is executed immediately on update stream of device): <img width="1812" height="489" alt="image" src="https://github.com/user-attachments/assets/373fe655-afe5-4d7d-807e-b0aacf24a543" /> after add synchonization, record is launched after graph replay: <img width="1803" height="466" alt="image" src="https://github.com/user-attachments/assets/a8a68053-bd7d-49f5-a79c-9a26ef1285cc" /> bubble time caused by synchronization is about 85 us on G8600： <img width="1491" height="804" alt="image" src="https://github.com/user-attachments/assets/968611ee-f39a-4329-8150-1c4adba25dd1" /> ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` --------- Signed-off-by: realliujiaxu <realliujiaxu@163.com> Co-authored-by: hwhaokun <haokun0405@163.com>	2025-11-19 14:47:19 +08:00
Li Wang	91b6ba8ffe	[CI] Fix kubernetes failed to resolve ip by dns name (#4240 ) ### What this PR does / why we need it? While in the scenario where the pod has been started, but the corresponding DNS service is not yet ready. If we immediately resolve the DNS domain name at this time, an error will occur. see https://github.com/vllm-project/vllm-ascend/actions/runs/19436639688/job/55609108796 - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-11-19 14:38:13 +08:00
zhangsicheng5	df777e9faa	[bugfix] pcp + mtp acl graph bugfix (#4221 ) Fix pcp + mtp bug while using acl graph. While using pcp + mtp, we need to flatten block_table to avoid irregular attn mask shape, this was done in mla attn_metadata builder, but we found out that this influences block_table address and leads to incorrect results while enable acl graph. To fix this, we enlarge block_table buffer size and flatten block_table in model_runner prepare_inputs, so this will not influence block_table address. - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` Signed-off-by: zhangsicheng5 <zhangsicheng5@huawei.com>	2025-11-19 11:21:46 +08:00
1092626063	9328f377b4	[refactor]support gatingtopk operator generalization (#2958 ) ### What this PR does / why we need it? Past： npu_moe_gating_top_k can only support 'group_count=256' pattern Now： 1、npu_moe_gating_top_k support all size of group_count 2、the functionality of `torch_npu.npu_moe_gating_top_k_softmax` are included in `torch_npu.npu_moe_gating_top_k` CANN: depends on 8.3.RC1 Performance： 1. GLM4.5-w8a8, TPS improve 6% 2. Qwen3, the same as before - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` Signed-off-by: 1092626063 <1092626063@qq.com>	2025-11-19 10:38:56 +08:00
Yizhou	63561d6763	[Fix] Sorts aclgraph batch sizes in ascending order (#4230 ) ### What this PR does / why we need it? Sorts aclgraph batch sizes in ascending order, corresponding to vLLM [#26016](https://github.com/vllm-project/vllm/pull/26016) Ensures batch sizes for aclgraph are sorted ascending when aclgraph mode is enabled, improving consistency and compatibility with later logic that may depend on order. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? Waiting for #3886 - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-11-19 09:36:37 +08:00
liziyu	e98543267a	[bugfix] fix proxy hen host ip using domain name (#4243 ) ### What this PR does / why we need it? fix proxy when host ip using domain name - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` --------- Signed-off-by: liziyu <liziyu16@huawei.com>	2025-11-18 16:30:51 +08:00
liziyu	a30261f779	[P/D] pd proxy support ipv6 (#4161 ) ### What this PR does / why we need it? pd proxy support ipv6, mooncake connector check whether the IPv6 address is used and notify the user. - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` --------- Signed-off-by: liziyu <liziyu16@huawei.com>	2025-11-18 11:01:13 +08:00
wangxiaochao	0d04ad8c8f	[feature] Mooncake_connector support pcp/dcp (#4183 ) add feature for Mooncake_connector supporting pcp/dcp - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` --------- Signed-off-by: wangxiaochao <w00642655@china.huawei.com> Co-authored-by: wangxiaochao <w00642655@china.huawei.com>	2025-11-18 10:17:48 +08:00
Angazenn	10a046ddce	[main][misc]change default capture size for Qwen3-MoE when using full dp (#4199 ) ### What this PR does / why we need it? Currently, the default `cudagraph_capture_size` in vLLM is `[1, 2, 4 ,8 ,16 ,24 ,... , max_capture_size]`. However, this is not always the best choice on different situations. This PR aims to change the default setting when running Qwen3-MoE on full dp (`dp_size > 1` && `tp_size == 1`) setting, which is usually applied in Large-Scale EP. old : `[1, 2, 4 ,8 ,16 ,24 ,... , max_capture_size]` new: `[1, 2, 5 ,10 ,15, 16 ,24 ,... , max_capture_size]` This is mainly because the performance of `_npu_paged_attention` op degrades dramatically on old settings. We hope to provide better performance if users do not set specific `cudagraph_capture_size`. ### Does this PR introduce _any_ user-facing change? The default `cudagraph_capture_size` is modified in above cases. However, if `cudagraph_capture_size` has already set by users, this PR won't have any influence on this. ### How was this patch tested? - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` --------- Signed-off-by: Angazenn <supperccell@163.com>	2025-11-18 08:41:45 +08:00
weiguihua2	da1cd9c7ca	[Bugfix]Fix moe error when sp chunked the hidden_states (#4212 ) ### What this PR does / why we need it? Fix moe error when sp chunked the hidden_states by disabling sp by a hacky way - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` --------- Signed-off-by: weiguihua2 <weiguihua2@huawei.com>	2025-11-17 22:55:17 +08:00
Ronald	3677202594	make vllm-ascend work well in developer mode (#4179 ) ### What this PR does / why we need it? we often install vllm-ascend in developer mode, which has no _build_info module. it will raise error in `utils.is_310p` and `utils.sleep_model_enabled`, then we need to modify these two function. ### Does this PR introduce _any_ user-facing change? not involved ### How was this patch tested? not involved - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` Signed-off-by: Ronald1995 <ronaldautomobile@163.com>	2025-11-17 19:13:04 +08:00
jiangyunfan1	9a1cfb48d4	[TEST]Update prefixcache perf threshold for qwen3-32b-int8 (#4220 ) ### What this PR does / why we need it? This PR update the prefixcache threshold for qwen3-32b-int from 0.4 to 0.8, as the baseline has been improved. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the test - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` Signed-off-by: jiangyunfan1 <jiangyunfan1@h-partners.com>	2025-11-17 19:06:54 +08:00
XiaoxinWang	e38ef2c434	support FULL graph mode for GQA (#3970 ) ### What this PR does / why we need it? The current library only supports the FullDecodeOnly graph mode, which enables full graph execution during the decode. This PR extends support to allow full graph execution in both the prefill and decode, referred to as FULL graph mode. - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com> Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>	2025-11-17 10:50:35 +08:00
zhangyiming	c334114f69	[CI] Fix no space left in build wheel CI. (#4215 ) ### What this PR does / why we need it? [CI] Fix no space left in build wheel CI. - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` Signed-off-by: menogrey <1299267905@qq.com>	2025-11-17 10:45:58 +08:00
zhangxinyuehfad	67f2b3a031	[Test] Add deepseek v3.2 exp nightly test (#4191 ) ### What this PR does / why we need it? - skip the nightly image build when the github event is pull_request - set imagepullpolicy as alway for multi_node test - move multi_node tests ahead to have some resource clean first - do not relevant nightly image build with nightly tests for tolerance - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` --------- Signed-off-by: hfadzxy <starmoon_zhang@163.com> Signed-off-by: wangli <wangli858794774@gmail.com> Co-authored-by: wangli <wangli858794774@gmail.com>	2025-11-14 15:46:10 +08:00
Shanshan Shen	1d0f13c1a3	[Misc] Add benchmark results into `.gitignore` (#4200 ) ### What this PR does / why we need it? Add benchmark results into `.gitignore` - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` Signed-off-by: shen-shanshan <467638484@qq.com>	2025-11-14 15:44:28 +08:00
Canlin Guo	f10251ede0	[Platform] Add import_kernels interface (#3694 ) ### What this PR does / why we need it? Add import_kernels interface to avoid import useless vLLM C library Closes #3488. Reopen #3498 for CI. ### How was this patch tested? CI tested. - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` --------- Signed-off-by: gcanlin <canlinguosdu@gmail.com>	2025-11-14 11:32:51 +08:00
Yizhou	094f32c8c9	[Feat] Adds a utility for printing from within ACL graphs (#4162 ) ### What this PR does / why we need it? Introduces the `acl_graph_print` function to enable printing debug information from code running inside an ACL graph, such as custom operators. This works by launching a host function on a dedicated stream, bypassing the limitations of standard `print` within compiled graph execution. The implementation handles the necessary stream subscriptions and ensures they are properly unregistered upon exit. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? None. - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-11-14 09:41:14 +08:00
weiguihua2	01195e860c	[Bugfix] fix cannot import name get_mp_context (#4174 ) ### What this PR does / why we need it? fix bug: cannot import vllm package - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` Signed-off-by: weiguihua2 <weiguihua2@huawei.com>	2025-11-14 09:09:14 +08:00
欧派果奶我还要	f90ed95578	[CI] Add multi-nodes EPLB configs of DeepSeek-R1-W8A8 & Qwen3-235B-W8A8 (#4144 ) ### What this PR does / why we need it? add DeepSeek-R1-W8A8 and Qwen3-235B-W8A8 configs in multi-nodes and EPLB scenario ### Does this PR introduce _any_ user-facing change? no - vLLM version: v0.11.0 - vLLM main: `83f478bb19` --------- Signed-off-by: 白永斌 <baiyongbin3@h-partners.com> Co-authored-by: 白永斌 <baiyongbin3@h-partners.com>	2025-11-14 08:50:29 +08:00
LookAround0301	5ec96fd46c	[long_seq_Feat] support chunk prefill (#4158 ) ### What this PR does / why we need it? 1、qwen GQA attention_v1 optim 2、DeepSeek MLA refactor, all gather q -> all gather kv 3、modelrunner refactor for chunk prefill, we remove some code not use - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` --------- Signed-off-by: LookAround <lixushi@huawei.com> Signed-off-by: Delphine-Nic <tanwenqin@huawei.com> Co-authored-by: Delphine-Nic <tanwenqin@huawei.com>	2025-11-14 08:43:37 +08:00
Li Wang	7294f89e43	[CI] Add daily images build for nightly ci (#3989 ) ### What this PR does / why we need it? Given the current excessively long build time of our nightly-ci, I recommend installing necessary, confirmed versions of packages in the Docker image to reduce the time required for integration testing. Including Mooncake vllm with fixed tags, This is expected to reduce nightly-ci duration by 2 hours. - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-11-13 20:10:12 +08:00

1 2 3 4 5 ...

1394 Commits