xc-llm-ascend

Author	SHA1	Message	Date
Mengqing Cao	91c35d765a	[Bugfix] Fix mc2 operator error in aclgraph + ep<16 scenario (#2609 ) ### What this PR does / why we need it? 1. quickfix mc2 operator error in aclgraph + ep<16 scenario to recover CI, will be refactorred in the future 2. disable aclgraph when testing w8a8 ### How was this patch tested? CI passed with existing test. - vLLM version: v0.10.1.1 - vLLM main: `95089607fa` Signed-off-by: MengqingCao <cmq0113@163.com>	2025-08-29 21:59:16 +08:00
wangxiaoteng666	ee6d141dd4	[MAIN][BUGFIX] BugFix: Resolve the issue of waiting queue accumulation when requests are canceled. (#2426 ) ### What this PR does / why we need it? Resolve the issue of waiting queue accumulation when requests are canceled. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By ci - vLLM version: v0.10.1.1 - vLLM main: `006477e60b` --------- Signed-off-by: wangxiaoteng666 <wangxiaoteng@huawei.com>	2025-08-29 17:19:23 +08:00
weichen	52aff9e229	[main] [bugfix] Fix misjudging quantized/unquantized scenarios (#2627 ) ### What this PR does / why we need it? In a mixed-precision scenario, quant_config is not None, but MoE needs to perform unquantized computation; however, quantized computation is currently being used. Therefore, we put the with_quant logic into forward, avoid misjudging in mix-precision scenarios. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? e2e & ut - vLLM version: v0.10.1.1 - vLLM main: `98ac0cb32d` Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>	2025-08-29 16:20:22 +08:00
yiz-liu	aadc75c247	[Fix] Resolve data-parallel (DP) assertion errors in TorchAir (#2626 ) ### What this PR does / why we need it? It is confirmed that `num_input_tokens` must be assigned the value of `maybe_padded_num_tokens` under all circumstances. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? Waiting for daily test for TorchAir. - vLLM version: v0.10.1.1 - vLLM main: `006477e60b` Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-08-29 16:06:49 +08:00
lidenghui1110	600b08f754	[Feat]: Add custom lmhead tensor model parallel (#2309 ) ### What this PR does / why we need it? This PR introduces LMhead tensor model parallel to achieve decreasing of memory consumption, and TPOT performance improvement. It support both eager mode and graph mode. In deepseek r1 w8a8 PD disagregated Decode instance, using pure DP, with lmhead_tensor_parallel_size = 8, we have 1 ms TPOT optimization, saved 1.48 GB NPU memory per RANK. performance data: <img width="1444" height="438" alt="image" src="https://github.com/user-attachments/assets/3c5ef0d3-a7c7-46fd-9797-4de728eb0cb0" /> ### Does this PR introduce _any_ user-facing change? This PR introduces one new config in `additional_config`. \| Name \| Effect \| Required \| Type \| Constraints \| \| :---------------------------- \| :--------------------------------------- \| :------- \| :--- \| :----------------- \| \| lmhead_tensor_parallel_size \| Split the lm_head matrix along the column dimension (vocab_size) into lmhead_tensor_parallel_size pieces \| No \| int \| default value is None, once this value is set, the feature will be enabled, vocab_size must be divisible by this value. \| example `--additional_config={"lmhead_tensor_parallel_size": 8}` ### How was this patch tested? - vLLM version: v0.10.1.1 - vLLM main: `de533ab2a1` --------- Signed-off-by: zzhx1 <zzh_201018@outlook.com> Co-authored-by: zhangzihang <zzh_201018@outlook.com>	2025-08-29 11:41:21 +08:00
zhangxinyuehfad	e7ad4a64f4	[CI] Add e2e ci test for A3 (#2573 ) ### What this PR does / why we need it? Add e2e ci test for A3 ### How was this patch tested? - vLLM version: v0.10.1.1 - vLLM main: `11a7fafaa8` Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-08-29 09:33:42 +08:00
yiz-liu	dfc7eb39ad	[Fix] Fix DP-related padding logic (#2582 ) ### What this PR does / why we need it? The determination of attention state, padding, and other forward metadata has been moved to an earlier stage within the input preparation process. This change enables us to utilize a single all-reduce operation, maximizing synchronization efficiency as early as possible. The logic for synchronizing metadata—such as the number of tokens, prefill status, and DBO status—across data parallel (DP) ranks has now been unified and simplified. For performance improvements, the all-reduce operation has been switched from the `gloo` backend to the `npu` backend, which results in an reduction of several milliseconds per step (approximately 10% performance gain for TPOT!). Additionally, the multi-DP server hang issue has been resolved, ensuring no more hangs occur when `num_requests < dp_size`. Alas, a relief. Finally, the miscalculated memory usage issue has been addressed by removing the unnecessary `DummyCommImpl`, allowing the system to use the real communication method when determining available memory. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? Maybe we should add an test case for multi-DP online server? @MengqingCao - vLLM version: v0.10.1.1 - vLLM main: `c5d004aaaf` --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-08-28 19:39:58 +08:00
Yikun Jiang	175f6bc445	Support v0.10.1 (#2584 ) ### What this PR does / why we need it? This patch also supports v0.10.1 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - CI passed - test 0.10.1: https://github.com/vllm-project/vllm-ascend/pull/2583 - vLLM version: v0.10.1.1 - vLLM main: `321938e9ac` Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-08-28 18:47:53 +08:00
Mengqing Cao	6c973361fc	[Bugfix] Fix aclgraph not enabled by default (#2590 ) ### What this PR does / why we need it? As vllm will set `cudagraph_mode` to `NONE` before `check_and_update_config` in post init of `VllmConfig` (`5da4f5d857/vllm/config/__init__.py (L3630)`), we always have `cudagraph_mode` isn't `None`, thus we must remove this check and add it when the related adaption in vllm is done. part of https://github.com/vllm-project/vllm-ascend/pull/2577, will add the e2e test on applying reply after the CI refactor is done ### How was this patch tested? CI passed with existing test. - vLLM version: v0.10.1.1 - vLLM main: `f48a9af892` Signed-off-by: MengqingCao <cmq0113@163.com>	2025-08-28 14:08:31 +08:00
yupeng	cf96366a39	[Bugfix][LoRA][Patch] Fix the LoRA inference bug after upstream vLLM codebase changed (#2560 ) ### What this PR does / why we need it? The mergence of the upstream https://github.com/vllm-project/vllm/pull/22592 caused a vllm-ascend LoRA inference bug. The details are following: According to [torch_npu/npu/_stream_check.py](`863b9071cb/torch_npu/npu/_stream_check.py (L74)`), NPU device type tensors have attributes is_cuda=True and is_npu=True. This causes that vLLM's apply_repetition_penalties function will run into the branch of "if logits.is_cuda and logits.is_contiguous()" and call the custom op implemented in CUDA, which is not compatible with NPU. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? pytest -sv tests/e2e/singlecard/test_ilama_lora.py pytest -sv tests/e2e/multicard/test_ilama_lora_tp2.py - vLLM version: v0.10.1.1 - vLLM main: `fe8d7b6f03` --------- Signed-off-by: paulyu12 <paulyu0307@gmail.com> Signed-off-by: paulyu12 <507435917@qq.com> Co-authored-by: paulyu12 <paulyu0307@gmail.com>	2025-08-28 10:40:51 +08:00
yeyifan	1191a64ae5	[Feat]attention add sliding windows size (#2528 ) ### What this PR does / why we need it? Add a sliding window size parameter to attention ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Regarding the `Gemma3` model, set additional_config={"ascend_scheduler_config": {"enabled":True}}, only support AscendScheduler test commond：`python3 -m vllm.entrypoints.openai.api_server --model gemma3 --additional-config '{"ascend_scheduler_config":{"enabled":true}}'` - vLLM version: v0.10.1.1 - vLLM main: `6578e87365` --------- Signed-off-by: nsdie <yeyifan@huawei.com>	2025-08-28 10:37:19 +08:00
LeeWenquan	c8d1df3a3f	[Refactor][WIP] Refactor mla_v1 by moving all MLA preprocessing ops into mla_v1 attention impl (#2465 ) ### What this PR does / why we need it? In order to support fused kernels, multi-stream, communication optimization etc, it's better to aggregate all opreations in Attention layer togather. This PR tries to refactor mla_v1 by moving all MLA preprocessing ops into mla_v1 attention impl. Note that new mla_v1 doesn't take torchair into consideration. So this PR can only be merged after torchair related mla_v1 is isolated into a new file. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? ### Features Test <img width="506" height="141" alt="image" src="https://github.com/user-attachments/assets/f1ab2906-a1ac-4450-8433-94811cd89466" /> ### Performance After Refact <img width="648" height="486" alt="image" src="https://github.com/user-attachments/assets/e33e038c-c5d9-4ba7-a8e9-1ac22f9833eb" /> ### Performance Before Refact <img width="618" height="494" alt="image" src="https://github.com/user-attachments/assets/83861dc2-dc51-4af3-9310-90ab10c43bb1" /> - vLLM version: v0.10.1.1 - vLLM main: `e03940762b` --------- Signed-off-by: lwq <liwenquan5@huawei.com> Signed-off-by: whx-sjtu <2952154980@qq.com> Signed-off-by: SunnyLee219 <3294305115@qq.com> Co-authored-by: lwq <liwenquan5@huawei.com> Co-authored-by: whx-sjtu <2952154980@qq.com>	2025-08-28 10:35:57 +08:00
weichen	320edde2df	[main] [refactor] refactor fused_moe.py to enable token_dispatchers (#2570 ) ### What this PR does / why we need it? Enable token_dispatcher to replace fused_experts_with_xxx in eager mode ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? e2e & ut - vLLM version: v0.10.1.1 - vLLM main: `704432af3c` Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com> Co-authored-by: sherie <963372609@qq.com> Co-authored-by: weijinqian0 <12153182+weijinqian0@users.noreply.github.com> Co-authored-by: shiyuan680 <72335504+shiyuan680@users.noreply.github.com>	2025-08-28 10:13:35 +08:00
Wang Yixuan	936c102105	[bugfix][refactor]fix torchair_w8a8 (#2569 ) ### What this PR does / why we need it? torchair w8a8 and w4a8 Separate from fused_moe due to the refactor and change for fused_moe ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? vLLM version: main vLLM main: `ab9f2cfd19` - vLLM version: v0.10.1.1 - vLLM main: `69244e67e6` Signed-off-by: hust17yixuan <303660421@qq.com>	2025-08-28 09:10:31 +08:00
Wang Yixuan	a955e5d404	[4/N][refactor]delete torchair from quantization (#2535 ) ### What this PR does / why we need it? After moved torchair related quantization section into torchair_quantization, split the torchair from the origin quantization ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? vLLM version: main vLLM main: `ab9f2cfd19` - vLLM version: v0.10.1.1 - vLLM main: `69244e67e6` Signed-off-by: hust17yixuan <303660421@qq.com>	2025-08-28 09:10:03 +08:00
Icey	c578f817ca	[CustomOp] Register VocabParallelEmbedding instead of overwrite forward (#2515 ) ### What this PR does / why we need it? Register VocabParallelEmbedding instead of overwrite forward ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.10.1.1 - vLLM main: `644d57d531` --------- Signed-off-by: Icey <1790571317@qq.com>	2025-08-28 08:57:34 +08:00
Li Wang	516e14ae6a	[Doc] Upgrade to multi-node tutorial model to deepseek-v3.1-w8a8 (#2553 ) ### What this PR does / why we need it? Upgrade to multi-node tutorial model to deepseek-v3.1-w8a8 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.10.1.1 - vLLM main: `de02b07db4` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-08-27 14:16:44 +08:00
rjg-lyh	2bfbf9b9b3	[main][bugfix] Fix bugs and refactor cached mask generation logic (#2442 ) ### What this PR does / why we need it? This PR fix bugs and refactor cached mask generation logic. Now just pre-construct and use the cached mask on cpu instead of device on npu. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.10.1.1 - vLLM main: `9b5f64238f` Signed-off-by: rjg-lyh <1318825571@qq.com>	2025-08-27 12:07:29 +08:00
huangxialu	6881c19458	[main] convert the format of gmm to nz (#2474 ) ### What this PR does / why we need it? convert the format of gmm to nz ### Does this PR introduce _any_ user-facing change? not involved ### How was this patch tested? ut: test_fused_ops.py and e2e: test_fused_moe.py performance: (qwen3 30B, 2k->20k) base: Total Token throughput (tok/s): 719.93 gmm nz: Total Token throughput (tok/s): 728.52 - vLLM version: v0.10.1.1 - vLLM main: `bfc1edc9f5` Signed-off-by: huangxialu <huangxialu1@huawei.com>	2025-08-27 11:25:02 +08:00
wangxiyuan	c0e12143a3	[CI] Fix UT failure (#2563 ) UT is broken by vLLM commit https://github.com/vllm-project/vllm/pull/23664 This PR mock the related config to recover the CI - vLLM version: v0.10.1.1 - vLLM main: `6dab89b8ec` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-08-27 11:24:35 +08:00
Wang Yixuan	20a7bc4b71	[3/N][refactor] refactoer quantization (#2504 ) ### What this PR does / why we need it? Move torchair related qunatization section into torchair dir to make the code clear. Next step we'll remove all torchair related code outside of torchair quantization. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? vLLM version: main vLLM main: `ab9f2cfd19` - vLLM version: v0.10.1.1 - vLLM main: `959783fb99` Signed-off-by: hust17yixuan <303660421@qq.com>	2025-08-27 10:45:50 +08:00
weiguihua2	acdc53c2f6	[Bugfix] Fix the bug of cos invalid shape when dp (#2558 ) ### What this PR does / why we need it? Fix the bug of cos invalid shape when dp ### How was this patch tested? - vLLM version: v0.10.1.1 - vLLM main: `1fdc732419` Signed-off-by: weiguihua2 <weiguihua2@huawei.com>	2025-08-27 10:36:23 +08:00
Mengqing Cao	a9e78a3299	[Aclgraph] Update compilation config in `check_and_update_config` (#2540 ) ### What this PR does / why we need it? This pr updates compilation config in `check_and_update_config`, we use `compilation_config.level` to update `compilation_config.cudagraph_mode` to ensure the config is correct. Add `compilation_config.cudagraph_num_of_warmups = 1` when V1 is enabled, cause this is also used in torchair graph mode. and this fixes https://github.com/vllm-project/vllm-ascend/issues/2523 fix the bug that the `aclgraphmode` always be `NONE` while running forward in aclgraph mode ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.10.1.1 - vLLM main: `f58675bfb3` --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-08-27 09:30:25 +08:00
wangxiyuan	f22077daa6	[Embedding] Recover embedding function (#2483 ) Fix broken embedding function. It's broken by http://github.com/vllm-project/vllm/pull/23162 - vLLM version: v0.10.1.1 - vLLM main: `efc88cf64a` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-08-27 09:22:01 +08:00
s30076806	6a4ec186e7	[Qwen-moe] Remove the minor operation arange (#2373 ) ### What this PR does / why we need it? Integrate the arange operator to reduce the time spent and improve performance ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.10.1.1 - vLLM main: `56dcf4e7e9` --------- Signed-off-by: s30076806 <songjiayang2@h-partners.com>	2025-08-27 09:13:31 +08:00
rjg-lyh	358ba68994	[main][bugfix] Fix MatmulNZ format bug on some machines (#2549 ) ### What this PR does / why we need it? This PR fixes the bug on some machines where quantmatmul failed to run with the NZ format. The change ensures proper execution under the expected data layout. ### How was this patch tested? CI passed with existing test. - vLLM version: v0.10.1.1 - vLLM main: `b5d34af328` Signed-off-by: rjg-lyh <1318825571@qq.com>	2025-08-27 09:08:17 +08:00
Li Wang	042605f4b2	[Doc] Add stable modelslim branch (#2545 ) ### What this PR does / why we need it? The branch `br_release_MindStudio_8.1.RC2_TR5_20260624` is commercial delivery version of modelslim in Q3, and has been verified available ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.10.1.1 - vLLM main: `7d67a9d9f9` Signed-off-by: wangli <wangli858794774@gmail.com>	2025-08-27 09:05:46 +08:00
zhanghw0354	8151a9d5a4	[Test]Add unit test for worker_v1.py (#2547 ) ### What this PR does / why we need it? According to issue https://github.com/vllm-project/vllm-ascend/issues/1298 ,this pull request adds unit test code for platform.py. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.10.1.1 - vLLM main: `b5d34af328` --------- Signed-off-by: zhanghaiwen <zhanghaiwen@cmss.chinamobile.com> Co-authored-by: zhanghaiwen <zhanghaiwen@cmss.chinamobile.com>	2025-08-26 22:00:49 +08:00
yiz-liu	a6bb502e70	[2/N][Feat] Add MC2 communication method for MoE layers (#2469 ) ### What this PR does / why we need it? This method replaces the previous all-gather approach for small numbers of tokens. The key changes include: - A new `AscendFusedMoE` layer that handles token splitting, local computation, and final aggregation via all-gather. - Logic in the model runner to dynamically select between the new MC2 method and the existing all-gather method based on the number of input tokens. - Sharding the MoE communication mask across tensor-parallel ranks. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? Test case fixed. - vLLM version: v0.10.1.1 - vLLM main: `b00e69f8ca` --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-08-26 19:05:23 +08:00
Wang Yixuan	5d8ec28009	[2/N][refactor] split torchair from fused_moe (#2503 ) ### What this PR does / why we need it? After moved torchair related fused_moe section into torchair_fused_moe, split the torchair from the origin fused_moe ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? vLLM version: main vLLM main: `ab9f2cfd19` - vLLM version: v0.10.1.1 - vLLM main: `2a97ffc33d` Signed-off-by: hust17yixuan <303660421@qq.com>	2025-08-26 14:12:43 +08:00
lilinsiman	cfe77e83ae	[Bugfix]Support Qwen3-MOE on aclgraph mode in sizes capture and add new ut (#2511 ) [Bugfix]Support Qwen3-MOE on aclgraph mode in sizes capture and add new ut What this PR does / why we need it? This PR solves the problem of sizes capture and stream error caused by using ACLgraph on the Qwen3-30B MOE model. Add new ut. Does this PR introduce any user-facing change? no How was this patch tested? ut - vLLM version: v0.10.1.1 - vLLM main: `6fad29b11b` Signed-off-by: lilinsiman <lilinsiman@gmail.com>	2025-08-26 12:39:21 +08:00
zhanghw0354	b3fdd78a6b	[Main][Refactor]Change ASCEND_QUATIZATION_METHOD to ASCEND_QUANTIZATION_METHOD (#2517 ) ### What this PR does / why we need it? The constant ASCEND_QUATIZATION_METHOD in vllm_ascend/utils.py is misspelled and should be corrected to ASCEND_QUANTIZATION_METHOD. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.10.1.1 - vLLM main: `c9abb10489` Signed-off-by: zhanghaiwen <zhanghaiwen@cmss.chinamobile.com> Co-authored-by: zhanghaiwen <zhanghaiwen@cmss.chinamobile.com>	2025-08-26 09:06:16 +08:00
Mengqing Cao	21b5727f9a	[CI] Upgrade vllm in accuracy and performance CI (#2527 ) ### What this PR does / why we need it? Upgrade vllm in accuracy and performance CI ### How was this patch tested? CI passed with existing test. - vLLM version: v0.10.1.1 - vLLM main: `5c4b6e66fe` Signed-off-by: MengqingCao <cmq0113@163.com>	2025-08-26 08:49:49 +08:00
wangxiyuan	7e494e94a9	[CI] Fix broken ci (#2530 ) vLLM commit https://github.com/vllm-project/vllm/pull/22711 changed the encode cache entries logic, this PR adapt the same change for vllm ascend to make CI happy. Co-Authored-By: zhoux77899 <zhouxiang100@huawei.com> - vLLM version: v0.10.1.1 - vLLM main: `0ff902f3b4` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-08-26 07:42:24 +08:00
yiz-liu	99bf25af76	[Fix] Add operations in `_dummy_run` to maintain synchronization with `_process_reqs`, resolving a service hang (#2454 ) ### What this PR does / why we need it? Fixes hang when batch size < DP size. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? After this change, the function in DP case will work now. - vLLM version: v0.10.1.1 - vLLM main: `d9a55204ba` Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-08-25 19:56:02 +08:00
wangxiyuan	de7649492d	[Refactor] cleanup converting_weight_acl_format_format (#2482 ) move maybe_converting_weight_acl_format_format to torchair module, it's only used with 310p+torchair - vLLM version: v0.10.1.1 - vLLM main: `49ab23b3cc` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-08-25 19:48:55 +08:00
Wang Yixuan	0f81e032f0	[1/N][refactor] torchair fused_moe refactor (#2438 ) ### What this PR does / why we need it? Move torchair related fused_moe section into torchair_fused_moe to make the code clear. Next step we'll remove all torchair related code outside of torchair_fused_moe . ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? vLLM version: v0.10.0 vLLM main: `08d5f7113a` - vLLM version: v0.10.1.1 - vLLM main: `170e8ea9ea` Signed-off-by: hust17yixuan <303660421@qq.com>	2025-08-25 15:46:10 +08:00
Shanshan Shen	334c44613a	[Doc] Update release version info (#2518 ) ### What this PR does / why we need it? Update release version info. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.10.1.1 - vLLM main: `712d0f88d8` Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>	2025-08-25 15:39:10 +08:00
Shanshan Shen	98c68220c1	[Doc] Update `v0.9.1rc3` doc (#2512 ) ### What this PR does / why we need it? Update `v0.9.1rc3` doc, which are supplements to https://github.com/vllm-project/vllm-ascend/pull/2488. - vLLM version: v0.10.0 - vLLM main: `170e8ea9ea` Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>	2025-08-25 11:39:29 +08:00
Mengqing Cao	4c4ffeebe5	[Doc] update vllm version in ci (#2513 ) ### What this PR does / why we need it? update vllm version in ci - vLLM version: v0.10.0 - vLLM main: `170e8ea9ea` Signed-off-by: MengqingCao <cmq0113@163.com>	2025-08-25 11:35:37 +08:00
Shanshan Shen	0767d51dd5	[Structured Output][CI] Add test for `outlines` backend for structured output in CI (#2283 ) ### What this PR does / why we need it? Add test for `outlines` backend for structured output in CI. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Tests have all passed with: ```bash pytest -sv tests/e2e/singlecard/test_guided_decoding.py ``` - vLLM version: v0.10.0 - vLLM main: `53415653ff` --------- Signed-off-by: shen-shanshan <467638484@qq.com>	2025-08-25 09:59:13 +08:00
Icey	891b2bfe71	Accuracy report formatting (#2279 ) ### What this PR does / why we need it? Accuracy report formatting ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? CI passed with existing test. - vLLM version: v0.10.0 - vLLM main: `53415653ff` --------- Signed-off-by: Icey <1790571317@qq.com>	2025-08-25 09:39:30 +08:00
Icey	f796e6280b	[CustomOp] Register RotaryEmbedding instead of overwrite forward (#2385 ) ### What this PR does / why we need it? Register RotaryEmbedding instead of overwrite forward ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.10.0 - vLLM main: `808d2e9aa0` --------- Signed-off-by: Icey <1790571317@qq.com> Signed-off-by: wxsIcey <1790571317@qq.com>	2025-08-25 09:32:35 +08:00
weichen	950c4b219a	[main] refactor alltoallv in fused_moe (#2487 ) ### What this PR does / why we need it? Refactor all2all-related fused_experts (both quantized/unquantized) into TokenDispatcherWithAll2AllV, including dispatch & combine calculation. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? E2E & UT - vLLM version: v0.10.0 - vLLM main: `65197a5fb3` Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>	2025-08-23 20:38:17 +08:00
linfeng-yuan	4af5b80606	[Scheduler] validate max_num_batched_tokens and max_model_len in AscendSchedulerConfig (#2434 ) ### What this PR does / why we need it? Add configuration check logic for ascend scheduler: if chunked_prefill is disabled, max_num_batched_tokens couldn't be less than max_model_len, following vLLM; ### Does this PR introduce _any_ user-facing change? users cannot set max_num_batched_tokens smaller than max_model_len with ascend scheduler ### How was this patch tested? CI and vllm serving passed - vLLM version: v0.10.0 - vLLM main: `f77a0802b7` Signed-off-by: linfeng-yuan <1102311262@qq.com>	2025-08-23 19:39:44 +08:00
ZhaoJiangJiang	3629bc4431	feat: add mtp ut and fix some bugs (#2453 ) ### What this PR does / why we need it? Fix mtp mode ut ### Does this PR introduce _any_ user-facing change? Nothing ### How was this patch tested? This can be tested in the same way as a unit test. - vLLM version: v0.10.0 - vLLM main: `53415653ff` Signed-off-by: 赵江江 <zhaojiangjiang1@h-partners.com> Co-authored-by: 赵江江 <zhaojiangjiang1@h-partners.com>	2025-08-22 17:09:08 +08:00
weiguihua2	dd04a96ee3	[Bugfix] Fix the bug of incorrect precision (#2479 ) ### What this PR does / why we need it? Fix the bug of incorrect precision - vLLM version: v0.10.0 - vLLM main: `53415653ff` --------- Signed-off-by: weiguihua2 <weiguihua2@huawei.com>	2025-08-22 17:08:56 +08:00
Shanshan Shen	f0be3eed84	[Doc] Add release note for `v0.9.1rc3` (#2488 ) ### What this PR does / why we need it? Add release note for `v0.9.1rc3`. - vLLM version: v0.10.0 - vLLM main: `53415653ff` --------- Signed-off-by: shen-shanshan <467638484@qq.com>	2025-08-22 16:06:29 +08:00
Mengqing Cao	60ac4fb576	[QuickFix] Skip failed ut to recover CI quickly (#2484 ) ### What this PR does / why we need it? Skip failed ut to recover CI quickly related ut: - `test_embed_models_correctness`: revert me when pooler is adapted with the latest vllm main - `test_check_and_update_config_enforce_eager_mode`: revert me when the occasional failed is fixed - vLLM version: v0.10.0 - vLLM main: `8896eb72eb` Signed-off-by: MengqingCao <cmq0113@163.com>	2025-08-22 14:14:51 +08:00
LookAround0301	e9fb895b10	[Doc] Add feature branch long_seq_optimization (#2477 ) ### What this PR does / why we need it? Add cp/sp feature branch - vLLM version: v0.10.0 - vLLM main: `0c6e40bbaa` Signed-off-by: LookAround <lixushi@huawei.com>	2025-08-22 08:53:12 +08:00

1 2 3 4 5 ...

780 Commits