xc-llm-ascend

Author	SHA1	Message	Date
weiguihua2	ce52e17bf3	[Doc]add long sequence tutorials (#5364 ) ### What this PR does / why we need it? Provide sample guidance for running long-sequence DeepSeek across multiple nodes To guide users on using the context parallel feature, a practical example is provided. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `bc0a5a0c08` Signed-off-by: weiguihua2 <weiguihua2@huawei.com>	2025-12-27 09:52:11 +08:00
ZT-AIA	1d8aa892bf	Update vllm pin to 12.26 (#5378 ) ### What this PR does / why we need it? Update vllm pin to 12.26 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `81786c8774` --------- Signed-off-by: ZT-AIA <1028681969@qq.com> Signed-off-by: Mengqing Cao <cmq0113@163.com> Signed-off-by: ZT-AIA <63220130+ZT-AIA@users.noreply.github.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2025-12-26 23:44:48 +08:00
LeeWenquan	7685d0c239	rollback causal_conv1d_fn to torch ops & update qwen3Next doc (#5391 ) ### What this PR does / why we need it? Rollback causal_conv1d_fn ops from triton to torch version to fix hanging issues，meanwhile update Qwen3Next doc - vLLM version: release/v0.13.0 - vLLM main: `254f6b9867` --------- Signed-off-by: SunnyLee219 <3294305115@qq.com>	2025-12-26 19:57:38 +08:00
Zhu Yi Lin	06732dbf5b	[Doc] update R1/V3.1 doc (#5383 ) ### What this PR does / why we need it? This PR updates DeepSeek-R1/V3.1 doc to give a simple recipe for repreducing our latest perfomance on Atlas A3/A2 servers. ### Does this PR introduce any user-facing change? No. Signed-off-by: GDzhu01 <809721801@qq.com>	2025-12-26 17:09:22 +08:00
zhangsicheng5	8ed87dfa84	[doc] Add context parallel user guide (#5358 ) 1. Add context parallel user guide 2. Add context parallel related message in supported features/models - vLLM version: release/v0.13.0 - vLLM main: `bc0a5a0c08` Signed-off-by: zhangsicheng5 <zhangsicheng5@huawei.com>	2025-12-26 17:03:47 +08:00
Qiu	da0b113cf5	[doc]<PCP&DCP> add developer guide for PCP&DCP (#5372 ) ### What this PR does / why we need it? add developer guide for PCP&DCP - vLLM version: release/v0.13.0 - vLLM main: `bc0a5a0c08` Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>	2025-12-26 16:17:38 +08:00
wangxiyuan	29d2fe653d	cleanup ascend config (#5296 ) 1. refresh additional config doc 2. move kv config logic to platform. 3. improve `dump_config` init logic and rename it to `dump_config_path` this change is user impacted. dump_config is changed from dict to string. 4. correct `enable_async_exponential` type 5. remove useless `chunked_prefill_for_mla` - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-26 14:07:37 +08:00
ZT-AIA	adaa89a7a5	Update vllm pin to 12.25 (#5342 ) ### What this PR does / why we need it? - Fix vllm break in the pr: 1.[Drop v0.14 deprecations ]https://github.com/vllm-project/vllm/pull/31285 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `bc0a5a0c08` --------- Signed-off-by: ZT-AIA <1028681969@qq.com>	2025-12-26 14:05:40 +08:00
cookieyyds	2da8038dd2	[doc] update using command (#5373 ) ### What this PR does / why we need it? Update the configuration for optimal performance of deepseek v3.2 in the usage tutorial. - vLLM version: release/v0.13.0 - vLLM main: `bc0a5a0c08` --------- Signed-off-by: cookieyyds <126683903+cookieyyds@users.noreply.github.com> Signed-off-by: Mengqing Cao <cmq0113@163.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2025-12-25 22:28:35 +08:00
wangxiyuan	2ae0bad96d	Remove VLLM_ASCEND_ENABLE_DENSE_OPTIMIZE (#5272 ) `VLLM_ASCEND_ENABLE_DENSE_OPTIMIZE` is only used together with `VLLM_ASCEND_ENABLE_PREFETCH_MLP` which is useless totally. This PR remove it. - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-25 11:09:56 +08:00
Nengjun Ma	42c989a437	Update vllm pin to 12.24 (#5307 ) ### What this PR does / why we need it? Fix vllm break in the pr: 1. [Add MiMo-V2-Flash support] (https://github.com/vllm-project/vllm/pull/30836) ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Co-authored-by: zxwang [1476209578@qq.com](mailto:1476209578@qq.com) - vLLM version: release/v0.13.0 - vLLM main: `5fbfa8d9ef` --------- Signed-off-by: leo-pony <nengjunma@outlook.com> Signed-off-by: zxwang <1476209578@qq.com> Co-authored-by: zxwang <1476209578@qq.com>	2025-12-24 17:24:31 +08:00
ZYang6263	a3f65b938f	[Doc] Add pa_shape_list description to qwen dense tutorial (#5225 ) ### What this PR does / why we need it? Add pa_shape_list description to qwen dense tutorial. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: ZYang6263 <zy626375@gmail.com> Co-authored-by: zzzzwwjj <34335947+zzzzwwjj@users.noreply.github.com>	2025-12-24 14:40:20 +08:00
Nengjun Ma	3b59f20a28	update to vllm 12-19 (#5223 ) ### What this PR does / why we need it? ### Does this PR introduce _any_ user-facing change? Fix vllm break: 1. [Enable cuda graph for deepepHT, 5.3% throughput improvement, 4.4% TTFT improvement] (https://github.com/vllm-project/vllm/pull/29558) Fix Solution: Add the now-necessary `all2all_backend` parameter. The impact of this parameter on the original `set_splitting_ops_for_v1` implementation is only that graph mode is disabled in `vllm` if `deepep_high_throughput` is enabled; it has no effect on the `vllm-ascend` logic. 2.[Migrate legacy ViT MultiHeadAttention to new MMEncoderAttention interface ] (https://github.com/vllm-project/vllm/pull/30684) Fix Solution: The reason why the GPU does not need to convert qkv to 3D is that the GPU's flash_attention operator is compatible with 3D and 4D (b s h d and s b ( h d)), but the NPU's flash_attention_unpad operator only supports 3D (s b ( h d)). Therefore, we need to introduce the reshape_qkv_to_3d operation. 4.Skip Tencent-Hunyuan/HunyuanOCR test case, as it has following issue in upgrade vllm code: https://github.com/vllm-project/vllm-ascend/issues/5297 ### How was this patch tested? Co-authored-by: zxwang <1476209578@qq.com> - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: leo-pony <nengjunma@outlook.com> Signed-off-by: zxwang <1476209578@qq.com> Co-authored-by: zxwang <1476209578@qq.com>	2025-12-23 23:52:11 +08:00
Tiger Xu / Zhonghu Xu	cb963c53a5	[Doc] Added deploying on k8s with kthena (#4674 ) ### What this PR does / why we need it? [Kthena](https://github.com/volcano-sh/kthena) is a Kubernetes-native LLM inference platform that transforms how organizations deploy and manage Large Language Models in production. Built with declarative model lifecycle management and intelligent request routing, it provides high performance and enterprise-grade scalability for LLM inference workloads. The platform extends Kubernetes with purpose-built Custom Resource Definitions (CRDs) for managing LLM workloads, supporting multiple inference engines (vLLM, SGLang, Triton) and advanced serving patterns like prefill-decode disaggregation. This pr added a example on deloying llm on Ascend Kubernetes clusters. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: Zhonghu Xu <xuzhonghu@huawei.com>	2025-12-23 17:46:04 +08:00
rongfu.leng	c9b5881bcd	[Doc] fix docs set rope_theta value is 10e6 in qwen3-235b model (#5258 ) ### What this PR does / why we need it? Fixes https://github.com/vllm-project/vllm-ascend/issues/5201 ### Does this PR introduce _any_ user-facing change? No, doc only ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: rongfu.leng <lenronfu@gmail.com>	2025-12-23 10:21:46 +08:00
zhangyiming	35dbdbb398	[Doc] Add new contributors and relative scripts. (#5070 ) ### What this PR does / why we need it? [Doc] Add new contributors and relative scripts. Usage of scripts: - `export GITHUB_TOKEN=<your github token>` - `bash tools/collect_user_first_contribution.sh vllm-project/vllm-ascend <base_sha> <head_sha>` and save the result to one temporary file such as `contributors.txt` - `python tools/format_contributors.py contributors.txt --start <start index now>` - Use the output to update the `contributors.md` - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: menogrey <1299267905@qq.com>	2025-12-23 10:01:45 +08:00
zhangyiming	f883a2edb9	[Doc] Update the weight download URL. (#5238 ) ### What this PR does / why we need it? Update the weight download URL. Because the model was renamed. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: menogrey <1299267905@qq.com>	2025-12-23 08:53:30 +08:00
lvjunqi	55beac9c91	[Feat]Xlite Qwen3-vl Support (#5228 ) ### What this PR does / why we need it? This patch adds support for the Qwen3-VL model in Xlite. For more details about Xlite, please refer to the following link:https://atomgit.com/openeuler/GVirt/blob/master/xlite/README.md. The latest performance comparison data between xlite and the default aclgraph mode is as follows: ### Does this PR introduce _any_ user-facing change? XLite graph mode supports the Qwen3-VL model. ### How was this patch tested? vLLM version: v0.12.0 - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: lvjunqi <lvjunqi1@huawei.com> Co-authored-by: lvjunqi <lvjunqi1@huawei.com>	2025-12-22 16:30:52 +08:00
zhangyiming	dc047489c7	[Doc] Fix DeepSeek-V3.2 tutorial. (#5190 ) ### What this PR does / why we need it? Fix DeepSeek-V3.2 tutorial. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: menogrey <1299267905@qq.com>	2025-12-22 11:30:17 +08:00
YuhanBai	5d02eed16f	[Performance] Add async exponential while model executing (#4501 ) ### What this PR does / why we need it? Add a control to enable the exponential distribution operator overlapping with model executing (default is OFF due to this feature might not perform well on MOE models, i.e. For Qwen3-30B). Enable async exponential overlapping will provides performance improvement. Also, overlapping the exponential operator with module execution can cover the performance drop introduced by AICPU-version's exponential operator. UPDATE: (12/12) Now our overlap will use the same stream that introduced in this pr: #4908 . We move the `do_async_exponential` from `model_runner_v1.py` to `sampler.py`. Now we are using `additional_config` to enable async exponential: Add `"enable_async_exponential": 1` in `addition_config`. Now we ONLY support default exponential/AI-CPU exponential, the old `"enable_async_exponential": 2` option has been aborted to keep consistency. ### Does this PR introduce _any_ user-facing change? YES, added a new `additional_config` : `"enable_async_exponential": 1`. When `enable_async_exponential` is set to 1, we enable the async exponential and overlap with model runner. When `enable_async_exponential` is set to 0 (default is 0), we disable the async exponential, but exponential will still running on a different stream using stream introduced in #4908. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: YuhanBai <yuhan.bai0830@gmail.com> Signed-off-by: YuhanBai yuhan.bai0830@gmail.com	2025-12-20 21:23:21 +08:00
wangxiyuan	758d81dcb1	Drop 0.12.0 support (#5146 ) We decided to release v0.13.0 soon. So no need to support 0.12.0 now. Let's drop it. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-20 09:38:53 +08:00
zzhxxx	17f2eead99	[Doc]Add the user_guide doc file regarding fine-grained TP. (#5084 ) ### What this PR does / why we need it? Add user guide for Fine-Grained Tensor Parallelism feature. Documents usage, supported components (`embedding`, `lm_head`, `o_proj`, `mlp`/`dense_ffn`), model compatibility, and deployment guidelines. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: zzhx1 <zzh_201018@outlook.com> Signed-off-by: chenxiao <Jaychou1620@Gmail.com> Signed-off-by: 秋刀鱼 <jaychou1620@Gmail.com> Co-authored-by: chenxiao <Jaychou1620@Gmail.com> Co-authored-by: Jade Zheng <zheng.shoujian@outlook.com>	2025-12-19 16:37:25 +08:00
luluxiu520	bc05a81bf2	Add Qwen3-VL-235B-A22B-Instruct tutorials (#5167 ) ### What this PR does / why we need it? This PR provides an introduction to the Qwen3-VL-235B-A22B-Instruct model, details on the features supported by the model in the current version, the model deployment process, as well as methods for performance testing and accuracy testing. With this document, the deployment and testing of the Qwen3-VL-235B-A22B-Instruct model can be implemented more easily. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: luluxiu520 <l2625793@outlook.com>	2025-12-19 14:56:17 +08:00
Li Wang	5ab6d124e5	[Doc] Add a perf tune section (#5127 ) ### What this PR does / why we need it? This patch purpose to 1. add a section on os point of perf tune doc 2. Set some default env in the image for performance - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-12-19 14:52:52 +08:00
1092626063	f952de93df	【Doc】Deepseekv3.1/R1 doc enhancement (#4827 ) ### What this PR does / why we need it? Deepseekv3.1、DeepSeekR1 doc enhancement - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: 1092626063 <1092626063@qq.com>	2025-12-19 10:52:33 +08:00
weichen	ca6f631cba	[2/N][Pangu][MoE] Remove Pangu Related Code (#5130 ) ### What this PR does / why we need it? Remove Pangu Related Code ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? e2e & ut - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: weichen <calvin_zhu0210@outlook.com>	2025-12-19 09:00:07 +08:00
zxr2333	073a3a6e6c	[Doc][P/D] Fix MooncakeConnector's name (#5172 ) ### What this PR does / why we need it? vLLM community has integrated their MooncakeConnector. The original scripts will now find this MooncakeConnector instead of the one from vLLM-Ascend. All scripts that involve using the MooncakeConnector need to be modified to another name. ### Does this PR introduce _any_ user-facing change? Yes, users need to use a new name to load vLLM-Ascend MooncakeConnector. ### How was this patch tested? By CI. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>	2025-12-18 22:29:19 +08:00
Li Wang	7d32371b7e	[Doc] Refact benchmark doc (#5173 ) ### What this PR does / why we need it? Refactor some outdated doc - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangli <wangli858794774@gmail.com>	2025-12-18 22:26:13 +08:00
wangxiyuan	0f571c347b	Nominate new maintainers @zzzzwwjj @realliujiaxu @LCAIZJ (#5152 ) I'd like to nominate @zzzzwwjj @realliujiaxu @LCAIZJ to join vLLM Ascend committer team. @zzzzwwjj --- - Review Quality‌: He has completed 80+reviews since April. 2025, include https://github.com/vllm-project/vllm-ascend/pull/3232#issuecomment-3506110786, https://github.com/vllm-project/vllm-ascend/pull/4822#discussion_r2601661204, https://github.com/vllm-project/vllm-ascend/pull/4768#issuecomment-3644795995 high quality review. - Sustained Contributions 15+ Valuable bug fix and refactor is very good. https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3Azzzzwwjj+is%3Aclosed+review%3Aapproved Continuous optimization of code architecture https://github.com/vllm-project/vllm-ascend/pulls?q=author%3Azzzzwwjj+is%3Amerged - Quality Contribution‌: https://github.com/vllm-project/vllm-ascend/pull/1229 https://github.com/vllm-project/vllm-ascend/pull/1979 https://github.com/vllm-project/vllm-ascend/pull/4359 https://github.com/vllm-project/vllm-ascend/pull/4878 - Community Involvement‌: He lead the https://github.com/vllm-project/vllm-ascend/issues/1147, to refactor AscendFusedMoE at the first time. He shared topics about large-scale distributed inference and reinforcement learning on vLLM-Ascend meetup on August 2nd. @realliujiaxu --- - Review Quality‌: He has completed about [40+ reviews](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+commenter%3Arealliujiaxu+-author%3Arealliujiaxu+) since September, include https://github.com/vllm-project/vllm-ascend/pull/4868#discussion_r2605549015, https://github.com/vllm-project/vllm-ascend/pull/2275#discussion_r2268455665. - Sustained Contributions He has completed (17 commits)[https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3Arealliujiaxu+is%3Amerged], continuously optimizing the performance of the MoE model. - Quality Contribution‌: Contributed the Flash Comm1 feature to the community, supporting both eager and aclgraph execution modes, while compatible with multiple MoE models including DeepSeek and GLM4.5. - https://github.com/vllm-project/vllm-ascend/pull/3334 - https://github.com/vllm-project/vllm-ascend/pull/3420 - https://github.com/vllm-project/vllm-ascend/pull/3015 co-author: - https://github.com/vllm-project/vllm-ascend/pull/3495 - https://github.com/vllm-project/vllm-ascend/pull/4868 - Community Involvement‌: 1. Completed two major refactors, enabling vllm-ascend to evolve more rapidly and robustly: [Linear module](https://github.com/vllm-project/vllm-ascend/pull/2867) and [rejection sampler](https://github.com/vllm-project/vllm-ascend/pull/4975) 2. [fixed 8 bugs](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3Arealliujiaxu+is%3Amerged+bugfix+) in graph mode, spec decoding and async scheduling. @LCAIZJ --- - Review Quality‌: He's been the go-to reviewer for virtually all PD disaggregation and KV Pool related PRs, having completed [30+ reviews](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+commenter%3ALCAIZJ+is%3Aopen+-author%3ALCAIZJ+) since May 2025. Notable examples include [discussion_r2553887360](https://github.com/vllm-project/vllm-ascend/pull/4345#discussion_r2553887360), [issuecomment-3540994801](https://github.com/vllm-project/vllm-ascend/pull/4161#issuecomment-3540994801), and [discussion_r2492593988](https://github.com/vllm-project/vllm-ascend/pull/3981#discussion_r2492593988), all demonstrating thorough and insightful feedback. - Sustained and Quality Contributions: His contributions reflect a strong grasp of both ‌vLLM‌ and ‌vLLM Ascend‌ codebases, particularly in prefill-decode disaggregation and KV pool areas ([7 PRs merged](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3ALCAIZJ+is%3Amerged+)). Prefill-Decode Disaggregation: Delivered KV transfer functionality using Mooncake TransferEngine and enabled layerwise KV transfer https://github.com/vllm-project/vllm-ascend/pull/1568 https://github.com/vllm-project/vllm-ascend/pull/2602 KV Pool: Developed the foundational KV Pool infrastructure and migrated it to the latest ADXL stack https://github.com/vllm-project/vllm-ascend/pull/2913 https://github.com/vllm-project/vllm-ascend/pull/3350 - Quality Contribution‌: https://github.com/vllm-project/vllm-ascend/pull/1568 https://github.com/vllm-project/vllm-ascend/pull/2602 https://github.com/vllm-project/vllm-ascend/pull/2913 https://github.com/vllm-project/vllm-ascend/pull/3350 - Community Involvement‌: He actively responds to [community issues](https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue%20commenter%3ALCAIZJ%20is%3Aopen%20-author%3ALCAIZJ), continuously monitors functionality and accuracy issues related to PD disaggregation and KV Pool, and proactively delivers [bug fixes](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3ALCAIZJ+is%3Amerged+bugfix). - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-18 18:49:07 +08:00
Ronald	b69b04d3a9	implement model runner v2 basic framework (#5051 ) ### What this PR does / why we need it? This PR aim to implement model runner v2 basic framework in vllm-ascend, the e2e function is not guaranteed by this pr. ### Does this PR introduce _any_ user-facing change? use envs.VLLM_USE_V2_MODEL_RUNNER to decide if choose model_runenr_v2. ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: Ronald1995 <ronaldautomobile@163.com>	2025-12-18 15:51:54 +08:00
ming1212	9268ad11e3	Qwen3-Next：Update the gpu-memory-utilization parameter to 0.7 (#5129 ) ### What this PR does / why we need it? Update the gpu-memory-utilization parameter to 0.7 - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: ming1212 <2717180080@qq.com> Signed-off-by: ming1212 <104972349+ming1212@users.noreply.github.com> Signed-off-by: Mengqing Cao <cmq0113@163.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2025-12-18 15:16:33 +08:00
TingW09	879ec2d1c4	[Doc] add qwen3 reranker (#5086 ) ### What this PR does / why we need it? add qwen3 reranker tutorials ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.12.0 --------- Signed-off-by: TingW09 <944713709@qq.com>	2025-12-18 10:54:07 +08:00
lilinsiman	3f7a2fba70	[main][doc] Instructions for using permissions added to docker (#5092 ) ### What this PR does / why we need it? Instructions for using permissions added to docker ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ut - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: lilinsiman <lilinsiman@gmail.com>	2025-12-17 15:26:09 +08:00
ZixuanWang	b1a853b0f6	Upgrade vllm commit hash to 1216 (#5053 ) ### What this PR does / why we need it? Upstream vLLM PR #30212 https://github.com/vllm-project/vllm/pull/30212 refactored the attention backend selection interface, This PR adapts vllm-ascend's get_attn_backend_cls to align with the new upstream standard, ensuring compatibility and reducing maintenance overhead. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? co-author:[leo-pony][nengjunma@outlook.com](mailto:nengjunma@outlook.com) - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: zxwang <1476209578@qq.com> Signed-off-by: leo-pony <nengjunma@outlook.com> Co-authored-by: leo-pony <nengjunma@outlook.com>	2025-12-17 08:48:36 +08:00
liziyu	190ae55e9f	Add a Mooncake installation tutorial for kv pool and update Mooncake installation tutorial (#5069 ) ### What this PR does / why we need it? Add a Mooncake installation tutorial for kv pool and update Mooncake installation tutorial - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: liziyu <liziyu16@huawei.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2025-12-16 19:53:23 +08:00
wangxiyuan	d11b74a571	Add release note for v0.11.0 (#4918 ) Add release note for v0.11.0. We'll release soon. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-16 17:31:45 +08:00
zhaomingyu13	039cc65e58	[Doc] Add user guide of speculative decoding (#5074 ) ### What this PR does / why we need it? Add user guide of speculative decoding that includes n-grams, EAGLE, MTP, and suffix. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com>	2025-12-16 17:01:44 +08:00
Li Wang	a63ef031af	[Doc] Upgrade some outdated doc (#5062 ) ### What this PR does / why we need it? Upgrade some outdated doc to make run happily Signed-off-by: wangli <wangli858794774@gmail.com>	2025-12-16 11:48:19 +08:00
UnifiedCacheManager	195eac665b	[Core][Worker] Add UCMConnector for KV Cache Offloading (#4411 ) ### What this PR does / why we need it? This PR introduces the initial integration of UCM (Unified Cache Management) into the vllm-ascend distributed KV-cache system. Specifically, it adds: - A new `UCMConnector` implementation under the distributed KV-transfer framework. - Support for offloading KV-cache blocks to external UCM backends (DRAM / NFS / Localdisk), depending on UCM configuration). - Integration with vLLM V1 KV connector interface, including metadata handling and role registration. Why it is needed: - UCM provides a unified, high-performance storage layer for KV-cache externalization. - This enables vllm-ascend to support out-of-core KV-cache workloads, improve memory efficiency, and leverage hardware-accelerated storage paths (RDMA / NFS / hybrid modes). - This connector is a required component to allow future work on multi-node inference + UCM-based scaling. --- ### Does this PR introduce _any_ user-facing change? Yes, but limited: - A new `kv_connector=UCMConnector` option becomes available through the configuration interface. - When selected, vllm-ascend workers may initialize UCM and offload KV-cache blocks externally. - No default behaviors are changed. Users must explicitly enable this connector. This PR does not modify: - existing APIs, - default execution paths, - model runner behavior, - user workflow unless `UCMConnector` is configured. --- ### How was this patch tested? --- ### Prefix Caching Benchmark We provide preliminary measurements for TTFT (ms) under VLLM benchmark. Tests run on 2 * Ascend 910B3, vllm-ascend 0.11.0, Tensor Parallel size 2, with UCM (Localdisk) enabled. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: UnifiedCacheManager <unifiedcachem@163.com>	2025-12-16 10:53:30 +08:00
Li Wang	6063853ead	[Misc] Upgrade vllm commit hash to 1215 (#5029 ) ### What this PR does / why we need it? Upgrade vllm commit hash to `4429d934de3c5cc327b0d7aec8e473aeba38db90` - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-12-16 09:23:02 +08:00
InSec	a5cb8e40f5	[doc]Modify quantization tutorials (#5026 ) ### What this PR does / why we need it? Modify quantization tutorials to correct a few mistakes: Qwen3-32B-W4A4.md and Qwen3-8B-W4A8.md Qwen3-8B-W4A8: need to set one idle npu card. Qwen3-32B-W4A4: need to set two idle npu cards for the flatquant training and modify the calib_file path which does not match the ModeSlim version. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: IncSec <1790766300@qq.com>	2025-12-15 20:12:06 +08:00
Li Wang	8d2998d0e4	[Misc] Upgrade vllm hash to 12_14 (#5000 ) ### What this PR does / why we need it? ### Does this PR introduce _any_ user-facing change? 1. fix https://github.com/vllm-project/vllm/pull/27938 2. fix https://github.com/vllm-project/vllm/pull/27145 pooling models now supports chunked prefill and prefix caching, 3. fix https://github.com/vllm-project/vllm/pull/30181 define the CPU fields in the field config where they really belong. 4. fix https://github.com/vllm-project/vllm/pull/28168 define the CPU fields in the field config where they really belong. 5. fix https://github.com/vllm-project/vllm/pull/30201 some moudle rename 6. fix https://github.com/vllm-project/vllm/pull/29067 fusedmoe moudle refactor 7. fix https://github.com/vllm-project/vllm/pull/29066 fusedmoe moudle refactor 8. fix https://github.com/vllm-project/vllm/pull/29624 ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-12-15 19:54:23 +08:00
fluctlux	6de4bedd04	update release note for suffix decoding (#5009 ) update release note for suffix decoding - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: fluctlux <38945811+fluctlux@users.noreply.github.com>	2025-12-15 17:22:19 +08:00
Chao Lei	b75bfc58f6	[Doc ] Supplement kvpool user guide (#5013 ) ### What this PR does / why we need it? Supplement detailed descriptions for `ASCEND_CONNECT_TIMEOUT` and `ASCEND_TRANSFER_TIMEOUT` in kvpool. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: LCAIZJ <leichao139636@163.com>	2025-12-15 14:24:39 +08:00
ming1212	98b9e2e18e	Add Qwen3-Next tutorials (#4607 ) ### What this PR does / why we need it? This PR provides an introduction to the Qwen3-Next model, details on the features supported by the model in the current version, the model deployment process, as well as methods for performance testing and accuracy testing. With this document, the deployment and testing of the Qwen3-Next model can be implemented more easily. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: ming1212 <2717180080@qq.com> Signed-off-by: ming1212 <104972349+ming1212@users.noreply.github.com> Signed-off-by: Mengqing Cao <cmq0113@163.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2025-12-15 11:48:22 +08:00
Li Wang	2497bbbaf6	[Misc] Update pooling example (#5002 ) ### What this PR does / why we need it? Since the param `task` has been depprecated, we should use the latest unified standard parameters for pooling models, this should be more clear - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-12-15 08:36:19 +08:00
wangxiyuan	8090914d69	[CI] CI refactor (#4928 ) 1. rename workflow to better name 2. fix lint error 3. remove accuracy report doc and test - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-14 11:09:56 +08:00
wangxiyuan	42ceaf08a1	add release note for 0.12.0 (#4995 ) Add release note for v0.12.0rc1 Update deepseek3.2 tutorial doc - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-13 22:09:59 +08:00
lilinsiman	31c94b7e7b	[doc][main] Correct more doc mistakes (#4958 ) ### What this PR does / why we need it? Correct more doc mistakes - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: lilinsiman <lilinsiman@gmail.com>	2025-12-13 18:36:58 +08:00
lilinsiman	fc818f1509	[doc][main] Correct mistakes in doc (#4945 ) ### What this PR does / why we need it? Correct mistakes in doc - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: lilinsiman <lilinsiman@gmail.com>	2025-12-12 19:17:10 +08:00

... 3 4 5 6 7 ...

611 Commits