xc-llm-ascend

Author	SHA1	Message	Date
UnifiedCacheManager	195eac665b	[Core][Worker] Add UCMConnector for KV Cache Offloading (#4411 ) ### What this PR does / why we need it? This PR introduces the initial integration of UCM (Unified Cache Management) into the vllm-ascend distributed KV-cache system. Specifically, it adds: - A new `UCMConnector` implementation under the distributed KV-transfer framework. - Support for offloading KV-cache blocks to external UCM backends (DRAM / NFS / Localdisk), depending on UCM configuration). - Integration with vLLM V1 KV connector interface, including metadata handling and role registration. Why it is needed: - UCM provides a unified, high-performance storage layer for KV-cache externalization. - This enables vllm-ascend to support out-of-core KV-cache workloads, improve memory efficiency, and leverage hardware-accelerated storage paths (RDMA / NFS / hybrid modes). - This connector is a required component to allow future work on multi-node inference + UCM-based scaling. --- ### Does this PR introduce _any_ user-facing change? Yes, but limited: - A new `kv_connector=UCMConnector` option becomes available through the configuration interface. - When selected, vllm-ascend workers may initialize UCM and offload KV-cache blocks externally. - No default behaviors are changed. Users must explicitly enable this connector. This PR does not modify: - existing APIs, - default execution paths, - model runner behavior, - user workflow unless `UCMConnector` is configured. --- ### How was this patch tested? --- ### Prefix Caching Benchmark We provide preliminary measurements for TTFT (ms) under VLLM benchmark. Tests run on 2 * Ascend 910B3, vllm-ascend 0.11.0, Tensor Parallel size 2, with UCM (Localdisk) enabled. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: UnifiedCacheManager <unifiedcachem@163.com>	2025-12-16 10:53:30 +08:00
SILONG ZENG	237fad635c	[Fix]Revert temporary skip on mtp1/mtp2 correctness tests (aclgraph fix) (#5039 ) ### What this PR does / why we need it? This Pull Request removes the @pytest.mark.skip decorators from test_mtp1_correctness_piecewise_graph and test_mtp2_correctness_piecewise_graph. These tests were temporarily skipped because of an issue with the MTP ACL Graph (as per the original TODO comment). Since the relevant bug/issue has been resolved, these tests are now re-enabled to ensure full correctness coverage for MTP functionality. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: MrZ20 <2609716663@qq.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2025-12-16 10:40:00 +08:00
Li Wang	6063853ead	[Misc] Upgrade vllm commit hash to 1215 (#5029 ) ### What this PR does / why we need it? Upgrade vllm commit hash to `4429d934de3c5cc327b0d7aec8e473aeba38db90` - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-12-16 09:23:02 +08:00
MengLong Chen	5e0ada5395	[Bugfix] Fix the attn_metadata is None (#5038 ) ### What this PR does / why we need it? Fix the bug " TypeError: 'NoneType' object is not iterable' " in vllm_ascend/compilation/acl_graph.py The reason of that is the attn_metadata is none in the dummy_run of MTP. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: chenmenglong <chenmenglong1@huawei.com>	2025-12-16 09:14:05 +08:00
Clorist33	d43cabc2b1	[Bugfix] Fix precision issues in moe_mlp (vllm-ascend main) (#5025 ) ### What this PR does / why we need it? Use group_list[0] to replace group_diff[0] in function "cumsum_group_list" (moe_mlp.py). The purpose is to modify it to the correct logic of converting cumsum to count. ### Does this PR introduce _any_ user-facing change? No - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: tanqingshan (A) <50050625@china.huawei.com> Co-authored-by: tanqingshan (A) <50050625@china.huawei.com>	2025-12-16 08:39:54 +08:00
fems14	b662d914a4	[bugfix] [main] Fix KV cache query inconsistency across different TP ranks in the KV Pool (#5030 ) ### What this PR does / why we need it? In the current KV Pool scenario for models like MLA and GQA, where different TP ranks generate identical KV caches, the system is designed to store only a single copy. The previous approach allowed each card to query storage requirements dynamically, but inconsistent query results across cards led to incorrect storage. To fix this, the new solution pre-allocates storage responsibilities; each card now simply stores its pre-assigned blocks, bypassing the inconsistent query step and ensuring data correctness. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: fems14 <1804143737@qq.com>	2025-12-15 21:56:05 +08:00
Jade Zheng	c064d11fd7	[Cleanup] Remove unused attn_metadata parameter from Proposer classes (#4862 ) The `attn_metadata` is not used by any draft proposer, so we can remove it. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>	2025-12-15 21:21:38 +08:00
whx	a9625851ef	[Attention] Temporarily add back pa for small batch sizes. (#4765 ) ### What this PR does / why we need it? This PR adds back pa in scenarios of small batch sizes due to performance consideration. Will remove pa once fia performs better than pa in all scenarios. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed with existing test. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: whx-sjtu <2952154980@qq.com> Co-authored-by: weijinqian0 <1184188277@qq.com>	2025-12-15 20:35:50 +08:00
baxingpiaochong	95e6400128	[KVPool]Fix PP get bug (#5007 ) ### What this PR does / why we need it? When kv caches are evicted from the key-value pool, it's possible that the kv cache for pp0 is still active, but the kv cache for pp1 has already been evicted. Therefore, a unified check is needed during the get operation. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: baxingpiaochong <771405853@qq.com> Co-authored-by: Jade Zheng <zheng.shoujian@outlook.com>	2025-12-15 20:27:57 +08:00
InSec	a5cb8e40f5	[doc]Modify quantization tutorials (#5026 ) ### What this PR does / why we need it? Modify quantization tutorials to correct a few mistakes: Qwen3-32B-W4A4.md and Qwen3-8B-W4A8.md Qwen3-8B-W4A8: need to set one idle npu card. Qwen3-32B-W4A4: need to set two idle npu cards for the flatquant training and modify the calib_file path which does not match the ModeSlim version. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: IncSec <1790766300@qq.com>	2025-12-15 20:12:06 +08:00
zhangyiming	e90e8afc94	[E2E] Collect test run time. (#5018 ) ### What this PR does / why we need it? [E2E] Collect test run time. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: menogrey <1299267905@qq.com>	2025-12-15 20:06:48 +08:00
zhangxinyuehfad	019c8e03c2	[CI] Delete deepseek3.2-exp nightly test (#5028 ) ### What this PR does / why we need it? Delete deepseek3.2-exp nightly test firstly for replacing deepseek3.2-exp with deepseek3.2 after nightly tests pass. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-12-15 20:01:53 +08:00
Li Wang	8d2998d0e4	[Misc] Upgrade vllm hash to 12_14 (#5000 ) ### What this PR does / why we need it? ### Does this PR introduce _any_ user-facing change? 1. fix https://github.com/vllm-project/vllm/pull/27938 2. fix https://github.com/vllm-project/vllm/pull/27145 pooling models now supports chunked prefill and prefix caching, 3. fix https://github.com/vllm-project/vllm/pull/30181 define the CPU fields in the field config where they really belong. 4. fix https://github.com/vllm-project/vllm/pull/28168 define the CPU fields in the field config where they really belong. 5. fix https://github.com/vllm-project/vllm/pull/30201 some moudle rename 6. fix https://github.com/vllm-project/vllm/pull/29067 fusedmoe moudle refactor 7. fix https://github.com/vllm-project/vllm/pull/29066 fusedmoe moudle refactor 8. fix https://github.com/vllm-project/vllm/pull/29624 ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-12-15 19:54:23 +08:00
wangx700	3b7eb5179f	[Bugfix] fix the incorrect use of python's sum on tensors. (#4655 ) ### What this PR does / why we need it? Fix the incorrect use of python's sum function on PyTorch tensors. 1. Using Python's sum() function on a tensor self.num_pcp_pads resulted in 6ms execution time Optimization: replacing with PyTorch's torch.sum() reduced execution time to 474µs 2. scheduler_output.scheduled_spec_decode_tokens undergoes repeated loop processing even when speculative decoding is not used Optimization: added conditional logic to skip processing loops when speculative decoding is disabled, eliminating unnecessary computational overhead. - vLLM version: 86e178f7c4d8c3b0eaf3c8e3f810a83f63b90e24 - vLLM main: `86e178f7c4` Signed-off-by: wangx700 <wangxin700@huawei.com> Co-authored-by: weijinqian0 <1184188277@qq.com>	2025-12-15 19:22:40 +08:00
zengzengran	6029bea480	[UT]add pcp dcp ut (#4949 ) ### What this PR does / why we need it? Adding UT for DCP/PCP -vLLM version: v0.12.0 -vLLM main: `ad32e3e19c` Signed-off-by: zengran <zengran2@huawei.com>	2025-12-15 18:41:38 +08:00
Icey	5fae65f3a8	[Graph][Fusion] Add AddRMSNorm(with bias) and Quant Fusion Pattern (#5011 ) ### What this PR does / why we need it? AddRMSNorm(with bias) and Quant Fusion Pattern ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: wxsIcey <1790571317@qq.com>	2025-12-15 18:37:56 +08:00
fluctlux	6de4bedd04	update release note for suffix decoding (#5009 ) update release note for suffix decoding - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: fluctlux <38945811+fluctlux@users.noreply.github.com>	2025-12-15 17:22:19 +08:00
Levi	df7e0fe916	[Bugfix] qwen3-vl-235b-w8a8 load weight ERROR when start service (#4292 ) ### What this PR does / why we need it? fix qwen3-vl-w8a8 load weight ERROR when start service 0.12.0rc1 can start qwen3-vl-235b-w8a8 by adding this PR - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` --------- Signed-off-by: Levi-JQ <yujinqi2@huawei.com> Co-authored-by: Levi-JQ <yujinqi2@huawei.com>	2025-12-15 16:39:58 +08:00
knight0528	e25c57b346	[Bugfix] Add support for PP intermediate value types in graph mode (#4902 ) This PR adds support for handling intermediate value types in pipeline parallelism when running in graph mode. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: zhangshushun <3265779424@qq.com> Co-authored-by: Jade Zheng <zheng.shoujian@outlook.com>	2025-12-15 16:27:17 +08:00
zzhxxx	e16444f21f	[Bugfix] Fix the bug in initializing the shared_weight communication domain in sfa-cp, and fix the mtp weight load in pp>1 situation (#4913 ) ### What this PR does / why we need it? In PR #4188, a small bug was introduced that caused sfa-cp to be unable to find the global_pp_size parameter during initialization, and this PR fixed the issue. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: zzhx1 <zzh_201018@outlook.com> Co-authored-by: Jade Zheng <zheng.shoujian@outlook.com>	2025-12-15 16:21:49 +08:00
SILONG ZENG	70606e0bb9	[Test]update accuracy test of models (#4911 ) ### What this PR does / why we need it? Delete accuracy tests for models that are no longer retained： - Meta-Llama-3.1-8B-Instruct - llava-1.5-7b-hf - InternVL2-8B.yaml - InternVL2_5-8B.yaml - InternVL3-8B.yaml Add accuracy tests for the new models： - Llama-3.2-3B-Instruct - llava-onevision-qwen2-0.5b-ov-hf - Qwen3-VL-30B-A3B-Instruct - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: MrZ20 <2609716663@qq.com>	2025-12-15 15:04:20 +08:00
Chao Lei	b75bfc58f6	[Doc ] Supplement kvpool user guide (#5013 ) ### What this PR does / why we need it? Supplement detailed descriptions for `ASCEND_CONNECT_TIMEOUT` and `ASCEND_TRANSFER_TIMEOUT` in kvpool. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: LCAIZJ <leichao139636@163.com>	2025-12-15 14:24:39 +08:00
Chen Chen	aa02a85e4d	[bugfix] Fix dummy-run and multi-node issues in MoE routing and MTP (#4947 ) ### What this PR does / why we need it? - Fix a premature `return` in `moe_init_routing_quant_v2.cpp` so the routing kernel completes correctly instead of exiting early in certain paths. - Switch `FusedAlltoAllCommImpl` to use the MC2-based token dispatcher and prepare/finalize routines, aligning MoE communication with the MC2 algorithm optimized for Ascend devices. - Add a temporary override in `MtpProposer` to map `FUSED_ALLTOALL` back to `ALLTOALL` until the MoE communication type selection logic is fully finalized, avoiding incorrect behavior in dummy-run flows. - Simplify the MoE communication selection for Ascend 910-93 in `NPUModelRunner` by removing the EP-size guard on `FUSED_ALLTOALL`, which fixes failures in multi-node / larger-EP configurations while keeping MC2 routing under the configured token capacity. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: mojave2 <chenchen145@huawei.com>	2025-12-15 14:18:23 +08:00
dependabot[bot]	cc7b302020	Bump actions/upload-artifact from 5 to 6 (#5014 ) Bumps [actions/upload-artifact](https://github.com/actions/upload-artifact) from 5 to 6. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-12-15 14:13:06 +08:00
drslark	8fb0ef5ffa	[main][BugFix] Fixed an accuracy bug of Qwen3-next-MTP when batched inferring (#4932 ) ### What this PR does / why we need it? Fixes an accuracy bug of Qwen3-next-MTP when batched inferring. It is descibed in https://github.com/vllm-project/vllm-ascend/issues/4930. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: drslark <slarksblood@qq.com>	2025-12-15 13:22:30 +08:00
wujinyuan1	545e856971	[Refactor]3/N Refactor mla_v1.py & extract mla_cp (#4933 ) RFC: https://github.com/vllm-project/vllm-ascend/issues/4629 Reason： The functions related to Cp differ significantly from those of normal MLA-Attention, but the coupling is quite severe. Steps： Isolate PCP and DCP (1) create a new python file: mla_cp.py (2) add classes AscendMlaCPImpl and AscendMlaCPMetadataBuilder，Inheritance AscendMLAImpl and AscendMLAMetadataBuilder (3) Remove PCP and DCP-related methods from mla_v1.py to mla_cp.py vLLM version: v0.12.0 - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: wujinyuan1 <wjy9595@qq.com> Co-authored-by: wujinyuan1 <wjy9595@qq.com> Co-authored-by: weijinqian0 <1184188277@qq.com>	2025-12-15 12:59:18 +08:00
ming1212	98b9e2e18e	Add Qwen3-Next tutorials (#4607 ) ### What this PR does / why we need it? This PR provides an introduction to the Qwen3-Next model, details on the features supported by the model in the current version, the model deployment process, as well as methods for performance testing and accuracy testing. With this document, the deployment and testing of the Qwen3-Next model can be implemented more easily. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: ming1212 <2717180080@qq.com> Signed-off-by: ming1212 <104972349+ming1212@users.noreply.github.com> Signed-off-by: Mengqing Cao <cmq0113@163.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2025-12-15 11:48:22 +08:00
Mengqing Cao	6beb4434e1	[CI][Bugfix] Fix scheduleroutput has no attr get error in prompt logprobs (#4998 ) ### What this PR does / why we need it? Fix scheduleroutput has no attr get error in prompt logprobs Fix https://github.com/vllm-project/vllm-ascend/actions/runs/20194753373/job/57977131870 ### How was this patch tested? CI passed with existing test. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: MengqingCao <cmq0113@163.com>	2025-12-15 11:10:39 +08:00
Li Wang	2497bbbaf6	[Misc] Update pooling example (#5002 ) ### What this PR does / why we need it? Since the param `task` has been depprecated, we should use the latest unified standard parameters for pooling models, this should be more clear - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-12-15 08:36:19 +08:00
LookAround0301	bb7b74c14f	add ut for model runner (#4991 ) ### What this PR does / why we need it? add ut for model runner - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: LookAround <lixushi@huawei.com>	2025-12-14 11:16:20 +08:00
wangxiyuan	8090914d69	[CI] CI refactor (#4928 ) 1. rename workflow to better name 2. fix lint error 3. remove accuracy report doc and test - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-14 11:09:56 +08:00
AlvisGong	ba28d54f35	[Perf]enable prefill flashcommon3 (#4065 ) ### What this PR does / why we need it? moe multistream overlap to improve the performance. ### How was this patch tested? --additional-config '{"multistream_overlap_gate": true}' - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: AlvisGong <gwly0401@163.com> Signed-off-by: chenxiao <Jaychou1620@Gmail.com> Co-authored-by: clrs97 <524936896@qq.com> Co-authored-by: zzhx1 <zzh_201018@outlook.com> Co-authored-by: chenxiao <Jaychou1620@Gmail.com>	2025-12-14 09:34:13 +08:00
Yizhou	0686b32d82	[Fix] Fixes issues in MTP with async scheduling and ACL graph (#4963 ) ### What this PR does / why we need it? Corrects attention metadata size for MTP when both asynchronous scheduling and full ACL graph mode are enabled. This prevents potential size mismatches during execution. Additionally, improves the robustness of calculating token sample indices by explicitly aligning tensor shapes. Finally, prevents padding when the number of input tokens exceeds the maximum ACL graph batch size to avoid out-of-bounds errors. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? Need to add corresponding test case ASAP. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com> Signed-off-by: Yizhou <136800916+yiz-liu@users.noreply.github.com> Co-authored-by: Jade Zheng <zheng.shoujian@outlook.com>	2025-12-14 00:10:11 +08:00
wangxiyuan	42ceaf08a1	add release note for 0.12.0 (#4995 ) Add release note for v0.12.0rc1 Update deepseek3.2 tutorial doc - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-13 22:09:59 +08:00
Li Wang	0f92d34a70	[CI] Pull latest vllm-ascend src before tests (#4988 ) ### What this PR does / why we need it? Currently, our image build suffers from errors during cross-compilation, which causing the image to fail to build sometimes(see https://github.com/vllm-project/vllm-ascend/actions/runs/20152861650/job/57849208186). This results in the nightly test code not being the latest version. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-12-13 19:04:14 +08:00
wangxiyuan	fd7c929145	[perf] replace all_reduce for kv_consumer and support different num_tokens among all ranks (#4983 ) pick from https://github.com/vllm-project/vllm-ascend/pull/4736 to fix the merge conflict ### What this PR does / why we need it? Currently, the all_reduce operation in _sync_metadata_across_dp is performed with gloo backend which is extremely time-consuming when DPEngineCores are in different nodes. This operation cannot be ignored by async scheduling in multi-node-scenarios with speculative decoding (e.g., EAGLE, mtp). This pr eliminates the all_reduce operation for D Nodes and change the input parameter of MoEDispatch & MoeCombine operators to make MC2EP support different num_tokens across all ranks. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Tested with PD disaggregation (2P: DP2TP8EP16 1D: DP8TP4EP32) scenarios while enabling async scheduling. This pr can remove cross-node all_reduce with gloo backend and further reduce latency with correct accuracy. --------- Signed-off-by: linfeng-yuan <1102311262@qq.com> Co-authored-by: linfeng-yuan <1102311262@qq.com>	2025-12-13 18:59:54 +08:00
wangxiyuan	5211e991ad	Revert "[Bugfix] support mtp kv transfer and pp partition by hand in kv transfer (#4892 )" (#4981 ) This reverts commit `332b547728`. This break deepseek3.2 in PD case. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c`	2025-12-13 18:58:55 +08:00
lilinsiman	31c94b7e7b	[doc][main] Correct more doc mistakes (#4958 ) ### What this PR does / why we need it? Correct more doc mistakes - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: lilinsiman <lilinsiman@gmail.com>	2025-12-13 18:36:58 +08:00
zhenwenqi2024	4721e4f53f	[bugfix] asyncscheduler bug fix (#4968 ) ### What this PR does / why we need it? now vllm-ascend uses AsyncGPUModelRunnerOutput ,AsyncNPUModelRunnerOutput before is outdated, so we should fix it - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>	2025-12-13 17:04:54 +08:00
realliujiaxu	3581946256	[Bugfix] fix eagle proposer (#4971 ) ### What this PR does / why we need it? After https://github.com/vllm-project/vllm-ascend/pull/4764, a lot of tensor created by `make_buffer` should be renamed, like `input_ids` -> `input_ids.gpu`. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: realliujiaxu <realliujiaxu@163.com>	2025-12-12 22:39:49 +08:00
Jade Zheng	45889a6185	[Bugfix] Pass vllm_config to kv_connector_no_forward in NPUModelRunner (#4970 ) ### What this PR does / why we need it? The newest version crashes in PD separation scenarios because the function is missing the `vllm_config` parameter. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>	2025-12-12 22:36:23 +08:00
MengLong Chen	fa367e3b1a	[CI] Add mtp_proposer ut (#4397 ) ### What this PR does / why we need it? Add mtp_proposer ut - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: chenmenglong <chenmenglong1@huawei.com>	2025-12-12 20:41:31 +08:00
lilinsiman	fc818f1509	[doc][main] Correct mistakes in doc (#4945 ) ### What this PR does / why we need it? Correct mistakes in doc - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: lilinsiman <lilinsiman@gmail.com>	2025-12-12 19:17:10 +08:00
zhenwenqi2024	f708d919f8	[Feature] model_runner refactor (#4764 ) ### What this PR does / why we need it? refactor npu_modelrunner， we should be close to gpu_modelrunner ### Does this PR introduce _any_ user-facing change? NO - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com> Signed-off-by: zhenwenqi2024 <155598497+zhenwenqi2024@users.noreply.github.com>	2025-12-12 17:27:09 +08:00
Li Wang	5b12c068f9	[Nightly] Remove gen_ranktable logic (#4941 ) ### What this PR does / why we need it? Since the `llmdatadist` has sunset, the logic gen_ranktable should also be removed - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangli <wangli858794774@gmail.com>	2025-12-12 17:20:18 +08:00
lty	0cdf98ac48	[usability]Modify the default value of the protocol to ascend (#4959 ) ### What this PR does / why we need it? The recommended configuration in the document kv_pool.md is ascend. Modify the default value of the protocol to ascend，Improve usability #### 1.Configure mooncake.json The environment variable MOONCAKE_CONFIG_PATH is configured to the full path where mooncake.json is located. ``` { "local_hostname": "xx.xx.xx.xx", "metadata_server": "P2PHANDSHAKE", "protocol": "ascend", "device_name": "", "alloc_in_same_node": true, "master_server_address": "xx.xx.xx.xx:50088", "global_segment_size": "1GB" (1024MB/1048576KB/1073741824B/1073741824) } ``` local_hostname: Configured as the IP address of the current master node. metadata_server: Configured as P2PHANDSHAKE. protocol: Configured for Ascend to use Mooncake's HCCL communication. device_name: "" alloc_in_same_node: Indicator for preferring local buffer allocation strategy. master_server_address: Configured with the IP and port of the master service. global_segment_size: Expands the kvcache size registered by the PD node to the master. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Mooncake does not set up a protocol to launch the pooled VLLM service; test whether the pooling function is working. Signed-off-by: lty <linhebiwen@gmail.com>	2025-12-12 16:56:18 +08:00
wangyao-i	0983c5510a	vllm-ascend support Ascend950 with Qwen dense model. (#4228 ) ### What this PR does / why we need it? vllm-ascend support Ascend950 with Qwen dense model ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: wangyao <iwangyao@outlook.com> Co-authored-by: weijinqian0 <1184188277@qq.com>	2025-12-12 15:50:57 +08:00
liziyu	716c4dacfe	update qwen2.5vl readme (#4938 ) ### What this PR does / why we need it? fix qwen2.5vl readme, del gen ranktable and add install mooncake - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: liziyu <liziyu16@huawei.com>	2025-12-12 15:40:07 +08:00
Li Wang	4ae7588c52	[Doc] Upgrade outdated doc (#4957 ) ### What this PR does / why we need it? Updated some issues that caused sleep mode document content to be unavailable due to changes/outdated environment variables. --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-12-12 15:38:29 +08:00
1092626063	62a9fea7af	【doc】Add model feature matrix (#4950 ) ### What this PR does / why we need it? doc tutorials add model feature matrix： DeepSeekR1 DeepSeekV3.1 Qwen3-Dense Qwen3-Moe Qwen3-Next Qwen2.5 Qwen2.5-VL Qwen3-VL ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: 1092626063 <1092626063@qq.com>	2025-12-12 15:37:39 +08:00

1 2 3 4 5 ...

1670 Commits