xc-llm-ascend

Author	SHA1	Message	Date
wangxiaoteng888	fff5df3efe	[P/D]The issue of solving the force-free secondary release request, which causes the node to crash. (#5968 ) ### What this PR does / why we need it? The force-free secondary release request causes the node to crash. When requests are pulled too quickly, they should not be added to the delay-free queue. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By ci - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>	2026-01-17 18:49:27 +08:00
lidenghui1110	48e10de8c9	[Bugfix] fix cpu offload hang with tp=1 (#5963 ) ### What this PR does / why we need it? As issue #5948 reported，when using cpu_offload_connector with TP=1, the server will hang on starting, we found several bugs here to fix. 1. some crash error encountered because of code changed with vllm version updating, some of them can be fixed as #5948, and this PR fixed all of them. 2. hang problem described in #5948, the direct reason is that in cpu_offload_connector, RPC client using the same client id in scheduler and worker when tensor_parrallel_size is 1, this PR force the client id to be different, then it is fixed. - Why we didn't find this hang problem before? Because we using --distributed-executor-backend mp or tensor_parrallel_size > 1 in our test, in our old test case, the scheduler and workers are different procceses, then client ids build by `worker-{os.getpid()}` are not the same. But when using tensor_parrallel_size=1, vllm will use uniproc as distributed-executor-backend by default, the scheduler and worker will by in the same proccess, then client ids are the same and hang. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` Signed-off-by: lidenghui <lidenghui1110@gmail.com>	2026-01-17 11:50:13 +08:00
lty	3cb0af0bcf	[Refactor]Refactor of vllm_ascend/distributed module (#5910 ) ### What this PR does / why we need it? Based on the RFC:https://github.com/vllm-project/vllm-ascend/issues/5604 This PR is a refactoring of vllm_ascend/distributed. ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `11b6af5280` Signed-off-by: lty <linhebiwen@gmail.com>	2026-01-15 16:26:53 +08:00
wjunLu	c11a05c4e1	[Main2Main] Upgrade vllm commit to 0113 (#5839 ) ### What this PR does / why we need it? Upgrade vllm commit to 0113 (11b6af5280d6d6dfb8953af16e67b25f819b3be9) - Modify import paths due to the refactors https://github.com/vllm-project/vllm/pull/31916 https://github.com/vllm-project/vllm/pull/32054 - Fix `TypeError: NPUOffloadingSpec.__init__() takes 2 positional arguments but 3 were given` due to https://github.com/vllm-project/vllm/pull/24498 - Skip the async-scheduling tests in `tests/e2e/multicard/4-cards/long_sequence/test_mtp.py`, which are never verified https://github.com/vllm-project/vllm/pull/31998 - Skip some pooling tests, which are caused by https://github.com/vllm-project/vllm/pull/32148 where vllm is also failed https://buildkite.com/vllm/ci/builds/46705/steps/canvas?jid=019bb329-3834-4685-862b-1613b8e0f5d4 We will reopen those tests when main2main reachs https://github.com/vllm-project/vllm/pull/32243 - Skip some cases in `tests/e2e/multicard/4-cards/long_sequence/test_mtp.py`, which are broken by https://github.com/vllm-project/vllm/pull/32118 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: wjunLu <wjunlu217@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com>	2026-01-15 09:48:53 +08:00
lty	295018ec0f	[Refactor]Refactor of vllm_ascend/distributed module (#5719 ) ### What this PR does / why we need it? Based on the RFC:https://github.com/vllm-project/vllm-ascend/issues/5604 This PR is a refactoring of vllm_ascend/distributed, moving all kv_transfer realtaed codes into a dedicated folder, which has already been done in vLLM ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: lty <linhebiwen@gmail.com>	2026-01-15 08:57:40 +08:00
liziyu	e1bed43cff	[P/D] bugfix for p node force free requset (#5431 ) ### What this PR does / why we need it? Fix the bug where the P-node's schedule dead after it force-frees a request due to timeout and then receives the completed kv cache pulled by the D-node again. By add list to recode all requests. - vLLM version: release/v0.13.0 - vLLM main: `81786c8774` --------- Signed-off-by: liziyu <liziyu16@huawei.com> Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com>	2026-01-14 08:51:31 +08:00
liziyu	eed9e366a7	[Bugfix][P/D] fix layerwise connector for decoder tp size > num kv heads (#5846 ) ### What this PR does / why we need it? Fix layerwise connector for decoder tp size > num kv heads. In this case prefiller should push kv cache to all decoder npu. - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: liziyu <liziyu16@huawei.com>	2026-01-13 17:30:33 +08:00
DreamerLeader	db7cf9b0ca	[bugfix] A2 Environment Pooling for Memcache Compatibility (#5601 ) ### What this PR does / why we need it? When running memcache in the A2 environment, the logic for registering memory needs to be added. Additionally, there is a link establishment conflict between memcache and HCCS during initialization in A2, so the link should be established in advance. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `7157596103` --------- Signed-off-by: fangjianwei <f30058701@china.huawei.com> Co-authored-by: fangjianwei <f30058701@china.huawei.com>	2026-01-13 09:07:38 +08:00
zzhxxx	db12c1e2c8	[Perf] Supports compute-communication overlap in the forward of sfa_v1 in the Sharded-CP feature. (#5701 ) ### What this PR does / why we need it? > Extracted from PR #5513 Based on the Sharded-CP feature PR:#4702; RFC:https://github.com/vllm-project/vllm/issues/30055 ### All-gather KV Cache for Communication Overlap: - This PR adjusts the calculation order in the SFA. - split `index_select` into `indexer_select_pre_process` and `indexer_select_post_process`. - Combine `nope`, `rope` and `index-k` into a tensor to perform asynchronous all-gather. ### benchmark: input=40k && num_batch_token=20k - before: ``` Mean TTFT (ms): 2614.52 Median TTFT (ms): 3148.03 P50 TTFT (ms): 3148.03 P90 TTFT (ms): 3163.48 P99 TTFT (ms): 3170.20 ``` - after: ``` Mean TTFT (ms): 2529.92 Median TTFT (ms): 3051.69 P50 TTFT (ms): 3051.69 P90 TTFT (ms): 3067.31 P99 TTFT (ms): 3072.15 ``` ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: zzhx1 <zzh_201018@outlook.com>	2026-01-11 09:47:27 +08:00
zxr2333	78b554dda9	[P/D] layerwise connector supports DeepSeek-V3.2 sparse attention && Distribute transfer tasks to redundant kv_head cards (#5722 ) ### What this PR does / why we need it? Add new function to mooncake layerwise connector, including: 1. supports sparse attention, for DeepSeek-V3.2 2. Distribute transfer tasks to redundant kv_head cards This PR is related to [[RFC]: CDCP Scheduling for Disaggregated Prefilling with KV Cache Layerwise Push Support](https://github.com/vllm-project/vllm-ascend/issues/4842) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By CI. - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com> Signed-off-by: liziyu <liziyu16@huawei.com> Co-authored-by: liziyu <liziyu16@huawei.com>	2026-01-10 23:04:16 +08:00
wangxiaoteng888	aa987ffe87	[P/D][bugfix]Fix the PCP port mapping error issue (#5706 ) ### What this PR does / why we need it? Fix the PCP port mapping error issue.In a multi-node PD separation scenario, when the PCP feature is enabled, there is an issue with the ZMQ transmission port. Specifically, the IP and port received by Side D do not match. The cause of this issue is an error in the port mapping update strategy logic. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By ci - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>	2026-01-10 22:43:52 +08:00
fems14	ff4c1a47b3	[bugfix] Fixing KV Pool Memory Retention and Performance Degradation Issues (#5751 ) ### What this PR does / why we need it? 1.Fixed memory retention on certain GPUs caused by missing PUT operations. 2.Fixed performance degradation resulting from architectural incompatibilities in the underlying refactor. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: fems14 <1804143737@qq.com>	2026-01-09 17:46:23 +08:00
lidenghui1110	481138e1d2	[bugfix] adapt to new implemented get_kv_cache_spec in cpuoffload connector (#4311 ) ### What this PR does / why we need it? func `get_kv_cache_spec` in model_runner changed a lot and caused error in cpuoffloading connector which is copied from model_runner, this PR adapts to new implemented `get_kv_cache_spec` to fix it. ### How was this patch tested? - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` Signed-off-by: lidenghui <lidenghui1110@gmail.com>	2026-01-08 09:15:09 +08:00
zzhxxx	f7db812ed7	[refactor] Refactor the interface for shard weight and remove the flashcomm2 o_shared interface. (#5181 ) ### What this PR does / why we need it? - Delete the environment variable `VLLM_ASCEND_ENABLE_FLASHCOMM2_OSHARED` - Introduce layer_sharding as a configurable feature in additional_config - Revise the term "shared weight" to "shard weight." Configuration : The feature is opt-in via the additional_config argument: ``` --additional-config '{ "layer_sharding": ["o_proj", "q_b_proj"] }' ``` This is orthogonal to standard tensor parallelism and weight replication strategies. It is treated as a separate, explicit feature.It can be used in any scenario, combined with the flashcomm2https://github.com/vllm-project/vllm-ascend/pull/3232 feature or the ShardedCP #4702 feature, to achieve significant performance. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: zzhx1 <zzh_201018@outlook.com> Signed-off-by: zzhxx <zhangzihang23@mails.ucas.ac.cn> Signed-off-by: chenxiao <Jaychou1620@Gmail.com> Co-authored-by: clrs97 <524936896@qq.com> Co-authored-by: Levi-JQ <yujinqi2@huawei.com> Co-authored-by: chenxiao <Jaychou1620@Gmail.com>	2026-01-08 09:05:02 +08:00
zxr2333	20a8cf061b	[BugFix][P/D] Fix pre-create link parameter error (#5694 ) ### What this PR does / why we need it? Fix pre-create link parameter error, `batch_transfer_sync_write` requires list. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By CI. - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: liziyu <liziyu16@huawei.com> Co-authored-by: liziyu <liziyu16@huawei.com>	2026-01-08 08:41:10 +08:00
Mengqing Cao	3f4f2b4ae6	[Refactor] Import global var form vllm instead of overwirte it (#5469 ) ### What this PR does / why we need it? Import global var form vllm instead of overwirte it, so that we could use the correct global variant value - vLLM version: v0.13.0 - vLLM main: `5326c89803` --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2026-01-07 18:41:45 +08:00
UnifiedCacheManager	d6bb17f10e	[Bugfix]Add register_kv_cache in ucm_connector (#5657 ) ### What this PR does / why we need it? To adapt different shapes of the KV cache, UCM optimized the initialization of store by moving it into `register_kv_caches`. Therefore, this update adds `register_kv_caches` interface to UCMConnectorV1. ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: UnifiedCacheManager <unifiedcachem@163.com>	2026-01-07 11:30:33 +08:00
liziyu	330e25ab1d	[P/D] Performance enhancement of Layerwise connector in TP asymmetric scenarios (#5540 ) ### What this PR does / why we need it? [P/D] Performance enhancement of Layerwise connector in TP asymmetric scenarios 1. Session fusion: For transmission tasks at each layer, aggregate transmission tasks with the same destination and merge them into a single task for assignment. 2. Alltoall aggregation: For TP asymmetric scenarios, perform all alltoall operations at once according to the block granularity for all requests. [RFC]: CDCP Scheduling for Disaggregated Prefilling with KV Cache Layerwise Push Support https://github.com/vllm-project/vllm-ascend/issues/4842 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `45c1ca1ca1` --------- Signed-off-by: liziyu <liziyu16@huawei.com> Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com> Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Co-authored-by: nwpu-zxr <zhouxuerong2@huawei.com> Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com>	2026-01-06 20:25:36 +08:00
Shanshan Shen	b94d589769	[MM][Bugfix] Update `hf_config` to `hf_text_config` (#5319 ) ### What this PR does / why we need it? Following https://github.com/vllm-project/vllm-ascend/pull/5205, update `hf_config` to `hf_text_config`. Find more details at https://github.com/vllm-project/vllm-ascend/pull/5205#issuecomment-3675417534 and https://github.com/vllm-project/vllm-ascend/pull/5205#issuecomment-3677920872. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `5fbfa8d9ef` Signed-off-by: shen-shanshan <467638484@qq.com>	2026-01-06 16:41:39 +08:00
Chao Lei	473431e7e2	[P/D]Remove mooncake kvpool unused parameter `local_hostname` (#5574 ) ### What this PR does / why we need it? In mooncake kvpool, `local_hostname` is not used. Instead, the local IP is obtained directly via `get_ip()`. Therefore, remove this parameter to avoid confusion. ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `7157596103` Signed-off-by: LCAIZJ <leichao139636@163.com>	2026-01-05 20:18:59 +08:00
baxingpiaochong	46c2fc6a3c	[KVPOOL]decode save kvcache (#5168 ) ### What this PR does / why we need it? kvpool decode save kvcache now only support mla ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: baxingpiaochong <771405853@qq.com> Co-authored-by: Chao Lei <leichao139636@163.com>	2026-01-04 22:22:01 +08:00
lidenghui1110	d462577504	[Recover] [Bugfix] support mtp kv transfer and pp partition by hand in kv transfer (#4892 ) (revert in #4981 ) (#5511 ) PR #4892 was revert in #4981, we recover it now. For the potential bug break deepseek3.2 in PD case, we will find it out and fix it. - vLLM version: v0.13.0 - vLLM main: `45c1ca1ca1` --------- Signed-off-by: lidenghui <lidenghui1110@gmail.com>	2026-01-04 16:49:33 +08:00
Chao Lei	d193316ded	[P/D] Bugfix zmq send/receive failed (#5503 ) ### What this PR does / why we need it? Currently, when the MooncakeConnector interacts via ZeroMQ, it throws the following exception upon send/receive failure: Issue 1: The currently used `zmq.REQ` socket follows a strict request-reply pattern, requiring an alternating sequence of send → receive → send → receive... If either a send() or receive() operation fails, the ZeroMQ socket becomes unusable. Solution: When a send() or receive() exception occurs, close and delete the ZeroMQ socket, and recreate it upon next use. Issue 2: In `_handle_request`, if `_send_done_recv_signal` raises an exception, the exception is thrown immediately and subsequent code is not executed, causing the decode logic to fail to properly release the request. Solution: Move the call to `_send_done_recv_signal` to the end of the function. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `45c1ca1ca1` Signed-off-by: LCAIZJ <leichao139636@163.com>	2025-12-31 19:17:08 +08:00
zxr2333	46a1614387	[P/D] Improve the performance of Layerwise Connector (#5303 ) ### What this PR does / why we need it? Improve the performance of Layerwise Connector, mainly includes the following points: 1. Use event synchronize to replace stream synchronize. 2. Access metaserver when scheduling. 3. Transfer kvcache each Chunk prefill segmentation. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By CI. - vLLM version: release/v0.13.0 - vLLM main: `5fbfa8d9ef` --------- Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com> Signed-off-by: liziyu <liziyu16@huawei.com> Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Co-authored-by: liziyu <liziyu16@huawei.com> Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com>	2025-12-31 15:09:01 +08:00
Jade Zheng	38570cfeb6	[Feature] Support kv nz feature for DeepSeek decode node in disagg-prefill scenario (#3072 ) By converting the KV cache from ND to NZ format when the decode node receives it, this PR ensures that the KV NZ feature works correctly during the decoding phase in disagg-prefill scenario. - vLLM version: v0.11.0 - vLLM main: `83f478bb19` --------- Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com> Co-authored-by: ghphotoframe <854746559@qq.com> Co-authored-by: alex101-ops <alex1015718386@gmail.com>	2025-12-31 14:24:04 +08:00
wangxiaochao6	a539ae753a	[feature] mooncake support pcp/dcp in common conditions (#5224 ) ### What this PR does / why we need it? 1. This PR is proposed to support complicated pcp/dcp parallelisms in Prefill and Decode nodes in Mooncake, such as Prefill: TP8/PCP2DCP8 and Decode: TP8/DCP4/DP2, which is not supported now. We establish the link mappings to transfer KVCache between prefill and decode nodes. The main function is realized in Function of `_get_kv_split_metadata` in Mooncake_connector.py 2. After a prefill rank is pulled KVCache by a decode rank, the decode rank will send `DONE_RECVING_MSG` to the prefill rank and the prefill rank will free its KVCache blocks. If a prefill rank is pulled KVCache more than one time by several decode ranks and it surely could happen in complicated pcp/dcp parallelisms, it will cause the prefill rank free its KVCache blocks for several times, which could cause memory issue. This PR solve this issue by counting the times of prefill rank would be pulled KVCache and in the last time, it will free the prefill rank KVCache blocks. The related code is in Function of `run_busy_loop` in Mooncake_connector.py 3. If a prefill rank is not pulled KVCache by any decode ranks, the first rank in decode node will send "DONE_RECVING_MSG" to free its blocks. The related code is in Function of `_send_done_signal_to_free_remote_port` in Mooncake_connector.py ### How was this patch tested? This PR is tested in many pcp/dcp parallelisms, and the accuracy are all correct. MLA model: Prefill node: TP8/DP2, Decode node: TP8/DP2 Prefill node: TP8/PCP2/DCP8, Decode node: TP8/DP2 Prefill node: TP8/PCP2/DCP8, Decode node: TP8/DCP4/DP2 Prefill node: TP8/PCP2/DCP4, Decode node: TP4/DCP2/DP4 Prefill node: TP8/PCP2/DCP2, Decode node: TP4/DCP4/DP4 Prefill node: TP8/PCP2, Decode node: TP4/DCP2 GQA model: Prefill node: TP8/DP2, Decode node: TP8/DP2 Prefill node: TP8/PCP2/DCP2, Decode node: TP8/DP2 Prefill node: TP8/PCP2/DCP2, Decode node: TP8/DCP2/DP2 Prefill node: TP8/PCP2/DCP2, Decode node: TP4/DP4 Prefill node: TP16/DCP2/PCP1, Decode node: TP8/DCP2/DP2 - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` - Co-author by: Daishixun dsxtsteven@sina.com --------- Signed-off-by: wangxiaochao <w00642655@china.huawei.com> Co-authored-by: wangxiaochao <w00642655@china.huawei.com> Co-authored-by: Jade Zheng <zheng.shoujian@outlook.com>	2025-12-31 09:53:03 +08:00
fems14	2ef4d1979e	[bugfix][main]KV Pool for KV Transfer in PD Disaggregation Scenarios (#5398 ) ### What this PR does / why we need it? 1.KV Pool for KV Transfer in PD Disaggregation Scenarios Error Resolution 2.Update KV Pool Documentation ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `254f6b9867` --------- Signed-off-by: fems14 <1804143737@qq.com>	2025-12-27 09:53:57 +08:00
ApsarasX	3d9954eff0	[Bugfix] Use hf_text_config instead of hf_config to support multimodal PD-Disaggregated (#5205 ) ### What this PR does / why we need it? In code files such as`mooncake_connector.py`, `vllm_config.model_config.hf_config` is used to get the LLM configs. This approach works for LLMs, but not for multi-modal models. For multi-modal models, `vllm_config.model_config.hf_text_config` must be used instead to get the LLM configs. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing UT - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: ApsarasX <apsarax@outlook.com>	2025-12-22 20:21:45 +08:00
shaopeng-666	fd9a47c04d	fix vl pd smoke error (#5103 ) ### What this PR does / why we need it? Fix VL model mooncacke PD smoke test error ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: 李少鹏 <lishaopeng21@huawei.com>	2025-12-18 22:20:45 +08:00
zzhxxx	a74a1196c5	[Feat] Support MLP_TP feature, exclude MOE layer (#4999 ) #4257 This PR implements the dense_ffn TP of the first three layers of the deepseek model, I have refactored this PR and used very little code to support the implementation of this feature. This PR adds a function `is_moe_layer` to mlp_tp, which supports MLP TP in models with both mlp and moe, such as deepseek or chat GLM. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: zzhx1 <zzh_201018@outlook.com> Co-authored-by: 子潜 <ziqian@U-DMKXH32D-2015.local> Co-authored-by: chenxiao <Jaychou1620@Gmail.com> Co-authored-by: Jade Zheng <zheng.shoujian@outlook.com>	2025-12-18 20:06:53 +08:00
dsxsteven	97537709ae	[BugFix] Fix mooncake bug in PCP scenario (#5055 ) ### What this PR does / why we need it? The mooncake_connector.py file was importing the wrong arguments to the file, which could cause errors when use PCP; this issue has been corrected. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: daishixun <dsxsteven@sina.com>	2025-12-17 16:32:16 +08:00
weiguihua2	bf97048bce	[feat]pd disaggregated support cross-machine (#5008 ) ### What this PR does / why we need it? pd disaggregated support cross-machine. We send the primary and secondary node information of node p to node d. When node d pulls the KV data, it retrieves the corresponding primary or secondary node information from the mapping. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: weiguihua2 <weiguihua2@huawei.com>	2025-12-17 09:28:03 +08:00
Chao Lei	9c02fa9867	[bugfix] Fix mooncake kvpool accuracy issue (#4976 ) ### What this PR does / why we need it? The current KVPool has a accuracy issue https://github.com/vllm-project/vllm-ascend/issues/4412. This PR aims to fix the precision problem without impacting prefill performance. Note：Due to a bug in ADXL, calling `current_event.synchronize()` may occasionally hang. This issue will be fixed in Cann version 8.5.rc1. You can manually build the master branch of the project at https://gitcode.com/cann/hixl to resolve this issue before the 8.5.RC1 release. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: LCAIZJ <leichao139636@163.com>	2025-12-16 11:33:16 +08:00
UnifiedCacheManager	195eac665b	[Core][Worker] Add UCMConnector for KV Cache Offloading (#4411 ) ### What this PR does / why we need it? This PR introduces the initial integration of UCM (Unified Cache Management) into the vllm-ascend distributed KV-cache system. Specifically, it adds: - A new `UCMConnector` implementation under the distributed KV-transfer framework. - Support for offloading KV-cache blocks to external UCM backends (DRAM / NFS / Localdisk), depending on UCM configuration). - Integration with vLLM V1 KV connector interface, including metadata handling and role registration. Why it is needed: - UCM provides a unified, high-performance storage layer for KV-cache externalization. - This enables vllm-ascend to support out-of-core KV-cache workloads, improve memory efficiency, and leverage hardware-accelerated storage paths (RDMA / NFS / hybrid modes). - This connector is a required component to allow future work on multi-node inference + UCM-based scaling. --- ### Does this PR introduce _any_ user-facing change? Yes, but limited: - A new `kv_connector=UCMConnector` option becomes available through the configuration interface. - When selected, vllm-ascend workers may initialize UCM and offload KV-cache blocks externally. - No default behaviors are changed. Users must explicitly enable this connector. This PR does not modify: - existing APIs, - default execution paths, - model runner behavior, - user workflow unless `UCMConnector` is configured. --- ### How was this patch tested? --- ### Prefix Caching Benchmark We provide preliminary measurements for TTFT (ms) under VLLM benchmark. Tests run on 2 * Ascend 910B3, vllm-ascend 0.11.0, Tensor Parallel size 2, with UCM (Localdisk) enabled. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: UnifiedCacheManager <unifiedcachem@163.com>	2025-12-16 10:53:30 +08:00
fems14	b662d914a4	[bugfix] [main] Fix KV cache query inconsistency across different TP ranks in the KV Pool (#5030 ) ### What this PR does / why we need it? In the current KV Pool scenario for models like MLA and GQA, where different TP ranks generate identical KV caches, the system is designed to store only a single copy. The previous approach allowed each card to query storage requirements dynamically, but inconsistent query results across cards led to incorrect storage. To fix this, the new solution pre-allocates storage responsibilities; each card now simply stores its pre-assigned blocks, bypassing the inconsistent query step and ensuring data correctness. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: fems14 <1804143737@qq.com>	2025-12-15 21:56:05 +08:00
baxingpiaochong	95e6400128	[KVPool]Fix PP get bug (#5007 ) ### What this PR does / why we need it? When kv caches are evicted from the key-value pool, it's possible that the kv cache for pp0 is still active, but the kv cache for pp1 has already been evicted. Therefore, a unified check is needed during the get operation. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: baxingpiaochong <771405853@qq.com> Co-authored-by: Jade Zheng <zheng.shoujian@outlook.com>	2025-12-15 20:27:57 +08:00
zzhxxx	e16444f21f	[Bugfix] Fix the bug in initializing the shared_weight communication domain in sfa-cp, and fix the mtp weight load in pp>1 situation (#4913 ) ### What this PR does / why we need it? In PR #4188, a small bug was introduced that caused sfa-cp to be unable to find the global_pp_size parameter during initialization, and this PR fixed the issue. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: zzhx1 <zzh_201018@outlook.com> Co-authored-by: Jade Zheng <zheng.shoujian@outlook.com>	2025-12-15 16:21:49 +08:00
AlvisGong	ba28d54f35	[Perf]enable prefill flashcommon3 (#4065 ) ### What this PR does / why we need it? moe multistream overlap to improve the performance. ### How was this patch tested? --additional-config '{"multistream_overlap_gate": true}' - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: AlvisGong <gwly0401@163.com> Signed-off-by: chenxiao <Jaychou1620@Gmail.com> Co-authored-by: clrs97 <524936896@qq.com> Co-authored-by: zzhx1 <zzh_201018@outlook.com> Co-authored-by: chenxiao <Jaychou1620@Gmail.com>	2025-12-14 09:34:13 +08:00
wangxiyuan	5211e991ad	Revert "[Bugfix] support mtp kv transfer and pp partition by hand in kv transfer (#4892 )" (#4981 ) This reverts commit `332b547728`. This break deepseek3.2 in PD case. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c`	2025-12-13 18:58:55 +08:00
lty	0cdf98ac48	[usability]Modify the default value of the protocol to ascend (#4959 ) ### What this PR does / why we need it? The recommended configuration in the document kv_pool.md is ascend. Modify the default value of the protocol to ascend，Improve usability #### 1.Configure mooncake.json The environment variable MOONCAKE_CONFIG_PATH is configured to the full path where mooncake.json is located. ``` { "local_hostname": "xx.xx.xx.xx", "metadata_server": "P2PHANDSHAKE", "protocol": "ascend", "device_name": "", "alloc_in_same_node": true, "master_server_address": "xx.xx.xx.xx:50088", "global_segment_size": "1GB" (1024MB/1048576KB/1073741824B/1073741824) } ``` local_hostname: Configured as the IP address of the current master node. metadata_server: Configured as P2PHANDSHAKE. protocol: Configured for Ascend to use Mooncake's HCCL communication. device_name: "" alloc_in_same_node: Indicator for preferring local buffer allocation strategy. master_server_address: Configured with the IP and port of the master service. global_segment_size: Expands the kvcache size registered by the PD node to the master. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Mooncake does not set up a protocol to launch the pooled VLLM service; test whether the pooling function is working. Signed-off-by: lty <linhebiwen@gmail.com>	2025-12-12 16:56:18 +08:00
lidenghui1110	d65fb194d9	[Feat] Add custom Embedding tensor model parallel (#2616 ) Similar to #2309 , this PR introduces Embedding tensor model parallel to achieve decreasing of memory consumption. It support both eager mode and graph mode. And this PR refactor module tensor parallel configurations supported in #2309, #2167, #2120, merge all config into `finegrained_tp_config` in `additional_config`, including: `lmhead_tensor_parallel_size` `oproj_tensor_parallel_size` `embedding_tensor_parallel_size` `mlp_tensor_parallel_size` - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: zzhx1 <zzh_201018@outlook.com> Signed-off-by: zzhxx <zhangzihang23@mails.ucas.ac.cn> Co-authored-by: zzhx1 <zzh_201018@outlook.com> Co-authored-by: chenxiao <Jaychou1620@Gmail.com> Co-authored-by: zzhxx <zhangzihang23@mails.ucas.ac.cn> Co-authored-by: Jade Zheng <zheng.shoujian@outlook.com>	2025-12-12 14:41:20 +08:00
Slightwind	b8a317caac	[main][Bugfix] Remove the ZMQ communication setup on the D node (#4926 ) In the PD separation scenario, the D node does not need to perform get operations, and therefore does not need to create ZeroMQ (ZMQ) communication. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: SlightwindSec <slightwindsec@gmail.com>	2025-12-12 14:37:26 +08:00
Jade Zheng	3fade30275	[Bugfix] Prevent engine hang during KVCacheSendingThread startup (#4754 ) Previously, if the KVCacheSendingThread couldn't create a socket because of port conflicts or other problems, the main thread would wait endlessly for the ready_event signal, causing the entire engine initialization to freeze. This update fixes the issue by adding timeouts for thread startup and handling unexpected thread exits, so the initialization process no longer gets stuck indefinitely. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>	2025-12-11 18:39:25 +08:00
lidenghui1110	332b547728	[Bugfix] support mtp kv transfer and pp partition by hand in kv transfer (#4892 ) ### What this PR does / why we need it? Current mooncake connector has following problems with PP and MTP enabled: 1. MTP layer kv caches are not transfered, it may cause decreasing of accept ratio: This PR add MTP layer indices for last PP stage after calculating end_layer in transfer_kv_cache 2. While MTP enabled, PP layers divided by default may cause imbalance between stages, we need to use `VLLM_PP_LAYER_PARTITION` environment to make it balance by hand, but in mooncake connector kv transfer, decode doesn't know the partition of prefill node: This PR add config `pp_layer_partition` in `kv_connector_extra_config` to make decode node acquire the partition information of prefill node. ### Does this PR introduce _any_ user-facing change? When prefill using `VLLM_PP_LAYER_PARTITION` environment, add `pp_layer_partition` in `kv_connector_extra_config` like below: ``` export VLLM_PP_LAYER_PARTITION=33,28 "kv_connector_extra_config": { "use_ascend_direct": true, "prefill": { "dp_size": 1, "tp_size": 8, "pp_size": 2, "pp_layer_partition": "33,28" }, "decode": { "dp_size": 16, "tp_size": 1, "pp_size": 1 } } ``` ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: lidenghui <lidenghui1110@gmail.com>	2025-12-11 17:23:21 +08:00
zzhxxx	eac72f5f23	[Feat] Flashcomm2 use o_shared linear (#4188 ) ### What this PR does / why we need it? It is mentioned in the [flashcomm2 technical report](https://gitcode.com/ascend-tribe/ascend-inference-cluster/blob/main/FlashComm/FlashComm2%E5%A4%A7%E6%A8%A1%E5%9E%8B%E6%8E%A8%E7%90%86%E4%B8%AD%E4%BB%A5%E5%AD%98%E6%8D%A2%E4%BC%A0%E7%9A%84%E9%80%9A%E4%BF%A1%E4%BC%98%E5%8C%96%E6%8A%80%E6%9C%AF.pdf) that FC2 will introduce full redundant storage of the o_proj matrix, which will put pressure on the memory. Therefore, the technical report proposed a compromise solution using otp2, but it will introduce additional reduce-scatter communication. We propose a shared linear feature (#2931 ) that supports distributing weights layer by layer to each card, avoiding the need for TP splitting, and can solve the memory issue. This PR depends on #3232 and #2931 ### Flashcomm2 flowchart <img width="1142" height="878" alt="PixPin_2025-11-14_13-37-39" src="https://github.com/user-attachments/assets/d45ea8db-d8ef-4d45-8e18-abd4d82ce3e0" /> ### Does this PR introduce _any_ user-facing change? Use environment variables ```bash export VLLM_ASCEND_FLASHCOMM2_PARALLEL_SIZE=1 export VLLM_ASCEND_ENABLE_FLASHCOMM2_OSHARED=1 ``` - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: zzhx1 <zzh_201018@outlook.com> Signed-off-by: zzhxx <2783294813@qq.com> Co-authored-by: zzh02232027 <zzh02232027@antgroup.com> Co-authored-by: clrs97 <524936896@qq.com> Co-authored-by: Levi-JQ <yujinqi2@huawei.com>	2025-12-11 12:43:04 +08:00
lidenghui1110	a82b0fa70e	mooncake connector support pipeline parallel & fix pp with flashcomm1 (#4054 ) ### What this PR does / why we need it? To support pipeline parallel with PD disaggregation, this PR support PP in mooncake connector and fix other bugs when enable pp with other optimization params, including following changes: - mooncake connector support pp in prefill, we do not support decode pp currently - fix bugs when enable both pp and flashcomm1 - optimize ascend-scheduler to support full batch in multiple pipeline stages, original implementation would cause all pipeline stages batch_size total summed to max_num_seq, which makes pipeline is not full, this optimization can make all stages running with full batch_size = max_num_seq, the same changes will contribute to vllm scheduler too. ### Does this PR introduce _any_ user-facing change? add `pp_size` in mooncake connector kv_connector_extra_config ``` "kv_connector_extra_config": { "use_ascend_direct": true, "prefill": { "dp_size": 1, "tp_size": 4, "pp_size": 4 }, "decode": { "dp_size": 16, "tp_size": 1 } } ``` ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: chenxiao <Jaychou1620@Gmail.com> Signed-off-by: Kurumi5210 <Jaychou1620@Gmail.com> Signed-off-by: Kurumi5210 <jaychou1620@gmail.com> Signed-off-by: 秋刀鱼 <jaychou1620@Gmail.com> Co-authored-by: chenxiao <Jaychou1620@Gmail.com> Co-authored-by: zss <zss@qq.com> Co-authored-by: zss <3265779424@qq.com>	2025-12-10 16:01:43 +08:00
wangxiyuan	835b4c8f1d	Drop torchair (#4814 ) aclgraph is stable and fast now. Let's drop torchair graph mode now. TODO: some logic to adapt torchair should be cleaned up as well. We'll do it in the following PR. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2025-12-10 09:20:40 +08:00
wangxiaoteng888	a77045f355	[P/D][main]Offline the llmdatadist connector related parts of the code and files. (#4780 ) ### What this PR does / why we need it? As support for the mooncake connector is now available, the llmdatadist connector is no longer being maintained, so the llmdatadist-related files need to be retired. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By ci - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Signed-off-by: liziyu <liziyu16@huawei.com> Co-authored-by: liziyu <liziyu16@huawei.com>	2025-12-09 22:36:43 +08:00
lty	dee00d0de3	[Usability]local_buffer_size support for units: GB, MB, KB, B (#4829 ) What this PR does / why we need it? Improve usability，local_buffer_size support for units: GB, MB, KB, B, For example, "2GB" { "local_hostname": "XXX.XXX.XXX.XXX", "metadata_server": "P2PHANDSHAKE", "protocol": "ascend", "device_name": "", "use_ascend_direct": true, "master_server_address": "XXX.XXX.XXX.XXX:50088", "global_segment_size": 60000000000, "local_buffer_size": "2GB" } Does this PR introduce any user-facing change? local_buffer_size support for units: GB, MB, KB, B How was this patch tested? Mooncake configures local_buffer_size as GB, MB, KB, B - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: lty <linxianchong1@huawei.com>	2025-12-09 17:52:24 +08:00
baxingpiaochong	dda027e680	[KVPOOl]Support pp (#4761 ) ### What this PR does / why we need it? Support pp for kv pool - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: baxingpiaochong <771405853@qq.com>	2025-12-09 16:15:26 +08:00

1 2 3

147 Commits