xc-llm-ascend

Author	SHA1	Message	Date
liziyu	14bef9af6f	[P/D] Remove restrictions on mooncake for IPv6 (#5946 ) ### What this PR does / why we need it? Remove restrictions on mooncake for IPv6 Dependencies: cann8.5、mooncake v0.3.8.post1 - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` --------- Signed-off-by: liziyu <liziyu16@huawei.com>	2026-01-24 11:30:22 +08:00
UnifiedCacheManager	a2f022f9b6	[UCMConnector]Add has_connector_metadata (#6172 ) ### What this PR does / why we need it? ucm_connector add has `has_connector_metadata` interface to adapt to the latest KV connector in vLLM. ### Does this PR introduce _any_ user-facing change? this PR doesn't introduce _any_ user-facing change. ### How was this patch tested? - vLLM version: v0.14.0 - vLLM main: `d68209402d` Signed-off-by: UnifiedCacheManager <unifiedcachem@163.com>	2026-01-23 21:16:48 +08:00
baxingpiaochong	8786412f5c	[Bugfix]KV pool rank 0 consumes more HBM (#6113 ) ### What this PR does / why we need it? before add_set_deivce <img width="2354" height="674" alt="image" src="https://github.com/user-attachments/assets/8b81ab5f-b9ba-4fd2-8546-8f36ac15d32b" /> after <img width="1044" height="156" alt="image" src="https://github.com/user-attachments/assets/996d845a-8abd-4aae-b894-4a9832b1f742" /> ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `d68209402d` --------- Signed-off-by: baxingpiaochong <771405853@qq.com>	2026-01-23 19:47:33 +08:00
weiguihua2	4173255c0c	[main][Bugix] fix kv pcp+pooling+pd separation bug (#6153 ) ### What this PR does / why we need it? Rectify the problem that the pcp and pd separation and kv pooling scenario. In the pooling scenario, multi_nodes_meta_mapping is empty. As a result, an error is reported when the remote_host information is obtained through the get_remote_port_send_num method. ### Does this PR introduce _any_ user-facing change? No - vLLM version: v0.13.0 - vLLM main: `d68209402d` Signed-off-by: weiguihua2 <weiguihua2@huawei.com>	2026-01-23 16:15:04 +08:00
wangxiaoteng888	82a2b3bcc7	[P/D]Add ssl cert for metaserver proxy (#5875 ) ### What this PR does / why we need it? When the P node accesses the proxy meteserver, add the SSL certificate and the CA certificate path to improve security. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By ci - vLLM version: v0.13.0 - vLLM main: `bde38c11df` --------- Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>	2026-01-23 11:11:44 +08:00
zhangxinyuehfad	819a4459ce	Drop vLLM 0.13.0 support (#6069 ) ### What this PR does / why we need it? Drop vLLM 0.13.0 support, upgrade to 0.14.0 - vLLM version: v0.13.0 - vLLM main: `d68209402d` --------- Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2026-01-23 09:45:08 +08:00
wangxiaoteng888	f2c0ced06d	[P/D][PCP]bugfix pcp force free twice caused logger error (#6124 ) ### What this PR does / why we need it? The issue of the D node mistakenly sending the pull-end signal twice, leading to the P node printing logger errors abnormally, has been resolved. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By ci - vLLM version: v0.13.0 - vLLM main: `d68209402d` Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>	2026-01-22 16:24:33 +08:00
Li Wang	484e7c59dc	[CI] optimize lint term (#5986 ) ### What this PR does / why we need it? This patch purpose to optimize the lint check term. The main idea is to reduce unnecessary installation time. 1. The installation of vllm is not must, only append the path of vllm src to the `PATHONPATH` is effective 2. This installation of `requirements-dev.txt` is not must, we have a pre-built image `quay.io/ascend-ci/vllm-ascend:lint` with all the requirements installed in advance. NOTE: the conditions for triggering image builds are: 1).Daily scheduled build; 2) Build when requirements are modified; 3) Manual build. This ensures that the dependencies in our image are up-to-date to the greatest extent possible. 3. The `mypy` was separated from the `pre-commit` hook for performance reasons; we found that integrating `mypy` into the `pre-commit` hook resulted in poor performance. 4. Reduce the CPU core consumption from 16 -> 8 ### Does this PR introduce _any_ user-facing change? The end-to-end lint time was optimized from 20min/per PR to 8min/per PR ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2026-01-22 15:46:59 +08:00
JiangWeixiang	cef04b3555	[bugfix] adapt_remote_request_id (#6051 ) This PR addresses a request ID mismatch issue in the PD (Prefill-Decoding) separation deployment scenario for vllm-ascend. Upstream vLLM recently mitigated request ID collisions by appending a random suffix to each request_id (e.g., req-123 → req-123-abc), refer to [PR-27987](https://github.com/vllm-project/vllm/pull/27987 ) & [PR-29665](https://github.com/vllm-project/vllm/pull/29665). While this works in single-node deployments, it breaks compatibility in PD-separated setups: the Producer (Prefill node) and Consumer (Decoding node) end up with different request_id values, preventing the Consumer from correctly retrieving the KV cache generated by the Producer. To resolve this, this PR introduces a new field remote_request_id in the metadata passed via mooncake_connector. The Producer preserves and forwards the original (unmodified) request_id as remote_request_id. The Consumer then uses this remote_request_id—instead of its locally generated suffixed ID—to fetch the correct KV cache from the Prefill node. This ensures consistent request identification across PD nodes while maintaining compatibility with upstream vLLM’s request ID deduplication mechanism. <img width="1279" height="781" alt="image" src="https://github.com/user-attachments/assets/274238c1-dab6-4d3a-9ee4-6e578679b762" /> - vLLM version: v0.13.0 - vLLM main: `d68209402d` Signed-off-by: ghphotoframe <854746559@qq.com> Co-authored-by: jiangweixiang <jwx02384838@antgroup.com>	2026-01-22 10:48:40 +08:00
zzhxxx	dd8571860d	[Feature] Support DSA-CP for Hybrid scenario (#5702 ) Signed-off-by: zzhx1 <zzh_201018@outlook.com> ### What this PR does / why we need it? > Extracted from PR #5513 Based on the Sharded-CP feature PR:#4702; RFC:https://github.com/vllm-project/vllm/issues/30055 ### Support FULL_DECODE_ONLY Mode under PD-Mixed Scenario: Extends DSA-CP to handle the FULL_DECODE_ONLY execution mode when running in a prefill-decode mixed (PD-mixed) serving environment, improving throughput and resource utilization for decode-intensive workloads. In pure prefill nodes: - Both q_proj and o_proj are sharded across world ranks, using broadcast for weights distribution. In PD-mixed nodes (supporting both prefill and decode): - q_proj is fully replicated (not sharded) to avoid communication overhead during decoding. - o_proj Using the original TP `RowParallelLinear` method to store weights During prefill execution: - o_proj forwards through all_gather to collect weights, reconstructing the complete o_proj weights on each card. During decode (graph replay phase): - Additional all_to_all (before o_proj) and reduce_scatter (after o_proj) are introduced to enable sequence-parallel output aggregation while maintaining correctness under SFA CP. ### benchmark: - TTFT increased by 527% - TPOT increased by 180% <img width="1550" height="938" alt="image" src="https://github.com/user-attachments/assets/9b7a03d8-a3db-4a99-8923-6e5bfcfecf72" /> ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: zzhx1 <zzh_201018@outlook.com> Signed-off-by: zzhxx <zhangzihang23@mails.ucas.ac.cn> Co-authored-by: clrs97 <524936896@qq.com>	2026-01-22 10:12:09 +08:00
DreamerLeader	b6d55fc48e	[Bugfix]Fixed precision issues caused by pooled request pooling (#6049 ) ### What this PR does / why we need it? Fixed precision issues caused by pooled request pooling ### Does this PR introduce _any_ user-facing change? pr6045 ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `d68209402d` --------- Signed-off-by: 房建伟 <fangjianwei@fangjianweideMacBook-Air.local> Co-authored-by: 房建伟 <fangjianwei@fangjianweideMacBook-Air.local>	2026-01-20 23:51:31 +08:00
fems14	8b98d7a4e8	【main】【bugfix】Resolved memory deallocation failure in the pooling layer under re-computation workloads. (#6045 ) ### What this PR does / why we need it? Resolved a double-free memory vulnerability in the pooling layer under re-computation scenarios. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `d68209402d` Signed-off-by: fems14 <1804143737@qq.com>	2026-01-20 22:56:04 +08:00
wangxiaochao6	bc486d9530	[main][bugfix] fix mooncake kv cache transfer when one P has multi nodes (#5960 ) ### What this PR does / why we need it? In PD disaggregation case, when P has multi nodes, mooncake fails to send data. Fix the issue in this PR. The details: If a P rank does not need to transfer kv cache to any one D rank, D node should send a message to P node to release the kv cache in P node. If P has multi nodes, D node should know the corresponding IP in each P node, then D node can send message to the right P node. Otherwise, send data error will happen. This PR fix this issue by providing P nodes IP to D node through Parameter `remote_port_send_num`. - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` --------- Signed-off-by: wangxiaochao <w00642655@china.huawei.com> Co-authored-by: wangxiaochao <w00642655@china.huawei.com>	2026-01-19 16:35:13 +08:00
LICO67373	687df88151	[Refactor] Move AttentionSpec initialization to Attention module (#5834 ) ### What this PR does / why we need it? This PR refactors `get_kv_cache_spec` method to delegate AttentionSpec creation to each attention module's own `get_kv_cache_spec()` method, aligning with the vllm source code structure. Changes: - Simplify `get_kv_cache_spec` in `model_runner_v1.py` and `cpu_offload_connector.py` - Remove manual `AttentionType` checks for `Attention` modules - Delegate spec creation to each attention module's `get_kv_cache_spec` method directly - Let `MambaBase` layers use their own `get_kv_cache_spec` method - Keep `use_sparse` hack for `MLAAttention` (DeepSeek DSA mode) as Ascend-specific handling This change follows RFC #5463 item 12: move AttentionSpec to Attention module. - Fixes #5463 (item 12) ### Does this PR introduce _any_ user-facing change? No. This is an internal refactoring that simplifies code structure without changing any external behavior. ### How was this patch tested? - Syntax validation passed via `python -m py_compile` - CI tests will verify the changes work correctly with existing test cases - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: lico67373 <918688502@qq.com>	2026-01-19 14:22:18 +08:00
wangxiaoteng888	fff5df3efe	[P/D]The issue of solving the force-free secondary release request, which causes the node to crash. (#5968 ) ### What this PR does / why we need it? The force-free secondary release request causes the node to crash. When requests are pulled too quickly, they should not be added to the delay-free queue. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By ci - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>	2026-01-17 18:49:27 +08:00
lidenghui1110	48e10de8c9	[Bugfix] fix cpu offload hang with tp=1 (#5963 ) ### What this PR does / why we need it? As issue #5948 reported，when using cpu_offload_connector with TP=1, the server will hang on starting, we found several bugs here to fix. 1. some crash error encountered because of code changed with vllm version updating, some of them can be fixed as #5948, and this PR fixed all of them. 2. hang problem described in #5948, the direct reason is that in cpu_offload_connector, RPC client using the same client id in scheduler and worker when tensor_parrallel_size is 1, this PR force the client id to be different, then it is fixed. - Why we didn't find this hang problem before? Because we using --distributed-executor-backend mp or tensor_parrallel_size > 1 in our test, in our old test case, the scheduler and workers are different procceses, then client ids build by `worker-{os.getpid()}` are not the same. But when using tensor_parrallel_size=1, vllm will use uniproc as distributed-executor-backend by default, the scheduler and worker will by in the same proccess, then client ids are the same and hang. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` Signed-off-by: lidenghui <lidenghui1110@gmail.com>	2026-01-17 11:50:13 +08:00
lty	3cb0af0bcf	[Refactor]Refactor of vllm_ascend/distributed module (#5910 ) ### What this PR does / why we need it? Based on the RFC:https://github.com/vllm-project/vllm-ascend/issues/5604 This PR is a refactoring of vllm_ascend/distributed. ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `11b6af5280` Signed-off-by: lty <linhebiwen@gmail.com>	2026-01-15 16:26:53 +08:00
wjunLu	c11a05c4e1	[Main2Main] Upgrade vllm commit to 0113 (#5839 ) ### What this PR does / why we need it? Upgrade vllm commit to 0113 (11b6af5280d6d6dfb8953af16e67b25f819b3be9) - Modify import paths due to the refactors https://github.com/vllm-project/vllm/pull/31916 https://github.com/vllm-project/vllm/pull/32054 - Fix `TypeError: NPUOffloadingSpec.__init__() takes 2 positional arguments but 3 were given` due to https://github.com/vllm-project/vllm/pull/24498 - Skip the async-scheduling tests in `tests/e2e/multicard/4-cards/long_sequence/test_mtp.py`, which are never verified https://github.com/vllm-project/vllm/pull/31998 - Skip some pooling tests, which are caused by https://github.com/vllm-project/vllm/pull/32148 where vllm is also failed https://buildkite.com/vllm/ci/builds/46705/steps/canvas?jid=019bb329-3834-4685-862b-1613b8e0f5d4 We will reopen those tests when main2main reachs https://github.com/vllm-project/vllm/pull/32243 - Skip some cases in `tests/e2e/multicard/4-cards/long_sequence/test_mtp.py`, which are broken by https://github.com/vllm-project/vllm/pull/32118 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: wjunLu <wjunlu217@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com>	2026-01-15 09:48:53 +08:00
lty	295018ec0f	[Refactor]Refactor of vllm_ascend/distributed module (#5719 ) ### What this PR does / why we need it? Based on the RFC:https://github.com/vllm-project/vllm-ascend/issues/5604 This PR is a refactoring of vllm_ascend/distributed, moving all kv_transfer realtaed codes into a dedicated folder, which has already been done in vLLM ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: lty <linhebiwen@gmail.com>	2026-01-15 08:57:40 +08:00
liziyu	e1bed43cff	[P/D] bugfix for p node force free requset (#5431 ) ### What this PR does / why we need it? Fix the bug where the P-node's schedule dead after it force-frees a request due to timeout and then receives the completed kv cache pulled by the D-node again. By add list to recode all requests. - vLLM version: release/v0.13.0 - vLLM main: `81786c8774` --------- Signed-off-by: liziyu <liziyu16@huawei.com> Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com>	2026-01-14 08:51:31 +08:00
liziyu	eed9e366a7	[Bugfix][P/D] fix layerwise connector for decoder tp size > num kv heads (#5846 ) ### What this PR does / why we need it? Fix layerwise connector for decoder tp size > num kv heads. In this case prefiller should push kv cache to all decoder npu. - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: liziyu <liziyu16@huawei.com>	2026-01-13 17:30:33 +08:00
DreamerLeader	db7cf9b0ca	[bugfix] A2 Environment Pooling for Memcache Compatibility (#5601 ) ### What this PR does / why we need it? When running memcache in the A2 environment, the logic for registering memory needs to be added. Additionally, there is a link establishment conflict between memcache and HCCS during initialization in A2, so the link should be established in advance. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `7157596103` --------- Signed-off-by: fangjianwei <f30058701@china.huawei.com> Co-authored-by: fangjianwei <f30058701@china.huawei.com>	2026-01-13 09:07:38 +08:00
zzhxxx	db12c1e2c8	[Perf] Supports compute-communication overlap in the forward of sfa_v1 in the Sharded-CP feature. (#5701 ) ### What this PR does / why we need it? > Extracted from PR #5513 Based on the Sharded-CP feature PR:#4702; RFC:https://github.com/vllm-project/vllm/issues/30055 ### All-gather KV Cache for Communication Overlap: - This PR adjusts the calculation order in the SFA. - split `index_select` into `indexer_select_pre_process` and `indexer_select_post_process`. - Combine `nope`, `rope` and `index-k` into a tensor to perform asynchronous all-gather. ### benchmark: input=40k && num_batch_token=20k - before: ``` Mean TTFT (ms): 2614.52 Median TTFT (ms): 3148.03 P50 TTFT (ms): 3148.03 P90 TTFT (ms): 3163.48 P99 TTFT (ms): 3170.20 ``` - after: ``` Mean TTFT (ms): 2529.92 Median TTFT (ms): 3051.69 P50 TTFT (ms): 3051.69 P90 TTFT (ms): 3067.31 P99 TTFT (ms): 3072.15 ``` ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: zzhx1 <zzh_201018@outlook.com>	2026-01-11 09:47:27 +08:00
zxr2333	78b554dda9	[P/D] layerwise connector supports DeepSeek-V3.2 sparse attention && Distribute transfer tasks to redundant kv_head cards (#5722 ) ### What this PR does / why we need it? Add new function to mooncake layerwise connector, including: 1. supports sparse attention, for DeepSeek-V3.2 2. Distribute transfer tasks to redundant kv_head cards This PR is related to [[RFC]: CDCP Scheduling for Disaggregated Prefilling with KV Cache Layerwise Push Support](https://github.com/vllm-project/vllm-ascend/issues/4842) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By CI. - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com> Signed-off-by: liziyu <liziyu16@huawei.com> Co-authored-by: liziyu <liziyu16@huawei.com>	2026-01-10 23:04:16 +08:00
wangxiaoteng888	aa987ffe87	[P/D][bugfix]Fix the PCP port mapping error issue (#5706 ) ### What this PR does / why we need it? Fix the PCP port mapping error issue.In a multi-node PD separation scenario, when the PCP feature is enabled, there is an issue with the ZMQ transmission port. Specifically, the IP and port received by Side D do not match. The cause of this issue is an error in the port mapping update strategy logic. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By ci - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>	2026-01-10 22:43:52 +08:00
fems14	ff4c1a47b3	[bugfix] Fixing KV Pool Memory Retention and Performance Degradation Issues (#5751 ) ### What this PR does / why we need it? 1.Fixed memory retention on certain GPUs caused by missing PUT operations. 2.Fixed performance degradation resulting from architectural incompatibilities in the underlying refactor. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: fems14 <1804143737@qq.com>	2026-01-09 17:46:23 +08:00
lidenghui1110	481138e1d2	[bugfix] adapt to new implemented get_kv_cache_spec in cpuoffload connector (#4311 ) ### What this PR does / why we need it? func `get_kv_cache_spec` in model_runner changed a lot and caused error in cpuoffloading connector which is copied from model_runner, this PR adapts to new implemented `get_kv_cache_spec` to fix it. ### How was this patch tested? - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` Signed-off-by: lidenghui <lidenghui1110@gmail.com>	2026-01-08 09:15:09 +08:00
zzhxxx	f7db812ed7	[refactor] Refactor the interface for shard weight and remove the flashcomm2 o_shared interface. (#5181 ) ### What this PR does / why we need it? - Delete the environment variable `VLLM_ASCEND_ENABLE_FLASHCOMM2_OSHARED` - Introduce layer_sharding as a configurable feature in additional_config - Revise the term "shared weight" to "shard weight." Configuration : The feature is opt-in via the additional_config argument: ``` --additional-config '{ "layer_sharding": ["o_proj", "q_b_proj"] }' ``` This is orthogonal to standard tensor parallelism and weight replication strategies. It is treated as a separate, explicit feature.It can be used in any scenario, combined with the flashcomm2https://github.com/vllm-project/vllm-ascend/pull/3232 feature or the ShardedCP #4702 feature, to achieve significant performance. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: zzhx1 <zzh_201018@outlook.com> Signed-off-by: zzhxx <zhangzihang23@mails.ucas.ac.cn> Signed-off-by: chenxiao <Jaychou1620@Gmail.com> Co-authored-by: clrs97 <524936896@qq.com> Co-authored-by: Levi-JQ <yujinqi2@huawei.com> Co-authored-by: chenxiao <Jaychou1620@Gmail.com>	2026-01-08 09:05:02 +08:00
zxr2333	20a8cf061b	[BugFix][P/D] Fix pre-create link parameter error (#5694 ) ### What this PR does / why we need it? Fix pre-create link parameter error, `batch_transfer_sync_write` requires list. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By CI. - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: liziyu <liziyu16@huawei.com> Co-authored-by: liziyu <liziyu16@huawei.com>	2026-01-08 08:41:10 +08:00
Mengqing Cao	3f4f2b4ae6	[Refactor] Import global var form vllm instead of overwirte it (#5469 ) ### What this PR does / why we need it? Import global var form vllm instead of overwirte it, so that we could use the correct global variant value - vLLM version: v0.13.0 - vLLM main: `5326c89803` --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2026-01-07 18:41:45 +08:00
UnifiedCacheManager	d6bb17f10e	[Bugfix]Add register_kv_cache in ucm_connector (#5657 ) ### What this PR does / why we need it? To adapt different shapes of the KV cache, UCM optimized the initialization of store by moving it into `register_kv_caches`. Therefore, this update adds `register_kv_caches` interface to UCMConnectorV1. ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: UnifiedCacheManager <unifiedcachem@163.com>	2026-01-07 11:30:33 +08:00
liziyu	330e25ab1d	[P/D] Performance enhancement of Layerwise connector in TP asymmetric scenarios (#5540 ) ### What this PR does / why we need it? [P/D] Performance enhancement of Layerwise connector in TP asymmetric scenarios 1. Session fusion: For transmission tasks at each layer, aggregate transmission tasks with the same destination and merge them into a single task for assignment. 2. Alltoall aggregation: For TP asymmetric scenarios, perform all alltoall operations at once according to the block granularity for all requests. [RFC]: CDCP Scheduling for Disaggregated Prefilling with KV Cache Layerwise Push Support https://github.com/vllm-project/vllm-ascend/issues/4842 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `45c1ca1ca1` --------- Signed-off-by: liziyu <liziyu16@huawei.com> Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com> Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Co-authored-by: nwpu-zxr <zhouxuerong2@huawei.com> Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com>	2026-01-06 20:25:36 +08:00
Shanshan Shen	b94d589769	[MM][Bugfix] Update `hf_config` to `hf_text_config` (#5319 ) ### What this PR does / why we need it? Following https://github.com/vllm-project/vllm-ascend/pull/5205, update `hf_config` to `hf_text_config`. Find more details at https://github.com/vllm-project/vllm-ascend/pull/5205#issuecomment-3675417534 and https://github.com/vllm-project/vllm-ascend/pull/5205#issuecomment-3677920872. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `5fbfa8d9ef` Signed-off-by: shen-shanshan <467638484@qq.com>	2026-01-06 16:41:39 +08:00
Chao Lei	473431e7e2	[P/D]Remove mooncake kvpool unused parameter `local_hostname` (#5574 ) ### What this PR does / why we need it? In mooncake kvpool, `local_hostname` is not used. Instead, the local IP is obtained directly via `get_ip()`. Therefore, remove this parameter to avoid confusion. ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `7157596103` Signed-off-by: LCAIZJ <leichao139636@163.com>	2026-01-05 20:18:59 +08:00
baxingpiaochong	46c2fc6a3c	[KVPOOL]decode save kvcache (#5168 ) ### What this PR does / why we need it? kvpool decode save kvcache now only support mla ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: baxingpiaochong <771405853@qq.com> Co-authored-by: Chao Lei <leichao139636@163.com>	2026-01-04 22:22:01 +08:00
lidenghui1110	d462577504	[Recover] [Bugfix] support mtp kv transfer and pp partition by hand in kv transfer (#4892 ) (revert in #4981 ) (#5511 ) PR #4892 was revert in #4981, we recover it now. For the potential bug break deepseek3.2 in PD case, we will find it out and fix it. - vLLM version: v0.13.0 - vLLM main: `45c1ca1ca1` --------- Signed-off-by: lidenghui <lidenghui1110@gmail.com>	2026-01-04 16:49:33 +08:00
Chao Lei	d193316ded	[P/D] Bugfix zmq send/receive failed (#5503 ) ### What this PR does / why we need it? Currently, when the MooncakeConnector interacts via ZeroMQ, it throws the following exception upon send/receive failure: Issue 1: The currently used `zmq.REQ` socket follows a strict request-reply pattern, requiring an alternating sequence of send → receive → send → receive... If either a send() or receive() operation fails, the ZeroMQ socket becomes unusable. Solution: When a send() or receive() exception occurs, close and delete the ZeroMQ socket, and recreate it upon next use. Issue 2: In `_handle_request`, if `_send_done_recv_signal` raises an exception, the exception is thrown immediately and subsequent code is not executed, causing the decode logic to fail to properly release the request. Solution: Move the call to `_send_done_recv_signal` to the end of the function. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `45c1ca1ca1` Signed-off-by: LCAIZJ <leichao139636@163.com>	2025-12-31 19:17:08 +08:00
zxr2333	46a1614387	[P/D] Improve the performance of Layerwise Connector (#5303 ) ### What this PR does / why we need it? Improve the performance of Layerwise Connector, mainly includes the following points: 1. Use event synchronize to replace stream synchronize. 2. Access metaserver when scheduling. 3. Transfer kvcache each Chunk prefill segmentation. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By CI. - vLLM version: release/v0.13.0 - vLLM main: `5fbfa8d9ef` --------- Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com> Signed-off-by: liziyu <liziyu16@huawei.com> Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Co-authored-by: liziyu <liziyu16@huawei.com> Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com>	2025-12-31 15:09:01 +08:00
Jade Zheng	38570cfeb6	[Feature] Support kv nz feature for DeepSeek decode node in disagg-prefill scenario (#3072 ) By converting the KV cache from ND to NZ format when the decode node receives it, this PR ensures that the KV NZ feature works correctly during the decoding phase in disagg-prefill scenario. - vLLM version: v0.11.0 - vLLM main: `83f478bb19` --------- Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com> Co-authored-by: ghphotoframe <854746559@qq.com> Co-authored-by: alex101-ops <alex1015718386@gmail.com>	2025-12-31 14:24:04 +08:00
wangxiaochao6	a539ae753a	[feature] mooncake support pcp/dcp in common conditions (#5224 ) ### What this PR does / why we need it? 1. This PR is proposed to support complicated pcp/dcp parallelisms in Prefill and Decode nodes in Mooncake, such as Prefill: TP8/PCP2DCP8 and Decode: TP8/DCP4/DP2, which is not supported now. We establish the link mappings to transfer KVCache between prefill and decode nodes. The main function is realized in Function of `_get_kv_split_metadata` in Mooncake_connector.py 2. After a prefill rank is pulled KVCache by a decode rank, the decode rank will send `DONE_RECVING_MSG` to the prefill rank and the prefill rank will free its KVCache blocks. If a prefill rank is pulled KVCache more than one time by several decode ranks and it surely could happen in complicated pcp/dcp parallelisms, it will cause the prefill rank free its KVCache blocks for several times, which could cause memory issue. This PR solve this issue by counting the times of prefill rank would be pulled KVCache and in the last time, it will free the prefill rank KVCache blocks. The related code is in Function of `run_busy_loop` in Mooncake_connector.py 3. If a prefill rank is not pulled KVCache by any decode ranks, the first rank in decode node will send "DONE_RECVING_MSG" to free its blocks. The related code is in Function of `_send_done_signal_to_free_remote_port` in Mooncake_connector.py ### How was this patch tested? This PR is tested in many pcp/dcp parallelisms, and the accuracy are all correct. MLA model: Prefill node: TP8/DP2, Decode node: TP8/DP2 Prefill node: TP8/PCP2/DCP8, Decode node: TP8/DP2 Prefill node: TP8/PCP2/DCP8, Decode node: TP8/DCP4/DP2 Prefill node: TP8/PCP2/DCP4, Decode node: TP4/DCP2/DP4 Prefill node: TP8/PCP2/DCP2, Decode node: TP4/DCP4/DP4 Prefill node: TP8/PCP2, Decode node: TP4/DCP2 GQA model: Prefill node: TP8/DP2, Decode node: TP8/DP2 Prefill node: TP8/PCP2/DCP2, Decode node: TP8/DP2 Prefill node: TP8/PCP2/DCP2, Decode node: TP8/DCP2/DP2 Prefill node: TP8/PCP2/DCP2, Decode node: TP4/DP4 Prefill node: TP16/DCP2/PCP1, Decode node: TP8/DCP2/DP2 - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` - Co-author by: Daishixun dsxtsteven@sina.com --------- Signed-off-by: wangxiaochao <w00642655@china.huawei.com> Co-authored-by: wangxiaochao <w00642655@china.huawei.com> Co-authored-by: Jade Zheng <zheng.shoujian@outlook.com>	2025-12-31 09:53:03 +08:00
fems14	2ef4d1979e	[bugfix][main]KV Pool for KV Transfer in PD Disaggregation Scenarios (#5398 ) ### What this PR does / why we need it? 1.KV Pool for KV Transfer in PD Disaggregation Scenarios Error Resolution 2.Update KV Pool Documentation ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `254f6b9867` --------- Signed-off-by: fems14 <1804143737@qq.com>	2025-12-27 09:53:57 +08:00
ApsarasX	3d9954eff0	[Bugfix] Use hf_text_config instead of hf_config to support multimodal PD-Disaggregated (#5205 ) ### What this PR does / why we need it? In code files such as`mooncake_connector.py`, `vllm_config.model_config.hf_config` is used to get the LLM configs. This approach works for LLMs, but not for multi-modal models. For multi-modal models, `vllm_config.model_config.hf_text_config` must be used instead to get the LLM configs. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing UT - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: ApsarasX <apsarax@outlook.com>	2025-12-22 20:21:45 +08:00
shaopeng-666	fd9a47c04d	fix vl pd smoke error (#5103 ) ### What this PR does / why we need it? Fix VL model mooncacke PD smoke test error ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: 李少鹏 <lishaopeng21@huawei.com>	2025-12-18 22:20:45 +08:00
zzhxxx	a74a1196c5	[Feat] Support MLP_TP feature, exclude MOE layer (#4999 ) #4257 This PR implements the dense_ffn TP of the first three layers of the deepseek model, I have refactored this PR and used very little code to support the implementation of this feature. This PR adds a function `is_moe_layer` to mlp_tp, which supports MLP TP in models with both mlp and moe, such as deepseek or chat GLM. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: zzhx1 <zzh_201018@outlook.com> Co-authored-by: 子潜 <ziqian@U-DMKXH32D-2015.local> Co-authored-by: chenxiao <Jaychou1620@Gmail.com> Co-authored-by: Jade Zheng <zheng.shoujian@outlook.com>	2025-12-18 20:06:53 +08:00
dsxsteven	97537709ae	[BugFix] Fix mooncake bug in PCP scenario (#5055 ) ### What this PR does / why we need it? The mooncake_connector.py file was importing the wrong arguments to the file, which could cause errors when use PCP; this issue has been corrected. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: daishixun <dsxsteven@sina.com>	2025-12-17 16:32:16 +08:00
weiguihua2	bf97048bce	[feat]pd disaggregated support cross-machine (#5008 ) ### What this PR does / why we need it? pd disaggregated support cross-machine. We send the primary and secondary node information of node p to node d. When node d pulls the KV data, it retrieves the corresponding primary or secondary node information from the mapping. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: weiguihua2 <weiguihua2@huawei.com>	2025-12-17 09:28:03 +08:00
Chao Lei	9c02fa9867	[bugfix] Fix mooncake kvpool accuracy issue (#4976 ) ### What this PR does / why we need it? The current KVPool has a accuracy issue https://github.com/vllm-project/vllm-ascend/issues/4412. This PR aims to fix the precision problem without impacting prefill performance. Note：Due to a bug in ADXL, calling `current_event.synchronize()` may occasionally hang. This issue will be fixed in Cann version 8.5.rc1. You can manually build the master branch of the project at https://gitcode.com/cann/hixl to resolve this issue before the 8.5.RC1 release. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: LCAIZJ <leichao139636@163.com>	2025-12-16 11:33:16 +08:00
UnifiedCacheManager	195eac665b	[Core][Worker] Add UCMConnector for KV Cache Offloading (#4411 ) ### What this PR does / why we need it? This PR introduces the initial integration of UCM (Unified Cache Management) into the vllm-ascend distributed KV-cache system. Specifically, it adds: - A new `UCMConnector` implementation under the distributed KV-transfer framework. - Support for offloading KV-cache blocks to external UCM backends (DRAM / NFS / Localdisk), depending on UCM configuration). - Integration with vLLM V1 KV connector interface, including metadata handling and role registration. Why it is needed: - UCM provides a unified, high-performance storage layer for KV-cache externalization. - This enables vllm-ascend to support out-of-core KV-cache workloads, improve memory efficiency, and leverage hardware-accelerated storage paths (RDMA / NFS / hybrid modes). - This connector is a required component to allow future work on multi-node inference + UCM-based scaling. --- ### Does this PR introduce _any_ user-facing change? Yes, but limited: - A new `kv_connector=UCMConnector` option becomes available through the configuration interface. - When selected, vllm-ascend workers may initialize UCM and offload KV-cache blocks externally. - No default behaviors are changed. Users must explicitly enable this connector. This PR does not modify: - existing APIs, - default execution paths, - model runner behavior, - user workflow unless `UCMConnector` is configured. --- ### How was this patch tested? --- ### Prefix Caching Benchmark We provide preliminary measurements for TTFT (ms) under VLLM benchmark. Tests run on 2 * Ascend 910B3, vllm-ascend 0.11.0, Tensor Parallel size 2, with UCM (Localdisk) enabled. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: UnifiedCacheManager <unifiedcachem@163.com>	2025-12-16 10:53:30 +08:00
fems14	b662d914a4	[bugfix] [main] Fix KV cache query inconsistency across different TP ranks in the KV Pool (#5030 ) ### What this PR does / why we need it? In the current KV Pool scenario for models like MLA and GQA, where different TP ranks generate identical KV caches, the system is designed to store only a single copy. The previous approach allowed each card to query storage requirements dynamically, but inconsistent query results across cards led to incorrect storage. To fix this, the new solution pre-allocates storage responsibilities; each card now simply stores its pre-assigned blocks, bypassing the inconsistent query step and ensuring data correctness. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: fems14 <1804143737@qq.com>	2025-12-15 21:56:05 +08:00
baxingpiaochong	95e6400128	[KVPool]Fix PP get bug (#5007 ) ### What this PR does / why we need it? When kv caches are evicted from the key-value pool, it's possible that the kv cache for pp0 is still active, but the kv cache for pp1 has already been evicted. Therefore, a unified check is needed during the get operation. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: baxingpiaochong <771405853@qq.com> Co-authored-by: Jade Zheng <zheng.shoujian@outlook.com>	2025-12-15 20:27:57 +08:00

1 2 3 4 5

211 Commits