xc-llm-ascend

Author	SHA1	Message	Date
wangxiaoteng888	b881fab416	[P/D][PCP] mooncake layerwise support pcp function (#6627 ) ### What this PR does / why we need it? mooncake layerwise support pcp function PCP (Prefill Context Parallelism) Support: Introduced explicit support for Prefill Context Parallelism (PCP) and Decode Context Parallelism (DCP) in the Mooncake layerwise KV cache transfer mechanism, allowing for more granular control and awareness of parallel configurations during data transfer. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By ci - vLLM version: v0.15.0 - vLLM main: `d7e17aaacd` --------- Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Signed-off-by: liziyu <liziyu16@huawei.com> Co-authored-by: liziyu <liziyu16@huawei.com>	2026-02-12 11:02:25 +08:00
liziyu	d252e4f5ec	[P/D] Using the cache load operator to replace the index select operator. (#6295 ) ### What this PR does / why we need it? Using the cache load operator to replace the index select operator. - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: liziyu <liziyu16@huawei.com>	2026-01-30 14:27:53 +08:00
zxr2333	14bd55f30c	[P/D][BugFix] Fix layerwise P/D request_id error (#6360 ) ### What this PR does / why we need it? Fix layerwise Connector P/D request_id error, due to vllm pr: https://github.com/vllm-project/vllm/pull/27987, which will add a random suffix to request_id in EngineCore. - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>	2026-01-29 20:19:05 +08:00
yuxinshan	0bb1f91c2c	[Feature] Mooncake connector get remote ptp size (#5822 ) ### What this PR does / why we need it? To support elastic scaling when using mooncake connector, we should support to configure different tp sizes for different nodes. As a result, we transfer the prefill node information, such as tp size, through the request's kv_transfer_params. The decode nodes get the prefill tp size through the request's kv_transfer_params, instead of getting it from the configuration of the mooncake connector . - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: yuxinshan <syx_ctyg@126.com> Signed-off-by: CalvinXKY <kyxiezju@163.com>	2026-01-26 14:28:33 +08:00
lty	295018ec0f	[Refactor]Refactor of vllm_ascend/distributed module (#5719 ) ### What this PR does / why we need it? Based on the RFC:https://github.com/vllm-project/vllm-ascend/issues/5604 This PR is a refactoring of vllm_ascend/distributed, moving all kv_transfer realtaed codes into a dedicated folder, which has already been done in vLLM ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: lty <linhebiwen@gmail.com>	2026-01-15 08:57:40 +08:00
zxr2333	78b554dda9	[P/D] layerwise connector supports DeepSeek-V3.2 sparse attention && Distribute transfer tasks to redundant kv_head cards (#5722 ) ### What this PR does / why we need it? Add new function to mooncake layerwise connector, including: 1. supports sparse attention, for DeepSeek-V3.2 2. Distribute transfer tasks to redundant kv_head cards This PR is related to [[RFC]: CDCP Scheduling for Disaggregated Prefilling with KV Cache Layerwise Push Support](https://github.com/vllm-project/vllm-ascend/issues/4842) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By CI. - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com> Signed-off-by: liziyu <liziyu16@huawei.com> Co-authored-by: liziyu <liziyu16@huawei.com>	2026-01-10 23:04:16 +08:00
wangxiaoteng888	aa987ffe87	[P/D][bugfix]Fix the PCP port mapping error issue (#5706 ) ### What this PR does / why we need it? Fix the PCP port mapping error issue.In a multi-node PD separation scenario, when the PCP feature is enabled, there is an issue with the ZMQ transmission port. Specifically, the IP and port received by Side D do not match. The cause of this issue is an error in the port mapping update strategy logic. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By ci - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>	2026-01-10 22:43:52 +08:00
liziyu	330e25ab1d	[P/D] Performance enhancement of Layerwise connector in TP asymmetric scenarios (#5540 ) ### What this PR does / why we need it? [P/D] Performance enhancement of Layerwise connector in TP asymmetric scenarios 1. Session fusion: For transmission tasks at each layer, aggregate transmission tasks with the same destination and merge them into a single task for assignment. 2. Alltoall aggregation: For TP asymmetric scenarios, perform all alltoall operations at once according to the block granularity for all requests. [RFC]: CDCP Scheduling for Disaggregated Prefilling with KV Cache Layerwise Push Support https://github.com/vllm-project/vllm-ascend/issues/4842 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `45c1ca1ca1` --------- Signed-off-by: liziyu <liziyu16@huawei.com> Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com> Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Co-authored-by: nwpu-zxr <zhouxuerong2@huawei.com> Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com>	2026-01-06 20:25:36 +08:00
Qiu	96775a27a8	[refactor](UT,PCP,DCP) refactor pcp&dcp patches in UTs (#5505 ) ### What this PR does / why we need it? Refactor PCP & DCP patches in UTs: Merge and reuse communication groups and communication function patches to reduce code duplication. ### Does this PR introduce _any_ user-facing change? No - vLLM version: v0.13.0 - vLLM main: `45c1ca1ca1` Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>	2026-01-05 09:05:45 +08:00
lidenghui1110	d462577504	[Recover] [Bugfix] support mtp kv transfer and pp partition by hand in kv transfer (#4892 ) (revert in #4981 ) (#5511 ) PR #4892 was revert in #4981, we recover it now. For the potential bug break deepseek3.2 in PD case, we will find it out and fix it. - vLLM version: v0.13.0 - vLLM main: `45c1ca1ca1` --------- Signed-off-by: lidenghui <lidenghui1110@gmail.com>	2026-01-04 16:49:33 +08:00
zxr2333	46a1614387	[P/D] Improve the performance of Layerwise Connector (#5303 ) ### What this PR does / why we need it? Improve the performance of Layerwise Connector, mainly includes the following points: 1. Use event synchronize to replace stream synchronize. 2. Access metaserver when scheduling. 3. Transfer kvcache each Chunk prefill segmentation. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By CI. - vLLM version: release/v0.13.0 - vLLM main: `5fbfa8d9ef` --------- Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com> Signed-off-by: liziyu <liziyu16@huawei.com> Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Co-authored-by: liziyu <liziyu16@huawei.com> Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com>	2025-12-31 15:09:01 +08:00
wangxiaochao6	a539ae753a	[feature] mooncake support pcp/dcp in common conditions (#5224 ) ### What this PR does / why we need it? 1. This PR is proposed to support complicated pcp/dcp parallelisms in Prefill and Decode nodes in Mooncake, such as Prefill: TP8/PCP2DCP8 and Decode: TP8/DCP4/DP2, which is not supported now. We establish the link mappings to transfer KVCache between prefill and decode nodes. The main function is realized in Function of `_get_kv_split_metadata` in Mooncake_connector.py 2. After a prefill rank is pulled KVCache by a decode rank, the decode rank will send `DONE_RECVING_MSG` to the prefill rank and the prefill rank will free its KVCache blocks. If a prefill rank is pulled KVCache more than one time by several decode ranks and it surely could happen in complicated pcp/dcp parallelisms, it will cause the prefill rank free its KVCache blocks for several times, which could cause memory issue. This PR solve this issue by counting the times of prefill rank would be pulled KVCache and in the last time, it will free the prefill rank KVCache blocks. The related code is in Function of `run_busy_loop` in Mooncake_connector.py 3. If a prefill rank is not pulled KVCache by any decode ranks, the first rank in decode node will send "DONE_RECVING_MSG" to free its blocks. The related code is in Function of `_send_done_signal_to_free_remote_port` in Mooncake_connector.py ### How was this patch tested? This PR is tested in many pcp/dcp parallelisms, and the accuracy are all correct. MLA model: Prefill node: TP8/DP2, Decode node: TP8/DP2 Prefill node: TP8/PCP2/DCP8, Decode node: TP8/DP2 Prefill node: TP8/PCP2/DCP8, Decode node: TP8/DCP4/DP2 Prefill node: TP8/PCP2/DCP4, Decode node: TP4/DCP2/DP4 Prefill node: TP8/PCP2/DCP2, Decode node: TP4/DCP4/DP4 Prefill node: TP8/PCP2, Decode node: TP4/DCP2 GQA model: Prefill node: TP8/DP2, Decode node: TP8/DP2 Prefill node: TP8/PCP2/DCP2, Decode node: TP8/DP2 Prefill node: TP8/PCP2/DCP2, Decode node: TP8/DCP2/DP2 Prefill node: TP8/PCP2/DCP2, Decode node: TP4/DP4 Prefill node: TP16/DCP2/PCP1, Decode node: TP8/DCP2/DP2 - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` - Co-author by: Daishixun dsxtsteven@sina.com --------- Signed-off-by: wangxiaochao <w00642655@china.huawei.com> Co-authored-by: wangxiaochao <w00642655@china.huawei.com> Co-authored-by: Jade Zheng <zheng.shoujian@outlook.com>	2025-12-31 09:53:03 +08:00
ApsarasX	3d9954eff0	[Bugfix] Use hf_text_config instead of hf_config to support multimodal PD-Disaggregated (#5205 ) ### What this PR does / why we need it? In code files such as`mooncake_connector.py`, `vllm_config.model_config.hf_config` is used to get the LLM configs. This approach works for LLMs, but not for multi-modal models. For multi-modal models, `vllm_config.model_config.hf_text_config` must be used instead to get the LLM configs. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing UT - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: ApsarasX <apsarax@outlook.com>	2025-12-22 20:21:45 +08:00
zxr2333	073a3a6e6c	[Doc][P/D] Fix MooncakeConnector's name (#5172 ) ### What this PR does / why we need it? vLLM community has integrated their MooncakeConnector. The original scripts will now find this MooncakeConnector instead of the one from vLLM-Ascend. All scripts that involve using the MooncakeConnector need to be modified to another name. ### Does this PR introduce _any_ user-facing change? Yes, users need to use a new name to load vLLM-Ascend MooncakeConnector. ### How was this patch tested? By CI. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>	2025-12-18 22:29:19 +08:00
Yuzhou Tong	78602eab4f	[UT] Add mooncake ut test (#5080 ) ### What this PR does / why we need it? Add UT for mooncake - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: tongyuzhou <tongyuzhou1@huawei.com> Signed-off-by: wangxiaochao <w00642655@china.huawei.com> Co-authored-by: tongyuzhou <tongyuzhou1@huawei.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-18 15:07:14 +08:00
dsxsteven	97537709ae	[BugFix] Fix mooncake bug in PCP scenario (#5055 ) ### What this PR does / why we need it? The mooncake_connector.py file was importing the wrong arguments to the file, which could cause errors when use PCP; this issue has been corrected. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: daishixun <dsxsteven@sina.com>	2025-12-17 16:32:16 +08:00
wangxiyuan	5211e991ad	Revert "[Bugfix] support mtp kv transfer and pp partition by hand in kv transfer (#4892 )" (#4981 ) This reverts commit `332b547728`. This break deepseek3.2 in PD case. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c`	2025-12-13 18:58:55 +08:00
lidenghui1110	332b547728	[Bugfix] support mtp kv transfer and pp partition by hand in kv transfer (#4892 ) ### What this PR does / why we need it? Current mooncake connector has following problems with PP and MTP enabled: 1. MTP layer kv caches are not transfered, it may cause decreasing of accept ratio: This PR add MTP layer indices for last PP stage after calculating end_layer in transfer_kv_cache 2. While MTP enabled, PP layers divided by default may cause imbalance between stages, we need to use `VLLM_PP_LAYER_PARTITION` environment to make it balance by hand, but in mooncake connector kv transfer, decode doesn't know the partition of prefill node: This PR add config `pp_layer_partition` in `kv_connector_extra_config` to make decode node acquire the partition information of prefill node. ### Does this PR introduce _any_ user-facing change? When prefill using `VLLM_PP_LAYER_PARTITION` environment, add `pp_layer_partition` in `kv_connector_extra_config` like below: ``` export VLLM_PP_LAYER_PARTITION=33,28 "kv_connector_extra_config": { "use_ascend_direct": true, "prefill": { "dp_size": 1, "tp_size": 8, "pp_size": 2, "pp_layer_partition": "33,28" }, "decode": { "dp_size": 16, "tp_size": 1, "pp_size": 1 } } ``` ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: lidenghui <lidenghui1110@gmail.com>	2025-12-11 17:23:21 +08:00
wangxiyuan	f917d5edcf	Remove useless env (#4858 ) cleanup useless env. These envs are not used anymore `VLLM_ASCEND_TRACE_RECOMPILES`, `VLLM_ASCEND_KV_CACHE_MEGABYTES_FLOATING_TOLERANCE`, `VLLM_ASCEND_MLA_PA`, `PHYSICAL_DEVICES` - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-11 06:51:07 +08:00
lidenghui1110	a82b0fa70e	mooncake connector support pipeline parallel & fix pp with flashcomm1 (#4054 ) ### What this PR does / why we need it? To support pipeline parallel with PD disaggregation, this PR support PP in mooncake connector and fix other bugs when enable pp with other optimization params, including following changes: - mooncake connector support pp in prefill, we do not support decode pp currently - fix bugs when enable both pp and flashcomm1 - optimize ascend-scheduler to support full batch in multiple pipeline stages, original implementation would cause all pipeline stages batch_size total summed to max_num_seq, which makes pipeline is not full, this optimization can make all stages running with full batch_size = max_num_seq, the same changes will contribute to vllm scheduler too. ### Does this PR introduce _any_ user-facing change? add `pp_size` in mooncake connector kv_connector_extra_config ``` "kv_connector_extra_config": { "use_ascend_direct": true, "prefill": { "dp_size": 1, "tp_size": 4, "pp_size": 4 }, "decode": { "dp_size": 16, "tp_size": 1 } } ``` ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: chenxiao <Jaychou1620@Gmail.com> Signed-off-by: Kurumi5210 <Jaychou1620@Gmail.com> Signed-off-by: Kurumi5210 <jaychou1620@gmail.com> Signed-off-by: 秋刀鱼 <jaychou1620@Gmail.com> Co-authored-by: chenxiao <Jaychou1620@Gmail.com> Co-authored-by: zss <zss@qq.com> Co-authored-by: zss <3265779424@qq.com>	2025-12-10 16:01:43 +08:00
wangxiaoteng888	a77045f355	[P/D][main]Offline the llmdatadist connector related parts of the code and files. (#4780 ) ### What this PR does / why we need it? As support for the mooncake connector is now available, the llmdatadist connector is no longer being maintained, so the llmdatadist-related files need to be retired. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By ci - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Signed-off-by: liziyu <liziyu16@huawei.com> Co-authored-by: liziyu <liziyu16@huawei.com>	2025-12-09 22:36:43 +08:00
liziyu	688b1332da	[P/D] check kv extra config and del hccl backend (#4547 ) ### What this PR does / why we need it? check kv extra config & del hccl backend - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: liziyu <liziyu16@huawei.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-07 15:19:42 +08:00
LookAround0301	b32ef53b3b	[long_seq] remove long_seq env (#4660 ) ### What this PR does / why we need it? remove env VLLM_ASCEND_ENABLE_CONTEXT_PARALLEL - vLLM version: v0.12.0 --------- Signed-off-by: LookAround <lixushi@huawei.com> Signed-off-by: ZhangMingWei716 <2894054457@qq.com> Co-authored-by: ZhangMingWei716 <2894054457@qq.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-05 10:31:49 +08:00
wangxiyuan	7f2673ea2d	upgrade vLLM to main (#4608 ) 1. fix https://github.com/vllm-project/vllm/pull/28542 The model structure modifications we involved in are: - Qwen2.5-VL(still exist some patch) - Qwen2-VL - Qwen2 - DeepSeek series - Qwen-moe series 2. fix https://github.com/vllm-project/vllm/pull/29121 the output token now type changed from np to `list[list[int]]` 3. fix https://github.com/vllm-project/vllm/pull/29262 `xformers` backend for multimodal now has been deprecated 4. fix https://github.com/vllm-project/vllm/pull/29342 5. fix https://github.com/vllm-project/vllm/pull/28579 6. fix https://github.com/vllm-project/vllm/pull/28718 7. fix https://github.com/vllm-project/vllm/issues/28665 8. fix https://github.com/vllm-project/vllm/pull/26847 vllm introduced the `optimization-level`, some default config has been changed, and the param `--enforce-eager` has been deprecated 9. fix http://github.com/vllm-project/vllm/pull/29223 it retuns tuple for sampler. 10. fix https://github.com/vllm-project/vllm/pull/29471 we'll remove the related patch to avoid this kind of error. Co-authored-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: wangli <wangli858794774@gmail.com> - vLLM version: v0.11.2 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: wangli <wangli858794774@gmail.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com>	2025-12-02 22:10:52 +08:00
fems14	5447a039b9	[Feature][main]reconstruction kvpool connector to ascend connector (#4438 ) ### What this PR does / why we need it? 1.In short, we renamed the existing MooncakeStoreConnector to AscendStoreConnector and extracted the storage engine interaction logic into a new Backend class. Associated RFC：https://github.com/vllm-project/vllm-ascend/issues/4329 2.Fixed the issue where the number of input parameters for the connector was incorrect, introduced in vllm 0.11.2 ### Does this PR introduce _any_ user-facing change? change MooncakeStoreConnector to AscendStoreConnector ### How was this patch tested? - vLLM version: v0.11.2 --------- Signed-off-by: fems14 <1804143737@qq.com>	2025-11-28 18:08:37 +08:00
wangxiyuan	bc69d7cfe1	upgrade to vllm 0.11.2 (#4400 ) Bump vLLM version to v0.11.2 What's broken and changed by vLLM: 1. structured_output is broken by https://github.com/vllm-project/vllm/pull/26866 2. get_mrope_input_positions is broken by https://github.com/vllm-project/vllm/pull/28399 3. graph mode is broken by https://github.com/vllm-project/vllm/pull/25110 we'll upgrade torch to 2.8 to fix the problem later 4. embedding is broken by https://github.com/vllm-project/vllm/pull/27583 5. `get_attn_backend_cls` and attention backend is broken are broken by https://github.com/vllm-project/vllm/pull/28534 6. spec decode is broken by https://github.com/vllm-project/vllm/pull/28771 7. sp feature is broken by https://github.com/vllm-project/vllm/pull/27126 8. mtp is broken by https://github.com/vllm-project/vllm/pull/27922 9. lora is broken by https://github.com/vllm-project/vllm/pull/21068 10. execute_model is broken by https://github.com/vllm-project/vllm/pull/26866 11. `VLLM_DISABLE_SHARED_EXPERTS_STREAM` env is broken by https://github.com/vllm-project/vllm/pull/28159 12. kv cahe is broken by https://github.com/vllm-project/vllm/pull/27753 13. dp is broken by https://github.com/vllm-project/vllm/pull/25110 What's broken and changed by ourself: 1. qwen vl is broken by https://github.com/vllm-project/vllm/pull/28455 We'll remove model files in the future to avoid this kind of error 2. Engine core is broken by https://github.com/vllm-project/vllm/pull/23691 We'll remove the patch file in the future. 3. Ascend scheduler is broken by https://github.com/vllm-project/vllm/pull/28733 We'll remove ascend scheudler later. 4. qwen3-next is broken by https://github.com/vllm-project/vllm/pull/28083 We'll remove model files in the future to avoid this kind of error 5. qwen vl is broken by https://github.com/vllm-project/vllm/pull/27764. We'll remove model files in the future Known issue: 1. ray doesn't work 2. the accuracy of qwen3-next is not correct 3. qwen3-vl is broken 4. prefix cache+ ascend scheduler + deepseek v2 lite is broken. Co-authored-by: MengqingCao <cmq0113@163.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: leo-pony <nengjunma@outlook.com> Co-authored-by: 22dimensions <waitingwind@foxmail.com> Co-authored-by: shen-shanshan <467638484@qq.com> - vLLM version: v0.11.2 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Signed-off-by: leo-pony <nengjunma@outlook.com> Co-authored-by: MengqingCao <cmq0113@163.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: leo-pony <nengjunma@outlook.com>	2025-11-26 11:48:58 +08:00
wangxiyuan	a1f142b7ad	Drop 0.11.0 support (#4377 ) There is a lot hack code for v0.11.0, which makes the code hard to upgrade to newer vLLM version. Since v0.11.0 will release soon. Let's drop v0.11.0 support first. Then we'll upgrade to v0.11.2 soon. - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-11-24 17:08:20 +08:00
wangxiaochao	3deeea14a0	[bugfix] bugfix for PD disaggregate (#4319 ) This PR is used to fix mooncake_connector in pcp/dcp case. When executing function update_done_task_count, it is necessary to ensure that both pcp/dcp and TP ranks have finished transferring KV cache. - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` --------- Signed-off-by: wangxiaochao <w00642655@china.huawei.com> Co-authored-by: wangxiaochao <w00642655@china.huawei.com>	2025-11-21 18:08:56 +08:00
wangxiaochao	0d04ad8c8f	[feature] Mooncake_connector support pcp/dcp (#4183 ) add feature for Mooncake_connector supporting pcp/dcp - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` --------- Signed-off-by: wangxiaochao <w00642655@china.huawei.com> Co-authored-by: wangxiaochao <w00642655@china.huawei.com>	2025-11-18 10:17:48 +08:00
zhangsicheng5	a123f355e9	[feature] support pcp + mtp (in pd co-locate scenario) (#4098 ) 1. support pcp + mtp in pd co-locate scenario 2. llmdatadist connector pcp related bugfix and cleancode - vLLM version: v0.11.0 - vLLM main: `83f478bb19` Signed-off-by: zhangsicheng5 <zhangsicheng5@huawei.com>	2025-11-12 17:22:21 +08:00
wangxiyuan	f811a24bf0	Remove VLLM_USE_V1 (#4086 ) Drop VLLM_USE_V1 usage. This env has been removed from vLLM already. - vLLM version: v0.11.0 - vLLM main: `83f478bb19` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-11-11 15:43:39 +08:00
zxr2333	1d81a289d0	[P/D][BugFix]Fix proxy format processing errors & Layerwise connector performance optimization (#4043 ) ### What this PR does / why we need it? 1. Fix proxy format processing errors. 2. Layer-wise connector performance optimization. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By CI. - vLLM version: v0.11.0 - vLLM main: `83f478bb19` --------- Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com> Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com>	2025-11-08 18:44:06 +08:00
zxr2333	b206e831e9	[P/D]Make kv-transfer env variable take effect & Fix load-balance proxy (#3981 ) ### What this PR does / why we need it? Make kv-transfer env variable take effect and Fix load-balance proxy. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By CI. - vLLM version: v0.11.0 - vLLM main: `83f478bb19` --------- Signed-off-by: liziyu <liziyu16@huawei.com> Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com> Co-authored-by: liziyu <liziyu16@huawei.com>	2025-11-06 12:02:47 +08:00
wangxiaoteng888	2c291bc63f	[bugfix] layerwise D first plan (#3866 ) ### What this PR does / why we need it? Refactored the layerwise code to send to the D node first, preventing P-node hangs due to communication timeouts when DP > 1. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By ci - vLLM version: v0.11.0 - vLLM main: `83f478bb19` --------- Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Signed-off-by: liziyu <liziyu16@huawei.com> Co-authored-by: liziyu <liziyu16@huawei.com>	2025-10-30 22:20:34 +08:00
baxingpiaochong	d6ef3df3b3	[Bugfix]fix_mulit_connector_bug (#3332 ) ### What this PR does / why we need it? When using multi connector, the multi connector does not define get_finished_count, which will cause the kv cache to be released ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: `83f478bb19` --------- Signed-off-by: baxingpiaochong <771405853@qq.com>	2025-10-29 23:23:06 +08:00
Icey	d9cdc65854	Upgrade to new vllm commit (#3719 ) ### What this PR does / why we need it? Upgrade to new vllm commit: `c9461e05a4` - Fix many imports, caused by https://github.com/vllm-project/vllm/pull/26908 - Fix import ```sha256```, caused by https://github.com/vllm-project/vllm/pull/27169 - Remove ```SchedulerConfig.send_delta_data```, caused by https://github.com/vllm-project/vllm/pull/27142 - Fix ```FusedMoE``` because of dual stream execution, caused by https://github.com/vllm-project/vllm/pull/26440 ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.11.0rc3 - vLLM main: `17c540a993` --------- Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: Icey <1790571317@qq.com> Co-authored-by: MengqingCao <cmq0113@163.com>	2025-10-25 15:36:32 +08:00
Mengqing Cao	cea0755b07	[1/N][Refactor] Refactor code to adapt with vllm main (#3612 ) ### What this PR does / why we need it? This is the step 1 of refactoring code to adapt with vllm main, and this pr aligned with `17c540a993` 1. refactor deepseek to the latest code arch as of `17c540a993` 2. bunches of fixes due to vllm changes - Fix `AscendScheduler` `__post_init__`, caused by https://github.com/vllm-project/vllm/pull/25075 - Fix `AscendScheduler` init got an unexpected arg `block_size`, caused by https://github.com/vllm-project/vllm/pull/26296 - Fix `KVCacheManager` `get_num_common_prefix_blocks` arg, caused by https://github.com/vllm-project/vllm/pull/23485 - Fix `MLAAttention` import,caused by https://github.com/vllm-project/vllm/pull/25103 - Fix `SharedFusedMoE` import, caused by https://github.com/vllm-project/vllm/pull/26145 - Fix `LazyLoader` improt, caused by https://github.com/vllm-project/vllm/pull/27022 - Fix `vllm.utils.swap_dict_values` improt, caused by https://github.com/vllm-project/vllm/pull/26990 - Fix `Backend` enum import, caused by https://github.com/vllm-project/vllm/pull/25893 - Fix `CompilationLevel` renaming to `CompilationMode` issue introduced by https://github.com/vllm-project/vllm/pull/26355 - Fix fused_moe ops, caused by https://github.com/vllm-project/vllm/pull/24097 - Fix bert model because of `inputs_embeds`, caused by https://github.com/vllm-project/vllm/pull/25922 - Fix MRope because of `get_input_positions_tensor` to `get_mrope_input_positions`, caused by https://github.com/vllm-project/vllm/pull/24172 - Fix `splitting_ops` changes introduced by https://github.com/vllm-project/vllm/pull/25845 - Fix multi-modality changes introduced by https://github.com/vllm-project/vllm/issues/16229 - Fix lora bias dropping issue introduced by https://github.com/vllm-project/vllm/pull/25807 - Fix structured ouput break introduced by https://github.com/vllm-project/vllm/issues/26737 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? CI passed with existing test. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: Icey <1790571317@qq.com> Co-authored-by: Icey <1790571317@qq.com>	2025-10-24 16:55:08 +08:00
LookAround0301	b54d44e664	support cp&dcp (#3260 ) ### What this PR does / why we need it? This PR adds the Prefill Context Parallelism (PCP) feature, which corresponds to DCP. For specific implementation details, please refer to the RFC https://github.com/vllm-project/vllm/issues/25749. TL;DR: PCP enhances long-sequence inference capabilities by partitioning the sequence dimension during the prefill stage. ### Does this PR introduce _any_ user-facing change? The current implementation primarily includes the following changes: Modified ModelRunner.py for CP partitioning logic for tokens; Modified attention_v1.py and mla_v1.py to adapt the GQA/MLA backend to PCP. Modified block_tables.py to extend the KV cache storage based on DCP&PCP; Added necessary command-line arguments to control parallelism for PCP; ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: LookAround <lixushi@huawei.com> Signed-off-by: chenjie <chenjie137@huawei.com> Signed-off-by: Delphine-Nic <tanwenqin@huawei.com> Signed-off-by: zhangsicheng5 <zhangsicheng5@huawei.com> Signed-off-by: Feng Liu <liufeng248@huawei.com> Signed-off-by: gaojc <1055866782@qq.com> Signed-off-by: weiguihua2 <weiguihua2@huawei.com> Signed-off-by: z50049692 <zhangmingwei11@huawei.com> Co-authored-by: chenjie <chenjie137@huawei.com> Co-authored-by: Delphine-Nic <tanwenqin@huawei.com> Co-authored-by: zhangsicheng5 <zhangsicheng5@huawei.com> Co-authored-by: Feng Liu <liufeng248@huawei.com> Co-authored-by: gaojc <1055866782@qq.com> Co-authored-by: weiguihua2 <weiguihua2@huawei.com> Co-authored-by: z50049692 <zhangmingwei11@huawei.com> Co-authored-by: w00896881 <wangzixuan40@huawei.com>	2025-10-24 10:32:01 +08:00
liziyu	aeddf4261a	[Bugfix] fix delay free prefill req & D node support prefix cache (#3607 ) ### What this PR does / why we need it? Fix mooncake connector. In scenarios where TP is not equal, when the prefill TP size is less than the number of key-value heads, _get_remote_tp_ranks_for_req will return a list of np.arrays. Performing an operation like int in list of np.arrays will cause an error. Converting the list of np.arrays into a single np.array resolves this issue. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? qwen235B P tp16, D tp1 P tp8, D tp1 P tp4, D tp1 P tp8, D tp2 - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: liziyu <liziyu16@huawei.com> Signed-off-by: underfituu <hzhucong@163.com> Co-authored-by: underfituu <hzhucong@163.com>	2025-10-23 20:39:14 +08:00
wangxiyuan	6ef62cb427	fix ut (#3608 ) Fix `test_torchair_deepseek_v2_decoder_layer` ut failure - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-10-22 11:30:12 +08:00
offline893	e916265b2b	[CI]Add EPLB CI. (#3568 ) ### What this PR does / why we need it? 1.Add eplb ci to check the change of eplb feature. 2.Add param checking of eplb params. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Qwen in A3. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: offline0806 <3337230449@qq.com> Co-authored-by: offline0806 <3337230449@qq.com>	2025-10-21 22:58:02 +08:00
liziyu	3164cb663c	[Bugfix] mooncake connector support external dp & update readme (#3579 ) ### What this PR does / why we need it? mooncake connector support external dp & update readme ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: liziyu <liziyu16@huawei.com>	2025-10-21 20:15:24 +08:00
zhangxinyuehfad	fdac146f71	[UT] fix skip ut test and enable ut test run normally (#3410 ) ### What this PR does / why we need it? fix skip ut test and enable ut test run normally ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-10-20 16:30:57 +08:00
zxr2333	c2c1db78a7	[Bugfix] fix ZeroDivisionError when prefill_tp_size > num_kv_head and fix tp_resharding README (#3437 ) ### What this PR does / why we need it? Fix ZeroDivisionError when prefill_tp_size > num_kv_head, in this situation, num_head_replica can be 0 and used to divide another value, this PR restricts the minimum value of a to be 1. And this PR fix tp_resharding README. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By CI. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: liziyu <liziyu16@huawei.com> Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com> Co-authored-by: liziyu <liziyu16@huawei.com>	2025-10-15 08:45:44 +08:00
wangxiaoteng888	ca05f7d632	[Bugfix] TP size larger than KV cache head causes accuracy issues (#3366 ) ### What this PR does / why we need it? Resolve the issue where, in the case of unequal TP (Tensor Parallelism), the TP size is larger than the number of model attention kvcache heads, causing the KV cache to generate duplicates, which leads to transmission errors in the original code. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By ci - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com> Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Co-authored-by: nwpu-zxr <zhouxuerong2@huawei.com>	2025-10-11 11:22:23 +08:00
Chao Lei	a486ff8c11	KVCache Transfer via Layer-wise Strategy in Disaggregation (#2602 ) ### What this PR does / why we need it? See RFC: https://github.com/vllm-project/vllm-ascend/issues/2470 This PR add a new kv connector for layer-wised kv transfer ### Does this PR introduce _any_ user-facing change? yes, a new kv connector is added. User can use layer wised feature now. ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/releases/v0.11.0 --------- Signed-off-by: leichao.lc <leichao139636@163.com> Signed-off-by: CaveNightingale <2859066733@qq.com> Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com> Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Signed-off-by: hanxinlong <50882499@qq.com> Signed-off-by: liziyu <liziyu16@huawei.com> Co-authored-by: CaveNightingale <2859066733@qq.com> Co-authored-by: nwpu-zxr <zhouxuerong2@huawei.com> Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com> Co-authored-by: hanxinlong <50882499@qq.com>	2025-09-30 15:10:29 +08:00
baxingpiaochong	eb205d9f35	[P/D][BugFix]Mooncake timeout release bug fix (#2899 ) ### What this PR does / why we need it? In the P node timeout release mechanism during PD separation, the req_id that requires timeout release is transmitted from the scheduler to the worker. If the KV cache between PDs is transferred too quickly, the P node's req_id may be released twice. The first release is when the D node notifies the P node that the KV cache has been pulled, and the second release is when the scheduler transmits the timeout release to the worker. To address this bug, an intermediate component is introduced to manage the release of req_ids. Pull kv and forward2 may occur one after the other in timing. The previous timeout defaulted to forward2 being before pull_kv. ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: `f225ea7dd9` --------- Signed-off-by: baxingpiaochong <771405853@qq.com>	2025-09-24 11:22:46 +08:00
zxr2333	0a27705917	fix mooncake connector adxl hostname usage (#2824 ) ### What this PR does / why we need it? This PR is used to adapt the hostname format for Mooncake when using adxl. When Mooncake uses adxl, it is necessary to set ```USE_ASCEND_DIRECT``` to True in the file ```/Mooncake/mooncake-common/common.cmake``` during compilation. The mooncake_connector obtains this config by calling ```vllm_config.kv_transfer_config.get_from_extra_config```, determines whether Mooncake is using adxl, and selects the corresponding hostname format. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By CI. - vLLM version: main - vLLM main: `d21a36f5f9` --------- Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>	2025-09-13 14:38:48 +08:00
Li Wang	22b425765a	[Bugfix] Fix broken CI (#2825 ) ### What this PR does / why we need it? 1. Initial support disable tp for integrating with [vllm-commit](https://github.com/vllm-project/vllm/pull/23024) 2. [vllm@commit](https://github.com/vllm-project/vllm/pull/23673) now use `bytes` to save the `BlockHash` to reduce GC overhead, this pr add the integration - vLLM version: main - vLLM main: `e40827280b` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-09-10 13:29:29 +08:00
Mengqing Cao	edf1f600ad	[CI] Remove compatibility maintenance for vllm v0.10.1 and v0.10.1.1 (#2840 ) ### What this PR does / why we need it? Remove compatibility maintenance for vllm v0.10.1 and v0.10.1.1 ### Does this PR introduce _any_ user-facing change? branch main of vllm-ascend will not be compatible with vllm v0.10.1 and v0.10.1.1 ### How was this patch tested? CI passed with existing test. - vLLM version: v0.10.1.1 - vLLM main: `6fb2788163` --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-09-10 08:43:10 +08:00

1 2

62 Commits