xc-llm-ascend

Author	SHA1	Message	Date
lidenghui1110	a82b0fa70e	mooncake connector support pipeline parallel & fix pp with flashcomm1 (#4054 ) ### What this PR does / why we need it? To support pipeline parallel with PD disaggregation, this PR support PP in mooncake connector and fix other bugs when enable pp with other optimization params, including following changes: - mooncake connector support pp in prefill, we do not support decode pp currently - fix bugs when enable both pp and flashcomm1 - optimize ascend-scheduler to support full batch in multiple pipeline stages, original implementation would cause all pipeline stages batch_size total summed to max_num_seq, which makes pipeline is not full, this optimization can make all stages running with full batch_size = max_num_seq, the same changes will contribute to vllm scheduler too. ### Does this PR introduce _any_ user-facing change? add `pp_size` in mooncake connector kv_connector_extra_config ``` "kv_connector_extra_config": { "use_ascend_direct": true, "prefill": { "dp_size": 1, "tp_size": 4, "pp_size": 4 }, "decode": { "dp_size": 16, "tp_size": 1 } } ``` ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: chenxiao <Jaychou1620@Gmail.com> Signed-off-by: Kurumi5210 <Jaychou1620@Gmail.com> Signed-off-by: Kurumi5210 <jaychou1620@gmail.com> Signed-off-by: 秋刀鱼 <jaychou1620@Gmail.com> Co-authored-by: chenxiao <Jaychou1620@Gmail.com> Co-authored-by: zss <zss@qq.com> Co-authored-by: zss <3265779424@qq.com>	2025-12-10 16:01:43 +08:00
wangxiyuan	835b4c8f1d	Drop torchair (#4814 ) aclgraph is stable and fast now. Let's drop torchair graph mode now. TODO: some logic to adapt torchair should be cleaned up as well. We'll do it in the following PR. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2025-12-10 09:20:40 +08:00
liziyu	688b1332da	[P/D] check kv extra config and del hccl backend (#4547 ) ### What this PR does / why we need it? check kv extra config & del hccl backend - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: liziyu <liziyu16@huawei.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-07 15:19:42 +08:00
wangxiaoteng888	41fbc5ebc9	[P/D][main] Clean connector history information (#4650 ) ### What this PR does / why we need it? Clean connector history information when the node restarts. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By ci - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-05 16:22:23 +08:00
LookAround0301	b32ef53b3b	[long_seq] remove long_seq env (#4660 ) ### What this PR does / why we need it? remove env VLLM_ASCEND_ENABLE_CONTEXT_PARALLEL - vLLM version: v0.12.0 --------- Signed-off-by: LookAround <lixushi@huawei.com> Signed-off-by: ZhangMingWei716 <2894054457@qq.com> Co-authored-by: ZhangMingWei716 <2894054457@qq.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-05 10:31:49 +08:00
wangxiyuan	7f2673ea2d	upgrade vLLM to main (#4608 ) 1. fix https://github.com/vllm-project/vllm/pull/28542 The model structure modifications we involved in are: - Qwen2.5-VL(still exist some patch) - Qwen2-VL - Qwen2 - DeepSeek series - Qwen-moe series 2. fix https://github.com/vllm-project/vllm/pull/29121 the output token now type changed from np to `list[list[int]]` 3. fix https://github.com/vllm-project/vllm/pull/29262 `xformers` backend for multimodal now has been deprecated 4. fix https://github.com/vllm-project/vllm/pull/29342 5. fix https://github.com/vllm-project/vllm/pull/28579 6. fix https://github.com/vllm-project/vllm/pull/28718 7. fix https://github.com/vllm-project/vllm/issues/28665 8. fix https://github.com/vllm-project/vllm/pull/26847 vllm introduced the `optimization-level`, some default config has been changed, and the param `--enforce-eager` has been deprecated 9. fix http://github.com/vllm-project/vllm/pull/29223 it retuns tuple for sampler. 10. fix https://github.com/vllm-project/vllm/pull/29471 we'll remove the related patch to avoid this kind of error. Co-authored-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: wangli <wangli858794774@gmail.com> - vLLM version: v0.11.2 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: wangli <wangli858794774@gmail.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com>	2025-12-02 22:10:52 +08:00
fems14	5447a039b9	[Feature][main]reconstruction kvpool connector to ascend connector (#4438 ) ### What this PR does / why we need it? 1.In short, we renamed the existing MooncakeStoreConnector to AscendStoreConnector and extracted the storage engine interaction logic into a new Backend class. Associated RFC：https://github.com/vllm-project/vllm-ascend/issues/4329 2.Fixed the issue where the number of input parameters for the connector was incorrect, introduced in vllm 0.11.2 ### Does this PR introduce _any_ user-facing change? change MooncakeStoreConnector to AscendStoreConnector ### How was this patch tested? - vLLM version: v0.11.2 --------- Signed-off-by: fems14 <1804143737@qq.com>	2025-11-28 18:08:37 +08:00
wangxiyuan	a1f142b7ad	Drop 0.11.0 support (#4377 ) There is a lot hack code for v0.11.0, which makes the code hard to upgrade to newer vLLM version. Since v0.11.0 will release soon. Let's drop v0.11.0 support first. Then we'll upgrade to v0.11.2 soon. - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-11-24 17:08:20 +08:00
wangxiaochao	3deeea14a0	[bugfix] bugfix for PD disaggregate (#4319 ) This PR is used to fix mooncake_connector in pcp/dcp case. When executing function update_done_task_count, it is necessary to ensure that both pcp/dcp and TP ranks have finished transferring KV cache. - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` --------- Signed-off-by: wangxiaochao <w00642655@china.huawei.com> Co-authored-by: wangxiaochao <w00642655@china.huawei.com>	2025-11-21 18:08:56 +08:00
wangxiaochao	0d04ad8c8f	[feature] Mooncake_connector support pcp/dcp (#4183 ) add feature for Mooncake_connector supporting pcp/dcp - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` --------- Signed-off-by: wangxiaochao <w00642655@china.huawei.com> Co-authored-by: wangxiaochao <w00642655@china.huawei.com>	2025-11-18 10:17:48 +08:00
zxr2333	b206e831e9	[P/D]Make kv-transfer env variable take effect & Fix load-balance proxy (#3981 ) ### What this PR does / why we need it? Make kv-transfer env variable take effect and Fix load-balance proxy. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By CI. - vLLM version: v0.11.0 - vLLM main: `83f478bb19` --------- Signed-off-by: liziyu <liziyu16@huawei.com> Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com> Co-authored-by: liziyu <liziyu16@huawei.com>	2025-11-06 12:02:47 +08:00
zxr2333	15bb5098ad	[PD Disaggregation]Set adxl engine as default backend and update README (#3761 ) ### What this PR does / why we need it? Set adxl engine as the default Mooncake backend, because Ascend Transport is no longer maintained. Update README to include instructions for installing the adxl backend Mooncake. ### Does this PR introduce _any_ user-facing change? Users need to compile and install the mooncake backend for adxl according to the revised README instructions. ### How was this patch tested? By CI. - vLLM version: v0.11.0 - vLLM main: `83f478bb19` Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>	2025-11-04 16:06:39 +08:00
baxingpiaochong	d6ef3df3b3	[Bugfix]fix_mulit_connector_bug (#3332 ) ### What this PR does / why we need it? When using multi connector, the multi connector does not define get_finished_count, which will cause the kv cache to be released ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: `83f478bb19` --------- Signed-off-by: baxingpiaochong <771405853@qq.com>	2025-10-29 23:23:06 +08:00
Icey	d9cdc65854	Upgrade to new vllm commit (#3719 ) ### What this PR does / why we need it? Upgrade to new vllm commit: `c9461e05a4` - Fix many imports, caused by https://github.com/vllm-project/vllm/pull/26908 - Fix import ```sha256```, caused by https://github.com/vllm-project/vllm/pull/27169 - Remove ```SchedulerConfig.send_delta_data```, caused by https://github.com/vllm-project/vllm/pull/27142 - Fix ```FusedMoE``` because of dual stream execution, caused by https://github.com/vllm-project/vllm/pull/26440 ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.11.0rc3 - vLLM main: `17c540a993` --------- Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: Icey <1790571317@qq.com> Co-authored-by: MengqingCao <cmq0113@163.com>	2025-10-25 15:36:32 +08:00
liziyu	aeddf4261a	[Bugfix] fix delay free prefill req & D node support prefix cache (#3607 ) ### What this PR does / why we need it? Fix mooncake connector. In scenarios where TP is not equal, when the prefill TP size is less than the number of key-value heads, _get_remote_tp_ranks_for_req will return a list of np.arrays. Performing an operation like int in list of np.arrays will cause an error. Converting the list of np.arrays into a single np.array resolves this issue. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? qwen235B P tp16, D tp1 P tp8, D tp1 P tp4, D tp1 P tp8, D tp2 - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: liziyu <liziyu16@huawei.com> Signed-off-by: underfituu <hzhucong@163.com> Co-authored-by: underfituu <hzhucong@163.com>	2025-10-23 20:39:14 +08:00
Chao Lei	11f9bccf6b	Mooncake store use adxl inferface (#3350 ) Use adxl inferface in mooncake store, mooncake PR https://github.com/kvcache-ai/Mooncake/pull/929 - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: LCAIZJ <leichao139636@163.com>	2025-10-21 20:18:17 +08:00
liziyu	3164cb663c	[Bugfix] mooncake connector support external dp & update readme (#3579 ) ### What this PR does / why we need it? mooncake connector support external dp & update readme ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: liziyu <liziyu16@huawei.com>	2025-10-21 20:15:24 +08:00
zouyida2052	58a37ce189	bugfix for mooncake (#3535 ) ### What this PR does / why we need it? bugfix for mooncake, remove useless judgement. ### How was this patch tested? by ci - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: zouyida2052 <zouyida2002@gmail.com>	2025-10-19 17:06:05 +08:00
Mengqing Cao	8abe517870	[Refactor] Adapt deepseek-v3.2 to vllm 0.11.0 (#3432 ) ### What this PR does / why we need it? Adapt deepseek-v3.2 to vllm 0.11.0, removing the useless patch. The final goal is to remove all the patches and align the code arch to vllm, thus we need to do the following work in next prs. TODO: - [x] remove patch on attention spec - [ ] refactor the kvcache creation logic ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? 1. CI passed with existing test. 2. Test pass with deepseek-v3.2-exp - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: MengqingCao <cmq0113@163.com>	2025-10-15 17:48:58 +08:00
lidenghui1110	0563106477	[Feature] mooncake connector support GQA transport (#2947 ) ### What this PR does / why we need it? The previous implementation of the Mooncake connector only supported scenarios where the Tensor Parallel sizes for the Prefill and Decode phases were the same for MLA and GQA/MHA. For heterogeneous TP scenarios, a single rank on a decode node needs to pull the KV cache from multiple ranks on the prefill nodes and then merge them (only support prefill TP >= decode TP now). During this merge, a transpose operation is required because the layouts of the KV caches are different. To minimize transpose overhead, we use the npu_paged_cache_load operation to extract the blocks corresponding to the request from the KV cache. After performing the transpose, we use _npu_reshape_and_cache to write the blocks back to their original positions. This process is illustrated in the diagram below. b means block_size, this diagram illustrates transpose kv cache layout for one block. In the implementation, we transpose kv cache by layer for one request. <img width="1464" height="916" alt="image" src="https://github.com/user-attachments/assets/09d96a98-e41c-4733-9535-05544163081a" /> ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested - vLLM version: v0.11.0 --------- Signed-off-by: chenxiao <Jaychou1620@Gmail.com> Signed-off-by: zzy-ContiLearn <1831242919@qq.com> Signed-off-by: zzhx1 <zzh_201018@outlook.com> Signed-off-by: Kurumi5210 <jaychou1620@gmail.com> Co-authored-by: zzy-ContiLearn <1831242919@qq.com> Co-authored-by: chenxiao <cx02308786@antgroup.com> Co-authored-by: chenxiao <Jaychou1620@Gmail.com> Co-authored-by: zzhx1 <zzh_201018@outlook.com>	2025-10-13 15:48:37 +08:00
Chao Lei	a486ff8c11	KVCache Transfer via Layer-wise Strategy in Disaggregation (#2602 ) ### What this PR does / why we need it? See RFC: https://github.com/vllm-project/vllm-ascend/issues/2470 This PR add a new kv connector for layer-wised kv transfer ### Does this PR introduce _any_ user-facing change? yes, a new kv connector is added. User can use layer wised feature now. ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/releases/v0.11.0 --------- Signed-off-by: leichao.lc <leichao139636@163.com> Signed-off-by: CaveNightingale <2859066733@qq.com> Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com> Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Signed-off-by: hanxinlong <50882499@qq.com> Signed-off-by: liziyu <liziyu16@huawei.com> Co-authored-by: CaveNightingale <2859066733@qq.com> Co-authored-by: nwpu-zxr <zhouxuerong2@huawei.com> Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com> Co-authored-by: hanxinlong <50882499@qq.com>	2025-09-30 15:10:29 +08:00
wangxiyuan	81bd6e4c99	Add DeepSeek V3.2 support (#3270 ) ### What this PR does / why we need it? This PR added the initial DeepSeek V3.2 support with [vLLM v0.11.0](https://github.com/vllm-project/vllm/tree/releases/v0.11.0) (not released yet). We will complete vLLM adaptation as soon as possible. This feature will be ready in recent 1-2 days. Related doc: https://github.com/vllm-project/vllm-ascend/pull/3223 . ### Does this PR introduce _any_ user-facing change? Yes! ### How was this patch tested? CI passed and Run deepseek doc soon. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/releases/v0.11.0 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: zzzzwwjj <1183291235@qq.com> Signed-off-by: linfeng-yuan <1102311262@qq.com> Signed-off-by: wxsIcey <1790571317@qq.com> Signed-off-by: MengqingCao <cmq0113@163.com> Co-authored-by: zzzzwwjj <1183291235@qq.com> Co-authored-by: linfeng-yuan <1102311262@qq.com> Co-authored-by: wxsIcey <1790571317@qq.com> Co-authored-by: MengqingCao <cmq0113@163.com>	2025-09-30 03:25:58 +08:00
baxingpiaochong	eb205d9f35	[P/D][BugFix]Mooncake timeout release bug fix (#2899 ) ### What this PR does / why we need it? In the P node timeout release mechanism during PD separation, the req_id that requires timeout release is transmitted from the scheduler to the worker. If the KV cache between PDs is transferred too quickly, the P node's req_id may be released twice. The first release is when the D node notifies the P node that the KV cache has been pulled, and the second release is when the scheduler transmits the timeout release to the worker. To address this bug, an intermediate component is introduced to manage the release of req_ids. Pull kv and forward2 may occur one after the other in timing. The previous timeout defaulted to forward2 being before pull_kv. ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: `f225ea7dd9` --------- Signed-off-by: baxingpiaochong <771405853@qq.com>	2025-09-24 11:22:46 +08:00
hucong	8dd53c8860	[Bugfix][PD] Auto-clear producer KV cache if no pull notification (#2174 ) ### What this PR does / why we need it? This PR addresses a critical issue where Node D (Device) failures cause Node P (Processor) to hang due to inability to release KV cache. Trigger Scenarios: 1. Node D fails mid-inference (e.g., network disconnection) 2. Node D rejects requests at a certain stage (e.g., via API server) 3. Load-test script termination causes Node P or D to abort queued requests Root Cause Analysis: 1. Currently, Node D sends a "KV cache pull complete, release approved" message to Node P 2. This message is transmitted via the worker connector. If PD connection breaks or requests are rejected upstream, Node D cannot send the message 3. Node P will never release KV cache without receiving this message Solution: Following VLLM community's approach (NIXL connector timeout mechanism), we're implementing: - A timeout mechanism with comprehensive warnings - Updated README documentation - Reference: VLLM's optimization PR [#20139](https://github.com/vllm-project/vllm/pull/20139) ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? None - vLLM version: v0.10.2 - vLLM main: `9607d5eb44` --------- Signed-off-by: underfituu <hzhucong@163.com>	2025-09-23 09:53:34 +08:00
zxr2333	0a27705917	fix mooncake connector adxl hostname usage (#2824 ) ### What this PR does / why we need it? This PR is used to adapt the hostname format for Mooncake when using adxl. When Mooncake uses adxl, it is necessary to set ```USE_ASCEND_DIRECT``` to True in the file ```/Mooncake/mooncake-common/common.cmake``` during compilation. The mooncake_connector obtains this config by calling ```vllm_config.kv_transfer_config.get_from_extra_config```, determines whether Mooncake is using adxl, and selects the corresponding hostname format. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By CI. - vLLM version: main - vLLM main: `d21a36f5f9` --------- Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>	2025-09-13 14:38:48 +08:00
baxingpiaochong	df88a2ecc8	[P/D]mooncake_connector adapted to 0.10.1 (#2664 ) ### What this PR does / why we need it? In vllm version 0.10.1, a new KVOutputAggregator was added to the executor, moving aggregation to the executor(https://github.com/vllm-project/vllm/pull/19555). This caused mooncake_connector to break. This change aims to fix this bug and also adds a policy to forcibly release the KV cache when the prefill node times out. This PR is currently linked to a PR in vllm (https://github.com/vllm-project/vllm/pull/23917). The vllm PR aims to modify the finish and send count confirmation in heterogeneous TP situations. The reason for deleting many UTs is that a lot of communication codes have been deleted, so the UT as a whole will appear more concise. - vLLM version: v0.10.1.1 - vLLM main: `fa4311d85f` --------- Signed-off-by: baxingpiaochong <771405853@qq.com>	2025-09-04 08:22:10 +08:00
zhiyuanzhang	07d44ade19	bugfix: fix initialization error for mooncake in k8s (#2541 ) ### What this PR does / why we need it? The detail has been clarified in that issue : https://github.com/vllm-project/vllm-ascend/issues/2557 ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? easy to test beacause we just need to echo the variable - vLLM version: v0.10.1.1 - vLLM main: `6997a25ac6` --------- Signed-off-by: zzy-ContiLearn <1831242919@qq.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: LCAIZJ <leichao139636@163.com>	2025-09-03 22:25:08 +08:00
Chao Lei	03ca2b26ca	[P/D] Mooncake Connector for v1 distributed (#1568 ) ### What this PR does / why we need it? This PR adopt Mooncake TransferEngine for kv cache register and pull_blocks style disaggregate prefill implementation. ### Does this PR introduce any user-facing change? No ### Dependencies 1. Cann Dependencies Using Mooncake TransferEngine with Ascend Transport requires CANN version 8.2.RC1 or higher.（see detail Mooncake[#502](https://github.com/kvcache-ai/Mooncake/pull/502)） 2. vllm-ascend This PR depends on changes introduced by #950 (modifications to `model_runner_v1`) and #1361 (updates to `schedule`), both of which have been merged into the `v0.9.1-dev` branch and are expected to land in `main` shortly. ### How was this patch tested? - vLLM version: v0.10.0 - vLLM main: `1c859a1387` --------- Signed-off-by: leichao.lc <leichao139636@163.com> Co-authored-by: jianzs <zheng.shoujian@outlook.com> Co-authored-by: zzy-ContiLearn <1831242919@qq.com> Co-authored-by: fems14 <1804143737@qq.com> Co-authored-by: Dreamerleader <2270923832@qq.com> Co-authored-by: chris668899 <15105191595@126.com> Co-authored-by: Pz1116 <zpbzpb123123@gmail.com>	2025-08-18 14:30:07 +08:00

28 Commits