xc-llm-ascend

Author	SHA1	Message	Date
G.O.D	27d038dc66	fix doc typo (#2407 ) fix doc typo - vLLM version: v0.10.0 - vLLM main: `5f5664b3e4` --------- Signed-off-by: felix01.yu <felix01.yu@vipshop.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2025-08-19 09:10:01 +08:00
Chao Lei	03ca2b26ca	[P/D] Mooncake Connector for v1 distributed (#1568 ) ### What this PR does / why we need it? This PR adopt Mooncake TransferEngine for kv cache register and pull_blocks style disaggregate prefill implementation. ### Does this PR introduce any user-facing change? No ### Dependencies 1. Cann Dependencies Using Mooncake TransferEngine with Ascend Transport requires CANN version 8.2.RC1 or higher.（see detail Mooncake[#502](https://github.com/kvcache-ai/Mooncake/pull/502)） 2. vllm-ascend This PR depends on changes introduced by #950 (modifications to `model_runner_v1`) and #1361 (updates to `schedule`), both of which have been merged into the `v0.9.1-dev` branch and are expected to land in `main` shortly. ### How was this patch tested? - vLLM version: v0.10.0 - vLLM main: `1c859a1387` --------- Signed-off-by: leichao.lc <leichao139636@163.com> Co-authored-by: jianzs <zheng.shoujian@outlook.com> Co-authored-by: zzy-ContiLearn <1831242919@qq.com> Co-authored-by: fems14 <1804143737@qq.com> Co-authored-by: Dreamerleader <2270923832@qq.com> Co-authored-by: chris668899 <15105191595@126.com> Co-authored-by: Pz1116 <zpbzpb123123@gmail.com>	2025-08-18 14:30:07 +08:00
wangxiyuan	36e450eb0f	[Misc] Nit fix for disaggregated_prefill and ascend_forward_context (#2097 ) we recently added disaggregated_prefill and ascend_forward_context feature by `ba3dfbd59e` and `df0ec55162`. This PR fix some nit introduced by them to make the code clear. 1. drop `current_platform` usage. It'll lead unknown circular import error in some case 2. update `set_ascend_forward_context` function to make the logic clear. for example, remove V0 support in this function. 3. Remove useless `self.local_rank_across_dp` in worker 4. Remove `soc_info.py` to use `get_ascend_soc_version` instead. - vLLM version: v0.10.0 - vLLM main: `02f82fe438` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-08-05 08:39:02 +08:00
hucong	e38fab011d	[Doc][PD] Restore the default configuration items in examples/disaggregate_prefill_v1/README.md (#2165 ) ### What this PR does / why we need it? - In the D node, the max-num-batched-tokens parameter can be set to a smaller value since the D node processes at most max-num-seqs batches concurrently. As the profile_run only needs to handle max-num-seqs sequences at a time, we can safely set max-num-batched-tokens equal to max-num-seqs. This optimization will help reduce activation memory consumption. - Restore the default configuration items for PD separation. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.10.0 - vLLM main: `61dcc280fa` Signed-off-by: underfituu <hzhucong@163.com>	2025-08-04 20:30:53 +08:00
Pleaplusone	4b3a210c33	Implementation of simple load balance routing proxy server (#1953 ) (#2124 ) ### What this PR does / why we need it? The PR is the cherry-pick from v0.9.1 https://github.com/vllm-project/vllm-ascend/pull/1953 This PR introduce a new load balance proxy server example implementation for disaggregated pd, which support simple token&kv_cache aware load balance routing strategy for the disaggregated pd system compared with origin round robin toy_proxy. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? tested on real workload and unittest - vLLM version: v0.10.0 - vLLM main: `ad57f23f6a` --------- Signed-off-by: ganyi <pleaplusone.gy@gmail.com>	2025-08-04 10:35:53 +08:00
Li Wang	f60bb474f9	[CI] Enable linux-aarch64-a2 (64GB) and tp2 * 2 max-parallel to speed up CI (#2065 ) ### What this PR does / why we need it? Currently our workflow run time takes about 3 hours in total, which seriously affects the developer experience, so it is urgent to have a optimization, after this pr, It is expected that the running time of the full CI can be shortened to 1h40min. - Enable linux-aarch64-a2 (64GB) to replace linux-arm64-npu (32GB) - Change TP4 ---> TP2 * 2 max-parallel - Move DeepSeek-V2-Lite-W8A8 to single card test ### Does this PR introduce _any_ user-facing change? No - vLLM version: v0.10.0 - vLLM main: `a2480251ec` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-07-29 18:59:05 +08:00
Pleaplusone	df0ec55162	Disaggregate prefill for kv cache register style (#950 ) ### What this PR does / why we need it? This PR adopt `LLMDataDist` for kv cache register and `pull_blocks` style disaggregate prefill implementation. The interface implementation mainly follows the design of NIXL PR https://github.com/vllm-project/vllm/pull/17751/files#diff-7eaad0b7dee0626bf29d10081b0f0c5e3ea15a4af97e7b182a4e0d35f8346953 . This PR can be test with the following step: - Generate the rank table for all machine. - execute`toy_proxy.py` to launch the disaggregate prefill proxy server, specify the prefill ip, port and the decode ip, port - Run the prefill server and decode server. - send the request to the disaggregate prefill proxy ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.9.2 - vLLM main: `8d0a01a5f2` --------- Signed-off-by: ganyi <pleaplusone.gy@gmail.com> Signed-off-by: machenglong <machenglong_yewu@cmss.chinamobile.com> Signed-off-by: liziyu179 <3475441767@qq.com> Signed-off-by: underfitc <hucong24@huawei.com> Signed-off-by: zouyida2052 <zouyida@huawei.com> Signed-off-by: liziyu <liziyu16@huawei.com> Signed-off-by: underfituu <hzhucong@163.com> Co-authored-by: machenglong <machenglong_yewu@cmss.chinamobile.com> Co-authored-by: liziyu179 <3475441767@qq.com> Co-authored-by: underfitc <hucong24@huawei.com> Co-authored-by: zouyida2052 <zouyida@huawei.com> Co-authored-by: liziyu <liziyu16@huawei.com> Co-authored-by: underfituu <hzhucong@163.com>	2025-07-26 17:15:47 +08:00

7 Commits