xc-llm-ascend

Author	SHA1	Message	Date
liziyu	568b6d0601	[P/D] Check wildcard address for layerwise connector (#7389 ) ### What this PR does / why we need it? Check wildcard address address for layerwise connector - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: liziyu <liziyu16@huawei.com>	2026-03-24 15:50:06 +08:00
ZRJ026	a398fa6a0b	[Bugfix]: correct streaming content-type in load balance proxy server (#6985 ) Set proper 'text/event-stream; charset=utf-8' media type for streaming requests instead of hardcoded 'application/json' ### What this PR does / why we need it? This PR fixes an issue in the disaggregated prefill proxy server where streaming requests (`"stream": true`) were always returned with a hardcoded `Content-Type: application/json`, even when the backend vLLM servers correctly returned Server-Sent Events (SSE) with `Content-Type: text/event-stream; charset=utf-8`. Specifically, the proxy used `StreamingResponse` with a fixed `media_type` of `application/json`, which caused FastAPI to override the response headers and break proper SSE semantics. As a result, clients (e.g. `curl -i`, EventSource, or OpenAI-compatible SDKs) could not reliably receive token-by-token streaming output. In addition, this incorrect response type causes compatibility issues with benchmarking and load-testing tools such as EvalScope. When streaming is enabled, these tools expect SSE-formatted responses to correctly parse token usage information. With the incorrect `application/json` content type, EvalScope fails to parse the response and reports errors similar to:`2025-12-15 09:27:56 - evalscope - ERROR: Failed to parse usage from response: list index out of range. Response: []` This PR updates the proxy to: - Detect whether the incoming request is a streaming request (`stream=true`) - Use `text/event-stream; charset=utf-8` for streaming responses - Preserve `application/json` for non-streaming responses This aligns the proxy behavior with native vLLM prefill/decoder servers and the OpenAI-compatible streaming API contract. Fixes incorrect streaming response headers that prevented proper real-time token delivery. ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? This change was tested manually using a disaggregated prefill + decode setup with the proxy server. ### Test Steps 1. Start prefiller and decoder vLLM servers: ```bash vllm serve --host 0.0.0.0 --port 8001 ... vllm serve --host 0.0.0.0 --port 8002 ... ``` 2. Start the proxy server: ```bash python load_balance_proxy_server_example.py \ --host 127.0.0.1 --port 8000 \ --prefiller-hosts 127.0.0.1 --prefiller-ports 8001 \ --decoder-hosts 127.0.0.1 --decoder-ports 8002 ``` 3. Send a streaming completion request through the proxy: ```bash curl -i -X POST http://127.0.0.1:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "test", "prompt": "hello", "max_tokens": 3, "stream": true }' ``` 4. Verify the following: - The response header is Content-Type: text/event-stream; charset=utf-8 - Tokens are streamed incrementally as SSE data: events - Non-streaming requests still return application/json No automated tests were added because this change affects an example proxy server and is limited to HTTP response headers. The behavior is directly verifiable using standard SSE-compatible clients. - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` Signed-off-by: zrj026 <zhangrunjiang026@gmail.com> Co-authored-by: zrj026 <zhangrunjiang026@gmail.com>	2026-03-10 10:11:35 +08:00
SILONG ZENG	e2237819a9	[CI]Fixed the spell check function in `typos.toml` (#6753 ) ### What this PR does / why we need it? The incorrect regular expression syntax `.[UE4M3\|ue4m3].` actually ignores all words containing any of the following characters: `u, e, 4, m, 3, \|` ```yaml extend-ignore-identifiers-re = [".Unc.", "._thw", ".UE8M0.", ".[UE4M3\|ue4m3].", ".eles.", ".fo.", ".ba.", ".ot.", ".[Tt]h[rR]."] ``` ===fix===> ```yaml extend-ignore-identifiers-re = [".Unc.", "._thw", ".UE8M0.", ".(UE4M3\|ue4m3]).", ".eles.", ".fo.", ".ba.", ".ot.", ".[Tt]h[rR]."] ``` ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: `9562912cea` Signed-off-by: MrZ20 <2609716663@qq.com>	2026-02-14 11:57:26 +08:00
liziyu	e5f0e0eaf7	[P/D] layerwise connector support recompute scheduler (#5900 ) ### What this PR does / why we need it? layerwise connector support recompute scheduler. NOTE： Triggering recompute will invoke the tokenizer again, which may lead to precision fluctuations. [RFC]: CDCP Scheduling for Disaggregated Prefilling with KV Cache Layerwise Push Support https://github.com/vllm-project/vllm-ascend/issues/4842 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `bde38c11df` --------- Signed-off-by: liziyu <liziyu16@huawei.com> Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com>	2026-02-07 15:24:42 +08:00
Li Wang	43be004379	[Lint] Fix mypy issue to make CI happy (#6272 ) ### What this PR does / why we need it? The variables `self.prefiller_heap` `self.decoder_heap` are used as `List[tuple[float, int, ServerState]]` but defined as `List[tuple[int, int, ServerState]]`, which leads to the failed of mypy, see https://github.com/vllm-project/vllm-ascend/actions/runs/21351411010/job/61448739554?pr=6265 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: `d68209402d` Signed-off-by: wangli <wangli858794774@gmail.com>	2026-01-26 17:54:00 +08:00
yuxinshan	7d119df2a9	[Feat] proxy delay to remove instances (#5934 ) ### What this PR does / why we need it? For the proxy, we should remove instances when the proxy are not processing requests. But sometimes, We need to isolate some faulty nodes when a large number of requests are coming in. So we support to isolate faulty nodes by lowering their priority and deleted them when the proxy does not process requests. ### Does this PR introduce _any_ user-facing change? For `examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py`, when using `/instances/remove` API to delete the node from the proxy server: ```txt curl -X POST http://localhost:9000/instances/remove \ -H "Content-Type: application/json" \ -d '{ "type": "decode", "instances": "127.0.0.1:8201" }' ``` There are 2 situations: * 【New】When the proxy is processing requests, isolate the nodes and remove them when the proxy is free. ```txt {"message": "Instances ['127.0.0.1:8201'] are isolated and waiting to be removed.", "current_prefill_instances": ['127.0.0.1:8100', '127.0.0.1:8101'], "current_decode_instances": ['127.0.0.1:8200', '127.0.0.1:8201']} ``` * When the proxy is free, remove the nodes directly. ```txt {"message": "remove decode instances: ['127.0.0.1:8201'].", "current_prefill_instances": ['127.0.0.1:8100', '127.0.0.1:8101'], "current_decode_instances": ['127.0.0.1:8200']} ``` ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `11b6af5280` Signed-off-by: yuxinshan <syx_ctyg@126.com>	2026-01-26 16:29:45 +08:00
SILONG ZENG	4811ba62e0	[Lint]Style: reformat markdown files via markdownlint (#5884 ) ### What this PR does / why we need it? reformat markdown files via markdownlint - vLLM version: v0.13.0 - vLLM main: `bde38c11df` --------- Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain> Signed-off-by: MrZ20 <2609716663@qq.com> Co-authored-by: root <root@LAPTOP-VQKDDVMG.localdomain>	2026-01-15 09:06:01 +08:00
lty	295018ec0f	[Refactor]Refactor of vllm_ascend/distributed module (#5719 ) ### What this PR does / why we need it? Based on the RFC:https://github.com/vllm-project/vllm-ascend/issues/5604 This PR is a refactoring of vllm_ascend/distributed, moving all kv_transfer realtaed codes into a dedicated folder, which has already been done in vLLM ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: lty <linhebiwen@gmail.com>	2026-01-15 08:57:40 +08:00
SILONG ZENG	78d5ce3e01	[Lint]Style: Convert `example` to `ruff format` (#5863 ) ### What this PR does / why we need it? This PR fixes linting issues in the `example/` to align with the project's Ruff configuration. - vLLM version: v0.13.0 - vLLM main: `bde38c11df` Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain> Co-authored-by: root <root@LAPTOP-VQKDDVMG.localdomain>	2026-01-13 20:46:50 +08:00
yuxinshan	b0376abd4c	[feat] proxy support elastic scaling (#5063 ) [RFC]: Elastic Scaling Support for P/D Instances Based on KV Pool: https://github.com/vllm-project/vllm-ascend/issues/3380 ### What this PR does / why we need it? Support elastic scaling for P/D instances based on mooncake conncetor deplayment. Support API routes * `/instances/add`: add prefill nodes or decode nodes to the list. * `/instances/remove`: remove prefill nodes or decode nodes from the list. Support functions * Support adding prefill nodes or decode nodes. - If prefill or decode server deployed after the proxy deployed, server can use `/instances/add` API to join the proxy server. The prefill server or decode server sends a signal to the proxy server, and the proxy server will check the status of the node util the node is available. * Support removing prefill nodes or decode nodes: - Support using `/instances/remove` API to delete the node from the proxy server. ### Does this PR introduce _any_ user-facing change? For `examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py`: Add 2 params When adding nodes to the proxy, the proxy will wait the nodes to be started util retrying a certain of times. \| name \| type \| default \| help \| \| ----- \| ---- \| ---- \| ---- \| \| max-waiting-retries \| int \| 3 \| Maximum number of retries for waiting nodes to be started \| \| waiting-retry-interval \| float \| 10 \| Check interval (seconds) for waiting nodes to be started \| For example: ```shell python load_balance_proxy_server_example.py \ --host 0.0.0.0 --port 9000 \ --prefiller-hosts 127.0.0.1 127.0.0.1 \ --prefiller-ports 8100 8101 \ --decoder-hosts 127.0.0.1 127.0.0.1 \ --decoder-ports 8200 8201 \ --max-waiting-retries 3 \ --waiting-retry-interval 10 ``` Add 2 API routings * Add instances: `instances/add` For example, add 2 prefiller instances: ```shell curl -X POST http://localhost:9000/instances/add \ -H "Content-Type: application/json" \ -d '{ "type": "prefill", "instances": ["127.0.0.1:8102", "127.0.0.1:8103"] }' ``` Response: ```shell {"message": "add prefill instances: ['127.0.0.1:8102', '127.0.0.1:8103'].", "current_prefill_instances": ['127.0.0.1:8100', '127.0.0.1:8101', '127.0.0.1:8102', '127.0.0.1:8103'], "current_decode_instances": ['127.0.0.1:8200', '127.0.0.1:8201']} ``` If the node '127.0.0.1:8103' has not benn started: ```shell {"message": "add prefill instances: ['127.0.0.1:8102']. Instances ['127.0.0.1:8103'] are waiting to be added.", "current_prefill_instances": ['127.0.0.1:8100', '127.0.0.1:8101', '127.0.0.1:8102'], "current_decode_instances": ['127.0.0.1:8200', '127.0.0.1:8201']} ``` * Remove instances: `instances/remove` For example, remove 1 decoder instance: ```shell curl -X POST http://localhost:9000/instances/remove \ -H "Content-Type: application/json" \ -d '{ "type": "decode", "instances": "127.0.0.1:8201" }' ``` Response: ```shell {"message": "remove decode instances: ['127.0.0.1:8201'].", "current_prefill_instances": ['127.0.0.1:8100', '127.0.0.1:8101'], "current_decode_instances": ['127.0.0.1:8200']} ``` ### How was this patch tested? Run proxy and using `/instances/add` API to add nodes and `/instances/remove` API to remove nodes * vLLM version: v0.11.0.rc3 * vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0.rc3 - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: yuxinshan <syx_ctyg@126.com> Signed-off-by: CalvinXKY <kyxiezju@163.com>	2025-12-18 14:29:53 +08:00
zhangxinyuehfad	18d2395f5e	[Bugfix] fix fastapi version (#5047 ) ### What this PR does / why we need it? fix fastapi version == 0.123.10(<0.124.0) - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-12-16 15:58:27 +08:00
wangxiyuan	42ceaf08a1	add release note for 0.12.0 (#4995 ) Add release note for v0.12.0rc1 Update deepseek3.2 tutorial doc - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-13 22:09:59 +08:00
wangxiaoteng888	a77045f355	[P/D][main]Offline the llmdatadist connector related parts of the code and files. (#4780 ) ### What this PR does / why we need it? As support for the mooncake connector is now available, the llmdatadist connector is no longer being maintained, so the llmdatadist-related files need to be retired. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By ci - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Signed-off-by: liziyu <liziyu16@huawei.com> Co-authored-by: liziyu <liziyu16@huawei.com>	2025-12-09 22:36:43 +08:00
linfeng-yuan	56f01820e8	[Docs]fix the configuration conflicts in documentation (#4823 ) ### What this PR does / why we need it? Fix configuration error in our documentations. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? NA. Signed-off-by: linfeng-yuan <1102311262@qq.com>	2025-12-09 15:37:38 +08:00
Li Wang	283bc5c7ba	[Nightly] Optimize nightly CI (#4509 ) ### What this PR does / why we need it? 1. Optimize multi-node waiting logic 2. Remove the `tee` pipeline for logs, which will lead to hang issue ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.12.0 Signed-off-by: wangli <wangli858794774@gmail.com>	2025-12-04 22:31:07 +08:00
zzzzwwjj	136ea9ff56	[refact] unified soc_version code (#4359 ) ### What this PR does / why we need it? Currently, there are two paths to judge the chip type in code, `get_ascend_soc_version` use `get_soc_version` api in torch_npu, and `is_310p` `use _build_info.__soc_version__`, which generate when install. We need to unify the two paths. We need to unify these codes based on the following points: 1. We need to ensure consistency in chip type judgment between compiling and running states; 2. In compiling state, we need chip type to complete op's compilation, but in running state, we only need device type(910B/910_93/310P/910_95/etc) to make code branch judgement; 3. In compiling state, torch_npu may not have been installed yet, so we can't use torch_npu's api. Based on the above points, we have made the following changes: 1. When user set env `SOC_VERSION`, use it; when not set, query soc_version by `npu-smi`; 2. generate device_type based on soc_version when compiling, and write `__device_type__` instead of `__soc_version__` in `_build_info.py`; 3. In running state, use `__device_type__` to judge code branch. ### Does this PR introduce _any_ user-facing change? When not set env `SOC_VERSION`, it will not be `ASCEND910B1` by default, we will query soc_version by `npu-smi`. And env `SOC_VERSION` must be in the list `soc_to_device` in `setup.py`. - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` Signed-off-by: zzzzwwjj <1183291235@qq.com>	2025-11-26 14:28:55 +08:00
whx	a5554b6661	[Feat][Doc] Add a load_balance_dp_proxy in examples and external dp doc. (#4265 ) ### What this PR does / why we need it? This PR adds a load-balance dp proxy server which can be used in external DP scenario without Disaggregated-Prefill enabled. What's more, add a doc of external dp and load-balance dp proxy server. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? See the new doc. - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` --------- Signed-off-by: whx-sjtu <2952154980@qq.com>	2025-11-21 16:33:23 +08:00
liziyu	e98543267a	[bugfix] fix proxy hen host ip using domain name (#4243 ) ### What this PR does / why we need it? fix proxy when host ip using domain name - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` --------- Signed-off-by: liziyu <liziyu16@huawei.com>	2025-11-18 16:30:51 +08:00
liziyu	a30261f779	[P/D] pd proxy support ipv6 (#4161 ) ### What this PR does / why we need it? pd proxy support ipv6, mooncake connector check whether the IPv6 address is used and notify the user. - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` --------- Signed-off-by: liziyu <liziyu16@huawei.com>	2025-11-18 11:01:13 +08:00
wangxiyuan	f811a24bf0	Remove VLLM_USE_V1 (#4086 ) Drop VLLM_USE_V1 usage. This env has been removed from vLLM already. - vLLM version: v0.11.0 - vLLM main: `83f478bb19` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-11-11 15:43:39 +08:00
Canlin Guo	de49fb3deb	[Feature][Build] Upgrade the minimum version to 3.10 (#3926 ) ### What this PR does / why we need it? Closes #3728, #3657. The main branch is now aligned with the vllm `releases/v0.11.1` branch, which no longer supports `Python 3.9`. Check it [here](https://github.com/vllm-project/vllm/blob/releases/v0.11.1/pyproject.toml). ### Does this PR introduce _any_ user-facing change? The newest version of vllm-ascend don't support Python 3.9. ### How was this patch tested? - vLLM version: v0.11.0 - vLLM main: `83f478bb19` Signed-off-by: gcanlin <canlinguosdu@gmail.com>	2025-11-10 11:50:12 +08:00
zxr2333	1d81a289d0	[P/D][BugFix]Fix proxy format processing errors & Layerwise connector performance optimization (#4043 ) ### What this PR does / why we need it? 1. Fix proxy format processing errors. 2. Layer-wise connector performance optimization. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By CI. - vLLM version: v0.11.0 - vLLM main: `83f478bb19` --------- Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com> Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com>	2025-11-08 18:44:06 +08:00
zxr2333	b206e831e9	[P/D]Make kv-transfer env variable take effect & Fix load-balance proxy (#3981 ) ### What this PR does / why we need it? Make kv-transfer env variable take effect and Fix load-balance proxy. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By CI. - vLLM version: v0.11.0 - vLLM main: `83f478bb19` --------- Signed-off-by: liziyu <liziyu16@huawei.com> Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com> Co-authored-by: liziyu <liziyu16@huawei.com>	2025-11-06 12:02:47 +08:00
pz1116	e0c23cb011	[docs] Add kv pool developer guide (#3752 ) ### What this PR does / why we need it? Add kv pool developer guide ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? vLLM version: v0.11.0rc3 vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 - vLLM version: v0.11.0 - vLLM main: `83f478bb19` --------- Signed-off-by: Pz1116 <zpbzpb123123@gmail.com> Signed-off-by: pz1116 <zpbzpb123123@gmail.com>	2025-11-05 18:03:36 +08:00
wangxiyuan	cc2cd42ad3	Upgrade CANN to 8.3.rc1 (#3945 ) ### What this PR does / why we need it? This PR upgrade CANN from 8.2rc1 to 8.3rc1 and remove the CANN version check logic. TODO: we notice that UT runs failed with CANN 8.3 image. So the base image for UT is still 8.2. We'll fix it later. - vLLM version: v0.11.0 - vLLM main: `83f478bb19` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-11-03 20:21:07 +08:00
wangxiyuan	fcc9a0eaeb	Update torch-npu version to 2.7.1 (#3896 ) ### What this PR does / why we need it? Upgrade torch-npu to the official release version 2.7.1 - vLLM version: v0.11.0 - vLLM main: `83f478bb19` --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-10-31 17:16:31 +08:00
wangxiaoteng888	a2b325ee00	[bugfix]cancel tokenize for layerwise_proxy (#3914 ) ### What this PR does / why we need it? cancel tokenize for layerwise_proxy ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? by ci - vLLM version: v0.11.0 - vLLM main: `83f478bb19` --------- Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>	2025-10-30 23:54:46 +08:00
wangxiaoteng888	2c291bc63f	[bugfix] layerwise D first plan (#3866 ) ### What this PR does / why we need it? Refactored the layerwise code to send to the D node first, preventing P-node hangs due to communication timeouts when DP > 1. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By ci - vLLM version: v0.11.0 - vLLM main: `83f478bb19` --------- Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Signed-off-by: liziyu <liziyu16@huawei.com> Co-authored-by: liziyu <liziyu16@huawei.com>	2025-10-30 22:20:34 +08:00
wangxiyuan	10772d94e3	[Build] Force torch version (#3791 ) We notice that sometimes user build vllm-ascend with incorrect torch version. In this case, the build is passed, but when running the code, the error `AttributeError: '_OpNamespace' '_C_ascend' object has no attribute 'weak_ref_tensor'` is raised. Let's force the torch version to 2.7.1 and check the torch version when build from source to fix the issue. closes: #3342 - vLLM version: v0.11.0rc3 - vLLM main: `c9461e05a4` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-10-30 15:53:15 +08:00
Shirley125	d8ca7fee75	[bugfix][main]fix proxy decode bug (#3750 ) ### What this PR does / why we need it? fix proxy decode bug when parsing non-UTF-8 characters. - vLLM version: v0.11.0 - vLLM main: `c9461e05a4` --------- Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>	2025-10-27 16:56:09 +08:00
yupeng	b8796b06c8	[Doc][Example][Bugfix] Elements in local_device_ids should be casted … (#3782 ) ### What this PR does / why we need it? It's a tiny bugfix in the `gen_ranktable.py` script. The script is an util to help setup an example case. It is used to prepare a ranktable before disaggregated prefill deployment. Elements in `local_device_ids` list should be casted to `int` type before referred for a MOD math operation. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? No. - vLLM version: v0.11.0 - vLLM main: `c9461e05a4` --------- Signed-off-by: paulyu12 <507435917@qq.com>	2025-10-27 14:52:47 +08:00
hucong	292cf339c3	[BugFix][P/D] Modify the recalculation logic to prevent waiting requests from filling up the D node KVCache (#3641 ) ### What this PR does / why we need it? Modify the recalculation logic to prevent waiting requests from filling up the D node KVCache - vLLM version: v0.11.0rc3 - vLLM main: `17c540a993` Signed-off-by: underfituu <hzhucong@163.com>	2025-10-25 09:14:20 +08:00
fems14	c83efcb9e4	kvpool sync load (#3698 ) ### What this PR does / why we need it? In certain scenarios, the performance of synchronously loading data from the pool is better than that of asynchronously loading data. Therefore, a control logic (or switch) for asynchronous loading from the pool has been added. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: fems14 <1804143737@qq.com>	2025-10-24 17:22:53 +08:00
Shirley125	b4233a2ec3	[Bugfix] Route requests requiring KVC recomputation from the decode instance to the P instance (#3448 ) ### What this PR does / why we need it? This PR is aimed to fix the recomputing out of memory bug in decode instance. When recomputing happens in decode, kv cache usage may exceed the pre-allocated memory, and it will cause OOM. So we propose a new scheduling strategy, when decode instance cannot allocate new block for running requests, we will stop the request that will be preempted. These stopped request will be recognied by proxy, and they will be send to prefill instance again to calculate kvc and then direct to decode instance. This is a temporary plan to fix the bug. The long-term stratege is to use CPU offload in decode instance. ### Does this PR introduce _any_ user-facing change? An extra ascend configuration option -- recompute_scheduler_enable = True is added to enable this strategy. The default value is False ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>	2025-10-18 15:56:44 +08:00
Li Wang	4c4a8458a5	[CI] Refator multi-node CI (#3487 ) ### What this PR does / why we need it? Refactor the multi-machine CI use case. The purpose of this PR is to increase the ease of adding multi-machine CI use cases, allowing developers to add multi-machine cluster model testing use cases (including PD separation) by simply adding a new YAML configuration file. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-10-17 09:04:31 +08:00
Chao Lei	a486ff8c11	KVCache Transfer via Layer-wise Strategy in Disaggregation (#2602 ) ### What this PR does / why we need it? See RFC: https://github.com/vllm-project/vllm-ascend/issues/2470 This PR add a new kv connector for layer-wised kv transfer ### Does this PR introduce _any_ user-facing change? yes, a new kv connector is added. User can use layer wised feature now. ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/releases/v0.11.0 --------- Signed-off-by: leichao.lc <leichao139636@163.com> Signed-off-by: CaveNightingale <2859066733@qq.com> Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com> Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Signed-off-by: hanxinlong <50882499@qq.com> Signed-off-by: liziyu <liziyu16@huawei.com> Co-authored-by: CaveNightingale <2859066733@qq.com> Co-authored-by: nwpu-zxr <zhouxuerong2@huawei.com> Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com> Co-authored-by: hanxinlong <50882499@qq.com>	2025-09-30 15:10:29 +08:00
fems14	1c9f0fe26f	Fix of DeepSeek Error in KV Pool Mixed Deployment Scenario (#3087 ) ### What this PR does / why we need it? A new kv_role "kv_both" is added to run mixed deployment scenarios. The mixed deployment will involve a decode phase, where with_prefill should be false. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: `c60e6137f0` Signed-off-by: fems14 <1804143737@qq.com>	2025-09-22 20:36:41 +08:00
LeeWenquan	f4e3d22432	Remove chunked_prefill_for_mla and fix ring_mla bug (#2781 ) ### What this PR does / why we need it? Remove chunked prefill for mla branch in mla , and change dtype of prefill_mask to avoid accuracy problem ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: `ef7eefe17a` --------- Signed-off-by: SunnyLee219 <3294305115@qq.com>	2025-09-18 19:43:26 +08:00
Chao Lei	cef43b524e	[Feat] A Connector that supports Mooncake store (#2913 ) ### What this PR does / why we need it? Added a new connector for Mooncake store integration to enable kvcache reuse in scenarios with system prompts or multi-turn dialogues. ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: `5963b98b46` --------- Signed-off-by: LCAIZJ <leichao139636@163.com> Signed-off-by: fems14 <1804143737@qq.com> Co-authored-by: fems14 <1804143737@qq.com> Co-authored-by: Dreamerleader <2270923832@qq.com> Co-authored-by: Pz1116 <zpbzpb123123@gmail.com> Co-authored-by: lizy124 <1950471827@qq.com> Co-authored-by: zouyida2052 <zouyida2002@gmail.com>	2025-09-18 14:04:45 +08:00
liziyu	aa3c4563ce	fix all cards super_pod_id same on A3 & proxy support min_tokens (#2939 ) ### What this PR does / why we need it? fix all cards super_pod_id same on A3 & proxy support min_tokens ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? 2*A3 gen ranktable before: "prefill_device_list": [ { "server_id": "xxx", "device_id": "0", "device_ip": "xxx", "super_pod_id": "0", "super_device_id": "106758159", "cluster_id": "1" }, { "server_id": "xxx", "device_id": "1", "device_ip": "xxx", "super_pod_id": "0", "super_device_id": "106758159", "cluster_id": "2" }... after: "prefill_device_list": [ { "server_id": "xxx", "device_id": "0", "device_ip": "xxx", "super_pod_id": "0", "super_device_id": "104857600", "cluster_id": "1" }, { "server_id": "xxx", "device_id": "1", "device_ip": "xxx", "super_pod_id": "0", "super_device_id": "104923137", "cluster_id": "2" }... --------- Signed-off-by: liziyu <liziyu16@huawei.com>	2025-09-16 01:09:18 +08:00
liziyu	5691104249	LLMdatadist connector adapt the distributed KV aggregation (#2718 ) ### What this PR does / why we need it? LLMdatadist connector adapt the distributed KV aggregation for the main branch. Change the P node from returning "finish sending" only when TP0 responds to returning "finish sending" as soon as each NPU receives it. The D node will send a finish receive signal to the corresponding tp rank of the P node. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? gsm8k test 2*A3 1P 1D P: dp2 tp8 D:dp 4 tp4 P: dp2 tp8 D:dp 2 tp8 - vLLM version: main - vLLM main: `cc99baf14d` Signed-off-by: liziyu <liziyu16@huawei.com>	2025-09-11 11:37:41 +08:00
yupeng	a746f8274f	[DOC] Qwen3 PD disaggregation user guide (#2751 ) ### What this PR does / why we need it? The PR is for the document of the prefiller&decoder disaggregation deloyment guide. The scenario of the guide is: - Use 3 nodes totally and 2 NPUs on each node - Qwen3-30B-A3B - 1P2D - Expert Parallel The deployment can be used to verify PD Disggregation / Expert Parallel features with a slightly less resources. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? No. - vLLM version: v0.10.1.1 - vLLM main: `e599e2c65e` --------- Signed-off-by: paulyu12 <507435917@qq.com>	2025-09-07 10:35:37 +08:00
zhiyuanzhang	07d44ade19	bugfix: fix initialization error for mooncake in k8s (#2541 ) ### What this PR does / why we need it? The detail has been clarified in that issue : https://github.com/vllm-project/vllm-ascend/issues/2557 ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? easy to test beacause we just need to echo the variable - vLLM version: v0.10.1.1 - vLLM main: `6997a25ac6` --------- Signed-off-by: zzy-ContiLearn <1831242919@qq.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: LCAIZJ <leichao139636@163.com>	2025-09-03 22:25:08 +08:00
wangxiaoteng666	ee6d141dd4	[MAIN][BUGFIX] BugFix: Resolve the issue of waiting queue accumulation when requests are canceled. (#2426 ) ### What this PR does / why we need it? Resolve the issue of waiting queue accumulation when requests are canceled. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By ci - vLLM version: v0.10.1.1 - vLLM main: `006477e60b` --------- Signed-off-by: wangxiaoteng666 <wangxiaoteng@huawei.com>	2025-08-29 17:19:23 +08:00
G.O.D	27d038dc66	fix doc typo (#2407 ) fix doc typo - vLLM version: v0.10.0 - vLLM main: `5f5664b3e4` --------- Signed-off-by: felix01.yu <felix01.yu@vipshop.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2025-08-19 09:10:01 +08:00
Chao Lei	03ca2b26ca	[P/D] Mooncake Connector for v1 distributed (#1568 ) ### What this PR does / why we need it? This PR adopt Mooncake TransferEngine for kv cache register and pull_blocks style disaggregate prefill implementation. ### Does this PR introduce any user-facing change? No ### Dependencies 1. Cann Dependencies Using Mooncake TransferEngine with Ascend Transport requires CANN version 8.2.RC1 or higher.（see detail Mooncake[#502](https://github.com/kvcache-ai/Mooncake/pull/502)） 2. vllm-ascend This PR depends on changes introduced by #950 (modifications to `model_runner_v1`) and #1361 (updates to `schedule`), both of which have been merged into the `v0.9.1-dev` branch and are expected to land in `main` shortly. ### How was this patch tested? - vLLM version: v0.10.0 - vLLM main: `1c859a1387` --------- Signed-off-by: leichao.lc <leichao139636@163.com> Co-authored-by: jianzs <zheng.shoujian@outlook.com> Co-authored-by: zzy-ContiLearn <1831242919@qq.com> Co-authored-by: fems14 <1804143737@qq.com> Co-authored-by: Dreamerleader <2270923832@qq.com> Co-authored-by: chris668899 <15105191595@126.com> Co-authored-by: Pz1116 <zpbzpb123123@gmail.com>	2025-08-18 14:30:07 +08:00
wangxiyuan	36e450eb0f	[Misc] Nit fix for disaggregated_prefill and ascend_forward_context (#2097 ) we recently added disaggregated_prefill and ascend_forward_context feature by `ba3dfbd59e` and `df0ec55162`. This PR fix some nit introduced by them to make the code clear. 1. drop `current_platform` usage. It'll lead unknown circular import error in some case 2. update `set_ascend_forward_context` function to make the logic clear. for example, remove V0 support in this function. 3. Remove useless `self.local_rank_across_dp` in worker 4. Remove `soc_info.py` to use `get_ascend_soc_version` instead. - vLLM version: v0.10.0 - vLLM main: `02f82fe438` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-08-05 08:39:02 +08:00
hucong	e38fab011d	[Doc][PD] Restore the default configuration items in examples/disaggregate_prefill_v1/README.md (#2165 ) ### What this PR does / why we need it? - In the D node, the max-num-batched-tokens parameter can be set to a smaller value since the D node processes at most max-num-seqs batches concurrently. As the profile_run only needs to handle max-num-seqs sequences at a time, we can safely set max-num-batched-tokens equal to max-num-seqs. This optimization will help reduce activation memory consumption. - Restore the default configuration items for PD separation. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.10.0 - vLLM main: `61dcc280fa` Signed-off-by: underfituu <hzhucong@163.com>	2025-08-04 20:30:53 +08:00
Pleaplusone	4b3a210c33	Implementation of simple load balance routing proxy server (#1953 ) (#2124 ) ### What this PR does / why we need it? The PR is the cherry-pick from v0.9.1 https://github.com/vllm-project/vllm-ascend/pull/1953 This PR introduce a new load balance proxy server example implementation for disaggregated pd, which support simple token&kv_cache aware load balance routing strategy for the disaggregated pd system compared with origin round robin toy_proxy. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? tested on real workload and unittest - vLLM version: v0.10.0 - vLLM main: `ad57f23f6a` --------- Signed-off-by: ganyi <pleaplusone.gy@gmail.com>	2025-08-04 10:35:53 +08:00
Li Wang	f60bb474f9	[CI] Enable linux-aarch64-a2 (64GB) and tp2 * 2 max-parallel to speed up CI (#2065 ) ### What this PR does / why we need it? Currently our workflow run time takes about 3 hours in total, which seriously affects the developer experience, so it is urgent to have a optimization, after this pr, It is expected that the running time of the full CI can be shortened to 1h40min. - Enable linux-aarch64-a2 (64GB) to replace linux-arm64-npu (32GB) - Change TP4 ---> TP2 * 2 max-parallel - Move DeepSeek-V2-Lite-W8A8 to single card test ### Does this PR introduce _any_ user-facing change? No - vLLM version: v0.10.0 - vLLM main: `a2480251ec` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-07-29 18:59:05 +08:00

1 2

51 Commits