9 Commits

Author SHA1 Message Date
zhangxinyuehfad
87c0cfafa3 [0.11.0][Bugfix] fix fastapi version (#5048)
### What this PR does / why we need it?
fix fastapi version <0.124.0

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-12-15 23:51:38 +08:00
liziyu
ddf3e75800 [Cherry-pick] [0.11.0] pd proxy support ipv6 and fix proxy (#4242)
### What this PR does / why we need it?
pd proxy support ipv6, mooncake connector check whether the IPv6 address
is used and notify the user.

---------

Signed-off-by: liziyu <liziyu16@huawei.com>
2025-11-18 16:33:00 +08:00
Shirley125
e48ca0b6ec [bugfix][0.11]fix proxy decode bug (#3751)
### What this PR does / why we need it?
fix proxy decode bug while parsing non-UTF-8 characters.

---------

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>
2025-10-27 16:56:50 +08:00
hucong
12bc78d252 [v0.11.0][BugFix][P/D] Modify the recalculation logic to prevent waiting requests from filling up the D node KVCache (#3686)
### What this PR does / why we need it?
Modify the recalculation logic to prevent waiting requests from filling
up the D node KVCache

Signed-off-by: underfituu <hzhucong@163.com>
2025-10-25 09:15:42 +08:00
Shirley125
b4233a2ec3 [Bugfix] Route requests requiring KVC recomputation from the decode instance to the P instance (#3448)
### What this PR does / why we need it?
This PR is aimed to fix the recomputing out of memory bug in decode
instance. When recomputing happens in decode, kv cache usage may exceed
the pre-allocated memory, and it will cause OOM.

So we propose a new scheduling strategy, when decode instance cannot
allocate new block for running requests, we will stop the request that
will be preempted. These stopped request will be recognied by proxy, and
they will be send to prefill instance again to calculate kvc and then
direct to decode instance.

This is a temporary plan to fix the bug. The long-term stratege is to
use CPU offload in decode instance.

### Does this PR introduce _any_ user-facing change?
An extra ascend configuration option **-- recompute_scheduler_enable =
True** is added to enable this strategy. The default value is False
### How was this patch tested?


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>
2025-10-18 15:56:44 +08:00
Chao Lei
a486ff8c11 KVCache Transfer via Layer-wise Strategy in Disaggregation (#2602)
### What this PR does / why we need it?
See RFC: https://github.com/vllm-project/vllm-ascend/issues/2470 This PR
add a new kv connector for layer-wised kv transfer

### Does this PR introduce _any_ user-facing change?
yes, a new kv connector is added. User can use layer wised feature now.
### How was this patch tested?


- vLLM version: v0.11.0rc3
- vLLM main:
https://github.com/vllm-project/vllm/commit/releases/v0.11.0

---------

Signed-off-by: leichao.lc <leichao139636@163.com>
Signed-off-by: CaveNightingale <2859066733@qq.com>
Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
Signed-off-by: hanxinlong <50882499@qq.com>
Signed-off-by: liziyu <liziyu16@huawei.com>
Co-authored-by: CaveNightingale <2859066733@qq.com>
Co-authored-by: nwpu-zxr <zhouxuerong2@huawei.com>
Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com>
Co-authored-by: hanxinlong <50882499@qq.com>
2025-09-30 15:10:29 +08:00
liziyu
aa3c4563ce fix all cards super_pod_id same on A3 & proxy support min_tokens (#2939)
### What this PR does / why we need it?
fix all cards super_pod_id same on A3 & proxy support min_tokens
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
2*A3 gen ranktable
before:
"prefill_device_list": [
        {
            "server_id": "xxx",
            "device_id": "0",
            "device_ip": "xxx",
            "super_pod_id": "0",
            "super_device_id": "106758159",
            "cluster_id": "1"
        },
        {
            "server_id": "xxx",
            "device_id": "1",
            "device_ip": "xxx",
            "super_pod_id": "0",
            "super_device_id": "106758159",
            "cluster_id": "2"
        }...
after:
"prefill_device_list": [
        {
            "server_id": "xxx",
            "device_id": "0",
            "device_ip": "xxx",
            "super_pod_id": "0",
            "super_device_id": "104857600",
            "cluster_id": "1"
        },
        {
            "server_id": "xxx",
            "device_id": "1",
            "device_ip": "xxx",
            "super_pod_id": "0",
            "super_device_id": "104923137",
            "cluster_id": "2"
        }...

---------

Signed-off-by: liziyu <liziyu16@huawei.com>
2025-09-16 01:09:18 +08:00
wangxiaoteng666
ee6d141dd4 [MAIN][BUGFIX] BugFix: Resolve the issue of waiting queue accumulation when requests are canceled. (#2426)
### What this PR does / why we need it?
Resolve the issue of waiting queue accumulation when requests are
canceled.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By ci


- vLLM version: v0.10.1.1
- vLLM main:
006477e60b

---------

Signed-off-by: wangxiaoteng666 <wangxiaoteng@huawei.com>
2025-08-29 17:19:23 +08:00
Pleaplusone
4b3a210c33 Implementation of simple load balance routing proxy server (#1953) (#2124)
### What this PR does / why we need it?
The PR is the cherry-pick from v0.9.1
https://github.com/vllm-project/vllm-ascend/pull/1953

This PR introduce a new load balance proxy server example implementation
for disaggregated pd, which support simple token&kv_cache aware load
balance routing strategy for the disaggregated pd system compared with
origin round robin toy_proxy.

### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
tested on real workload and unittest

- vLLM version: v0.10.0
- vLLM main:
ad57f23f6a

---------

Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
2025-08-04 10:35:53 +08:00