[Bugfix] Route requests requiring KVC recomputation from the decode instance to the P instance (#3448)

### What this PR does / why we need it?
This PR is aimed to fix the recomputing out of memory bug in decode
instance. When recomputing happens in decode, kv cache usage may exceed
the pre-allocated memory, and it will cause OOM.

So we propose a new scheduling strategy, when decode instance cannot
allocate new block for running requests, we will stop the request that
will be preempted. These stopped request will be recognied by proxy, and
they will be send to prefill instance again to calculate kvc and then
direct to decode instance.

This is a temporary plan to fix the bug. The long-term stratege is to
use CPU offload in decode instance.

### Does this PR introduce _any_ user-facing change?
An extra ascend configuration option **-- recompute_scheduler_enable =
True** is added to enable this strategy. The default value is False
### How was this patch tested?


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>

This commit is contained in:

Shirley125

2025-10-18 15:56:44 +08:00

committed by

GitHub

parent 4750d45d86

commit b4233a2ec3

6 changed files with 1761 additions and 114 deletions

1392

vllm_ascend/core/recompute_scheduler.py Normal file

View File

File diff suppressed because it is too large Load Diff

[Bugfix] Route requests requiring KVC recomputation from the decode instance to the P instance (#3448)

1392 vllm_ascend/core/recompute_scheduler.py Normal file View File

1392

vllm_ascend/core/recompute_scheduler.py Normal file

View File