xc-llm-ascend

Files

yuxinshan b0376abd4c [feat] proxy support elastic scaling (#5063 )

**[RFC]: Elastic Scaling Support for P/D Instances Based on KV Pool:**
https://github.com/vllm-project/vllm-ascend/issues/3380

### What this PR does / why we need it?
Support elastic scaling for P/D instances based on mooncake conncetor
deplayment.

**Support API routes**
* `/instances/add`: add prefill nodes or decode nodes to the list.
* `/instances/remove`: remove prefill nodes or decode nodes from the
list.

**Support functions**
* Support **adding** prefill nodes or decode nodes.
- If prefill or decode server deployed **after the proxy deployed**,
server can use `/instances/add` API to join the proxy server. The
prefill server or decode server sends a signal to the proxy server, and
the proxy server will check the status of the node util the node is
available.
* Support **removing** prefill nodes or decode nodes:
- Support using `/instances/remove` API to **delete the node** from the
proxy server.

### Does this PR introduce _any_ user-facing change?
For
`examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py`:

**Add 2 params**

When adding nodes to the proxy, the proxy will wait the nodes to be
started util retrying a certain of times.

| name | type | default | help |
| ----- | ---- | ---- | ---- |
| max-waiting-retries | int | 3 | Maximum number of retries for waiting
nodes to be started |
| waiting-retry-interval | float | 10 | Check interval (seconds) for
waiting nodes to be started |

For example:
```shell
python load_balance_proxy_server_example.py \
  --host 0.0.0.0 --port 9000 \
  --prefiller-hosts 127.0.0.1 127.0.0.1 \
  --prefiller-ports 8100 8101 \
  --decoder-hosts 127.0.0.1 127.0.0.1 \
  --decoder-ports 8200 8201 \
  --max-waiting-retries 3 \
  --waiting-retry-interval 10
```
**Add 2 API routings**

* Add instances: `instances/add`

For example, add 2 prefiller instances:
```shell
curl -X POST http://localhost:9000/instances/add \
  -H "Content-Type: application/json" \
  -d '{
        "type": "prefill",
        "instances": ["127.0.0.1:8102", "127.0.0.1:8103"]
      }'
```
Response:
```shell
{"message": "add prefill instances: ['127.0.0.1:8102', '127.0.0.1:8103'].", "current_prefill_instances": ['127.0.0.1:8100', '127.0.0.1:8101', '127.0.0.1:8102', '127.0.0.1:8103'], "current_decode_instances": ['127.0.0.1:8200', '127.0.0.1:8201']}
```
If the node '127.0.0.1:8103' has not benn started:
```shell
{"message": "add prefill instances: ['127.0.0.1:8102']. Instances ['127.0.0.1:8103'] are waiting to be added.", "current_prefill_instances": ['127.0.0.1:8100', '127.0.0.1:8101', '127.0.0.1:8102'], "current_decode_instances": ['127.0.0.1:8200', '127.0.0.1:8201']}
```
* Remove instances: `instances/remove`

For example, remove 1 decoder instance:
```shell
curl -X POST http://localhost:9000/instances/remove \
  -H "Content-Type: application/json" \
  -d '{
        "type": "decode",
        "instances": "127.0.0.1:8201"
      }'
```
Response:
```shell
{"message": "remove decode instances: ['127.0.0.1:8201'].", "current_prefill_instances": ['127.0.0.1:8100', '127.0.0.1:8101'], "current_decode_instances": ['127.0.0.1:8200']}
```
### How was this patch tested?
Run proxy and using `/instances/add` API to add nodes and
`/instances/remove` API to remove nodes

* vLLM version: v0.11.0.rc3
* vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0.rc3
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: yuxinshan <syx_ctyg@126.com>
Signed-off-by: CalvinXKY <kyxiezju@163.com>

2025-12-18 14:29:53 +08:00

load_balance_proxy_layerwise_server_example.py

[Bugfix] fix fastapi version (#5047 )

2025-12-16 15:58:27 +08:00

load_balance_proxy_server_example.py

[feat] proxy support elastic scaling (#5063 )

2025-12-18 14:29:53 +08:00

mooncake_connector_deployment_guide.md

add release note for 0.12.0 (#4995 )

2025-12-13 22:09:59 +08:00