[feat] proxy support elastic scaling (#5063)

**[RFC]: Elastic Scaling Support for P/D Instances Based on KV Pool:** https://github.com/vllm-project/vllm-ascend/issues/3380 ### What this PR does / why we need it? Support elastic scaling for P/D instances based on mooncake conncetor deplayment. **Support API routes** * `/instances/add`: add prefill nodes or decode nodes to the list. * `/instances/remove`: remove prefill nodes or decode nodes from the list. **Support functions** * Support **adding** prefill nodes or decode nodes. - If prefill or decode server deployed **after the proxy deployed**, server can use `/instances/add` API to join the proxy server. The prefill server or decode server sends a signal to the proxy server, and the proxy server will check the status of the node util the node is available. * Support **removing** prefill nodes or decode nodes: - Support using `/instances/remove` API to **delete the node** from the proxy server. ### Does this PR introduce _any_ user-facing change? For `examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py`: **Add 2 params** When adding nodes to the proxy, the proxy will wait the nodes to be started util retrying a certain of times. | name | type | default | help | | ----- | ---- | ---- | ---- | | max-waiting-retries | int | 3 | Maximum number of retries for waiting nodes to be started | | waiting-retry-interval | float | 10 | Check interval (seconds) for waiting nodes to be started | For example: ```shell python load_balance_proxy_server_example.py \ --host 0.0.0.0 --port 9000 \ --prefiller-hosts 127.0.0.1 127.0.0.1 \ --prefiller-ports 8100 8101 \ --decoder-hosts 127.0.0.1 127.0.0.1 \ --decoder-ports 8200 8201 \ --max-waiting-retries 3 \ --waiting-retry-interval 10 ``` **Add 2 API routings** * Add instances: `instances/add` For example, add 2 prefiller instances: ```shell curl -X POST http://localhost:9000/instances/add \ -H "Content-Type: application/json" \ -d '{ "type": "prefill", "instances": ["127.0.0.1:8102", "127.0.0.1:8103"] }' ``` Response: ```shell {"message": "add prefill instances: ['127.0.0.1:8102', '127.0.0.1:8103'].", "current_prefill_instances": ['127.0.0.1:8100', '127.0.0.1:8101', '127.0.0.1:8102', '127.0.0.1:8103'], "current_decode_instances": ['127.0.0.1:8200', '127.0.0.1:8201']} ``` If the node '127.0.0.1:8103' has not benn started: ```shell {"message": "add prefill instances: ['127.0.0.1:8102']. Instances ['127.0.0.1:8103'] are waiting to be added.", "current_prefill_instances": ['127.0.0.1:8100', '127.0.0.1:8101', '127.0.0.1:8102'], "current_decode_instances": ['127.0.0.1:8200', '127.0.0.1:8201']} ``` * Remove instances: `instances/remove` For example, remove 1 decoder instance: ```shell curl -X POST http://localhost:9000/instances/remove \ -H "Content-Type: application/json" \ -d '{ "type": "decode", "instances": "127.0.0.1:8201" }' ``` Response: ```shell {"message": "remove decode instances: ['127.0.0.1:8201'].", "current_prefill_instances": ['127.0.0.1:8100', '127.0.0.1:8101'], "current_decode_instances": ['127.0.0.1:8200']} ``` ### How was this patch tested? Run proxy and using `/instances/add` API to add nodes and `/instances/remove` API to remove nodes * vLLM version: v0.11.0.rc3 * vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0.rc3 - vLLM version: v0.12.0 - vLLM main: ad32e3e19c Signed-off-by: yuxinshan <syx_ctyg@126.com> Signed-off-by: CalvinXKY <kyxiezju@163.com>
2025-12-18 14:29:53 +08:00
parent 71e544e259
commit b0376abd4c
1 changed files with 232 additions and 1 deletions
--- a/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py
+++ b/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py
@@ -77,6 +77,34 @@
 # This will return a JSON object with the status and the number of prefiller
 # and decoder instances.
 #
 # Step 5: Add or Remove Prefiller or Decoder Instances (Optional)
 # ---------------------------------------------------------------
 # You can add or remove prefiller or decoder instances after the proxy is started.
 # For example, add 2 prefiller instances:
 #
 #   curl -X POST http://localhost:9000/instances/add \
 #     -H "Content-Type: application/json" \
 #     -d '{
 #           "type": "prefill",
 #           "instances": ["127.0.0.1:8102", "127.0.0.1:8103"]
 #         }'
 #
 # or remove 1 decoder instance:
 #
 #   curl -X POST http://localhost:9000/instances/remove \
 #     -H "Content-Type: application/json" \
 #     -d '{
 #           "type": "decode",
 #           "instances": "127.0.0.1:8201"
 #         }'
 #
 # This will return a JSON object with the adding or removing info
 # and the current prefiller and decoder instances.
 #
 # When adding instances, if the instances are not started,
 # the proxy will wait and try until the instances to be started
 # or exceeding the number of attempts
 #
 # Notes:
 # - You can scale the number of prefiller and decoder servers as needed.
 # - The proxy will round-robin requests to balance load.
@@ -92,10 +120,12 @@ import ipaddress
 import json
 import os
 import sys
 import threading
 import time
 import uuid
 from contextlib import asynccontextmanager
 from dataclasses import dataclass
-from typing import Any, List
+from typing import Any, List, Tuple, Dict
 import httpx
 from fastapi import FastAPI, Request
@@ -113,6 +143,12 @@ except ImportError:
    pass
@dataclass
 class InstanceType:
    PREFILL: str = "prefill"
    DECODE: str = "decode"
 class ServerState:
    def __init__(self, host, port):
@@ -136,10 +172,24 @@ class ServerState:
        self.aborted_requests = set()  # Track aborted requests
        # Removed individual server lock - will use global locks instead
    def __eq__(self, other):
        self_host = self.host.replace("localhost", "0.0.0.0").replace("127.0.0.1", "0.0.0.0")
        other_host = other.host.replace("localhost", "0.0.0.0").replace("127.0.0.1", "0.0.0.0")
        return self_host == other_host and str(self.port) == str(other.port)
    def __hash__(self):
        self_host = self.host.replace("localhost", "0.0.0.0").replace("127.0.0.1", "0.0.0.0")
        return hash((self_host, str(self.port)))
    def __repr__(self):
        return f"{self.host}:{self.port}"
 class ProxyState:
    def __init__(self, prefiller_instances, decoder_instances):
        self.node_listener = NodeListener(self)
        self.prefillers: List[ServerState] = [
            ServerState(h, p) for h, p in prefiller_instances
        ]
@@ -264,10 +314,127 @@ class ProxyState:
    def calculate_decode_scores(self, request_length: int) -> float:
        return request_length
    async def add_instances(
            self, instance_type: str, instances: List[ServerState]
    ) -> Tuple[List[str], List[str]]:
        added_nodes, waiting_nodes = [], []
        for server in instances:
            is_valid = await self.node_listener.check_instance_status(server.client)
            if is_valid and instance_type == InstanceType.PREFILL:
                self.add_prefillers([server])
                added_nodes.append(str(server))
            elif is_valid and instance_type == InstanceType.DECODE:
                self.add_decoders([server])
                added_nodes.append(str(server))
            else:
                node = str(server)
                self.node_listener.waiting_nodes[node] = (instance_type, server, 0)
                waiting_nodes.append(node)
        return added_nodes, waiting_nodes
    def add_prefillers(self, instances: List[ServerState]) -> None:
        num_prefillers = len(self.prefillers)
        for idx, server in enumerate(instances):
            if server not in self.prefillers:
                self.prefillers.append(server)
                # prefiller_heap: [(priority_0, 0, server_0)] -> [(priority_0, 0, server_0), (0, 1, server_1)]
                heapq.heappush(self.prefiller_heap, (0, num_prefillers + idx, server))
        self.print_status(f"Add prefiller instances: {instances}.")
    def add_decoders(self, instances: List[ServerState]) -> None:
        num_decoders = len(self.decoders)
        for idx, server in enumerate(instances):
            if server not in self.decoders:
                self.decoders.append(server)
                # decoder_heap: [(priority_0, 0, server_0)] -> [(priority_0, 0, server_0), (0, 1, server_1)]
                heapq.heappush(self.decoder_heap, (0, num_decoders + idx, server))
        self.print_status(f"Add decoder instances: {instances}.")
    def remove_prefillers(self, instances: List[ServerState]) -> None:
        instances_to_remove = set(instances)
        self.prefillers = [server for server in self.prefillers if server not in instances_to_remove]
        prefiller_heap_copy = self.prefiller_heap.copy()
        prefiller_heap_copy.sort(key=lambda x: x[1])  # sorted by key: prefiller_idx
        prefiller_heap = []
        idx = 0
        for priority, _, server in prefiller_heap_copy:
            if server not in instances_to_remove:
                prefiller_heap.append((priority, idx, server))
                idx += 1
        # prefiller_heap: [(priority_0, 0, server_0), (priority_1, 1, server_1)] -> [(priority_1, 0, server_1)]
        self.prefiller_heap = prefiller_heap
        heapq.heapify(self.prefiller_heap)
        self.print_status(f"Remove prefiller instances: {instances}.")
    def remove_decoders(self, instances: List[ServerState]) -> None:
        instances_to_remove = set(instances)
        self.decoders = [server for server in self.decoders if server not in instances_to_remove]
        decoder_heap_copy = self.decoder_heap.copy()
        decoder_heap_copy.sort(key=lambda x: x[1])  # sorted by key: decoder_idx
        decoder_heap = []
        idx = 0
        for priority, _, server in decoder_heap_copy:
            if server not in instances_to_remove:
                decoder_heap.append((priority, idx, server))
                idx += 1
        # decoder_heap: [(priority_0, 0, server_0), (priority_1, 1, server_1)] -> [(priority_1, 0, server_1)]
        self.decoder_heap = decoder_heap
        heapq.heapify(self.decoder_heap)
        self.print_status(f"Remove decoder instances: {instances}.")
    def print_status(self, msg: str) -> None:
        status = {
            "prefill_instances": [str(server) for server in self.prefillers],
            "decode_instances": [str(server) for server in self.decoders]
        }
        print(f"{msg} Status: {status}")
 proxy_state = None
 class NodeListener:
    def __init__(self, proxy):
        self.proxy_state = proxy
        self.waiting_nodes: Dict[str, Tuple[str, Any, int]] = {}
        self.listening_thread = threading.Thread(target=self._node_listener, daemon=True)
        self.listening_thread.start()
    def _node_listener(self) -> None:
        while True:
            for node, (instance_type, server, check_times) in list(self.waiting_nodes.items()):
                is_valid = asyncio.run(self.check_instance_status(server.client))
                print(f"Checking instance {node}...")
                check_times += 1
                if is_valid:
                    if instance_type == InstanceType.PREFILL:
                        self.proxy_state.add_prefillers([server])
                    else:
                        self.proxy_state.add_decoders([server])
                    self.waiting_nodes.pop(node)
                elif check_times == global_args.max_waiting_retries:
                    print(f"Instance {node} was not added to the proxy.")
                    self.waiting_nodes.pop(node)
                else:
                    self.waiting_nodes[node] = (instance_type, server, check_times)
            time.sleep(global_args.waiting_retry_interval)
    @staticmethod
    async def check_instance_status(client: httpx.AsyncClient) -> bool:
        endpoint = "/models"
        headers = {
            "Authorization": f"Bearer {os.environ.get('OPENAI_API_KEY')}"
        }
        try:
            response = await client.get(endpoint, headers=headers)
            response.raise_for_status()
            return True
        except (httpx.RequestError, httpx.HTTPStatusError):
            return False
 def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument("--port", type=int, default=8000)
@@ -294,6 +461,15 @@ def parse_args():
        type=float,
        default=0.001,
        help="Base delay (seconds) for exponential backoff retries")
    parser.add_argument("--max-waiting-retries",
                        type=int,
                        default=3,
                        help="Maximum number of retries for waiting nodes to be started")
    parser.add_argument(
        "--waiting-retry-interval",
        type=float,
        default=10,
        help="Check interval (seconds) for waiting nodes to be started")
    args = parser.parse_args()
    if len(args.prefiller_hosts) != len(args.prefiller_ports):
        raise ValueError(
@@ -637,6 +813,51 @@ async def _handle_completions(api: str, request: Request):
        raise
 async def _handle_adjust_instances(adjust_mode: str, request: Request):
    try:
        req_data = await request.json()
        instance_type = req_data.get("type", "")
        instances = req_data.get("instances", [])
        if isinstance(instances, str):
            instances = [instances]
        instances = trans_instances(instances)
        all_msg = f"{adjust_mode} {instance_type} instances: " \
                  f"{[str(server) for server in instances]}."
        if instance_type not in [InstanceType.PREFILL, InstanceType.DECODE]:
            return {"error": f"Instance type {instance_type} is not supported. "
                             f"Only support '{InstanceType.PREFILL}' and '{InstanceType.DECODE}'."}
        if adjust_mode == "add":
            added_nodes, waiting_nodes = await proxy_state.add_instances(
                instance_type, instances
            )
            if waiting_nodes:
                all_msg = f"{adjust_mode} {instance_type} instances: {added_nodes}. " \
                          f"Instances {waiting_nodes} are waiting to be added."
        elif adjust_mode == "remove":
            if instance_type == InstanceType.PREFILL:
                proxy_state.remove_prefillers(instances)
            else:
                proxy_state.remove_decoders(instances)
        return {
            "message": all_msg,
            "current_prefill_instances": [str(prefiller) for prefiller in proxy_state.prefillers],
            "current_decode_instances": [str(decoder) for decoder in proxy_state.decoders]
        }
    except Exception as e:
        logger.error(f"Failed to {adjust_mode} instances: {e}")
        raise e
 def trans_instances(instances: List[str]) -> List[ServerState]:
    server_list = []
    for instance in instances:
        h, p = instance.split(":")
        server_list.append(ServerState(h, int(p)))
    return server_list
@app.post("/v1/completions")
@with_cancellation
 async def handle_completions(request: Request):
@@ -658,6 +879,16 @@ async def healthcheck():
    }
@app.post("/instances/add")
 async def handle_add_instances(request: Request):
    return await _handle_adjust_instances("add", request)
@app.post("/instances/remove")
 async def handle_remove_instances(request: Request):
    return await _handle_adjust_instances("remove", request)
 if __name__ == '__main__':
    global global_args
    global_args = parse_args()