[router] Update doc for dynamic scaling and fault tolerance (#2454)
This commit is contained in:
@@ -7,14 +7,14 @@ The router is a independent Python package, and it can be used as a drop-in repl
|
|||||||
## Installation
|
## Installation
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
pip install sglang-router
|
$ pip install sglang-router
|
||||||
```
|
```
|
||||||
|
|
||||||
Detailed usage of the router can be found in [launch_router](https://github.com/sgl-project/sglang/blob/main/rust/py_src/sglang_router/launch_router.py) and [launch_server](https://github.com/sgl-project/sglang/blob/main/rust/py_src/sglang/launch_server.py). Also, you can directly run the following command to see the usage of the router.
|
Detailed usage of the router can be found in [launch_router](https://github.com/sgl-project/sglang/blob/main/rust/py_src/sglang_router/launch_router.py) and [launch_server](https://github.com/sgl-project/sglang/blob/main/rust/py_src/sglang/launch_server.py). Also, you can directly run the following command to see the usage of the router.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
python -m sglang_router.launch_server --help
|
$ python -m sglang_router.launch_server --help
|
||||||
python -m sglang_router.launch_router --help
|
$ python -m sglang_router.launch_router --help
|
||||||
```
|
```
|
||||||
|
|
||||||
The router supports two working modes:
|
The router supports two working modes:
|
||||||
@@ -27,7 +27,7 @@ The router supports two working modes:
|
|||||||
This will be a drop-in replacement for the existing `--dp-size` arguement of SGLang Runtime. Under the hood, it uses multi-processes to launch multiple workers, wait for them to be ready, then connect the router to all workers.
|
This will be a drop-in replacement for the existing `--dp-size` arguement of SGLang Runtime. Under the hood, it uses multi-processes to launch multiple workers, wait for them to be ready, then connect the router to all workers.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
python -m sglang_router.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --dp-size 1
|
$ python -m sglang_router.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --dp-size 1
|
||||||
```
|
```
|
||||||
|
|
||||||
After the server is ready, you can directly send requests to the router as the same way as sending requests to each single worker.
|
After the server is ready, you can directly send requests to the router as the same way as sending requests to each single worker.
|
||||||
@@ -47,12 +47,62 @@ print(response.json())
|
|||||||
This is useful for multi-node DP. First, launch workers on multiple nodes, then launch a router on the main node, and connect the router to all workers.
|
This is useful for multi-node DP. First, launch workers on multiple nodes, then launch a router on the main node, and connect the router to all workers.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
python -m sglang_router.launch_router --worker-urls http://worker_url_1 http://worker_url_2
|
$ python -m sglang_router.launch_router --worker-urls http://worker_url_1 http://worker_url_2
|
||||||
```
|
```
|
||||||
|
|
||||||
## Strategies
|
## Dynamic Scaling APIs
|
||||||
|
|
||||||
### Cache-Aware Load-Balancing Router
|
We offer `/add_worker` and `/remove_worker` APIs to dynamically add or remove workers from the router.
|
||||||
|
|
||||||
|
- `/add_worker`
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
$ curl -X POST http://localhost:30000/add_worker?url=http://worker_url_1
|
||||||
|
```
|
||||||
|
|
||||||
|
Example:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
$ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --port 30001
|
||||||
|
$ curl -X POST http://localhost:30000/add_worker?url=http://127.0.0.1:30001
|
||||||
|
Successfully added worker: http://127.0.0.1:30001
|
||||||
|
```
|
||||||
|
|
||||||
|
- `/remove_worker`
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
$ curl -X POST http://localhost:30000/remove_worker?url=http://worker_url_1
|
||||||
|
```
|
||||||
|
|
||||||
|
Example:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
$ curl -X POST http://localhost:30000/remove_worker?url=http://127.0.0.1:30001
|
||||||
|
Successfully removed worker: http://127.0.0.1:30001
|
||||||
|
```
|
||||||
|
|
||||||
|
Note:
|
||||||
|
|
||||||
|
- For cache-aware router, the worker will be removed from the tree and the queues.
|
||||||
|
|
||||||
|
## Fault Tolerance
|
||||||
|
|
||||||
|
We provide retries based for failure tolerance.
|
||||||
|
|
||||||
|
1. If the request to a worker fails for `max_worker_retries` times, the router will remove the worker from the router and move on to the next worker.
|
||||||
|
2. If the total number of retries exceeds `max_total_retries`, the router will return an error.
|
||||||
|
|
||||||
|
Note:
|
||||||
|
|
||||||
|
- `max_worker_retries` is 3 and `max_total_retries` is 6 by default.
|
||||||
|
|
||||||
|
## Routing Strategies
|
||||||
|
|
||||||
|
#### Cache-Aware Load-Balancing Router
|
||||||
|
|
||||||
The native router combines two strategies to optimize both cache utilization and request distribution:
|
The native router combines two strategies to optimize both cache utilization and request distribution:
|
||||||
|
|
||||||
|
|||||||
126
rust/README.md
126
rust/README.md
@@ -2,115 +2,13 @@
|
|||||||
|
|
||||||
SGLang router is a standalone module implemented in Rust to achieve data parallelism across SGLang instances.
|
SGLang router is a standalone module implemented in Rust to achieve data parallelism across SGLang instances.
|
||||||
|
|
||||||
## Installation
|
## User docs
|
||||||
|
|
||||||
```bash
|
Please check https://sgl-project.github.io/router/router.html
|
||||||
pip install sglang-router
|
|
||||||
```
|
|
||||||
|
|
||||||
## Usage
|
## Developer docs
|
||||||
The router offers two modes:
|
|
||||||
|
|
||||||
### 1. Co-launch workers and router
|
### Prerequisites
|
||||||
This will be a drop-in replacement for the existing `--dp-size`. This part of code will be moved into sglang core.
|
|
||||||
Under the hood, it uses multi-processes to launch multiple sglang workers, wait for them to be healthy, then launch the router.
|
|
||||||
|
|
||||||
```bash
|
|
||||||
$ python -m sglang_router.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --dp-size 8
|
|
||||||
```
|
|
||||||
|
|
||||||
### 2. Launch only router
|
|
||||||
This is useful for multi-node DP. You can launch workers on different nodes, then connect the router to them.
|
|
||||||
|
|
||||||
```bash
|
|
||||||
$ python -m sglang_router.launch_router --worker-urls http://worker1:8000 http://worker2:8000
|
|
||||||
|
|
||||||
$ python -m sglang_router.launch_router --help
|
|
||||||
usage: launch_router.py [-h] [--host HOST] [--port PORT] [--worker-urls WORKER_URLS [WORKER_URLS ...]]
|
|
||||||
[--policy {random,round_robin,cache_aware}] [--cache-threshold CACHE_THRESHOLD]
|
|
||||||
[--balance-abs-threshold BALANCE_ABS_THRESHOLD] [--balance-rel-threshold BALANCE_REL_THRESHOLD]
|
|
||||||
[--eviction-interval EVICTION_INTERVAL] [--max-tree-size MAX_TREE_SIZE]
|
|
||||||
|
|
||||||
options:
|
|
||||||
-h, --help show this help message and exit
|
|
||||||
--host HOST Host address to bind the router server (default: 127.0.0.1)
|
|
||||||
--port PORT Port number to bind the router server (default: 30000)
|
|
||||||
--worker-urls WORKER_URLS [WORKER_URLS ...]
|
|
||||||
List of worker URLs (e.g., http://worker1:8000 http://worker2:8000) (default: None)
|
|
||||||
--policy {random,round_robin,cache_aware}
|
|
||||||
Load balancing policy to use (default: cache_aware)
|
|
||||||
--cache-threshold CACHE_THRESHOLD
|
|
||||||
Cache threshold (0.0-1.0) for cache-aware routing (default: 0.5)
|
|
||||||
--balance-abs-threshold BALANCE_ABS_THRESHOLD
|
|
||||||
Load balancing is triggered when (max_load - min_load) > abs_threshold AND max_load > min_load * rel_threshold (default: 32)
|
|
||||||
--balance-rel-threshold BALANCE_REL_THRESHOLD
|
|
||||||
Load balancing is triggered when (max_load - min_load) > abs_threshold AND max_load > min_load * rel_threshold (default: 1.0001)
|
|
||||||
--eviction-interval EVICTION_INTERVAL
|
|
||||||
Interval in seconds between cache eviction operations (default: 60)
|
|
||||||
--max-tree-size MAX_TREE_SIZE
|
|
||||||
Maximum size of the approximation tree for cache-aware routing (default: 16777216)
|
|
||||||
```
|
|
||||||
|
|
||||||
## Strategy
|
|
||||||
|
|
||||||
### Cache-Aware Load-Balancing Router
|
|
||||||
|
|
||||||
This router combines two strategies to optimize both cache utilization and request distribution:
|
|
||||||
|
|
||||||
1. Cache-Aware Routing (Approximate Tree)
|
|
||||||
2. Load-Balancing Routing (Shortest Queue with Balance Thresholds)
|
|
||||||
|
|
||||||
The router dynamically switches between these strategies based on load conditions:
|
|
||||||
- Uses load balancing when the system is imbalanced
|
|
||||||
- Uses cache-aware routing when the system is balanced
|
|
||||||
|
|
||||||
A system is considered imbalanced if both conditions are met:
|
|
||||||
1. (max_load - min_load) > balance_abs_threshold
|
|
||||||
2. max_load > balance_rel_threshold * min_load
|
|
||||||
|
|
||||||
#### 1. Cache-Aware Routing (Approximate Tree)
|
|
||||||
This strategy maintains an approximate radix tree for each worker based on request history,
|
|
||||||
eliminating the need for direct cache state queries. The tree stores raw text characters
|
|
||||||
instead of token IDs to avoid tokenization overhead.
|
|
||||||
|
|
||||||
Process:
|
|
||||||
- For each request, find the worker with the highest prefix match
|
|
||||||
- If match rate > cache_threshold:
|
|
||||||
- Route to the worker with highest match (likely has relevant data cached)
|
|
||||||
- If match rate ≤ cache_threshold:
|
|
||||||
- Route to the worker with smallest tree size (most available cache capacity)
|
|
||||||
- Background maintenance:
|
|
||||||
- Periodically evict least recently used leaf nodes to prevent memory overflow
|
|
||||||
|
|
||||||
#### 2. Load-Balancing (Shortest Queue)
|
|
||||||
This strategy tracks pending request counts per worker and routes new requests
|
|
||||||
to the least busy worker when the system is detected to be imbalanced. This helps
|
|
||||||
maintain optimal load distribution across workers.
|
|
||||||
|
|
||||||
### Configuration Parameters
|
|
||||||
|
|
||||||
1. `cache_threshold`: (float, 0.0 to 1.0, default: 0.5)
|
|
||||||
- Minimum prefix match ratio to use highest-match routing
|
|
||||||
- Below this threshold, routes to worker with most available cache space
|
|
||||||
|
|
||||||
2. `balance_abs_threshold`: (integer, default: 32)
|
|
||||||
- Absolute difference threshold for load imbalance detection
|
|
||||||
- System is potentially imbalanced if (max_load - min_load) > abs_threshold
|
|
||||||
|
|
||||||
3. `balance_rel_threshold`: (float, default: 1.0001)
|
|
||||||
- Relative ratio threshold for load imbalance detection
|
|
||||||
- System is potentially imbalanced if max_load > min_load * rel_threshold
|
|
||||||
- Used in conjunction with abs_threshold to determine final imbalance state
|
|
||||||
|
|
||||||
4. `eviction_interval`: (integer, default: 60)
|
|
||||||
- Interval in seconds between LRU eviction cycles for the approximate trees
|
|
||||||
- Background thread periodically evicts least recently used nodes to maintain tree size
|
|
||||||
|
|
||||||
5. `max_tree_size`: (integer, default: 16777216)
|
|
||||||
- Maximum nodes per tree
|
|
||||||
- When exceeded, LRU leaf nodes are evicted during the next eviction cycle
|
|
||||||
|
|
||||||
## Development
|
|
||||||
|
|
||||||
- Rust and Cargo installed
|
- Rust and Cargo installed
|
||||||
|
|
||||||
@@ -134,7 +32,7 @@ cargo --version
|
|||||||
#### 1. Build Rust Project
|
#### 1. Build Rust Project
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
cargo build
|
$ cargo build
|
||||||
```
|
```
|
||||||
|
|
||||||
#### 2. Build Python Binding
|
#### 2. Build Python Binding
|
||||||
@@ -142,13 +40,19 @@ cargo build
|
|||||||
##### Option A: Build and Install Wheel
|
##### Option A: Build and Install Wheel
|
||||||
1. Build the wheel package:
|
1. Build the wheel package:
|
||||||
```bash
|
```bash
|
||||||
pip install setuptools-rust wheel build
|
$ pip install setuptools-rust wheel build
|
||||||
python -m build
|
$ python -m build
|
||||||
```
|
```
|
||||||
|
|
||||||
2. Install the generated wheel:
|
2. Install the generated wheel:
|
||||||
```bash
|
```bash
|
||||||
pip install <path-to-wheel>
|
$ pip install <path-to-wheel>
|
||||||
|
```
|
||||||
|
|
||||||
|
If you want one handy command to do build + install for every change you make:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
$ python -m build && pip install --force-reinstall dist/*.whl
|
||||||
```
|
```
|
||||||
|
|
||||||
##### Option B: Development Mode
|
##### Option B: Development Mode
|
||||||
@@ -158,7 +62,7 @@ For development purposes, you can install the package in editable mode:
|
|||||||
Warning: Using editable python binding can suffer from performance degradation!! Please build a fresh wheel for every update if you want to test performance.
|
Warning: Using editable python binding can suffer from performance degradation!! Please build a fresh wheel for every update if you want to test performance.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
pip install -e .
|
$ pip install -e .
|
||||||
```
|
```
|
||||||
|
|
||||||
**Note:** When modifying Rust code, you must rebuild the wheel for changes to take effect.
|
**Note:** When modifying Rust code, you must rebuild the wheel for changes to take effect.
|
||||||
|
|||||||
@@ -118,7 +118,7 @@ async fn remove_worker(
|
|||||||
None => return HttpResponse::BadRequest().finish(),
|
None => return HttpResponse::BadRequest().finish(),
|
||||||
};
|
};
|
||||||
data.router.remove_worker(&worker_url);
|
data.router.remove_worker(&worker_url);
|
||||||
HttpResponse::Ok().finish()
|
HttpResponse::Ok().body(format!("Successfully removed worker: {}", worker_url))
|
||||||
}
|
}
|
||||||
|
|
||||||
pub struct ServerConfig {
|
pub struct ServerConfig {
|
||||||
|
|||||||
Reference in New Issue
Block a user