update router doc (#2143)
This commit is contained in:
115
rust/README.md
115
rust/README.md
@@ -1,8 +1,101 @@
|
||||
# SGLang Router (Experimental)
|
||||
# SGLang Router
|
||||
|
||||
SGLang router is a standalone module implemented in Rust to achieve data parallelism across SGLang instances.
|
||||
|
||||
## Prerequisites
|
||||
## Installation
|
||||
|
||||
```bash
|
||||
pip install sglang-router
|
||||
```
|
||||
|
||||
## Usage
|
||||
The router offers two modes:
|
||||
|
||||
### 1. Co-launch workers and router
|
||||
This will be a drop-in replacement for the existing `--dp-size`. This part of code will be moved into sglang core.
|
||||
Under the hood, it uses multi-processes to launch multiple sglang workers, wait for them to be healthy, then launch the router.
|
||||
|
||||
```bash
|
||||
$ python -m sglang_router.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --dp-size 8
|
||||
```
|
||||
|
||||
### 2. Launch only router
|
||||
This is useful if you for multi node DP. You can launch workers on different nodes, then connect the router to them.
|
||||
|
||||
```bash
|
||||
$ python -m sglang_router.launch_router --worker-urls http://worker1:8000 http://worker2:8000
|
||||
|
||||
$ python -m sglang_router.launch_router --help
|
||||
usage: launch_router.py [-h] [--host HOST] [--port PORT] [--worker-urls WORKER_URLS [WORKER_URLS ...]]
|
||||
[--policy {random,round_robin,cache_aware}] [--cache-threshold CACHE_THRESHOLD]
|
||||
[--cache-routing-prob CACHE_ROUTING_PROB] [--eviction-interval EVICTION_INTERVAL]
|
||||
[--max-tree-size MAX_TREE_SIZE]
|
||||
|
||||
options:
|
||||
-h, --help show this help message and exit
|
||||
--host HOST Host address to bind the router server (default: 127.0.0.1)
|
||||
--port PORT Port number to bind the router server (default: 30000)
|
||||
--worker-urls WORKER_URLS [WORKER_URLS ...]
|
||||
List of worker URLs (e.g., http://worker1:8000 http://worker2:8000) (default: None)
|
||||
--policy {random,round_robin,cache_aware}
|
||||
Load balancing policy to use (default: cache_aware)
|
||||
--cache-threshold CACHE_THRESHOLD
|
||||
Cache threshold (0.0-1.0) for cache-aware routing (default: 0.5)
|
||||
--cache-routing-prob CACHE_ROUTING_PROB
|
||||
Probability of using cache-aware routing (0.0-1.0) (default: 1.0)
|
||||
--eviction-interval EVICTION_INTERVAL
|
||||
Interval in seconds between cache eviction operations (default: 60)
|
||||
--max-tree-size MAX_TREE_SIZE
|
||||
Maximum size of the approximation tree for cache-aware routing (default: 16777216)
|
||||
```
|
||||
|
||||
## Strategy
|
||||
|
||||
### Cache-Aware Load-Balancing Router
|
||||
|
||||
This router combines two strategies to optimize both cache utilization and request distribution:
|
||||
|
||||
1. Cache-Aware Routing (Approximate Tree)
|
||||
2. Load-Balancing Routing (Shortest Queue)
|
||||
|
||||
#### 1. Cache-Aware Routing (Approximate Tree)
|
||||
This strategy maintains an approximate radix tree for each worker based on request history,
|
||||
eliminating the need for direct cache state queries. The tree stores raw text characters
|
||||
instead of token IDs to avoid tokenization overhead.
|
||||
|
||||
Process:
|
||||
- For each request, find the worker with the highest prefix match
|
||||
- If match rate > cache_threshold:
|
||||
- Route to the worker with highest match (likely has relevant data cached)
|
||||
- If match rate ≤ cache_threshold:
|
||||
- Route to the worker with smallest tree size (most available cache capacity)
|
||||
- Background maintenance:
|
||||
- Periodically evict least recently used leaf nodes to prevent memory overflow
|
||||
|
||||
#### 2. Load-Balancing (Shortest Queue)
|
||||
This strategy tracks pending request counts per worker and routes new requests
|
||||
to the least busy worker for optimal load distribution.
|
||||
|
||||
### Configuration Parameters
|
||||
|
||||
1. `cache_routing_prob`: (float, 0.0 to 1.0)
|
||||
- 0.0: Exclusively use load balancing
|
||||
- 1.0: Exclusively use cache-aware routing
|
||||
- Between 0-1: Probability of using cache-aware routing vs load balancing
|
||||
|
||||
2. `cache_threshold`: (float, 0.0 to 1.0)
|
||||
- Minimum prefix match ratio to use highest-match routing
|
||||
- Below this threshold, routes to worker with most available cache space
|
||||
|
||||
3. `eviction_interval_secs`: (integer)
|
||||
- Interval between LRU eviction cycles for the approximate trees
|
||||
|
||||
4. `max_tree_size`: (integer)
|
||||
- Maximum nodes per tree
|
||||
- When exceeded, LRU leaf nodes are evicted during the next eviction cycle
|
||||
|
||||
|
||||
## Development
|
||||
|
||||
- Rust and Cargo installed
|
||||
|
||||
@@ -21,17 +114,17 @@ cargo --version
|
||||
- Python with pip installed
|
||||
|
||||
|
||||
## Build Process
|
||||
### Build Process
|
||||
|
||||
### 1. Build Rust Project
|
||||
#### 1. Build Rust Project
|
||||
|
||||
```bash
|
||||
cargo build
|
||||
```
|
||||
|
||||
### 2. Build Python Binding
|
||||
#### 2. Build Python Binding
|
||||
|
||||
#### Option A: Build and Install Wheel
|
||||
##### Option A: Build and Install Wheel
|
||||
1. Build the wheel package:
|
||||
```bash
|
||||
pip install setuptools-rust wheel build
|
||||
@@ -43,7 +136,7 @@ python -m build
|
||||
pip install <path-to-wheel>
|
||||
```
|
||||
|
||||
#### Option B: Development Mode
|
||||
##### Option B: Development Mode
|
||||
|
||||
For development purposes, you can install the package in editable mode:
|
||||
|
||||
@@ -55,21 +148,21 @@ pip install -e .
|
||||
|
||||
**Note:** When modifying Rust code, you must rebuild the wheel for changes to take effect.
|
||||
|
||||
## CI/CD Setup
|
||||
### CI/CD Setup
|
||||
|
||||
The continuous integration pipeline consists of three main steps:
|
||||
|
||||
### 1. Build Wheels
|
||||
#### 1. Build Wheels
|
||||
- Uses `cibuildwheel` to create manylinux x86_64 packages
|
||||
- Compatible with major Linux distributions (Ubuntu, CentOS, etc.)
|
||||
- Additional configurations can be added to support other OS/architectures
|
||||
- Reference: [cibuildwheel documentation](https://cibuildwheel.pypa.io/en/stable/)
|
||||
|
||||
### 2. Build Source Distribution
|
||||
#### 2. Build Source Distribution
|
||||
- Creates a source distribution containing the raw, unbuilt code
|
||||
- Enables `pip` to build the package from source when prebuilt wheels are unavailable
|
||||
|
||||
### 3. Publish to PyPI
|
||||
#### 3. Publish to PyPI
|
||||
- Uploads both wheels and source distribution to PyPI
|
||||
|
||||
The CI configuration is based on the [tiktoken workflow](https://github.com/openai/tiktoken/blob/63527649963def8c759b0f91f2eb69a40934e468/.github/workflows/build_wheels.yml#L1).
|
||||
|
||||
Reference in New Issue
Block a user