2025-08-12 13:16:34 -07:00
# SGLang Router
2024-12-04 15:41:22 -08:00
2025-08-12 13:16:34 -07:00
The SGLang Router is a high-performance request distribution system that routes inference requests across multiple SGLang runtime instances. It features cache-aware load balancing, fault tolerance, and support for advanced deployment patterns including data parallelism and prefill-decode disaggregation.
2024-12-04 15:41:22 -08:00
2025-08-12 13:16:34 -07:00
## Key Features
- **Cache-Aware Load Balancing**: Optimizes cache utilization while maintaining balanced load distribution
- **Multiple Routing Policies**: Choose from random, round-robin, cache-aware, or power-of-two policies
- **Fault Tolerance**: Automatic retry and circuit breaker mechanisms for resilient operation
- **Dynamic Scaling**: Add or remove workers at runtime without service interruption
- **Kubernetes Integration**: Native service discovery and pod management
- **Prefill-Decode Disaggregation**: Support for disaggregated serving load balancing
- **Prometheus Metrics**: Built-in observability and monitoring
2024-12-04 15:41:22 -08:00
## Installation
```bash
2025-02-21 20:24:13 +01:00
pip install sglang-router
2024-12-04 15:41:22 -08:00
```
2025-08-12 13:16:34 -07:00
## Quick Start
To see all available options:
2024-12-04 15:41:22 -08:00
```bash
2025-08-12 13:16:34 -07:00
python -m sglang_router.launch_server --help # Co-launch router and workers
python -m sglang_router.launch_router --help # Launch router only
2024-12-04 15:41:22 -08:00
```
2025-08-12 13:16:34 -07:00
## Deployment Modes
2024-12-04 15:41:22 -08:00
2025-08-12 13:16:34 -07:00
The router supports three primary deployment patterns:
2024-12-04 15:41:22 -08:00
2025-08-12 13:16:34 -07:00
1. **Co-launch Mode** : Router and workers launch together (simplest for single-node deployments)
2. **Separate Launch Mode** : Router and workers launch independently (best for multi-node setups)
3. **Prefill-Decode Disaggregation** : Specialized mode for disaggregated serving
2024-12-04 15:41:22 -08:00
2025-08-12 13:16:34 -07:00
### Mode 1: Co-launch Router and Workers
This mode launches both the router and multiple worker instances in a single command. It's the simplest deployment option and replaces the `--dp-size` argument of SGLang Runtime.
2024-12-04 15:41:22 -08:00
```bash
2025-08-12 13:16:34 -07:00
# Launch router with 4 workers
python -m sglang_router.launch_server \
--model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
--dp-size 4 \
--host 0.0.0.0 \
--port 30000
2024-12-04 15:41:22 -08:00
```
2025-08-12 13:16:34 -07:00
#### Sending Requests
2024-12-04 15:41:22 -08:00
2025-08-12 13:16:34 -07:00
Once the server is ready, send requests to the router endpoint:
2025-02-26 19:29:25 +01:00
2024-12-04 15:41:22 -08:00
```python
import requests
2025-08-12 13:16:34 -07:00
# Using the /generate endpoint
2024-12-04 15:41:22 -08:00
url = "http://localhost:30000/generate"
2025-08-12 13:16:34 -07:00
data = {
"text": "What is the capital of France?",
"sampling_params": {
"temperature": 0.7,
"max_new_tokens": 100
}
}
response = requests.post(url, json=data)
print(response.json())
# OpenAI-compatible endpoint
url = "http://localhost:30000/v1/chat/completions"
data = {
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"messages": [{"role": "user", "content": "What is the capital of France?"}]
}
2024-12-04 15:41:22 -08:00
response = requests.post(url, json=data)
print(response.json())
```
2025-08-12 13:16:34 -07:00
### Mode 2: Separate Launch Mode
This mode is ideal for multi-node deployments where workers run on different machines.
2024-12-04 15:41:22 -08:00
2025-08-12 13:16:34 -07:00
#### Step 1: Launch Workers
On each worker node:
2024-12-04 15:41:22 -08:00
```bash
2025-08-12 13:16:34 -07:00
# Worker node 1
python -m sglang.launch_server \
--model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 8000
# Worker node 2
python -m sglang.launch_server \
--model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 8001
2024-12-04 15:41:22 -08:00
```
2025-08-12 13:16:34 -07:00
#### Step 2: Launch Router
On the router node:
2024-12-04 15:41:22 -08:00
2025-08-12 13:16:34 -07:00
```bash
python -m sglang_router.launch_router \
--worker-urls http://worker1:8000 http://worker2:8001 \
--host 0.0.0.0 \
--port 30000 \
--policy cache_aware # or random, round_robin, power_of_two
```
2024-12-11 13:11:42 -08:00
2025-08-12 13:16:34 -07:00
### Mode 3: Prefill-Decode Disaggregation
2024-12-11 13:11:42 -08:00
2025-08-12 13:16:34 -07:00
This advanced mode separates prefill and decode operations for optimized performance:
2024-12-11 13:11:42 -08:00
```bash
2025-08-12 13:16:34 -07:00
python -m sglang_router.launch_router \
--pd-disaggregation \
--prefill http://prefill1:8000 9000 \
--prefill http://prefill2:8001 9001 \
--decode http://decode1:8002 \
--decode http://decode2:8003 \
--prefill-policy cache_aware \
--decode-policy round_robin
2024-12-11 13:11:42 -08:00
```
2025-08-12 13:16:34 -07:00
#### Understanding --prefill Arguments
2024-12-11 13:11:42 -08:00
2025-08-12 13:16:34 -07:00
The `--prefill` flag accepts URLs with optional bootstrap ports:
- `--prefill http://server:8000` - No bootstrap port
- `--prefill http://server:8000 9000` - Bootstrap port 9000
- `--prefill http://server:8000 none` - Explicitly no bootstrap port
#### Policy Inheritance in PD Mode
The router intelligently handles policy configuration for prefill and decode nodes:
2025-02-26 19:29:25 +01:00
2025-08-12 13:16:34 -07:00
1. **Only `--policy` specified** : Both prefill and decode nodes use this policy
2. ** `--policy` and `--prefill-policy` specified**: Prefill nodes use `--prefill-policy` , decode nodes use `--policy`
3. ** `--policy` and `--decode-policy` specified**: Prefill nodes use `--policy` , decode nodes use `--decode-policy`
4. **All three specified** : Prefill nodes use `--prefill-policy` , decode nodes use `--decode-policy` (main `--policy` is ignored)
2025-02-26 19:29:25 +01:00
2025-08-12 13:16:34 -07:00
Example with mixed policies:
```bash
python -m sglang_router.launch_router \
--pd-disaggregation \
--prefill http://prefill1:8000
--prefill http://prefill2:8000 \
--decode http://decode1:8001
--decode http://decode2:8001 \
--policy round_robin \
--prefill-policy cache_aware # Prefill uses cache_aware and decode uses round_robin from --policy
2024-12-11 13:11:42 -08:00
```
2025-08-12 13:16:34 -07:00
#### PD Mode with Service Discovery
2024-12-11 13:11:42 -08:00
2025-08-12 13:16:34 -07:00
For Kubernetes deployments with separate prefill and decode server pools:
2024-12-11 13:11:42 -08:00
```bash
2025-08-12 13:16:34 -07:00
python -m sglang_router.launch_router \
--pd-disaggregation \
--service-discovery \
--prefill-selector app=prefill-server tier=gpu \
--decode-selector app=decode-server tier=cpu \
--service-discovery-namespace production \
--prefill-policy cache_aware \
--decode-policy round_robin
2024-12-11 13:11:42 -08:00
```
2025-08-12 13:16:34 -07:00
## Dynamic Scaling
The router supports runtime scaling through REST APIs:
### Adding Workers
2024-12-11 13:11:42 -08:00
```bash
2025-08-12 13:16:34 -07:00
# Launch a new worker
python -m sglang.launch_server \
--model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
--port 30001
2025-02-26 19:29:25 +01:00
2025-08-12 13:16:34 -07:00
# Add it to the router
curl -X POST "http://localhost:30000/add_worker?url=http://127.0.0.1:30001"
2024-12-11 13:11:42 -08:00
```
2025-08-12 13:16:34 -07:00
### Removing Workers
```bash
curl -X POST "http://localhost:30000/remove_worker?url=http://127.0.0.1:30001"
```
2024-12-11 13:11:42 -08:00
2025-08-12 13:16:34 -07:00
**Note**: When using cache-aware routing, removed workers are cleanly evicted from the routing tree and request queues.
2024-12-11 13:11:42 -08:00
## Fault Tolerance
2025-08-12 13:16:34 -07:00
The router includes comprehensive fault tolerance mechanisms:
2024-12-11 13:11:42 -08:00
2025-08-12 13:16:34 -07:00
### Retry Configuration
2024-12-11 13:11:42 -08:00
2025-08-12 13:16:34 -07:00
```bash
python -m sglang_router.launch_router \
--worker-urls http://worker1:8000 http://worker2:8001 \
--retry-max-retries 3 \
--retry-initial-backoff-ms 100 \
--retry-max-backoff-ms 10000 \
--retry-backoff-multiplier 2.0 \
--retry-jitter-factor 0.1
```
2024-12-11 13:11:42 -08:00
2025-08-12 13:16:34 -07:00
### Circuit Breaker
2024-12-11 13:11:42 -08:00
2025-08-12 13:16:34 -07:00
Protects against cascading failures:
2024-12-11 13:11:42 -08:00
2025-08-12 13:16:34 -07:00
```bash
python -m sglang_router.launch_router \
--worker-urls http://worker1:8000 http://worker2:8001 \
--cb-failure-threshold 5 \
--cb-success-threshold 2 \
--cb-timeout-duration-secs 30 \
--cb-window-duration-secs 60
```
2024-12-04 15:41:22 -08:00
2025-08-12 13:16:34 -07:00
**Behavior**:
- Worker is marked unhealthy after `cb-failure-threshold` consecutive failures
- Returns to service after `cb-success-threshold` successful health checks
- Circuit breaker can be disabled with `--disable-circuit-breaker`
2024-12-04 15:41:22 -08:00
2025-08-12 13:16:34 -07:00
## Routing Policies
2024-12-04 15:41:22 -08:00
2025-08-12 13:16:34 -07:00
The router supports multiple routing strategies:
2024-12-04 15:41:22 -08:00
2025-08-12 13:16:34 -07:00
### 1. Random Routing
Distributes requests randomly across workers.
2024-12-04 15:41:22 -08:00
2025-08-12 13:16:34 -07:00
```bash
--policy random
```
2024-12-04 15:41:22 -08:00
2025-08-12 13:16:34 -07:00
### 2. Round-Robin Routing
Cycles through workers in order.
2024-12-04 15:41:22 -08:00
2025-08-12 13:16:34 -07:00
```bash
--policy round_robin
```
2024-12-04 15:41:22 -08:00
2025-08-12 13:16:34 -07:00
### 3. Power of Two Choices
Samples two workers and routes to the less loaded one.
2024-12-04 15:41:22 -08:00
2025-08-12 13:16:34 -07:00
```bash
--policy power_of_two
```
2024-12-04 15:41:22 -08:00
2025-08-12 13:16:34 -07:00
### 4. Cache-Aware Load Balancing (Default)
2024-12-04 15:41:22 -08:00
2025-08-12 13:16:34 -07:00
The most sophisticated policy that combines cache optimization with load balancing:
2024-12-04 15:41:22 -08:00
2025-08-12 13:16:34 -07:00
```bash
--policy cache_aware \
--cache-threshold 0.5 \
--balance-abs-threshold 32 \
--balance-rel-threshold 1.0001
```
2024-12-04 15:41:22 -08:00
2025-08-12 13:16:34 -07:00
#### How It Works
2024-12-04 15:41:22 -08:00
2025-08-12 13:16:34 -07:00
1. **Load Assessment** : Checks if the system is balanced
- Imbalanced if: `(max_load - min_load) > balance_abs_threshold` AND `max_load > balance_rel_threshold * min_load`
2024-12-04 15:41:22 -08:00
2025-08-12 13:16:34 -07:00
2. **Routing Decision** :
- **Balanced System**: Uses cache-aware routing
- Routes to worker with highest prefix match if match > `cache_threshold`
- Otherwise routes to worker with most available cache capacity
- **Imbalanced System**: Uses shortest queue routing to the least busy worker
2025-07-30 20:58:48 +08:00
2025-08-12 13:16:34 -07:00
3. **Cache Management** :
- Maintains approximate radix trees per worker
- Periodically evicts LRU entries based on `--eviction-interval` and `--max-tree-size`
2025-07-30 20:58:48 +08:00
2025-08-12 13:16:34 -07:00
### Data Parallelism Aware Routing
2025-07-30 20:58:48 +08:00
2025-08-12 13:16:34 -07:00
Enables fine-grained control over data parallel replicas:
2025-07-30 20:58:48 +08:00
2025-08-12 13:16:34 -07:00
```bash
--dp-aware \
--api-key your_api_key # Required for worker authentication
```
2024-12-04 15:41:22 -08:00
2025-08-12 13:16:34 -07:00
This mode coordinates with SGLang's DP controller for optimized request distribution across data parallel ranks.
## Configuration Reference
### Core Settings
| Parameter | Type | Default | Description |
|-----------------------------|------|-------------|-----------------------------------------------------------------|
| `--host` | str | 127.0.0.1 | Router server host address |
| `--port` | int | 30000 | Router server port |
| `--worker-urls` | list | [] | Worker URLs for separate launch mode |
| `--policy` | str | cache_aware | Routing policy (random, round_robin, cache_aware, power_of_two) |
| `--max-concurrent-requests` | int | 64 | Maximum concurrent requests (rate limiting) |
| `--request-timeout-secs` | int | 600 | Request timeout in seconds |
| `--max-payload-size` | int | 256MB | Maximum request payload size |
### Cache-Aware Routing Parameters
| Parameter | Type | Default | Description |
|---------------------------|-------|----------|--------------------------------------------------------|
| `--cache-threshold` | float | 0.5 | Minimum prefix match ratio for cache routing (0.0-1.0) |
| `--balance-abs-threshold` | int | 32 | Absolute load difference threshold |
| `--balance-rel-threshold` | float | 1.0001 | Relative load ratio threshold |
| `--eviction-interval` | int | 60 | Seconds between cache eviction cycles |
| `--max-tree-size` | int | 16777216 | Maximum nodes in routing tree |
### Fault Tolerance Parameters
| Parameter | Type | Default | Description |
|------------------------------|-------|---------|---------------------------------------|
| `--retry-max-retries` | int | 3 | Maximum retry attempts per request |
| `--retry-initial-backoff-ms` | int | 100 | Initial retry backoff in milliseconds |
| `--retry-max-backoff-ms` | int | 10000 | Maximum retry backoff in milliseconds |
| `--retry-backoff-multiplier` | float | 2.0 | Backoff multiplier between retries |
| `--retry-jitter-factor` | float | 0.1 | Random jitter factor for retries |
| `--disable-retries` | flag | False | Disable retry mechanism |
| `--cb-failure-threshold` | int | 5 | Failures before circuit opens |
| `--cb-success-threshold` | int | 2 | Successes to close circuit |
| `--cb-timeout-duration-secs` | int | 30 | Circuit breaker timeout duration |
| `--cb-window-duration-secs` | int | 60 | Circuit breaker window duration |
| `--disable-circuit-breaker` | flag | False | Disable circuit breaker |
### Prefill-Decode Disaggregation Parameters
| Parameter | Type | Default | Description |
|-----------------------------------|------|---------|-------------------------------------------------------|
| `--pd-disaggregation` | flag | False | Enable PD disaggregated mode |
| `--prefill` | list | [] | Prefill server URLs with optional bootstrap ports |
| `--decode` | list | [] | Decode server URLs |
| `--prefill-policy` | str | None | Routing policy for prefill nodes (overrides --policy) |
| `--decode-policy` | str | None | Routing policy for decode nodes (overrides --policy) |
| `--worker-startup-timeout-secs` | int | 300 | Timeout for worker startup |
| `--worker-startup-check-interval` | int | 10 | Interval between startup checks |
### Kubernetes Integration
| Parameter | Type | Default | Description |
|---------------------------------|------|--------------------------|------------------------------------------------------|
| `--service-discovery` | flag | False | Enable Kubernetes service discovery |
| `--selector` | list | [] | Label selector for workers (key1=value1 key2=value2) |
| `--prefill-selector` | list | [] | Label selector for prefill servers in PD mode |
| `--decode-selector` | list | [] | Label selector for decode servers in PD mode |
| `--service-discovery-port` | int | 80 | Port for discovered pods |
| `--service-discovery-namespace` | str | None | Kubernetes namespace to watch |
| `--bootstrap-port-annotation` | str | sglang.ai/bootstrap-port | Annotation for bootstrap ports |
### Observability
| Parameter | Type | Default | Description |
|------------------------|------|-----------|-------------------------------------------------------|
| `--prometheus-port` | int | 29000 | Prometheus metrics port |
| `--prometheus-host` | str | 127.0.0.1 | Prometheus metrics host |
| `--log-dir` | str | None | Directory for log files |
| `--log-level` | str | info | Logging level (debug, info, warning, error, critical) |
| `--request-id-headers` | list | None | Custom headers for request tracing |
### CORS Configuration
| Parameter | Type | Default | Description |
|--------------------------|------|---------|----------------------|
| `--cors-allowed-origins` | list | [] | Allowed CORS origins |
## Advanced Features
### Kubernetes Service Discovery
Automatically discover and manage workers in Kubernetes:
#### Standard Mode
```bash
python -m sglang_router.launch_router \
--service-discovery \
--selector app=sglang-worker env=prod \
--service-discovery-namespace production \
--service-discovery-port 8000
```
2024-12-04 15:41:22 -08:00
2025-08-12 13:16:34 -07:00
#### Prefill-Decode Disaggregation Mode
```bash
python -m sglang_router.launch_router \
--pd-disaggregation \
--service-discovery \
--prefill-selector app=prefill-server env=prod \
--decode-selector app=decode-server env=prod \
--service-discovery-namespace production
```
**Note**: The `--bootstrap-port-annotation` (default: `sglang.ai/bootstrap-port` ) is used to discover bootstrap ports for prefill servers in PD mode. Prefill pods should have this annotation set to their bootstrap port value.
### Prometheus Metrics
Expose metrics for monitoring:
```bash
python -m sglang_router.launch_router \
--worker-urls http://worker1:8000 http://worker2:8001 \
--prometheus-port 29000 \
--prometheus-host 0.0.0.0
```
2024-12-04 15:41:22 -08:00
2025-08-12 13:16:34 -07:00
Metrics available at `http://localhost:29000/metrics`
2024-12-04 15:41:22 -08:00
2025-08-12 13:16:34 -07:00
### Request Tracing
2024-12-04 15:41:22 -08:00
2025-08-12 13:16:34 -07:00
Enable request ID tracking:
```bash
python -m sglang_router.launch_router \
--worker-urls http://worker1:8000 http://worker2:8001 \
--request-id-headers x-request-id x-trace-id
```
## Troubleshooting
### Common Issues
1. **Workers not connecting** : Ensure workers are fully initialized before starting the router. Use `--worker-startup-timeout-secs` to increase wait time.
2. **High latency** : Check if cache-aware routing is causing imbalance. Try adjusting `--balance-abs-threshold` and `--balance-rel-threshold` .
3. **Memory growth** : Reduce `--max-tree-size` or decrease `--eviction-interval` for more aggressive cache cleanup.
4. **Circuit breaker triggering frequently** : Increase `--cb-failure-threshold` or extend `--cb-window-duration-secs` .
### Debug Mode
Enable detailed logging:
```bash
python -m sglang_router.launch_router \
--worker-urls http://worker1:8000 http://worker2:8001 \
--log-level debug \
--log-dir ./router_logs
```