[docs] update router readme (#7797)
This commit is contained in:
@@ -1,17 +1,16 @@
|
||||
# SGLang Router
|
||||
|
||||
SGLang router is a standalone module implemented in Rust to achieve data parallelism across SGLang instances.
|
||||
SGLang router is a standalone Rust module that enables data parallelism across SGLang instances, providing high-performance request routing and advanced load balancing. The router supports multiple load balancing algorithms including cache-aware, power of two, random, and round robin, and acts as a specialized load balancer for prefill-decode disaggregated serving architectures.
|
||||
|
||||
## User docs
|
||||
## Documentation
|
||||
|
||||
Please check https://docs.sglang.ai/router/router.html
|
||||
- **User Guide**: [docs.sglang.ai/router/router.html](https://docs.sglang.ai/router/router.html)
|
||||
|
||||
## Developer docs
|
||||
## Quick Start
|
||||
|
||||
### Prerequisites
|
||||
|
||||
- Rust and Cargo installed
|
||||
|
||||
**Rust and Cargo:**
|
||||
```bash
|
||||
# Install rustup (Rust installer and version manager)
|
||||
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
|
||||
@@ -24,87 +23,83 @@ rustc --version
|
||||
cargo --version
|
||||
```
|
||||
|
||||
- Python with pip installed
|
||||
**Python with pip installed**
|
||||
|
||||
### Installation
|
||||
|
||||
### Build Process
|
||||
|
||||
#### 1. Build Rust Project
|
||||
|
||||
#### Option A: Build and Install Wheel (Recommended)
|
||||
```bash
|
||||
$ cargo build
|
||||
# Install build dependencies
|
||||
pip install setuptools-rust wheel build
|
||||
|
||||
# Build the wheel package
|
||||
python -m build
|
||||
|
||||
# Install the generated wheel
|
||||
pip install dist/*.whl
|
||||
|
||||
# One-liner for development (rebuild + install)
|
||||
python -m build && pip install --force-reinstall dist/*.whl
|
||||
```
|
||||
|
||||
#### 2. Build Python Binding
|
||||
|
||||
##### Option A: Build and Install Wheel
|
||||
1. Build the wheel package:
|
||||
#### Option B: Development Mode
|
||||
```bash
|
||||
$ pip install setuptools-rust wheel build
|
||||
$ python -m build
|
||||
pip install -e .
|
||||
```
|
||||
|
||||
2. Install the generated wheel:
|
||||
```bash
|
||||
$ pip install <path-to-wheel>
|
||||
```
|
||||
⚠️ **Warning**: Editable installs may suffer performance degradation. Use wheel builds for performance testing.
|
||||
|
||||
If you want one handy command to do build + install for every change you make:
|
||||
### Basic Usage
|
||||
|
||||
```bash
|
||||
$ python -m build && pip install --force-reinstall dist/*.whl
|
||||
# Build Rust components
|
||||
cargo build
|
||||
|
||||
# Launch router with worker URLs
|
||||
python -m sglang_router.launch_router \
|
||||
--worker-urls http://worker1:8000 http://worker2:8000
|
||||
```
|
||||
|
||||
##### Option B: Development Mode
|
||||
|
||||
For development purposes, you can install the package in editable mode:
|
||||
|
||||
Warning: Using editable python binding can suffer from performance degradation!! Please build a fresh wheel for every update if you want to test performance.
|
||||
|
||||
```bash
|
||||
$ pip install -e .
|
||||
```
|
||||
|
||||
**Note:** When modifying Rust code, you must rebuild the wheel for changes to take effect.
|
||||
## Configuration
|
||||
|
||||
### Logging
|
||||
|
||||
The SGL Router includes structured logging with console output by default. To enable log files:
|
||||
Enable structured logging with optional file output:
|
||||
|
||||
```python
|
||||
# Enable file logging when creating a router
|
||||
from sglang_router import Router
|
||||
|
||||
# Console logging (default)
|
||||
router = Router(worker_urls=["http://worker1:8000", "http://worker2:8000"])
|
||||
|
||||
# File logging enabled
|
||||
router = Router(
|
||||
worker_urls=["http://worker1:8000", "http://worker2:8000"],
|
||||
log_dir="./logs" # Daily log files will be created here
|
||||
log_dir="./logs" # Daily log files created here
|
||||
)
|
||||
```
|
||||
|
||||
Use the `--log-level` flag with the CLI to set [log level](https://docs.sglang.ai/backend/server_arguments.html#logging).
|
||||
Set log level with `--log-level` flag ([documentation](https://docs.sglang.ai/backend/server_arguments.html#logging)).
|
||||
|
||||
### Metrics
|
||||
|
||||
SGL Router exposes a Prometheus HTTP scrape endpoint for monitoring, which by default listens at 127.0.0.1:29000.
|
||||
Prometheus metrics endpoint available at `127.0.0.1:29000` by default.
|
||||
|
||||
To change the endpoint to listen on all network interfaces and set the port to 9000, configure the following options when launching the router:
|
||||
```
|
||||
```bash
|
||||
# Custom metrics configuration
|
||||
python -m sglang_router.launch_router \
|
||||
--worker-urls http://localhost:8080 http://localhost:8081 \
|
||||
--prometheus-host 0.0.0.0 \
|
||||
--prometheus-port 9000
|
||||
--worker-urls http://localhost:8080 http://localhost:8081 \
|
||||
--prometheus-host 0.0.0.0 \
|
||||
--prometheus-port 9000
|
||||
```
|
||||
|
||||
## Advanced Features
|
||||
|
||||
### Kubernetes Service Discovery
|
||||
|
||||
SGL Router supports automatic service discovery for worker nodes in Kubernetes environments. This feature works with both regular (single-server) routing and PD (Prefill-Decode) routing modes. When enabled, the router will automatically:
|
||||
Automatic worker discovery and management in Kubernetes environments.
|
||||
|
||||
- Discover and add worker pods with matching labels
|
||||
- Remove unhealthy or deleted worker pods
|
||||
- Dynamically adjust the worker pool based on pod health and availability
|
||||
- For PD mode: distinguish between prefill and decode servers based on labels
|
||||
|
||||
#### Regular Mode Service Discovery
|
||||
|
||||
For traditional single-server routing:
|
||||
#### Basic Service Discovery
|
||||
|
||||
```bash
|
||||
python -m sglang_router.launch_router \
|
||||
@@ -113,9 +108,9 @@ python -m sglang_router.launch_router \
|
||||
--service-discovery-namespace default
|
||||
```
|
||||
|
||||
#### PD Mode Service Discovery
|
||||
#### PD (Prefill-Decode) Mode
|
||||
|
||||
For PD (Prefill-Decode) disaggregated routing, service discovery can automatically discover and classify pods as either prefill or decode servers based on their labels:
|
||||
For disaggregated prefill/decode routing:
|
||||
|
||||
```bash
|
||||
python -m sglang_router.launch_router \
|
||||
@@ -127,23 +122,7 @@ python -m sglang_router.launch_router \
|
||||
--service-discovery-namespace sglang-system
|
||||
```
|
||||
|
||||
You can also specify initial prefill and decode servers and let service discovery add more:
|
||||
|
||||
```bash
|
||||
python -m sglang_router.launch_router \
|
||||
--pd-disaggregation \
|
||||
--policy cache_aware \
|
||||
--prefill http://prefill-1:8000 8001 \
|
||||
--decode http://decode-1:8000 \
|
||||
--service-discovery \
|
||||
--prefill-selector app=sglang component=prefill \
|
||||
--decode-selector app=sglang component=decode \
|
||||
--service-discovery-namespace sglang-system
|
||||
```
|
||||
|
||||
#### Kubernetes Pod Configuration for PD Mode
|
||||
|
||||
When using PD service discovery, your Kubernetes pods need specific labels to be classified as prefill or decode servers:
|
||||
#### Kubernetes Pod Configuration
|
||||
|
||||
**Prefill Server Pod:**
|
||||
```yaml
|
||||
@@ -155,15 +134,14 @@ metadata:
|
||||
app: sglang
|
||||
component: prefill
|
||||
annotations:
|
||||
sglang.ai/bootstrap-port: "9001" # Optional: Bootstrap port for Mooncake prefill coordination
|
||||
sglang.ai/bootstrap-port: "9001" # Optional: Bootstrap port
|
||||
spec:
|
||||
containers:
|
||||
- name: sglang
|
||||
image: lmsys/sglang:latest
|
||||
ports:
|
||||
- containerPort: 8000 # Main API port
|
||||
- containerPort: 9001 # Optional: Bootstrap coordination port
|
||||
# ... rest of configuration
|
||||
- containerPort: 9001 # Optional: Bootstrap port
|
||||
```
|
||||
|
||||
**Decode Server Pod:**
|
||||
@@ -180,38 +158,10 @@ spec:
|
||||
- name: sglang
|
||||
image: lmsys/sglang:latest
|
||||
ports:
|
||||
- containerPort: 8000 # Main API port
|
||||
# ... rest of configuration
|
||||
- containerPort: 8000
|
||||
```
|
||||
|
||||
**Key Requirements:**
|
||||
- Prefill pods must have labels matching your `--prefill-selector`
|
||||
- Decode pods must have labels matching your `--decode-selector`
|
||||
- Prefill pods can optionally include bootstrap port in annotations using `sglang.ai/bootstrap-port` (defaults to None if not specified)
|
||||
|
||||
#### Service Discovery Arguments
|
||||
|
||||
**General Arguments:**
|
||||
- `--service-discovery`: Enable Kubernetes service discovery feature
|
||||
- `--service-discovery-port`: Port to use when generating worker URLs (default: 8000)
|
||||
- `--service-discovery-namespace`: Optional. Kubernetes namespace to watch for pods. If not provided, watches all namespaces (requires cluster-wide permissions)
|
||||
- `--selector`: One or more label key-value pairs for pod selection in regular mode (format: key1=value1 key2=value2)
|
||||
|
||||
**PD Mode Arguments:**
|
||||
- `--pd-disaggregation`: Enable PD (Prefill-Decode) disaggregated mode
|
||||
- `--prefill`: Specify initial prefill server URL and bootstrap port (format: URL BOOTSTRAP_PORT, can be used multiple times)
|
||||
- `--decode`: Specify initial decode server URL (can be used multiple times)
|
||||
- `--prefill-selector`: Label selector for prefill server pods in PD mode (format: key1=value1 key2=value2)
|
||||
- `--decode-selector`: Label selector for decode server pods in PD mode (format: key1=value1 key2=value2)
|
||||
- `--policy`: Routing policy (cache_aware, random, power_of_two - note: power_of_two only works in PD mode)
|
||||
|
||||
**Notes:**
|
||||
- Bootstrap port annotation is automatically set to `sglang.ai/bootstrap-port` for Mooncake deployments
|
||||
- Advanced cache tuning parameters use sensible defaults and are not exposed via CLI
|
||||
|
||||
#### RBAC Requirements
|
||||
|
||||
When using service discovery, you must configure proper Kubernetes RBAC permissions:
|
||||
#### RBAC Configuration
|
||||
|
||||
**Namespace-scoped (recommended):**
|
||||
```yaml
|
||||
@@ -246,43 +196,9 @@ roleRef:
|
||||
apiGroup: rbac.authorization.k8s.io
|
||||
```
|
||||
|
||||
**Cluster-wide (if watching all namespaces):**
|
||||
```yaml
|
||||
apiVersion: v1
|
||||
kind: ServiceAccount
|
||||
metadata:
|
||||
name: sglang-router
|
||||
namespace: sglang-system
|
||||
---
|
||||
apiVersion: rbac.authorization.k8s.io/v1
|
||||
kind: ClusterRole
|
||||
metadata:
|
||||
name: sglang-router
|
||||
rules:
|
||||
- apiGroups: [""]
|
||||
resources: ["pods"]
|
||||
verbs: ["get", "list", "watch"]
|
||||
---
|
||||
apiVersion: rbac.authorization.k8s.io/v1
|
||||
kind: ClusterRoleBinding
|
||||
metadata:
|
||||
name: sglang-router
|
||||
subjects:
|
||||
- kind: ServiceAccount
|
||||
name: sglang-router
|
||||
namespace: sglang-system
|
||||
roleRef:
|
||||
kind: ClusterRole
|
||||
name: sglang-router
|
||||
apiGroup: rbac.authorization.k8s.io
|
||||
```
|
||||
|
||||
#### Complete Example: PD Mode with Service Discovery
|
||||
|
||||
Here's a complete example of running SGLang Router with PD mode and service discovery:
|
||||
#### Complete PD Example
|
||||
|
||||
```bash
|
||||
# Start the router with PD mode and automatic prefill/decode discovery
|
||||
python -m sglang_router.launch_router \
|
||||
--pd-disaggregation \
|
||||
--policy cache_aware \
|
||||
@@ -296,42 +212,70 @@ python -m sglang_router.launch_router \
|
||||
--prometheus-port 9090
|
||||
```
|
||||
|
||||
This setup will:
|
||||
1. Enable PD (Prefill-Decode) disaggregated routing mode with automatic pod classification
|
||||
2. Watch for pods in the `production` namespace
|
||||
3. Automatically add prefill servers with labels `app=sglang`, `component=prefill`, `environment=production`
|
||||
4. Automatically add decode servers with labels `app=sglang`, `component=decode`, `environment=production`
|
||||
5. Extract bootstrap ports from the `sglang.ai/bootstrap-port` annotation on prefill pods
|
||||
6. Use cache-aware load balancing for optimal performance
|
||||
7. Expose the router API on port 8080 and metrics on port 9090
|
||||
### Command Line Arguments Reference
|
||||
|
||||
**Note:** In PD mode with service discovery, pods MUST match either the prefill or decode selector to be added. Pods that don't match either selector are ignored.
|
||||
#### Service Discovery
|
||||
- `--service-discovery`: Enable Kubernetes service discovery
|
||||
- `--service-discovery-port`: Port for worker URLs (default: 8000)
|
||||
- `--service-discovery-namespace`: Kubernetes namespace to watch
|
||||
- `--selector`: Label selectors for regular mode (format: `key1=value1 key2=value2`)
|
||||
|
||||
#### PD Mode
|
||||
- `--pd-disaggregation`: Enable Prefill-Decode disaggregated mode
|
||||
- `--prefill`: Initial prefill server (format: `URL BOOTSTRAP_PORT`)
|
||||
- `--decode`: Initial decode server URL
|
||||
- `--prefill-selector`: Label selector for prefill pods
|
||||
- `--decode-selector`: Label selector for decode pods
|
||||
- `--policy`: Routing policy (`cache_aware`, `random`, `power_of_two`)
|
||||
|
||||
## Development
|
||||
|
||||
### Build Process
|
||||
|
||||
```bash
|
||||
# Build Rust project
|
||||
cargo build
|
||||
|
||||
# Build Python binding (see Installation section above)
|
||||
```
|
||||
|
||||
**Note**: When modifying Rust code, you must rebuild the wheel for changes to take effect.
|
||||
|
||||
### Troubleshooting
|
||||
|
||||
1. If rust analyzer is not working in VSCode, set `rust-analyzer.linkedProjects` to the absolute path of `Cargo.toml` in your repo. For example:
|
||||
**VSCode Rust Analyzer Issues:**
|
||||
Set `rust-analyzer.linkedProjects` to the absolute path of `Cargo.toml`:
|
||||
|
||||
```json
|
||||
{
|
||||
"rust-analyzer.linkedProjects": ["/workspaces/sglang/sgl-router/Cargo.toml"]
|
||||
"rust-analyzer.linkedProjects": ["/workspaces/sglang/sgl-router/Cargo.toml"]
|
||||
}
|
||||
```
|
||||
|
||||
### CI/CD Setup
|
||||
### CI/CD Pipeline
|
||||
|
||||
The continuous integration pipeline consists of three main steps:
|
||||
The continuous integration pipeline includes comprehensive testing, benchmarking, and publishing:
|
||||
|
||||
#### 1. Build Wheels
|
||||
- Uses `cibuildwheel` to create manylinux x86_64 packages
|
||||
- Compatible with major Linux distributions (Ubuntu, CentOS, etc.)
|
||||
- Additional configurations can be added to support other OS/architectures
|
||||
- Reference: [cibuildwheel documentation](https://cibuildwheel.pypa.io/en/stable/)
|
||||
#### Build & Test
|
||||
1. **Build Wheels**: Uses `cibuildwheel` for manylinux x86_64 packages
|
||||
2. **Build Source Distribution**: Creates source distribution for pip fallback
|
||||
3. **Rust HTTP Server Benchmarking**: Performance testing of router overhead
|
||||
4. **Basic Inference Testing**: End-to-end validation through the router
|
||||
5. **PD Disaggregation Testing**: Benchmark and sanity checks for prefill-decode load balancing
|
||||
|
||||
#### 2. Build Source Distribution
|
||||
- Creates a source distribution containing the raw, unbuilt code
|
||||
- Enables `pip` to build the package from source when prebuilt wheels are unavailable
|
||||
#### Publishing
|
||||
- **PyPI Publishing**: Wheels and source distributions are published only when the version changes in `pyproject.toml`
|
||||
- **Container Images**: Docker images published using `/docker/Dockerfile.router`
|
||||
|
||||
#### 3. Publish to PyPI
|
||||
- Uploads both wheels and source distribution to PyPI
|
||||
## Features
|
||||
|
||||
The CI configuration is based on the [tiktoken workflow](https://github.com/openai/tiktoken/blob/63527649963def8c759b0f91f2eb69a40934e468/.github/workflows/build_wheels.yml#L1).
|
||||
- **High Performance**: Rust-based routing with connection pooling and optimized request handling
|
||||
- **Advanced Load Balancing**: Multiple algorithms including:
|
||||
- **Cache-Aware**: Intelligent routing based on cache locality for optimal performance
|
||||
- **Power of Two**: Chooses the less loaded of two randomly selected workers
|
||||
- **Random**: Distributes requests randomly across available workers
|
||||
- **Round Robin**: Sequential distribution across workers in rotation
|
||||
- **Prefill-Decode Disaggregation**: Specialized load balancing for separated prefill and decode servers
|
||||
- **Service Discovery**: Automatic Kubernetes worker discovery and health management
|
||||
- **Monitoring**: Comprehensive Prometheus metrics and structured logging
|
||||
- **Scalability**: Handles thousands of concurrent connections with efficient resource utilization
|
||||
|
||||
Reference in New Issue
Block a user