From 35724aa182130dad3f4e1741efd4e0f924eef9bc Mon Sep 17 00:00:00 2001 From: Simo Lin Date: Sun, 6 Jul 2025 22:54:11 -0700 Subject: [PATCH] [docs] update router readme (#7797) --- sgl-router/README.md | 274 +++++++++++++++++-------------------------- 1 file changed, 109 insertions(+), 165 deletions(-) diff --git a/sgl-router/README.md b/sgl-router/README.md index 5c1ef12cd..c899a6f59 100644 --- a/sgl-router/README.md +++ b/sgl-router/README.md @@ -1,17 +1,16 @@ # SGLang Router -SGLang router is a standalone module implemented in Rust to achieve data parallelism across SGLang instances. +SGLang router is a standalone Rust module that enables data parallelism across SGLang instances, providing high-performance request routing and advanced load balancing. The router supports multiple load balancing algorithms including cache-aware, power of two, random, and round robin, and acts as a specialized load balancer for prefill-decode disaggregated serving architectures. -## User docs +## Documentation -Please check https://docs.sglang.ai/router/router.html +- **User Guide**: [docs.sglang.ai/router/router.html](https://docs.sglang.ai/router/router.html) -## Developer docs +## Quick Start ### Prerequisites -- Rust and Cargo installed - +**Rust and Cargo:** ```bash # Install rustup (Rust installer and version manager) curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh @@ -24,87 +23,83 @@ rustc --version cargo --version ``` -- Python with pip installed +**Python with pip installed** +### Installation -### Build Process - -#### 1. Build Rust Project - +#### Option A: Build and Install Wheel (Recommended) ```bash -$ cargo build +# Install build dependencies +pip install setuptools-rust wheel build + +# Build the wheel package +python -m build + +# Install the generated wheel +pip install dist/*.whl + +# One-liner for development (rebuild + install) +python -m build && pip install --force-reinstall dist/*.whl ``` -#### 2. Build Python Binding - -##### Option A: Build and Install Wheel -1. Build the wheel package: +#### Option B: Development Mode ```bash -$ pip install setuptools-rust wheel build -$ python -m build +pip install -e . ``` -2. Install the generated wheel: -```bash -$ pip install -``` +⚠️ **Warning**: Editable installs may suffer performance degradation. Use wheel builds for performance testing. -If you want one handy command to do build + install for every change you make: +### Basic Usage ```bash -$ python -m build && pip install --force-reinstall dist/*.whl +# Build Rust components +cargo build + +# Launch router with worker URLs +python -m sglang_router.launch_router \ + --worker-urls http://worker1:8000 http://worker2:8000 ``` -##### Option B: Development Mode - -For development purposes, you can install the package in editable mode: - -Warning: Using editable python binding can suffer from performance degradation!! Please build a fresh wheel for every update if you want to test performance. - -```bash -$ pip install -e . -``` - -**Note:** When modifying Rust code, you must rebuild the wheel for changes to take effect. +## Configuration ### Logging -The SGL Router includes structured logging with console output by default. To enable log files: +Enable structured logging with optional file output: ```python -# Enable file logging when creating a router +from sglang_router import Router + +# Console logging (default) +router = Router(worker_urls=["http://worker1:8000", "http://worker2:8000"]) + +# File logging enabled router = Router( worker_urls=["http://worker1:8000", "http://worker2:8000"], - log_dir="./logs" # Daily log files will be created here + log_dir="./logs" # Daily log files created here ) ``` -Use the `--log-level` flag with the CLI to set [log level](https://docs.sglang.ai/backend/server_arguments.html#logging). +Set log level with `--log-level` flag ([documentation](https://docs.sglang.ai/backend/server_arguments.html#logging)). ### Metrics -SGL Router exposes a Prometheus HTTP scrape endpoint for monitoring, which by default listens at 127.0.0.1:29000. +Prometheus metrics endpoint available at `127.0.0.1:29000` by default. -To change the endpoint to listen on all network interfaces and set the port to 9000, configure the following options when launching the router: -``` +```bash +# Custom metrics configuration python -m sglang_router.launch_router \ - --worker-urls http://localhost:8080 http://localhost:8081 \ - --prometheus-host 0.0.0.0 \ - --prometheus-port 9000 + --worker-urls http://localhost:8080 http://localhost:8081 \ + --prometheus-host 0.0.0.0 \ + --prometheus-port 9000 ``` +## Advanced Features + ### Kubernetes Service Discovery -SGL Router supports automatic service discovery for worker nodes in Kubernetes environments. This feature works with both regular (single-server) routing and PD (Prefill-Decode) routing modes. When enabled, the router will automatically: +Automatic worker discovery and management in Kubernetes environments. -- Discover and add worker pods with matching labels -- Remove unhealthy or deleted worker pods -- Dynamically adjust the worker pool based on pod health and availability -- For PD mode: distinguish between prefill and decode servers based on labels - -#### Regular Mode Service Discovery - -For traditional single-server routing: +#### Basic Service Discovery ```bash python -m sglang_router.launch_router \ @@ -113,9 +108,9 @@ python -m sglang_router.launch_router \ --service-discovery-namespace default ``` -#### PD Mode Service Discovery +#### PD (Prefill-Decode) Mode -For PD (Prefill-Decode) disaggregated routing, service discovery can automatically discover and classify pods as either prefill or decode servers based on their labels: +For disaggregated prefill/decode routing: ```bash python -m sglang_router.launch_router \ @@ -127,23 +122,7 @@ python -m sglang_router.launch_router \ --service-discovery-namespace sglang-system ``` -You can also specify initial prefill and decode servers and let service discovery add more: - -```bash -python -m sglang_router.launch_router \ - --pd-disaggregation \ - --policy cache_aware \ - --prefill http://prefill-1:8000 8001 \ - --decode http://decode-1:8000 \ - --service-discovery \ - --prefill-selector app=sglang component=prefill \ - --decode-selector app=sglang component=decode \ - --service-discovery-namespace sglang-system -``` - -#### Kubernetes Pod Configuration for PD Mode - -When using PD service discovery, your Kubernetes pods need specific labels to be classified as prefill or decode servers: +#### Kubernetes Pod Configuration **Prefill Server Pod:** ```yaml @@ -155,15 +134,14 @@ metadata: app: sglang component: prefill annotations: - sglang.ai/bootstrap-port: "9001" # Optional: Bootstrap port for Mooncake prefill coordination + sglang.ai/bootstrap-port: "9001" # Optional: Bootstrap port spec: containers: - name: sglang image: lmsys/sglang:latest ports: - containerPort: 8000 # Main API port - - containerPort: 9001 # Optional: Bootstrap coordination port - # ... rest of configuration + - containerPort: 9001 # Optional: Bootstrap port ``` **Decode Server Pod:** @@ -180,38 +158,10 @@ spec: - name: sglang image: lmsys/sglang:latest ports: - - containerPort: 8000 # Main API port - # ... rest of configuration + - containerPort: 8000 ``` -**Key Requirements:** -- Prefill pods must have labels matching your `--prefill-selector` -- Decode pods must have labels matching your `--decode-selector` -- Prefill pods can optionally include bootstrap port in annotations using `sglang.ai/bootstrap-port` (defaults to None if not specified) - -#### Service Discovery Arguments - -**General Arguments:** -- `--service-discovery`: Enable Kubernetes service discovery feature -- `--service-discovery-port`: Port to use when generating worker URLs (default: 8000) -- `--service-discovery-namespace`: Optional. Kubernetes namespace to watch for pods. If not provided, watches all namespaces (requires cluster-wide permissions) -- `--selector`: One or more label key-value pairs for pod selection in regular mode (format: key1=value1 key2=value2) - -**PD Mode Arguments:** -- `--pd-disaggregation`: Enable PD (Prefill-Decode) disaggregated mode -- `--prefill`: Specify initial prefill server URL and bootstrap port (format: URL BOOTSTRAP_PORT, can be used multiple times) -- `--decode`: Specify initial decode server URL (can be used multiple times) -- `--prefill-selector`: Label selector for prefill server pods in PD mode (format: key1=value1 key2=value2) -- `--decode-selector`: Label selector for decode server pods in PD mode (format: key1=value1 key2=value2) -- `--policy`: Routing policy (cache_aware, random, power_of_two - note: power_of_two only works in PD mode) - -**Notes:** -- Bootstrap port annotation is automatically set to `sglang.ai/bootstrap-port` for Mooncake deployments -- Advanced cache tuning parameters use sensible defaults and are not exposed via CLI - -#### RBAC Requirements - -When using service discovery, you must configure proper Kubernetes RBAC permissions: +#### RBAC Configuration **Namespace-scoped (recommended):** ```yaml @@ -246,43 +196,9 @@ roleRef: apiGroup: rbac.authorization.k8s.io ``` -**Cluster-wide (if watching all namespaces):** -```yaml -apiVersion: v1 -kind: ServiceAccount -metadata: - name: sglang-router - namespace: sglang-system ---- -apiVersion: rbac.authorization.k8s.io/v1 -kind: ClusterRole -metadata: - name: sglang-router -rules: -- apiGroups: [""] - resources: ["pods"] - verbs: ["get", "list", "watch"] ---- -apiVersion: rbac.authorization.k8s.io/v1 -kind: ClusterRoleBinding -metadata: - name: sglang-router -subjects: -- kind: ServiceAccount - name: sglang-router - namespace: sglang-system -roleRef: - kind: ClusterRole - name: sglang-router - apiGroup: rbac.authorization.k8s.io -``` - -#### Complete Example: PD Mode with Service Discovery - -Here's a complete example of running SGLang Router with PD mode and service discovery: +#### Complete PD Example ```bash -# Start the router with PD mode and automatic prefill/decode discovery python -m sglang_router.launch_router \ --pd-disaggregation \ --policy cache_aware \ @@ -296,42 +212,70 @@ python -m sglang_router.launch_router \ --prometheus-port 9090 ``` -This setup will: -1. Enable PD (Prefill-Decode) disaggregated routing mode with automatic pod classification -2. Watch for pods in the `production` namespace -3. Automatically add prefill servers with labels `app=sglang`, `component=prefill`, `environment=production` -4. Automatically add decode servers with labels `app=sglang`, `component=decode`, `environment=production` -5. Extract bootstrap ports from the `sglang.ai/bootstrap-port` annotation on prefill pods -6. Use cache-aware load balancing for optimal performance -7. Expose the router API on port 8080 and metrics on port 9090 +### Command Line Arguments Reference -**Note:** In PD mode with service discovery, pods MUST match either the prefill or decode selector to be added. Pods that don't match either selector are ignored. +#### Service Discovery +- `--service-discovery`: Enable Kubernetes service discovery +- `--service-discovery-port`: Port for worker URLs (default: 8000) +- `--service-discovery-namespace`: Kubernetes namespace to watch +- `--selector`: Label selectors for regular mode (format: `key1=value1 key2=value2`) + +#### PD Mode +- `--pd-disaggregation`: Enable Prefill-Decode disaggregated mode +- `--prefill`: Initial prefill server (format: `URL BOOTSTRAP_PORT`) +- `--decode`: Initial decode server URL +- `--prefill-selector`: Label selector for prefill pods +- `--decode-selector`: Label selector for decode pods +- `--policy`: Routing policy (`cache_aware`, `random`, `power_of_two`) + +## Development + +### Build Process + +```bash +# Build Rust project +cargo build + +# Build Python binding (see Installation section above) +``` + +**Note**: When modifying Rust code, you must rebuild the wheel for changes to take effect. ### Troubleshooting -1. If rust analyzer is not working in VSCode, set `rust-analyzer.linkedProjects` to the absolute path of `Cargo.toml` in your repo. For example: +**VSCode Rust Analyzer Issues:** +Set `rust-analyzer.linkedProjects` to the absolute path of `Cargo.toml`: ```json { - "rust-analyzer.linkedProjects": ["/workspaces/sglang/sgl-router/Cargo.toml"] + "rust-analyzer.linkedProjects": ["/workspaces/sglang/sgl-router/Cargo.toml"] } ``` -### CI/CD Setup +### CI/CD Pipeline -The continuous integration pipeline consists of three main steps: +The continuous integration pipeline includes comprehensive testing, benchmarking, and publishing: -#### 1. Build Wheels -- Uses `cibuildwheel` to create manylinux x86_64 packages -- Compatible with major Linux distributions (Ubuntu, CentOS, etc.) -- Additional configurations can be added to support other OS/architectures -- Reference: [cibuildwheel documentation](https://cibuildwheel.pypa.io/en/stable/) +#### Build & Test +1. **Build Wheels**: Uses `cibuildwheel` for manylinux x86_64 packages +2. **Build Source Distribution**: Creates source distribution for pip fallback +3. **Rust HTTP Server Benchmarking**: Performance testing of router overhead +4. **Basic Inference Testing**: End-to-end validation through the router +5. **PD Disaggregation Testing**: Benchmark and sanity checks for prefill-decode load balancing -#### 2. Build Source Distribution -- Creates a source distribution containing the raw, unbuilt code -- Enables `pip` to build the package from source when prebuilt wheels are unavailable +#### Publishing +- **PyPI Publishing**: Wheels and source distributions are published only when the version changes in `pyproject.toml` +- **Container Images**: Docker images published using `/docker/Dockerfile.router` -#### 3. Publish to PyPI -- Uploads both wheels and source distribution to PyPI +## Features -The CI configuration is based on the [tiktoken workflow](https://github.com/openai/tiktoken/blob/63527649963def8c759b0f91f2eb69a40934e468/.github/workflows/build_wheels.yml#L1). +- **High Performance**: Rust-based routing with connection pooling and optimized request handling +- **Advanced Load Balancing**: Multiple algorithms including: + - **Cache-Aware**: Intelligent routing based on cache locality for optimal performance + - **Power of Two**: Chooses the less loaded of two randomly selected workers + - **Random**: Distributes requests randomly across available workers + - **Round Robin**: Sequential distribution across workers in rotation +- **Prefill-Decode Disaggregation**: Specialized load balancing for separated prefill and decode servers +- **Service Discovery**: Automatic Kubernetes worker discovery and health management +- **Monitoring**: Comprehensive Prometheus metrics and structured logging +- **Scalability**: Handles thousands of concurrent connections with efficient resource utilization