From 1ce30dd13ef99ee7671251067adb1758a9116650 Mon Sep 17 00:00:00 2001 From: Simo Lin Date: Tue, 12 Aug 2025 13:16:34 -0700 Subject: [PATCH] [router] update router documentation (#9121) --- docs/advanced_features/pd_disaggregation.md | 4 + docs/advanced_features/router.md | 442 ++++++++++++++++---- 2 files changed, 361 insertions(+), 85 deletions(-) diff --git a/docs/advanced_features/pd_disaggregation.md b/docs/advanced_features/pd_disaggregation.md index b7a384c4c..f7cc0adaf 100644 --- a/docs/advanced_features/pd_disaggregation.md +++ b/docs/advanced_features/pd_disaggregation.md @@ -17,6 +17,10 @@ For the design details, please refer to [link](https://docs.google.com/document/ Currently, we support Mooncake and NIXL as the transfer engine. +## Router Integration + +For deploying PD disaggregation at scale with load balancing and fault tolerance, SGLang provides a router. The router can distribute requests between prefill and decode instances using various routing policies. For detailed information on setting up routing with PD disaggregation, including configuration options and deployment patterns, see the [SGLang Router documentation](router.md#mode-3-prefill-decode-disaggregation). + ## Mooncake ### Requirements diff --git a/docs/advanced_features/router.md b/docs/advanced_features/router.md index 7339144fa..555a0bc4b 100644 --- a/docs/advanced_features/router.md +++ b/docs/advanced_features/router.md @@ -1,8 +1,16 @@ -# Router for Data Parallelism +# SGLang Router -Given multiple GPUs running multiple SGLang Runtimes, SGLang Router distributes the requests to different Runtimes with its unique cache-aware load-balancing algorithm. +The SGLang Router is a high-performance request distribution system that routes inference requests across multiple SGLang runtime instances. It features cache-aware load balancing, fault tolerance, and support for advanced deployment patterns including data parallelism and prefill-decode disaggregation. -The router is an independent Python package, and it can be used as a drop-in replacement for the OpenAI API. +## Key Features + +- **Cache-Aware Load Balancing**: Optimizes cache utilization while maintaining balanced load distribution +- **Multiple Routing Policies**: Choose from random, round-robin, cache-aware, or power-of-two policies +- **Fault Tolerance**: Automatic retry and circuit breaker mechanisms for resilient operation +- **Dynamic Scaling**: Add or remove workers at runtime without service interruption +- **Kubernetes Integration**: Native service discovery and pod management +- **Prefill-Decode Disaggregation**: Support for disaggregated serving load balancing +- **Prometheus Metrics**: Built-in observability and monitoring ## Installation @@ -10,164 +18,428 @@ The router is an independent Python package, and it can be used as a drop-in rep pip install sglang-router ``` -Detailed usage of the router can be found in [launch_router](https://github.com/sgl-project/sglang/blob/main/sgl-router/py_src/sglang_router/launch_router.py) and [launch_server](https://github.com/sgl-project/sglang/blob/main/sgl-router/py_src/sglang_router/launch_server.py). Also, you can directly run the following command to see the usage of the router. +## Quick Start + +To see all available options: ```bash -python -m sglang_router.launch_server --help -python -m sglang_router.launch_router --help +python -m sglang_router.launch_server --help # Co-launch router and workers +python -m sglang_router.launch_router --help # Launch router only ``` -The router supports two working modes: +## Deployment Modes -1. Co-launch Router and Runtimes -2. Launch Runtimes and Router separately +The router supports three primary deployment patterns: -## Co-launch Router and Runtimes +1. **Co-launch Mode**: Router and workers launch together (simplest for single-node deployments) +2. **Separate Launch Mode**: Router and workers launch independently (best for multi-node setups) +3. **Prefill-Decode Disaggregation**: Specialized mode for disaggregated serving -This will be a drop-in replacement for the existing `--dp-size` argument of SGLang Runtime. Under the hood, it uses multi-processes to launch multiple workers, wait for them to be ready, then connect the router to all workers. +### Mode 1: Co-launch Router and Workers + +This mode launches both the router and multiple worker instances in a single command. It's the simplest deployment option and replaces the `--dp-size` argument of SGLang Runtime. ```bash -python -m sglang_router.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --dp-size 4 --host 0.0.0.0 +# Launch router with 4 workers +python -m sglang_router.launch_server \ + --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \ + --dp-size 4 \ + --host 0.0.0.0 \ + --port 30000 ``` -After the server is ready, you can directly send requests to the router as the same way as sending requests to each single worker. +#### Sending Requests -Please adjust the batchsize accordingly to achieve maximum throughput. +Once the server is ready, send requests to the router endpoint: ```python import requests +# Using the /generate endpoint url = "http://localhost:30000/generate" -data = {"text": "What is the capital of France?"} +data = { + "text": "What is the capital of France?", + "sampling_params": { + "temperature": 0.7, + "max_new_tokens": 100 + } +} + +response = requests.post(url, json=data) +print(response.json()) + +# OpenAI-compatible endpoint +url = "http://localhost:30000/v1/chat/completions" +data = { + "model": "meta-llama/Meta-Llama-3.1-8B-Instruct", + "messages": [{"role": "user", "content": "What is the capital of France?"}] +} response = requests.post(url, json=data) print(response.json()) ``` -## Launch Runtimes and Router Separately +### Mode 2: Separate Launch Mode -This is useful for multi-node DP. First, launch workers on multiple nodes, then launch a router on the main node, and connect the router to all workers. +This mode is ideal for multi-node deployments where workers run on different machines. + +#### Step 1: Launch Workers + +On each worker node: ```bash -python -m sglang_router.launch_router --worker-urls http://worker_url_1 http://worker_url_2 +# Worker node 1 +python -m sglang.launch_server \ + --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \ + --host 0.0.0.0 \ + --port 8000 + +# Worker node 2 +python -m sglang.launch_server \ + --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \ + --host 0.0.0.0 \ + --port 8001 ``` -## Dynamic Scaling APIs +#### Step 2: Launch Router -We offer `/add_worker` and `/remove_worker` APIs to dynamically add or remove workers from the router. - -- `/add_worker` - -Usage: +On the router node: ```bash -curl -X POST http://localhost:30000/add_worker?url=http://worker_url_1 +python -m sglang_router.launch_router \ + --worker-urls http://worker1:8000 http://worker2:8001 \ + --host 0.0.0.0 \ + --port 30000 \ + --policy cache_aware # or random, round_robin, power_of_two ``` -Example: +### Mode 3: Prefill-Decode Disaggregation + +This advanced mode separates prefill and decode operations for optimized performance: ```bash -python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --port 30001 - -curl -X POST http://localhost:30000/add_worker?url=http://127.0.0.1:30001 - -# Successfully added worker: http://127.0.0.1:30001 +python -m sglang_router.launch_router \ + --pd-disaggregation \ + --prefill http://prefill1:8000 9000 \ + --prefill http://prefill2:8001 9001 \ + --decode http://decode1:8002 \ + --decode http://decode2:8003 \ + --prefill-policy cache_aware \ + --decode-policy round_robin ``` -- `/remove_worker` +#### Understanding --prefill Arguments -Usage: +The `--prefill` flag accepts URLs with optional bootstrap ports: +- `--prefill http://server:8000` - No bootstrap port +- `--prefill http://server:8000 9000` - Bootstrap port 9000 +- `--prefill http://server:8000 none` - Explicitly no bootstrap port + +#### Policy Inheritance in PD Mode + +The router intelligently handles policy configuration for prefill and decode nodes: + +1. **Only `--policy` specified**: Both prefill and decode nodes use this policy +2. **`--policy` and `--prefill-policy` specified**: Prefill nodes use `--prefill-policy`, decode nodes use `--policy` +3. **`--policy` and `--decode-policy` specified**: Prefill nodes use `--policy`, decode nodes use `--decode-policy` +4. **All three specified**: Prefill nodes use `--prefill-policy`, decode nodes use `--decode-policy` (main `--policy` is ignored) + +Example with mixed policies: +```bash +python -m sglang_router.launch_router \ + --pd-disaggregation \ + --prefill http://prefill1:8000 + --prefill http://prefill2:8000 \ + --decode http://decode1:8001 + --decode http://decode2:8001 \ + --policy round_robin \ + --prefill-policy cache_aware # Prefill uses cache_aware and decode uses round_robin from --policy +``` + +#### PD Mode with Service Discovery + +For Kubernetes deployments with separate prefill and decode server pools: ```bash -curl -X POST http://localhost:30000/remove_worker?url=http://worker_url_1 +python -m sglang_router.launch_router \ + --pd-disaggregation \ + --service-discovery \ + --prefill-selector app=prefill-server tier=gpu \ + --decode-selector app=decode-server tier=cpu \ + --service-discovery-namespace production \ + --prefill-policy cache_aware \ + --decode-policy round_robin ``` -Example: +## Dynamic Scaling + +The router supports runtime scaling through REST APIs: + +### Adding Workers ```bash -curl -X POST http://localhost:30000/remove_worker?url=http://127.0.0.1:30001 +# Launch a new worker +python -m sglang.launch_server \ + --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \ + --port 30001 -# Successfully removed worker: http://127.0.0.1:30001 +# Add it to the router +curl -X POST "http://localhost:30000/add_worker?url=http://127.0.0.1:30001" ``` -Note: +### Removing Workers -- For cache-aware router, the worker will be removed from the tree and the queues. +```bash +curl -X POST "http://localhost:30000/remove_worker?url=http://127.0.0.1:30001" +``` + +**Note**: When using cache-aware routing, removed workers are cleanly evicted from the routing tree and request queues. ## Fault Tolerance -We provide retries based for failure tolerance. +The router includes comprehensive fault tolerance mechanisms: -1. If the request to a worker fails for `max_worker_retries` times, the router will remove the worker from the router and move on to the next worker. -2. If the total number of retries exceeds `max_total_retries`, the router will return an error. +### Retry Configuration -Note: +```bash +python -m sglang_router.launch_router \ + --worker-urls http://worker1:8000 http://worker2:8001 \ + --retry-max-retries 3 \ + --retry-initial-backoff-ms 100 \ + --retry-max-backoff-ms 10000 \ + --retry-backoff-multiplier 2.0 \ + --retry-jitter-factor 0.1 +``` -- `max_worker_retries` is 3 and `max_total_retries` is 6 by default. +### Circuit Breaker -## Routing Strategies +Protects against cascading failures: -### Cache-Aware Load-Balancing Router +```bash +python -m sglang_router.launch_router \ + --worker-urls http://worker1:8000 http://worker2:8001 \ + --cb-failure-threshold 5 \ + --cb-success-threshold 2 \ + --cb-timeout-duration-secs 30 \ + --cb-window-duration-secs 60 +``` -The native router combines two strategies to optimize both cache utilization and request distribution: +**Behavior**: +- Worker is marked unhealthy after `cb-failure-threshold` consecutive failures +- Returns to service after `cb-success-threshold` successful health checks +- Circuit breaker can be disabled with `--disable-circuit-breaker` -1. Cache-Aware Routing (Approximate Tree) -2. Load-Balancing Routing (Shortest Queue with Balance Thresholds) +## Routing Policies -The router dynamically switches between these strategies based on load conditions: +The router supports multiple routing strategies: -- Uses load balancing when the system is imbalanced -- Uses cache-aware routing when the system is balanced +### 1. Random Routing +Distributes requests randomly across workers. -A system is considered imbalanced if both conditions are met: +```bash +--policy random +``` -1. (max_load - min_load) > balance_abs_threshold -2. max_load > balance_rel_threshold * min_load +### 2. Round-Robin Routing +Cycles through workers in order. -***Cache-Aware Routing (Approximate Tree)*** +```bash +--policy round_robin +``` -When the workers are considered to be balanced, the router maintains an approximate radix tree for each worker based on request history, eliminating the need for direct cache state queries on each worker. The tree stores raw text characters instead of token IDs to avoid tokenization overhead. +### 3. Power of Two Choices +Samples two workers and routes to the less loaded one. -Process: +```bash +--policy power_of_two +``` -1. For each request, find the worker with the highest prefix match. +### 4. Cache-Aware Load Balancing (Default) - - If match rate > cache_threshold, route the request to the worker with highest match (likely has relevant data cached) - - If match rate ≤ cache_threshold, route the request to the worker with smallest tree size (most available cache capacity) +The most sophisticated policy that combines cache optimization with load balancing: -2. Background maintenance: Periodically evict least recently used leaf nodes on the approximate tree to prevent memory overflow. +```bash +--policy cache_aware \ +--cache-threshold 0.5 \ +--balance-abs-threshold 32 \ +--balance-rel-threshold 1.0001 +``` -***Load-Balancing (Shortest Queue)*** +#### How It Works -For unbalanced systems, this strategy tracks pending request counts per worker and routes new requests to the least busy worker. This helps maintain optimal load distribution across workers. +1. **Load Assessment**: Checks if the system is balanced + - Imbalanced if: `(max_load - min_load) > balance_abs_threshold` AND `max_load > balance_rel_threshold * min_load` -***Data-Parallelism Aware Routing*** +2. **Routing Decision**: + - **Balanced System**: Uses cache-aware routing + - Routes to worker with highest prefix match if match > `cache_threshold` + - Otherwise routes to worker with most available cache capacity + - **Imbalanced System**: Uses shortest queue routing to the least busy worker -An additional DP-aware routing strategy can be enabled on top of the sgl-router’s hybrid cache-aware load-balancing strategy by setting the `--dp-aware` flag when starting the router. +3. **Cache Management**: + - Maintains approximate radix trees per worker + - Periodically evicts LRU entries based on `--eviction-interval` and `--max-tree-size` -When this flag is enabled, the router attempts to contact the workers to retrieve the `dp_size` of each one and registers the new workers at the DP-rank level. In this mode, the router applies the cache-aware routing strategy in a more fine-grained manner, with assistance from the DP controller on the SRT side. +### Data Parallelism Aware Routing -By default (when the flag is not set), the SRT’s DP controller distributes incoming requests across DP ranks in a round-robin fashion. +Enables fine-grained control over data parallel replicas: -## Configuration Parameters +```bash +--dp-aware \ +--api-key your_api_key # Required for worker authentication +``` -1. `cache_threshold`: (float, 0.0 to 1.0, default: 0.5) - - Minimum prefix match ratio to use highest-match routing. - - Below this threshold, the request will be routed to the worker with most available cache space. +This mode coordinates with SGLang's DP controller for optimized request distribution across data parallel ranks. -2. `balance_abs_threshold`: (integer, default: 32) - - Absolute difference threshold for load imbalance detection. - - The system is potentially imbalanced if (max_load - min_load) > abs_threshold. +## Configuration Reference -3. `balance_rel_threshold`: (float, default: 1.0001) - - Relative ratio threshold for load imbalance detection. - - The system is potentially imbalanced if max_load > min_load * rel_threshold. - - Used in conjunction with `balance_abs_threshold` to determine the final imbalance state. +### Core Settings -4. `eviction_interval`: (integer, default: 60) - - Interval in seconds between LRU eviction cycles for the approximate trees. - - Background thread periodically evicts least recently used nodes to maintain tree size. +| Parameter | Type | Default | Description | +|-----------------------------|------|-------------|-----------------------------------------------------------------| +| `--host` | str | 127.0.0.1 | Router server host address | +| `--port` | int | 30000 | Router server port | +| `--worker-urls` | list | [] | Worker URLs for separate launch mode | +| `--policy` | str | cache_aware | Routing policy (random, round_robin, cache_aware, power_of_two) | +| `--max-concurrent-requests` | int | 64 | Maximum concurrent requests (rate limiting) | +| `--request-timeout-secs` | int | 600 | Request timeout in seconds | +| `--max-payload-size` | int | 256MB | Maximum request payload size | -5. `max_tree_size`: (integer, default: 16777216) - - Maximum nodes on the approximate tree. - - When exceeded, LRU leaf nodes are evicted during the next eviction cycle. +### Cache-Aware Routing Parameters + +| Parameter | Type | Default | Description | +|---------------------------|-------|----------|--------------------------------------------------------| +| `--cache-threshold` | float | 0.5 | Minimum prefix match ratio for cache routing (0.0-1.0) | +| `--balance-abs-threshold` | int | 32 | Absolute load difference threshold | +| `--balance-rel-threshold` | float | 1.0001 | Relative load ratio threshold | +| `--eviction-interval` | int | 60 | Seconds between cache eviction cycles | +| `--max-tree-size` | int | 16777216 | Maximum nodes in routing tree | + +### Fault Tolerance Parameters + +| Parameter | Type | Default | Description | +|------------------------------|-------|---------|---------------------------------------| +| `--retry-max-retries` | int | 3 | Maximum retry attempts per request | +| `--retry-initial-backoff-ms` | int | 100 | Initial retry backoff in milliseconds | +| `--retry-max-backoff-ms` | int | 10000 | Maximum retry backoff in milliseconds | +| `--retry-backoff-multiplier` | float | 2.0 | Backoff multiplier between retries | +| `--retry-jitter-factor` | float | 0.1 | Random jitter factor for retries | +| `--disable-retries` | flag | False | Disable retry mechanism | +| `--cb-failure-threshold` | int | 5 | Failures before circuit opens | +| `--cb-success-threshold` | int | 2 | Successes to close circuit | +| `--cb-timeout-duration-secs` | int | 30 | Circuit breaker timeout duration | +| `--cb-window-duration-secs` | int | 60 | Circuit breaker window duration | +| `--disable-circuit-breaker` | flag | False | Disable circuit breaker | + +### Prefill-Decode Disaggregation Parameters + +| Parameter | Type | Default | Description | +|-----------------------------------|------|---------|-------------------------------------------------------| +| `--pd-disaggregation` | flag | False | Enable PD disaggregated mode | +| `--prefill` | list | [] | Prefill server URLs with optional bootstrap ports | +| `--decode` | list | [] | Decode server URLs | +| `--prefill-policy` | str | None | Routing policy for prefill nodes (overrides --policy) | +| `--decode-policy` | str | None | Routing policy for decode nodes (overrides --policy) | +| `--worker-startup-timeout-secs` | int | 300 | Timeout for worker startup | +| `--worker-startup-check-interval` | int | 10 | Interval between startup checks | + +### Kubernetes Integration + +| Parameter | Type | Default | Description | +|---------------------------------|------|--------------------------|------------------------------------------------------| +| `--service-discovery` | flag | False | Enable Kubernetes service discovery | +| `--selector` | list | [] | Label selector for workers (key1=value1 key2=value2) | +| `--prefill-selector` | list | [] | Label selector for prefill servers in PD mode | +| `--decode-selector` | list | [] | Label selector for decode servers in PD mode | +| `--service-discovery-port` | int | 80 | Port for discovered pods | +| `--service-discovery-namespace` | str | None | Kubernetes namespace to watch | +| `--bootstrap-port-annotation` | str | sglang.ai/bootstrap-port | Annotation for bootstrap ports | + +### Observability + +| Parameter | Type | Default | Description | +|------------------------|------|-----------|-------------------------------------------------------| +| `--prometheus-port` | int | 29000 | Prometheus metrics port | +| `--prometheus-host` | str | 127.0.0.1 | Prometheus metrics host | +| `--log-dir` | str | None | Directory for log files | +| `--log-level` | str | info | Logging level (debug, info, warning, error, critical) | +| `--request-id-headers` | list | None | Custom headers for request tracing | + +### CORS Configuration + +| Parameter | Type | Default | Description | +|--------------------------|------|---------|----------------------| +| `--cors-allowed-origins` | list | [] | Allowed CORS origins | + +## Advanced Features + +### Kubernetes Service Discovery + +Automatically discover and manage workers in Kubernetes: + +#### Standard Mode +```bash +python -m sglang_router.launch_router \ + --service-discovery \ + --selector app=sglang-worker env=prod \ + --service-discovery-namespace production \ + --service-discovery-port 8000 +``` + +#### Prefill-Decode Disaggregation Mode +```bash +python -m sglang_router.launch_router \ + --pd-disaggregation \ + --service-discovery \ + --prefill-selector app=prefill-server env=prod \ + --decode-selector app=decode-server env=prod \ + --service-discovery-namespace production +``` + +**Note**: The `--bootstrap-port-annotation` (default: `sglang.ai/bootstrap-port`) is used to discover bootstrap ports for prefill servers in PD mode. Prefill pods should have this annotation set to their bootstrap port value. + +### Prometheus Metrics + +Expose metrics for monitoring: + +```bash +python -m sglang_router.launch_router \ + --worker-urls http://worker1:8000 http://worker2:8001 \ + --prometheus-port 29000 \ + --prometheus-host 0.0.0.0 +``` + +Metrics available at `http://localhost:29000/metrics` + +### Request Tracing + +Enable request ID tracking: + +```bash +python -m sglang_router.launch_router \ + --worker-urls http://worker1:8000 http://worker2:8001 \ + --request-id-headers x-request-id x-trace-id +``` + +## Troubleshooting + +### Common Issues + +1. **Workers not connecting**: Ensure workers are fully initialized before starting the router. Use `--worker-startup-timeout-secs` to increase wait time. + +2. **High latency**: Check if cache-aware routing is causing imbalance. Try adjusting `--balance-abs-threshold` and `--balance-rel-threshold`. + +3. **Memory growth**: Reduce `--max-tree-size` or decrease `--eviction-interval` for more aggressive cache cleanup. + +4. **Circuit breaker triggering frequently**: Increase `--cb-failure-threshold` or extend `--cb-window-duration-secs`. + +### Debug Mode + +Enable detailed logging: + +```bash +python -m sglang_router.launch_router \ + --worker-urls http://worker1:8000 http://worker2:8001 \ + --log-level debug \ + --log-dir ./router_logs +```