From 8e776c78a1976c97abd4f6b68ad46a821c119704 Mon Sep 17 00:00:00 2001 From: Jonah Bernard <96398205+Jonahcb@users.noreply.github.com> Date: Sun, 12 Oct 2025 23:03:27 -0400 Subject: [PATCH] docs(router): add token-bucket rate limiting to the docs (#11485) --- docs/advanced_features/router.md | 30 ++++++++++++++++++++++++++++++ 1 file changed, 30 insertions(+) diff --git a/docs/advanced_features/router.md b/docs/advanced_features/router.md index 5eb4a2ff0..f2bb68942 100644 --- a/docs/advanced_features/router.md +++ b/docs/advanced_features/router.md @@ -11,6 +11,7 @@ The SGLang Router is a high-performance request distribution system that routes - **Kubernetes Integration**: Native service discovery and pod management - **Prefill-Decode Disaggregation**: Support for disaggregated serving load balancing - **Prometheus Metrics**: Built-in observability and monitoring +- **Rate Limiter**: Token-bucket rate limiter to shield workers from overload ## Installation @@ -229,6 +230,35 @@ python -m sglang_router.launch_router \ - Returns to service after `cb-success-threshold` successful health checks - Circuit breaker can be disabled with `--disable-circuit-breaker` +### Rate Limiter + +Use the token-bucket rate limiter to cap requests before they overwhelm downstream workers. + +- Enable rate limiting by setting `--max-concurrent-requests` to a positive integer. A bucket with that many tokens (concurrent leases) is created; `-1` keeps it disabled. +- Optionally override the refill rate with `--rate-limit-tokens-per-second`. If omitted, the refill rate matches `max-concurrent-requests`. +- Overflow traffic can wait in a FIFO queue controlled by: + - `--queue-size`: pending-request buffer (0 disables queuing; defaults to 100). + - `--queue-timeout-secs`: maximum wait time for queued requests before returning `429` (defaults to 60 seconds). + +Example: + +```bash +python -m sglang_router.launch_router \ + --worker-urls http://worker1:8000 http://worker2:8001 \ + --max-concurrent-requests 256 \ + --rate-limit-tokens-per-second 512 \ + --queue-size 128 \ + --queue-timeout-secs 30 +``` + +**Behavior**: + +This configuration allows up to 256 concurrent requests, refills 512 tokens (requests) per second, and keeps up to 128 overflow requests queued for 30 seconds before timing out. + +**Responses**: +- Returns **429** when the router cannot enqueue the request (queue disabled or full). +- Returns **408** when a queued request waits longer than `--queue-timeout-secs` or no token becomes available before the timeout. + ## Routing Policies The router supports multiple routing strategies: