From a85363c199477852bacf4f129179d09b53ff88d9 Mon Sep 17 00:00:00 2001 From: yhyang201 <47235274+yhyang201@users.noreply.github.com> Date: Wed, 27 Aug 2025 09:30:57 +0800 Subject: [PATCH] [docs] Instructions for bench_serving.py (#9071) Co-authored-by: Mick Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com> Co-authored-by: zhaochenyang20 Co-authored-by: zhaochenyang20 Co-authored-by: Yineng Zhang --- .../separate_reasoning.ipynb | 5 +- docs/developer_guide/bench_serving.md | 319 ++++++++++++++++++ .../benchmark_and_profiling.md | 3 + docs/index.rst | 1 + 4 files changed, 327 insertions(+), 1 deletion(-) create mode 100644 docs/developer_guide/bench_serving.md diff --git a/docs/advanced_features/separate_reasoning.ipynb b/docs/advanced_features/separate_reasoning.ipynb index 4886a4680..586d3a978 100644 --- a/docs/advanced_features/separate_reasoning.ipynb +++ b/docs/advanced_features/separate_reasoning.ipynb @@ -33,7 +33,10 @@ "- Qwen3-Thinking (e.g., Qwen3-235B-A22B-Thinking-2507): Use `qwen3` or `qwen3-thinking` parser, always thinks\n", "\n", "**Kimi:**\n", - "- Kimi: Uses special `◁think▷` and `◁/think▷` tags" + "- Kimi: Uses special `◁think▷` and `◁/think▷` tags\n", + "\n", + "**GPT OSS:**\n", + "- GPT OSS: Uses special `<|channel|>analysis<|message|>` and `<|end|>` tags" ] }, { diff --git a/docs/developer_guide/bench_serving.md b/docs/developer_guide/bench_serving.md new file mode 100644 index 000000000..35c9b2b0f --- /dev/null +++ b/docs/developer_guide/bench_serving.md @@ -0,0 +1,319 @@ +## Bench Serving Guide + +This guide explains how to benchmark online serving throughput and latency using `python -m sglang.bench_serving`. It supports multiple inference backends via OpenAI-compatible and native endpoints, and produces both console metrics and optional JSONL outputs. + +### What it does + +- Generates synthetic or dataset-driven prompts and submits them to a target serving endpoint +- Measures throughput, time-to-first-token (TTFT), inter-token latency (ITL), per-request end-to-end latency, and more +- Supports streaming or non-streaming modes, rate control, and concurrency limits + +### Supported backends and endpoints + +- `sglang` / `sglang-native`: `POST /generate` +- `sglang-oai`, `vllm`, `lmdeploy`: `POST /v1/completions` +- `sglang-oai-chat`, `vllm-chat`, `lmdeploy-chat`: `POST /v1/chat/completions` +- `trt` (TensorRT-LLM): `POST /v2/models/ensemble/generate_stream` +- `gserver`: Custom server (Not Implemented yet in this script) +- `truss`: `POST /v1/models/model:predict` + +If `--base-url` is provided, requests are sent to it. Otherwise, `--host` and `--port` are used. When `--model` is not provided, the script will attempt to query `GET /v1/models` for an available model ID (OpenAI-compatible endpoints). + +### Prerequisites + +- Python 3.8+ +- Dependencies typically used by this script: `aiohttp`, `numpy`, `requests`, `tqdm`, `transformers`, and for some datasets `datasets`, `pillow`, `pybase64`. Install as needed. +- An inference server running and reachable via the endpoints above +- If your server requires authentication, set environment variable `OPENAI_API_KEY` (used as `Authorization: Bearer `) + +### Quick start + +Run a basic benchmark against an sglang server exposing `/generate`: + +```bash +python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct +``` + +```bash +python3 -m sglang.bench_serving \ + --backend sglang \ + --host 127.0.0.1 --port 30000 \ + --num-prompts 1000 \ + --model meta-llama/Llama-3.1-8B-Instruct +``` + +Or, using an OpenAI-compatible endpoint (completions): + +```bash +python3 -m sglang.bench_serving \ + --backend vllm \ + --base-url http://127.0.0.1:8000 \ + --num-prompts 1000 \ + --model meta-llama/Llama-3.1-8B-Instruct +``` + +### Datasets + +Select with `--dataset-name`: + +- `sharegpt` (default): loads ShareGPT-style pairs; optionally restrict with `--sharegpt-context-len` and override outputs with `--sharegpt-output-len` +- `random`: random text lengths; sampled from ShareGPT token space +- `random-ids`: random token ids (can lead to gibberish) +- `random-image`: generates random images and wraps them in chat messages; supports custom resolutions via 'heightxwidth' format +- `generated-shared-prefix`: synthetic dataset with shared long system prompts and short questions +- `mmmu`: samples from MMMU (Math split) and includes images + +Common dataset flags: + +- `--num-prompts N`: number of requests +- `--random-input-len`, `--random-output-len`, `--random-range-ratio`: for random/random-ids/random-image +- `--random-image-num-images`, `--random-image-resolution`: for random-image dataset (supports presets 1080p/720p/360p or custom 'heightxwidth' format) +- `--apply-chat-template`: apply tokenizer chat template when constructing prompts +- `--dataset-path PATH`: file path for ShareGPT json; if blank and missing, it will be downloaded and cached + +Generated Shared Prefix flags (for `generated-shared-prefix`): + +- `--gsp-num-groups` +- `--gsp-prompts-per-group` +- `--gsp-system-prompt-len` +- `--gsp-question-len` +- `--gsp-output-len` + +Random Image dataset flags (for `random-image`): + +- `--random-image-num-images`: Number of images per request +- `--random-image-resolution`: Image resolution; supports presets (1080p, 720p, 360p) or custom 'heightxwidth' format (e.g., 1080x1920, 512x768) + +### Examples + +1. To benchmark random-image dataset with 3 images per request, 500 prompts, 512 input length, and 512 output length, you can run: + +```bash +python -m sglang.launch_server --model-path Qwen/Qwen2.5-VL-3B-Instruct --disable-radix-cache +``` + +```bash +python -m sglang.bench_serving \ + --backend sglang-oai-chat \ + --dataset-name random-image \ + --num-prompts 500 \ + --random-image-num-images 3 \ + --random-image-resolution 720p \ + --random-input-len 512 \ + --random-output-len 512 +``` + +2. To benchmark random dataset with 3000 prompts, 1024 input length, and 1024 output length, you can run: + +```bash +python -m sglang.launch_server --model-path Qwen/Qwen2.5-3B-Instruct +``` + +```bash +python3 -m sglang.bench_serving \ + --backend sglang \ + --dataset-name random \ + --num-prompts 3000 \ + --random-input 1024 \ + --random-output 1024 \ + --random-range-ratio 0.5 +``` + +### Choosing model and tokenizer + +- `--model` is required unless the backend exposes `GET /v1/models`, in which case the first model ID is auto-selected. +- `--tokenizer` defaults to `--model`. Both can be HF model IDs or local paths. +- For ModelScope workflows, setting `SGLANG_USE_MODELSCOPE=true` enables fetching via ModelScope (weights are skipped for speed). +- If your tokenizer lacks a chat template, the script warns because token counting can be less robust for gibberish outputs. + +### Rate, concurrency, and streaming + +- `--request-rate`: requests per second. `inf` sends all immediately (burst). Non-infinite rate uses a Poisson process for arrival times. +- `--max-concurrency`: caps concurrent in-flight requests regardless of arrival rate. +- `--disable-stream`: switch to non-streaming mode when supported; TTFT then equals total latency for chat completions. + +### Other key options + +- `--output-file FILE.jsonl`: append JSONL results to file; auto-named if unspecified +- `--output-details`: include per-request arrays (generated texts, errors, ttfts, itls, input/output lens) +- `--extra-request-body '{"top_p":0.9,"temperature":0.6}'`: merged into payload (sampling params, etc.) +- `--disable-ignore-eos`: pass through EOS behavior (varies by backend) +- `--warmup-requests N`: run warmup requests with short output first (default 1) +- `--flush-cache`: call `/flush_cache` (sglang) before main run +- `--profile`: call `/start_profile` and `/stop_profile` (requires server to enable profiling, e.g., `SGLANG_TORCH_PROFILER_DIR`) +- `--lora-name name1 name2 ...`: randomly pick one per request and pass to backend (e.g., `lora_path` for sglang) +- `--tokenize-prompt`: send integer IDs instead of text (currently supports `--backend sglang` only) + +### Authentication + +If your target endpoint requires OpenAI-style auth, set: + +```bash +export OPENAI_API_KEY=sk-...yourkey... +``` + +The script will add `Authorization: Bearer $OPENAI_API_KEY` automatically for OpenAI-compatible routes. + +### Metrics explained + +Printed after each run: + +- Request throughput (req/s) +- Input token throughput (tok/s) +- Output token throughput (tok/s) +- Total token throughput (tok/s) +- Concurrency: aggregate time of all requests divided by wall time +- End-to-End Latency (ms): mean/median/std/p99 per-request total latency +- Time to First Token (TTFT, ms): mean/median/std/p99 for streaming mode +- Inter-Token Latency (ITL, ms): mean/median/std/p95/p99/max between tokens +- TPOT (ms): Token processing time after first token, i.e., `(latency - ttft)/(tokens-1)` +- Accept length (sglang-only, if available): speculative decoding accept length + +The script also retokenizes generated text with the configured tokenizer and reports "retokenized" counts. + +### JSONL output format + +When `--output-file` is set, one JSON object is appended per run. Base fields: + +- Arguments summary: backend, dataset, request_rate, max_concurrency, etc. +- Duration and totals: completed, total_input_tokens, total_output_tokens, retokenized totals +- Throughputs and latency statistics as printed in the console +- `accept_length` when available (sglang) + +With `--output-details`, an extended object also includes arrays: + +- `input_lens`, `output_lens` +- `ttfts`, `itls` (per request: ITL arrays) +- `generated_texts`, `errors` + +### End-to-end examples + +1) sglang native `/generate` (streaming): + +```bash +python3 -m sglang.bench_serving \ + --backend sglang \ + --host 127.0.0.1 --port 30000 \ + --model meta-llama/Llama-3.1-8B-Instruct \ + --dataset-name random \ + --random-input-len 1024 --random-output-len 1024 --random-range-ratio 0.5 \ + --num-prompts 2000 \ + --request-rate 100 \ + --max-concurrency 512 \ + --output-file sglang_random.jsonl --output-details +``` + +2) OpenAI-compatible Completions (e.g., vLLM): + +```bash +python3 -m sglang.bench_serving \ + --backend vllm \ + --base-url http://127.0.0.1:8000 \ + --model meta-llama/Llama-3.1-8B-Instruct \ + --dataset-name sharegpt \ + --num-prompts 1000 \ + --sharegpt-output-len 256 +``` + +3) OpenAI-compatible Chat Completions (streaming): + +```bash +python3 -m sglang.bench_serving \ + --backend vllm-chat \ + --base-url http://127.0.0.1:8000 \ + --model meta-llama/Llama-3.1-8B-Instruct \ + --dataset-name random \ + --num-prompts 500 \ + --apply-chat-template +``` + +4) Random images (VLM) with chat template: + +```bash +python3 -m sglang.bench_serving \ + --backend sglang \ + --host 127.0.0.1 --port 30000 \ + --model your-vlm-model \ + --dataset-name random-image \ + --random-image-num-images 2 \ + --random-image-resolution 720p \ + --random-input-len 128 --random-output-len 256 \ + --num-prompts 200 \ + --apply-chat-template +``` + +4a) Random images with custom resolution: + +```bash +python3 -m sglang.bench_serving \ + --backend sglang \ + --host 127.0.0.1 --port 30000 \ + --model your-vlm-model \ + --dataset-name random-image \ + --random-image-num-images 1 \ + --random-image-resolution 512x768 \ + --random-input-len 64 --random-output-len 128 \ + --num-prompts 100 \ + --apply-chat-template +``` + +5) Generated shared prefix (long system prompts + short questions): + +```bash +python3 -m sglang.bench_serving \ + --backend sglang \ + --host 127.0.0.1 --port 30000 \ + --model meta-llama/Llama-3.1-8B-Instruct \ + --dataset-name generated-shared-prefix \ + --gsp-num-groups 64 --gsp-prompts-per-group 16 \ + --gsp-system-prompt-len 2048 --gsp-question-len 128 --gsp-output-len 256 \ + --num-prompts 1024 +``` + +6) Tokenized prompts (ids) for strict length control (sglang only): + +```bash +python3 -m sglang.bench_serving \ + --backend sglang \ + --host 127.0.0.1 --port 30000 \ + --model meta-llama/Llama-3.1-8B-Instruct \ + --dataset-name random \ + --tokenize-prompt \ + --random-input-len 2048 --random-output-len 256 --random-range-ratio 0.2 +``` + +7) Profiling and cache flush (sglang): + +```bash +python3 -m sglang.bench_serving \ + --backend sglang \ + --host 127.0.0.1 --port 30000 \ + --model meta-llama/Llama-3.1-8B-Instruct \ + --profile \ + --flush-cache +``` + +8) TensorRT-LLM streaming endpoint: + +```bash +python3 -m sglang.bench_serving \ + --backend trt \ + --base-url http://127.0.0.1:8000 \ + --model your-trt-llm-model \ + --dataset-name random \ + --num-prompts 100 \ + --disable-ignore-eos +``` + +### Troubleshooting + +- All requests failed: verify `--backend`, server URL/port, `--model`, and authentication. Check warmup errors printed by the script. +- Throughput seems too low: adjust `--request-rate` and `--max-concurrency`; verify server batch size/scheduling; ensure streaming is enabled if appropriate. +- Token counts look odd: prefer chat/instruct models with proper chat templates; otherwise tokenization of gibberish may be inconsistent. +- Random-image/MMMU datasets: ensure you installed extra deps (`pillow`, `datasets`, `pybase64`). +- Authentication errors (401/403): set `OPENAI_API_KEY` or disable auth on your server. + +### Notes + +- The script raises the file descriptor soft limit (`RLIMIT_NOFILE`) to help with many concurrent connections. +- For sglang, `/get_server_info` is queried post-run to report speculative decoding accept length when available. diff --git a/docs/developer_guide/benchmark_and_profiling.md b/docs/developer_guide/benchmark_and_profiling.md index 019805456..948c837ff 100644 --- a/docs/developer_guide/benchmark_and_profiling.md +++ b/docs/developer_guide/benchmark_and_profiling.md @@ -31,6 +31,7 @@ [Pytorch Profiler](https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html) is a convenient basic tool to inspect kernel execution time, call stack, and kernel overlap and occupancy. ### Profile a server with `sglang.bench_serving` + ```bash # set trace path export SGLANG_TORCH_PROFILER_DIR=/root/sglang/profile_log @@ -44,6 +45,8 @@ python -m sglang.bench_serving --backend sglang --model meta-llama/Llama-3.1-8B- Please make sure that the `SGLANG_TORCH_PROFILER_DIR` should be set at both server and client side, otherwise the trace file cannot be generated correctly . A secure way will be setting `SGLANG_TORCH_PROFILER_DIR` in the `.*rc` file of shell (e.g. `~/.bashrc` for bash shells). +For more details, please refer to [Bench Serving Guide](./bench_serving.md). + ### Profile a server with `sglang.bench_offline_throughput` ```bash export SGLANG_TORCH_PROFILER_DIR=/root/sglang/profile_log diff --git a/docs/index.rst b/docs/index.rst index 5eeca7892..040aa53f3 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -79,6 +79,7 @@ The core features include: developer_guide/contribution_guide.md developer_guide/development_guide_using_docker.md developer_guide/benchmark_and_profiling.md + developer_guide/bench_serving.md .. toctree:: :maxdepth: 1