sglangv0.5.2 & support Qwen3-Next-80B-A3B-Instruct
This commit is contained in:
334
docs/developer_guide/bench_serving.md
Normal file
334
docs/developer_guide/bench_serving.md
Normal file
@@ -0,0 +1,334 @@
|
||||
## Bench Serving Guide
|
||||
|
||||
This guide explains how to benchmark online serving throughput and latency using `python -m sglang.bench_serving`. It supports multiple inference backends via OpenAI-compatible and native endpoints, and produces both console metrics and optional JSONL outputs.
|
||||
|
||||
### What it does
|
||||
|
||||
- Generates synthetic or dataset-driven prompts and submits them to a target serving endpoint
|
||||
- Measures throughput, time-to-first-token (TTFT), inter-token latency (ITL), per-request end-to-end latency, and more
|
||||
- Supports streaming or non-streaming modes, rate control, and concurrency limits
|
||||
|
||||
### Supported backends and endpoints
|
||||
|
||||
- `sglang` / `sglang-native`: `POST /generate`
|
||||
- `sglang-oai`, `vllm`, `lmdeploy`: `POST /v1/completions`
|
||||
- `sglang-oai-chat`, `vllm-chat`, `lmdeploy-chat`: `POST /v1/chat/completions`
|
||||
- `trt` (TensorRT-LLM): `POST /v2/models/ensemble/generate_stream`
|
||||
- `gserver`: Custom server (Not Implemented yet in this script)
|
||||
- `truss`: `POST /v1/models/model:predict`
|
||||
|
||||
If `--base-url` is provided, requests are sent to it. Otherwise, `--host` and `--port` are used. When `--model` is not provided, the script will attempt to query `GET /v1/models` for an available model ID (OpenAI-compatible endpoints).
|
||||
|
||||
### Prerequisites
|
||||
|
||||
- Python 3.8+
|
||||
- Dependencies typically used by this script: `aiohttp`, `numpy`, `requests`, `tqdm`, `transformers`, and for some datasets `datasets`, `pillow`, `pybase64`. Install as needed.
|
||||
- An inference server running and reachable via the endpoints above
|
||||
- If your server requires authentication, set environment variable `OPENAI_API_KEY` (used as `Authorization: Bearer <key>`)
|
||||
|
||||
### Quick start
|
||||
|
||||
Run a basic benchmark against an sglang server exposing `/generate`:
|
||||
|
||||
```bash
|
||||
python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct
|
||||
```
|
||||
|
||||
```bash
|
||||
python3 -m sglang.bench_serving \
|
||||
--backend sglang \
|
||||
--host 127.0.0.1 --port 30000 \
|
||||
--num-prompts 1000 \
|
||||
--model meta-llama/Llama-3.1-8B-Instruct
|
||||
```
|
||||
|
||||
Or, using an OpenAI-compatible endpoint (completions):
|
||||
|
||||
```bash
|
||||
python3 -m sglang.bench_serving \
|
||||
--backend vllm \
|
||||
--base-url http://127.0.0.1:8000 \
|
||||
--num-prompts 1000 \
|
||||
--model meta-llama/Llama-3.1-8B-Instruct
|
||||
```
|
||||
|
||||
### Datasets
|
||||
|
||||
Select with `--dataset-name`:
|
||||
|
||||
- `sharegpt` (default): loads ShareGPT-style pairs; optionally restrict with `--sharegpt-context-len` and override outputs with `--sharegpt-output-len`
|
||||
- `random`: random text lengths; sampled from ShareGPT token space
|
||||
- `random-ids`: random token ids (can lead to gibberish)
|
||||
- `random-image`: generates random images and wraps them in chat messages; supports custom resolutions via 'heightxwidth' format
|
||||
- `generated-shared-prefix`: synthetic dataset with shared long system prompts and short questions
|
||||
- `mmmu`: samples from MMMU (Math split) and includes images
|
||||
|
||||
Common dataset flags:
|
||||
|
||||
- `--num-prompts N`: number of requests
|
||||
- `--random-input-len`, `--random-output-len`, `--random-range-ratio`: for random/random-ids/random-image
|
||||
- `--random-image-num-images`, `--random-image-resolution`: for random-image dataset (supports presets 1080p/720p/360p or custom 'heightxwidth' format)
|
||||
- `--apply-chat-template`: apply tokenizer chat template when constructing prompts
|
||||
- `--dataset-path PATH`: file path for ShareGPT json; if blank and missing, it will be downloaded and cached
|
||||
|
||||
Generated Shared Prefix flags (for `generated-shared-prefix`):
|
||||
|
||||
- `--gsp-num-groups`
|
||||
- `--gsp-prompts-per-group`
|
||||
- `--gsp-system-prompt-len`
|
||||
- `--gsp-question-len`
|
||||
- `--gsp-output-len`
|
||||
|
||||
Random Image dataset flags (for `random-image`):
|
||||
|
||||
- `--random-image-num-images`: Number of images per request
|
||||
- `--random-image-resolution`: Image resolution; supports presets (1080p, 720p, 360p) or custom 'heightxwidth' format (e.g., 1080x1920, 512x768)
|
||||
|
||||
### Examples
|
||||
|
||||
1. To benchmark random-image dataset with 3 images per request, 500 prompts, 512 input length, and 512 output length, you can run:
|
||||
|
||||
```bash
|
||||
python -m sglang.launch_server --model-path Qwen/Qwen2.5-VL-3B-Instruct --disable-radix-cache
|
||||
```
|
||||
|
||||
```bash
|
||||
python -m sglang.bench_serving \
|
||||
--backend sglang-oai-chat \
|
||||
--dataset-name random-image \
|
||||
--num-prompts 500 \
|
||||
--random-image-num-images 3 \
|
||||
--random-image-resolution 720p \
|
||||
--random-input-len 512 \
|
||||
--random-output-len 512
|
||||
```
|
||||
|
||||
2. To benchmark random dataset with 3000 prompts, 1024 input length, and 1024 output length, you can run:
|
||||
|
||||
```bash
|
||||
python -m sglang.launch_server --model-path Qwen/Qwen2.5-3B-Instruct
|
||||
```
|
||||
|
||||
```bash
|
||||
python3 -m sglang.bench_serving \
|
||||
--backend sglang \
|
||||
--dataset-name random \
|
||||
--num-prompts 3000 \
|
||||
--random-input 1024 \
|
||||
--random-output 1024 \
|
||||
--random-range-ratio 0.5
|
||||
```
|
||||
|
||||
### Choosing model and tokenizer
|
||||
|
||||
- `--model` is required unless the backend exposes `GET /v1/models`, in which case the first model ID is auto-selected.
|
||||
- `--tokenizer` defaults to `--model`. Both can be HF model IDs or local paths.
|
||||
- For ModelScope workflows, setting `SGLANG_USE_MODELSCOPE=true` enables fetching via ModelScope (weights are skipped for speed).
|
||||
- If your tokenizer lacks a chat template, the script warns because token counting can be less robust for gibberish outputs.
|
||||
|
||||
### Rate, concurrency, and streaming
|
||||
|
||||
- `--request-rate`: requests per second. `inf` sends all immediately (burst). Non-infinite rate uses a Poisson process for arrival times.
|
||||
- `--max-concurrency`: caps concurrent in-flight requests regardless of arrival rate.
|
||||
- `--disable-stream`: switch to non-streaming mode when supported; TTFT then equals total latency for chat completions.
|
||||
|
||||
### Other key options
|
||||
|
||||
- `--output-file FILE.jsonl`: append JSONL results to file; auto-named if unspecified
|
||||
- `--output-details`: include per-request arrays (generated texts, errors, ttfts, itls, input/output lens)
|
||||
- `--extra-request-body '{"top_p":0.9,"temperature":0.6}'`: merged into payload (sampling params, etc.)
|
||||
- `--disable-ignore-eos`: pass through EOS behavior (varies by backend)
|
||||
- `--warmup-requests N`: run warmup requests with short output first (default 1)
|
||||
- `--flush-cache`: call `/flush_cache` (sglang) before main run
|
||||
- `--profile`: call `/start_profile` and `/stop_profile` (requires server to enable profiling, e.g., `SGLANG_TORCH_PROFILER_DIR`)
|
||||
- `--lora-name name1 name2 ...`: randomly pick one per request and pass to backend (e.g., `lora_path` for sglang)
|
||||
- `--tokenize-prompt`: send integer IDs instead of text (currently supports `--backend sglang` only)
|
||||
|
||||
### Authentication
|
||||
|
||||
If your target endpoint requires OpenAI-style auth, set:
|
||||
|
||||
```bash
|
||||
export OPENAI_API_KEY=sk-...yourkey...
|
||||
```
|
||||
|
||||
The script will add `Authorization: Bearer $OPENAI_API_KEY` automatically for OpenAI-compatible routes.
|
||||
|
||||
### Metrics explained
|
||||
|
||||
Printed after each run:
|
||||
|
||||
- Request throughput (req/s)
|
||||
- Input token throughput (tok/s)
|
||||
- Output token throughput (tok/s)
|
||||
- Total token throughput (tok/s)
|
||||
- Concurrency: aggregate time of all requests divided by wall time
|
||||
- End-to-End Latency (ms): mean/median/std/p99 per-request total latency
|
||||
- Time to First Token (TTFT, ms): mean/median/std/p99 for streaming mode
|
||||
- Inter-Token Latency (ITL, ms): mean/median/std/p95/p99/max between tokens
|
||||
- TPOT (ms): Token processing time after first token, i.e., `(latency - ttft)/(tokens-1)`
|
||||
- Accept length (sglang-only, if available): speculative decoding accept length
|
||||
|
||||
The script also retokenizes generated text with the configured tokenizer and reports "retokenized" counts.
|
||||
|
||||
### JSONL output format
|
||||
|
||||
When `--output-file` is set, one JSON object is appended per run. Base fields:
|
||||
|
||||
- Arguments summary: backend, dataset, request_rate, max_concurrency, etc.
|
||||
- Duration and totals: completed, total_input_tokens, total_output_tokens, retokenized totals
|
||||
- Throughputs and latency statistics as printed in the console
|
||||
- `accept_length` when available (sglang)
|
||||
|
||||
With `--output-details`, an extended object also includes arrays:
|
||||
|
||||
- `input_lens`, `output_lens`
|
||||
- `ttfts`, `itls` (per request: ITL arrays)
|
||||
- `generated_texts`, `errors`
|
||||
|
||||
### End-to-end examples
|
||||
|
||||
1) sglang native `/generate` (streaming):
|
||||
|
||||
```bash
|
||||
python3 -m sglang.bench_serving \
|
||||
--backend sglang \
|
||||
--host 127.0.0.1 --port 30000 \
|
||||
--model meta-llama/Llama-3.1-8B-Instruct \
|
||||
--dataset-name random \
|
||||
--random-input-len 1024 --random-output-len 1024 --random-range-ratio 0.5 \
|
||||
--num-prompts 2000 \
|
||||
--request-rate 100 \
|
||||
--max-concurrency 512 \
|
||||
--output-file sglang_random.jsonl --output-details
|
||||
```
|
||||
|
||||
2) OpenAI-compatible Completions (e.g., vLLM):
|
||||
|
||||
```bash
|
||||
python3 -m sglang.bench_serving \
|
||||
--backend vllm \
|
||||
--base-url http://127.0.0.1:8000 \
|
||||
--model meta-llama/Llama-3.1-8B-Instruct \
|
||||
--dataset-name sharegpt \
|
||||
--num-prompts 1000 \
|
||||
--sharegpt-output-len 256
|
||||
```
|
||||
|
||||
3) OpenAI-compatible Chat Completions (streaming):
|
||||
|
||||
```bash
|
||||
python3 -m sglang.bench_serving \
|
||||
--backend vllm-chat \
|
||||
--base-url http://127.0.0.1:8000 \
|
||||
--model meta-llama/Llama-3.1-8B-Instruct \
|
||||
--dataset-name random \
|
||||
--num-prompts 500 \
|
||||
--apply-chat-template
|
||||
```
|
||||
|
||||
4) Random images (VLM) with chat template:
|
||||
|
||||
```bash
|
||||
python3 -m sglang.bench_serving \
|
||||
--backend sglang \
|
||||
--host 127.0.0.1 --port 30000 \
|
||||
--model your-vlm-model \
|
||||
--dataset-name random-image \
|
||||
--random-image-num-images 2 \
|
||||
--random-image-resolution 720p \
|
||||
--random-input-len 128 --random-output-len 256 \
|
||||
--num-prompts 200 \
|
||||
--apply-chat-template
|
||||
```
|
||||
|
||||
4a) Random images with custom resolution:
|
||||
|
||||
```bash
|
||||
python3 -m sglang.bench_serving \
|
||||
--backend sglang \
|
||||
--host 127.0.0.1 --port 30000 \
|
||||
--model your-vlm-model \
|
||||
--dataset-name random-image \
|
||||
--random-image-num-images 1 \
|
||||
--random-image-resolution 512x768 \
|
||||
--random-input-len 64 --random-output-len 128 \
|
||||
--num-prompts 100 \
|
||||
--apply-chat-template
|
||||
```
|
||||
|
||||
5) Generated shared prefix (long system prompts + short questions):
|
||||
|
||||
```bash
|
||||
python3 -m sglang.bench_serving \
|
||||
--backend sglang \
|
||||
--host 127.0.0.1 --port 30000 \
|
||||
--model meta-llama/Llama-3.1-8B-Instruct \
|
||||
--dataset-name generated-shared-prefix \
|
||||
--gsp-num-groups 64 --gsp-prompts-per-group 16 \
|
||||
--gsp-system-prompt-len 2048 --gsp-question-len 128 --gsp-output-len 256 \
|
||||
--num-prompts 1024
|
||||
```
|
||||
|
||||
6) Tokenized prompts (ids) for strict length control (sglang only):
|
||||
|
||||
```bash
|
||||
python3 -m sglang.bench_serving \
|
||||
--backend sglang \
|
||||
--host 127.0.0.1 --port 30000 \
|
||||
--model meta-llama/Llama-3.1-8B-Instruct \
|
||||
--dataset-name random \
|
||||
--tokenize-prompt \
|
||||
--random-input-len 2048 --random-output-len 256 --random-range-ratio 0.2
|
||||
```
|
||||
|
||||
7) Profiling and cache flush (sglang):
|
||||
|
||||
```bash
|
||||
python3 -m sglang.bench_serving \
|
||||
--backend sglang \
|
||||
--host 127.0.0.1 --port 30000 \
|
||||
--model meta-llama/Llama-3.1-8B-Instruct \
|
||||
--profile \
|
||||
--flush-cache
|
||||
```
|
||||
|
||||
8) TensorRT-LLM streaming endpoint:
|
||||
|
||||
```bash
|
||||
python3 -m sglang.bench_serving \
|
||||
--backend trt \
|
||||
--base-url http://127.0.0.1:8000 \
|
||||
--model your-trt-llm-model \
|
||||
--dataset-name random \
|
||||
--num-prompts 100 \
|
||||
--disable-ignore-eos
|
||||
```
|
||||
|
||||
9) Evaluating large-scale KVCache sharing with mooncake trace (sglang only):
|
||||
|
||||
```bash
|
||||
python3 -m sglang.bench_serving \
|
||||
--backend sglang \
|
||||
--host 127.0.0.1 --port 30000 \
|
||||
--model mode-name \
|
||||
--dataset-name mooncake \
|
||||
--mooncake-slowdown-factor 1.0 \
|
||||
--mooncake-num-rounds 1000 \
|
||||
--mooncake-workload conversation|mooncake|agent|synthetic
|
||||
--use-trace-timestamps true \
|
||||
--random-output-len 256
|
||||
```
|
||||
|
||||
### Troubleshooting
|
||||
|
||||
- All requests failed: verify `--backend`, server URL/port, `--model`, and authentication. Check warmup errors printed by the script.
|
||||
- Throughput seems too low: adjust `--request-rate` and `--max-concurrency`; verify server batch size/scheduling; ensure streaming is enabled if appropriate.
|
||||
- Token counts look odd: prefer chat/instruct models with proper chat templates; otherwise tokenization of gibberish may be inconsistent.
|
||||
- Random-image/MMMU datasets: ensure you installed extra deps (`pillow`, `datasets`, `pybase64`).
|
||||
- Authentication errors (401/403): set `OPENAI_API_KEY` or disable auth on your server.
|
||||
|
||||
### Notes
|
||||
|
||||
- The script raises the file descriptor soft limit (`RLIMIT_NOFILE`) to help with many concurrent connections.
|
||||
- For sglang, `/get_server_info` is queried post-run to report speculative decoding accept length when available.
|
||||
182
docs/developer_guide/benchmark_and_profiling.md
Normal file
182
docs/developer_guide/benchmark_and_profiling.md
Normal file
@@ -0,0 +1,182 @@
|
||||
# Benchmark and Profiling
|
||||
|
||||
## Benchmark
|
||||
|
||||
- Benchmark the latency of running a single static batch without a server. The arguments are the same as for `launch_server.py`.
|
||||
Note that this is a simplified test script without a dynamic batching server, so it may run out of memory for a batch size that a real server can handle. A real server truncates the prefill into several batches, while this simplified script does not.
|
||||
- Without a server (do not need to launch a server)
|
||||
```bash
|
||||
python -m sglang.bench_one_batch --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --batch 32 --input-len 256 --output-len 32
|
||||
```
|
||||
- With a server (please use `sglang.launch_server` to launch a server first and run the following command.)
|
||||
```bash
|
||||
python -m sglang.bench_one_batch_server --base-url http://127.0.0.1:30000 --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --batch-size 32 --input-len 256 --output-len 32
|
||||
```
|
||||
|
||||
|
||||
- Benchmark offline processing. This script will start an offline engine and run the benchmark.
|
||||
|
||||
```bash
|
||||
python3 -m sglang.bench_offline_throughput --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --num-prompts 10
|
||||
```
|
||||
|
||||
- Benchmark online serving. Please use `sglang.launch_server` to launch a server first and run the following command.
|
||||
|
||||
```bash
|
||||
python3 -m sglang.bench_serving --backend sglang --num-prompt 10
|
||||
```
|
||||
|
||||
## Profile with PyTorch Profiler
|
||||
|
||||
[Pytorch Profiler](https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html) is a convenient basic tool to inspect kernel execution time, call stack, and kernel overlap and occupancy.
|
||||
|
||||
### Profile a server with `sglang.bench_serving`
|
||||
|
||||
```bash
|
||||
# set trace path
|
||||
export SGLANG_TORCH_PROFILER_DIR=/root/sglang/profile_log
|
||||
|
||||
# start server
|
||||
python -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct
|
||||
|
||||
# send profiling request from client
|
||||
python -m sglang.bench_serving --backend sglang --model meta-llama/Llama-3.1-8B-Instruct --num-prompts 10 --sharegpt-output-len 100 --profile
|
||||
```
|
||||
|
||||
Please make sure that the `SGLANG_TORCH_PROFILER_DIR` should be set at both server and client side, otherwise the trace file cannot be generated correctly . A secure way will be setting `SGLANG_TORCH_PROFILER_DIR` in the `.*rc` file of shell (e.g. `~/.bashrc` for bash shells).
|
||||
|
||||
For more details, please refer to [Bench Serving Guide](./bench_serving.md).
|
||||
|
||||
### Profile a server with `sglang.bench_offline_throughput`
|
||||
```bash
|
||||
export SGLANG_TORCH_PROFILER_DIR=/root/sglang/profile_log
|
||||
|
||||
# profile one batch with bench_one_batch.py
|
||||
# batch size can be controlled with --batch argument
|
||||
python3 -m sglang.bench_one_batch --model-path meta-llama/Llama-3.1-8B-Instruct --batch 32 --input-len 1024 --output-len 10 --profile
|
||||
|
||||
# profile multiple batches with bench_offline_throughput.py
|
||||
python -m sglang.bench_offline_throughput --model-path meta-llama/Llama-3.1-8B-Instruct --dataset-name random --num-prompts 10 --profile --mem-frac=0.8
|
||||
```
|
||||
|
||||
### Profile a server with `sglang.profiler`
|
||||
|
||||
When the server is running (e.g., processing a decoding request), you can start live profiling immediately by sending a profile request to the server.
|
||||
|
||||
You can do this by running `python3 -m sglang.profiler`. For example:
|
||||
|
||||
```
|
||||
# Terminal 1: Send a generation request
|
||||
python3 -m sglang.test.send_one
|
||||
|
||||
# Terminal 2: Before the above request finishes, quickly launch the following command in a separate terminal.
|
||||
# It will generate a profile of the above request for several decoding batches.
|
||||
python3 -m sglang.profiler
|
||||
```
|
||||
|
||||
### Possible PyTorch bugs
|
||||
If in any cases you encounter the following error (for example, using qwen 2.5 VL):
|
||||
```bash
|
||||
RuntimeError: !stack.empty() INTERNAL ASSERT FAILED at "/pytorch/torch/csrc/autograd/profiler_python.cpp":983, please report a bug to PyTorch. Python replay stack is empty.
|
||||
```
|
||||
This is likely a PyTorch Bug reported in [Bug: vLLM Profiler](https://github.com/vllm-project/vllm/issues/18240) and [Bug: torch.profiler.profile](https://github.com/pytorch/pytorch/issues/101632). As a workaround, you may disable `with_stack` with an environment variable such as follows:
|
||||
```bash
|
||||
export SGLANG_PROFILE_WITH_STACK=False
|
||||
python -m sglang.bench_offline_throughput --model-path meta-llama/Llama-3.1-8B-Instruct --dataset-name random --num-prompts 10 --profile --mem-frac=0.8
|
||||
```
|
||||
|
||||
### View traces
|
||||
|
||||
Trace files can be loaded and visualized from:
|
||||
|
||||
1. https://ui.perfetto.dev/ (any browser)
|
||||
2. chrome://tracing (Chrome browser only)
|
||||
|
||||
If browser cannot open trace file due to its large size,
|
||||
client can generate a small trace file (<100MB) by controlling number of prompts and lengths of prompt outputs.
|
||||
For example, when profiling a server,
|
||||
|
||||
```bash
|
||||
python -m sglang.bench_serving --backend sglang --model meta-llama/Llama-3.1-8B-Instruct --num-prompts 2 --sharegpt-output-len 100 --profile
|
||||
```
|
||||
|
||||
This command sets the number of prompts to 2 with `--num-prompts` argument and limits the length of output sequences to 100 with `--sharegpt-output-len` argument, which can generate a small trace file for browser to open smoothly.
|
||||
|
||||
Additionally, if you want to locate the SGLang Python source code through the cuda kernel in Trace, you need to disable CUDA Graph when starting the service. This can be done by using the `--disable-cuda-graph` parameter in the command to start the service.
|
||||
|
||||
## Profile with Nsight
|
||||
|
||||
[Nsight systems](https://docs.nvidia.com/nsight-systems/) is an advanced tool that exposes more profiling details, such as register and shared memory usage, annotated code regions and low-level CUDA APIs and events.
|
||||
|
||||
1. Prerequisite:
|
||||
|
||||
Install using apt, or run inside a [NVIDIA Docker container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch/tags) or [SGLang Docker container](https://github.com/sgl-project/sglang/tree/main/docker).
|
||||
|
||||
```bash
|
||||
# install nsys
|
||||
# https://docs.nvidia.com/nsight-systems/InstallationGuide/index.html
|
||||
apt update
|
||||
apt install -y --no-install-recommends gnupg
|
||||
echo "deb http://developer.download.nvidia.com/devtools/repos/ubuntu$(source /etc/lsb-release; echo "$DISTRIB_RELEASE" | tr -d .)/$(dpkg --print-architecture) /" | tee /etc/apt/sources.list.d/nvidia-devtools.list
|
||||
apt-key adv --fetch-keys http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub
|
||||
apt update
|
||||
apt install nsight-systems-cli
|
||||
```
|
||||
|
||||
2. To profile a single batch, use
|
||||
|
||||
```bash
|
||||
nsys profile --trace-fork-before-exec=true --cuda-graph-trace=node python3 -m sglang.bench_one_batch --model meta-llama/Meta-Llama-3-8B --batch-size 64 --input-len 512
|
||||
```
|
||||
|
||||
3. To profile a server, e.g.
|
||||
|
||||
```bash
|
||||
# launch the server, set the delay and duration times according to needs
|
||||
# after the duration time has been used up, server will be killed by nsys
|
||||
|
||||
nsys profile --trace-fork-before-exec=true --cuda-graph-trace=node -o sglang.out --delay 60 --duration 70 python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --disable-radix-cache
|
||||
|
||||
# client
|
||||
python3 -m sglang.bench_serving --backend sglang --num-prompts 1000 --dataset-name random --random-input 1024 --random-output 512
|
||||
```
|
||||
|
||||
In practice, we recommend users to set `--duration` argument to a large value. Whenever user wants the server to stop profiling. Firstly run:
|
||||
|
||||
```bash
|
||||
nsys sessions list
|
||||
```
|
||||
|
||||
to get the session id in the form of `profile-XXXXX`, then run:
|
||||
|
||||
```bash
|
||||
nsys stop --session=profile-XXXXX
|
||||
```
|
||||
|
||||
to manually kill the profiler and generate `nsys-rep` files instantly.
|
||||
|
||||
4. Use NVTX to annotate code regions, e.g. to see their execution time.
|
||||
|
||||
```bash
|
||||
# install nvtx
|
||||
pip install nvtx
|
||||
```
|
||||
|
||||
```python
|
||||
# code snippets
|
||||
import nvtx
|
||||
with nvtx.annotate("description", color="color"):
|
||||
# some critical code
|
||||
```
|
||||
|
||||
## Other tips
|
||||
|
||||
1. You can benchmark a model using dummy weights by only providing the config.json file. This allows for quick testing of model variants without training. To do so, add `--load-format dummy` to the above commands and then you only need a correct `config.json` under the checkpoint folder.
|
||||
2. You can benchmark a model with modified configs (e.g., less layers) by using `--json-model-override-args`. For example, you can benchmark a model with only 2 layers and 2 kv heads using:
|
||||
|
||||
```bash
|
||||
python -m sglang.bench_one_batch --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --batch 32 --input-len 256 --output-len 32 --load-format dummy --json-model-override-args '{"num_hidden_layers": 1, "num_key_value_heads": 1}'
|
||||
```
|
||||
|
||||
3. You can use `--python-backtrace=cuda` to see python call stack for all CUDA kernels, as in PyTorch Profiler. (Caveat: this can cause inaccurately long kernel runtimes for CUDA event based timing)
|
||||
4. For more arguments see [Nsight Systems User Guide](https://docs.nvidia.com/nsight-systems/UserGuide/index.html).
|
||||
103
docs/developer_guide/contribution_guide.md
Normal file
103
docs/developer_guide/contribution_guide.md
Normal file
@@ -0,0 +1,103 @@
|
||||
# Contribution Guide
|
||||
|
||||
Welcome to **SGLang**! We appreciate your interest in contributing. This guide provides a concise overview of how to set up your environment, run tests, build documentation, and open a Pull Request (PR). Whether you’re fixing a small bug or developing a major feature, we encourage following these steps for a smooth contribution process.
|
||||
|
||||
## Install SGLang from Source
|
||||
|
||||
### Fork and clone the repository
|
||||
|
||||
**Note**: New contributors do **not** have the write permission to push to the official SGLang repo. Please fork the repository under your GitHub account, then clone your fork locally.
|
||||
|
||||
```bash
|
||||
git clone https://github.com/<your_user_name>/sglang.git
|
||||
```
|
||||
|
||||
### Build from source
|
||||
|
||||
Refer to [Install SGLang from Source](../get_started/install.md#method-2-from-source).
|
||||
|
||||
## Format code with pre-commit
|
||||
|
||||
We use [pre-commit](https://pre-commit.com/) to maintain consistent code style checks. Before pushing your changes, please run:
|
||||
|
||||
```bash
|
||||
pip3 install pre-commit
|
||||
pre-commit install
|
||||
pre-commit run --all-files
|
||||
```
|
||||
|
||||
- **`pre-commit run --all-files`** manually runs all configured checks, applying fixes if possible. If it fails the first time, re-run it to ensure lint errors are fully resolved. Make sure your code passes all checks **before** creating a Pull Request.
|
||||
- **Do not commit** directly to the `main` branch. Always create a new branch (e.g., `feature/my-new-feature`), push your changes, and open a PR from that branch.
|
||||
|
||||
## Run and add unit tests
|
||||
|
||||
If you add a new feature or fix a bug, please add corresponding unit tests to ensure coverage and prevent regression.
|
||||
SGLang uses Python's built-in [unittest](https://docs.python.org/3/library/unittest.html) framework.
|
||||
For detailed instructions on running tests and integrating them into CI, refer to [test/README.md](https://github.com/sgl-project/sglang/tree/main/test/README.md).
|
||||
|
||||
## Write documentations
|
||||
|
||||
We recommend new contributors start from writing documentation, which helps you quickly understand SGLang codebase.
|
||||
For more details, please refer to [docs/README.md](https://github.com/sgl-project/sglang/tree/main/docs/README.md).
|
||||
|
||||
## Test the accuracy
|
||||
If your code changes the model output, please run the accuracy tests. A quick sanity check is the few-shot GSM8K.
|
||||
|
||||
```
|
||||
# Launch a server
|
||||
python3 -m sglang.launch_server --model Qwen/Qwen2-7B-Instruct
|
||||
|
||||
# Evaluate
|
||||
python3 -m sglang.test.few_shot_gsm8k --num-questions 200
|
||||
```
|
||||
|
||||
Please note that the above script is primarily a sanity check, not a rigorous accuracy or speed test.
|
||||
This test can have significant variance (1%–5%) in accuracy due to batching and the non-deterministic nature of the inference engine.
|
||||
Also, do not rely on the "Latency/Output throughput" from this script, as it is not a proper speed test.
|
||||
|
||||
GSM8K is too easy for state-of-the-art models nowadays. Please try your own more challenging accuracy tests.
|
||||
You can find additional accuracy eval examples in:
|
||||
- [test_eval_accuracy_large.py](https://github.com/sgl-project/sglang/blob/main/test/srt/test_eval_accuracy_large.py)
|
||||
- [test_gpt_oss_1gpu.py](https://github.com/sgl-project/sglang/blob/main/test/srt/test_gpt_oss_1gpu.py)
|
||||
|
||||
## Benchmark the speed
|
||||
Refer to [Benchmark and Profiling](../developer_guide/benchmark_and_profiling.md).
|
||||
|
||||
## Request a review
|
||||
You can identify potential reviewers for your code by checking the [code owners](https://github.com/sgl-project/sglang/blob/main/.github/CODEOWNERS) and [reviewers](https://github.com/sgl-project/sglang/blob/main/.github/REVIEWERS.md) files.
|
||||
Another effective strategy is to review the file modification history and contact individuals who have frequently edited the files.
|
||||
If you modify files protected by code owners, their approval is required to merge the code.
|
||||
|
||||
## General code style
|
||||
- Avoid code duplication. If the same code snippet (more than five lines) appears multiple times, extract it into a shared function.
|
||||
- Minimize device synchronization. Reduce expensive CPU-GPU synchronization operations, such as `tensor.item()` or `tensor.cpu()`, whenever possible. Use vectorized code.
|
||||
- Keep files concise. If a file exceeds 2,000 lines of code, split it into multiple smaller files.
|
||||
- Prioritize extreme efficiency. SGLang is a runtime, and most of your code runs on the critical path for every request. Optimize all minor overheads as much as possible, especially in the model forward code.
|
||||
- A common pattern is some runtime checks in the model forward pass (e.g., [this](https://github.com/sgl-project/sglang/blob/f1b0eda55c2c4838e8ab90a0fac7fb1e3d7064ab/python/sglang/srt/models/deepseek_v2.py#L486-L491)). These are very likely the same for every layer. Please cache the result as a single boolean value whenever possible.
|
||||
- Strive to make functions as pure as possible. Avoid in-place modification of arguments.
|
||||
- When supporting new hardware or features, follow these guidelines:
|
||||
- Do not drastically change existing code.
|
||||
- Always prefer new files to introduce specific components for your new hardware (e.g., `allocator_ascend.py`).
|
||||
- If you write multiple if/else blocks for new features, ensure the common path (e.g., NVIDIA hardware or the existing code path) is the first branch.
|
||||
|
||||
## How to update sgl-kernel
|
||||
Since sglang and sgl-kernel are separate Python packages, our current GitHub CI infrastructure does not support updating a kernel and using it immediately within the same pull request (PR).
|
||||
To add a new kernel or modify an existing one in the sgl-kernel package, you must use multiple PRs.
|
||||
|
||||
Follow these steps:
|
||||
|
||||
1. Submit a PR to update the sgl-kernel source code without using it in sglang python package (e.g., [#8884](https://github.com/sgl-project/sglang/pull/8884/files)).
|
||||
2. Bump the version of sgl-kernel (e.g., [#9220](https://github.com/sgl-project/sglang/pull/9220/files)).
|
||||
- Once merged, this will trigger an automatic release of the sgl-kernel wheel to PyPI.
|
||||
- If not urgent, you can wait for other people to release the wheel. A new version will typically be released within one week.
|
||||
3. Apply the changes:
|
||||
- Update the sgl-kernel version in `sglang/python/pyproject.toml` to use the modified kernels.
|
||||
- Update the related caller code in the sglang to use the new kernel.
|
||||
|
||||
## Tips for newcomers
|
||||
|
||||
If you want to contribute but don’t have a specific idea in mind, pick issues labeled [“good first issue” or “help wanted”](https://github.com/sgl-project/sglang/issues?q=is%3Aissue+label%3A%22good+first+issue%22%2C%22help+wanted%22). These tasks typically have lower complexity and provide an excellent introduction to the codebase. Also check out this [code walk-through](https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/tree/main/sglang/code-walk-through) for a deeper look into SGLang’s workflow.
|
||||
|
||||
If you have any questions or want to start a discussion, please feel free to ask in our [Slack channel](https://slack.sglang.ai).
|
||||
|
||||
Thank you for your interest in SGLang. Happy coding!
|
||||
108
docs/developer_guide/development_guide_using_docker.md
Normal file
108
docs/developer_guide/development_guide_using_docker.md
Normal file
@@ -0,0 +1,108 @@
|
||||
# Development Guide Using Docker
|
||||
|
||||
## Setup VSCode on a Remote Host
|
||||
(Optional - you can skip this step if you plan to run sglang dev container locally)
|
||||
|
||||
1. In the remote host, download `code` from [Https://code.visualstudio.com/docs/?dv=linux64cli](https://code.visualstudio.com/download) and run `code tunnel` in a shell.
|
||||
|
||||
Example
|
||||
```bash
|
||||
wget https://vscode.download.prss.microsoft.com/dbazure/download/stable/fabdb6a30b49f79a7aba0f2ad9df9b399473380f/vscode_cli_alpine_x64_cli.tar.gz
|
||||
tar xf vscode_cli_alpine_x64_cli.tar.gz
|
||||
|
||||
# https://code.visualstudio.com/docs/remote/tunnels
|
||||
./code tunnel
|
||||
```
|
||||
|
||||
2. In your local machine, press F1 in VSCode and choose "Remote Tunnels: Connect to Tunnel".
|
||||
|
||||
## Setup Docker Container
|
||||
|
||||
### Option 1. Use the default dev container automatically from VSCode
|
||||
There is a `.devcontainer` folder in the sglang repository root folder to allow VSCode to automatically start up within dev container. You can read more about this VSCode extension in VSCode official document [Developing inside a Container](https://code.visualstudio.com/docs/devcontainers/containers).
|
||||

|
||||
(*Figure 1: Diagram from VSCode official documentation [Developing inside a Container](https://code.visualstudio.com/docs/devcontainers/containers).*)
|
||||
|
||||
To enable this, you only need to:
|
||||
1. Start Visual Studio Code and install [VSCode dev container extension](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-containers).
|
||||
2. Press F1, type and choose "Dev Container: Open Folder in Container.
|
||||
3. Input the `sglang` local repo path in your machine and press enter.
|
||||
|
||||
The first time you open it in dev container might take longer due to docker pull and build. Once it's successful, you should set on your status bar at the bottom left displaying that you are in a dev container:
|
||||
|
||||

|
||||
|
||||
Now when you run `sglang.launch_server` in the VSCode terminal or start debugging using F5, sglang server will be started in the dev container with all your local changes applied automatically:
|
||||
|
||||

|
||||
|
||||
|
||||
### Option 2. Start up containers manually (advanced)
|
||||
|
||||
The following startup command is an example for internal development by the SGLang team. You can **modify or add directory mappings as needed**, especially for model weight downloads, to prevent repeated downloads by different Docker containers.
|
||||
|
||||
❗️ **Note on RDMA**
|
||||
|
||||
1. `--network host` and `--privileged` are required by RDMA. If you don't need RDMA, you can remove them but keeping them there does not harm. Thus, we enable these two flags by default in the commands below.
|
||||
2. You may need to set `NCCL_IB_GID_INDEX` if you are using RoCE, for example: `export NCCL_IB_GID_INDEX=3`.
|
||||
|
||||
```bash
|
||||
# Change the name to yours
|
||||
docker run -itd --shm-size 32g --gpus all -v <volumes-to-mount> --ipc=host --network=host --privileged --name sglang_dev lmsysorg/sglang:dev /bin/zsh
|
||||
docker exec -it sglang_dev /bin/zsh
|
||||
```
|
||||
Some useful volumes to mount are:
|
||||
1. **Huggingface model cache**: mounting model cache can avoid re-download every time docker restarts. Default location on Linux is `~/.cache/huggingface/`.
|
||||
2. **SGLang repository**: code changes in the SGLang local repository will be automatically synced to the .devcontainer.
|
||||
|
||||
Example 1: Monting local cache folder `/opt/dlami/nvme/.cache` but not the SGLang repo. Use this when you prefer to manually transfer local code changes to the devcontainer.
|
||||
```bash
|
||||
docker run -itd --shm-size 32g --gpus all -v /opt/dlami/nvme/.cache:/root/.cache --ipc=host --network=host --privileged --name sglang_zhyncs lmsysorg/sglang:dev /bin/zsh
|
||||
docker exec -it sglang_zhyncs /bin/zsh
|
||||
```
|
||||
Example 2: Mounting both HuggingFace cache and local SGLang repo. Local code changes are automatically synced to the devcontainer as the SGLang is installed in editable mode in the dev image.
|
||||
```bash
|
||||
docker run -itd --shm-size 32g --gpus all -v $HOME/.cache/huggingface/:/root/.cache/huggingface -v $HOME/src/sglang:/sgl-workspace/sglang --ipc=host --network=host --privileged --name sglang_zhyncs lmsysorg/sglang:dev /bin/zsh
|
||||
docker exec -it sglang_zhyncs /bin/zsh
|
||||
```
|
||||
## Debug SGLang with VSCode Debugger
|
||||
1. (Create if not exist) open `launch.json` in VSCode.
|
||||
2. Add the following config and save. Please note that you can edit the script as needed to apply different parameters or debug a different program (e.g. benchmark script).
|
||||
```JSON
|
||||
{
|
||||
"version": "0.2.0",
|
||||
"configurations": [
|
||||
{
|
||||
"name": "Python Debugger: launch_server",
|
||||
"type": "debugpy",
|
||||
"request": "launch",
|
||||
"module": "sglang.launch_server",
|
||||
"console": "integratedTerminal",
|
||||
"args": [
|
||||
"--model-path", "meta-llama/Llama-3.2-1B",
|
||||
"--host", "0.0.0.0",
|
||||
"--port", "30000",
|
||||
"--trust-remote-code",
|
||||
],
|
||||
"justMyCode": false
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
3. Press "F5" to start. VSCode debugger will ensure that the program will pause at the breakpoints even if the program is running at remote SSH/Tunnel host + dev container.
|
||||
|
||||
## Profile
|
||||
|
||||
```bash
|
||||
# Change batch size, input, output and add `disable-cuda-graph` (for easier analysis)
|
||||
# e.g. DeepSeek V3
|
||||
nsys profile -o deepseek_v3 python3 -m sglang.bench_one_batch --batch-size 1 --input 128 --output 256 --model deepseek-ai/DeepSeek-V3 --trust-remote-code --tp 8 --disable-cuda-graph
|
||||
```
|
||||
|
||||
## Evaluation
|
||||
|
||||
```bash
|
||||
# e.g. gsm8k 8 shot
|
||||
python3 benchmark/gsm8k/bench_sglang.py --num-questions 2000 --parallel 2000 --num-shots 8
|
||||
```
|
||||
18
docs/developer_guide/release_process.md
Normal file
18
docs/developer_guide/release_process.md
Normal file
@@ -0,0 +1,18 @@
|
||||
# PyPI Package Release Process
|
||||
|
||||
## Update the version in code
|
||||
Update the package version in `python/pyproject.toml` and `python/sglang/__init__.py`.
|
||||
|
||||
## Upload the PyPI package
|
||||
|
||||
```
|
||||
pip install build twine
|
||||
```
|
||||
|
||||
```
|
||||
cd python
|
||||
bash upload_pypi.sh
|
||||
```
|
||||
|
||||
## Make a release in GitHub
|
||||
Make a new release https://github.com/sgl-project/sglang/releases/new.
|
||||
49
docs/developer_guide/setup_github_runner.md
Normal file
49
docs/developer_guide/setup_github_runner.md
Normal file
@@ -0,0 +1,49 @@
|
||||
# Set Up Self-Hosted Runners for GitHub Action
|
||||
|
||||
## Add a Runner
|
||||
|
||||
### Step 1: Start a docker container.
|
||||
|
||||
You can mount a folder for the shared huggingface model weights cache. The command below uses `/tmp/huggingface` as an example.
|
||||
|
||||
```
|
||||
docker pull nvidia/cuda:12.1.1-devel-ubuntu22.04
|
||||
# Nvidia
|
||||
docker run --shm-size 128g -it -v /tmp/huggingface:/hf_home --gpus all nvidia/cuda:12.1.1-devel-ubuntu22.04 /bin/bash
|
||||
# AMD
|
||||
docker run --rm --device=/dev/kfd --device=/dev/dri --group-add video --shm-size 128g -it -v /tmp/huggingface:/hf_home lmsysorg/sglang:v0.5.0rc1-rocm630 /bin/bash
|
||||
# AMD just the last 2 GPUs
|
||||
docker run --rm --device=/dev/kfd --device=/dev/dri/renderD176 --device=/dev/dri/renderD184 --group-add video --shm-size 128g -it -v /tmp/huggingface:/hf_home lmsysorg/sglang:v0.5.0rc1-rocm630 /bin/bash
|
||||
```
|
||||
|
||||
### Step 2: Configure the runner by `config.sh`
|
||||
|
||||
Run these commands inside the container.
|
||||
|
||||
```
|
||||
apt update && apt install -y curl python3-pip git
|
||||
export RUNNER_ALLOW_RUNASROOT=1
|
||||
```
|
||||
|
||||
Then follow https://github.com/sgl-project/sglang/settings/actions/runners/new?arch=x64&os=linux to run `config.sh`
|
||||
|
||||
**Notes**
|
||||
- Do not need to specify the runner group
|
||||
- Give it a name (e.g., `test-sgl-gpu-0`) and some labels (e.g., `1-gpu-runner`). The labels can be edited later in Github Settings.
|
||||
- Do not need to change the work folder.
|
||||
|
||||
### Step 3: Run the runner by `run.sh`
|
||||
|
||||
- Set up environment variables
|
||||
```
|
||||
export HF_HOME=/hf_home
|
||||
export SGLANG_IS_IN_CI=true
|
||||
export HF_TOKEN=hf_xxx
|
||||
export OPENAI_API_KEY=sk-xxx
|
||||
export CUDA_VISIBLE_DEVICES=0
|
||||
```
|
||||
|
||||
- Run it forever
|
||||
```
|
||||
while true; do ./run.sh; echo "Restarting..."; sleep 2; done
|
||||
```
|
||||
Reference in New Issue
Block a user