Refactor the docs (#9031)

This commit is contained in:
Lianmin Zheng
2025-08-10 19:49:45 -07:00
committed by GitHub
parent 0f229c07f1
commit 2449a0afe2
80 changed files with 619 additions and 750 deletions

2
.github/CODEOWNERS vendored
View File

@@ -10,7 +10,7 @@
/python/sglang/srt/eplb @fzyzcjy
/python/sglang/srt/function_call @CatherineSue
/python/sglang/srt/layers @merrymercy @Ying1123 @zhyncs @ispobock @HaiShaw @ch-wan @BBuf @kushanam @Edwardf0t1
/python/sglang/srt/lora @Ying1123 @Fridge003
/python/sglang/srt/lora @Ying1123 @Fridge003 @lifuhuang
/python/sglang/srt/managers @merrymercy @Ying1123 @hnyls2002 @xiezhq-hermann
/python/sglang/srt/mem_cache @merrymercy @Ying1123 @hnyls2002 @xiezhq-hermann
/python/sglang/srt/model_executor @merrymercy @Ying1123 @hnyls2002 @zhyncs @ispobock

50
.github/REVIEWERS.md vendored Normal file
View File

@@ -0,0 +1,50 @@
# Area Reviewer
Here are some reviewers for common areas. You can ping them to review your code if you touch related parts.
## Hardware platforms
- general @Alcanderian
- AMD GPU @HaiShaw
- Blackwell GPU @kushanam @trevor-m @zhyncs
- CPU @mingfeima
## Kernel
- general @zhyncs @ispobock @HandH1998 @BBuf @yizhang2077 @HaiShaw
- triton attention backend @ispobock
- flash attention @hebiao064
## Scheduler and memory pool
- general @merrymercy @Ying1123 @hnyls2002 @xiezhq-hermann
- constrained decoding @hnyls2002
- hierarhical cache @xiezhq-hermann @DarkSharpness
- lora @Fridge003 @Ying1123 @lifuhuang
- speculative decoding @merrymercy @Ying1123 @kssteven418
- sliding window attention @hanming-lu
## Parallelism
- expert parallelism @fzyzcjy @ch-wan
- data parallelism attention @ch-wan
- pipeline parallelism @Ying1123
- tensor parallelism @merrymercy
## PD disaggregation
- general @ByronHsu @ShangmingCai @@ShangmingCai @hnyls2002
- Mooncake backend @ShangmingCai
## Build and release
- general @zhyncs @merrymercy
## API Server
- general @CatherineSue @slin1237 @ispobock
- function calling and reasoning parsing @CatherineSue
- OpenAI API @CatherineSue @slin1237
## SGL-Router
- general @slin1237 @ByronHsu
## Model
- multimodal models @mickqian @JustinTong0323
- other new models @zhaochenyang20
## Reinforcment learning
- general @zhaochenyang20 @hebiao064 @fzyzcjy @zhuzilin

View File

@@ -1,26 +1,24 @@
<!-- Thank you for your contribution! We appreciate it. The following guidelines will help improve your pull request and facilitate feedback. If anything is unclear, don't hesitate to submit your pull request and ask the maintainers for assistance. -->
<!-- Thank you for your contribution! Please follow these guidelines to enhance your pull request. If anything is unclear, submit your PR and reach out to maintainers for assistance. Join our Slack community at https://slack.sglang.ai to discuss further. -->
## Motivation
<!-- Explain the purpose of this PR and the goals it aims to achieve. -->
<!-- Describe the purpose and goals of this pull request. -->
## Modifications
<!-- Describe the changes made in this PR. -->
<!-- Detail the changes made in this pull request. -->
## Accuracy Test
## Accuracy Tests
<!-- If this PR affects model-side code (e.g., kernels, model architecture), please provide accuracy test results. Ref: https://docs.sglang.ai/references/accuracy_evaluation.html -->
<!-- If this pull request affects model outputs (e.g., changes to the kernel or model forward code), provide accuracy test results. -->
## Benchmark & Profiling
## Benchmarking and Profiling
<!-- If this PR is expected to impact performance, please provide benchmark and profiling results. Ref: https://docs.sglang.ai/references/benchmark_and_profiling.html -->
<!-- If this pull request impacts inference speed, provide benchmarking and profiling results. -->
## Checklist
- [ ] Format your code according to the [Code Formatting with Pre-Commit](https://docs.sglang.ai/references/contribution_guide.html#code-formatting-with-pre-commit).
- [ ] Add unit tests as outlined in the [Running Unit Tests](https://docs.sglang.ai/references/contribution_guide.html#running-unit-tests-adding-to-ci).
- [ ] Update documentation / docstrings / example tutorials as needed, according to [Writing Documentation](https://docs.sglang.ai/references/contribution_guide.html#writing-documentation-running-docs-ci).
- [ ] Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to [Benchmark and Profiling](https://docs.sglang.ai/references/benchmark_and_profiling.html) and [Accuracy Results](https://docs.sglang.ai/references/accuracy_evaluation.html).
- [ ] For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
- [ ] Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.
- [ ] Format your code according to the [Code formatting with pre-commit](https://docs.sglang.ai/references/contribution_guide.html#code-formatting-with-pre-commit).
- [ ] Add unit tests according to the [Running and adding unit tests](https://docs.sglang.ai/references/contribution_guide.html#running-unit-tests-adding-to-ci).
- [ ] Update documentation according to [Writing documentations](https://docs.sglang.ai/references/contribution_guide.html#writing-documentation-running-docs-ci).
- [ ] Provide accuracy and speed benchmark results according to [Testing the accuracy](https://docs.sglang.ai/references/contribution_guide.html#testing-the-accuracy) and [Benchmark and profiling]()

View File

@@ -41,6 +41,7 @@ jobs:
make compile
- name: Push HTML to sgl-project.github.io
timeout-minutes: 60
env:
GITHUB_TOKEN: ${{ secrets.DOCUMENTATION_PAT_TOKEN }}
run: |

View File

@@ -42,7 +42,7 @@ repos:
exclude: |
(?x)^(
test/srt/test_reasoning_parser\.py|
docs/backend/vlm_query\.ipynb
docs/advanced_features/vlm_query\.ipynb
)$
- repo: https://github.com/pre-commit/mirrors-clang-format
rev: v18.1.8

View File

@@ -40,8 +40,9 @@ compile:
# Serve documentation with auto-build and live reload
serve:
@echo "Starting auto-build server at http://localhost:$(PORT)"
@echo "Starting auto-build server at http://0.0.0.0:$(PORT)"
@$(SPHINXAUTOBUILD) "$(SOURCEDIR)" "$(BUILDDIR)/html" \
--host 0.0.0.0 \
--port $(PORT) \
--watch $(SOURCEDIR) \
--re-ignore ".*\.(ipynb_checkpoints|pyc|pyo|pyd|git)"

View File

@@ -1,12 +1,14 @@
# SGLang Documentation
We recommend new contributors start from writing documentation, which helps you quickly understand SGLang codebase. Most documentation files are located under the `docs/` folder. We prefer **Jupyter Notebooks** over Markdown so that all examples can be executed and validated by our docs CI pipeline.
We recommend new contributors start from writing documentation, which helps you quickly understand SGLang codebase.
Most documentation files are located under the `docs/` folder.
## Docs Workflow
### Install Dependency
```bash
apt-get update && apt-get install -y pandoc parallel retry
pip install -r requirements.txt
```
@@ -15,11 +17,11 @@ pip install -r requirements.txt
Update your Jupyter notebooks in the appropriate subdirectories under `docs/`. If you add new files, remember to update `index.rst` (or relevant `.rst` files) accordingly.
- **`pre-commit run --all-files`** manually runs all configured checks, applying fixes if possible. If it fails the first time, re-run it to ensure lint errors are fully resolved. Make sure your code passes all checks **before** creating a Pull Request.
- **Do not commit** directly to the `main` branch. Always create a new branch (e.g., `feature/my-new-feature`), push your changes, and open a PR from that branch.
```bash
# 1) Compile all Jupyter notebooks
make compile
make compile # This step can take a long time (10+ mins). You can consider skipping this step if you can make sure your added files are correct.
make html
# 2) Compile and Preview documentation locally with auto-build
# This will automatically rebuild docs when files change
@@ -43,68 +45,11 @@ pre-commit run --all-files
```
---
### **Port Allocation and CI Efficiency**
## Documentation Style Guidelines
**To launch and kill the server:**
```python
from sglang.test.test_utils import is_in_ci
from sglang.utils import wait_for_server, print_highlight, terminate_process
if is_in_ci():
from patch import launch_server_cmd
else:
from sglang.utils import launch_server_cmd
server_process, port = launch_server_cmd(
"""
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
--host 0.0.0.0
"""
)
wait_for_server(f"http://localhost:{port}")
# Terminate Server
terminate_process(server_process)
```
**To launch and kill the engine:**
```python
# Launch Engine
import sglang as sgl
import asyncio
from sglang.test.test_utils import is_in_ci
if is_in_ci():
import patch
llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")
# Terminalte Engine
llm.shutdown()
```
### **Why this approach?**
- **Dynamic Port Allocation**: Avoids port conflicts by selecting an available port at runtime, enabling multiple server instances to run in parallel.
- **Optimized for CI**: The `patch` version of `launch_server_cmd` and `sgl.Engine()` in CI environments helps manage GPU memory dynamically, preventing conflicts and improving test parallelism.
- **Better Parallel Execution**: Ensures smooth concurrent tests by avoiding fixed port collisions and optimizing memory usage.
### **Model Selection**
For demonstrations in the docs, **prefer smaller models** to reduce memory consumption and speed up inference. Running larger models in CI can lead to instability due to memory constraints.
### **Prompt Alignment Example**
When designing prompts, ensure they align with SGLang's structured formatting. For example:
```python
prompt = """You are an AI assistant. Answer concisely and accurately.
User: What is the capital of France?
Assistant: The capital of France is Paris."""
```
This keeps responses aligned with expected behavior and improves reliability across different files.
- For common functionalities, we prefer **Jupyter Notebooks** over Markdown so that all examples can be executed and validated by our docs CI pipeline. For complex features (e.g., distributed serving), Markdown is preferred.
- Keep in mind the documentation execution time when writing interactive Jupyter notebooks. Each interactive notebook will be run and compiled against every commit to ensure they are runnable, so it is important to apply some tips to reduce the documentation compilation time:
- Use small models (e.g., `qwen/qwen2.5-0.5b-instruct`) for most cases to reduce server launch time.
- Reuse the launched server as much as possible to reduce server launch time.
- Do not use absolute links (e.g., `https://docs.sglang.ai/get_started/install.html`). Always prefer relative links (e.g., `../get_started/install.md`).
- Follow the existing examples to learn how to launch a server, send a query and other common styles.

View File

@@ -1,5 +1,8 @@
# Attention Backend
SGLang supports multiple attention backends. Each of them has different pros and cons.
You can test them according to your needs.
## Supporting matrix for different attention backends
| **Backend** | **Page Size > 1** | **Spec Decoding** | **MLA** | **Sliding Window** | **MultiModal** |
@@ -7,10 +10,10 @@
| **FlashInfer** | ❌ | ✅ | ✅ | ✅ | ✅ |
| **FA3** | ✅ | ✅ | ✅ | ✅ | ✅ |
| **Triton** | ❌ | ✅ | ✅ | ✅ | ❌ |
| **Torch Native** | ❌ | ❌ | | ❌ | ❌ |
| **Torch Native** | ❌ | ❌ | | ❌ | ❌ |
| **FlashMLA** | ✅ | ✅ | ✅ | ❌ | ❌ |
| **TRTLLM MLA** | ✅ | ❌ | ✅ | ✅ | ❌ |
| **Ascend** | ✅ | ❌ | | ❌ | ❌ |
| **Ascend** | ✅ | ❌ | | ❌ | ❌ |
**Notes:**
- TRTLLM MLA only implements decode operations. For prefill operations (including multimodal inputs), it falls back to FlashInfer MLA backend.
@@ -21,7 +24,7 @@ The "❌" and "✅" symbols in the table above under "Page Size > 1" indicate wh
## User guide
#### Launch command for different attention backends.
### Launch command for different attention backends.
- FlashInfer (Default for Non-Hopper Machines, e.g., A100, A40)
```bash

View File

@@ -29,18 +29,10 @@
"metadata": {},
"outputs": [],
"source": [
"from openai import OpenAI\n",
"import json\n",
"from sglang.test.doc_patch import launch_server_cmd\n",
"from sglang.utils import wait_for_server, print_highlight, terminate_process\n",
"from sglang.test.test_utils import is_in_ci\n",
"\n",
"if is_in_ci():\n",
" from patch import launch_server_cmd\n",
"else:\n",
" from sglang.utils import launch_server_cmd\n",
" import nest_asyncio\n",
"\n",
" nest_asyncio.apply()\n",
"from openai import OpenAI\n",
"\n",
"server_process, port = launch_server_cmd(\n",
" \"python3 -m sglang.launch_server --model-path Qwen/Qwen2.5-7B-Instruct --tool-call-parser qwen25 --host 0.0.0.0\" # qwen25\n",
@@ -304,7 +296,7 @@
"metadata": {},
"source": [
"\n",
"## Execute the Tool"
"### Execute the Tool"
]
},
{
@@ -389,17 +381,8 @@
"outputs": [],
"source": [
"from openai import OpenAI\n",
"import json\n",
"from sglang.utils import wait_for_server, print_highlight, terminate_process\n",
"from sglang.test.test_utils import is_in_ci\n",
"\n",
"if is_in_ci():\n",
" from patch import launch_server_cmd\n",
"else:\n",
" from sglang.utils import launch_server_cmd\n",
" import nest_asyncio\n",
"\n",
" nest_asyncio.apply()\n",
"from sglang.test.doc_patch import launch_server_cmd\n",
"\n",
"# Start a new server session for tool choice examples\n",
"server_process_tool_choice, port_tool_choice = launch_server_cmd(\n",
@@ -498,6 +481,15 @@
" print(f\"Arguments: {tool_call.function.arguments}\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"terminate_process(server_process_tool_choice)"
]
},
{
"cell_type": "markdown",
"metadata": {},

View File

@@ -52,11 +52,11 @@ Note that CUDA graph consumes more memory, so you may need to reduce `--mem-frac
### Tune `--dp-size` and `--tp-size`
Data parallelism is better for throughput. When there is enough GPU memory, always favor data parallelism for throughput. Refer to [sglang router](../router/router.md) for a better data parallelism rather than using `dp_size` parameter.
Data parallelism is better for throughput. When there is enough GPU memory, always favor data parallelism for throughput. Refer to [sglang router](../advanced_features/router.md) for a better data parallelism rather than using `dp_size` parameter.
### Try other options
- `torch.compile` accelerates small models on small batch sizes. You can enable it with `--enable-torch-compile`.
- Try other quantization (e.g. FP8 quantization with `--quantization fp8`)
- Try other parallelism strategies (e.g. expert parallelism) or DP attention for deepseek models (with `--enable-dp-attention --dp-size 8`).
- Try other parallelism strategies (e.g. [expert parallelism](https://lmsys.org/blog/2025-05-05-large-scale-ep/)) or DP attention for deepseek models (with `--enable-dp-attention --dp-size 8`).
- If the workload has many shared prefixes, try `--schedule-policy lpm`. Here, `lpm` stands for longest prefix match. It reorders requests to encourage more cache hits but introduces more scheduling overhead.

View File

@@ -61,17 +61,11 @@
"metadata": {},
"outputs": [],
"source": [
"from sglang.test.test_utils import is_in_ci\n",
"\n",
"if is_in_ci():\n",
" from patch import launch_server_cmd\n",
"else:\n",
" from sglang.utils import launch_server_cmd\n",
"\n",
"from sglang.utils import wait_for_server, terminate_process\n",
"\n",
"import json\n",
"import requests"
"import requests\n",
"\n",
"from sglang.test.doc_patch import launch_server_cmd\n",
"from sglang.utils import wait_for_server, terminate_process"
]
},
{

View File

@@ -0,0 +1,35 @@
# Observability
## Production Metrics
SGLang exposes the following metrics via Prometheus. You can enable them by adding `--enable-metrics` when launching the server.
You can query them by:
```
curl http://localhost:30000/metrics
```
See [Production Metrics](../references/production_metrics.md) for more details.
## Logging
By default, SGLang does not log any request contents. You can log them by using `--log-requests`.
You can control the verbosity by using `--log-request-level`.
See [Logging](server_arguments.md#logging) for more details.
## Request Dump and Replay
You can dump all requests and replay them later for benchmarking or other purposes.
To start dumping, use the following command to send a request to a server:
```
python3 -m sglang.srt.managers.configure_logging --url http://localhost:30000 --dump-requests-folder /tmp/sglang_request_dump --dump-requests-threshold 100
```
The server will dump the requests into a pickle file for every 100 requests.
To replay the request dump, use `scripts/playground/replay_request_dump.py`.
## Crash Dump and Replay
Sometimes the server might crash, and you may want to debug the cause of the crash.
SGLang supports crash dumping, which will dump all requests from the 5 minutes before the crash, allowing you to replay the requests and debug the reason later.
To enable crash dumping, use `--crash-dump-folder /tmp/crash_dump`.
To replay the crash dump, use `scripts/playground/replay_request_dump.py`.

View File

@@ -56,16 +56,9 @@
"source": [
"import requests\n",
"from openai import OpenAI\n",
"from sglang.test.test_utils import is_in_ci\n",
"\n",
"if is_in_ci():\n",
" from patch import launch_server_cmd\n",
"else:\n",
" from sglang.utils import launch_server_cmd\n",
"\n",
"from sglang.test.doc_patch import launch_server_cmd\n",
"from sglang.utils import wait_for_server, print_highlight, terminate_process\n",
"\n",
"\n",
"server_process, port = launch_server_cmd(\n",
" \"python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-R1-Distill-Qwen-7B --host 0.0.0.0 --reasoning-parser deepseek-r1\"\n",
")\n",

View File

@@ -38,7 +38,7 @@ You can find all arguments by `python3 -m sglang.launch_server --help`
- To enable torchao quantization, add `--torchao-config int4wo-128`. It supports other [quantization strategies (INT8/FP8)](https://github.com/sgl-project/sglang/blob/v0.3.6/python/sglang/srt/server_args.py#L671) as well.
- To enable fp8 weight quantization, add `--quantization fp8` on a fp16 checkpoint or directly load a fp8 checkpoint without specifying any arguments.
- To enable fp8 kv cache quantization, add `--kv-cache-dtype fp8_e5m2`.
- If the model does not have a chat template in the Hugging Face tokenizer, you can specify a [custom chat template](custom_chat_template.md).
- If the model does not have a chat template in the Hugging Face tokenizer, you can specify a [custom chat template](../references/custom_chat_template.md).
- To run tensor parallelism on multiple nodes, add `--nnodes 2`. If you have two nodes with two GPUs on each node and want to run TP=4, let `sgl-dev-0` be the hostname of the first node and `50000` be an available port, you can use the following commands. If you meet deadlock, please try to add `--disable-cuda-graph`
```bash

View File

@@ -7,7 +7,6 @@
"# Speculative Decoding\n",
"\n",
"SGLang now provides an EAGLE-based (EAGLE-2/EAGLE-3) speculative decoding option. Our implementation aims to maximize speed and efficiency and is considered to be among the fastest in open-source LLM engines.\n",
"**Note:** Currently, Speculative Decoding in SGLang is compatible with radix cache and chunked prefill.\n",
"\n",
"### Performance Highlights\n",
"\n",
@@ -18,7 +17,7 @@
"|--------|----------------|\n",
"| SGLang (w/o speculative, 1x H100) | 158.34 tokens/s |\n",
"| SGLang + EAGLE-2 (1x H100) | 244.10 tokens/s |\n",
"| SGLang + EAGLE-3 (1x H100) | 373.25 tokens/s |\n"
"| SGLang + EAGLE-3 (1x H100) | 373.25 tokens/s |"
]
},
{
@@ -30,12 +29,14 @@
"To enable EAGLE speculative decoding the following parameters are relevant:\n",
"* `speculative_draft_model_path`: Specifies draft model. This parameter is required.\n",
"* `speculative_num_steps`: Depth of autoregressive drafting. Increases speculation range but risks rejection cascades. Default is 5.\n",
"\n",
"* `speculative_eagle_topk`: Branching factor per step. Improves candidate diversity, will lead to higher acceptance rate, but more lead to higher memory/compute consumption. Default is 4.\n",
"\n",
"* `speculative_num_draft_tokens`: Maximum parallel verification capacity. Allows deeper tree evaluation but will lead to higher GPU memory usage. Default is 8.\n",
"\n",
"These parameters are the same for EAGLE-2 and EAGLE-3."
"These parameters are the same for EAGLE-2 and EAGLE-3.\n",
"\n",
"You can find the best combinations of these parameters with [bench_speculative.py](https://github.com/sgl-project/sglang/blob/main/scripts/playground/bench_speculative.py).\n",
"\n",
"In the documentation below, we set `--cuda-graph-max-bs` to be a small value for faster engine startup. For your own workloads, please tune the above parameters together with `--cuda-graph-max-bs`, `--max-running-requests`, `--mem-fraction-static` for the best performance. "
]
},
{
@@ -53,13 +54,7 @@
"metadata": {},
"outputs": [],
"source": [
"from sglang.test.test_utils import is_in_ci\n",
"\n",
"if is_in_ci():\n",
" from patch import launch_server_cmd\n",
"else:\n",
" from sglang.utils import launch_server_cmd\n",
"\n",
"from sglang.test.doc_patch import launch_server_cmd\n",
"from sglang.utils import wait_for_server, print_highlight, terminate_process\n",
"\n",
"import openai"

View File

@@ -15,8 +15,8 @@
"\n",
"SGLang supports three grammar backends:\n",
"\n",
"- [Outlines](https://github.com/dottxt-ai/outlines): Supports JSON schema and regular expression constraints.\n",
"- [XGrammar](https://github.com/mlc-ai/xgrammar)(default): Supports JSON schema, regular expression, and EBNF constraints.\n",
"- [Outlines](https://github.com/dottxt-ai/outlines): Supports JSON schema and regular expression constraints.\n",
"- [Llguidance](https://github.com/guidance-ai/llguidance): Supports JSON schema, regular expression, and EBNF constraints.\n",
"\n",
"We suggest using XGrammar for its better performance and utility. XGrammar currently uses the [GGML BNF format](https://github.com/ggerganov/llama.cpp/blob/master/grammars/README.md). For more details, see [XGrammar technical overview](https://blog.mlc.ai/2024/11/22/achieving-efficient-flexible-portable-structured-generation-with-xgrammar).\n",
@@ -43,13 +43,8 @@
"source": [
"import openai\n",
"import os\n",
"from sglang.test.test_utils import is_in_ci\n",
"\n",
"if is_in_ci():\n",
" from patch import launch_server_cmd\n",
"else:\n",
" from sglang.utils import launch_server_cmd\n",
"\n",
"from sglang.test.doc_patch import launch_server_cmd\n",
"from sglang.utils import wait_for_server, print_highlight, terminate_process\n",
"\n",
"os.environ[\"TOKENIZERS_PARALLELISM\"] = \"false\"\n",

View File

@@ -39,13 +39,8 @@
"source": [
"import openai\n",
"import os\n",
"from sglang.test.test_utils import is_in_ci\n",
"\n",
"if is_in_ci():\n",
" from patch import launch_server_cmd\n",
"else:\n",
" from sglang.utils import launch_server_cmd\n",
"\n",
"from sglang.test.doc_patch import launch_server_cmd\n",
"from sglang.utils import wait_for_server, print_highlight, terminate_process\n",
"\n",
"os.environ[\"TOKENIZERS_PARALLELISM\"] = \"false\"\n",

View File

@@ -5,13 +5,21 @@
"id": "0",
"metadata": {},
"source": [
"# Querying Qwen-VL"
"# Query Vision Language Model"
]
},
{
"cell_type": "markdown",
"id": "1",
"metadata": {},
"source": [
"## Querying Qwen-VL"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1",
"id": "2",
"metadata": {},
"outputs": [],
"source": [
@@ -26,7 +34,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "2",
"id": "3",
"metadata": {},
"outputs": [
{
@@ -61,7 +69,6 @@
"import requests\n",
"from PIL import Image\n",
"\n",
"from sglang.srt.entrypoints.openai.protocol import ChatCompletionRequest\n",
"from sglang.srt.conversation import chat_templates\n",
"\n",
"image = Image.open(\n",
@@ -83,16 +90,16 @@
},
{
"cell_type": "markdown",
"id": "3",
"id": "4",
"metadata": {},
"source": [
"## Query via the offline Engine API"
"### Query via the offline Engine API"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4",
"id": "5",
"metadata": {},
"outputs": [
{
@@ -121,7 +128,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "5",
"id": "6",
"metadata": {},
"outputs": [
{
@@ -139,16 +146,16 @@
},
{
"cell_type": "markdown",
"id": "6",
"id": "7",
"metadata": {},
"source": [
"## Query via the offline Engine API, but send precomputed embeddings"
"### Query via the offline Engine API, but send precomputed embeddings"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7",
"id": "8",
"metadata": {},
"outputs": [
{
@@ -181,7 +188,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "8",
"id": "9",
"metadata": {},
"outputs": [
{
@@ -212,16 +219,16 @@
},
{
"cell_type": "markdown",
"id": "9",
"id": "10",
"metadata": {},
"source": [
"# Querying Llama 4 (Vision)"
"## Querying Llama 4 (Vision)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "10",
"id": "11",
"metadata": {},
"outputs": [],
"source": [
@@ -236,7 +243,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "11",
"id": "12",
"metadata": {},
"outputs": [
{
@@ -271,7 +278,6 @@
"import requests\n",
"from PIL import Image\n",
"\n",
"from sglang.srt.entrypoints.openai.protocol import ChatCompletionRequest\n",
"from sglang.srt.conversation import chat_templates\n",
"\n",
"image = Image.open(\n",
@@ -295,16 +301,16 @@
},
{
"cell_type": "markdown",
"id": "12",
"id": "13",
"metadata": {},
"source": [
"## Query via the offline Engine API"
"### Query via the offline Engine API"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "13",
"id": "14",
"metadata": {},
"outputs": [
{
@@ -416,7 +422,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "14",
"id": "15",
"metadata": {},
"outputs": [
{
@@ -435,16 +441,16 @@
},
{
"cell_type": "markdown",
"id": "15",
"id": "16",
"metadata": {},
"source": [
"## Query via the offline Engine API, but send precomputed embeddings"
"### Query via the offline Engine API, but send precomputed embeddings"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "16",
"id": "17",
"metadata": {},
"outputs": [
{
@@ -480,7 +486,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "17",
"id": "18",
"metadata": {},
"outputs": [
{

View File

@@ -57,22 +57,20 @@ To run DeepSeek V3/R1 models, the requirements are as follows:
Detailed commands for reference:
- [8 x H200](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#using-docker-recommended)
- [8 x MI300X](https://docs.sglang.ai/references/amd.html#running-deepseek-v3)
- [8 x MI300X](../platforms/amd_gpu.md#running-deepseek-v3)
- [2 x 8 x H200](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-two-h208-nodes)
- [4 x 8 x A100](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-four-a1008-nodes)
- [8 x A100 (AWQ)](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-8-a100a800-with-awq-quantization)
- [16 x A100 (int8)](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-16-a100a800-with-int8-quantization)
- [32 x L40S (int8)](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-32-l40s-with-int8-quantization)
- [Xeon 6980P CPU](https://docs.sglang.ai/references/cpu.html#example-running-deepseek-r1)
- [Xeon 6980P CPU](../platforms/cpu_server.md#example-running-deepseek-r1)
### Download Weights
If you encounter errors when starting the server, ensure the weights have finished downloading. It's recommended to download them beforehand or restart multiple times until all weights are downloaded. Please refer to [DeepSeek V3](https://huggingface.co/deepseek-ai/DeepSeek-V3-Base#61-inference-with-deepseek-infer-demo-example-only) official guide to download the weights.
### Caching `torch.compile`
The DeepSeek series have huge model weights, it takes some time to compile the model with `torch.compile` for the first time if you have added the flag `--enable-torch-compile`. You can refer [here](https://docs.sglang.ai/backend/hyperparameter_tuning.html#try-advanced-options) to optimize the caching of compilation results, so that the cache can be used to speed up the next startup.
### Launch with one node of 8 x H200
Please refer to [the example](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#using-docker-recommended). **Note that Deepseek V3 is already in FP8. So we should not run it with any quantization arguments like `--quantization fp8 --kv-cache-dtype fp8_e5m2`.
Please refer to [the example](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#installation--launch).
**Note that Deepseek V3 is already in FP8**, so we should not run it with any quantization arguments like `--quantization fp8 --kv-cache-dtype fp8_e5m2`.
### Running examples on Multi-node
@@ -221,7 +219,6 @@ Important Notes:
2. To receive more consistent tool call results, it is recommended to use `--chat-template examples/chat_template/tool_chat_template_deepseekv3.jinja`. It provides an improved unified prompt.
## FAQ
**Q: Model loading is taking too long, and I'm encountering an NCCL timeout. What should I do?**

View File

@@ -0,0 +1,3 @@
# GPT OSS Usage
Please refer to [https://github.com/sgl-project/sglang/issues/8833](https://github.com/sgl-project/sglang/issues/8833).

View File

@@ -6,7 +6,7 @@
"source": [
"# SGLang Native APIs\n",
"\n",
"Apart from the OpenAI compatible APIs, the SGLang Runtime also provides its native server APIs. We introduce these following APIs:\n",
"Apart from the OpenAI compatible APIs, the SGLang Runtime also provides its native server APIs. We introduce the following APIs:\n",
"\n",
"- `/generate` (text generation model)\n",
"- `/get_model_info`\n",
@@ -21,8 +21,9 @@
"- `/start_expert_distribution_record`\n",
"- `/stop_expert_distribution_record`\n",
"- `/dump_expert_distribution_record`\n",
"- A full list of these APIs can be found at [http_server.py](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/entrypoints/http_server.py)\n",
"\n",
"We mainly use `requests` to test these APIs in the following examples. You can also use `curl`."
"We mainly use `requests` to test these APIs in the following examples. You can also use `curl`.\n"
]
},
{
@@ -38,24 +39,12 @@
"metadata": {},
"outputs": [],
"source": [
"import requests\n",
"from sglang.test.test_utils import is_in_ci\n",
"\n",
"if is_in_ci():\n",
" from patch import launch_server_cmd\n",
"else:\n",
" from sglang.utils import launch_server_cmd\n",
"\n",
"from sglang.test.doc_patch import launch_server_cmd\n",
"from sglang.utils import wait_for_server, print_highlight, terminate_process\n",
"\n",
"\n",
"server_process, port = launch_server_cmd(\n",
" \"python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct --host 0.0.0.0\"\n",
")\n",
"## To run qwen2.5-0.5b-instruct model on the Ascend-Npu, you can execute the following command:\n",
"# server_process, port = launch_server_cmd(\n",
"# \"python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct --host 0.0.0.0 --device npu --tp 2 --attention-backend torch_native\"\n",
"# )\n",
"\n",
"wait_for_server(f\"http://localhost:{port}\")"
]
@@ -65,7 +54,7 @@
"metadata": {},
"source": [
"## Generate (text generation model)\n",
"Generate completions. This is similar to the `/v1/completions` in OpenAI API. Detailed parameters can be found in the [sampling parameters](./sampling_params.md)."
"Generate completions. This is similar to the `/v1/completions` in OpenAI API. Detailed parameters can be found in the [sampling parameters](sampling_params.md)."
]
},
{
@@ -74,6 +63,8 @@
"metadata": {},
"outputs": [],
"source": [
"import requests\n",
"\n",
"url = f\"http://localhost:{port}/generate\"\n",
"data = {\"text\": \"What is the capital of France?\"}\n",
"\n",
@@ -81,11 +72,6 @@
"print_highlight(response.json())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
@@ -141,8 +127,6 @@
"metadata": {},
"outputs": [],
"source": [
"# get_server_info\n",
"\n",
"url = f\"http://localhost:{port}/get_server_info\"\n",
"\n",
"response = requests.get(url)\n",
@@ -197,8 +181,6 @@
"metadata": {},
"outputs": [],
"source": [
"# flush cache\n",
"\n",
"url = f\"http://localhost:{port}/flush_cache\"\n",
"\n",
"response = requests.post(url)\n",
@@ -270,7 +252,7 @@
"source": [
"## Encode (embedding model)\n",
"\n",
"Encode text into embeddings. Note that this API is only available for [embedding models](openai_api_embeddings.html#openai-apis-embedding) and will raise an error for generation models.\n",
"Encode text into embeddings. Note that this API is only available for [embedding models](openai_api_embeddings.ipynb) and will raise an error for generation models.\n",
"Therefore, we launch a new server to server an embedding model."
]
},

View File

@@ -64,25 +64,11 @@
"source": [
"# launch the offline engine\n",
"import asyncio\n",
"import io\n",
"import os\n",
"\n",
"from PIL import Image\n",
"import requests\n",
"import sglang as sgl\n",
"\n",
"from sglang.srt.conversation import chat_templates\n",
"from sglang.test.test_utils import is_in_ci\n",
"import sglang.test.doc_patch\n",
"from sglang.utils import async_stream_and_merge, stream_and_merge\n",
"\n",
"if is_in_ci():\n",
" import patch\n",
"else:\n",
" import nest_asyncio\n",
"\n",
" nest_asyncio.apply()\n",
"\n",
"\n",
"llm = sgl.Engine(model_path=\"qwen/qwen2.5-0.5b-instruct\")"
]
},

View File

@@ -0,0 +1,9 @@
OpenAI-Compatible APIs
======================
.. toctree::
:maxdepth: 1
openai_api_completions.ipynb
openai_api_vision.ipynb
openai_api_embeddings.ipynb

View File

@@ -14,7 +14,7 @@
"- `chat/completions`\n",
"- `completions`\n",
"\n",
"Check out other tutorials to learn about [vision APIs](https://docs.sglang.ai/backend/openai_api_vision.html) for vision-language models and [embedding APIs](https://docs.sglang.ai/backend/openai_api_embeddings.html) for embedding models."
"Check out other tutorials to learn about [vision APIs](openai_api_vision.ipynb) for vision-language models and [embedding APIs](openai_api_embeddings.ipynb) for embedding models."
]
},
{
@@ -32,18 +32,11 @@
"metadata": {},
"outputs": [],
"source": [
"from sglang.test.test_utils import is_in_ci\n",
"\n",
"if is_in_ci():\n",
" from patch import launch_server_cmd\n",
"else:\n",
" from sglang.utils import launch_server_cmd\n",
"\n",
"from sglang.test.doc_patch import launch_server_cmd\n",
"from sglang.utils import wait_for_server, print_highlight, terminate_process\n",
"\n",
"\n",
"server_process, port = launch_server_cmd(\n",
" \"python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct --host 0.0.0.0 --mem-fraction-static 0.8\"\n",
" \"python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct --host 0.0.0.0\"\n",
")\n",
"\n",
"wait_for_server(f\"http://localhost:{port}\")\n",
@@ -93,9 +86,69 @@
"\n",
"The chat completions API accepts OpenAI Chat Completions API's parameters. Refer to [OpenAI Chat Completions API](https://platform.openai.com/docs/api-reference/chat/create) for more details.\n",
"\n",
"SGLang extends the standard API with the `extra_body` parameter, allowing for additional customization. One key option within `extra_body` is `chat_template_kwargs`, which can be used to pass arguments to the chat template processor.\n",
"SGLang extends the standard API with the `extra_body` parameter, allowing for additional customization. One key option within `extra_body` is `chat_template_kwargs`, which can be used to pass arguments to the chat template processor."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"response = client.chat.completions.create(\n",
" model=\"qwen/qwen2.5-0.5b-instruct\",\n",
" messages=[\n",
" {\n",
" \"role\": \"system\",\n",
" \"content\": \"You are a knowledgeable historian who provides concise responses.\",\n",
" },\n",
" {\"role\": \"user\", \"content\": \"Tell me about ancient Rome\"},\n",
" {\n",
" \"role\": \"assistant\",\n",
" \"content\": \"Ancient Rome was a civilization centered in Italy.\",\n",
" },\n",
" {\"role\": \"user\", \"content\": \"What were their major achievements?\"},\n",
" ],\n",
" temperature=0.3, # Lower temperature for more focused responses\n",
" max_tokens=128, # Reasonable length for a concise response\n",
" top_p=0.95, # Slightly higher for better fluency\n",
" presence_penalty=0.2, # Mild penalty to avoid repetition\n",
" frequency_penalty=0.2, # Mild penalty for more natural language\n",
" n=1, # Single response is usually more stable\n",
" seed=42, # Keep for reproducibility\n",
")\n",
"\n",
"#### Enabling Model Thinking/Reasoning\n",
"print_highlight(response.choices[0].message.content)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Streaming mode is also supported."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"stream = client.chat.completions.create(\n",
" model=\"qwen/qwen2.5-0.5b-instruct\",\n",
" messages=[{\"role\": \"user\", \"content\": \"Say this is a test\"}],\n",
" stream=True,\n",
")\n",
"for chunk in stream:\n",
" if chunk.choices[0].delta.content is not None:\n",
" print(chunk.choices[0].delta.content, end=\"\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Enabling Model Thinking/Reasoning\n",
"\n",
"You can use `chat_template_kwargs` to enable or disable the model's internal thinking or reasoning process output. Set `\"enable_thinking\": True` within `chat_template_kwargs` to include the reasoning steps in the response. This requires launching the server with a compatible reasoning parser.\n",
"\n",
@@ -160,61 +213,6 @@
"Here is an example of a detailed chat completion request using standard OpenAI parameters:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"response = client.chat.completions.create(\n",
" model=\"qwen/qwen2.5-0.5b-instruct\",\n",
" messages=[\n",
" {\n",
" \"role\": \"system\",\n",
" \"content\": \"You are a knowledgeable historian who provides concise responses.\",\n",
" },\n",
" {\"role\": \"user\", \"content\": \"Tell me about ancient Rome\"},\n",
" {\n",
" \"role\": \"assistant\",\n",
" \"content\": \"Ancient Rome was a civilization centered in Italy.\",\n",
" },\n",
" {\"role\": \"user\", \"content\": \"What were their major achievements?\"},\n",
" ],\n",
" temperature=0.3, # Lower temperature for more focused responses\n",
" max_tokens=128, # Reasonable length for a concise response\n",
" top_p=0.95, # Slightly higher for better fluency\n",
" presence_penalty=0.2, # Mild penalty to avoid repetition\n",
" frequency_penalty=0.2, # Mild penalty for more natural language\n",
" n=1, # Single response is usually more stable\n",
" seed=42, # Keep for reproducibility\n",
")\n",
"\n",
"print_highlight(response.choices[0].message.content)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Streaming mode is also supported."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"stream = client.chat.completions.create(\n",
" model=\"qwen/qwen2.5-0.5b-instruct\",\n",
" messages=[{\"role\": \"user\", \"content\": \"Say this is a test\"}],\n",
" stream=True,\n",
")\n",
"for chunk in stream:\n",
" if chunk.choices[0].delta.content is not None:\n",
" print(chunk.choices[0].delta.content, end=\"\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
@@ -282,7 +280,7 @@
"source": [
"## Structured Outputs (JSON, Regex, EBNF)\n",
"\n",
"For OpenAI compatible structured outputs API, refer to [Structured Outputs](https://docs.sglang.ai/backend/structured_outputs.html#OpenAI-Compatible-API) for more details.\n"
"For OpenAI compatible structured outputs API, refer to [Structured Outputs](../advanced_features/structured_outputs.ipynb) for more details.\n"
]
},
{

View File

@@ -9,7 +9,7 @@
"SGLang provides OpenAI-compatible APIs to enable a smooth transition from OpenAI services to self-hosted local models.\n",
"A complete reference for the API is available in the [OpenAI API Reference](https://platform.openai.com/docs/guides/embeddings).\n",
"\n",
"This tutorial covers the embedding APIs for embedding models. For a list of the supported models see the [corresponding overview page](https://docs.sglang.ai/supported_models/embedding_models.html)\n"
"This tutorial covers the embedding APIs for embedding models. For a list of the supported models see the [corresponding overview page](../supported_models/embedding_models.md)\n"
]
},
{
@@ -27,13 +27,7 @@
"metadata": {},
"outputs": [],
"source": [
"from sglang.test.test_utils import is_in_ci\n",
"\n",
"if is_in_ci():\n",
" from patch import launch_server_cmd\n",
"else:\n",
" from sglang.utils import launch_server_cmd\n",
"\n",
"from sglang.test.doc_patch import launch_server_cmd\n",
"from sglang.utils import wait_for_server, print_highlight, terminate_process\n",
"\n",
"embedding_process, port = launch_server_cmd(\n",

View File

@@ -10,7 +10,7 @@
"A complete reference for the API is available in the [OpenAI API Reference](https://platform.openai.com/docs/guides/vision).\n",
"This tutorial covers the vision APIs for vision language models.\n",
"\n",
"SGLang supports various vision language models such as Llama 3.2, LLaVA-OneVision, Qwen2.5-VL, Gemma3 and [more](https://docs.sglang.ai/supported_models/multimodal_language_models).\n",
"SGLang supports various vision language models such as Llama 3.2, LLaVA-OneVision, Qwen2.5-VL, Gemma3 and [more](../supported_models/multimodal_language_models.md).\n",
"\n",
"As an alternative to the OpenAI API, you can also use the [SGLang offline engine](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py)."
]
@@ -30,13 +30,7 @@
"metadata": {},
"outputs": [],
"source": [
"from sglang.test.test_utils import is_in_ci\n",
"\n",
"if is_in_ci():\n",
" from patch import launch_server_cmd\n",
"else:\n",
" from sglang.utils import launch_server_cmd\n",
"\n",
"from sglang.test.doc_patch import launch_server_cmd\n",
"from sglang.utils import wait_for_server, print_highlight, terminate_process\n",
"\n",
"vision_process, port = launch_server_cmd(\n",

View File

@@ -1,7 +1,6 @@
# Sampling Parameters
This doc describes the sampling parameters of the SGLang Runtime. It is the low-level endpoint of the runtime.
If you want a high-level endpoint that can automatically handle chat templates, consider using the [OpenAI Compatible API](./openai_api_completions.ipynb).
## `/generate` Endpoint
@@ -53,7 +52,7 @@ The object is defined at `sampling_params.py::SamplingParams`. You can also read
### Constrained decoding
Please refer to our dedicated guide on [constrained decoding](./structured_outputs.ipynb) for the following parameters.
Please refer to our dedicated guide on [constrained decoding](../advanced_features/structured_outputs.ipynb) for the following parameters.
| Argument | Type/Default | Description |
|-----------------|---------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------|
@@ -185,9 +184,9 @@ You can specify a JSON schema, regular expression or [EBNF](https://en.wikipedia
SGLang supports two grammar backends:
- [Outlines](https://github.com/dottxt-ai/outlines): Supports JSON schema and regular expression constraints.
- [XGrammar](https://github.com/mlc-ai/xgrammar) (default): Supports JSON schema, regular expression, and EBNF constraints.
- XGrammar currently uses the [GGML BNF format](https://github.com/ggerganov/llama.cpp/blob/master/grammars/README.md).
- [Outlines](https://github.com/dottxt-ai/outlines): Supports JSON schema and regular expression constraints.
If instead you want to initialize the Outlines backend, you can use `--grammar-backend outlines` flag:
@@ -252,7 +251,7 @@ response = requests.post(
print(response.json())
```
Detailed example in [structured outputs](./structured_outputs.ipynb).
Detailed example in [structured outputs](../advanced_features/structured_outputs.ipynb).
### Custom logit processor

View File

@@ -7,9 +7,9 @@
"# Sending Requests\n",
"This notebook provides a quick-start guide to use SGLang in chat completions after installation.\n",
"\n",
"- For Vision Language Models, see [OpenAI APIs - Vision](../backend/openai_api_vision.ipynb).\n",
"- For Embedding Models, see [OpenAI APIs - Embedding](../backend/openai_api_embeddings.ipynb) and [Encode (embedding model)](../backend/native_api.html#Encode-(embedding-model)).\n",
"- For Reward Models, see [Classify (reward model)](../backend/native_api.html#Classify-(reward-model))."
"- For Vision Language Models, see [OpenAI APIs - Vision](openai_api_vision.ipynb).\n",
"- For Embedding Models, see [OpenAI APIs - Embedding](openai_api_embeddings.ipynb) and [Encode (embedding model)](native_api.html#Encode-(embedding-model)).\n",
"- For Reward Models, see [Classify (reward model)](native_api.html#Classify-(reward-model))."
]
},
{
@@ -25,16 +25,10 @@
"metadata": {},
"outputs": [],
"source": [
"from sglang.test.test_utils import is_in_ci\n",
"from sglang.test.doc_patch import launch_server_cmd\n",
"from sglang.utils import wait_for_server, print_highlight, terminate_process\n",
"\n",
"if is_in_ci():\n",
" from patch import launch_server_cmd\n",
"else:\n",
" from sglang.utils import launch_server_cmd\n",
"\n",
"# This is equivalent to running the following command in your terminal\n",
"\n",
"# python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct --host 0.0.0.0\n",
"\n",
"server_process, port = launch_server_cmd(\n",

View File

@@ -30,62 +30,76 @@
[Pytorch Profiler](https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html) is a convenient basic tool to inspect kernel execution time, call stack, and kernel overlap and occupancy.
- To profile a server
### Profile a server with `sglang.bench_serving`
```bash
# set trace path
export SGLANG_TORCH_PROFILER_DIR=/root/sglang/profile_log
```bash
# set trace path
export SGLANG_TORCH_PROFILER_DIR=/root/sglang/profile_log
# start server
python -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct
# start server
python -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct
# send profiling request from client
python -m sglang.bench_serving --backend sglang --model meta-llama/Llama-3.1-8B-Instruct --num-prompts 10 --sharegpt-output-len 100 --profile
```
# send profiling request from client
python -m sglang.bench_serving --backend sglang --model meta-llama/Llama-3.1-8B-Instruct --num-prompts 10 --sharegpt-output-len 100 --profile
```
Please make sure that the `SGLANG_TORCH_PROFILER_DIR` should be set at both server and client side, otherwise the trace file cannot be generated correctly . A secure way will be setting `SGLANG_TORCH_PROFILER_DIR` in the `.*rc` file of shell (e.g. `~/.bashrc` for bash shells).
Please make sure that the `SGLANG_TORCH_PROFILER_DIR` should be set at both server and client side, otherwise the trace file cannot be generated correctly . A secure way will be setting `SGLANG_TORCH_PROFILER_DIR` in the `.*rc` file of shell (e.g. `~/.bashrc` for bash shells).
### Profile a server with `sglang.bench_offline_throughput`
```bash
export SGLANG_TORCH_PROFILER_DIR=/root/sglang/profile_log
- To profile offline
```bash
export SGLANG_TORCH_PROFILER_DIR=/root/sglang/profile_log
# profile one batch with bench_one_batch.py
# batch size can be controlled with --batch argument
python3 -m sglang.bench_one_batch --model-path meta-llama/Llama-3.1-8B-Instruct --batch 32 --input-len 1024 --output-len 10 --profile
# profile one batch with bench_one_batch.py
# batch size can be controlled with --batch argument
python3 -m sglang.bench_one_batch --model-path meta-llama/Llama-3.1-8B-Instruct --batch 32 --input-len 1024 --output-len 10 --profile
# profile multiple batches with bench_offline_throughput.py
python -m sglang.bench_offline_throughput --model-path meta-llama/Llama-3.1-8B-Instruct --dataset-name random --num-prompts 10 --profile --mem-frac=0.8
```
# profile multiple batches with bench_offline_throughput.py
python -m sglang.bench_offline_throughput --model-path meta-llama/Llama-3.1-8B-Instruct --dataset-name random --num-prompts 10 --profile --mem-frac=0.8
```
### Profile a server with `sglang.profiler`
- Possible PyTorch Bug
If in any cases you encounter the following error (for example, using qwen 2.5 VL):
```bash
RuntimeError: !stack.empty() INTERNAL ASSERT FAILED at "/pytorch/torch/csrc/autograd/profiler_python.cpp":983, please report a bug to PyTorch. Python replay stack is empty.
```
This is likely a PyTorch Bug reported in [Bug: vLLM Profiler](https://github.com/vllm-project/vllm/issues/18240) and [Bug: torch.profiler.profile](https://github.com/pytorch/pytorch/issues/101632). As a workaround, you may disable `with_stack` with an environment variable such as follows:
```bash
export SGLANG_PROFILE_WITH_STACK=False
python -m sglang.bench_offline_throughput --model-path meta-llama/Llama-3.1-8B-Instruct --dataset-name random --num-prompts 10 --profile --mem-frac=0.8
```
When the server is running (e.g., processing a decoding request), you can start live profiling immediately by sending a profile request to the server.
- View Traces
You can do this by running `python3 -m sglang.profiler`. For example:
Trace files can be loaded and visualized from:
```
# Terminal 1: Send a generation request
python3 -m sglang.test.send_one
1. https://ui.perfetto.dev/ (any browser)
2. chrome://tracing (Chrome browser only)
# Terminal 2: Before the above request finishes, quickly launch the following command in a separate terminal.
# It will generate a profile of the above request for several decoding batches.
python3 -m sglang.profiler
```
If browser cannot open trace file due to its large size,
client can generate a small trace file (<100MB) by controlling number of prompts and lengths of prompt outputs.
For example, when profiling a server,
### Possible PyTorch bugs
If in any cases you encounter the following error (for example, using qwen 2.5 VL):
```bash
RuntimeError: !stack.empty() INTERNAL ASSERT FAILED at "/pytorch/torch/csrc/autograd/profiler_python.cpp":983, please report a bug to PyTorch. Python replay stack is empty.
```
This is likely a PyTorch Bug reported in [Bug: vLLM Profiler](https://github.com/vllm-project/vllm/issues/18240) and [Bug: torch.profiler.profile](https://github.com/pytorch/pytorch/issues/101632). As a workaround, you may disable `with_stack` with an environment variable such as follows:
```bash
export SGLANG_PROFILE_WITH_STACK=False
python -m sglang.bench_offline_throughput --model-path meta-llama/Llama-3.1-8B-Instruct --dataset-name random --num-prompts 10 --profile --mem-frac=0.8
```
```bash
python -m sglang.bench_serving --backend sglang --model meta-llama/Llama-3.1-8B-Instruct --num-prompts 2 --sharegpt-output-len 100 --profile
```
### View traces
This command sets the number of prompts to 2 with `--num-prompts` argument and limits the length of output sequences to 100 with `--sharegpt-output-len` argument, which can generate a small trace file for browser to open smoothly.
Trace files can be loaded and visualized from:
Additionally, if you want to locate the SGLang Python source code through the cuda kernel in Trace, you need to disable CUDA Graph when starting the service. This can be done by using the `--disable-cuda-graph` parameter in the command to start the service.
1. https://ui.perfetto.dev/ (any browser)
2. chrome://tracing (Chrome browser only)
If browser cannot open trace file due to its large size,
client can generate a small trace file (<100MB) by controlling number of prompts and lengths of prompt outputs.
For example, when profiling a server,
```bash
python -m sglang.bench_serving --backend sglang --model meta-llama/Llama-3.1-8B-Instruct --num-prompts 2 --sharegpt-output-len 100 --profile
```
This command sets the number of prompts to 2 with `--num-prompts` argument and limits the length of output sequences to 100 with `--sharegpt-output-len` argument, which can generate a small trace file for browser to open smoothly.
Additionally, if you want to locate the SGLang Python source code through the cuda kernel in Trace, you need to disable CUDA Graph when starting the service. This can be done by using the `--disable-cuda-graph` parameter in the command to start the service.
## Profile with Nsight

View File

@@ -0,0 +1,82 @@
# Contribution Guide
Welcome to **SGLang**! We appreciate your interest in contributing. This guide provides a concise overview of how to set up your environment, run tests, build documentation, and open a Pull Request (PR). Whether youre fixing a small bug or developing a major feature, we encourage following these steps for a smooth contribution process.
## Install SGLang from Source
### Fork and clone the repository
**Note**: New contributors do **not** have the write permission to push to the official SGLang repo. Please fork the repository under your GitHub account, then clone your fork locally.
```bash
git clone https://github.com/<your_user_name>/sglang.git
```
### Build from source
Refer to [Install SGLang from Source](../get_started/install.md#method-2-from-source).
## Format code with pre-commit
We use [pre-commit](https://pre-commit.com/) to maintain consistent code style checks. Before pushing your changes, please run:
```bash
pip3 install pre-commit
pre-commit install
pre-commit run --all-files
```
- **`pre-commit run --all-files`** manually runs all configured checks, applying fixes if possible. If it fails the first time, re-run it to ensure lint errors are fully resolved. Make sure your code passes all checks **before** creating a Pull Request.
- **Do not commit** directly to the `main` branch. Always create a new branch (e.g., `feature/my-new-feature`), push your changes, and open a PR from that branch.
## Run and add unit tests
If you add a new feature or fix a bug, please add corresponding unit tests to ensure coverage and prevent regression.
SGLang uses Python's built-in [unittest](https://docs.python.org/3/library/unittest.html) framework.
For detailed instructions on running tests and integrating them into CI, refer to [test/README.md](https://github.com/sgl-project/sglang/tree/main/test/README.md).
## Write documentations
We recommend new contributors start from writing documentation, which helps you quickly understand SGLang codebase.
For more details, please refer to [docs/README.md](https://github.com/sgl-project/sglang/tree/main/docs/README.md).
## Test the accuracy
If your code changes the model output, please run the accuracy tests. A quick sanity check is the few-shot GSM8K.
```
# Launch a server
python3 -m sglang.launch_server --model Qwen/Qwen2-7B-Instruct
# Evaluate
python3 -m sglang.test.few_shot_gsm8k --num-questions 200
```
Please note that the above script is primarily a sanity check, not a rigorous accuracy or speed test.
This test can have significant variance (1%5%) in accuracy due to batching and the non-deterministic nature of the inference engine.
Also, do not rely on the "Latency/Output throughput" from this script, as it is not a proper speed test.
GSM8K is too easy for state-of-the-art models nowadays. Please try your own more challenging accuracy tests.
You can find additional accuracy eval examples in:
- [test_eval_accuracy_large.py](https://github.com/sgl-project/sglang/blob/main/test/srt/test_eval_accuracy_large.py)
- [test_gpt_oss_1gpu.py](https://github.com/sgl-project/sglang/blob/main/test/srt/test_gpt_oss_1gpu.py)
## Benchmark the speed
Refer to [Benchmark and Profiling](../developer_guide/benchmark_and_profiling.md).
## Request a Review
You can identify potential reviewers for your code by checking the [code owners](https://github.com/sgl-project/sglang/blob/main/.github/CODEOWNERS) and [reviewers](https://github.com/sgl-project/sglang/blob/main/.github/REVIEWERS.md) files.
Another effective strategy is to review the file modification history and contact individuals who have frequently edited the files.
If you modify files protected by code owners, their approval is required to merge the code.
## General Code Style
- Avoid code duplication. If the same code snippet (more than 5 lines) appears multiple times, extract it into a shared function.
- Minimize device synchronization. Reduce expensive CPU-GPU synchronization operations, such as `tensor.item()` or `tensor.cpu()`, as much as possible. Use vectorized code instead.
- Keep files short. If a file exceeds 2,000 lines of code, please split it into multiple smaller files.
## Tips for newcomers
If you want to contribute but dont have a specific idea in mind, pick issues labeled [“good first issue” or “help wanted”](https://github.com/sgl-project/sglang/issues?q=is%3Aissue+label%3A%22good+first+issue%22%2C%22help+wanted%22). These tasks typically have lower complexity and provide an excellent introduction to the codebase. Also check out this [code walk-through](https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/tree/main/sglang/code-walk-through) for a deeper look into SGLangs workflow.
If you have any questions or want to start a discussion, please feel free to ask in our [Slack channel](https://slack.sglang.ai).
Thank you for your interest in SGLang. Happy coding!

View File

@@ -1,49 +0,0 @@
import weakref
from sglang.utils import execute_shell_command, reserve_port
DEFAULT_MAX_RUNNING_REQUESTS = 200
DEFAULT_MAX_TOTAL_TOKENS = 20480
import sglang.srt.server_args as server_args_mod
_original_post_init = server_args_mod.ServerArgs.__post_init__
def patched_post_init(self):
_original_post_init(self)
if self.max_running_requests is None:
self.max_running_requests = DEFAULT_MAX_RUNNING_REQUESTS
if self.max_total_tokens is None:
self.max_total_tokens = DEFAULT_MAX_TOTAL_TOKENS
self.disable_cuda_graph = True
server_args_mod.ServerArgs.__post_init__ = patched_post_init
process_socket_map = weakref.WeakKeyDictionary()
def launch_server_cmd(command: str, host: str = "0.0.0.0", port: int = None):
"""
Launch the server using the given command.
If no port is specified, a free port is reserved.
"""
if port is None:
port, lock_socket = reserve_port(host)
else:
lock_socket = None
extra_flags = (
f"--max-running-requests {DEFAULT_MAX_RUNNING_REQUESTS} "
f"--max-total-tokens {DEFAULT_MAX_TOTAL_TOKENS} "
f"--disable-cuda-graph"
)
full_command = f"{command} --port {port} {extra_flags}"
process = execute_shell_command(full_command)
if lock_socket is not None:
process_socket_map[process] = lock_socket
return process, port

View File

@@ -1,27 +1,25 @@
# Install SGLang
You can install SGLang using any of the methods below.
You can install SGLang using one of the methods below.
For running DeepSeek V3/R1, refer to [DeepSeek V3 Support](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3). It is recommended to use the latest version and deploy it with [Docker](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#using-docker-recommended) to avoid environment-related issues.
It is recommended to use uv to install the dependencies for faster installation:
This page primarily applies to common NVIDIA GPU platforms.
For other or newer platforms, please refer to the dedicated pages for [NVIDIA Blackwell GPUs](../platforms/blackwell_gpu.md), [AMD GPUs](../platforms/amd_gpu.md), [Intel Xeon CPUs](../platforms/cpu_server.md), [NVIDIA Jetson](../platforms/nvidia_jetson.md), [Ascend NPUs](../platforms/ascend_npu.md).
## Method 1: With pip or uv
It is recommended to use uv for faster installation:
```bash
pip install --upgrade pip
pip install uv
uv pip install "sglang[all]>=0.5.0rc0"
```
**Quick Fixes to Common Problems**
- SGLang currently uses torch 2.7.1, so you need to install flashinfer for torch 2.7.1. If you want to install flashinfer separately, please refer to [FlashInfer installation doc](https://docs.flashinfer.ai/installation.html). Please note that the FlashInfer pypi package is called `flashinfer-python` instead of `flashinfer`.
**Quick fixes to common problems**
- If you encounter `OSError: CUDA_HOME environment variable is not set`. Please set it to your CUDA install root with either of the following solutions:
1. Use `export CUDA_HOME=/usr/local/cuda-<your-cuda-version>` to set the `CUDA_HOME` environment variable.
2. Install FlashInfer first following [FlashInfer installation doc](https://docs.flashinfer.ai/installation.html), then install SGLang as described above.
- SGLang currently uses torch 2.8 and flashinfer for torch 2.8. If you want to install flashinfer separately, please refer to [FlashInfer installation doc](https://docs.flashinfer.ai/installation.html). Please note that the FlashInfer pypi package is called `flashinfer-python` instead of `flashinfer`.
## Method 2: From source
@@ -30,34 +28,18 @@ uv pip install "sglang[all]>=0.5.0rc0"
git clone -b v0.5.0rc0 https://github.com/sgl-project/sglang.git
cd sglang
# Install the python packages
pip install --upgrade pip
pip install -e "python[all]"
```
Note: SGLang currently uses torch 2.7.1, so you need to install flashinfer for torch 2.7.1. If you want to install flashinfer separately, please refer to [FlashInfer installation doc](https://docs.flashinfer.ai/installation.html).
If you want to develop SGLang, it is recommended to use docker. Please refer to [setup docker container](https://github.com/sgl-project/sglang/blob/main/docs/references/development_guide_using_docker.md#setup-docker-container) for guidance. The docker image is `lmsysorg/sglang:dev`.
Note: For AMD ROCm system with Instinct/MI GPUs, do following instead:
```bash
# Use the last release branch
git clone -b v0.5.0rc0 https://github.com/sgl-project/sglang.git
cd sglang
pip install --upgrade pip
cd sgl-kernel
python setup_rocm.py install
cd ..
pip install -e "python[all_hip]"
```
Note: Please refer to [the CPU environment setup command list](../references/cpu.md#install-from-source)
to set up the SGLang environment for running the models with CPU servers.
**Quick fixes to common problems**
- If you want to develop SGLang, it is recommended to use docker. Please refer to [setup docker container](../developer_guide/development_guide_using_docker.md#setup-docker-container). The docker image is `lmsysorg/sglang:dev`.
- SGLang currently uses torch 2.8 and flashinfer for torch 2.8. If you want to install flashinfer separately, please refer to [FlashInfer installation doc](https://docs.flashinfer.ai/installation.html). Please note that the FlashInfer pypi package is called `flashinfer-python` instead of `flashinfer`.
## Method 3: Using docker
The docker images are available on Docker Hub as [lmsysorg/sglang](https://hub.docker.com/r/lmsysorg/sglang/tags), built from [Dockerfile](https://github.com/sgl-project/sglang/tree/main/docker).
The docker images are available on Docker Hub at [lmsysorg/sglang](https://hub.docker.com/r/lmsysorg/sglang/tags), built from [Dockerfile](https://github.com/sgl-project/sglang/tree/main/docker).
Replace `<secret>` below with your huggingface hub [token](https://huggingface.co/docs/hub/en/security-tokens).
```bash
@@ -71,41 +53,9 @@ docker run --gpus all \
python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --host 0.0.0.0 --port 30000
```
Note: For AMD ROCm system with Instinct/MI GPUs, it is recommended to use `docker/Dockerfile.rocm` to build images, example and usage as below:
## Method 4: Using Kubernetes
```bash
docker build --build-arg SGL_BRANCH=v0.5.0rc0 -t v0.5.0rc0-rocm630 -f Dockerfile.rocm .
alias drun='docker run -it --rm --network=host --device=/dev/kfd --device=/dev/dri --ipc=host \
--shm-size 16G --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined \
-v $HOME/dockerx:/dockerx -v /data:/data'
drun -p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=<secret>" \
v0.5.0rc0-rocm630 \
python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --host 0.0.0.0 --port 30000
# Till flashinfer backend available, --attention-backend triton --sampling-backend pytorch are set by default
drun v0.5.0rc0-rocm630 python3 -m sglang.bench_one_batch --batch-size 32 --input 1024 --output 128 --model amd/Meta-Llama-3.1-8B-Instruct-FP8-KV --tp 8 --quantization fp8
```
Note: Please refer to [the CPU installation guide using Docker](../references/cpu.md#install-using-docker)
to set up the SGLang environment for running the models with CPU servers.
## Method 4: Using docker compose
<details>
<summary>More</summary>
> This method is recommended if you plan to serve it as a service.
> A better approach is to use the [k8s-sglang-service.yaml](https://github.com/sgl-project/sglang/blob/main/docker/k8s-sglang-service.yaml).
1. Copy the [compose.yml](https://github.com/sgl-project/sglang/blob/main/docker/compose.yaml) to your local machine
2. Execute the command `docker compose up -d` in your terminal.
</details>
## Method 5: Using Kubernetes
Please check out [OME](https://github.com/sgl-project/ome), a Kubernetes operator for enterprise-grade management and serving of large language models (LLMs).
<details>
<summary>More</summary>
@@ -120,6 +70,18 @@ to set up the SGLang environment for running the models with CPU servers.
</details>
## Method 5: Using docker compose
<details>
<summary>More</summary>
> This method is recommended if you plan to serve it as a service.
> A better approach is to use the [k8s-sglang-service.yaml](https://github.com/sgl-project/sglang/blob/main/docker/k8s-sglang-service.yaml).
1. Copy the [compose.yml](https://github.com/sgl-project/sglang/blob/main/docker/compose.yaml) to your local machine
2. Execute the command `docker compose up -d` in your terminal.
</details>
## Method 6: Run on Kubernetes or Clouds with SkyPilot
<details>
@@ -166,6 +128,6 @@ sky status --endpoint 30000 sglang
## Common Notes
- [FlashInfer](https://github.com/flashinfer-ai/flashinfer) is the default attention kernel backend. It only supports sm75 and above. If you encounter any FlashInfer-related issues on sm75+ devices (e.g., T4, A10, A100, L4, L40S, H100), please switch to other kernels by adding `--attention-backend triton --sampling-backend pytorch` and open an issue on GitHub.
- If you only need to use OpenAI models with the frontend language, you can avoid installing other dependencies by using `pip install "sglang[openai]"`.
- The language frontend operates independently of the backend runtime. You can install the frontend locally without needing a GPU, while the backend can be set up on a GPU-enabled machine. To install the frontend, run `pip install sglang`, and for the backend, use `pip install sglang[srt]`. `srt` is the abbreviation of SGLang runtime.
- To reinstall flashinfer locally, use the following command: `pip3 install --upgrade flashinfer-python --force-reinstall --no-deps` and then delete the cache with `rm -rf ~/.cache/flashinfer`.
- If you only need to use OpenAI API models with the frontend language, you can avoid installing other dependencies by using `pip install "sglang[openai]"`.
- The language frontend operates independently of the backend runtime. You can install the frontend locally without needing a GPU, while the backend can be set up on a GPU-enabled machine. To install the frontend, run `pip install sglang`, and for the backend, use `pip install sglang[srt]`. `srt` is the abbreviation of SGLang runtime.

View File

@@ -12,31 +12,41 @@ The core features include:
.. toctree::
:maxdepth: 1
:caption: Installation
:caption: Get Started
start/install.md
get_started/install.md
.. toctree::
:maxdepth: 1
:caption: Backend Tutorial
:caption: Basic Usage
references/deepseek
references/llama4
backend/send_request.ipynb
backend/openai_api_completions.ipynb
backend/openai_api_vision.ipynb
backend/openai_api_embeddings.ipynb
backend/native_api.ipynb
backend/offline_engine_api.ipynb
basic_usage/send_request.ipynb
basic_usage/openai_api.rst
basic_usage/offline_engine_api.ipynb
basic_usage/native_api.ipynb
basic_usage/sampling_params.md
basic_usage/deepseek.md
basic_usage/gpt_oss.md
basic_usage/llama4.md
.. toctree::
:maxdepth: 1
:caption: Advanced Backend Configurations
:caption: Advanced Features
backend/server_arguments.md
backend/sampling_params.md
backend/hyperparameter_tuning.md
backend/attention_backend.md
advanced_features/server_arguments.md
advanced_features/hyperparameter_tuning.md
advanced_features/speculative_decoding.ipynb
advanced_features/structured_outputs.ipynb
advanced_features/structured_outputs_for_reasoning_models.ipynb
advanced_features/function_calling.ipynb
advanced_features/separate_reasoning.ipynb
advanced_features/quantization.md
advanced_features/lora.ipynb
advanced_features/pd_disaggregation.md
advanced_features/vlm_query.ipynb
advanced_features/router.md
advanced_features/observability.md
advanced_features/attention_backend.md
.. toctree::
:maxdepth: 1
@@ -46,43 +56,38 @@ The core features include:
supported_models/multimodal_language_models.md
supported_models/embedding_models.md
supported_models/reward_models.md
supported_models/rerank_models.md
supported_models/support_new_models.md
supported_models/transformers_fallback.md
supported_models/modelscope.md
.. toctree::
:maxdepth: 1
:caption: Advanced Features
:caption: Hardware Platforms
backend/speculative_decoding.ipynb
backend/structured_outputs.ipynb
backend/function_calling.ipynb
backend/separate_reasoning.ipynb
backend/structured_outputs_for_reasoning_models.ipynb
backend/custom_chat_template.md
backend/quantization.md
backend/lora.ipynb
backend/pd_disaggregation.md
backend/vlm_query.ipynb
platforms/amd_gpu.md
platforms/blackwell_gpu.md
platforms/cpu_server.md
platforms/tpu.md
platforms/nvidia_jetson.md
platforms/ascend_npu.md
.. toctree::
:maxdepth: 1
:caption: Frontend Tutorial
:caption: Developer Guide
frontend/frontend.ipynb
frontend/choices_methods.md
developer_guide/contribution_guide.md
developer_guide/development_guide_using_docker.md
developer_guide/benchmark_and_profiling.md
.. toctree::
:maxdepth: 1
:caption: SGLang Router
:caption: References
router/router.md
.. toctree::
:maxdepth: 1
:caption: References
references/general
references/hardware
references/advanced_deploy
references/performance_analysis_and_optimization
references/developer
references/faq.md
references/environment_variables.md
references/production_metrics.md
references/custom_chat_template.md
references/frontend/frontend_index.rst
references/multi_node_deployment/multi_node_index.rst
references/learn_more.md

View File

@@ -1,15 +1,16 @@
# SGLang on AMD
# AMD GPUs
This document describes how to set up an AMD-based environment for [SGLang](https://github.com/sgl-project/sglang). If you encounter issues or have questions, please [open an issue](https://github.com/sgl-project/sglang/issues) on the SGLang repository.
This document describes how run SGLang on AMD GPUs. If you encounter issues or have questions, please [open an issue](https://github.com/sgl-project/sglang/issues).
## System Configuration
When using AMD GPUs (such as MI300X), certain system-level optimizations help ensure stable performance. Here we take MI300X as an example. AMD provides official documentation for MI300X optimization and system tuning:
- [AMD MI300X Tuning Guides](https://rocm.docs.amd.com/en/latest/how-to/tuning-guides/mi300x/index.html)
- [LLM inference performance validation on AMD Instinct MI300X](https://rocm.docs.amd.com/en/latest/how-to/rocm-for-ai/inference/vllm-benchmark.html)
- [AMD Instinct MI300X System Optimization](https://rocm.docs.amd.com/en/latest/how-to/system-optimization/mi300x.html)
- [AMD Instinct MI300X Workload Optimization](https://rocm.docs.amd.com/en/latest/how-to/rocm-for-ai/inference-optimization/workload.html)
- [LLM inference performance validation on AMD Instinct MI300X](https://rocm.docs.amd.com/en/latest/how-to/rocm-for-ai/inference/vllm-benchmark.html)
- [AMD Instinct MI300X System Optimization](https://rocm.docs.amd.com/en/latest/how-to/system-optimization/mi300x.html)
- [AMD Instinct MI300X Workload Optimization](https://rocm.docs.amd.com/en/latest/how-to/rocm-for-ai/inference-optimization/workload.html)
- [Supercharge DeepSeek-R1 Inference on AMD Instinct MI300X](https://rocm.blogs.amd.com/artificial-intelligence/DeepSeekR1-Part2/README.html)
**NOTE:** We strongly recommend reading these docs and guides entirely to fully utilize your system.
@@ -35,24 +36,35 @@ You can automate or verify this change using [this helpful script](https://githu
Again, please go through the entire documentation to confirm your system is using the recommended configuration.
## Installing SGLang
## Install SGLang
For general installation instructions, see the official [SGLang Installation Docs](../start/install.md). Below are the AMD-specific steps summarized for convenience.
You can install SGLang using one of the methods below.
### Install from Source
```bash
git clone https://github.com/sgl-project/sglang.git
# Use the last release branch
git clone -b v0.5.0rc0 https://github.com/sgl-project/sglang.git
cd sglang
# Compile sgl-kernel
pip install --upgrade pip
pip install sgl-kernel --force-reinstall --no-deps
cd sgl-kernel
python setup_rocm.py install
# Install sglang python package
cd ..
pip install -e "python[all_hip]"
```
### Install Using Docker (Recommended)
The docker images are available on Docker Hub at [lmsysorg/sglang](https://hub.docker.com/r/lmsysorg/sglang/tags), built from [Dockerfile.rocm](https://github.com/sgl-project/sglang/tree/main/docker).
The steps below show how to build and use an image.
1. Build the docker image.
If you use pre-built images, you can skip this step and replace `sglang_image` with the pre-built image names in the steps below.
```bash
docker build -t sglang_image -f Dockerfile.rocm .
@@ -68,10 +80,10 @@ pip install -e "python[all_hip]"
-v /data:/data'
```
If you are using RDMA, please note that:
If you are using RDMA, please note that:
- `--network host` and `--privileged` are required by RDMA. If you don't need RDMA, you can remove them.
- You may need to set `NCCL_IB_GID_INDEX` if you are using RoCE, for example: `export NCCL_IB_GID_INDEX=3`.
1. `--network host` and `--privileged` are required by RDMA. If you don't need RDMA, you can remove them.
2. You may need to set `NCCL_IB_GID_INDEX` if you are using RoCE, for example: `export NCCL_IB_GID_INDEX=3`.
3. Launch the server.
**NOTE:** Replace `<secret>` below with your [huggingface hub token](https://huggingface.co/docs/hub/en/security-tokens).

View File

@@ -0,0 +1,7 @@
# Ascend NPUs
## Install
TODO
## Examples
TODO

View File

@@ -0,0 +1,9 @@
# Blackwell GPUs
We will release the pre-built wheels soon. Before that, please try to compile from source or check the blackwell docker images from [lmsysorg/sglang](https://hub.docker.com/r/lmsysorg/sglang/tags).
## B200 with x86 CPUs
TODO
## GB200/GB300 with ARM CPUs
TODO

View File

@@ -1,4 +1,4 @@
# SGLang on CPU
# CPU Servers
The document addresses how to set up the [SGLang](https://github.com/sgl-project/sglang) environment and run LLM inference on CPU servers.
Specifically, SGLang is well optimized on the CPUs equipped with Intel® AMX® Instructions,

View File

@@ -1,4 +1,4 @@
# Apply SGLang on NVIDIA Jetson Orin
# NVIDIA Jetson Orin
## Prerequisites

3
docs/platforms/tpu.md Normal file
View File

@@ -0,0 +1,3 @@
# TPU
The support for TPU is under active development. Please stay tuned.

View File

@@ -1,60 +0,0 @@
# Measuring Model Accuracy in SGLang
This guide shows how to evaluate model accuracy using SGLang's [built-in benchmarks](https://github.com/sgl-project/sglang/tree/b045841baeff37a5601fcde23fa98bd09d942c36/benchmark). Please include accuracy on crucial benchmarks in your PR if you make modifications on the model side, like the kernel and model architecture.
## Benchmarking Model Accuracy
This is a reference workflow for the [MMLU benchmark](https://github.com/sgl-project/sglang/tree/main/benchmark/mmlu). For more details or other benchmarks, please refer to the README in each specific benchmark folder under [sglang/benchmark](https://github.com/sgl-project/sglang/tree/b045841baeff37a5601fcde23fa98bd09d942c36/benchmark).
```bash
# Step 1: Download the dataset
bash download_data.sh
# Step 2: Launch the server
python3 -m sglang.launch_server \
--model-path Qwen/Qwen2.5-Math-1.5B-Instruct \ # Model selection
--port 30000 \ # Network configuration
--mem-fraction-static 0.8 # Memory optimization
# Step 3: Run the benchmark script
python3 bench_sglang.py --nsub 10 # Test 10 subjects
# Step 4: Extract the accuracy
cat result.jsonl | grep -oP '"accuracy": \K\d+\.\d+'
```
## Customizing Benchmark Scripts
Some benchmark implementations may differ from ours, causing accuracy discrepancies. To match [[Qwen2.5-Math]](https://github.com/QwenLM/Qwen2.5-Math)'s reported 76.8% GSM8K accuracy, customization is required.
```python
# The GSM8K benchmark script includes few shot examples for evaluation by default.
# Here we exclude them.
for i in range(len(lines[num_shots:num_questions])):
questions.append(get_one_example(lines, i, False))
labels.append(get_answer_value(lines[i]["answer"]))
```
```python
@sgl.function
def few_shot_gsm8k(s, question):
# System prompt given in https://github.com/QwenLM/Qwen2.5-Math
s += sgl.system("Please reason step by step, and put your final answer within \\boxed{}.") # Include system prompt
s += few_shot_examples + question
# Stopwords given in evaluation/math_eval.py of the Qwen2.5-Math repo
s += sgl.gen(
"answer", max_tokens=2048, stop=["Question", "Assistant:", "</s>", "<|im_end|>", "<|endoftext|>"]
)
```
These adjustments should return the desired accuracy.
## Extending Evaluation Capabilities
1. **Contribute New Benchmarks**
* Follow our [contribution guidelines](../references/contribution_guide.md) to add new test scripts
2. **Request Implementations**
* Feel free to open an issue describing your evaluation needs
3. **Use Alternative Tools**
* [OpenCompass](https://opencompass.org.cn)
* [LM Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness)

View File

@@ -1,8 +0,0 @@
Multi-Node Deployment
==========================
.. toctree::
:maxdepth: 1
multi_node.md
deploy_on_k8s.md
disaggregation/lws_pd_deploy.md

View File

@@ -1,46 +0,0 @@
# Contribution Guide
Welcome to **SGLang**! We appreciate your interest in contributing. This guide provides a concise overview of how to set up your environment, run tests, build documentation, and open a Pull Request (PR). Whether youre fixing a small bug or developing a major feature, we encourage following these steps for a smooth contribution process.
## Setting Up & Building from Source
### Fork and Clone the Repository
**Note**: New contributors do **not** have the write permission to push to the official SGLang repo. Please fork the repository under your GitHub account, then clone your fork locally.
```bash
git clone https://github.com/<your_user_name>/sglang.git
```
### Install Dependencies & Build
Refer to [Install SGLang from Source](https://docs.sglang.ai/start/install.html#method-2-from-source) documentation for more details on setting up the necessary dependencies.
## Code Formatting with Pre-Commit
We use [pre-commit](https://pre-commit.com/) to maintain consistent code style checks. Before pushing your changes, please run:
```bash
pip3 install pre-commit
pre-commit install
pre-commit run --all-files
```
- **`pre-commit run --all-files`** manually runs all configured checks, applying fixes if possible. If it fails the first time, re-run it to ensure lint errors are fully resolved. Make sure your code passes all checks **before** creating a Pull Request.
- **Do not commit** directly to the `main` branch. Always create a new branch (e.g., `feature/my-new-feature`), push your changes, and open a PR from that branch.
## Running Unit Tests & Adding to CI
SGLang uses Python's built-in [unittest](https://docs.python.org/3/library/unittest.html) framework. For detailed instructions on running tests and adding them to CI, please refer to [test/README.md](https://github.com/sgl-project/sglang/tree/main/test/README.md).
## Writing Documentation & Running Docs CI
We recommend new contributors start from writing documentation, which helps you quickly understand SGLang codebase. For more details, please refer to [docs/README.md](https://github.com/sgl-project/sglang/tree/main/docs/README.md).
## Tips for Newcomers
If you want to contribute but dont have a specific idea in mind, pick issues labeled [“good first issue” or “help wanted”](https://github.com/sgl-project/sglang/issues?q=is%3Aissue+label%3A%22good+first+issue%22%2C%22help+wanted%22). These tasks typically have lower complexity and provide an excellent introduction to the codebase. Also check out this [code walk-through](https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/tree/main/sglang/code-walk-through) for a deeper look into SGLangs workflow.
If you have any questions or want to start a discussion, please feel free to ask in our [Slack channel](https://join.slack.com/t/sgl-fru7574/shared_invite/zt-2um0ad92q-LkU19KQTxCGzlCgRiOiQEw).
Thank you for your interest in SGLang. Happy coding!

View File

@@ -1,6 +0,0 @@
Multi-Node Deployment
==========================
.. toctree::
:maxdepth: 1
deepseek.md

View File

@@ -1,8 +0,0 @@
Developer Reference
==========================
.. toctree::
:maxdepth: 1
development_guide_using_docker.md
release_process.md
setup_github_runner.md

View File

@@ -1,6 +1,26 @@
# Frequently Asked Questions
# Troubleshooting and Frequently Asked Questions
## The results are not deterministic, even with a temperature of 0
## Troubleshooting
This page lists common errors and tips for resolving them.
### CUDA Out of Memory
If you encounter out-of-memory (OOM) errors, you can adjust the following parameters:
- If OOM occurs during prefill, try reducing `--chunked-prefill-size` to `4096` or `2048`. This saves memory but slows down the prefill speed for long prompts.
- If OOM occurs during decoding, try lowering `--max-running-requests`.
- You can also reduce `--mem-fraction-static` to a smaller value, such as 0.8 or 0.7. This decreases the memory usage of the KV cache memory pool and helps prevent OOM errors during both prefill and decoding. However, it limits maximum concurrency and reduces peak throughput.
- Another common case for OOM is requesting input logprobs for a long prompt as it requires significant memory. To address this, set `logprob_start_len` in your sampling parameters to include only the necessary parts. If you do need input logprobs for a long prompt, try reducing `--mem-fraction-static`.
### CUDA Error: Illegal Memory Access Encountered
This error may result from kernel errors or out-of-memory issues:
- If it is a kernel error, resolving it may be challenging. Please file an issue on GitHub.
- If it is an out-of-memory issue, it may sometimes be reported as this error instead of "Out of Memory." Refer to the section above for guidance on avoiding OOM issues.
## Frequently Asked Questions
### The results are not deterministic, even with a temperature of 0
You may notice that when you send the same request twice, the results from the engine will be slightly different, even when the temperature is set to 0.

View File

@@ -0,0 +1,9 @@
Frontend Language
=================
.. toctree::
:maxdepth: 1
:caption: Frontend Language
frontend_tutorial.ipynb
choices_methods.md

View File

@@ -29,23 +29,15 @@
"metadata": {},
"outputs": [],
"source": [
"import requests\n",
"import os\n",
"\n",
"from sglang import assistant_begin, assistant_end\n",
"from sglang import assistant, function, gen, system, user\n",
"from sglang import image\n",
"from sglang import RuntimeEndpoint, set_default_backend\n",
"from sglang import RuntimeEndpoint\n",
"from sglang.lang.api import set_default_backend\n",
"from sglang.srt.utils import load_image\n",
"from sglang.test.test_utils import is_in_ci\n",
"from sglang.test.doc_patch import launch_server_cmd\n",
"from sglang.utils import print_highlight, terminate_process, wait_for_server\n",
"\n",
"if is_in_ci():\n",
" from patch import launch_server_cmd\n",
"else:\n",
" from sglang.utils import launch_server_cmd\n",
"\n",
"\n",
"server_process, port = launch_server_cmd(\n",
" \"python -m sglang.launch_server --model-path Qwen/Qwen2.5-7B-Instruct --host 0.0.0.0\"\n",
")\n",

View File

@@ -1,14 +0,0 @@
General Guidance
==========
.. toctree::
:maxdepth: 1
contribution_guide.md
troubleshooting.md
faq.md
learn_more.md
modelscope.md
environment_variables.md
production_metrics.md

View File

@@ -1,8 +0,0 @@
Hardware Supports
==========
.. toctree::
:maxdepth: 1
amd.md
nvidia_jetson.md
cpu.md

View File

@@ -1,3 +1,7 @@
# Learn more
You can find more blogs, slides, and videos about SGLang at [https://github.com/sgl-project/sgl-learning-materials](https://github.com/sgl-project/sgl-learning-materials).
The latest SGLang features and updates are shared through the [LMSYS blog](https://lmsys.org/blog/).
The 2025 H2 roadmap can be found at this [issue](https://github.com/sgl-project/sglang/issues/7736).

View File

@@ -0,0 +1,13 @@
Multi-Node Deployment
=====================
.. toctree::
:maxdepth: 1
:caption: Multi-Node Deployment
multi_node.md
deploy_on_k8s.md
lws_pd/lws_pd_deploy.md
- `Deploying DeepSeek with PD Disaggregation and Large-Scale Expert Parallelism on 96 H100 GPUs <https://lmsys.org/blog/2025-05-05-large-scale-ep/>`_
- `Deploying Kimi K2 with PD Disaggregation and Large-Scale Expert Parallelism on 128 H200 GPUs <https://lmsys.org/blog/2025-07-20-k2-large-scale-ep/>`_

View File

@@ -1,7 +0,0 @@
Performance Analysis & Optimization
===================================
.. toctree::
:maxdepth: 1
benchmark_and_profiling.md
accuracy_evaluation.md

View File

@@ -1,6 +1,6 @@
# Production Metrics
SGLang exposes the following metrics via Prometheus. The metrics are namespaced by `$name` (the model name).
SGLang exposes the following metrics via Prometheus. You can enable it by adding `--enable-metrics` when you launch the server.
An example of the monitoring dashboard is available in [examples/monitoring/grafana.json](https://github.com/sgl-project/sglang/blob/main/examples/monitoring/grafana/dashboards/json/sglang-dashboard.json).

View File

@@ -1,16 +0,0 @@
# Troubleshooting
This page lists common errors and tips for resolving them.
## CUDA Out of Memory
If you encounter out-of-memory (OOM) errors, you can adjust the following parameters:
- If OOM occurs during prefill, try reducing `--chunked-prefill-size` to `4096` or `2048`. This saves memory but slows down the prefill speed for long prompts.
- If OOM occurs during decoding, try lowering `--max-running-requests`.
- You can also reduce `--mem-fraction-static` to a smaller value, such as 0.8 or 0.7. This decreases the memory usage of the KV cache memory pool and helps prevent OOM errors during both prefill and decoding. However, it limits maximum concurrency and reduces peak throughput.
- Another common case for OOM is requesting input logprobs for a long prompt as it requires significant memory. To address this, set `logprob_start_len` in your sampling parameters to include only the necessary parts. If you do need input logprobs for a long prompt, try reducing `--mem-fraction-static`.
## CUDA Error: Illegal Memory Access Encountered
This error may result from kernel errors or out-of-memory issues:
- If it is a kernel error, resolving it may be challenging. Please file an issue on GitHub.
- If it is an out-of-memory issue, it may sometimes be reported as this error instead of "Out of Memory." Refer to the section above for guidance on avoiding OOM issues.

View File

@@ -4,23 +4,23 @@
## Example launch Command
By default, we will use sglang implementation if it is available. Otherwise, we will fall back to transformers one. However, you can switch the implementation by setting `impl` to `transformers`.
By default, we will use sglang implementation if it is available. Otherwise, we will fall back to transformers one. However, you can switch the implementation by setting `--model-impl` to `transformers`.
```shell
python3 -m sglang.launch_server \
--model-path meta-llama/Llama-3.2-1B-Instruct \
--host 0.0.0.0 \
--port 30000 \
--impl transformers
--model-impl transformers
```
#### Supported features
## Supported features
##### Quantization
### Quantization
Transformers fall back has supported most of available quantization in SGLang (except GGUF). See [Quantization page](https://docs.sglang.ai/backend/quantization.html) for more information about supported quantization in SGLang.
##### Remote code
### Remote code
This fallback also means that any model on the hub that can be used in `transformers` with `trust_remote_code=True` that correctly implements attention can be used in production!

View File

@@ -32,16 +32,20 @@ from sglang.lang.choices import (
token_length_normalized,
unconditional_likelihood_normalized,
)
from sglang.srt.entrypoints.engine import Engine
# Lazy import some libraries
from sglang.utils import LazyImport
from sglang.version import __version__
ServerArgs = LazyImport("sglang.srt.server_args", "ServerArgs")
Anthropic = LazyImport("sglang.lang.backend.anthropic", "Anthropic")
LiteLLM = LazyImport("sglang.lang.backend.litellm", "LiteLLM")
OpenAI = LazyImport("sglang.lang.backend.openai", "OpenAI")
VertexAI = LazyImport("sglang.lang.backend.vertexai", "VertexAI")
# Runtime Engine APIs
ServerArgs = LazyImport("sglang.srt.server_args", "ServerArgs")
Engine = LazyImport("sglang.srt.entrypoints.engine", "Engine")
__all__ = [
"Engine",
"Runtime",

View File

@@ -2175,10 +2175,6 @@ class ServerArgs:
self.mem_fraction_static = (
original_server_arg_mem_fraction * final_overall_factor
)
logger.warning(
f"Multimodal model: Dynamically adjusted --mem-fraction-static "
f"from: {original_server_arg_mem_fraction:.3f} to: {self.mem_fraction_static:.3f}."
)
def prepare_server_args(argv: List[str]) -> ServerArgs:

View File

@@ -1,15 +1,21 @@
"""
Do some monkey patch to make the documentation compilation faster and more reliable.
- Avoid port conflicts
- Reduce the server launch time
"""
import weakref
import nest_asyncio
nest_asyncio.apply()
import sglang.srt.server_args as server_args_mod
from sglang.utils import execute_shell_command, reserve_port
DEFAULT_MAX_RUNNING_REQUESTS = 200
DEFAULT_MAX_TOTAL_TOKENS = 20480
import sglang.srt.server_args as server_args_mod
DEFAULT_MAX_RUNNING_REQUESTS = 128
DEFAULT_MAX_TOTAL_TOKENS = 20480 # To allow multiple servers on the same machine
_original_post_init = server_args_mod.ServerArgs.__post_init__
@@ -20,7 +26,7 @@ def patched_post_init(self):
self.max_running_requests = DEFAULT_MAX_RUNNING_REQUESTS
if self.max_total_tokens is None:
self.max_total_tokens = DEFAULT_MAX_TOTAL_TOKENS
self.disable_cuda_graph = True
self.cuda_graph_max_bs = 4
server_args_mod.ServerArgs.__post_init__ = patched_post_init
@@ -41,7 +47,7 @@ def launch_server_cmd(command: str, host: str = "0.0.0.0", port: int = None):
extra_flags = (
f"--max-running-requests {DEFAULT_MAX_RUNNING_REQUESTS} "
f"--max-total-tokens {DEFAULT_MAX_TOTAL_TOKENS} "
f"--disable-cuda-graph"
f"--cuda-graph-max-bs 4"
)
full_command = f"{command} --port {port} {extra_flags}"

View File

@@ -458,7 +458,7 @@ def wait_for_server(base_url: str, timeout: int = None) -> None:
NOTE: Typically, the server runs in a separate terminal.
In this notebook, we run the server and notebook code together, so their outputs are combined.
To improve clarity, the server logs are displayed in the original black color, while the notebook outputs are highlighted in blue.
We are running those notebooks in a CI parallel environment, so the throughput is not representative of the actual performance.
We are running those notebooks in a CI environment, so the throughput is not representative of the actual performance.
"""
)
break

View File

@@ -19,22 +19,15 @@ python3 run_suite.py --suite per-commit
## Test Frontend Language
```bash
cd sglang/test/lang
export OPENAI_API_KEY=sk-*****
# Run a single file
python3 test_openai_backend.py
# Run a single test
python3 -m unittest test_openai_backend.TestOpenAIServer.test_few_shot_qa
# Run a suite with multiple files
python3 run_suite.py --suite per-commit
python3 test_srt_backend.py
```
## Adding or Updating Tests in CI
- Create new test files under `test/srt` or `test/lang` depending on the type of test.
- Ensure they are referenced in the respective `run_suite.py` (e.g., `test/srt/run_suite.py` or `test/lang/run_suite.py`) so theyre picked up in CI. For most small test cases, they can be added to the `per-commit` suite.
- Ensure they are referenced in the respective `run_suite.py` (e.g., `test/srt/run_suite.py`) so theyre picked up in CI. For most small test cases, they can be added to the `per-commit` suite. Sort the test cases alphabetically.
- The CI will run the `per-commit` and `nightly` automatically. If you need special setup or custom test groups, you may modify the workflows in [`.github/workflows/`](https://github.com/sgl-project/sglang/tree/main/.github/workflows).
@@ -45,3 +38,4 @@ python3 run_suite.py --suite per-commit
- Give tests descriptive names reflecting their purpose.
- Use robust assertions (e.g., assert, unittest methods) to validate outcomes.
- Clean up resources to avoid side effects and preserve test independence.
- Reduce the test time by using smaller models and reusing the server for multiple test cases.