[Docs] Add Support for Pydantic Structured Output Format (#2697)
This commit is contained in:
@@ -1,67 +1,5 @@
|
||||
# Backend: SGLang Runtime (SRT)
|
||||
The SGLang Runtime (SRT) is an efficient serving engine.
|
||||
# Server Arguments
|
||||
|
||||
## Quick Start
|
||||
Launch a server
|
||||
```
|
||||
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --port 30000
|
||||
```
|
||||
|
||||
Send a request
|
||||
```
|
||||
curl http://localhost:30000/generate \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"text": "Once upon a time,",
|
||||
"sampling_params": {
|
||||
"max_new_tokens": 16,
|
||||
"temperature": 0
|
||||
}
|
||||
}'
|
||||
```
|
||||
|
||||
Learn more about the argument specification, streaming, and multi-modal support [here](../references/sampling_params.md).
|
||||
|
||||
## OpenAI Compatible API
|
||||
In addition, the server supports OpenAI-compatible APIs.
|
||||
|
||||
```python
|
||||
import openai
|
||||
client = openai.Client(
|
||||
base_url="http://127.0.0.1:30000/v1", api_key="EMPTY")
|
||||
|
||||
# Text completion
|
||||
response = client.completions.create(
|
||||
model="default",
|
||||
prompt="The capital of France is",
|
||||
temperature=0,
|
||||
max_tokens=32,
|
||||
)
|
||||
print(response)
|
||||
|
||||
# Chat completion
|
||||
response = client.chat.completions.create(
|
||||
model="default",
|
||||
messages=[
|
||||
{"role": "system", "content": "You are a helpful AI assistant"},
|
||||
{"role": "user", "content": "List 3 countries and their capitals."},
|
||||
],
|
||||
temperature=0,
|
||||
max_tokens=64,
|
||||
)
|
||||
print(response)
|
||||
|
||||
# Text embedding
|
||||
response = client.embeddings.create(
|
||||
model="default",
|
||||
input="How are you today",
|
||||
)
|
||||
print(response)
|
||||
```
|
||||
|
||||
It supports streaming, vision, and almost all features of the Chat/Completions/Models/Batch endpoints specified by the [OpenAI API Reference](https://platform.openai.com/docs/api-reference/).
|
||||
|
||||
## Additional Server Arguments
|
||||
- To enable multi-GPU tensor parallelism, add `--tp 2`. If it reports the error "peer access is not supported between these two devices", add `--enable-p2p-check` to the server launch command.
|
||||
```
|
||||
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 2
|
||||
@@ -94,35 +32,6 @@ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct
|
||||
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 4 --nccl-init sgl-dev-0:50000 --nnodes 2 --node-rank 1
|
||||
```
|
||||
|
||||
## Engine Without HTTP Server
|
||||
|
||||
We also provide an inference engine **without a HTTP server**. For example,
|
||||
|
||||
```python
|
||||
import sglang as sgl
|
||||
|
||||
def main():
|
||||
prompts = [
|
||||
"Hello, my name is",
|
||||
"The president of the United States is",
|
||||
"The capital of France is",
|
||||
"The future of AI is",
|
||||
]
|
||||
sampling_params = {"temperature": 0.8, "top_p": 0.95}
|
||||
llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")
|
||||
|
||||
outputs = llm.generate(prompts, sampling_params)
|
||||
for prompt, output in zip(prompts, outputs):
|
||||
print("===============================")
|
||||
print(f"Prompt: {prompt}\nGenerated text: {output['text']}")
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
```
|
||||
|
||||
This can be used for offline batch inference and building custom servers.
|
||||
You can view the full example [here](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine).
|
||||
|
||||
## Use Models From ModelScope
|
||||
<details>
|
||||
<summary>More</summary>
|
||||
@@ -16,16 +16,11 @@
|
||||
"SGLang supports two grammar backends:\n",
|
||||
"\n",
|
||||
"- [Outlines](https://github.com/dottxt-ai/outlines) (default): Supports JSON schema and regular expression constraints.\n",
|
||||
"- [XGrammar](https://github.com/mlc-ai/xgrammar): Supports JSON schema and EBNF constraints.\n",
|
||||
" - XGrammar currently uses the [GGML BNF format](https://github.com/ggerganov/llama.cpp/blob/master/grammars/README.md)\n",
|
||||
"- [XGrammar](https://github.com/mlc-ai/xgrammar): Supports JSON schema and EBNF constraints and currently uses the [GGML BNF format](https://github.com/ggerganov/llama.cpp/blob/master/grammars/README.md).\n",
|
||||
"\n",
|
||||
"Initialize the XGrammar backend using `--grammar-backend xgrammar` flag\n",
|
||||
"```bash\n",
|
||||
"python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \\\n",
|
||||
"--port 30000 --host 0.0.0.0 --grammar-backend [xgrammar|outlines] # xgrammar or outlines (default: outlines)\n",
|
||||
"```\n",
|
||||
"We suggest using XGrammar whenever possible for its better performance. For more details, see [XGrammar technical overview](https://blog.mlc.ai/2024/11/22/achieving-efficient-flexible-portable-structured-generation-with-xgrammar).\n",
|
||||
"\n",
|
||||
"We suggest using XGrammar whenever possible for its better performance. For more details, see [XGrammar technical overview](https://blog.mlc.ai/2024/11/22/achieving-efficient-flexible-portable-structured-generation-with-xgrammar)."
|
||||
"To use Xgrammar, simply add `--grammar-backend` xgrammar when launching the server. If no backend is specified, Outlines will be used as the default."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -35,13 +30,6 @@
|
||||
"## OpenAI Compatible API"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"To use Xgrammar, simply add `--grammar-backend xgrammar` when launching the server. If no backend is specified, Outlines will be used as the default."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
@@ -68,7 +56,64 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### JSON"
|
||||
"### JSON\n",
|
||||
"\n",
|
||||
"you can directly define a JSON schema or use [Pydantic](https://docs.pydantic.dev/latest/) to define and validate the response."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"**Using Pydantic**"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from pydantic import BaseModel, Field\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"# Define the schema using Pydantic\n",
|
||||
"class CapitalInfo(BaseModel):\n",
|
||||
" name: str = Field(..., pattern=r\"^\\w+$\", description=\"Name of the capital city\")\n",
|
||||
" population: int = Field(..., description=\"Population of the capital city\")\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"response = client.chat.completions.create(\n",
|
||||
" model=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n",
|
||||
" messages=[\n",
|
||||
" {\n",
|
||||
" \"role\": \"user\",\n",
|
||||
" \"content\": \"Give me the information of the capital of France in the JSON format.\",\n",
|
||||
" },\n",
|
||||
" ],\n",
|
||||
" temperature=0,\n",
|
||||
" max_tokens=128,\n",
|
||||
" response_format={\n",
|
||||
" \"type\": \"json_schema\",\n",
|
||||
" \"json_schema\": {\n",
|
||||
" \"name\": \"foo\",\n",
|
||||
" # convert the pydantic model to json schema\n",
|
||||
" \"schema\": CapitalInfo.model_json_schema(),\n",
|
||||
" },\n",
|
||||
" },\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"response_content = response.choices[0].message.content\n",
|
||||
"# validate the JSON response by the pydantic model\n",
|
||||
"capital_info = CapitalInfo.model_validate_json(response_content)\n",
|
||||
"print_highlight(f\"Validated response: {capital_info.model_dump_json()}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"**JSON Schema Directly**\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -225,15 +270,64 @@
|
||||
"### JSON"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"**Using Pydantic**"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import json\n",
|
||||
"import requests\n",
|
||||
"import json\n",
|
||||
"from pydantic import BaseModel, Field\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"# Define the schema using Pydantic\n",
|
||||
"class CapitalInfo(BaseModel):\n",
|
||||
" name: str = Field(..., pattern=r\"^\\w+$\", description=\"Name of the capital city\")\n",
|
||||
" population: int = Field(..., description=\"Population of the capital city\")\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"# Make API request\n",
|
||||
"response = requests.post(\n",
|
||||
" \"http://localhost:30010/generate\",\n",
|
||||
" json={\n",
|
||||
" \"text\": \"Here is the information of the capital of France in the JSON format.\\n\",\n",
|
||||
" \"sampling_params\": {\n",
|
||||
" \"temperature\": 0,\n",
|
||||
" \"max_new_tokens\": 64,\n",
|
||||
" \"json_schema\": json.dumps(CapitalInfo.model_json_schema()),\n",
|
||||
" },\n",
|
||||
" },\n",
|
||||
")\n",
|
||||
"print_highlight(response.json())\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"response_data = json.loads(response.json()[\"text\"])\n",
|
||||
"# validate the response by the pydantic model\n",
|
||||
"capital_info = CapitalInfo.model_validate(response_data)\n",
|
||||
"print_highlight(f\"Validated response: {capital_info.model_dump_json()}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"**JSON Schema Directly**"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"json_schema = json.dumps(\n",
|
||||
" {\n",
|
||||
" \"type\": \"object\",\n",
|
||||
@@ -379,6 +473,13 @@
|
||||
"### JSON"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"**Using Pydantic**"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
@@ -386,7 +487,49 @@
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import json\n",
|
||||
"from pydantic import BaseModel, Field\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"prompts = [\n",
|
||||
" \"Give me the information of the capital of China in the JSON format.\",\n",
|
||||
" \"Give me the information of the capital of France in the JSON format.\",\n",
|
||||
" \"Give me the information of the capital of Ireland in the JSON format.\",\n",
|
||||
"]\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"# Define the schema using Pydantic\n",
|
||||
"class CapitalInfo(BaseModel):\n",
|
||||
" name: str = Field(..., pattern=r\"^\\w+$\", description=\"Name of the capital city\")\n",
|
||||
" population: int = Field(..., description=\"Population of the capital city\")\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"sampling_params = {\n",
|
||||
" \"temperature\": 0.1,\n",
|
||||
" \"top_p\": 0.95,\n",
|
||||
" \"json_schema\": json.dumps(CapitalInfo.model_json_schema()),\n",
|
||||
"}\n",
|
||||
"\n",
|
||||
"outputs = llm_xgrammar.generate(prompts, sampling_params)\n",
|
||||
"for prompt, output in zip(prompts, outputs):\n",
|
||||
" print_highlight(\"===============================\")\n",
|
||||
" print_highlight(f\"Prompt: {prompt}\") # validate the output by the pydantic model\n",
|
||||
" capital_info = CapitalInfo.model_validate_json(output[\"text\"])\n",
|
||||
" print_highlight(f\"Validated output: {capital_info.model_dump_json()}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"**JSON Schema Directly**"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"prompts = [\n",
|
||||
" \"Give me the information of the capital of China in the JSON format.\",\n",
|
||||
" \"Give me the information of the capital of France in the JSON format.\",\n",
|
||||
|
||||
@@ -29,7 +29,7 @@ The core features include:
|
||||
backend/native_api.ipynb
|
||||
backend/offline_engine_api.ipynb
|
||||
backend/structured_outputs.ipynb
|
||||
backend/backend.md
|
||||
backend/server_arguments.md
|
||||
|
||||
|
||||
.. toctree::
|
||||
|
||||
@@ -2,9 +2,9 @@
|
||||
|
||||
Welcome to **SGLang**! We appreciate your interest in contributing. This guide provides a concise overview of how to set up your environment, run tests, build documentation, and open a Pull Request (PR). Whether you’re fixing a small bug or developing a major feature, we encourage following these steps for a smooth contribution process.
|
||||
|
||||
## 1. Setting Up & Building from Source
|
||||
## Setting Up & Building from Source
|
||||
|
||||
### 1.1 Fork and Clone the Repository
|
||||
### Fork and Clone the Repository
|
||||
|
||||
**Note**: SGLang does **not** accept PRs on the main repo. Please fork the repository under your GitHub account, then clone your fork locally.
|
||||
|
||||
@@ -13,7 +13,7 @@ git clone https://github.com/<your_user_name>/sglang.git
|
||||
cd sglang
|
||||
```
|
||||
|
||||
### 1.2 Install Dependencies & Build
|
||||
### Install Dependencies & Build
|
||||
|
||||
Refer to [Install SGLang](https://sgl-project.github.io/start/install.html) documentation for more details on setting up the necessary dependencies.
|
||||
|
||||
@@ -32,7 +32,7 @@ cd sglang/python
|
||||
pip install .
|
||||
```
|
||||
|
||||
## 2. Code Formatting with Pre-Commit
|
||||
## Code Formatting with Pre-Commit
|
||||
|
||||
We use [pre-commit](https://pre-commit.com/) to maintain consistent code style checks. Before pushing your changes, please run:
|
||||
|
||||
@@ -45,11 +45,11 @@ pre-commit run --all-files
|
||||
- **`pre-commit run --all-files`** manually runs all configured checks, applying fixes if possible. If it fails the first time, re-run it to ensure lint errors are fully resolved. Make sure your code passes all checks **before** creating a Pull Request.
|
||||
- **Do not commit** directly to the `main` branch. Always create a new branch (e.g., `feature/my-new-feature`), push your changes, and open a PR from that branch.
|
||||
|
||||
## 3. Writing Documentation & Running Docs CI
|
||||
## Writing Documentation & Running Docs CI
|
||||
|
||||
Most documentation files are located under the `docs/` folder. We prefer **Jupyter Notebooks** over Markdown so that all examples can be executed and validated by our docs CI pipeline.
|
||||
|
||||
### 3.1 Docs Workflow
|
||||
### Docs Workflow
|
||||
|
||||
Add or update your Jupyter notebooks in the appropriate subdirectories under `docs/`. If you add new files, remember to update `index.rst` (or relevant `.rst` files) accordingly.
|
||||
|
||||
@@ -114,11 +114,11 @@ llm.shutdown()
|
||||
```
|
||||
|
||||
|
||||
## 4. Running Unit Tests & Adding to CI
|
||||
## Running Unit Tests & Adding to CI
|
||||
|
||||
SGLang uses Python’s built-in [unittest](https://docs.python.org/3/library/unittest.html) framework. You can run tests either individually or in suites.
|
||||
|
||||
### 4.1 Test Backend Runtime
|
||||
### Test Backend Runtime
|
||||
|
||||
```bash
|
||||
cd sglang/test/srt
|
||||
@@ -133,7 +133,7 @@ python3 -m unittest test_srt_endpoint.TestSRTEndpoint.test_simple_decode
|
||||
python3 run_suite.py --suite minimal
|
||||
```
|
||||
|
||||
### 4.2 Test Frontend Language
|
||||
### Test Frontend Language
|
||||
|
||||
```bash
|
||||
cd sglang/test/lang
|
||||
@@ -149,13 +149,13 @@ python3 -m unittest test_openai_backend.TestOpenAIBackend.test_few_shot_qa
|
||||
python3 run_suite.py --suite minimal
|
||||
```
|
||||
|
||||
### 4.3 Adding or Updating Tests in CI
|
||||
### Adding or Updating Tests in CI
|
||||
|
||||
- Create new test files under `test/srt` or `test/lang` depending on the type of test.
|
||||
- Ensure they are referenced in the respective `run_suite.py` (e.g., `test/srt/run_suite.py` or `test/lang/run_suite.py`) so they’re picked up in CI.
|
||||
- In CI, all tests run automatically. You may modify the workflows in [`.github/workflows/`](https://github.com/sgl-project/sglang/tree/main/.github/workflows) to add custom test groups or extra checks.
|
||||
|
||||
### 4.4 Writing Elegant Test Cases
|
||||
### Writing Elegant Test Cases
|
||||
|
||||
- Examine existing tests in [sglang/test](https://github.com/sgl-project/sglang/tree/main/test) for practical examples.
|
||||
- Keep each test function focused on a single scenario or piece of functionality.
|
||||
@@ -164,7 +164,7 @@ python3 run_suite.py --suite minimal
|
||||
- Clean up resources to avoid side effects and preserve test independence.
|
||||
|
||||
|
||||
## 5. Tips for Newcomers
|
||||
## Tips for Newcomers
|
||||
|
||||
If you want to contribute but don’t have a specific idea in mind, pick issues labeled [“good first issue” or “help wanted”](https://github.com/sgl-project/sglang/issues?q=is%3Aissue+label%3A%22good+first+issue%22%2C%22help+wanted%22). These tasks typically have lower complexity and provide an excellent introduction to the codebase. Also check out this [code walk-through](https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/tree/main/sglang/code-walk-through) for a deeper look into SGLang’s workflow.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user