diff --git a/docs/backend/backend.md b/docs/backend/server_arguments.md similarity index 62% rename from docs/backend/backend.md rename to docs/backend/server_arguments.md index 4f014f21e..a4913b8af 100644 --- a/docs/backend/backend.md +++ b/docs/backend/server_arguments.md @@ -1,67 +1,5 @@ -# Backend: SGLang Runtime (SRT) -The SGLang Runtime (SRT) is an efficient serving engine. +# Server Arguments -## Quick Start -Launch a server -``` -python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --port 30000 -``` - -Send a request -``` -curl http://localhost:30000/generate \ - -H "Content-Type: application/json" \ - -d '{ - "text": "Once upon a time,", - "sampling_params": { - "max_new_tokens": 16, - "temperature": 0 - } - }' -``` - -Learn more about the argument specification, streaming, and multi-modal support [here](../references/sampling_params.md). - -## OpenAI Compatible API -In addition, the server supports OpenAI-compatible APIs. - -```python -import openai -client = openai.Client( - base_url="http://127.0.0.1:30000/v1", api_key="EMPTY") - -# Text completion -response = client.completions.create( - model="default", - prompt="The capital of France is", - temperature=0, - max_tokens=32, -) -print(response) - -# Chat completion -response = client.chat.completions.create( - model="default", - messages=[ - {"role": "system", "content": "You are a helpful AI assistant"}, - {"role": "user", "content": "List 3 countries and their capitals."}, - ], - temperature=0, - max_tokens=64, -) -print(response) - -# Text embedding -response = client.embeddings.create( - model="default", - input="How are you today", -) -print(response) -``` - -It supports streaming, vision, and almost all features of the Chat/Completions/Models/Batch endpoints specified by the [OpenAI API Reference](https://platform.openai.com/docs/api-reference/). - -## Additional Server Arguments - To enable multi-GPU tensor parallelism, add `--tp 2`. If it reports the error "peer access is not supported between these two devices", add `--enable-p2p-check` to the server launch command. ``` python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 2 @@ -94,35 +32,6 @@ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 4 --nccl-init sgl-dev-0:50000 --nnodes 2 --node-rank 1 ``` -## Engine Without HTTP Server - -We also provide an inference engine **without a HTTP server**. For example, - -```python -import sglang as sgl - -def main(): - prompts = [ - "Hello, my name is", - "The president of the United States is", - "The capital of France is", - "The future of AI is", - ] - sampling_params = {"temperature": 0.8, "top_p": 0.95} - llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct") - - outputs = llm.generate(prompts, sampling_params) - for prompt, output in zip(prompts, outputs): - print("===============================") - print(f"Prompt: {prompt}\nGenerated text: {output['text']}") - -if __name__ == "__main__": - main() -``` - -This can be used for offline batch inference and building custom servers. -You can view the full example [here](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine). - ## Use Models From ModelScope
More diff --git a/docs/backend/structured_outputs.ipynb b/docs/backend/structured_outputs.ipynb index a294819bc..f017ef863 100644 --- a/docs/backend/structured_outputs.ipynb +++ b/docs/backend/structured_outputs.ipynb @@ -16,16 +16,11 @@ "SGLang supports two grammar backends:\n", "\n", "- [Outlines](https://github.com/dottxt-ai/outlines) (default): Supports JSON schema and regular expression constraints.\n", - "- [XGrammar](https://github.com/mlc-ai/xgrammar): Supports JSON schema and EBNF constraints.\n", - " - XGrammar currently uses the [GGML BNF format](https://github.com/ggerganov/llama.cpp/blob/master/grammars/README.md)\n", + "- [XGrammar](https://github.com/mlc-ai/xgrammar): Supports JSON schema and EBNF constraints and currently uses the [GGML BNF format](https://github.com/ggerganov/llama.cpp/blob/master/grammars/README.md).\n", "\n", - "Initialize the XGrammar backend using `--grammar-backend xgrammar` flag\n", - "```bash\n", - "python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \\\n", - "--port 30000 --host 0.0.0.0 --grammar-backend [xgrammar|outlines] # xgrammar or outlines (default: outlines)\n", - "```\n", + "We suggest using XGrammar whenever possible for its better performance. For more details, see [XGrammar technical overview](https://blog.mlc.ai/2024/11/22/achieving-efficient-flexible-portable-structured-generation-with-xgrammar).\n", "\n", - "We suggest using XGrammar whenever possible for its better performance. For more details, see [XGrammar technical overview](https://blog.mlc.ai/2024/11/22/achieving-efficient-flexible-portable-structured-generation-with-xgrammar)." + "To use Xgrammar, simply add `--grammar-backend` xgrammar when launching the server. If no backend is specified, Outlines will be used as the default." ] }, { @@ -35,13 +30,6 @@ "## OpenAI Compatible API" ] }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "To use Xgrammar, simply add `--grammar-backend xgrammar` when launching the server. If no backend is specified, Outlines will be used as the default." - ] - }, { "cell_type": "code", "execution_count": null, @@ -68,7 +56,64 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### JSON" + "### JSON\n", + "\n", + "you can directly define a JSON schema or use [Pydantic](https://docs.pydantic.dev/latest/) to define and validate the response." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Using Pydantic**" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from pydantic import BaseModel, Field\n", + "\n", + "\n", + "# Define the schema using Pydantic\n", + "class CapitalInfo(BaseModel):\n", + " name: str = Field(..., pattern=r\"^\\w+$\", description=\"Name of the capital city\")\n", + " population: int = Field(..., description=\"Population of the capital city\")\n", + "\n", + "\n", + "response = client.chat.completions.create(\n", + " model=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n", + " messages=[\n", + " {\n", + " \"role\": \"user\",\n", + " \"content\": \"Give me the information of the capital of France in the JSON format.\",\n", + " },\n", + " ],\n", + " temperature=0,\n", + " max_tokens=128,\n", + " response_format={\n", + " \"type\": \"json_schema\",\n", + " \"json_schema\": {\n", + " \"name\": \"foo\",\n", + " # convert the pydantic model to json schema\n", + " \"schema\": CapitalInfo.model_json_schema(),\n", + " },\n", + " },\n", + ")\n", + "\n", + "response_content = response.choices[0].message.content\n", + "# validate the JSON response by the pydantic model\n", + "capital_info = CapitalInfo.model_validate_json(response_content)\n", + "print_highlight(f\"Validated response: {capital_info.model_dump_json()}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**JSON Schema Directly**\n" ] }, { @@ -225,15 +270,64 @@ "### JSON" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Using Pydantic**" + ] + }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ - "import json\n", "import requests\n", + "import json\n", + "from pydantic import BaseModel, Field\n", "\n", + "\n", + "# Define the schema using Pydantic\n", + "class CapitalInfo(BaseModel):\n", + " name: str = Field(..., pattern=r\"^\\w+$\", description=\"Name of the capital city\")\n", + " population: int = Field(..., description=\"Population of the capital city\")\n", + "\n", + "\n", + "# Make API request\n", + "response = requests.post(\n", + " \"http://localhost:30010/generate\",\n", + " json={\n", + " \"text\": \"Here is the information of the capital of France in the JSON format.\\n\",\n", + " \"sampling_params\": {\n", + " \"temperature\": 0,\n", + " \"max_new_tokens\": 64,\n", + " \"json_schema\": json.dumps(CapitalInfo.model_json_schema()),\n", + " },\n", + " },\n", + ")\n", + "print_highlight(response.json())\n", + "\n", + "\n", + "response_data = json.loads(response.json()[\"text\"])\n", + "# validate the response by the pydantic model\n", + "capital_info = CapitalInfo.model_validate(response_data)\n", + "print_highlight(f\"Validated response: {capital_info.model_dump_json()}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**JSON Schema Directly**" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ "json_schema = json.dumps(\n", " {\n", " \"type\": \"object\",\n", @@ -379,6 +473,13 @@ "### JSON" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Using Pydantic**" + ] + }, { "cell_type": "code", "execution_count": null, @@ -386,7 +487,49 @@ "outputs": [], "source": [ "import json\n", + "from pydantic import BaseModel, Field\n", "\n", + "\n", + "prompts = [\n", + " \"Give me the information of the capital of China in the JSON format.\",\n", + " \"Give me the information of the capital of France in the JSON format.\",\n", + " \"Give me the information of the capital of Ireland in the JSON format.\",\n", + "]\n", + "\n", + "\n", + "# Define the schema using Pydantic\n", + "class CapitalInfo(BaseModel):\n", + " name: str = Field(..., pattern=r\"^\\w+$\", description=\"Name of the capital city\")\n", + " population: int = Field(..., description=\"Population of the capital city\")\n", + "\n", + "\n", + "sampling_params = {\n", + " \"temperature\": 0.1,\n", + " \"top_p\": 0.95,\n", + " \"json_schema\": json.dumps(CapitalInfo.model_json_schema()),\n", + "}\n", + "\n", + "outputs = llm_xgrammar.generate(prompts, sampling_params)\n", + "for prompt, output in zip(prompts, outputs):\n", + " print_highlight(\"===============================\")\n", + " print_highlight(f\"Prompt: {prompt}\") # validate the output by the pydantic model\n", + " capital_info = CapitalInfo.model_validate_json(output[\"text\"])\n", + " print_highlight(f\"Validated output: {capital_info.model_dump_json()}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**JSON Schema Directly**" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ "prompts = [\n", " \"Give me the information of the capital of China in the JSON format.\",\n", " \"Give me the information of the capital of France in the JSON format.\",\n", diff --git a/docs/index.rst b/docs/index.rst index 15fcf6ea8..80a53d1cb 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -29,7 +29,7 @@ The core features include: backend/native_api.ipynb backend/offline_engine_api.ipynb backend/structured_outputs.ipynb - backend/backend.md + backend/server_arguments.md .. toctree:: diff --git a/docs/references/contribution_guide.md b/docs/references/contribution_guide.md index 182a73683..38afacc06 100644 --- a/docs/references/contribution_guide.md +++ b/docs/references/contribution_guide.md @@ -2,9 +2,9 @@ Welcome to **SGLang**! We appreciate your interest in contributing. This guide provides a concise overview of how to set up your environment, run tests, build documentation, and open a Pull Request (PR). Whether you’re fixing a small bug or developing a major feature, we encourage following these steps for a smooth contribution process. -## 1. Setting Up & Building from Source +## Setting Up & Building from Source -### 1.1 Fork and Clone the Repository +### Fork and Clone the Repository **Note**: SGLang does **not** accept PRs on the main repo. Please fork the repository under your GitHub account, then clone your fork locally. @@ -13,7 +13,7 @@ git clone https://github.com//sglang.git cd sglang ``` -### 1.2 Install Dependencies & Build +### Install Dependencies & Build Refer to [Install SGLang](https://sgl-project.github.io/start/install.html) documentation for more details on setting up the necessary dependencies. @@ -32,7 +32,7 @@ cd sglang/python pip install . ``` -## 2. Code Formatting with Pre-Commit +## Code Formatting with Pre-Commit We use [pre-commit](https://pre-commit.com/) to maintain consistent code style checks. Before pushing your changes, please run: @@ -45,11 +45,11 @@ pre-commit run --all-files - **`pre-commit run --all-files`** manually runs all configured checks, applying fixes if possible. If it fails the first time, re-run it to ensure lint errors are fully resolved. Make sure your code passes all checks **before** creating a Pull Request. - **Do not commit** directly to the `main` branch. Always create a new branch (e.g., `feature/my-new-feature`), push your changes, and open a PR from that branch. -## 3. Writing Documentation & Running Docs CI +## Writing Documentation & Running Docs CI Most documentation files are located under the `docs/` folder. We prefer **Jupyter Notebooks** over Markdown so that all examples can be executed and validated by our docs CI pipeline. -### 3.1 Docs Workflow +### Docs Workflow Add or update your Jupyter notebooks in the appropriate subdirectories under `docs/`. If you add new files, remember to update `index.rst` (or relevant `.rst` files) accordingly. @@ -114,11 +114,11 @@ llm.shutdown() ``` -## 4. Running Unit Tests & Adding to CI +## Running Unit Tests & Adding to CI SGLang uses Python’s built-in [unittest](https://docs.python.org/3/library/unittest.html) framework. You can run tests either individually or in suites. -### 4.1 Test Backend Runtime +### Test Backend Runtime ```bash cd sglang/test/srt @@ -133,7 +133,7 @@ python3 -m unittest test_srt_endpoint.TestSRTEndpoint.test_simple_decode python3 run_suite.py --suite minimal ``` -### 4.2 Test Frontend Language +### Test Frontend Language ```bash cd sglang/test/lang @@ -149,13 +149,13 @@ python3 -m unittest test_openai_backend.TestOpenAIBackend.test_few_shot_qa python3 run_suite.py --suite minimal ``` -### 4.3 Adding or Updating Tests in CI +### Adding or Updating Tests in CI - Create new test files under `test/srt` or `test/lang` depending on the type of test. - Ensure they are referenced in the respective `run_suite.py` (e.g., `test/srt/run_suite.py` or `test/lang/run_suite.py`) so they’re picked up in CI. - In CI, all tests run automatically. You may modify the workflows in [`.github/workflows/`](https://github.com/sgl-project/sglang/tree/main/.github/workflows) to add custom test groups or extra checks. -### 4.4 Writing Elegant Test Cases +### Writing Elegant Test Cases - Examine existing tests in [sglang/test](https://github.com/sgl-project/sglang/tree/main/test) for practical examples. - Keep each test function focused on a single scenario or piece of functionality. @@ -164,7 +164,7 @@ python3 run_suite.py --suite minimal - Clean up resources to avoid side effects and preserve test independence. -## 5. Tips for Newcomers +## Tips for Newcomers If you want to contribute but don’t have a specific idea in mind, pick issues labeled [“good first issue” or “help wanted”](https://github.com/sgl-project/sglang/issues?q=is%3Aissue+label%3A%22good+first+issue%22%2C%22help+wanted%22). These tasks typically have lower complexity and provide an excellent introduction to the codebase. Also check out this [code walk-through](https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/tree/main/sglang/code-walk-through) for a deeper look into SGLang’s workflow.