366 lines
12 KiB
Plaintext
366 lines
12 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# OpenAI APIs - Completions\n",
|
|
"\n",
|
|
"SGLang provides OpenAI-compatible APIs to enable a smooth transition from OpenAI services to self-hosted local models.\n",
|
|
"A complete reference for the API is available in the [OpenAI API Reference](https://platform.openai.com/docs/api-reference).\n",
|
|
"\n",
|
|
"This tutorial covers the following popular APIs:\n",
|
|
"\n",
|
|
"- `chat/completions`\n",
|
|
"- `completions`\n",
|
|
"\n",
|
|
"Check out other tutorials to learn about [vision APIs](openai_api_vision.ipynb) for vision-language models and [embedding APIs](openai_api_embeddings.ipynb) for embedding models."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Launch A Server\n",
|
|
"\n",
|
|
"Launch the server in your terminal and wait for it to initialize."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from sglang.test.doc_patch import launch_server_cmd\n",
|
|
"from sglang.utils import wait_for_server, print_highlight, terminate_process\n",
|
|
"\n",
|
|
"server_process, port = launch_server_cmd(\n",
|
|
" \"python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct --host 0.0.0.0\"\n",
|
|
")\n",
|
|
"\n",
|
|
"wait_for_server(f\"http://localhost:{port}\")\n",
|
|
"print(f\"Server started on http://localhost:{port}\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Chat Completions\n",
|
|
"\n",
|
|
"### Usage\n",
|
|
"\n",
|
|
"The server fully implements the OpenAI API.\n",
|
|
"It will automatically apply the chat template specified in the Hugging Face tokenizer, if one is available.\n",
|
|
"You can also specify a custom chat template with `--chat-template` when launching the server."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import openai\n",
|
|
"\n",
|
|
"client = openai.Client(base_url=f\"http://127.0.0.1:{port}/v1\", api_key=\"None\")\n",
|
|
"\n",
|
|
"response = client.chat.completions.create(\n",
|
|
" model=\"qwen/qwen2.5-0.5b-instruct\",\n",
|
|
" messages=[\n",
|
|
" {\"role\": \"user\", \"content\": \"List 3 countries and their capitals.\"},\n",
|
|
" ],\n",
|
|
" temperature=0,\n",
|
|
" max_tokens=64,\n",
|
|
")\n",
|
|
"\n",
|
|
"print_highlight(f\"Response: {response}\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Model Thinking/Reasoning Support\n",
|
|
"\n",
|
|
"Some models support internal reasoning or thinking processes that can be exposed in the API response. SGLang provides unified support for various reasoning models through the `chat_template_kwargs` parameter and compatible reasoning parsers.\n",
|
|
"\n",
|
|
"#### Supported Models and Configuration\n",
|
|
"\n",
|
|
"| Model Family | Chat Template Parameter | Reasoning Parser | Notes |\n",
|
|
"|--------------|------------------------|------------------|--------|\n",
|
|
"| DeepSeek-R1 (R1, R1-0528, R1-Distill) | `enable_thinking` | `--reasoning-parser deepseek-r1` | Standard reasoning models |\n",
|
|
"| DeepSeek-V3.1 | `thinking` | `--reasoning-parser deepseek-v3` | Hybrid model (thinking/non-thinking modes) |\n",
|
|
"| Qwen3 (standard) | `enable_thinking` | `--reasoning-parser qwen3` | Hybrid model (thinking/non-thinking modes) |\n",
|
|
"| Qwen3-Thinking | N/A (always enabled) | `--reasoning-parser qwen3-thinking` | Always generates reasoning |\n",
|
|
"| Kimi | N/A (always enabled) | `--reasoning-parser kimi` | Kimi thinking models |\n",
|
|
"| Gpt-Oss | N/A (always enabled) | `--reasoning-parser gpt-oss` | Gpt-Oss thinking models |\n",
|
|
"\n",
|
|
"#### Basic Usage\n",
|
|
"\n",
|
|
"To enable reasoning output, you need to:\n",
|
|
"1. Launch the server with the appropriate reasoning parser\n",
|
|
"2. Set the model-specific parameter in `chat_template_kwargs`\n",
|
|
"3. Optionally use `separate_reasoning: False` to not get reasoning content separately (default to `True`)\n",
|
|
"\n",
|
|
"**Note for Qwen3-Thinking models:** These models always generate thinking content and do not support the `enable_thinking` parameter. Use `--reasoning-parser qwen3-thinking` or `--reasoning-parser qwen3` to parse the thinking content.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Example: Qwen3 Models\n",
|
|
"\n",
|
|
"```python\n",
|
|
"# Launch server:\n",
|
|
"# python3 -m sglang.launch_server --model-path QwQ/Qwen3-32B-250415 --reasoning-parser qwen3\n",
|
|
"\n",
|
|
"from openai import OpenAI\n",
|
|
"\n",
|
|
"client = OpenAI(\n",
|
|
" api_key=\"EMPTY\",\n",
|
|
" base_url=f\"http://127.0.0.1:{port}/v1\",\n",
|
|
")\n",
|
|
"\n",
|
|
"model = \"QwQ/Qwen3-32B-250415\"\n",
|
|
"messages = [{\"role\": \"user\", \"content\": \"9.11 and 9.8, which is greater?\"}]\n",
|
|
"\n",
|
|
"response = client.chat.completions.create(\n",
|
|
" model=model,\n",
|
|
" messages=messages,\n",
|
|
" extra_body={\n",
|
|
" \"chat_template_kwargs\": {\"enable_thinking\": True},\n",
|
|
" \"separate_reasoning\": True\n",
|
|
" }\n",
|
|
")\n",
|
|
"\n",
|
|
"print(\"Reasoning:\", response.choices[0].message.reasoning_content)\n",
|
|
"print(\"Answer:\", response.choices[0].message.content)\n",
|
|
"```\n",
|
|
"\n",
|
|
"**Output:**\n",
|
|
"```\n",
|
|
"Reasoning: Okay, so I need to figure out which number is greater between 9.11 and 9.8...\n",
|
|
"Answer: 9.8 is greater than 9.11.\n",
|
|
"```\n",
|
|
"\n",
|
|
"**Note:** Setting `\"enable_thinking\": False` (or omitting it) will result in `reasoning_content` being `None`. Qwen3-Thinking models always generate reasoning content and don't support the `enable_thinking` parameter.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Example: DeepSeek-V3 Models\n",
|
|
"\n",
|
|
"DeepSeek-V3 models support thinking mode through the `thinking` parameter:\n",
|
|
"\n",
|
|
"```python\n",
|
|
"# Launch server:\n",
|
|
"# python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --reasoning-parser deepseek-v3\n",
|
|
"\n",
|
|
"from openai import OpenAI\n",
|
|
"\n",
|
|
"client = OpenAI(\n",
|
|
" api_key=\"EMPTY\",\n",
|
|
" base_url=f\"http://127.0.0.1:{port}/v1\",\n",
|
|
")\n",
|
|
"\n",
|
|
"model = \"deepseek-ai/DeepSeek-V3\"\n",
|
|
"messages = [{\"role\": \"user\", \"content\": \"What is 2^8?\"}]\n",
|
|
"\n",
|
|
"response = client.chat.completions.create(\n",
|
|
" model=model,\n",
|
|
" messages=messages,\n",
|
|
" extra_body={\n",
|
|
" \"chat_template_kwargs\": {\"thinking\": True},\n",
|
|
" \"separate_reasoning\": True\n",
|
|
" }\n",
|
|
")\n",
|
|
"\n",
|
|
"print(\"Reasoning:\", response.choices[0].message.reasoning_content)\n",
|
|
"print(\"Answer:\", response.choices[0].message.content)\n",
|
|
"```\n",
|
|
"\n",
|
|
"**Output:**\n",
|
|
"```\n",
|
|
"Reasoning: <think>I need to calculate 2^8. Let me work through this step by step:\n",
|
|
"2^1 = 2\n",
|
|
"2^2 = 4\n",
|
|
"2^3 = 8\n",
|
|
"2^4 = 16\n",
|
|
"2^5 = 32\n",
|
|
"2^6 = 64\n",
|
|
"2^7 = 128\n",
|
|
"2^8 = 256</think>\n",
|
|
"Answer: 2^8 equals 256.\n",
|
|
"```\n",
|
|
"\n",
|
|
"**Note:** DeepSeek-V3 models use the `thinking` parameter (not `enable_thinking`) to control reasoning output.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Parameters\n",
|
|
"\n",
|
|
"The chat completions API accepts OpenAI Chat Completions API's parameters. Refer to [OpenAI Chat Completions API](https://platform.openai.com/docs/api-reference/chat/create) for more details.\n",
|
|
"\n",
|
|
"SGLang extends the standard API with the `extra_body` parameter, allowing for additional customization. One key option within `extra_body` is `chat_template_kwargs`, which can be used to pass arguments to the chat template processor."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"response = client.chat.completions.create(\n",
|
|
" model=\"qwen/qwen2.5-0.5b-instruct\",\n",
|
|
" messages=[\n",
|
|
" {\n",
|
|
" \"role\": \"system\",\n",
|
|
" \"content\": \"You are a knowledgeable historian who provides concise responses.\",\n",
|
|
" },\n",
|
|
" {\"role\": \"user\", \"content\": \"Tell me about ancient Rome\"},\n",
|
|
" {\n",
|
|
" \"role\": \"assistant\",\n",
|
|
" \"content\": \"Ancient Rome was a civilization centered in Italy.\",\n",
|
|
" },\n",
|
|
" {\"role\": \"user\", \"content\": \"What were their major achievements?\"},\n",
|
|
" ],\n",
|
|
" temperature=0.3, # Lower temperature for more focused responses\n",
|
|
" max_tokens=128, # Reasonable length for a concise response\n",
|
|
" top_p=0.95, # Slightly higher for better fluency\n",
|
|
" presence_penalty=0.2, # Mild penalty to avoid repetition\n",
|
|
" frequency_penalty=0.2, # Mild penalty for more natural language\n",
|
|
" n=1, # Single response is usually more stable\n",
|
|
" seed=42, # Keep for reproducibility\n",
|
|
")\n",
|
|
"\n",
|
|
"print_highlight(response.choices[0].message.content)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Streaming mode is also supported."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"stream = client.chat.completions.create(\n",
|
|
" model=\"qwen/qwen2.5-0.5b-instruct\",\n",
|
|
" messages=[{\"role\": \"user\", \"content\": \"Say this is a test\"}],\n",
|
|
" stream=True,\n",
|
|
")\n",
|
|
"for chunk in stream:\n",
|
|
" if chunk.choices[0].delta.content is not None:\n",
|
|
" print(chunk.choices[0].delta.content, end=\"\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Completions\n",
|
|
"\n",
|
|
"### Usage\n",
|
|
"Completions API is similar to Chat Completions API, but without the `messages` parameter or chat templates."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"response = client.completions.create(\n",
|
|
" model=\"qwen/qwen2.5-0.5b-instruct\",\n",
|
|
" prompt=\"List 3 countries and their capitals.\",\n",
|
|
" temperature=0,\n",
|
|
" max_tokens=64,\n",
|
|
" n=1,\n",
|
|
" stop=None,\n",
|
|
")\n",
|
|
"\n",
|
|
"print_highlight(f\"Response: {response}\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Parameters\n",
|
|
"\n",
|
|
"The completions API accepts OpenAI Completions API's parameters. Refer to [OpenAI Completions API](https://platform.openai.com/docs/api-reference/completions/create) for more details.\n",
|
|
"\n",
|
|
"Here is an example of a detailed completions request:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"response = client.completions.create(\n",
|
|
" model=\"qwen/qwen2.5-0.5b-instruct\",\n",
|
|
" prompt=\"Write a short story about a space explorer.\",\n",
|
|
" temperature=0.7, # Moderate temperature for creative writing\n",
|
|
" max_tokens=150, # Longer response for a story\n",
|
|
" top_p=0.9, # Balanced diversity in word choice\n",
|
|
" stop=[\"\\n\\n\", \"THE END\"], # Multiple stop sequences\n",
|
|
" presence_penalty=0.3, # Encourage novel elements\n",
|
|
" frequency_penalty=0.3, # Reduce repetitive phrases\n",
|
|
" n=1, # Generate one completion\n",
|
|
" seed=123, # For reproducible results\n",
|
|
")\n",
|
|
"\n",
|
|
"print_highlight(f\"Response: {response}\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Structured Outputs (JSON, Regex, EBNF)\n",
|
|
"\n",
|
|
"For OpenAI compatible structured outputs API, refer to [Structured Outputs](../advanced_features/structured_outputs.ipynb) for more details.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"terminate_process(server_process)"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 2
|
|
}
|