Files
sglang/docs/start/send_request.ipynb
2024-11-08 07:42:47 +08:00

325 lines
9.2 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Quick Start: Sending Requests\n",
"This notebook provides a quick-start guide to use SGLang in chat completions after installation.\n",
"\n",
"- For Vision Language Models, see [OpenAI APIs - Vision](../backend/openai_api_vision.ipynb).\n",
"- For Embedding Models, see [OpenAI APIs - Embedding](../backend/openai_api_embeddings.ipynb) and [Encode (embedding model)](../backend/native_api.html#Encode-(embedding-model)).\n",
"- For Reward Models, see [Classify (reward model)](../backend/native_api.html#Classify-(reward-model))."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Launch A Server\n",
"\n",
"This code block is equivalent to executing \n",
"\n",
"```bash\n",
"python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \\\n",
"--port 30000 --host 0.0.0.0\n",
"```\n",
"\n",
"in your terminal and wait for the server to be ready. Once the server is running, you can send test requests using curl or requests. The server implements the [OpenAI-compatible APIs](https://platform.openai.com/docs/api-reference/chat)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {
"iopub.execute_input": "2024-11-07T18:48:52.032229Z",
"iopub.status.busy": "2024-11-07T18:48:52.032105Z",
"iopub.status.idle": "2024-11-07T18:49:20.226042Z",
"shell.execute_reply": "2024-11-07T18:49:20.225562Z"
}
},
"outputs": [],
"source": [
"from sglang.utils import (\n",
" execute_shell_command,\n",
" wait_for_server,\n",
" terminate_process,\n",
" print_highlight,\n",
")\n",
"\n",
"server_process = execute_shell_command(\n",
" \"\"\"\n",
"python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \\\n",
"--port 30000 --host 0.0.0.0\n",
"\"\"\"\n",
")\n",
"\n",
"wait_for_server(\"http://localhost:30000\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Using cURL\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {
"iopub.execute_input": "2024-11-07T18:49:20.228006Z",
"iopub.status.busy": "2024-11-07T18:49:20.227572Z",
"iopub.status.idle": "2024-11-07T18:49:20.469885Z",
"shell.execute_reply": "2024-11-07T18:49:20.469518Z"
}
},
"outputs": [],
"source": [
"import subprocess, json\n",
"\n",
"curl_command = \"\"\"\n",
"curl -s http://localhost:30000/v1/chat/completions \\\n",
" -d '{\"model\": \"meta-llama/Meta-Llama-3.1-8B-Instruct\", \"messages\": [{\"role\": \"user\", \"content\": \"What is the capital of France?\"}]}'\n",
"\"\"\"\n",
"\n",
"response = json.loads(subprocess.check_output(curl_command, shell=True))\n",
"print_highlight(response)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Using Python Requests"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {
"iopub.execute_input": "2024-11-07T18:49:20.471956Z",
"iopub.status.busy": "2024-11-07T18:49:20.471811Z",
"iopub.status.idle": "2024-11-07T18:49:20.667997Z",
"shell.execute_reply": "2024-11-07T18:49:20.667630Z"
}
},
"outputs": [],
"source": [
"import requests\n",
"\n",
"url = \"http://localhost:30000/v1/chat/completions\"\n",
"\n",
"data = {\n",
" \"model\": \"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n",
" \"messages\": [{\"role\": \"user\", \"content\": \"What is the capital of France?\"}],\n",
"}\n",
"\n",
"response = requests.post(url, json=data)\n",
"print_highlight(response.json())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Using OpenAI Python Client"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {
"iopub.execute_input": "2024-11-07T18:49:20.669977Z",
"iopub.status.busy": "2024-11-07T18:49:20.669826Z",
"iopub.status.idle": "2024-11-07T18:49:22.004855Z",
"shell.execute_reply": "2024-11-07T18:49:22.004472Z"
}
},
"outputs": [],
"source": [
"import openai\n",
"\n",
"client = openai.Client(base_url=\"http://127.0.0.1:30000/v1\", api_key=\"None\")\n",
"\n",
"response = client.chat.completions.create(\n",
" model=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n",
" messages=[\n",
" {\"role\": \"user\", \"content\": \"List 3 countries and their capitals.\"},\n",
" ],\n",
" temperature=0,\n",
" max_tokens=64,\n",
")\n",
"print_highlight(response)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Streaming"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {
"iopub.execute_input": "2024-11-07T18:49:22.006983Z",
"iopub.status.busy": "2024-11-07T18:49:22.006858Z",
"iopub.status.idle": "2024-11-07T18:49:23.029098Z",
"shell.execute_reply": "2024-11-07T18:49:23.028697Z"
}
},
"outputs": [],
"source": [
"import openai\n",
"\n",
"client = openai.Client(base_url=\"http://127.0.0.1:30000/v1\", api_key=\"None\")\n",
"\n",
"# Use stream=True for streaming responses\n",
"response = client.chat.completions.create(\n",
" model=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n",
" messages=[\n",
" {\"role\": \"user\", \"content\": \"List 3 countries and their capitals.\"},\n",
" ],\n",
" temperature=0,\n",
" max_tokens=64,\n",
" stream=True,\n",
")\n",
"\n",
"# Handle the streaming output\n",
"for chunk in response:\n",
" if chunk.choices[0].delta.content:\n",
" print(chunk.choices[0].delta.content, end=\"\", flush=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Using Native Generation APIs\n",
"\n",
"You can also use the native `/generate` endpoint with requests, which provides more flexiblity. An API reference is available at [Sampling Parameters](../references/sampling_params.md)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {
"iopub.execute_input": "2024-11-07T18:49:23.031712Z",
"iopub.status.busy": "2024-11-07T18:49:23.031571Z",
"iopub.status.idle": "2024-11-07T18:49:23.787752Z",
"shell.execute_reply": "2024-11-07T18:49:23.787368Z"
}
},
"outputs": [],
"source": [
"import requests\n",
"\n",
"response = requests.post(\n",
" \"http://localhost:30000/generate\",\n",
" json={\n",
" \"text\": \"The capital of France is\",\n",
" \"sampling_params\": {\n",
" \"temperature\": 0,\n",
" \"max_new_tokens\": 32,\n",
" },\n",
" },\n",
")\n",
"\n",
"print_highlight(response.json())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Streaming"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {
"iopub.execute_input": "2024-11-07T18:49:23.789840Z",
"iopub.status.busy": "2024-11-07T18:49:23.789702Z",
"iopub.status.idle": "2024-11-07T18:49:24.545631Z",
"shell.execute_reply": "2024-11-07T18:49:24.545241Z"
}
},
"outputs": [],
"source": [
"import requests, json\n",
"\n",
"response = requests.post(\n",
" \"http://localhost:30000/generate\",\n",
" json={\n",
" \"text\": \"The capital of France is\",\n",
" \"sampling_params\": {\n",
" \"temperature\": 0,\n",
" \"max_new_tokens\": 32,\n",
" },\n",
" \"stream\": True,\n",
" },\n",
" stream=True,\n",
")\n",
"\n",
"prev = 0\n",
"for chunk in response.iter_lines(decode_unicode=False):\n",
" chunk = chunk.decode(\"utf-8\")\n",
" if chunk and chunk.startswith(\"data:\"):\n",
" if chunk == \"data: [DONE]\":\n",
" break\n",
" data = json.loads(chunk[5:].strip(\"\\n\"))\n",
" output = data[\"text\"]\n",
" print(output[prev:], end=\"\", flush=True)\n",
" prev = len(output)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"execution": {
"iopub.execute_input": "2024-11-07T18:49:24.547641Z",
"iopub.status.busy": "2024-11-07T18:49:24.547497Z",
"iopub.status.idle": "2024-11-07T18:49:25.888864Z",
"shell.execute_reply": "2024-11-07T18:49:25.888114Z"
}
},
"outputs": [],
"source": [
"terminate_process(server_process)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.7"
}
},
"nbformat": 4,
"nbformat_minor": 2
}