253 lines
7.2 KiB
Plaintext
253 lines
7.2 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Offline Engine API\n",
|
|
"\n",
|
|
"SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:\n",
|
|
"\n",
|
|
"- Offline Batch Inference\n",
|
|
"- Custom Server on Top of the Engine\n",
|
|
"\n",
|
|
"This document focuses on the offline batch inference, demonstrating four different inference modes:\n",
|
|
"\n",
|
|
"- Non-streaming synchronous generation\n",
|
|
"- Streaming synchronous generation\n",
|
|
"- Non-streaming asynchronous generation\n",
|
|
"- Streaming asynchronous generation\n",
|
|
"\n",
|
|
"Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Offline Batch Inference\n",
|
|
"\n",
|
|
"SGLang offline engine supports batch inference with efficient scheduling."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-11-07T18:46:04.789536Z",
|
|
"iopub.status.busy": "2024-11-07T18:46:04.789418Z",
|
|
"iopub.status.idle": "2024-11-07T18:46:27.038169Z",
|
|
"shell.execute_reply": "2024-11-07T18:46:27.037540Z"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# launch the offline engine\n",
|
|
"\n",
|
|
"import sglang as sgl\n",
|
|
"from sglang.utils import print_highlight\n",
|
|
"import asyncio\n",
|
|
"\n",
|
|
"llm = sgl.Engine(model_path=\"meta-llama/Meta-Llama-3.1-8B-Instruct\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Non-streaming Synchronous Generation"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-11-07T18:46:27.040005Z",
|
|
"iopub.status.busy": "2024-11-07T18:46:27.039872Z",
|
|
"iopub.status.idle": "2024-11-07T18:46:30.203840Z",
|
|
"shell.execute_reply": "2024-11-07T18:46:30.203368Z"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"prompts = [\n",
|
|
" \"Hello, my name is\",\n",
|
|
" \"The president of the United States is\",\n",
|
|
" \"The capital of France is\",\n",
|
|
" \"The future of AI is\",\n",
|
|
"]\n",
|
|
"\n",
|
|
"sampling_params = {\"temperature\": 0.8, \"top_p\": 0.95}\n",
|
|
"\n",
|
|
"outputs = llm.generate(prompts, sampling_params)\n",
|
|
"for prompt, output in zip(prompts, outputs):\n",
|
|
" print_highlight(\"===============================\")\n",
|
|
" print_highlight(f\"Prompt: {prompt}\\nGenerated text: {output['text']}\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Streaming Synchronous Generation"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-11-07T18:46:30.205880Z",
|
|
"iopub.status.busy": "2024-11-07T18:46:30.205719Z",
|
|
"iopub.status.idle": "2024-11-07T18:46:39.256561Z",
|
|
"shell.execute_reply": "2024-11-07T18:46:39.255880Z"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"prompts = [\n",
|
|
" \"Hello, my name is\",\n",
|
|
" \"The capital of France is\",\n",
|
|
" \"The future of AI is\",\n",
|
|
"]\n",
|
|
"sampling_params = {\"temperature\": 0.8, \"top_p\": 0.95}\n",
|
|
"\n",
|
|
"print_highlight(\"\\n=== Testing synchronous streaming generation ===\")\n",
|
|
"\n",
|
|
"for prompt in prompts:\n",
|
|
" print_highlight(f\"\\nPrompt: {prompt}\")\n",
|
|
" print(\"Generated text: \", end=\"\", flush=True)\n",
|
|
"\n",
|
|
" for chunk in llm.generate(prompt, sampling_params, stream=True):\n",
|
|
" print(chunk[\"text\"], end=\"\", flush=True)\n",
|
|
" print()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Non-streaming Asynchronous Generation"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-11-07T18:46:39.259464Z",
|
|
"iopub.status.busy": "2024-11-07T18:46:39.259309Z",
|
|
"iopub.status.idle": "2024-11-07T18:46:42.384955Z",
|
|
"shell.execute_reply": "2024-11-07T18:46:42.384378Z"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"prompts = [\n",
|
|
" \"Hello, my name is\",\n",
|
|
" \"The capital of France is\",\n",
|
|
" \"The future of AI is\",\n",
|
|
"]\n",
|
|
"\n",
|
|
"sampling_params = {\"temperature\": 0.8, \"top_p\": 0.95}\n",
|
|
"\n",
|
|
"print_highlight(\"\\n=== Testing asynchronous batch generation ===\")\n",
|
|
"\n",
|
|
"\n",
|
|
"async def main():\n",
|
|
" outputs = await llm.async_generate(prompts, sampling_params)\n",
|
|
"\n",
|
|
" for prompt, output in zip(prompts, outputs):\n",
|
|
" print_highlight(f\"\\nPrompt: {prompt}\")\n",
|
|
" print_highlight(f\"Generated text: {output['text']}\")\n",
|
|
"\n",
|
|
"\n",
|
|
"asyncio.run(main())"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Streaming Asynchronous Generation"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-11-07T18:46:42.387431Z",
|
|
"iopub.status.busy": "2024-11-07T18:46:42.387279Z",
|
|
"iopub.status.idle": "2024-11-07T18:46:51.448572Z",
|
|
"shell.execute_reply": "2024-11-07T18:46:51.447781Z"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"prompts = [\n",
|
|
" \"Hello, my name is\",\n",
|
|
" \"The capital of France is\",\n",
|
|
" \"The future of AI is\",\n",
|
|
"]\n",
|
|
"sampling_params = {\"temperature\": 0.8, \"top_p\": 0.95}\n",
|
|
"\n",
|
|
"print_highlight(\"\\n=== Testing asynchronous streaming generation ===\")\n",
|
|
"\n",
|
|
"\n",
|
|
"async def main():\n",
|
|
" for prompt in prompts:\n",
|
|
" print_highlight(f\"\\nPrompt: {prompt}\")\n",
|
|
" print(\"Generated text: \", end=\"\", flush=True)\n",
|
|
"\n",
|
|
" generator = await llm.async_generate(prompt, sampling_params, stream=True)\n",
|
|
" async for chunk in generator:\n",
|
|
" print(chunk[\"text\"], end=\"\", flush=True)\n",
|
|
" print()\n",
|
|
"\n",
|
|
"\n",
|
|
"asyncio.run(main())"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 6,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-11-07T18:46:51.451177Z",
|
|
"iopub.status.busy": "2024-11-07T18:46:51.450952Z",
|
|
"iopub.status.idle": "2024-11-07T18:46:51.497530Z",
|
|
"shell.execute_reply": "2024-11-07T18:46:51.496850Z"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"llm.shutdown()"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "AlphaMeemory",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.11.7"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 2
|
|
}
|