Add openAI compatible API (#1810)

Co-authored-by: Chayenne <zhaochenyang@g.ucla.edu>
2024-10-27 10:51:42 -07:00
parent eaade87a42
commit 51c81e339b
7 changed files with 800 additions and 56 deletions
--- a/.github/workflows/deploy-docs.yml
+++ b/.github/workflows/deploy-docs.yml
@@ -10,12 +10,17 @@ on:
  workflow_dispatch:
 jobs:
-  execute-notebooks:
+  execute-and-deploy:
    runs-on: 1-gpu-runner
    if: github.repository == 'sgl-project/sglang'
    defaults:
      run:
        working-directory: docs
    steps:
      - name: Checkout code
        uses: actions/checkout@v3
        with:
          path: .
      - name: Set up Python
        uses: actions/setup-python@v4
@@ -25,7 +30,9 @@ jobs:
      - name: Install dependencies
        run: |
          bash scripts/ci_install_dependency.sh
-          pip install -r docs/requirements.txt
+          pip install -r requirements.txt
          apt-get update
          apt-get install -y pandoc
      - name: Setup Jupyter Kernel
        run: |
@@ -33,7 +40,6 @@ jobs:
      - name: Execute notebooks
        run: |
          cd docs
          for nb in *.ipynb; do
            if [ -f "$nb" ]; then
              echo "Executing $nb"
@@ -43,36 +49,15 @@ jobs:
            fi
          done
  build-and-deploy:
    needs: execute-notebooks
    if: github.repository == 'sgl-project/sglang'
    runs-on: 1-gpu-runner
    steps:
      - name: Checkout code
        uses: actions/checkout@v3
      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.9'
      - name: Install dependencies
        run: |
          bash scripts/ci_install_dependency.sh
          pip install -r docs/requirements.txt
          apt-get update
          apt-get install -y pandoc
      - name: Build documentation
        run: |
          cd docs
          make html
      - name: Push to sgl-project.github.io
        env:
          GITHUB_TOKEN: ${{ secrets.PAT_TOKEN }}
        run: |
-          cd docs/_build/html
+          cd _build/html
          git clone https://$GITHUB_TOKEN@github.com/sgl-project/sgl-project.github.io.git ../sgl-project.github.io
          cp -r * ../sgl-project.github.io
          cd ../sgl-project.github.io
--- a/.github/workflows/execute-notebook.yml
+++ b/.github/workflows/execute-notebook.yml
@@ -1,12 +1,24 @@
 name: Execute Notebooks
 on:
  pull_request:
  push:
-    branches:
+    branches: [ main ]
-      - main
+    paths:
      - "python/sglang/**"
      - "docs/**"
  pull_request:
    branches: [ main ]
    paths:
      - "python/sglang/**"
      - "docs/**"
  workflow_dispatch:
 concurrency:
  group: execute-notebook-${{ github.ref }}
  cancel-in-progress: true
 jobs:
  run-all-notebooks:
    runs-on: 1-gpu-runner
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@@ -10,6 +10,9 @@ repos:
    rev: 24.10.0
    hooks:
      - id: black
        additional_dependencies: ['.[jupyter]']
        types: [python, jupyter]
        types_or: [python, jupyter]
  - repo: https://github.com/pre-commit/pre-commit-hooks
    rev: v5.0.0
--- a/docs/embedding_model.ipynb
+++ b/docs/embedding_model.ipynb
@@ -4,19 +4,32 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "# Embedding Model"
+    "# Embedding Model\n",
    "\n",
    "SGLang supports embedding models in the same way as completion models. Here are some example models:\n",
    "\n",
    "- [intfloat/e5-mistral-7b-instruct](https://huggingface.co/intfloat/e5-mistral-7b-instruct)\n",
    "- [Alibaba-NLP/gte-Qwen2-7B-instruct](https://huggingface.co/Alibaba-NLP/gte-Qwen2-7B-instruct)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "## Launch A Server"
+    "## Launch A Server\n",
    "\n",
    "The following code is equivalent to running this in the shell:\n",
    "```bash\n",
    "python -m sglang.launch_server --model-path Alibaba-NLP/gte-Qwen2-7B-instruct \\\n",
    "    --port 30010 --host 0.0.0.0 --is-embedding --log-level error\n",
    "```\n",
    "\n",
    "Remember to add `--is-embedding` to the command."
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 1,
+   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
@@ -28,14 +41,14 @@
    }
   ],
   "source": [
    "# Equivalent to running this in the shell:\n",
    "# python -m sglang.launch_server --model-path Alibaba-NLP/gte-Qwen2-7B-instruct --port 30010 --host 0.0.0.0 --is-embedding --log-level error\n",
    "from sglang.utils import execute_shell_command, wait_for_server, terminate_process\n",
    "\n",
-    "embedding_process = execute_shell_command(\"\"\"\n",
+    "embedding_process = execute_shell_command(\n",
    "    \"\"\"\n",
    "python -m sglang.launch_server --model-path Alibaba-NLP/gte-Qwen2-7B-instruct \\\n",
    "    --port 30010 --host 0.0.0.0 --is-embedding --log-level error\n",
-    "\"\"\")\n",
+    "\"\"\"\n",
    ")\n",
    "\n",
    "wait_for_server(\"http://localhost:30010\")\n",
    "\n",
@@ -51,25 +64,32 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 2,
+   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
-      "[0.0083160400390625, 0.0006804466247558594, -0.00809478759765625, -0.0006995201110839844, 0.0143890380859375, -0.0090179443359375, 0.01238250732421875, 0.00209808349609375, 0.0062103271484375, -0.003047943115234375]\n"
+      "Text embedding (first 10): [0.0083160400390625, 0.0006804466247558594, -0.00809478759765625, -0.0006995201110839844, 0.0143890380859375, -0.0090179443359375, 0.01238250732421875, 0.00209808349609375, 0.0062103271484375, -0.003047943115234375]\n"
     ]
    }
   ],
   "source": [
-    "# Get the first 10 elements of the embedding\n",
+    "import subprocess, json\n",
    "\n",
-    "! curl -s http://localhost:30010/v1/embeddings \\\n",
+    "text = \"Once upon a time\"\n",
    "\n",
    "curl_text = f\"\"\"curl -s http://localhost:30010/v1/embeddings \\\n",
    "  -H \"Content-Type: application/json\" \\\n",
    "  -H \"Authorization: Bearer None\" \\\n",
-    "  -d '{\"model\": \"Alibaba-NLP/gte-Qwen2-7B-instruct\", \"input\": \"Once upon a time\"}' \\\n",
+    "  -d '{{\"model\": \"Alibaba-NLP/gte-Qwen2-7B-instruct\", \"input\": \"{text}\"}}'\"\"\"\n",
-    "  | python3 -c \"import sys, json; print(json.load(sys.stdin)['data'][0]['embedding'][:10])\""
+    "\n",
    "text_embedding = json.loads(subprocess.check_output(curl_text, shell=True))[\"data\"][0][\n",
    "    \"embedding\"\n",
    "]\n",
    "\n",
    "print(f\"Text embedding (first 10): {text_embedding[:10]}\")"
   ]
  },
  {
@@ -81,37 +101,79 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 3,
+   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
-      "[0.00603485107421875, -0.0190582275390625, -0.01273345947265625, 0.01552581787109375, 0.0066680908203125, -0.0135955810546875, 0.01131439208984375, 0.0013713836669921875, -0.0089874267578125, 0.021759033203125]\n"
+      "Text embedding (first 10): [0.00829315185546875, 0.0007004737854003906, -0.00809478759765625, -0.0006799697875976562, 0.01438140869140625, -0.00897979736328125, 0.0123748779296875, 0.0020923614501953125, 0.006195068359375, -0.0030498504638671875]\n"
     ]
    }
   ],
   "source": [
    "import openai\n",
    "\n",
-    "client = openai.Client(\n",
+    "client = openai.Client(base_url=\"http://127.0.0.1:30010/v1\", api_key=\"None\")\n",
    "    base_url=\"http://127.0.0.1:30010/v1\", api_key=\"None\"\n",
    ")\n",
    "\n",
    "# Text embedding example\n",
    "response = client.embeddings.create(\n",
    "    model=\"Alibaba-NLP/gte-Qwen2-7B-instruct\",\n",
-    "    input=\"How are you today\",\n",
+    "    input=text,\n",
    ")\n",
    "\n",
    "embedding = response.data[0].embedding[:10]\n",
-    "print(embedding)"
+    "print(f\"Text embedding (first 10): {embedding}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Using Input IDs\n",
    "\n",
    "SGLang also supports `input_ids` as input to get the embedding."
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 4,
+   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Input IDs embedding (first 10): [0.00829315185546875, 0.0007004737854003906, -0.00809478759765625, -0.0006799697875976562, 0.01438140869140625, -0.00897979736328125, 0.0123748779296875, 0.0020923614501953125, 0.006195068359375, -0.0030498504638671875]\n"
     ]
    }
   ],
   "source": [
    "import json\n",
    "import os\n",
    "from transformers import AutoTokenizer\n",
    "\n",
    "os.environ[\"TOKENIZERS_PARALLELISM\"] = \"false\"\n",
    "\n",
    "tokenizer = AutoTokenizer.from_pretrained(\"Alibaba-NLP/gte-Qwen2-7B-instruct\")\n",
    "input_ids = tokenizer.encode(text)\n",
    "\n",
    "curl_ids = f\"\"\"curl -s http://localhost:30010/v1/embeddings \\\n",
    "  -H \"Content-Type: application/json\" \\\n",
    "  -H \"Authorization: Bearer None\" \\\n",
    "  -d '{{\"model\": \"Alibaba-NLP/gte-Qwen2-7B-instruct\", \"input\": {json.dumps(input_ids)}}}'\"\"\"\n",
    "\n",
    "input_ids_embedding = json.loads(subprocess.check_output(curl_ids, shell=True))[\"data\"][\n",
    "    0\n",
    "][\"embedding\"]\n",
    "\n",
    "print(f\"Input IDs embedding (first 10): {input_ids_embedding[:10]}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -16,12 +16,14 @@ The core features include:
   :caption: Getting Started
   install.md
   send_request.ipynb
 .. toctree::
   :maxdepth: 1
   :caption: Backend Tutorial
   openai_api.ipynb
   backend.md
@@ -43,3 +45,4 @@ The core features include:
   choices_methods.md
   benchmark_and_profiling.md
   troubleshooting.md
   embedding_model.ipynb
--- a/docs/openai_api.ipynb
+++ b/docs/openai_api.ipynb
@@ -0,0 +1,676 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# OpenAI Compatible API\n",
    "\n",
    "SGLang provides an OpenAI compatible API for smooth transition from OpenAI services.\n",
    "\n",
    "- `chat/completions`\n",
    "- `completions`\n",
    "- `batches`\n",
    "- `embeddings`(refer to [embedding_model.ipynb](embedding_model.ipynb))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Chat Completions\n",
    "\n",
    "### Usage\n",
    "\n",
    "Similar to [send_request.ipynb](send_request.ipynb), we can send a chat completion request to SGLang server with OpenAI API format."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Server is ready. Proceeding with the next steps.\n"
     ]
    }
   ],
   "source": [
    "from sglang.utils import execute_shell_command, wait_for_server, terminate_process\n",
    "\n",
    "server_process = execute_shell_command(\n",
    "    \"\"\"\n",
    "python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \\\n",
    "--port 30000 --host 0.0.0.0 --log-level warning\n",
    "\"\"\"\n",
    ")\n",
    "\n",
    "wait_for_server(\"http://localhost:30000\")\n",
    "print(\"Server is ready. Proceeding with the next steps.\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "ChatCompletion(id='e854540ec7914b2d8c712f16fd9ed2ca', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='Here are 3 countries and their capitals:\\n\\n1. **Country:** Japan\\n**Capital:** Tokyo\\n\\n2. **Country:** Australia\\n**Capital:** Canberra\\n\\n3. **Country:** Brazil\\n**Capital:** Brasília', refusal=None, role='assistant', function_call=None, tool_calls=None), matched_stop=128009)], created=1730012326, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=46, prompt_tokens=49, total_tokens=95, prompt_tokens_details=None))\n"
     ]
    }
   ],
   "source": [
    "import openai\n",
    "\n",
    "# Always assign an api_key, even if not specified during server initialization.\n",
    "# Setting an API key during server initialization is strongly recommended.\n",
    "\n",
    "client = openai.Client(base_url=\"http://127.0.0.1:30000/v1\", api_key=\"None\")\n",
    "\n",
    "# Chat completion example\n",
    "\n",
    "response = client.chat.completions.create(\n",
    "    model=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n",
    "    messages=[\n",
    "        {\"role\": \"system\", \"content\": \"You are a helpful AI assistant\"},\n",
    "        {\"role\": \"user\", \"content\": \"List 3 countries and their capitals.\"},\n",
    "    ],\n",
    "    temperature=0,\n",
    "    max_tokens=64,\n",
    ")\n",
    "print(response)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Parameters\n",
    "\n",
    "The chat completions API accepts the following parameters (refer to [OpenAI Chat Completions API](https://platform.openai.com/docs/api-reference/chat/create) for more details):\n",
    "\n",
    "- `messages`: List of messages in the conversation, each containing `role` and `content`\n",
    "- `model`: The model identifier to use for completion\n",
    "- `max_tokens`: Maximum number of tokens to generate in the response\n",
    "- `temperature`: Controls randomness (0-2). Lower values make output more focused and deterministic\n",
    "- `top_p`: Alternative to temperature. Controls diversity via nucleus sampling\n",
    "- `n`: Number of chat completion choices to generate\n",
    "- `stream`: If true, partial message deltas will be sent as they become available\n",
    "- `stop`: Sequences where the API will stop generating further tokens\n",
    "- `presence_penalty`: Penalizes new tokens based on their presence in the text so far (-2.0 to 2.0)\n",
    "- `frequency_penalty`: Penalizes new tokens based on their frequency in the text so far (-2.0 to 2.0)\n",
    "- `logit_bias`: Modify the likelihood of specified tokens appearing in the completion\n",
    "- `logprobs`: Include log probabilities of tokens in the response\n",
    "- `top_logprobs`: Number of most likely tokens to return probabilities for\n",
    "- `seed`: Random seed for deterministic results\n",
    "- `response_format`: Specify the format of the response (e.g., JSON)\n",
    "- `stream_options`: Additional options for streaming responses\n",
    "- `user`: A unique identifier representing your end-user\n",
    "\n",
    "Here is an example of a detailed chat completion request:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Ancient Rome's major achievements include:"
     ]
    }
   ],
   "source": [
    "response = client.chat.completions.create(\n",
    "    model=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n",
    "    messages=[\n",
    "        {\n",
    "            \"role\": \"system\",\n",
    "            \"content\": \"You are a knowledgeable historian who provides concise responses.\",\n",
    "        },\n",
    "        {\"role\": \"user\", \"content\": \"Tell me about ancient Rome\"},\n",
    "        {\n",
    "            \"role\": \"assistant\",\n",
    "            \"content\": \"Ancient Rome was a civilization centered in Italy.\",\n",
    "        },\n",
    "        {\"role\": \"user\", \"content\": \"What were their major achievements?\"},\n",
    "    ],\n",
    "    temperature=0.3,  # Lower temperature for more focused responses\n",
    "    max_tokens=100,  # Reasonable length for a concise response\n",
    "    top_p=0.95,  # Slightly higher for better fluency\n",
    "    stop=[\"\\n\\n\"],  # Simple stop sequence\n",
    "    presence_penalty=0.2,  # Mild penalty to avoid repetition\n",
    "    frequency_penalty=0.2,  # Mild penalty for more natural language\n",
    "    n=1,  # Single response is usually more stable\n",
    "    seed=42,  # Keep for reproducibility\n",
    "    stream=True,  # Keep streaming for real-time output\n",
    ")\n",
    "\n",
    "for chunk in response:\n",
    "    print(chunk.choices[0].delta.content or \"\", end=\"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Completions\n",
    "\n",
    "### Usage\n",
    "\n",
    "Completions API is similar to Chat Completions API, but without the `messages` parameter. Refer to [OpenAI Completions API](https://platform.openai.com/docs/api-reference/completions/create) for more details."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Completion(id='a6e07198f4b445baa0fb08a2178ceb59', choices=[CompletionChoice(finish_reason='length', index=0, logprobs=None, text=' 1. 2. 3.\\n1.  United States - Washington D.C. 2.  Japan - Tokyo 3.  Australia - Canberra\\nList 3 countries and their capitals. 1. 2. 3.\\n1.  China - Beijing 2.  Brazil - Bras', matched_stop=None)], created=1730012328, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='text_completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=64, prompt_tokens=9, total_tokens=73, prompt_tokens_details=None))\n"
     ]
    }
   ],
   "source": [
    "response = client.completions.create(\n",
    "    model=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n",
    "    prompt=\"List 3 countries and their capitals.\",\n",
    "    temperature=0,\n",
    "    max_tokens=64,\n",
    "    n=1,\n",
    "    stop=None,\n",
    ")\n",
    "print(response)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Parameters\n",
    "\n",
    "The completions API accepts the following parameters:\n",
    "\n",
    "- `model`: The model identifier to use for completion\n",
    "- `prompt`: Input text to generate completions for. Can be a string, array of strings, or token arrays\n",
    "- `best_of`: Number of completions to generate server-side and return the best one\n",
    "- `echo`: If true, the prompt will be included in the response\n",
    "- `frequency_penalty`: Penalizes new tokens based on their frequency in the text so far (-2.0 to 2.0)\n",
    "- `logit_bias`: Modify the likelihood of specified tokens appearing in the completion\n",
    "- `logprobs`: Include log probabilities of tokens in the response\n",
    "- `max_tokens`: Maximum number of tokens to generate in the response (default: 16)\n",
    "- `n`: Number of completion choices to generate\n",
    "- `presence_penalty`: Penalizes new tokens based on their presence in the text so far (-2.0 to 2.0)\n",
    "- `seed`: Random seed for deterministic results\n",
    "- `stop`: Sequences where the API will stop generating further tokens\n",
    "- `stream`: If true, partial completion deltas will be sent as they become available\n",
    "- `stream_options`: Additional options for streaming responses\n",
    "- `suffix`: Text to append to the completion\n",
    "- `temperature`: Controls randomness (0-2). Lower values make output more focused and deterministic\n",
    "- `top_p`: Alternative to temperature. Controls diversity via nucleus sampling\n",
    "- `user`: A unique identifier representing your end-user\n",
    "\n",
    "Here is an example of a detailed completions request:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " Space explorer, Captain Orion Blackwood, had been traveling through the galaxy for 12 years, searching for a new home for humanity. His ship, the Aurora, had been his home for so long that he barely remembered what it was like to walk on solid ground.\n",
      "As he navigated through the dense asteroid field, the ship's computer, S.A.R.A. (Self-Aware Reasoning Algorithm), alerted him to a strange reading on one of the asteroids. Captain Blackwood's curiosity was piqued, and he decided to investigate further.\n",
      "\"Captain, I'm detecting unusual energy signatures emanating from the asteroid,\" S.A.R.A. said. \"It's unlike anything I've seen before.\"\n",
      "Captain Blackwood's eyes narrowed as"
     ]
    }
   ],
   "source": [
    "response = client.completions.create(\n",
    "    model=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n",
    "    prompt=\"Write a short story about a space explorer.\",\n",
    "    temperature=0.7,  # Moderate temperature for creative writing\n",
    "    max_tokens=150,  # Longer response for a story\n",
    "    top_p=0.9,  # Balanced diversity in word choice\n",
    "    stop=[\"\\n\\n\", \"THE END\"],  # Multiple stop sequences\n",
    "    presence_penalty=0.3,  # Encourage novel elements\n",
    "    frequency_penalty=0.3,  # Reduce repetitive phrases\n",
    "    n=1,  # Generate one completion\n",
    "    seed=123,  # For reproducible results\n",
    "    stream=True,  # Stream the response\n",
    ")\n",
    "\n",
    "for chunk in response:\n",
    "    print(chunk.choices[0].text or \"\", end=\"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Batches\n",
    "\n",
    "We have implemented the batches API for chat completions and completions. You can upload your requests in `jsonl` files, create a batch job, and retrieve the results when the batch job is completed (which takes longer but costs less).\n",
    "\n",
    "The batches APIs are:\n",
    "\n",
    "- `batches`\n",
    "- `batches/{batch_id}/cancel`\n",
    "- `batches/{batch_id}`\n",
    "\n",
    "Here is an example of a batch job for chat completions, completions are similar.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Batch job created with ID: batch_03d7f74f-dffe-4c26-b5e7-bb9fb5cb89ff\n"
     ]
    }
   ],
   "source": [
    "import json\n",
    "import time\n",
    "from openai import OpenAI\n",
    "\n",
    "client = OpenAI(base_url=\"http://127.0.0.1:30000/v1\", api_key=\"None\")\n",
    "\n",
    "requests = [\n",
    "    {\n",
    "        \"custom_id\": \"request-1\",\n",
    "        \"method\": \"POST\",\n",
    "        \"url\": \"/chat/completions\",\n",
    "        \"body\": {\n",
    "            \"model\": \"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n",
    "            \"messages\": [\n",
    "                {\"role\": \"user\", \"content\": \"Tell me a joke about programming\"}\n",
    "            ],\n",
    "            \"max_tokens\": 50,\n",
    "        },\n",
    "    },\n",
    "    {\n",
    "        \"custom_id\": \"request-2\",\n",
    "        \"method\": \"POST\",\n",
    "        \"url\": \"/chat/completions\",\n",
    "        \"body\": {\n",
    "            \"model\": \"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n",
    "            \"messages\": [{\"role\": \"user\", \"content\": \"What is Python?\"}],\n",
    "            \"max_tokens\": 50,\n",
    "        },\n",
    "    },\n",
    "]\n",
    "\n",
    "input_file_path = \"batch_requests.jsonl\"\n",
    "\n",
    "with open(input_file_path, \"w\") as f:\n",
    "    for req in requests:\n",
    "        f.write(json.dumps(req) + \"\\n\")\n",
    "\n",
    "with open(input_file_path, \"rb\") as f:\n",
    "    file_response = client.files.create(file=f, purpose=\"batch\")\n",
    "\n",
    "batch_response = client.batches.create(\n",
    "    input_file_id=file_response.id,\n",
    "    endpoint=\"/v1/chat/completions\",\n",
    "    completion_window=\"24h\",\n",
    ")\n",
    "\n",
    "print(f\"Batch job created with ID: {batch_response.id}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Batch job status: validating...trying again in 3 seconds...\n",
      "Batch job completed successfully!\n",
      "Request counts: BatchRequestCounts(completed=2, failed=0, total=2)\n",
      "\n",
      "Request request-1:\n",
      "Response: {'status_code': 200, 'request_id': 'request-1', 'body': {'id': 'request-1', 'object': 'chat.completion', 'created': 1730012333, 'model': 'meta-llama/Meta-Llama-3.1-8B-Instruct', 'choices': {'index': 0, 'message': {'role': 'assistant', 'content': 'Why do programmers prefer dark mode?\\n\\nBecause light attracts bugs.'}, 'logprobs': None, 'finish_reason': 'stop', 'matched_stop': 128009}, 'usage': {'prompt_tokens': 41, 'completion_tokens': 13, 'total_tokens': 54}, 'system_fingerprint': None}}\n",
      "\n",
      "Request request-2:\n",
      "Response: {'status_code': 200, 'request_id': 'request-2', 'body': {'id': 'request-2', 'object': 'chat.completion', 'created': 1730012333, 'model': 'meta-llama/Meta-Llama-3.1-8B-Instruct', 'choices': {'index': 0, 'message': {'role': 'assistant', 'content': '**What is Python?**\\n\\nPython is a high-level, interpreted programming language that is widely used for various purposes, including:\\n\\n*   **Web Development**: Building web applications, web services, and web scraping.\\n*   **Data Science**: Data analysis'}, 'logprobs': None, 'finish_reason': 'length', 'matched_stop': None}, 'usage': {'prompt_tokens': 39, 'completion_tokens': 50, 'total_tokens': 89}, 'system_fingerprint': None}}\n",
      "\n",
      "Cleaning up files...\n"
     ]
    }
   ],
   "source": [
    "while batch_response.status not in [\"completed\", \"failed\", \"cancelled\"]:\n",
    "    time.sleep(3)\n",
    "    print(f\"Batch job status: {batch_response.status}...trying again in 3 seconds...\")\n",
    "    batch_response = client.batches.retrieve(batch_response.id)\n",
    "\n",
    "if batch_response.status == \"completed\":\n",
    "    print(\"Batch job completed successfully!\")\n",
    "    print(f\"Request counts: {batch_response.request_counts}\")\n",
    "\n",
    "    result_file_id = batch_response.output_file_id\n",
    "    file_response = client.files.content(result_file_id)\n",
    "    result_content = file_response.read().decode(\"utf-8\")\n",
    "\n",
    "    results = [\n",
    "        json.loads(line) for line in result_content.split(\"\\n\") if line.strip() != \"\"\n",
    "    ]\n",
    "\n",
    "    for result in results:\n",
    "        print(f\"\\nRequest {result['custom_id']}:\")\n",
    "        print(f\"Response: {result['response']}\")\n",
    "\n",
    "    print(\"\\nCleaning up files...\")\n",
    "    # Only delete the result file ID since file_response is just content\n",
    "    client.files.delete(result_file_id)\n",
    "else:\n",
    "    print(f\"Batch job failed with status: {batch_response.status}\")\n",
    "    if hasattr(batch_response, \"errors\"):\n",
    "        print(f\"Errors: {batch_response.errors}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It takes a while to complete the batch job. You can use these two APIs to retrieve the batch job status or cancel the batch job.\n",
    "\n",
    "1. `batches/{batch_id}`: Retrieve the batch job status.\n",
    "2. `batches/{batch_id}/cancel`: Cancel the batch job.\n",
    "\n",
    "Here is an example to check the batch job status."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Created batch job with ID: batch_6b9625ac-9ebc-4c4f-bfd5-f84f88b0100d\n",
      "Initial status: validating\n",
      "Batch job details (check 1/5):\n",
      "ID: batch_6b9625ac-9ebc-4c4f-bfd5-f84f88b0100d\n",
      "Status: in_progress\n",
      "Created at: 1730012334\n",
      "Input file ID: backend_input_file-8203d42a-109c-4573-9663-13b5d9cb6a2b\n",
      "Output file ID: None\n",
      "Request counts:\n",
      "Total: 0\n",
      "Completed: 0\n",
      "Failed: 0\n",
      "Batch job details (check 2/5):\n",
      "ID: batch_6b9625ac-9ebc-4c4f-bfd5-f84f88b0100d\n",
      "Status: in_progress\n",
      "Created at: 1730012334\n",
      "Input file ID: backend_input_file-8203d42a-109c-4573-9663-13b5d9cb6a2b\n",
      "Output file ID: None\n",
      "Request counts:\n",
      "Total: 0\n",
      "Completed: 0\n",
      "Failed: 0\n",
      "Batch job details (check 3/5):\n",
      "ID: batch_6b9625ac-9ebc-4c4f-bfd5-f84f88b0100d\n",
      "Status: in_progress\n",
      "Created at: 1730012334\n",
      "Input file ID: backend_input_file-8203d42a-109c-4573-9663-13b5d9cb6a2b\n",
      "Output file ID: None\n",
      "Request counts:\n",
      "Total: 0\n",
      "Completed: 0\n",
      "Failed: 0\n",
      "Batch job details (check 4/5):\n",
      "ID: batch_6b9625ac-9ebc-4c4f-bfd5-f84f88b0100d\n",
      "Status: completed\n",
      "Created at: 1730012334\n",
      "Input file ID: backend_input_file-8203d42a-109c-4573-9663-13b5d9cb6a2b\n",
      "Output file ID: backend_result_file-d32f441d-e737-4da3-b07a-c39349425b3a\n",
      "Request counts:\n",
      "Total: 100\n",
      "Completed: 100\n",
      "Failed: 0\n",
      "Batch job details (check 5/5):\n",
      "ID: batch_6b9625ac-9ebc-4c4f-bfd5-f84f88b0100d\n",
      "Status: completed\n",
      "Created at: 1730012334\n",
      "Input file ID: backend_input_file-8203d42a-109c-4573-9663-13b5d9cb6a2b\n",
      "Output file ID: backend_result_file-d32f441d-e737-4da3-b07a-c39349425b3a\n",
      "Request counts:\n",
      "Total: 100\n",
      "Completed: 100\n",
      "Failed: 0\n"
     ]
    }
   ],
   "source": [
    "import json\n",
    "import time\n",
    "from openai import OpenAI\n",
    "\n",
    "client = OpenAI(base_url=\"http://127.0.0.1:30000/v1\", api_key=\"None\")\n",
    "\n",
    "requests = []\n",
    "for i in range(100):\n",
    "    requests.append(\n",
    "        {\n",
    "            \"custom_id\": f\"request-{i}\",\n",
    "            \"method\": \"POST\",\n",
    "            \"url\": \"/chat/completions\",\n",
    "            \"body\": {\n",
    "                \"model\": \"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n",
    "                \"messages\": [\n",
    "                    {\n",
    "                        \"role\": \"system\",\n",
    "                        \"content\": f\"{i}: You are a helpful AI assistant\",\n",
    "                    },\n",
    "                    {\n",
    "                        \"role\": \"user\",\n",
    "                        \"content\": \"Write a detailed story about topic. Make it very long.\",\n",
    "                    },\n",
    "                ],\n",
    "                \"max_tokens\": 500,\n",
    "            },\n",
    "        }\n",
    "    )\n",
    "\n",
    "input_file_path = \"batch_requests.jsonl\"\n",
    "with open(input_file_path, \"w\") as f:\n",
    "    for req in requests:\n",
    "        f.write(json.dumps(req) + \"\\n\")\n",
    "\n",
    "with open(input_file_path, \"rb\") as f:\n",
    "    uploaded_file = client.files.create(file=f, purpose=\"batch\")\n",
    "\n",
    "batch_job = client.batches.create(\n",
    "    input_file_id=uploaded_file.id,\n",
    "    endpoint=\"/v1/chat/completions\",\n",
    "    completion_window=\"24h\",\n",
    ")\n",
    "\n",
    "print(f\"Created batch job with ID: {batch_job.id}\")\n",
    "print(f\"Initial status: {batch_job.status}\")\n",
    "\n",
    "time.sleep(10)\n",
    "\n",
    "max_checks = 5\n",
    "for i in range(max_checks):\n",
    "    batch_details = client.batches.retrieve(batch_id=batch_job.id)\n",
    "    print(f\"Batch job details (check {i+1}/{max_checks}):\")\n",
    "    print(f\"ID: {batch_details.id}\")\n",
    "    print(f\"Status: {batch_details.status}\")\n",
    "    print(f\"Created at: {batch_details.created_at}\")\n",
    "    print(f\"Input file ID: {batch_details.input_file_id}\")\n",
    "    print(f\"Output file ID: {batch_details.output_file_id}\")\n",
    "\n",
    "    print(\"Request counts:\")\n",
    "    print(f\"Total: {batch_details.request_counts.total}\")\n",
    "    print(f\"Completed: {batch_details.request_counts.completed}\")\n",
    "    print(f\"Failed: {batch_details.request_counts.failed}\")\n",
    "\n",
    "    time.sleep(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here is an example to cancel a batch job."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Created batch job with ID: batch_3d2dd881-ad84-465a-85ee-6d5991794e5e\n",
      "Initial status: validating\n",
      "Cancellation initiated. Status: cancelling\n",
      "Current status: cancelled\n",
      "Batch job successfully cancelled\n",
      "Successfully cleaned up input file\n"
     ]
    }
   ],
   "source": [
    "import json\n",
    "import time\n",
    "from openai import OpenAI\n",
    "\n",
    "client = OpenAI(base_url=\"http://127.0.0.1:30000/v1\", api_key=\"None\")\n",
    "\n",
    "requests = []\n",
    "for i in range(500):\n",
    "    requests.append(\n",
    "        {\n",
    "            \"custom_id\": f\"request-{i}\",\n",
    "            \"method\": \"POST\",\n",
    "            \"url\": \"/chat/completions\",\n",
    "            \"body\": {\n",
    "                \"model\": \"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n",
    "                \"messages\": [\n",
    "                    {\n",
    "                        \"role\": \"system\",\n",
    "                        \"content\": f\"{i}: You are a helpful AI assistant\",\n",
    "                    },\n",
    "                    {\n",
    "                        \"role\": \"user\",\n",
    "                        \"content\": \"Write a detailed story about topic. Make it very long.\",\n",
    "                    },\n",
    "                ],\n",
    "                \"max_tokens\": 500,\n",
    "            },\n",
    "        }\n",
    "    )\n",
    "\n",
    "input_file_path = \"batch_requests.jsonl\"\n",
    "with open(input_file_path, \"w\") as f:\n",
    "    for req in requests:\n",
    "        f.write(json.dumps(req) + \"\\n\")\n",
    "\n",
    "with open(input_file_path, \"rb\") as f:\n",
    "    uploaded_file = client.files.create(file=f, purpose=\"batch\")\n",
    "\n",
    "batch_job = client.batches.create(\n",
    "    input_file_id=uploaded_file.id,\n",
    "    endpoint=\"/v1/chat/completions\",\n",
    "    completion_window=\"24h\",\n",
    ")\n",
    "\n",
    "print(f\"Created batch job with ID: {batch_job.id}\")\n",
    "print(f\"Initial status: {batch_job.status}\")\n",
    "\n",
    "time.sleep(10)\n",
    "\n",
    "try:\n",
    "    cancelled_job = client.batches.cancel(batch_id=batch_job.id)\n",
    "    print(f\"Cancellation initiated. Status: {cancelled_job.status}\")\n",
    "    assert cancelled_job.status == \"cancelling\"\n",
    "\n",
    "    # Monitor the cancellation process\n",
    "    while cancelled_job.status not in [\"failed\", \"cancelled\"]:\n",
    "        time.sleep(3)\n",
    "        cancelled_job = client.batches.retrieve(batch_job.id)\n",
    "        print(f\"Current status: {cancelled_job.status}\")\n",
    "\n",
    "    # Verify final status\n",
    "    assert cancelled_job.status == \"cancelled\"\n",
    "    print(\"Batch job successfully cancelled\")\n",
    "\n",
    "except Exception as e:\n",
    "    print(f\"Error during cancellation: {e}\")\n",
    "    raise e\n",
    "\n",
    "finally:\n",
    "    try:\n",
    "        del_response = client.files.delete(uploaded_file.id)\n",
    "        if del_response.deleted:\n",
    "            print(\"Successfully cleaned up input file\")\n",
    "    except Exception as e:\n",
    "        print(f\"Error cleaning up: {e}\")\n",
    "        raise e"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [],
   "source": [
    "terminate_process(server_process)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "AlphaMeemory",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
 }
--- a/docs/send_request.ipynb
+++ b/docs/send_request.ipynb
@@ -4,7 +4,9 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "# Quick Start"
+    "# Quick Start: Launch A Server and Send Requests\n",
    "\n",
    "This section provides a quick start guide to using SGLang after installation."
   ]
  },
  {
@@ -13,12 +15,13 @@
   "source": [
    "## Launch a server\n",
    "\n",
-    "This code uses `subprocess.Popen` to start an SGLang server process, equivalent to executing \n",
+    "This code block is equivalent to executing \n",
    "\n",
    "```bash\n",
    "python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \\\n",
    "--port 30000 --host 0.0.0.0 --log-level warning\n",
    "```\n",
    "\n",
    "in your command line and wait for the server to be ready."
   ]
  },
@@ -39,10 +42,12 @@
    "from sglang.utils import execute_shell_command, wait_for_server, terminate_process\n",
    "\n",
    "\n",
-    "server_process = execute_shell_command(\"\"\"\n",
+    "server_process = execute_shell_command(\n",
    "    \"\"\"\n",
    "python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \\\n",
    "--port 30000 --host 0.0.0.0 --log-level warning\n",
-    "\"\"\")\n",
+    "\"\"\"\n",
    ")\n",
    "\n",
    "wait_for_server(\"http://localhost:30000\")\n",
    "print(\"Server is ready. Proceeding with the next steps.\")"
@@ -105,9 +110,7 @@
    "# Always assign an api_key, even if not specified during server initialization.\n",
    "# Setting an API key during server initialization is strongly recommended.\n",
    "\n",
-    "client = openai.Client(\n",
+    "client = openai.Client(base_url=\"http://127.0.0.1:30000/v1\", api_key=\"None\")\n",
    "    base_url=\"http://127.0.0.1:30000/v1\", api_key=\"None\"\n",
    ")\n",
    "\n",
    "# Chat completion example\n",
    "\n",