sglang/docs/basic_usage/native_api.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# SGLang Native APIs\n",
    "\n",
    "Apart from the OpenAI compatible APIs, the SGLang Runtime also provides its native server APIs. We introduce the following APIs:\n",
    "\n",
    "- `/generate` (text generation model)\n",
    "- `/get_model_info`\n",
    "- `/get_server_info`\n",
    "- `/health`\n",
    "- `/health_generate`\n",
    "- `/flush_cache`\n",
    "- `/update_weights`\n",
    "- `/encode`(embedding model)\n",
    "- `/v1/rerank`(cross encoder rerank model)\n",
    "- `/classify`(reward model)\n",
    "- `/start_expert_distribution_record`\n",
    "- `/stop_expert_distribution_record`\n",
    "- `/dump_expert_distribution_record`\n",
    "- `/tokenize`\n",
    "- `/detokenize`\n",
    "- A full list of these APIs can be found at [http_server.py](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/entrypoints/http_server.py)\n",
    "\n",
    "We mainly use `requests` to test these APIs in the following examples. You can also use `curl`.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Launch A Server"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sglang.test.doc_patch import launch_server_cmd\n",
    "from sglang.utils import wait_for_server, print_highlight, terminate_process\n",
    "\n",
    "server_process, port = launch_server_cmd(\n",
    "    \"python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct --host 0.0.0.0 --log-level warning\"\n",
    ")\n",
    "\n",
    "wait_for_server(f\"http://localhost:{port}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Generate (text generation model)\n",
    "Generate completions. This is similar to the `/v1/completions` in OpenAI API. Detailed parameters can be found in the [sampling parameters](sampling_params.md)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import requests\n",
    "\n",
    "url = f\"http://localhost:{port}/generate\"\n",
    "data = {\"text\": \"What is the capital of France?\"}\n",
    "\n",
    "response = requests.post(url, json=data)\n",
    "print_highlight(response.json())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Get Model Info\n",
    "\n",
    "Get the information of the model.\n",
    "\n",
    "- `model_path`: The path/name of the model.\n",
    "- `is_generation`: Whether the model is used as generation model or embedding model.\n",
    "- `tokenizer_path`: The path/name of the tokenizer.\n",
    "- `preferred_sampling_params`: The default sampling params specified via `--preferred-sampling-params`. `None` is returned in this example as we did not explicitly configure it in server args.\n",
    "- `weight_version`: This field contains the version of the model weights. This is often used to track changes or updates to the model’s trained parameters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "url = f\"http://localhost:{port}/get_model_info\"\n",
    "\n",
    "response = requests.get(url)\n",
    "response_json = response.json()\n",
    "print_highlight(response_json)\n",
    "assert response_json[\"model_path\"] == \"qwen/qwen2.5-0.5b-instruct\"\n",
    "assert response_json[\"is_generation\"] is True\n",
    "assert response_json[\"tokenizer_path\"] == \"qwen/qwen2.5-0.5b-instruct\"\n",
    "assert response_json[\"preferred_sampling_params\"] is None\n",
    "assert response_json.keys() == {\n",
    "    \"model_path\",\n",
    "    \"is_generation\",\n",
    "    \"tokenizer_path\",\n",
    "    \"preferred_sampling_params\",\n",
    "    \"weight_version\",\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Get Server Info\n",
    "Gets the server information including CLI arguments, token limits, and memory pool sizes.\n",
    "- Note: `get_server_info` merges the following deprecated endpoints:\n",
    "  - `get_server_args`\n",
    "  - `get_memory_pool_size` \n",
    "  - `get_max_total_num_tokens`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "url = f\"http://localhost:{port}/get_server_info\"\n",
    "\n",
    "response = requests.get(url)\n",
    "print_highlight(response.text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Health Check\n",
    "- `/health`: Check the health of the server.\n",
    "- `/health_generate`: Check the health of the server by generating one token."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "url = f\"http://localhost:{port}/health_generate\"\n",
    "\n",
    "response = requests.get(url)\n",
    "print_highlight(response.text)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "url = f\"http://localhost:{port}/health\"\n",
    "\n",
    "response = requests.get(url)\n",
    "print_highlight(response.text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Flush Cache\n",
    "\n",
    "Flush the radix cache. It will be automatically triggered when the model weights are updated by the `/update_weights` API."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "url = f\"http://localhost:{port}/flush_cache\"\n",
    "\n",
    "response = requests.post(url)\n",
    "print_highlight(response.text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Update Weights From Disk\n",
    "\n",
    "Update model weights from disk without restarting the server. Only applicable for models with the same architecture and parameter size.\n",
    "\n",
    "SGLang support `update_weights_from_disk` API for continuous evaluation during training (save checkpoint to disk and update weights from disk).\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# successful update with same architecture and size\n",
    "\n",
    "url = f\"http://localhost:{port}/update_weights_from_disk\"\n",
    "data = {\"model_path\": \"qwen/qwen2.5-0.5b-instruct\"}\n",
    "\n",
    "response = requests.post(url, json=data)\n",
    "print_highlight(response.text)\n",
    "assert response.json()[\"success\"] is True\n",
    "assert response.json()[\"message\"] == \"Succeeded to update model weights.\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# failed update with different parameter size or wrong name\n",
    "\n",
    "url = f\"http://localhost:{port}/update_weights_from_disk\"\n",
    "data = {\"model_path\": \"qwen/qwen2.5-0.5b-instruct-wrong\"}\n",
    "\n",
    "response = requests.post(url, json=data)\n",
    "response_json = response.json()\n",
    "print_highlight(response_json)\n",
    "assert response_json[\"success\"] is False\n",
    "assert response_json[\"message\"] == (\n",
    "    \"Failed to get weights iterator: \"\n",
    "    \"qwen/qwen2.5-0.5b-instruct-wrong\"\n",
    "    \" (repository not found).\"\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "terminate_process(server_process)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Encode (embedding model)\n",
    "\n",
    "Encode text into embeddings. Note that this API is only available for [embedding models](openai_api_embeddings.ipynb) and will raise an error for generation models.\n",
    "Therefore, we launch a new server to server an embedding model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "embedding_process, port = launch_server_cmd(\n",
    "    \"\"\"\n",
    "python3 -m sglang.launch_server --model-path Alibaba-NLP/gte-Qwen2-1.5B-instruct \\\n",
    "    --host 0.0.0.0 --is-embedding --log-level warning\n",
    "\"\"\"\n",
    ")\n",
    "\n",
    "wait_for_server(f\"http://localhost:{port}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# successful encode for embedding model\n",
    "\n",
    "url = f\"http://localhost:{port}/encode\"\n",
    "data = {\"model\": \"Alibaba-NLP/gte-Qwen2-1.5B-instruct\", \"text\": \"Once upon a time\"}\n",
    "\n",
    "response = requests.post(url, json=data)\n",
    "response_json = response.json()\n",
    "print_highlight(f\"Text embedding (first 10): {response_json['embedding'][:10]}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "terminate_process(embedding_process)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## v1/rerank (cross encoder rerank model)\n",
    "Rerank a list of documents given a query using a cross-encoder model. Note that this API is only available for cross encoder model like [BAAI/bge-reranker-v2-m3](https://huggingface.co/BAAI/bge-reranker-v2-m3) with `attention-backend` `triton` and `torch_native`.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "reranker_process, port = launch_server_cmd(\n",
    "    \"\"\"\n",
    "python3 -m sglang.launch_server --model-path BAAI/bge-reranker-v2-m3 \\\n",
    "    --host 0.0.0.0 --disable-radix-cache --chunked-prefill-size -1 --attention-backend triton --is-embedding --log-level warning\n",
    "\"\"\"\n",
    ")\n",
    "\n",
    "wait_for_server(f\"http://localhost:{port}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# compute rerank scores for query and documents\n",
    "\n",
    "url = f\"http://localhost:{port}/v1/rerank\"\n",
    "data = {\n",
    "    \"model\": \"BAAI/bge-reranker-v2-m3\",\n",
    "    \"query\": \"what is panda?\",\n",
    "    \"documents\": [\n",
    "        \"hi\",\n",
    "        \"The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.\",\n",
    "    ],\n",
    "}\n",
    "\n",
    "response = requests.post(url, json=data)\n",
    "response_json = response.json()\n",
    "for item in response_json:\n",
    "    print_highlight(f\"Score: {item['score']:.2f} - Document: '{item['document']}'\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "terminate_process(reranker_process)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Classify (reward model)\n",
    "\n",
    "SGLang Runtime also supports reward models. Here we use a reward model to classify the quality of pairwise generations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Note that SGLang now treats embedding models and reward models as the same type of models.\n",
    "# This will be updated in the future.\n",
    "\n",
    "reward_process, port = launch_server_cmd(\n",
    "    \"\"\"\n",
    "python3 -m sglang.launch_server --model-path Skywork/Skywork-Reward-Llama-3.1-8B-v0.2 --host 0.0.0.0 --is-embedding --log-level warning\n",
    "\"\"\"\n",
    ")\n",
    "\n",
    "wait_for_server(f\"http://localhost:{port}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from transformers import AutoTokenizer\n",
    "\n",
    "PROMPT = (\n",
    "    \"What is the range of the numeric output of a sigmoid node in a neural network?\"\n",
    ")\n",
    "\n",
    "RESPONSE1 = \"The output of a sigmoid node is bounded between -1 and 1.\"\n",
    "RESPONSE2 = \"The output of a sigmoid node is bounded between 0 and 1.\"\n",
    "\n",
    "CONVS = [\n",
    "    [{\"role\": \"user\", \"content\": PROMPT}, {\"role\": \"assistant\", \"content\": RESPONSE1}],\n",
    "    [{\"role\": \"user\", \"content\": PROMPT}, {\"role\": \"assistant\", \"content\": RESPONSE2}],\n",
    "]\n",
    "\n",
    "tokenizer = AutoTokenizer.from_pretrained(\"Skywork/Skywork-Reward-Llama-3.1-8B-v0.2\")\n",
    "prompts = tokenizer.apply_chat_template(CONVS, tokenize=False)\n",
    "\n",
    "url = f\"http://localhost:{port}/classify\"\n",
    "data = {\"model\": \"Skywork/Skywork-Reward-Llama-3.1-8B-v0.2\", \"text\": prompts}\n",
    "\n",
    "responses = requests.post(url, json=data).json()\n",
    "for response in responses:\n",
    "    print_highlight(f\"reward: {response['embedding'][0]}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "terminate_process(reward_process)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Capture expert selection distribution in MoE models\n",
    "\n",
    "SGLang Runtime supports recording the number of times an expert is selected in a MoE model run for each expert in the model. This is useful when analyzing the throughput of the model and plan for optimization.\n",
    "\n",
    "*Note: We only print out the first 10 lines of the csv below for better readability. Please adjust accordingly if you want to analyze the results more deeply.*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "expert_record_server_process, port = launch_server_cmd(\n",
    "    \"python3 -m sglang.launch_server --model-path Qwen/Qwen1.5-MoE-A2.7B --host 0.0.0.0 --expert-distribution-recorder-mode stat --log-level warning\"\n",
    ")\n",
    "\n",
    "wait_for_server(f\"http://localhost:{port}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "response = requests.post(f\"http://localhost:{port}/start_expert_distribution_record\")\n",
    "print_highlight(response)\n",
    "\n",
    "url = f\"http://localhost:{port}/generate\"\n",
    "data = {\"text\": \"What is the capital of France?\"}\n",
    "\n",
    "response = requests.post(url, json=data)\n",
    "print_highlight(response.json())\n",
    "\n",
    "response = requests.post(f\"http://localhost:{port}/stop_expert_distribution_record\")\n",
    "print_highlight(response)\n",
    "\n",
    "response = requests.post(f\"http://localhost:{port}/dump_expert_distribution_record\")\n",
    "print_highlight(response)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "terminate_process(expert_record_server_process)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Tokenize/Detokenize Example (Round Trip)\n",
    "\n",
    "This example demonstrates how to use the /tokenize and /detokenize endpoints together. We first tokenize a string, then detokenize the resulting IDs to reconstruct the original text. This workflow is useful when you need to handle tokenization externally but still leverage the server for detokenization."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tokenizer_free_server_process, port = launch_server_cmd(\n",
    "    \"\"\"\n",
    "python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct\n",
    "\"\"\"\n",
    ")\n",
    "\n",
    "wait_for_server(f\"http://localhost:{port}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import requests\n",
    "from sglang.utils import print_highlight\n",
    "\n",
    "base_url = f\"http://localhost:{port}\"\n",
    "tokenize_url = f\"{base_url}/tokenize\"\n",
    "detokenize_url = f\"{base_url}/detokenize\"\n",
    "\n",
    "model_name = \"qwen/qwen2.5-0.5b-instruct\"\n",
    "input_text = \"SGLang provides efficient tokenization endpoints.\"\n",
    "print_highlight(f\"Original Input Text:\\n'{input_text}'\")\n",
    "\n",
    "# --- tokenize the input text ---\n",
    "tokenize_payload = {\n",
    "    \"model\": model_name,\n",
    "    \"prompt\": input_text,\n",
    "    \"add_special_tokens\": False,\n",
    "}\n",
    "try:\n",
    "    tokenize_response = requests.post(tokenize_url, json=tokenize_payload)\n",
    "    tokenize_response.raise_for_status()\n",
    "    tokenization_result = tokenize_response.json()\n",
    "    token_ids = tokenization_result.get(\"tokens\")\n",
    "\n",
    "    if not token_ids:\n",
    "        raise ValueError(\"Tokenization returned empty tokens.\")\n",
    "\n",
    "    print_highlight(f\"\\nTokenized Output (IDs):\\n{token_ids}\")\n",
    "    print_highlight(f\"Token Count: {tokenization_result.get('count')}\")\n",
    "    print_highlight(f\"Max Model Length: {tokenization_result.get('max_model_len')}\")\n",
    "\n",
    "    # --- detokenize the obtained token IDs ---\n",
    "    detokenize_payload = {\n",
    "        \"model\": model_name,\n",
    "        \"tokens\": token_ids,\n",
    "        \"skip_special_tokens\": True,\n",
    "    }\n",
    "\n",
    "    detokenize_response = requests.post(detokenize_url, json=detokenize_payload)\n",
    "    detokenize_response.raise_for_status()\n",
    "    detokenization_result = detokenize_response.json()\n",
    "    reconstructed_text = detokenization_result.get(\"text\")\n",
    "\n",
    "    print_highlight(f\"\\nDetokenized Output (Text):\\n'{reconstructed_text}'\")\n",
    "\n",
    "    if input_text == reconstructed_text:\n",
    "        print_highlight(\n",
    "            \"\\nRound Trip Successful: Original and reconstructed text match.\"\n",
    "        )\n",
    "    else:\n",
    "        print_highlight(\n",
    "            \"\\nRound Trip Mismatch: Original and reconstructed text differ.\"\n",
    "        )\n",
    "\n",
    "except requests.exceptions.RequestException as e:\n",
    "    print_highlight(f\"\\nHTTP Request Error: {e}\")\n",
    "except Exception as e:\n",
    "    print_highlight(f\"\\nAn error occurred: {e}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "terminate_process(tokenizer_free_server_process)"
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
-												add native api docs (#1883)

Co-authored-by: Chayenne <zhaochenyang@g.ucla.edu>
											
										
										
											2024-11-02 00:17:30 -07:00
+								{
 								 "cells": [
 								  {
 								   "cell_type": "markdown",
 								   "metadata": {},
 								   "source": [
-												Docs: Fix layout with sub-section (#3710)


											
										
										
											2025-02-19 15:44:30 -08:00
+								    "# SGLang Native APIs\n",
-												Add Reward API Docs etc (#1910)

Co-authored-by: Chayenne <zhaochenyang@g.ucla.edu>
											
										
										
											2024-11-03 22:33:03 -08:00
+								    "\n",
-												Refactor the docs (#9031)


											
										
										
											2025-08-10 19:49:45 -07:00
+								    "Apart from the OpenAI compatible APIs, the SGLang Runtime also provides its native server APIs. We introduce the following APIs:\n",
-												add native api docs (#1883)

Co-authored-by: Chayenne <zhaochenyang@g.ucla.edu>
											
										
										
											2024-11-02 00:17:30 -07:00
+								    "\n",
-												Add Reward API Docs etc (#1910)

Co-authored-by: Chayenne <zhaochenyang@g.ucla.edu>
											
										
										
											2024-11-03 22:33:03 -08:00
+								    "- `/generate` (text generation model)\n",
-												add native api docs (#1883)

Co-authored-by: Chayenne <zhaochenyang@g.ucla.edu>
											
										
										
											2024-11-02 00:17:30 -07:00
+								    "- `/get_model_info`\n",
-												Merged three native APIs into one: get_server_info (#2152)


											
										
										
											2024-11-24 01:37:58 -08:00
+								    "- `/get_server_info`\n",
-												add native api docs (#1883)

Co-authored-by: Chayenne <zhaochenyang@g.ucla.edu>
											
										
										
											2024-11-02 00:17:30 -07:00
+								    "- `/health`\n",
 								    "- `/health_generate`\n",
 								    "- `/flush_cache`\n",
-												Fix docs ci (#1888)


											
										
										
											2024-11-02 11:57:22 -07:00
+								    "- `/update_weights`\n",
-												Add Reward API Docs etc (#1910)

Co-authored-by: Chayenne <zhaochenyang@g.ucla.edu>
											
										
										
											2024-11-03 22:33:03 -08:00
+								    "- `/encode`(embedding model)\n",
-												[Doc] add embedding rerank doc (#7364)


											
										
										
											2025-06-20 12:53:54 +08:00
+								    "- `/v1/rerank`(cross encoder rerank model)\n",
-												Change judge to classify & Modify make file (#1920)


											
										
										
											2024-11-04 23:53:44 -08:00
+								    "- `/classify`(reward model)\n",
-												Add endpoints to dump selected expert ids (#4435)

Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
											
										
										
											2025-03-24 21:34:19 -07:00
+								    "- `/start_expert_distribution_record`\n",
 								    "- `/stop_expert_distribution_record`\n",
 								    "- `/dump_expert_distribution_record`\n",
-												[Feature] Add /tokenize and /detokenize OpenAI compatible endpoints (#9545)


											
										
										
											2025-10-08 10:08:48 +05:30
+								    "- `/tokenize`\n",
 								    "- `/detokenize`\n",
-												Refactor the docs (#9031)


											
										
										
											2025-08-10 19:49:45 -07:00
+								    "- A full list of these APIs can be found at [http_server.py](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/entrypoints/http_server.py)\n",
-												add native api docs (#1883)

Co-authored-by: Chayenne <zhaochenyang@g.ucla.edu>
											
										
										
											2024-11-02 00:17:30 -07:00
+								    "\n",
-												Refactor the docs (#9031)


											
										
										
											2025-08-10 19:49:45 -07:00
+								    "We mainly use `requests` to test these APIs in the following examples. You can also use `curl`.\n"
-												add native api docs (#1883)

Co-authored-by: Chayenne <zhaochenyang@g.ucla.edu>
											
										
										
											2024-11-02 00:17:30 -07:00
+								   ]
 								  },
 								  {
 								   "cell_type": "markdown",
 								   "metadata": {},
 								   "source": [
 								    "## Launch A Server"
 								   ]
 								  },
 								  {
 								   "cell_type": "code",
 								   "execution_count": null,
-												feat(pre-commit): trim unnecessary notebook metadata from git history (#2127)


											
										
										
											2024-11-23 05:04:51 +08:00
+								   "metadata": {},
-												add native api docs (#1883)

Co-authored-by: Chayenne <zhaochenyang@g.ucla.edu>
											
										
										
											2024-11-02 00:17:30 -07:00
+								   "outputs": [],
 								   "source": [
-												Refactor the docs (#9031)


											
										
										
											2025-08-10 19:49:45 -07:00
+								    "from sglang.test.doc_patch import launch_server_cmd\n",
-												[CI] Improve Docs CI Efficiency (#3587)

Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>
											
										
										
											2025-02-15 03:57:00 +00:00
+								    "from sglang.utils import wait_for_server, print_highlight, terminate_process\n",
 								    "\n",
 								    "server_process, port = launch_server_cmd(\n",
-												[Doc] Fix SGLang tool parser doc (#9886)


											
										
										
											2025-09-04 09:52:53 -04:00
+								    "    \"python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct --host 0.0.0.0 --log-level warning\"\n",
-												add native api docs (#1883)

Co-authored-by: Chayenne <zhaochenyang@g.ucla.edu>
											
										
										
											2024-11-02 00:17:30 -07:00
+								    ")\n",
 								    "\n",
-												[CI] Improve Docs CI Efficiency (#3587)

Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>
											
										
										
											2025-02-15 03:57:00 +00:00
+								    "wait_for_server(f\"http://localhost:{port}\")"
-												add native api docs (#1883)

Co-authored-by: Chayenne <zhaochenyang@g.ucla.edu>
											
										
										
											2024-11-02 00:17:30 -07:00
+								   ]
 								  },
 								  {
 								   "cell_type": "markdown",
 								   "metadata": {},
 								   "source": [
-												Add Reward API Docs etc (#1910)

Co-authored-by: Chayenne <zhaochenyang@g.ucla.edu>
											
										
										
											2024-11-03 22:33:03 -08:00
+								    "## Generate (text generation model)\n",
-												Refactor the docs (#9031)


											
										
										
											2025-08-10 19:49:45 -07:00
+								    "Generate completions. This is similar to the `/v1/completions` in OpenAI API. Detailed parameters can be found in the [sampling parameters](sampling_params.md)."
-												add native api docs (#1883)

Co-authored-by: Chayenne <zhaochenyang@g.ucla.edu>
											
										
										
											2024-11-02 00:17:30 -07:00
+								   ]
 								  },
 								  {
 								   "cell_type": "code",
 								   "execution_count": null,
-												feat(pre-commit): trim unnecessary notebook metadata from git history (#2127)


											
										
										
											2024-11-23 05:04:51 +08:00
+								   "metadata": {},
-												add native api docs (#1883)

Co-authored-by: Chayenne <zhaochenyang@g.ucla.edu>
											
										
										
											2024-11-02 00:17:30 -07:00
+								   "outputs": [],
 								   "source": [
-												Refactor the docs (#9031)


											
										
										
											2025-08-10 19:49:45 -07:00
+								    "import requests\n",
 								    "\n",
-												[CI] Improve Docs CI Efficiency (#3587)

Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>
											
										
										
											2025-02-15 03:57:00 +00:00
+								    "url = f\"http://localhost:{port}/generate\"\n",
-												Fix docs ci (#1888)


											
										
										
											2024-11-02 11:57:22 -07:00
+								    "data = {\"text\": \"What is the capital of France?\"}\n",
-												add native api docs (#1883)

Co-authored-by: Chayenne <zhaochenyang@g.ucla.edu>
											
										
										
											2024-11-02 00:17:30 -07:00
+								    "\n",
 								    "response = requests.post(url, json=data)\n",
-												Fix docs (#1890)


											
										
										
											2024-11-02 13:26:32 -07:00
+								    "print_highlight(response.json())"
-												add native api docs (#1883)

Co-authored-by: Chayenne <zhaochenyang@g.ucla.edu>
											
										
										
											2024-11-02 00:17:30 -07:00
+								   ]
 								  },
 								  {
 								   "cell_type": "markdown",
 								   "metadata": {},
 								   "source": [
 								    "## Get Model Info\n",
 								    "\n",
-												Fix docs (#1890)


											
										
										
											2024-11-02 13:26:32 -07:00
+								    "Get the information of the model.\n",
-												add native api docs (#1883)

Co-authored-by: Chayenne <zhaochenyang@g.ucla.edu>
											
										
										
											2024-11-02 00:17:30 -07:00
+								    "\n",
 								    "- `model_path`: The path/name of the model.\n",
-												udate weights from disk (#2265)


											
										
										
											2024-11-29 17:17:00 -08:00
+								    "- `is_generation`: Whether the model is used as generation model or embedding model.\n",
-												Update native_api doc to match the change in the `get_model_info` endpoint (#7660)

Co-authored-by: Lifu Huang <lifu.hlf@gmail.com>
Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>
											
										
										
											2025-07-09 12:05:58 +08:00
+								    "- `tokenizer_path`: The path/name of the tokenizer.\n",
-												feat: Add model version tracking with API endpoints and response metadata (#8795)


											
										
										
											2025-08-15 03:13:46 +08:00
+								    "- `preferred_sampling_params`: The default sampling params specified via `--preferred-sampling-params`. `None` is returned in this example as we did not explicitly configure it in server args.\n",
 								    "- `weight_version`: This field contains the version of the model weights. This is often used to track changes or updates to the model’s trained parameters."
-												add native api docs (#1883)

Co-authored-by: Chayenne <zhaochenyang@g.ucla.edu>
											
										
										
											2024-11-02 00:17:30 -07:00
+								   ]
 								  },
 								  {
 								   "cell_type": "code",
 								   "execution_count": null,
-												feat(pre-commit): trim unnecessary notebook metadata from git history (#2127)


											
										
										
											2024-11-23 05:04:51 +08:00
+								   "metadata": {},
-												add native api docs (#1883)

Co-authored-by: Chayenne <zhaochenyang@g.ucla.edu>
											
										
										
											2024-11-02 00:17:30 -07:00
+								   "outputs": [],
 								   "source": [
-												[CI] Improve Docs CI Efficiency (#3587)

Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>
											
										
										
											2025-02-15 03:57:00 +00:00
+								    "url = f\"http://localhost:{port}/get_model_info\"\n",
-												add native api docs (#1883)

Co-authored-by: Chayenne <zhaochenyang@g.ucla.edu>
											
										
										
											2024-11-02 00:17:30 -07:00
+								    "\n",
 								    "response = requests.get(url)\n",
 								    "response_json = response.json()\n",
 								    "print_highlight(response_json)\n",
-												smaller and non gated models for docs (#5378)


											
										
										
											2025-04-21 02:38:25 +02:00
+								    "assert response_json[\"model_path\"] == \"qwen/qwen2.5-0.5b-instruct\"\n",
-												Fix docs (#1890)


											
										
										
											2024-11-02 13:26:32 -07:00
+								    "assert response_json[\"is_generation\"] is True\n",
-												smaller and non gated models for docs (#5378)


											
										
										
											2025-04-21 02:38:25 +02:00
+								    "assert response_json[\"tokenizer_path\"] == \"qwen/qwen2.5-0.5b-instruct\"\n",
-												Update native_api doc to match the change in the `get_model_info` endpoint (#7660)

Co-authored-by: Lifu Huang <lifu.hlf@gmail.com>
Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>
											
										
										
											2025-07-09 12:05:58 +08:00
+								    "assert response_json[\"preferred_sampling_params\"] is None\n",
-												 fix CI: update native api ipynb (#7754)

Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>
											
										
										
											2025-07-03 15:25:00 -07:00
+								    "assert response_json.keys() == {\n",
 								    "    \"model_path\",\n",
 								    "    \"is_generation\",\n",
 								    "    \"tokenizer_path\",\n",
 								    "    \"preferred_sampling_params\",\n",
-												feat: Add model version tracking with API endpoints and response metadata (#8795)


											
										
										
											2025-08-15 03:13:46 +08:00
+								    "    \"weight_version\",\n",
-												 fix CI: update native api ipynb (#7754)

Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>
											
										
										
											2025-07-03 15:25:00 -07:00
+								    "}"
-												add native api docs (#1883)

Co-authored-by: Chayenne <zhaochenyang@g.ucla.edu>
											
										
										
											2024-11-02 00:17:30 -07:00
+								   ]
 								  },
 								  {
 								   "cell_type": "markdown",
 								   "metadata": {},
 								   "source": [
-												Merged three native APIs into one: get_server_info (#2152)


											
										
										
											2024-11-24 01:37:58 -08:00
+								    "## Get Server Info\n",
 								    "Gets the server information including CLI arguments, token limits, and memory pool sizes.\n",
 								    "- Note: `get_server_info` merges the following deprecated endpoints:\n",
 								    "  - `get_server_args`\n",
 								    "  - `get_memory_pool_size` \n",
 								    "  - `get_max_total_num_tokens`"
-												add native api docs (#1883)

Co-authored-by: Chayenne <zhaochenyang@g.ucla.edu>
											
										
										
											2024-11-02 00:17:30 -07:00
+								   ]
 								  },
 								  {
 								   "cell_type": "code",
 								   "execution_count": null,
-												feat(pre-commit): trim unnecessary notebook metadata from git history (#2127)


											
										
										
											2024-11-23 05:04:51 +08:00
+								   "metadata": {},
-												add native api docs (#1883)

Co-authored-by: Chayenne <zhaochenyang@g.ucla.edu>
											
										
										
											2024-11-02 00:17:30 -07:00
+								   "outputs": [],
 								   "source": [
-												[CI] Improve Docs CI Efficiency (#3587)

Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>
											
										
										
											2025-02-15 03:57:00 +00:00
+								    "url = f\"http://localhost:{port}/get_server_info\"\n",
-												add native api docs (#1883)

Co-authored-by: Chayenne <zhaochenyang@g.ucla.edu>
											
										
										
											2024-11-02 00:17:30 -07:00
+								    "\n",
 								    "response = requests.get(url)\n",
 								    "print_highlight(response.text)"
 								   ]
 								  },
 								  {
 								   "cell_type": "markdown",
 								   "metadata": {},
 								   "source": [
-												Merged three native APIs into one: get_server_info (#2152)


											
										
										
											2024-11-24 01:37:58 -08:00
+								    "## Health Check\n",
 								    "- `/health`: Check the health of the server.\n",
 								    "- `/health_generate`: Check the health of the server by generating one token."
-												add native api docs (#1883)

Co-authored-by: Chayenne <zhaochenyang@g.ucla.edu>
											
										
										
											2024-11-02 00:17:30 -07:00
+								   ]
 								  },
 								  {
 								   "cell_type": "code",
 								   "execution_count": null,
-												feat(pre-commit): trim unnecessary notebook metadata from git history (#2127)


											
										
										
											2024-11-23 05:04:51 +08:00
+								   "metadata": {},
-												add native api docs (#1883)

Co-authored-by: Chayenne <zhaochenyang@g.ucla.edu>
											
										
										
											2024-11-02 00:17:30 -07:00
+								   "outputs": [],
 								   "source": [
-												[CI] Improve Docs CI Efficiency (#3587)

Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>
											
										
										
											2025-02-15 03:57:00 +00:00
+								    "url = f\"http://localhost:{port}/health_generate\"\n",
-												add native api docs (#1883)

Co-authored-by: Chayenne <zhaochenyang@g.ucla.edu>
											
										
										
											2024-11-02 00:17:30 -07:00
+								    "\n",
-												Merged three native APIs into one: get_server_info (#2152)


											
										
										
											2024-11-24 01:37:58 -08:00
+								    "response = requests.get(url)\n",
-												add native api docs (#1883)

Co-authored-by: Chayenne <zhaochenyang@g.ucla.edu>
											
										
										
											2024-11-02 00:17:30 -07:00
+								    "print_highlight(response.text)"
 								   ]
 								  },
 								  {
 								   "cell_type": "code",
 								   "execution_count": null,
-												feat(pre-commit): trim unnecessary notebook metadata from git history (#2127)


											
										
										
											2024-11-23 05:04:51 +08:00
+								   "metadata": {},
-												add native api docs (#1883)

Co-authored-by: Chayenne <zhaochenyang@g.ucla.edu>
											
										
										
											2024-11-02 00:17:30 -07:00
+								   "outputs": [],
 								   "source": [
-												[CI] Improve Docs CI Efficiency (#3587)

Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>
											
										
										
											2025-02-15 03:57:00 +00:00
+								    "url = f\"http://localhost:{port}/health\"\n",
-												add native api docs (#1883)

Co-authored-by: Chayenne <zhaochenyang@g.ucla.edu>
											
										
										
											2024-11-02 00:17:30 -07:00
+								    "\n",
 								    "response = requests.get(url)\n",
 								    "print_highlight(response.text)"
 								   ]
 								  },
-												Expose max total num tokens from Runtime & Engine API (#2092)


											
										
										
											2024-11-22 15:10:10 -08:00
+								  {
 								   "cell_type": "markdown",
 								   "metadata": {},
 								   "source": [
-												Merged three native APIs into one: get_server_info (#2152)


											
										
										
											2024-11-24 01:37:58 -08:00
+								    "## Flush Cache\n",
-												Expose max total num tokens from Runtime & Engine API (#2092)


											
										
										
											2024-11-22 15:10:10 -08:00
+								    "\n",
-												Merged three native APIs into one: get_server_info (#2152)


											
										
										
											2024-11-24 01:37:58 -08:00
+								    "Flush the radix cache. It will be automatically triggered when the model weights are updated by the `/update_weights` API."
-												Expose max total num tokens from Runtime & Engine API (#2092)


											
										
										
											2024-11-22 15:10:10 -08:00
+								   ]
 								  },
 								  {
 								   "cell_type": "code",
 								   "execution_count": null,
 								   "metadata": {},
 								   "outputs": [],
 								   "source": [
-												[CI] Improve Docs CI Efficiency (#3587)

Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>
											
										
										
											2025-02-15 03:57:00 +00:00
+								    "url = f\"http://localhost:{port}/flush_cache\"\n",
-												Expose max total num tokens from Runtime & Engine API (#2092)


											
										
										
											2024-11-22 15:10:10 -08:00
+								    "\n",
-												Merged three native APIs into one: get_server_info (#2152)


											
										
										
											2024-11-24 01:37:58 -08:00
+								    "response = requests.post(url)\n",
-												Expose max total num tokens from Runtime & Engine API (#2092)


											
										
										
											2024-11-22 15:10:10 -08:00
+								    "print_highlight(response.text)"
 								   ]
 								  },
-												add native api docs (#1883)

Co-authored-by: Chayenne <zhaochenyang@g.ucla.edu>
											
										
										
											2024-11-02 00:17:30 -07:00
+								  {
 								   "cell_type": "markdown",
 								   "metadata": {},
 								   "source": [
-												udate weights from disk (#2265)


											
										
										
											2024-11-29 17:17:00 -08:00
+								    "## Update Weights From Disk\n",
-												add native api docs (#1883)

Co-authored-by: Chayenne <zhaochenyang@g.ucla.edu>
											
										
										
											2024-11-02 00:17:30 -07:00
+								    "\n",
-												udate weights from disk (#2265)


											
										
										
											2024-11-29 17:17:00 -08:00
+								    "Update model weights from disk without restarting the server. Only applicable for models with the same architecture and parameter size.\n",
 								    "\n",
 								    "SGLang support `update_weights_from_disk` API for continuous evaluation during training (save checkpoint to disk and update weights from disk).\n"
-												add native api docs (#1883)

Co-authored-by: Chayenne <zhaochenyang@g.ucla.edu>
											
										
										
											2024-11-02 00:17:30 -07:00
+								   ]
 								  },
 								  {
 								   "cell_type": "code",
 								   "execution_count": null,
-												feat(pre-commit): trim unnecessary notebook metadata from git history (#2127)


											
										
										
											2024-11-23 05:04:51 +08:00
+								   "metadata": {},
-												add native api docs (#1883)

Co-authored-by: Chayenne <zhaochenyang@g.ucla.edu>
											
										
										
											2024-11-02 00:17:30 -07:00
+								   "outputs": [],
 								   "source": [
 								    "# successful update with same architecture and size\n",
 								    "\n",
-												[CI] Improve Docs CI Efficiency (#3587)

Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>
											
										
										
											2025-02-15 03:57:00 +00:00
+								    "url = f\"http://localhost:{port}/update_weights_from_disk\"\n",
-												smaller and non gated models for docs (#5378)


											
										
										
											2025-04-21 02:38:25 +02:00
+								    "data = {\"model_path\": \"qwen/qwen2.5-0.5b-instruct\"}\n",
-												add native api docs (#1883)

Co-authored-by: Chayenne <zhaochenyang@g.ucla.edu>
											
										
										
											2024-11-02 00:17:30 -07:00
+								    "\n",
 								    "response = requests.post(url, json=data)\n",
 								    "print_highlight(response.text)\n",
-												[Doc] improve relative links and structure (#1924)


											
										
										
											2024-11-05 01:12:10 -08:00
+								    "assert response.json()[\"success\"] is True\n",
-												Support penalty in overlap mode; return logprob with chunked prefill; improve benchmark scripts (#3988)

Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Co-authored-by: dhou-xai <dhou@x.ai>
Co-authored-by: Hanming Lu <hanming_lu@berkeley.edu>

											
										
										
											2025-03-03 00:12:04 -08:00
+								    "assert response.json()[\"message\"] == \"Succeeded to update model weights.\""
-												add native api docs (#1883)

Co-authored-by: Chayenne <zhaochenyang@g.ucla.edu>
											
										
										
											2024-11-02 00:17:30 -07:00
+								   ]
 								  },
 								  {
 								   "cell_type": "code",
 								   "execution_count": null,
-												feat(pre-commit): trim unnecessary notebook metadata from git history (#2127)


											
										
										
											2024-11-23 05:04:51 +08:00
+								   "metadata": {},
-												add native api docs (#1883)

Co-authored-by: Chayenne <zhaochenyang@g.ucla.edu>
											
										
										
											2024-11-02 00:17:30 -07:00
+								   "outputs": [],
 								   "source": [
-												Fix Docs CI When Compile Error (#2323)


											
										
										
											2024-12-04 11:19:46 -08:00
+								    "# failed update with different parameter size or wrong name\n",
-												add native api docs (#1883)

Co-authored-by: Chayenne <zhaochenyang@g.ucla.edu>
											
										
										
											2024-11-02 00:17:30 -07:00
+								    "\n",
-												[CI] Improve Docs CI Efficiency (#3587)

Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>
											
										
										
											2025-02-15 03:57:00 +00:00
+								    "url = f\"http://localhost:{port}/update_weights_from_disk\"\n",
-												smaller and non gated models for docs (#5378)


											
										
										
											2025-04-21 02:38:25 +02:00
+								    "data = {\"model_path\": \"qwen/qwen2.5-0.5b-instruct-wrong\"}\n",
-												add native api docs (#1883)

Co-authored-by: Chayenne <zhaochenyang@g.ucla.edu>
											
										
										
											2024-11-02 00:17:30 -07:00
+								    "\n",
 								    "response = requests.post(url, json=data)\n",
 								    "response_json = response.json()\n",
 								    "print_highlight(response_json)\n",
-												[Doc] improve relative links and structure (#1924)


											
										
										
											2024-11-05 01:12:10 -08:00
+								    "assert response_json[\"success\"] is False\n",
-												add native api docs (#1883)

Co-authored-by: Chayenne <zhaochenyang@g.ucla.edu>
											
										
										
											2024-11-02 00:17:30 -07:00
+								    "assert response_json[\"message\"] == (\n",
-												Fix Docs CI When Compile Error (#2323)


											
										
										
											2024-12-04 11:19:46 -08:00
+								    "    \"Failed to get weights iterator: \"\n",
-												smaller and non gated models for docs (#5378)


											
										
										
											2025-04-21 02:38:25 +02:00
+								    "    \"qwen/qwen2.5-0.5b-instruct-wrong\"\n",
-												Fix Docs CI When Compile Error (#2323)


											
										
										
											2024-12-04 11:19:46 -08:00
+								    "    \" (repository not found).\"\n",
-												add native api docs (#1883)

Co-authored-by: Chayenne <zhaochenyang@g.ucla.edu>
											
										
										
											2024-11-02 00:17:30 -07:00
+								    ")"
 								   ]
 								  },
-												smaller and non gated models for docs (#5378)


											
										
										
											2025-04-21 02:38:25 +02:00
+								  {
 								   "cell_type": "code",
 								   "execution_count": null,
 								   "metadata": {},
 								   "outputs": [],
 								   "source": [
 								    "terminate_process(server_process)"
 								   ]
 								  },
-												Fix docs ci (#1888)


											
										
										
											2024-11-02 11:57:22 -07:00
+								  {
 								   "cell_type": "markdown",
 								   "metadata": {},
 								   "source": [
-												Add Reward API Docs etc (#1910)

Co-authored-by: Chayenne <zhaochenyang@g.ucla.edu>
											
										
										
											2024-11-03 22:33:03 -08:00
+								    "## Encode (embedding model)\n",
-												Fix docs ci (#1888)


											
										
										
											2024-11-02 11:57:22 -07:00
+								    "\n",
-												Refactor the docs (#9031)


											
										
										
											2025-08-10 19:49:45 -07:00
+								    "Encode text into embeddings. Note that this API is only available for [embedding models](openai_api_embeddings.ipynb) and will raise an error for generation models.\n",
-												Fix ci and link error (#1892)

Co-authored-by: Chayenne <zhaochenyang@g.ucla.edu>
											
										
										
											2024-11-02 19:08:49 -07:00
+								    "Therefore, we launch a new server to server an embedding model."
-												Fix docs ci (#1888)


											
										
										
											2024-11-02 11:57:22 -07:00
+								   ]
 								  },
 								  {
 								   "cell_type": "code",
 								   "execution_count": null,
-												feat(pre-commit): trim unnecessary notebook metadata from git history (#2127)


											
										
										
											2024-11-23 05:04:51 +08:00
+								   "metadata": {},
-												Fix docs ci (#1888)


											
										
										
											2024-11-02 11:57:22 -07:00
+								   "outputs": [],
 								   "source": [
-												[CI] Improve Docs CI Efficiency (#3587)

Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>
											
										
										
											2025-02-15 03:57:00 +00:00
+								    "embedding_process, port = launch_server_cmd(\n",
-												Fix docs ci (#1888)


											
										
										
											2024-11-02 11:57:22 -07:00
+								    "    \"\"\"\n",
-												smaller and non gated models for docs (#5378)


											
										
										
											2025-04-21 02:38:25 +02:00
+								    "python3 -m sglang.launch_server --model-path Alibaba-NLP/gte-Qwen2-1.5B-instruct \\\n",
-												[Doc] Fix SGLang tool parser doc (#9886)


											
										
										
											2025-09-04 09:52:53 -04:00
+								    "    --host 0.0.0.0 --is-embedding --log-level warning\n",
-												Fix docs ci (#1888)


											
										
										
											2024-11-02 11:57:22 -07:00
+								    "\"\"\"\n",
 								    ")\n",
 								    "\n",
-												[CI] Improve Docs CI Efficiency (#3587)

Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>
											
										
										
											2025-02-15 03:57:00 +00:00
+								    "wait_for_server(f\"http://localhost:{port}\")"
-												Fix docs ci (#1888)


											
										
										
											2024-11-02 11:57:22 -07:00
+								   ]
 								  },
 								  {
 								   "cell_type": "code",
 								   "execution_count": null,
-												feat(pre-commit): trim unnecessary notebook metadata from git history (#2127)


											
										
										
											2024-11-23 05:04:51 +08:00
+								   "metadata": {},
-												Fix docs ci (#1888)


											
										
										
											2024-11-02 11:57:22 -07:00
+								   "outputs": [],
 								   "source": [
 								    "# successful encode for embedding model\n",
 								    "\n",
-												[CI] Improve Docs CI Efficiency (#3587)

Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>
											
										
										
											2025-02-15 03:57:00 +00:00
+								    "url = f\"http://localhost:{port}/encode\"\n",
-												smaller and non gated models for docs (#5378)


											
										
										
											2025-04-21 02:38:25 +02:00
+								    "data = {\"model\": \"Alibaba-NLP/gte-Qwen2-1.5B-instruct\", \"text\": \"Once upon a time\"}\n",
-												Fix docs ci (#1888)


											
										
										
											2024-11-02 11:57:22 -07:00
+								    "\n",
 								    "response = requests.post(url, json=data)\n",
 								    "response_json = response.json()\n",
 								    "print_highlight(f\"Text embedding (first 10): {response_json['embedding'][:10]}\")"
 								   ]
 								  },
-												[CI] Improve Docs CI Efficiency (#3587)

Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>
											
										
										
											2025-02-15 03:57:00 +00:00
+								  {
 								   "cell_type": "code",
 								   "execution_count": null,
 								   "metadata": {},
 								   "outputs": [],
 								   "source": [
-												[Docs]: Fix Multi-User Port Allocation Conflicts (#3601)

Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>
Co-authored-by: simveit <simp.veitner@gmail.com>
											
										
										
											2025-02-19 19:15:44 +00:00
+								    "terminate_process(embedding_process)"
-												[CI] Improve Docs CI Efficiency (#3587)

Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>
											
										
										
											2025-02-15 03:57:00 +00:00
+								   ]
 								  },
-												[Doc] add embedding rerank doc (#7364)


											
										
										
											2025-06-20 12:53:54 +08:00
+								  {
 								   "cell_type": "markdown",
 								   "metadata": {},
 								   "source": [
 								    "## v1/rerank (cross encoder rerank model)\n",
 								    "Rerank a list of documents given a query using a cross-encoder model. Note that this API is only available for cross encoder model like [BAAI/bge-reranker-v2-m3](https://huggingface.co/BAAI/bge-reranker-v2-m3) with `attention-backend` `triton` and `torch_native`.\n"
 								   ]
 								  },
 								  {
 								   "cell_type": "code",
 								   "execution_count": null,
 								   "metadata": {},
 								   "outputs": [],
 								   "source": [
 								    "reranker_process, port = launch_server_cmd(\n",
 								    "    \"\"\"\n",
 								    "python3 -m sglang.launch_server --model-path BAAI/bge-reranker-v2-m3 \\\n",
-												[Doc] Fix SGLang tool parser doc (#9886)


											
										
										
											2025-09-04 09:52:53 -04:00
+								    "    --host 0.0.0.0 --disable-radix-cache --chunked-prefill-size -1 --attention-backend triton --is-embedding --log-level warning\n",
-												[Doc] add embedding rerank doc (#7364)


											
										
										
											2025-06-20 12:53:54 +08:00
+								    "\"\"\"\n",
 								    ")\n",
 								    "\n",
 								    "wait_for_server(f\"http://localhost:{port}\")"
 								   ]
 								  },
 								  {
 								   "cell_type": "code",
 								   "execution_count": null,
 								   "metadata": {},
 								   "outputs": [],
 								   "source": [
 								    "# compute rerank scores for query and documents\n",
 								    "\n",
 								    "url = f\"http://localhost:{port}/v1/rerank\"\n",
 								    "data = {\n",
 								    "    \"model\": \"BAAI/bge-reranker-v2-m3\",\n",
 								    "    \"query\": \"what is panda?\",\n",
 								    "    \"documents\": [\n",
 								    "        \"hi\",\n",
 								    "        \"The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.\",\n",
 								    "    ],\n",
 								    "}\n",
 								    "\n",
 								    "response = requests.post(url, json=data)\n",
 								    "response_json = response.json()\n",
 								    "for item in response_json:\n",
 								    "    print_highlight(f\"Score: {item['score']:.2f} - Document: '{item['document']}'\")"
 								   ]
 								  },
 								  {
 								   "cell_type": "code",
 								   "execution_count": null,
 								   "metadata": {},
 								   "outputs": [],
 								   "source": [
 								    "terminate_process(reranker_process)"
 								   ]
 								  },
-												Add Reward API Docs etc (#1910)

Co-authored-by: Chayenne <zhaochenyang@g.ucla.edu>
											
										
										
											2024-11-03 22:33:03 -08:00
+								  {
 								   "cell_type": "markdown",
 								   "metadata": {},
 								   "source": [
-												Change judge to classify & Modify make file (#1920)


											
										
										
											2024-11-04 23:53:44 -08:00
+								    "## Classify (reward model)\n",
-												Add Reward API Docs etc (#1910)

Co-authored-by: Chayenne <zhaochenyang@g.ucla.edu>
											
										
										
											2024-11-03 22:33:03 -08:00
+								    "\n",
-												Change judge to classify & Modify make file (#1920)


											
										
										
											2024-11-04 23:53:44 -08:00
+								    "SGLang Runtime also supports reward models. Here we use a reward model to classify the quality of pairwise generations."
-												Add Reward API Docs etc (#1910)

Co-authored-by: Chayenne <zhaochenyang@g.ucla.edu>
											
										
										
											2024-11-03 22:33:03 -08:00
+								   ]
 								  },
 								  {
 								   "cell_type": "code",
 								   "execution_count": null,
-												feat(pre-commit): trim unnecessary notebook metadata from git history (#2127)


											
										
										
											2024-11-23 05:04:51 +08:00
+								   "metadata": {},
-												Add Reward API Docs etc (#1910)

Co-authored-by: Chayenne <zhaochenyang@g.ucla.edu>
											
										
										
											2024-11-03 22:33:03 -08:00
+								   "outputs": [],
 								   "source": [
 								    "# Note that SGLang now treats embedding models and reward models as the same type of models.\n",
 								    "# This will be updated in the future.\n",
 								    "\n",
-												[CI] Improve Docs CI Efficiency (#3587)

Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>
											
										
										
											2025-02-15 03:57:00 +00:00
+								    "reward_process, port = launch_server_cmd(\n",
-												Add Reward API Docs etc (#1910)

Co-authored-by: Chayenne <zhaochenyang@g.ucla.edu>
											
										
										
											2024-11-03 22:33:03 -08:00
+								    "    \"\"\"\n",
-												[Doc] Fix SGLang tool parser doc (#9886)


											
										
										
											2025-09-04 09:52:53 -04:00
+								    "python3 -m sglang.launch_server --model-path Skywork/Skywork-Reward-Llama-3.1-8B-v0.2 --host 0.0.0.0 --is-embedding --log-level warning\n",
-												Add Reward API Docs etc (#1910)

Co-authored-by: Chayenne <zhaochenyang@g.ucla.edu>
											
										
										
											2024-11-03 22:33:03 -08:00
+								    "\"\"\"\n",
 								    ")\n",
 								    "\n",
-												[CI] Improve Docs CI Efficiency (#3587)

Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>
											
										
										
											2025-02-15 03:57:00 +00:00
+								    "wait_for_server(f\"http://localhost:{port}\")"
-												Add Reward API Docs etc (#1910)

Co-authored-by: Chayenne <zhaochenyang@g.ucla.edu>
											
										
										
											2024-11-03 22:33:03 -08:00
+								   ]
 								  },
 								  {
 								   "cell_type": "code",
 								   "execution_count": null,
-												feat(pre-commit): trim unnecessary notebook metadata from git history (#2127)


											
										
										
											2024-11-23 05:04:51 +08:00
+								   "metadata": {},
-												Add Reward API Docs etc (#1910)

Co-authored-by: Chayenne <zhaochenyang@g.ucla.edu>
											
										
										
											2024-11-03 22:33:03 -08:00
+								   "outputs": [],
 								   "source": [
 								    "from transformers import AutoTokenizer\n",
 								    "\n",
 								    "PROMPT = (\n",
 								    "    \"What is the range of the numeric output of a sigmoid node in a neural network?\"\n",
 								    ")\n",
 								    "\n",
 								    "RESPONSE1 = \"The output of a sigmoid node is bounded between -1 and 1.\"\n",
 								    "RESPONSE2 = \"The output of a sigmoid node is bounded between 0 and 1.\"\n",
 								    "\n",
 								    "CONVS = [\n",
 								    "    [{\"role\": \"user\", \"content\": PROMPT}, {\"role\": \"assistant\", \"content\": RESPONSE1}],\n",
 								    "    [{\"role\": \"user\", \"content\": PROMPT}, {\"role\": \"assistant\", \"content\": RESPONSE2}],\n",
 								    "]\n",
 								    "\n",
 								    "tokenizer = AutoTokenizer.from_pretrained(\"Skywork/Skywork-Reward-Llama-3.1-8B-v0.2\")\n",
 								    "prompts = tokenizer.apply_chat_template(CONVS, tokenize=False)\n",
 								    "\n",
-												[CI] Improve Docs CI Efficiency (#3587)

Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>
											
										
										
											2025-02-15 03:57:00 +00:00
+								    "url = f\"http://localhost:{port}/classify\"\n",
-												fix black in pre-commit (#1940)


											
										
										
											2024-11-07 15:42:47 -08:00
+								    "data = {\"model\": \"Skywork/Skywork-Reward-Llama-3.1-8B-v0.2\", \"text\": prompts}\n",
-												Add Reward API Docs etc (#1910)

Co-authored-by: Chayenne <zhaochenyang@g.ucla.edu>
											
										
										
											2024-11-03 22:33:03 -08:00
+								    "\n",
 								    "responses = requests.post(url, json=data).json()\n",
 								    "for response in responses:\n",
 								    "    print_highlight(f\"reward: {response['embedding'][0]}\")"
 								   ]
 								  },
-												add native api docs (#1883)

Co-authored-by: Chayenne <zhaochenyang@g.ucla.edu>
											
										
										
											2024-11-02 00:17:30 -07:00
+								  {
 								   "cell_type": "code",
-												feat(pre-commit): trim unnecessary notebook metadata from git history (#2127)


											
										
										
											2024-11-23 05:04:51 +08:00
+								   "execution_count": null,
 								   "metadata": {},
-												add native api docs (#1883)

Co-authored-by: Chayenne <zhaochenyang@g.ucla.edu>
											
										
										
											2024-11-02 00:17:30 -07:00
+								   "outputs": [],
 								   "source": [
-												[Docs]: Fix Multi-User Port Allocation Conflicts (#3601)

Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>
Co-authored-by: simveit <simp.veitner@gmail.com>
											
										
										
											2025-02-19 19:15:44 +00:00
+								    "terminate_process(reward_process)"
-												add native api docs (#1883)

Co-authored-by: Chayenne <zhaochenyang@g.ucla.edu>
											
										
										
											2024-11-02 00:17:30 -07:00
+								   ]
-												Improve: Token-In Token-Out Usage for RLHF (#2843)


											
										
										
											2025-01-11 23:14:26 +00:00
+								  },
-												Add endpoints to dump selected expert ids (#4435)

Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
											
										
										
											2025-03-24 21:34:19 -07:00
+								  {
 								   "cell_type": "markdown",
 								   "metadata": {},
 								   "source": [
 								    "## Capture expert selection distribution in MoE models\n",
 								    "\n",
-												Small improvement of native api docs (#5139)

Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>
											
										
										
											2025-04-08 21:09:26 +02:00
+								    "SGLang Runtime supports recording the number of times an expert is selected in a MoE model run for each expert in the model. This is useful when analyzing the throughput of the model and plan for optimization.\n",
 								    "\n",
 								    "*Note: We only print out the first 10 lines of the csv below for better readability. Please adjust accordingly if you want to analyze the results more deeply.*"
-												Add endpoints to dump selected expert ids (#4435)

Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
											
										
										
											2025-03-24 21:34:19 -07:00
+								   ]
 								  },
 								  {
 								   "cell_type": "code",
 								   "execution_count": null,
 								   "metadata": {},
 								   "outputs": [],
 								   "source": [
 								    "expert_record_server_process, port = launch_server_cmd(\n",
-												[Doc] Fix SGLang tool parser doc (#9886)


											
										
										
											2025-09-04 09:52:53 -04:00
+								    "    \"python3 -m sglang.launch_server --model-path Qwen/Qwen1.5-MoE-A2.7B --host 0.0.0.0 --expert-distribution-recorder-mode stat --log-level warning\"\n",
-												Add endpoints to dump selected expert ids (#4435)

Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
											
										
										
											2025-03-24 21:34:19 -07:00
+								    ")\n",
 								    "\n",
 								    "wait_for_server(f\"http://localhost:{port}\")"
 								   ]
 								  },
 								  {
 								   "cell_type": "code",
 								   "execution_count": null,
 								   "metadata": {},
 								   "outputs": [],
 								   "source": [
 								    "response = requests.post(f\"http://localhost:{port}/start_expert_distribution_record\")\n",
 								    "print_highlight(response)\n",
 								    "\n",
 								    "url = f\"http://localhost:{port}/generate\"\n",
 								    "data = {\"text\": \"What is the capital of France?\"}\n",
 								    "\n",
 								    "response = requests.post(url, json=data)\n",
 								    "print_highlight(response.json())\n",
 								    "\n",
 								    "response = requests.post(f\"http://localhost:{port}/stop_expert_distribution_record\")\n",
 								    "print_highlight(response)\n",
 								    "\n",
 								    "response = requests.post(f\"http://localhost:{port}/dump_expert_distribution_record\")\n",
-												Expert distribution recording without overhead for EPLB (#4957)


											
										
										
											2025-05-20 11:07:43 +08:00
+								    "print_highlight(response)"
-												Add endpoints to dump selected expert ids (#4435)

Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
											
										
										
											2025-03-24 21:34:19 -07:00
+								   ]
 								  },
 								  {
 								   "cell_type": "code",
 								   "execution_count": null,
 								   "metadata": {},
 								   "outputs": [],
 								   "source": [
 								    "terminate_process(expert_record_server_process)"
 								   ]
-												[Feature] Add /tokenize and /detokenize OpenAI compatible endpoints (#9545)


											
										
										
											2025-10-08 10:08:48 +05:30
+								  },
 								  {
 								   "cell_type": "markdown",
 								   "metadata": {},
 								   "source": [
 								    "## Tokenize/Detokenize Example (Round Trip)\n",
 								    "\n",
 								    "This example demonstrates how to use the /tokenize and /detokenize endpoints together. We first tokenize a string, then detokenize the resulting IDs to reconstruct the original text. This workflow is useful when you need to handle tokenization externally but still leverage the server for detokenization."
 								   ]
 								  },
 								  {
 								   "cell_type": "code",
 								   "execution_count": null,
 								   "metadata": {},
 								   "outputs": [],
 								   "source": [
 								    "tokenizer_free_server_process, port = launch_server_cmd(\n",
 								    "    \"\"\"\n",
 								    "python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct\n",
 								    "\"\"\"\n",
 								    ")\n",
 								    "\n",
 								    "wait_for_server(f\"http://localhost:{port}\")"
 								   ]
 								  },
 								  {
 								   "cell_type": "code",
 								   "execution_count": null,
 								   "metadata": {},
 								   "outputs": [],
 								   "source": [
 								    "import requests\n",
 								    "from sglang.utils import print_highlight\n",
 								    "\n",
 								    "base_url = f\"http://localhost:{port}\"\n",
 								    "tokenize_url = f\"{base_url}/tokenize\"\n",
 								    "detokenize_url = f\"{base_url}/detokenize\"\n",
 								    "\n",
 								    "model_name = \"qwen/qwen2.5-0.5b-instruct\"\n",
 								    "input_text = \"SGLang provides efficient tokenization endpoints.\"\n",
 								    "print_highlight(f\"Original Input Text:\\n'{input_text}'\")\n",
 								    "\n",
 								    "# --- tokenize the input text ---\n",
 								    "tokenize_payload = {\n",
 								    "    \"model\": model_name,\n",
 								    "    \"prompt\": input_text,\n",
 								    "    \"add_special_tokens\": False,\n",
 								    "}\n",
 								    "try:\n",
 								    "    tokenize_response = requests.post(tokenize_url, json=tokenize_payload)\n",
 								    "    tokenize_response.raise_for_status()\n",
 								    "    tokenization_result = tokenize_response.json()\n",
 								    "    token_ids = tokenization_result.get(\"tokens\")\n",
 								    "\n",
 								    "    if not token_ids:\n",
 								    "        raise ValueError(\"Tokenization returned empty tokens.\")\n",
 								    "\n",
 								    "    print_highlight(f\"\\nTokenized Output (IDs):\\n{token_ids}\")\n",
 								    "    print_highlight(f\"Token Count: {tokenization_result.get('count')}\")\n",
 								    "    print_highlight(f\"Max Model Length: {tokenization_result.get('max_model_len')}\")\n",
 								    "\n",
 								    "    # --- detokenize the obtained token IDs ---\n",
 								    "    detokenize_payload = {\n",
 								    "        \"model\": model_name,\n",
 								    "        \"tokens\": token_ids,\n",
 								    "        \"skip_special_tokens\": True,\n",
 								    "    }\n",
 								    "\n",
 								    "    detokenize_response = requests.post(detokenize_url, json=detokenize_payload)\n",
 								    "    detokenize_response.raise_for_status()\n",
 								    "    detokenization_result = detokenize_response.json()\n",
 								    "    reconstructed_text = detokenization_result.get(\"text\")\n",
 								    "\n",
 								    "    print_highlight(f\"\\nDetokenized Output (Text):\\n'{reconstructed_text}'\")\n",
 								    "\n",
 								    "    if input_text == reconstructed_text:\n",
 								    "        print_highlight(\n",
 								    "            \"\\nRound Trip Successful: Original and reconstructed text match.\"\n",
 								    "        )\n",
 								    "    else:\n",
 								    "        print_highlight(\n",
 								    "            \"\\nRound Trip Mismatch: Original and reconstructed text differ.\"\n",
 								    "        )\n",
 								    "\n",
 								    "except requests.exceptions.RequestException as e:\n",
 								    "    print_highlight(f\"\\nHTTP Request Error: {e}\")\n",
 								    "except Exception as e:\n",
 								    "    print_highlight(f\"\\nAn error occurred: {e}\")"
 								   ]
 								  },
 								  {
 								   "cell_type": "code",
 								   "execution_count": null,
 								   "metadata": {},
 								   "outputs": [],
 								   "source": [
 								    "terminate_process(tokenizer_free_server_process)"
 								   ]
-												add native api docs (#1883)

Co-authored-by: Chayenne <zhaochenyang@g.ucla.edu>
											
										
										
											2024-11-02 00:17:30 -07:00
+								  }
 								 ],
 								 "metadata": {
 								  "language_info": {
 								   "codemirror_mode": {
 								    "name": "ipython",
 								    "version": 3
 								   },
 								   "file_extension": ".py",
 								   "mimetype": "text/x-python",
 								   "name": "python",
 								   "nbconvert_exporter": "python",
-												feat(pre-commit): trim unnecessary notebook metadata from git history (#2127)


											
										
										
											2024-11-23 05:04:51 +08:00
+								   "pygments_lexer": "ipython3"
-												add native api docs (#1883)

Co-authored-by: Chayenne <zhaochenyang@g.ucla.edu>
											
										
										
											2024-11-02 00:17:30 -07:00
+								  }
 								 },
 								 "nbformat": 4,
-												[Feature] Add /tokenize and /detokenize OpenAI compatible endpoints (#9545)


											
										
										
											2025-10-08 10:08:48 +05:30
+								 "nbformat_minor": 4
-												add native api docs (#1883)

Co-authored-by: Chayenne <zhaochenyang@g.ucla.edu>
											
										
										
											2024-11-02 00:17:30 -07:00
+								}