[FEATURE] Add OpenAI-Compatible LoRA Adapter Selection (#11570)

2025-10-21 00:44:33 -07:00
parent 7e6191c098
commit 852c0578fd
10 changed files with 815 additions and 40 deletions
--- a/docs/advanced_features/lora.ipynb
+++ b/docs/advanced_features/lora.ipynb
@@ -59,6 +59,17 @@
    "### Serving Single Adaptor"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Note:** SGLang supports LoRA adapters through two APIs:\n",
+    "\n",
+    "1. **OpenAI-Compatible API** (`/v1/chat/completions`, `/v1/completions`): Use the `model:adapter-name` syntax. See [OpenAI API with LoRA](../basic_usage/openai_api_completions.ipynb#Using-LoRA-Adapters) for examples.\n",
+    "\n",
+    "2. **Native API** (`/generate`): Pass `lora_path` in the request body (shown below)."
+   ]
+  },
  {
   "cell_type": "code",
   "execution_count": null,
@@ -379,6 +390,15 @@
    "print(f\"Output from lora1 (updated): \\n{response.json()[1]['text']}\\n\")"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### OpenAI-compatible API usage\n",
+    "\n",
+    "You can use LoRA adapters via the OpenAI-compatible APIs by specifying the adapter in the `model` field using the `base-model:adapter-name` syntax (for example, `qwen/qwen2.5-0.5b-instruct:adapter_a`). For more details and examples, see the “Using LoRA Adapters” section in the OpenAI API documentation: [openai_api_completions.ipynb](../basic_usage/openai_api_completions.ipynb).\n"
+   ]
+  },
  {
   "cell_type": "code",
   "execution_count": null,
--- a/docs/basic_usage/openai_api_completions.ipynb
+++ b/docs/basic_usage/openai_api_completions.ipynb
@@ -361,6 +361,50 @@
    "For OpenAI compatible structured outputs API, refer to [Structured Outputs](../advanced_features/structured_outputs.ipynb) for more details.\n"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Using LoRA Adapters\n",
+    "\n",
+    "SGLang supports LoRA (Low-Rank Adaptation) adapters with OpenAI-compatible APIs. You can specify which adapter to use directly in the `model` parameter using the `base-model:adapter-name` syntax.\n",
+    "\n",
+    "**Server Setup:**\n",
+    "```bash\n",
+    "python -m sglang.launch_server \\\n",
+    "    --model-path qwen/qwen2.5-0.5b-instruct \\\n",
+    "    --enable-lora \\\n",
+    "    --lora-paths adapter_a=/path/to/adapter_a adapter_b=/path/to/adapter_b\n",
+    "```\n",
+    "\n",
+    "For more details on LoRA serving configuration, see the [LoRA documentation](../advanced_features/lora.ipynb).\n",
+    "\n",
+    "**API Call:**\n",
+    "\n",
+    "(Recommended) Use the `model:adapter` syntax to specify which adapter to use:\n",
+    "```python\n",
+    "response = client.chat.completions.create(\n",
+    "    model=\"qwen/qwen2.5-0.5b-instruct:adapter_a\",  # ← base-model:adapter-name\n",
+    "    messages=[{\"role\": \"user\", \"content\": \"Convert to SQL: show all users\"}],\n",
+    "    max_tokens=50,\n",
+    ")\n",
+    "```\n",
+    "\n",
+    "**Backward Compatible: Using `extra_body`**\n",
+    "\n",
+    "The old `extra_body` method is still supported for backward compatibility:\n",
+    "```python\n",
+    "# Backward compatible method\n",
+    "response = client.chat.completions.create(\n",
+    "    model=\"qwen/qwen2.5-0.5b-instruct\",\n",
+    "    messages=[{\"role\": \"user\", \"content\": \"Convert to SQL: show all users\"}],\n",
+    "    extra_body={\"lora_path\": \"adapter_a\"},  # ← old method\n",
+    "    max_tokens=50,\n",
+    ")\n",
+    "```\n",
+    "**Note:** When both `model:adapter` and `extra_body[\"lora_path\"]` are specified, the `model:adapter` syntax takes precedence."
+   ]
+  },
  {
   "cell_type": "code",
   "execution_count": null,