Support GPU pinning for LoRA (#8697)
This commit is contained in:
@@ -381,6 +381,78 @@
|
||||
"print(f\"Output from lora1 (updated): \\n{response.json()[1]['text']}\\n\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### LoRA GPU Pinning"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Another advanced option is to specify adapters as `pinned` during loading. When an adapter is pinned, it is permanently assigned to one of the available GPU pool slots (as configured by `--max-loras-per-batch`) and will not be evicted from GPU memory during runtime. Instead, it remains resident until it is explicitly unloaded.\n",
|
||||
"\n",
|
||||
"This can improve performance in scenarios where the same adapter is frequently used across requests, by avoiding repeated memory transfers and reinitialization overhead. However, since GPU pool slots are limited, pinning adapters reduces the flexibility of the system to dynamically load other adapters on demand. If too many adapters are pinned, it may lead to degraded performance, or in the most extreme case (`Number of pinned adapters == max-loras-per-batch`), halt all unpinned requests. Therefore, currently SGLang limits maximal number of pinned adapters to `max-loras-per-batch - 1` to prevent unexpected starvations. \n",
|
||||
"\n",
|
||||
"In the example below, we unload `lora1` and reload it as a `pinned` adapter:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"response = requests.post(\n",
|
||||
" url + \"/unload_lora_adapter\",\n",
|
||||
" json={\n",
|
||||
" \"lora_name\": \"lora1\",\n",
|
||||
" },\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"response = requests.post(\n",
|
||||
" url + \"/load_lora_adapter\",\n",
|
||||
" json={\n",
|
||||
" \"lora_name\": \"lora1\",\n",
|
||||
" \"lora_path\": lora1,\n",
|
||||
" \"pinned\": True, # Pin the adapter to GPU\n",
|
||||
" },\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Verify that the result is identical as before:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"url = f\"http://127.0.0.1:{port}\"\n",
|
||||
"json_data = {\n",
|
||||
" \"text\": [\n",
|
||||
" \"List 3 countries and their capitals.\",\n",
|
||||
" \"List 3 countries and their capitals.\",\n",
|
||||
" ],\n",
|
||||
" \"sampling_params\": {\"max_new_tokens\": 32, \"temperature\": 0},\n",
|
||||
" # The first input uses lora0, and the second input uses lora1\n",
|
||||
" \"lora_path\": [\"lora0\", \"lora1\"],\n",
|
||||
"}\n",
|
||||
"response = requests.post(\n",
|
||||
" url + \"/generate\",\n",
|
||||
" json=json_data,\n",
|
||||
")\n",
|
||||
"print(f\"Output from lora0: \\n{response.json()[0]['text']}\\n\")\n",
|
||||
"print(f\"Output from lora1 (pinned): \\n{response.json()[1]['text']}\\n\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
|
||||
Reference in New Issue
Block a user