Feat: support cuda graph for LoRA (#4115)

Co-authored-by: Beichen Ma <mabeichen12@gmail.com>
2025-04-29 02:30:44 -04:00
parent 2c3ea29476
commit 8c0cfca87d
13 changed files with 366 additions and 55 deletions
--- a/docs/backend/lora.ipynb
+++ b/docs/backend/lora.ipynb
@@ -77,7 +77,7 @@
    "python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \\\n",
    "    --lora-paths lora0=algoprog/fact-generation-llama-3.1-8b-instruct-lora \\\n",
    "    --max-loras-per-batch 1 --lora-backend triton \\\n",
-    "    --disable-cuda-graph --disable-radix-cache\n",
+    "    --disable-radix-cache\n",
    "\"\"\"\n",
    ")\n",
    "\n",
@@ -136,7 +136,7 @@
    "    --lora-paths lora0=algoprog/fact-generation-llama-3.1-8b-instruct-lora \\\n",
    "    lora1=Nutanix/Meta-Llama-3.1-8B-Instruct_lora_4_alpha_16 \\\n",
    "    --max-loras-per-batch 2 --lora-backend triton \\\n",
-    "    --disable-cuda-graph --disable-radix-cache\n",
+    "    --disable-radix-cache\n",
    "\"\"\"\n",
    ")\n",
    "\n",
@@ -182,7 +182,7 @@
   "source": [
    "## Future Works\n",
    "\n",
-    "The development roadmap for LoRA-related features can be found in this [issue](https://github.com/sgl-project/sglang/issues/2929). Currently Cuda graph and radix attention are not incompatible with LoRA and must be manually disabled. Other features, including Unified Paging, Cutlass backend, and dynamic loading/unloadingm, are still under development."
+    "The development roadmap for LoRA-related features can be found in this [issue](https://github.com/sgl-project/sglang/issues/2929). Currently radix attention is incompatible with LoRA and must be manually disabled. Other features, including Unified Paging, Cutlass backend, and dynamic loading/unloadingm, are still under development."
   ]
  }
 ],