This commit is contained in:
Lianmin Zheng
2024-11-02 13:26:32 -07:00
committed by GitHub
parent 5a5f18432f
commit be7986e005
7 changed files with 34 additions and 41 deletions

View File

@@ -127,7 +127,7 @@ You can view the full example [here](https://github.com/sgl-project/sglang/tree/
## Supported Models
**Generative Models**
- Llama / Llama 2 / Llama 3 / Llama 3.1
- Llama / Llama 2 / Llama 3 / Llama 3.1 / Llama 3.2
- Mistral / Mixtral / Mistral NeMo
- Gemma / Gemma 2
- Qwen / Qwen 2 / Qwen 2 MoE / Qwen 2 VL

View File

@@ -5,7 +5,6 @@
"metadata": {},
"source": [
"# Native APIs\n",
"\n",
"Apart from the OpenAI compatible APIs, the SGLang Runtime also provides its native server APIs. We introduce these following APIs:\n",
"\n",
"- `/generate`\n",
@@ -40,7 +39,6 @@
" terminate_process,\n",
" print_highlight,\n",
")\n",
"import subprocess, json\n",
"\n",
"server_process = execute_shell_command(\n",
"\"\"\"\n",
@@ -56,8 +54,7 @@
"metadata": {},
"source": [
"## Generate\n",
"\n",
"Used to generate completion from the model, similar to the `/v1/completions` API in OpenAI. Detailed parameters can be found in the [sampling parameters](https://sgl-project.github.io/references/sampling_params.html)."
"Generate completions. This is similar to the `/v1/completions` in OpenAI API. Detailed parameters can be found in the [sampling parameters](https://sgl-project.github.io/references/sampling_params.html)."
]
},
{
@@ -72,7 +69,7 @@
"data = {\"text\": \"What is the capital of France?\"}\n",
"\n",
"response = requests.post(url, json=data)\n",
"print_highlight(response.text)"
"print_highlight(response.json())"
]
},
{
@@ -80,8 +77,7 @@
"metadata": {},
"source": [
"## Get Server Args\n",
"\n",
"Used to get the serving args when the server is launched."
"Get the arguments of a server."
]
},
{
@@ -102,7 +98,7 @@
"source": [
"## Get Model Info\n",
"\n",
"Used to get the model info.\n",
"Get the information of the model.\n",
"\n",
"- `model_path`: The path/name of the model.\n",
"- `is_generation`: Whether the model is used as generation model or embedding model."
@@ -120,7 +116,7 @@
"response_json = response.json()\n",
"print_highlight(response_json)\n",
"assert response_json[\"model_path\"] == \"meta-llama/Llama-3.2-1B-Instruct\"\n",
"assert response_json[\"is_generation\"] == True\n",
"assert response_json[\"is_generation\"] is True\n",
"assert response_json.keys() == {\"model_path\", \"is_generation\"}"
]
},
@@ -128,8 +124,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Health and Health Generate\n",
"\n",
"## Health Check\n",
"- `/health`: Check the health of the server.\n",
"- `/health_generate`: Check the health of the server by generating one token."
]
@@ -164,7 +159,7 @@
"source": [
"## Flush Cache\n",
"\n",
"Used to flush the radix cache. It will be automatically triggered when the model weights are updated by the `/update_weights` API."
"Flush the radix cache. It will be automatically triggered when the model weights are updated by the `/update_weights` API."
]
},
{
@@ -259,7 +254,7 @@
"source": [
"## Encode\n",
"\n",
"Used to encode text into embeddings. Note that this API is only available for [embedding models](./openai_embedding_api.ipynb) and will raise an error for generation models.\n",
"Encode text into embeddings. Note that this API is only available for [embedding models](./openai_embedding_api.ipynb) and will raise an error for generation models.\n",
"Therefore, we launch a new server to server an embedding model.\n"
]
},

View File

@@ -24,7 +24,7 @@
"\n",
"```bash\n",
"python -m sglang.launch_server --model-path Alibaba-NLP/gte-Qwen2-7B-instruct \\\n",
" --port 30010 --host 0.0.0.0 --is-embedding\n",
" --port 30000 --host 0.0.0.0 --is-embedding\n",
"```\n",
"\n",
"Remember to add `--is-embedding` to the command."
@@ -53,11 +53,11 @@
"embedding_process = execute_shell_command(\n",
" \"\"\"\n",
"python -m sglang.launch_server --model-path Alibaba-NLP/gte-Qwen2-7B-instruct \\\n",
" --port 30010 --host 0.0.0.0 --is-embedding\n",
" --port 30000 --host 0.0.0.0 --is-embedding\n",
"\"\"\"\n",
")\n",
"\n",
"wait_for_server(\"http://localhost:30010\")"
"wait_for_server(\"http://localhost:30000\")"
]
},
{
@@ -84,7 +84,7 @@
"\n",
"text = \"Once upon a time\"\n",
"\n",
"curl_text = f\"\"\"curl -s http://localhost:30010/v1/embeddings \\\n",
"curl_text = f\"\"\"curl -s http://localhost:30000/v1/embeddings \\\n",
" -d '{{\"model\": \"Alibaba-NLP/gte-Qwen2-7B-instruct\", \"input\": \"{text}\"}}'\"\"\"\n",
"\n",
"text_embedding = json.loads(subprocess.check_output(curl_text, shell=True))[\"data\"][0][\n",
@@ -112,7 +112,7 @@
"text = \"Once upon a time\"\n",
"\n",
"response = requests.post(\n",
" \"http://localhost:30010/v1/embeddings\",\n",
" \"http://localhost:30000/v1/embeddings\",\n",
" json={\n",
" \"model\": \"Alibaba-NLP/gte-Qwen2-7B-instruct\",\n",
" \"input\": text\n",
@@ -146,7 +146,7 @@
"source": [
"import openai\n",
"\n",
"client = openai.Client(base_url=\"http://127.0.0.1:30010/v1\", api_key=\"None\")\n",
"client = openai.Client(base_url=\"http://127.0.0.1:30000/v1\", api_key=\"None\")\n",
"\n",
"# Text embedding example\n",
"response = client.embeddings.create(\n",
@@ -189,7 +189,7 @@
"tokenizer = AutoTokenizer.from_pretrained(\"Alibaba-NLP/gte-Qwen2-7B-instruct\")\n",
"input_ids = tokenizer.encode(text)\n",
"\n",
"curl_ids = f\"\"\"curl -s http://localhost:30010/v1/embeddings \\\n",
"curl_ids = f\"\"\"curl -s http://localhost:30000/v1/embeddings \\\n",
" -d '{{\"model\": \"Alibaba-NLP/gte-Qwen2-7B-instruct\", \"input\": {json.dumps(input_ids)}}}'\"\"\"\n",
"\n",
"input_ids_embedding = json.loads(subprocess.check_output(curl_ids, shell=True))[\"data\"][\n",

View File

@@ -26,7 +26,7 @@
"\n",
"```bash\n",
"python3 -m sglang.launch_server --model-path meta-llama/Llama-3.2-11B-Vision-Instruct \\\n",
" --port 30010 --chat-template llama_3_vision\n",
" --port 30000 --chat-template llama_3_vision\n",
"```\n",
"in your terminal and wait for the server to be ready.\n",
"\n",
@@ -50,11 +50,11 @@
"embedding_process = execute_shell_command(\n",
" \"\"\"\n",
"python3 -m sglang.launch_server --model-path meta-llama/Llama-3.2-11B-Vision-Instruct \\\n",
" --port=30010 --chat-template=llama_3_vision\n",
" --port=30000 --chat-template=llama_3_vision\n",
"\"\"\"\n",
")\n",
"\n",
"wait_for_server(\"http://localhost:30010\")"
"wait_for_server(\"http://localhost:30000\")"
]
},
{
@@ -75,7 +75,7 @@
"import subprocess\n",
"\n",
"curl_command = \"\"\"\n",
"curl -s http://localhost:30010/v1/chat/completions \\\n",
"curl -s http://localhost:30000/v1/chat/completions \\\n",
" -d '{\n",
" \"model\": \"meta-llama/Llama-3.2-11B-Vision-Instruct\",\n",
" \"messages\": [\n",
@@ -118,7 +118,7 @@
"source": [
"import requests\n",
"\n",
"url = \"http://localhost:30010/v1/chat/completions\"\n",
"url = \"http://localhost:30000/v1/chat/completions\"\n",
"\n",
"data = {\n",
" \"model\": \"meta-llama/Llama-3.2-11B-Vision-Instruct\",\n",
@@ -161,7 +161,7 @@
"source": [
"from openai import OpenAI\n",
"\n",
"client = OpenAI(base_url=\"http://localhost:30010/v1\", api_key=\"None\")\n",
"client = OpenAI(base_url=\"http://localhost:30000/v1\", api_key=\"None\")\n",
"\n",
"response = client.chat.completions.create(\n",
" model=\"meta-llama/Llama-3.2-11B-Vision-Instruct\",\n",
@@ -205,7 +205,7 @@
"source": [
"from openai import OpenAI\n",
"\n",
"client = OpenAI(base_url=\"http://localhost:30010/v1\", api_key=\"None\")\n",
"client = OpenAI(base_url=\"http://localhost:30000/v1\", api_key=\"None\")\n",
"\n",
"response = client.chat.completions.create(\n",
" model=\"meta-llama/Llama-3.2-11B-Vision-Instruct\",\n",

View File

@@ -25,7 +25,7 @@ If you see `decode out of memory happened` occasionally but not frequently, it i
Data parallelism is better for throughput. When there is enough GPU memory, always favor data parallelism for throughput.
### Avoid out-of-memory by Tuning `--chunked-prefill-size`, `--mem-fraction-static`, `--max-running-requests`
If you see out of memory (OOM) errors, you can decrease these parameters.
If you see out of memory (OOM) errors, you can try to tune the following parameters.
If OOM happens during prefill, try to decrease `--chunked-prefill-size` to `4096` or `2048`.
If OOM happens during decoding, try to decrease `--max-running-requests`.
You can also try to decrease `--mem-fraction-static`, which reduces the memory usage of the KV cache memory pool and helps both prefill and decoding.

View File

@@ -2,12 +2,13 @@
This page lists some common errors and tips for fixing them.
## CUDA out of memory
If you see out of memory (OOM) errors, you can try to tune the following parameters.
If OOM happens during prefill, try to decrease `--chunked-prefill-size` to `4096` or `2048`.
If OOM happens during decoding, try to decrease `--max-running-requests`.
You can also try to decrease `--mem-fraction-static`, which reduces the memory usage of the KV cache memory pool and helps both prefill and decoding.
## CUDA error: an illegal memory access was encountered
This error may be due to kernel errors or out-of-memory issues.
- If it is a kernel error, it is not easy to fix.
- If it is out-of-memory, sometimes it will report this error instead of "Out-of-memory." In this case, try setting a smaller value for `--mem-fraction-static`. The default value of `--mem-fraction-static` is around 0.8 - 0.9.
## The server hangs
If the server hangs, try disabling some optimizations when launching the server.
- Add `--disable-cuda-graph`.
- Add `--sampling-backend pytorch`.
- If it is a kernel error, it is not easy to fix. Please file an issue on the GitHub.
- If it is out-of-memory, sometimes it will report this error instead of "Out-of-memory." Please refer to the above seciton to avoid the OOM.

View File

@@ -70,7 +70,7 @@
"\n",
"curl_command = \"\"\"\n",
"curl -s http://localhost:30000/v1/chat/completions \\\n",
" -d '{\"model\": \"meta-llama/Meta-Llama-3.1-8B-Instruct\", \"messages\": [{\"role\": \"system\", \"content\": \"You are a helpful assistant.\"}, {\"role\": \"user\", \"content\": \"What is a LLM?\"}]}'\n",
" -d '{\"model\": \"meta-llama/Meta-Llama-3.1-8B-Instruct\", \"messages\": [{\"role\": \"user\", \"content\": \"What is the capital of France?\"}]}'\n",
"\"\"\"\n",
"\n",
"response = json.loads(subprocess.check_output(curl_command, shell=True))\n",
@@ -104,8 +104,7 @@
"data = {\n",
" \"model\": \"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n",
" \"messages\": [\n",
" {\"role\": \"system\", \"content\": \"You are a helpful assistant.\"},\n",
" {\"role\": \"user\", \"content\": \"What is a LLM?\"}\n",
" {\"role\": \"user\", \"content\": \"What is the capital of France?\"}\n",
" ]\n",
"}\n",
"\n",
@@ -140,7 +139,6 @@
"response = client.chat.completions.create(\n",
" model=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n",
" messages=[\n",
" {\"role\": \"system\", \"content\": \"You are a helpful AI assistant\"},\n",
" {\"role\": \"user\", \"content\": \"List 3 countries and their capitals.\"},\n",
" ],\n",
" temperature=0,\n",
@@ -170,7 +168,6 @@
"response = client.chat.completions.create(\n",
" model=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n",
" messages=[\n",
" {\"role\": \"system\", \"content\": \"You are a helpful AI assistant\"},\n",
" {\"role\": \"user\", \"content\": \"List 3 countries and their capitals.\"},\n",
" ],\n",
" temperature=0,\n",