2024-11-01 17:47:44 -07:00
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
2024-11-01 20:00:41 -07:00
"# OpenAI APIs - Vision\n",
2024-11-01 17:47:44 -07:00
"\n",
"SGLang provides OpenAI-compatible APIs to enable a smooth transition from OpenAI services to self-hosted local models.\n",
"A complete reference for the API is available in the [OpenAI API Reference](https://platform.openai.com/docs/guides/vision).\n",
"This tutorial covers the vision APIs for vision language models.\n",
"\n",
"SGLang supports vision language models such as Llama 3.2, LLaVA-OneVision, and QWen-VL2 \n",
"- [meta-llama/Llama-3.2-11B-Vision-Instruct](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct) \n",
"- [lmms-lab/llava-onevision-qwen2-72b-ov-chat](https://huggingface.co/lmms-lab/llava-onevision-qwen2-72b-ov-chat) \n",
"- [Qwen/Qwen2-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct) "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Launch A Server\n",
"\n",
"This code block is equivalent to executing \n",
"\n",
"```bash\n",
"python3 -m sglang.launch_server --model-path meta-llama/Llama-3.2-11B-Vision-Instruct \\\n",
2024-11-02 13:26:32 -07:00
" --port 30000 --chat-template llama_3_vision\n",
2024-11-01 17:47:44 -07:00
"```\n",
"in your terminal and wait for the server to be ready.\n",
"\n",
"Remember to add `--chat-template llama_3_vision` to specify the vision chat template, otherwise the server only supports text.\n",
"We need to specify `--chat-template` for vision language models because the chat template provided in Hugging Face tokenizer only supports text."
]
},
{
"cell_type": "code",
2024-11-02 01:02:17 -07:00
"execution_count": null,
2024-11-23 05:04:51 +08:00
"metadata": {},
2024-11-02 00:17:30 -07:00
"outputs": [],
2024-11-01 17:47:44 -07:00
"source": [
"from sglang.utils import (\n",
" execute_shell_command,\n",
" wait_for_server,\n",
" terminate_process,\n",
" print_highlight,\n",
")\n",
"\n",
"embedding_process = execute_shell_command(\n",
2024-11-02 00:17:30 -07:00
" \"\"\"\n",
2024-11-01 17:47:44 -07:00
"python3 -m sglang.launch_server --model-path meta-llama/Llama-3.2-11B-Vision-Instruct \\\n",
2024-11-02 13:26:32 -07:00
" --port=30000 --chat-template=llama_3_vision\n",
2024-11-01 17:47:44 -07:00
"\"\"\"\n",
")\n",
"\n",
2024-11-02 13:26:32 -07:00
"wait_for_server(\"http://localhost:30000\")"
2024-11-01 17:47:44 -07:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Using cURL\n",
"\n",
2024-11-02 01:02:17 -07:00
"Once the server is up, you can send test requests using curl or requests."
2024-11-01 17:47:44 -07:00
]
},
{
"cell_type": "code",
2024-11-02 01:02:17 -07:00
"execution_count": null,
2024-11-23 05:04:51 +08:00
"metadata": {},
2024-11-02 00:17:30 -07:00
"outputs": [],
2024-11-01 17:47:44 -07:00
"source": [
"import subprocess\n",
"\n",
"curl_command = \"\"\"\n",
2024-11-02 13:26:32 -07:00
"curl -s http://localhost:30000/v1/chat/completions \\\n",
2024-11-01 17:47:44 -07:00
" -d '{\n",
" \"model\": \"meta-llama/Llama-3.2-11B-Vision-Instruct\",\n",
" \"messages\": [\n",
" {\n",
" \"role\": \"user\",\n",
" \"content\": [\n",
" {\n",
" \"type\": \"text\",\n",
" \"text\": \"What’ s in this image?\"\n",
" },\n",
" {\n",
" \"type\": \"image_url\",\n",
" \"image_url\": {\n",
" \"url\": \"https://github.com/sgl-project/sglang/blob/main/test/lang/example_image.png?raw=true\"\n",
" }\n",
" }\n",
" ]\n",
" }\n",
" ],\n",
" \"max_tokens\": 300\n",
" }'\n",
"\"\"\"\n",
"\n",
"response = subprocess.check_output(curl_command, shell=True).decode()\n",
"print_highlight(response)"
]
},
2024-11-02 01:02:17 -07:00
{
"cell_type": "markdown",
"metadata": {},
"source": [
2024-11-02 11:46:00 -07:00
"## Using Python Requests"
2024-11-02 01:02:17 -07:00
]
},
{
"cell_type": "code",
"execution_count": null,
2024-11-23 05:04:51 +08:00
"metadata": {},
2024-11-02 01:02:17 -07:00
"outputs": [],
"source": [
"import requests\n",
"\n",
2024-11-02 13:26:32 -07:00
"url = \"http://localhost:30000/v1/chat/completions\"\n",
2024-11-02 01:02:17 -07:00
"\n",
"data = {\n",
" \"model\": \"meta-llama/Llama-3.2-11B-Vision-Instruct\",\n",
" \"messages\": [\n",
" {\n",
" \"role\": \"user\",\n",
" \"content\": [\n",
2024-11-02 22:03:38 -07:00
" {\"type\": \"text\", \"text\": \"What’ s in this image?\"},\n",
2024-11-02 01:02:17 -07:00
" {\n",
" \"type\": \"image_url\",\n",
" \"image_url\": {\n",
" \"url\": \"https://github.com/sgl-project/sglang/blob/main/test/lang/example_image.png?raw=true\"\n",
2024-11-02 22:03:38 -07:00
" },\n",
" },\n",
" ],\n",
2024-11-02 01:02:17 -07:00
" }\n",
" ],\n",
2024-11-02 22:03:38 -07:00
" \"max_tokens\": 300,\n",
2024-11-02 01:02:17 -07:00
"}\n",
"\n",
"response = requests.post(url, json=data)\n",
"print_highlight(response.text)"
]
},
2024-11-01 17:47:44 -07:00
{
"cell_type": "markdown",
"metadata": {},
"source": [
2024-11-02 11:46:00 -07:00
"## Using OpenAI Python Client"
2024-11-01 17:47:44 -07:00
]
},
{
"cell_type": "code",
2024-11-02 01:02:17 -07:00
"execution_count": null,
2024-11-23 05:04:51 +08:00
"metadata": {},
2024-11-02 00:17:30 -07:00
"outputs": [],
2024-11-01 17:47:44 -07:00
"source": [
"from openai import OpenAI\n",
"\n",
2024-11-02 13:26:32 -07:00
"client = OpenAI(base_url=\"http://localhost:30000/v1\", api_key=\"None\")\n",
2024-11-01 17:47:44 -07:00
"\n",
"response = client.chat.completions.create(\n",
" model=\"meta-llama/Llama-3.2-11B-Vision-Instruct\",\n",
" messages=[\n",
" {\n",
" \"role\": \"user\",\n",
" \"content\": [\n",
" {\n",
" \"type\": \"text\",\n",
" \"text\": \"What is in this image?\",\n",
" },\n",
" {\n",
" \"type\": \"image_url\",\n",
2024-11-02 00:17:30 -07:00
" \"image_url\": {\n",
" \"url\": \"https://github.com/sgl-project/sglang/blob/main/test/lang/example_image.png?raw=true\"\n",
" },\n",
2024-11-01 17:47:44 -07:00
" },\n",
" ],\n",
" }\n",
" ],\n",
" max_tokens=300,\n",
")\n",
"\n",
"print_highlight(response.choices[0].message.content)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Multiple-Image Inputs\n",
"\n",
"The server also supports multiple images and interleaved text and images if the model supports it."
]
},
{
"cell_type": "code",
2024-11-02 01:02:17 -07:00
"execution_count": null,
2024-11-23 05:04:51 +08:00
"metadata": {},
2024-11-02 00:17:30 -07:00
"outputs": [],
2024-11-01 17:47:44 -07:00
"source": [
"from openai import OpenAI\n",
"\n",
2024-11-02 13:26:32 -07:00
"client = OpenAI(base_url=\"http://localhost:30000/v1\", api_key=\"None\")\n",
2024-11-01 17:47:44 -07:00
"\n",
"response = client.chat.completions.create(\n",
" model=\"meta-llama/Llama-3.2-11B-Vision-Instruct\",\n",
" messages=[\n",
" {\n",
" \"role\": \"user\",\n",
" \"content\": [\n",
" {\n",
" \"type\": \"image_url\",\n",
" \"image_url\": {\n",
" \"url\": \"https://github.com/sgl-project/sglang/blob/main/test/lang/example_image.png?raw=true\",\n",
" },\n",
" },\n",
" {\n",
" \"type\": \"image_url\",\n",
" \"image_url\": {\n",
" \"url\": \"https://raw.githubusercontent.com/sgl-project/sglang/main/assets/logo.png\",\n",
" },\n",
" },\n",
" {\n",
" \"type\": \"text\",\n",
" \"text\": \"I have two very different images. They are not related at all. \"\n",
2024-11-02 00:17:30 -07:00
" \"Please describe the first image in one sentence, and then describe the second image in another sentence.\",\n",
2024-11-01 17:47:44 -07:00
" },\n",
" ],\n",
" }\n",
" ],\n",
" temperature=0,\n",
")\n",
"\n",
"print_highlight(response.choices[0].message.content)"
]
},
{
"cell_type": "code",
2024-11-23 05:04:51 +08:00
"execution_count": null,
"metadata": {},
2024-11-01 17:47:44 -07:00
"outputs": [],
"source": [
"terminate_process(embedding_process)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Chat Template\n",
"\n",
"As mentioned before, if you do not specify a vision model's `--chat-template`, the server uses Hugging Face's default template, which only supports text.\n",
"\n",
"We list popular vision models with their chat templates:\n",
"\n",
"- [meta-llama/Llama-3.2-Vision](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct) uses `llama_3_vision`.\n",
2024-11-01 20:42:30 -07:00
"- [Qwen/Qwen2-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct) uses `qwen2-vl`.\n",
2024-11-01 17:47:44 -07:00
"- [LlaVA-OneVision](https://huggingface.co/lmms-lab/llava-onevision-qwen2-7b-ov) uses `chatml-llava`.\n",
2024-11-01 20:42:30 -07:00
"- [LLaVA-NeXT](https://huggingface.co/collections/lmms-lab/llava-next-6623288e2d61edba3ddbf5ff) uses `chatml-llava`.\n",
2024-11-01 17:47:44 -07:00
"- [Llama3-LLaVA-NeXT](https://huggingface.co/lmms-lab/llama3-llava-next-8b) uses `llava_llama_3`.\n",
"- [LLaVA-v1.5 / 1.6](https://huggingface.co/liuhaotian/llava-v1.6-34b) uses `vicuna_v1.1`."
]
}
],
"metadata": {
2024-11-02 01:02:17 -07:00
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
2024-11-23 05:04:51 +08:00
"pygments_lexer": "ipython3"
2024-11-01 17:47:44 -07:00
}
},
"nbformat": 4,
"nbformat_minor": 2
}