diff --git a/docs/backend/openai_embedding_api.ipynb b/docs/backend/openai_api_embeddings.ipynb similarity index 93% rename from docs/backend/openai_embedding_api.ipynb rename to docs/backend/openai_api_embeddings.ipynb index 356a57121..41ed3f775 100644 --- a/docs/backend/openai_embedding_api.ipynb +++ b/docs/backend/openai_api_embeddings.ipynb @@ -6,10 +6,12 @@ "source": [ "# OpenAI APIs - Embedding\n", "\n", - "SGLang supports embedding models in the same way as completion models. Here are some example models:\n", + "SGLang provides OpenAI-compatible APIs to enable a smooth transition from OpenAI services to self-hosted local models.\n", + "A complete reference for the API is available in the [OpenAI API Reference](https://platform.openai.com/docs/guides/embeddings).\n", "\n", - "- [intfloat/e5-mistral-7b-instruct](https://huggingface.co/intfloat/e5-mistral-7b-instruct)\n", - "- [Alibaba-NLP/gte-Qwen2-7B-instruct](https://huggingface.co/Alibaba-NLP/gte-Qwen2-7B-instruct)\n" + "This tutorial covers the embedding APIs for embedding models, such as \n", + "- [intfloat/e5-mistral-7b-instruct](https://huggingface.co/intfloat/e5-mistral-7b-instruct) \n", + "- [Alibaba-NLP/gte-Qwen2-7B-instruct](https://huggingface.co/Alibaba-NLP/gte-Qwen2-7B-instruct) \n" ] }, { @@ -96,7 +98,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Using OpenAI Compatible API w/ Requests" + "## Using Python Requests" ] }, { diff --git a/docs/backend/openai_api_vision.ipynb b/docs/backend/openai_api_vision.ipynb index 4707a9e65..4a903c401 100644 --- a/docs/backend/openai_api_vision.ipynb +++ b/docs/backend/openai_api_vision.ipynb @@ -107,7 +107,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Using OpenAI Compatible API w/ Requests" + "## Using Python Requests" ] }, { @@ -150,9 +150,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Using OpenAI Python Client\n", - "\n", - "Also, you can use the OpenAI Python API library to send requests." + "## Using OpenAI Python Client" ] }, { diff --git a/docs/index.rst b/docs/index.rst index d73ce8ac1..55d3e81be 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -25,7 +25,7 @@ The core features include: backend/openai_api_completions.ipynb backend/openai_api_vision.ipynb - backend/openai_embedding_api.ipynb + backend/openai_api_embeddings.ipynb backend/native_api.ipynb backend/backend.md diff --git a/docs/references/sampling_params.md b/docs/references/sampling_params.md index 78d5193c2..21850e935 100644 --- a/docs/references/sampling_params.md +++ b/docs/references/sampling_params.md @@ -177,252 +177,3 @@ print(response.json()) The `image_data` can be a file name, a URL, or a base64 encoded string. See also `python/sglang/srt/utils.py:load_image`. Streaming is supported in a similar manner as [above](#streaming). - -## Performance Implications on Penalties - -While you can apply penalties by supplying relevant `sampling_params`, this comes with some drawbacks. - -These drawbacks will be applied to every single requests in the same batch, as penalizers also applies in batch. - -### Latency - -While we try to compute penalty algorithms through CUDA, it is still additional computation on top of the basic sampling logic. For detailed overhead, we recommend you to run your own benchmarks, but you can find samples below to get a glimpse. - -### Memory - -Since we compute penalty algorithms through CUDA, the logic stores relevant parameters on GPU. This is usually in a scale of `vocab_size` multiplied by `running_requests`. - -You can run your own benchmark with desired parameters on your own hardware to make sure it's not OOMing before using. - -Tuning `--mem-fraction-static` and/or `--max-running-requests` will help. - -### Benchmarks - -All the benchmarks below were ran on NVIDIA H100 SXM5. - -
- -#### Baseline - -Measured at [dc9d06d886151707f97d0b78095df9de262fd3c9](https://github.com/sgl-project/sglang/commit/dc9d06d886151707f97d0b78095df9de262fd3c9). - -``` -$ python3 -m sglang.bench_serving --backend sglang --port 8413 --dataset-name random --num-prompts 3000 --random-input 256 --random-output 512 - -============ Serving Benchmark Result ============ -Backend: sglang -Traffic request rate: inf -Successful requests: 3000 -Benchmark duration (s): 66.11 -Total input tokens: 378633 -Total generated tokens: 775651 -Total generated tokens (retokenized): 775118 -Request throughput (req/s): 45.38 -Input token throughput (tok/s): 5727.04 -Output token throughput (tok/s): 11732.16 -----------------End-to-End Latency---------------- -Mean E2E Latency (ms): 40881.94 -Median E2E Latency (ms): 43967.10 ----------------Time to First Token---------------- -Mean TTFT (ms): 19884.75 -Median TTFT (ms): 14226.56 -P99 TTFT (ms): 47738.97 ------Time per Output Token (excl. 1st token)------ -Mean TPOT (ms): 91.96 -Median TPOT (ms): 90.11 -P99 TPOT (ms): 308.54 ----------------Inter-token Latency---------------- -Mean ITL (ms): 174.54 -Median ITL (ms): 58.56 -P99 ITL (ms): 440.18 -================================================== -``` - -#### All Together - -``` -$ python3 -m sglang.bench_serving --backend sglang --port 8413 --dataset-name random --num-prompts 3000 --random-input 256 --random-output 512 --extra-request-body '{ - "frequency_penalty": 1.1, - "presence_penalty": 1.1, - "repetition_penalty": 0.1, - "min_new_tokens": 5 -}' - -============ Serving Benchmark Result ============ -Backend: sglang -Traffic request rate: inf -Successful requests: 3000 -Benchmark duration (s): 78.35 -Total input tokens: 378633 -Total generated tokens: 775651 -Total generated tokens (retokenized): 774756 -Request throughput (req/s): 38.29 -Input token throughput (tok/s): 4832.86 -Output token throughput (tok/s): 9900.39 -----------------End-to-End Latency---------------- -Mean E2E Latency (ms): 49017.68 -Median E2E Latency (ms): 52825.70 ----------------Time to First Token---------------- -Mean TTFT (ms): 23892.60 -Median TTFT (ms): 18895.47 -P99 TTFT (ms): 57426.01 ------Time per Output Token (excl. 1st token)------ -Mean TPOT (ms): 114.54 -Median TPOT (ms): 107.27 -P99 TPOT (ms): 293.31 ----------------Inter-token Latency---------------- -Mean ITL (ms): 205.68 -Median ITL (ms): 73.97 -P99 ITL (ms): 453.86 -================================================== -``` - -#### Frequency Penalty - -``` -$ python3 -m sglang.bench_serving --backend sglang --port 8413 --dataset-name random --num-prompts 3000 --random-input 256 --random-output 512 --extra-request-body '{ - "frequency_penalty": 1.1 -}' - -============ Serving Benchmark Result ============ -Backend: sglang -Traffic request rate: inf -Successful requests: 3000 -Benchmark duration (s): 72.72 -Total input tokens: 378633 -Total generated tokens: 775651 -Total generated tokens (retokenized): 774955 -Request throughput (req/s): 41.26 -Input token throughput (tok/s): 5206.84 -Output token throughput (tok/s): 10666.51 -----------------End-to-End Latency---------------- -Mean E2E Latency (ms): 45445.56 -Median E2E Latency (ms): 48960.39 ----------------Time to First Token---------------- -Mean TTFT (ms): 22363.16 -Median TTFT (ms): 17125.02 -P99 TTFT (ms): 52920.95 ------Time per Output Token (excl. 1st token)------ -Mean TPOT (ms): 104.71 -Median TPOT (ms): 98.30 -P99 TPOT (ms): 268.06 ----------------Inter-token Latency---------------- -Mean ITL (ms): 191.60 -Median ITL (ms): 67.83 -P99 ITL (ms): 455.46 -================================================== -``` - -#### Presence Penalty - -``` -$ python3 -m sglang.bench_serving --backend sglang --port 8413 --dataset-name random --num-prompts 3000 --random-input 256 --random-output 512 --extra-request-body '{ - "presence_penalty": 1.1 -}' - -============ Serving Benchmark Result ============ -Backend: sglang -Traffic request rate: inf -Successful requests: 3000 -Benchmark duration (s): 72.04 -Total input tokens: 378633 -Total generated tokens: 775651 -Total generated tokens (retokenized): 775210 -Request throughput (req/s): 41.64 -Input token throughput (tok/s): 5255.98 -Output token throughput (tok/s): 10767.18 -----------------End-to-End Latency---------------- -Mean E2E Latency (ms): 44926.61 -Median E2E Latency (ms): 48302.88 ----------------Time to First Token---------------- -Mean TTFT (ms): 22095.39 -Median TTFT (ms): 16740.93 -P99 TTFT (ms): 52554.03 ------Time per Output Token (excl. 1st token)------ -Mean TPOT (ms): 103.54 -Median TPOT (ms): 97.37 -P99 TPOT (ms): 271.86 ----------------Inter-token Latency---------------- -Mean ITL (ms): 189.86 -Median ITL (ms): 68.45 -P99 ITL (ms): 447.11 -================================================== -``` - -#### Repetition Penalty - -``` -$ python3 -m sglang.bench_serving --backend sglang --port 8413 --dataset-name random --num-prompts 3000 --random-input 256 --random-output 512 --extra-request-body '{ - "repetition_penalty": 0.1 -}' - -============ Serving Benchmark Result ============ -Backend: sglang -Traffic request rate: inf -Successful requests: 3000 -Benchmark duration (s): 74.54 -Total input tokens: 378633 -Total generated tokens: 775651 -Total generated tokens (retokenized): 766008 -Request throughput (req/s): 40.24 -Input token throughput (tok/s): 5079.36 -Output token throughput (tok/s): 10405.35 -----------------End-to-End Latency---------------- -Mean E2E Latency (ms): 46530.38 -Median E2E Latency (ms): 50302.65 ----------------Time to First Token---------------- -Mean TTFT (ms): 22603.47 -Median TTFT (ms): 17167.08 -P99 TTFT (ms): 54497.85 ------Time per Output Token (excl. 1st token)------ -Mean TPOT (ms): 117.59 -Median TPOT (ms): 101.79 -P99 TPOT (ms): 320.04 ----------------Inter-token Latency---------------- -Mean ITL (ms): 195.26 -Median ITL (ms): 69.51 -P99 ITL (ms): 433.86 -================================================== -``` - -#### Min New Tokens - -The min new tokens penalizer computes until generation process reaches given `min_new_tokens`. - -Dislike other penalizers, setting this to higher value will have more latency implications. - -``` -$ python3 -m sglang.bench_serving --backend sglang --port 8413 --dataset-name random --num-prompts 3000 --random-input 256 --random-output 512 --extra-request-body '{ - "min_new_tokens": 5 -}' - -============ Serving Benchmark Result ============ -Backend: sglang -Traffic request rate: inf -Successful requests: 3000 -Benchmark duration (s): 66.94 -Total input tokens: 378633 -Total generated tokens: 775651 -Total generated tokens (retokenized): 775220 -Request throughput (req/s): 44.81 -Input token throughput (tok/s): 5656.13 -Output token throughput (tok/s): 11586.90 -----------------End-to-End Latency---------------- -Mean E2E Latency (ms): 41888.55 -Median E2E Latency (ms): 45354.16 ----------------Time to First Token---------------- -Mean TTFT (ms): 20866.91 -Median TTFT (ms): 16219.79 -P99 TTFT (ms): 49263.91 ------Time per Output Token (excl. 1st token)------ -Mean TPOT (ms): 97.05 -Median TPOT (ms): 89.76 -P99 TPOT (ms): 233.50 ----------------Inter-token Latency---------------- -Mean ITL (ms): 179.17 -Median ITL (ms): 55.08 -P99 ITL (ms): 409.12 -================================================== -``` - -
diff --git a/docs/start/install.md b/docs/start/install.md index 1ecd572ce..c203cc4d0 100644 --- a/docs/start/install.md +++ b/docs/start/install.md @@ -97,5 +97,5 @@ sky status --endpoint 30000 sglang ## Common Notes - [FlashInfer](https://github.com/flashinfer-ai/flashinfer) is the default attention kernel backend. It only supports sm75 and above. If you encounter any FlashInfer-related issues on sm75+ devices (e.g., T4, A10, A100, L4, L40S, H100), please switch to other kernels by adding `--attention-backend triton --sampling-backend pytorch` and open an issue on GitHub. -- If you only need to use the OpenAI backend, you can avoid installing other dependencies by using `pip install "sglang[openai]"`. +- If you only need to use OpenAI models with the frontend language, you can avoid installing other dependencies by using `pip install "sglang[openai]"`. - The language frontend operates independently of the backend runtime. You can install the frontend locally without needing a GPU, while the backend can be set up on a GPU-enabled machine. To install the frontend, run `pip install sglang`, and for the backend, use `pip install sglang[srt]`. This allows you to build SGLang programs locally and execute them by connecting to the remote backend. diff --git a/docs/start/send_request.ipynb b/docs/start/send_request.ipynb index 99c22332f..0d4f7474a 100644 --- a/docs/start/send_request.ipynb +++ b/docs/start/send_request.ipynb @@ -5,7 +5,6 @@ "metadata": {}, "source": [ "# Quick Start: Sending Requests\n", - "\n", "This notebook provides a quick-start guide for using SGLang after installation." ] }, @@ -14,7 +13,6 @@ "metadata": {}, "source": [ "## Launch a server\n", - "\n", "This code block is equivalent to executing \n", "\n", "```bash\n", @@ -83,7 +81,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Using OpenAI Compatible API w/ Requests" + "## Using Requests" ] }, { @@ -119,9 +117,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Using OpenAI Python Client\n", - "\n", - "You can also use the OpenAI Python API library to send requests." + "## Using OpenAI Python Client" ] }, { @@ -153,6 +149,41 @@ "print_highlight(response)" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Streaming" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import openai\n", + "\n", + "client = openai.Client(base_url=\"http://127.0.0.1:30000/v1\", api_key=\"None\")\n", + "\n", + "# Use stream=True for streaming responses\n", + "response = client.chat.completions.create(\n", + " model=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n", + " messages=[\n", + " {\"role\": \"system\", \"content\": \"You are a helpful AI assistant\"},\n", + " {\"role\": \"user\", \"content\": \"List 3 countries and their capitals.\"},\n", + " ],\n", + " temperature=0,\n", + " max_tokens=64,\n", + " stream=True,\n", + ")\n", + "\n", + "# Handle the streaming output\n", + "for chunk in response:\n", + " if chunk.choices[0].delta.content:\n", + " print(chunk.choices[0].delta.content, end='', flush=True)" + ] + }, { "cell_type": "markdown", "metadata": {}, @@ -184,6 +215,46 @@ "print_highlight(response.json())" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Streaming" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import requests, json\n", + "\n", + "response = requests.post(\n", + " \"http://localhost:30000/generate\",\n", + " json={\n", + " \"text\": \"The capital of France is\",\n", + " \"sampling_params\": {\n", + " \"temperature\": 0,\n", + " \"max_new_tokens\": 32,\n", + " },\n", + " \"stream\": True,\n", + " },\n", + " stream=True,\n", + ")\n", + "\n", + "prev = 0\n", + "for chunk in response.iter_lines(decode_unicode=False):\n", + " chunk = chunk.decode(\"utf-8\")\n", + " if chunk and chunk.startswith(\"data:\"):\n", + " if chunk == \"data: [DONE]\":\n", + " break\n", + " data = json.loads(chunk[5:].strip(\"\\n\"))\n", + " output = data[\"text\"]\n", + " print(output[prev:], end=\"\", flush=True)\n", + " prev = len(output)" + ] + }, { "cell_type": "code", "execution_count": 6,