Refactor the docs (#9031)
This commit is contained in:
226
docs/basic_usage/deepseek.md
Normal file
226
docs/basic_usage/deepseek.md
Normal file
@@ -0,0 +1,226 @@
|
||||
# DeepSeek Usage
|
||||
|
||||
SGLang provides many optimizations specifically designed for the DeepSeek models, making it the inference engine recommended by the official [DeepSeek team](https://github.com/deepseek-ai/DeepSeek-V3/tree/main?tab=readme-ov-file#62-inference-with-sglang-recommended) from Day 0.
|
||||
|
||||
This document outlines current optimizations for DeepSeek.
|
||||
For an overview of the implemented features see the completed [Roadmap](https://github.com/sgl-project/sglang/issues/2591).
|
||||
|
||||
## Launch DeepSeek V3 with SGLang
|
||||
|
||||
To run DeepSeek V3/R1 models, the requirements are as follows:
|
||||
|
||||
| Weight Type | Configuration |
|
||||
|------------|-------------------|
|
||||
| **Full precision FP8**<br>*(recommended)* | 8 x H200 |
|
||||
| | 8 x MI300X |
|
||||
| | 2 x 8 x H100/800/20 |
|
||||
| | Xeon 6980P CPU |
|
||||
| **Full precision BF16** | 2 x 8 x H200 |
|
||||
| | 2 x 8 x MI300X |
|
||||
| | 4 x 8 x H100/800/20 |
|
||||
| | 4 x 8 x A100/A800 |
|
||||
| **Quantized weights (AWQ)** | 8 x H100/800/20 |
|
||||
| | 8 x A100/A800 |
|
||||
| **Quantized weights (int8)** | 16 x A100/800 |
|
||||
| | 32 x L40S |
|
||||
| | Xeon 6980P CPU |
|
||||
|
||||
<style>
|
||||
.md-typeset__table {
|
||||
width: 100%;
|
||||
}
|
||||
|
||||
.md-typeset__table table {
|
||||
border-collapse: collapse;
|
||||
margin: 1em 0;
|
||||
border: 2px solid var(--md-typeset-table-color);
|
||||
table-layout: fixed;
|
||||
}
|
||||
|
||||
.md-typeset__table th {
|
||||
border: 1px solid var(--md-typeset-table-color);
|
||||
border-bottom: 2px solid var(--md-typeset-table-color);
|
||||
background-color: var(--md-default-bg-color--lighter);
|
||||
padding: 12px;
|
||||
}
|
||||
|
||||
.md-typeset__table td {
|
||||
border: 1px solid var(--md-typeset-table-color);
|
||||
padding: 12px;
|
||||
}
|
||||
|
||||
.md-typeset__table tr:nth-child(2n) {
|
||||
background-color: var(--md-default-bg-color--lightest);
|
||||
}
|
||||
</style>
|
||||
|
||||
Detailed commands for reference:
|
||||
|
||||
- [8 x H200](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#using-docker-recommended)
|
||||
- [8 x MI300X](../platforms/amd_gpu.md#running-deepseek-v3)
|
||||
- [2 x 8 x H200](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-two-h208-nodes)
|
||||
- [4 x 8 x A100](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-four-a1008-nodes)
|
||||
- [8 x A100 (AWQ)](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-8-a100a800-with-awq-quantization)
|
||||
- [16 x A100 (int8)](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-16-a100a800-with-int8-quantization)
|
||||
- [32 x L40S (int8)](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-32-l40s-with-int8-quantization)
|
||||
- [Xeon 6980P CPU](../platforms/cpu_server.md#example-running-deepseek-r1)
|
||||
|
||||
### Download Weights
|
||||
If you encounter errors when starting the server, ensure the weights have finished downloading. It's recommended to download them beforehand or restart multiple times until all weights are downloaded. Please refer to [DeepSeek V3](https://huggingface.co/deepseek-ai/DeepSeek-V3-Base#61-inference-with-deepseek-infer-demo-example-only) official guide to download the weights.
|
||||
|
||||
### Launch with one node of 8 x H200
|
||||
Please refer to [the example](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#installation--launch).
|
||||
**Note that Deepseek V3 is already in FP8**, so we should not run it with any quantization arguments like `--quantization fp8 --kv-cache-dtype fp8_e5m2`.
|
||||
|
||||
### Running examples on Multi-node
|
||||
|
||||
- [Serving with two H20*8 nodes](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-two-h208-nodes).
|
||||
|
||||
- [Serving with two H200*8 nodes and docker](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-two-h2008-nodes-and-docker).
|
||||
|
||||
- [Serving with four A100*8 nodes](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-four-a1008-nodes).
|
||||
|
||||
## Optimizations
|
||||
|
||||
### Multi-head Latent Attention (MLA) Throughput Optimizations
|
||||
|
||||
**Description**: [MLA](https://arxiv.org/pdf/2405.04434) is an innovative attention mechanism introduced by the DeepSeek team, aimed at improving inference efficiency. SGLang has implemented specific optimizations for this, including:
|
||||
|
||||
- **Weight Absorption**: By applying the associative law of matrix multiplication to reorder computation steps, this method balances computation and memory access and improves efficiency in the decoding phase.
|
||||
|
||||
- **MLA Attention Backends**: Currently SGLang supports different optimized MLA attention backends, including [FlashAttention3](https://github.com/Dao-AILab/flash-attention), [Flashinfer](https://docs.flashinfer.ai/api/mla.html), [FlashMLA](https://github.com/deepseek-ai/FlashMLA), [CutlassMLA](https://github.com/sgl-project/sglang/pull/5390), **TRTLLM MLA** (optimized for Blackwell architecture), and [Triton](https://github.com/triton-lang/triton) backends. The default FA3 provides good performance across wide workloads.
|
||||
|
||||
- **FP8 Quantization**: W8A8 FP8 and KV Cache FP8 quantization enables efficient FP8 inference. Additionally, we have implemented Batched Matrix Multiplication (BMM) operator to facilitate FP8 inference in MLA with weight absorption.
|
||||
|
||||
- **CUDA Graph & Torch.compile**: Both MLA and Mixture of Experts (MoE) are compatible with CUDA Graph and Torch.compile, which reduces latency and accelerates decoding speed for small batch sizes.
|
||||
|
||||
- **Chunked Prefix Cache**: Chunked prefix cache optimization can increase throughput by cutting prefix cache into chunks, processing them with multi-head attention and merging their states. Its improvement can be significant when doing chunked prefill on long sequences. Currently this optimization is only available for FlashAttention3 backend.
|
||||
|
||||
Overall, with these optimizations, we have achieved up to **7x** acceleration in output throughput compared to the previous version.
|
||||
|
||||
<p align="center">
|
||||
<img src="https://lmsys.org/images/blog/sglang_v0_3/deepseek_mla.svg" alt="Multi-head Latent Attention for DeepSeek Series Models">
|
||||
</p>
|
||||
|
||||
**Usage**: MLA optimization is enabled by default. For MLA models on Blackwell architecture (e.g., B200), the default backend is FlashInfer. To use the optimized TRTLLM MLA backend for decode operations, explicitly specify `--attention-backend trtllm_mla`. Note that TRTLLM MLA only optimizes decode operations - prefill operations (including multimodal inputs) will fall back to FlashInfer MLA.
|
||||
|
||||
**Reference**: Check [Blog](https://lmsys.org/blog/2024-09-04-sglang-v0-3/#deepseek-multi-head-latent-attention-mla-throughput-optimizations) and [Slides](https://github.com/sgl-project/sgl-learning-materials/blob/main/slides/lmsys_1st_meetup_deepseek_mla.pdf) for more details.
|
||||
|
||||
### Data Parallelism Attention
|
||||
|
||||
**Description**: This optimization involves data parallelism (DP) for the MLA attention mechanism of DeepSeek Series Models, which allows for a significant reduction in the KV cache size, enabling larger batch sizes. Each DP worker independently handles different types of batches (prefill, decode, idle), which are then synchronized before and after processing through the Mixture-of-Experts (MoE) layer. If you do not use DP attention, KV cache will be duplicated among all TP ranks.
|
||||
|
||||
<p align="center">
|
||||
<img src="https://lmsys.org/images/blog/sglang_v0_4/dp_attention.svg" alt="Data Parallelism Attention for DeepSeek Series Models">
|
||||
</p>
|
||||
|
||||
With data parallelism attention enabled, we have achieved up to **1.9x** decoding throughput improvement compared to the previous version.
|
||||
|
||||
<p align="center">
|
||||
<img src="https://lmsys.org/images/blog/sglang_v0_4/deepseek_coder_v2.svg" alt="Data Parallelism Attention Performance Comparison">
|
||||
</p>
|
||||
|
||||
**Usage**:
|
||||
- Append `--enable-dp-attention --tp 8 --dp 8` to the server arguments when using 8 H200 GPUs. This optimization improves peak throughput in high batch size scenarios where the server is limited by KV cache capacity. However, it is not recommended for low-latency, small-batch use cases.
|
||||
- DP and TP attention can be flexibly combined. For example, to deploy DeepSeek-V3/R1 on 2 nodes with 8 H100 GPUs each, you can specify `--enable-dp-attention --tp 16 --dp 2`. This configuration runs attention with 2 DP groups, each containing 8 TP GPUs.
|
||||
|
||||
**Reference**: Check [Blog](https://lmsys.org/blog/2024-12-04-sglang-v0-4/#data-parallelism-attention-for-deepseek-models).
|
||||
|
||||
### Multi Node Tensor Parallelism
|
||||
|
||||
**Description**: For users with limited memory on a single node, SGLang supports serving DeepSeek Series Models, including DeepSeek V3, across multiple nodes using tensor parallelism. This approach partitions the model parameters across multiple GPUs or nodes to handle models that are too large for one node's memory.
|
||||
|
||||
**Usage**: Check [here](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-2-h208) for usage examples.
|
||||
|
||||
### Block-wise FP8
|
||||
|
||||
**Description**: SGLang implements block-wise FP8 quantization with two key optimizations:
|
||||
|
||||
- **Activation**: E4M3 format using per-token-per-128-channel sub-vector scales with online casting.
|
||||
|
||||
- **Weight**: Per-128x128-block quantization for better numerical stability.
|
||||
|
||||
- **DeepGEMM**: The [DeepGEMM](https://github.com/deepseek-ai/DeepGEMM) kernel library optimized for FP8 matrix multiplications.
|
||||
|
||||
**Usage**: The activation and weight optimization above are turned on by default for DeepSeek V3 models. DeepGEMM is enabled by default on NVIDIA Hopper GPUs and disabled by default on other devices. DeepGEMM can also be manually turned off by setting the environment variable `SGL_ENABLE_JIT_DEEPGEMM=0`.
|
||||
|
||||
Before serving the DeepSeek model, precompile the DeepGEMM kernels using:
|
||||
```bash
|
||||
python3 -m sglang.compile_deep_gemm --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code
|
||||
```
|
||||
The precompilation process typically takes around 10 minutes to complete.
|
||||
|
||||
### Multi-token Prediction
|
||||
**Description**: SGLang implements DeepSeek V3 Multi-Token Prediction (MTP) based on [EAGLE speculative decoding](https://docs.sglang.ai/backend/speculative_decoding.html#EAGLE-Decoding). With this optimization, the decoding speed can be improved by **1.8x** for batch size 1 and **1.5x** for batch size 32 respectively on H200 TP8 setting.
|
||||
|
||||
**Usage**:
|
||||
Add arguments `--speculative-algorithm`, `--speculative-num-steps`, `--speculative-eagle-topk` and `--speculative-num-draft-tokens` to enable this feature. For example:
|
||||
```
|
||||
python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3-0324 --speculative-algorithm EAGLE --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2 --trust-remote-code --tp 8
|
||||
```
|
||||
- The best configuration for `--speculative-num-steps`, `--speculative-eagle-topk` and `--speculative-num-draft-tokens` can be searched with [bench_speculative.py](https://github.com/sgl-project/sglang/blob/main/scripts/playground/bench_speculative.py) script for given batch size. The minimum configuration is `--speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2`, which can achieve speedup for larger batch sizes.
|
||||
- FlashAttention3, FlashMLA, and Triton backend fully supports MTP usage. For FlashInfer backend (`--attention-backend flashinfer`) with speculative decoding,`--speculative-eagle-topk` parameter should be set to `1`. MTP support for the CutlassMLA and TRTLLM MLA backends are still under development.
|
||||
- To enable DeepSeek MTP for large batch sizes (>32), there are some parameters should be changed (Reference [this discussion](https://github.com/sgl-project/sglang/issues/4543#issuecomment-2737413756)):
|
||||
- Adjust `--max-running-requests` to a larger number. The default value is `32` for MTP. For larger batch sizes, you should increase this value beyond the default value.
|
||||
- Set `--cuda-graph-bs`. It's a list of batch sizes for cuda graph capture. The default captured batch sizes for speculative decoding is set [here](https://github.com/sgl-project/sglang/blob/49420741746c8f3e80e0eb17e7d012bfaf25793a/python/sglang/srt/model_executor/cuda_graph_runner.py#L126). You can include more batch sizes into it.
|
||||
|
||||
|
||||
### Reasoning Content for DeepSeek R1
|
||||
|
||||
See [Separate Reasoning](https://docs.sglang.ai/backend/separate_reasoning.html).
|
||||
|
||||
|
||||
### Function calling for DeepSeek Models
|
||||
|
||||
Add arguments `--tool-call-parser deepseekv3` and `--chat-template ./examples/chat_template/tool_chat_template_deepseekv3.jinja`(recommended) to enable this feature. For example (running on 1 * H20 node):
|
||||
|
||||
```
|
||||
python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3-0324 --tp 8 --port 30000 --host 0.0.0.0 --mem-fraction-static 0.9 --tool-call-parser deepseekv3 --chat-template ./examples/chat_template/tool_chat_template_deepseekv3.jinja
|
||||
```
|
||||
|
||||
Sample Request:
|
||||
|
||||
```
|
||||
curl "http://127.0.0.1:30000/v1/chat/completions" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"temperature": 0, "max_tokens": 100, "model": "deepseek-ai/DeepSeek-V3-0324", "tools": [{"type": "function", "function": {"name": "query_weather", "description": "Get weather of an city, the user should supply a city first", "parameters": {"type": "object", "properties": {"city": {"type": "string", "description": "The city, e.g. Beijing"}}, "required": ["city"]}}}], "messages": [{"role": "user", "content": "Hows the weather like in Qingdao today"}]}'
|
||||
```
|
||||
|
||||
Expected Response
|
||||
|
||||
```
|
||||
{"id":"6501ef8e2d874006bf555bc80cddc7c5","object":"chat.completion","created":1745993638,"model":"deepseek-ai/DeepSeek-V3-0324","choices":[{"index":0,"message":{"role":"assistant","content":null,"reasoning_content":null,"tool_calls":[{"id":"0","index":null,"type":"function","function":{"name":"query_weather","arguments":"{\"city\": \"Qingdao\"}"}}]},"logprobs":null,"finish_reason":"tool_calls","matched_stop":null}],"usage":{"prompt_tokens":116,"total_tokens":138,"completion_tokens":22,"prompt_tokens_details":null}}
|
||||
|
||||
```
|
||||
Sample Streaming Request:
|
||||
```
|
||||
curl "http://127.0.0.1:30000/v1/chat/completions" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"temperature": 0, "max_tokens": 100, "model": "deepseek-ai/DeepSeek-V3-0324","stream":true,"tools": [{"type": "function", "function": {"name": "query_weather", "description": "Get weather of an city, the user should supply a city first", "parameters": {"type": "object", "properties": {"city": {"type": "string", "description": "The city, e.g. Beijing"}}, "required": ["city"]}}}], "messages": [{"role": "user", "content": "Hows the weather like in Qingdao today"}]}'
|
||||
```
|
||||
Expected Streamed Chunks (simplified for clarity):
|
||||
```
|
||||
data: {"choices":[{"delta":{"tool_calls":[{"function":{"arguments":"{\""}}]}}]}
|
||||
data: {"choices":[{"delta":{"tool_calls":[{"function":{"arguments":"city"}}]}}]}
|
||||
data: {"choices":[{"delta":{"tool_calls":[{"function":{"arguments":"\":\""}}]}}]}
|
||||
data: {"choices":[{"delta":{"tool_calls":[{"function":{"arguments":"Q"}}]}}]}
|
||||
data: {"choices":[{"delta":{"tool_calls":[{"function":{"arguments":"ing"}}]}}]}
|
||||
data: {"choices":[{"delta":{"tool_calls":[{"function":{"arguments":"dao"}}]}}]}
|
||||
data: {"choices":[{"delta":{"tool_calls":[{"function":{"arguments":"\"}"}}]}}]}
|
||||
data: {"choices":[{"delta":{"tool_calls":null}}], "finish_reason": "tool_calls"}
|
||||
data: [DONE]
|
||||
```
|
||||
The client needs to concatenate all arguments fragments to reconstruct the complete tool call:
|
||||
```
|
||||
{"city": "Qingdao"}
|
||||
```
|
||||
Important Notes:
|
||||
1. Use a lower `"temperature"` value for better results.
|
||||
2. To receive more consistent tool call results, it is recommended to use `--chat-template examples/chat_template/tool_chat_template_deepseekv3.jinja`. It provides an improved unified prompt.
|
||||
|
||||
|
||||
## FAQ
|
||||
|
||||
**Q: Model loading is taking too long, and I'm encountering an NCCL timeout. What should I do?**
|
||||
|
||||
A: If you're experiencing extended model loading times and an NCCL timeout, you can try increasing the timeout duration. Add the argument `--dist-timeout 3600` when launching your model. This will set the timeout to one hour, which often resolves the issue.
|
||||
3
docs/basic_usage/gpt_oss.md
Normal file
3
docs/basic_usage/gpt_oss.md
Normal file
@@ -0,0 +1,3 @@
|
||||
# GPT OSS Usage
|
||||
|
||||
Please refer to [https://github.com/sgl-project/sglang/issues/8833](https://github.com/sgl-project/sglang/issues/8833).
|
||||
61
docs/basic_usage/llama4.md
Normal file
61
docs/basic_usage/llama4.md
Normal file
@@ -0,0 +1,61 @@
|
||||
# Llama4 Usage
|
||||
|
||||
[Llama 4](https://github.com/meta-llama/llama-models/blob/main/models/llama4/MODEL_CARD.md) is Meta's latest generation of open-source LLM model with industry-leading performance.
|
||||
|
||||
SGLang has supported Llama 4 Scout (109B) and Llama 4 Maverick (400B) since [v0.4.5](https://github.com/sgl-project/sglang/releases/tag/v0.4.5).
|
||||
|
||||
Ongoing optimizations are tracked in the [Roadmap](https://github.com/sgl-project/sglang/issues/5118).
|
||||
|
||||
## Launch Llama 4 with SGLang
|
||||
|
||||
To serve Llama 4 models on 8xH100/H200 GPUs:
|
||||
|
||||
```bash
|
||||
python3 -m sglang.launch_server --model-path meta-llama/Llama-4-Scout-17B-16E-Instruct --tp 8 --context-length 1000000
|
||||
```
|
||||
|
||||
### Configuration Tips
|
||||
|
||||
- **OOM Mitigation**: Adjust `--context-length` to avoid a GPU out-of-memory issue. For the Scout model, we recommend setting this value up to 1M on 8\*H100 and up to 2.5M on 8\*H200. For the Maverick model, we don't need to set context length on 8\*H200. When hybrid kv cache is enabled, `--context-length` can be set up to 5M on 8\*H100 and up to 10M on 8\*H200 for the Scout model.
|
||||
|
||||
- **Chat Template**: Add `--chat-template llama-4` for chat completion tasks.
|
||||
- **Enable Multi-Modal**: Add `--enable-multimodal` for multi-modal capabilities.
|
||||
- **Enable Hybrid-KVCache**: Add `--hybrid-kvcache-ratio` for hybrid kv cache. Details can be seen in [this PR](https://github.com/sgl-project/sglang/pull/6563)
|
||||
|
||||
|
||||
### EAGLE Speculative Decoding
|
||||
**Description**: SGLang has supported Llama 4 Maverick (400B) with [EAGLE speculative decoding](https://docs.sglang.ai/backend/speculative_decoding.html#EAGLE-Decoding).
|
||||
|
||||
**Usage**:
|
||||
Add arguments `--speculative-draft-model-path`, `--speculative-algorithm`, `--speculative-num-steps`, `--speculative-eagle-topk` and `--speculative-num-draft-tokens` to enable this feature. For example:
|
||||
```
|
||||
python3 -m sglang.launch_server --model-path meta-llama/Llama-4-Maverick-17B-128E-Instruct --speculative-algorithm EAGLE3 --speculative-draft-model-path nvidia/Llama-4-Maverick-17B-128E-Eagle3 --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --trust-remote-code --tp 8 --context-length 1000000
|
||||
```
|
||||
|
||||
- **Note** The Llama 4 draft model *nvidia/Llama-4-Maverick-17B-128E-Eagle3* can only recognize conversations in chat mode.
|
||||
|
||||
## Benchmarking Results
|
||||
|
||||
### Accuracy Test with `lm_eval`
|
||||
|
||||
The accuracy on SGLang for both Llama4 Scout and Llama4 Maverick can match the [official benchmark numbers](https://ai.meta.com/blog/llama-4-multimodal-intelligence/).
|
||||
|
||||
Benchmark results on MMLU Pro dataset with 8*H100:
|
||||
| | Llama-4-Scout-17B-16E-Instruct | Llama-4-Maverick-17B-128E-Instruct |
|
||||
|--------------------|--------------------------------|-------------------------------------|
|
||||
| Official Benchmark | 74.3 | 80.5 |
|
||||
| SGLang | 75.2 | 80.7 |
|
||||
|
||||
Commands:
|
||||
|
||||
```bash
|
||||
# Llama-4-Scout-17B-16E-Instruct model
|
||||
python -m sglang.launch_server --model-path meta-llama/Llama-4-Scout-17B-16E-Instruct --port 30000 --tp 8 --mem-fraction-static 0.8 --context-length 65536
|
||||
lm_eval --model local-chat-completions --model_args model=meta-llama/Llama-4-Scout-17B-16E-Instruct,base_url=http://localhost:30000/v1/chat/completions,num_concurrent=128,timeout=999999,max_gen_toks=2048 --tasks mmlu_pro --batch_size 128 --apply_chat_template --num_fewshot 0
|
||||
|
||||
# Llama-4-Maverick-17B-128E-Instruct
|
||||
python -m sglang.launch_server --model-path meta-llama/Llama-4-Maverick-17B-128E-Instruct --port 30000 --tp 8 --mem-fraction-static 0.8 --context-length 65536
|
||||
lm_eval --model local-chat-completions --model_args model=meta-llama/Llama-4-Maverick-17B-128E-Instruct,base_url=http://localhost:30000/v1/chat/completions,num_concurrent=128,timeout=999999,max_gen_toks=2048 --tasks mmlu_pro --batch_size 128 --apply_chat_template --num_fewshot 0
|
||||
```
|
||||
|
||||
Details can be seen in [this PR](https://github.com/sgl-project/sglang/pull/5092).
|
||||
495
docs/basic_usage/native_api.ipynb
Normal file
495
docs/basic_usage/native_api.ipynb
Normal file
@@ -0,0 +1,495 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# SGLang Native APIs\n",
|
||||
"\n",
|
||||
"Apart from the OpenAI compatible APIs, the SGLang Runtime also provides its native server APIs. We introduce the following APIs:\n",
|
||||
"\n",
|
||||
"- `/generate` (text generation model)\n",
|
||||
"- `/get_model_info`\n",
|
||||
"- `/get_server_info`\n",
|
||||
"- `/health`\n",
|
||||
"- `/health_generate`\n",
|
||||
"- `/flush_cache`\n",
|
||||
"- `/update_weights`\n",
|
||||
"- `/encode`(embedding model)\n",
|
||||
"- `/v1/rerank`(cross encoder rerank model)\n",
|
||||
"- `/classify`(reward model)\n",
|
||||
"- `/start_expert_distribution_record`\n",
|
||||
"- `/stop_expert_distribution_record`\n",
|
||||
"- `/dump_expert_distribution_record`\n",
|
||||
"- A full list of these APIs can be found at [http_server.py](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/entrypoints/http_server.py)\n",
|
||||
"\n",
|
||||
"We mainly use `requests` to test these APIs in the following examples. You can also use `curl`.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Launch A Server"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from sglang.test.doc_patch import launch_server_cmd\n",
|
||||
"from sglang.utils import wait_for_server, print_highlight, terminate_process\n",
|
||||
"\n",
|
||||
"server_process, port = launch_server_cmd(\n",
|
||||
" \"python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct --host 0.0.0.0\"\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"wait_for_server(f\"http://localhost:{port}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Generate (text generation model)\n",
|
||||
"Generate completions. This is similar to the `/v1/completions` in OpenAI API. Detailed parameters can be found in the [sampling parameters](sampling_params.md)."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import requests\n",
|
||||
"\n",
|
||||
"url = f\"http://localhost:{port}/generate\"\n",
|
||||
"data = {\"text\": \"What is the capital of France?\"}\n",
|
||||
"\n",
|
||||
"response = requests.post(url, json=data)\n",
|
||||
"print_highlight(response.json())"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Get Model Info\n",
|
||||
"\n",
|
||||
"Get the information of the model.\n",
|
||||
"\n",
|
||||
"- `model_path`: The path/name of the model.\n",
|
||||
"- `is_generation`: Whether the model is used as generation model or embedding model.\n",
|
||||
"- `tokenizer_path`: The path/name of the tokenizer.\n",
|
||||
"- `preferred_sampling_params`: The default sampling params specified via `--preferred-sampling-params`. `None` is returned in this example as we did not explicitly configure it in server args."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"url = f\"http://localhost:{port}/get_model_info\"\n",
|
||||
"\n",
|
||||
"response = requests.get(url)\n",
|
||||
"response_json = response.json()\n",
|
||||
"print_highlight(response_json)\n",
|
||||
"assert response_json[\"model_path\"] == \"qwen/qwen2.5-0.5b-instruct\"\n",
|
||||
"assert response_json[\"is_generation\"] is True\n",
|
||||
"assert response_json[\"tokenizer_path\"] == \"qwen/qwen2.5-0.5b-instruct\"\n",
|
||||
"assert response_json[\"preferred_sampling_params\"] is None\n",
|
||||
"assert response_json.keys() == {\n",
|
||||
" \"model_path\",\n",
|
||||
" \"is_generation\",\n",
|
||||
" \"tokenizer_path\",\n",
|
||||
" \"preferred_sampling_params\",\n",
|
||||
"}"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Get Server Info\n",
|
||||
"Gets the server information including CLI arguments, token limits, and memory pool sizes.\n",
|
||||
"- Note: `get_server_info` merges the following deprecated endpoints:\n",
|
||||
" - `get_server_args`\n",
|
||||
" - `get_memory_pool_size` \n",
|
||||
" - `get_max_total_num_tokens`"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"url = f\"http://localhost:{port}/get_server_info\"\n",
|
||||
"\n",
|
||||
"response = requests.get(url)\n",
|
||||
"print_highlight(response.text)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Health Check\n",
|
||||
"- `/health`: Check the health of the server.\n",
|
||||
"- `/health_generate`: Check the health of the server by generating one token."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"url = f\"http://localhost:{port}/health_generate\"\n",
|
||||
"\n",
|
||||
"response = requests.get(url)\n",
|
||||
"print_highlight(response.text)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"url = f\"http://localhost:{port}/health\"\n",
|
||||
"\n",
|
||||
"response = requests.get(url)\n",
|
||||
"print_highlight(response.text)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Flush Cache\n",
|
||||
"\n",
|
||||
"Flush the radix cache. It will be automatically triggered when the model weights are updated by the `/update_weights` API."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"url = f\"http://localhost:{port}/flush_cache\"\n",
|
||||
"\n",
|
||||
"response = requests.post(url)\n",
|
||||
"print_highlight(response.text)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Update Weights From Disk\n",
|
||||
"\n",
|
||||
"Update model weights from disk without restarting the server. Only applicable for models with the same architecture and parameter size.\n",
|
||||
"\n",
|
||||
"SGLang support `update_weights_from_disk` API for continuous evaluation during training (save checkpoint to disk and update weights from disk).\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# successful update with same architecture and size\n",
|
||||
"\n",
|
||||
"url = f\"http://localhost:{port}/update_weights_from_disk\"\n",
|
||||
"data = {\"model_path\": \"qwen/qwen2.5-0.5b-instruct\"}\n",
|
||||
"\n",
|
||||
"response = requests.post(url, json=data)\n",
|
||||
"print_highlight(response.text)\n",
|
||||
"assert response.json()[\"success\"] is True\n",
|
||||
"assert response.json()[\"message\"] == \"Succeeded to update model weights.\""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# failed update with different parameter size or wrong name\n",
|
||||
"\n",
|
||||
"url = f\"http://localhost:{port}/update_weights_from_disk\"\n",
|
||||
"data = {\"model_path\": \"qwen/qwen2.5-0.5b-instruct-wrong\"}\n",
|
||||
"\n",
|
||||
"response = requests.post(url, json=data)\n",
|
||||
"response_json = response.json()\n",
|
||||
"print_highlight(response_json)\n",
|
||||
"assert response_json[\"success\"] is False\n",
|
||||
"assert response_json[\"message\"] == (\n",
|
||||
" \"Failed to get weights iterator: \"\n",
|
||||
" \"qwen/qwen2.5-0.5b-instruct-wrong\"\n",
|
||||
" \" (repository not found).\"\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"terminate_process(server_process)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Encode (embedding model)\n",
|
||||
"\n",
|
||||
"Encode text into embeddings. Note that this API is only available for [embedding models](openai_api_embeddings.ipynb) and will raise an error for generation models.\n",
|
||||
"Therefore, we launch a new server to server an embedding model."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"embedding_process, port = launch_server_cmd(\n",
|
||||
" \"\"\"\n",
|
||||
"python3 -m sglang.launch_server --model-path Alibaba-NLP/gte-Qwen2-1.5B-instruct \\\n",
|
||||
" --host 0.0.0.0 --is-embedding\n",
|
||||
"\"\"\"\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"wait_for_server(f\"http://localhost:{port}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# successful encode for embedding model\n",
|
||||
"\n",
|
||||
"url = f\"http://localhost:{port}/encode\"\n",
|
||||
"data = {\"model\": \"Alibaba-NLP/gte-Qwen2-1.5B-instruct\", \"text\": \"Once upon a time\"}\n",
|
||||
"\n",
|
||||
"response = requests.post(url, json=data)\n",
|
||||
"response_json = response.json()\n",
|
||||
"print_highlight(f\"Text embedding (first 10): {response_json['embedding'][:10]}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"terminate_process(embedding_process)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## v1/rerank (cross encoder rerank model)\n",
|
||||
"Rerank a list of documents given a query using a cross-encoder model. Note that this API is only available for cross encoder model like [BAAI/bge-reranker-v2-m3](https://huggingface.co/BAAI/bge-reranker-v2-m3) with `attention-backend` `triton` and `torch_native`.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"reranker_process, port = launch_server_cmd(\n",
|
||||
" \"\"\"\n",
|
||||
"python3 -m sglang.launch_server --model-path BAAI/bge-reranker-v2-m3 \\\n",
|
||||
" --host 0.0.0.0 --disable-radix-cache --chunked-prefill-size -1 --attention-backend triton --is-embedding\n",
|
||||
"\"\"\"\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"wait_for_server(f\"http://localhost:{port}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# compute rerank scores for query and documents\n",
|
||||
"\n",
|
||||
"url = f\"http://localhost:{port}/v1/rerank\"\n",
|
||||
"data = {\n",
|
||||
" \"model\": \"BAAI/bge-reranker-v2-m3\",\n",
|
||||
" \"query\": \"what is panda?\",\n",
|
||||
" \"documents\": [\n",
|
||||
" \"hi\",\n",
|
||||
" \"The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.\",\n",
|
||||
" ],\n",
|
||||
"}\n",
|
||||
"\n",
|
||||
"response = requests.post(url, json=data)\n",
|
||||
"response_json = response.json()\n",
|
||||
"for item in response_json:\n",
|
||||
" print_highlight(f\"Score: {item['score']:.2f} - Document: '{item['document']}'\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"terminate_process(reranker_process)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Classify (reward model)\n",
|
||||
"\n",
|
||||
"SGLang Runtime also supports reward models. Here we use a reward model to classify the quality of pairwise generations."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Note that SGLang now treats embedding models and reward models as the same type of models.\n",
|
||||
"# This will be updated in the future.\n",
|
||||
"\n",
|
||||
"reward_process, port = launch_server_cmd(\n",
|
||||
" \"\"\"\n",
|
||||
"python3 -m sglang.launch_server --model-path Skywork/Skywork-Reward-Llama-3.1-8B-v0.2 --host 0.0.0.0 --is-embedding\n",
|
||||
"\"\"\"\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"wait_for_server(f\"http://localhost:{port}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from transformers import AutoTokenizer\n",
|
||||
"\n",
|
||||
"PROMPT = (\n",
|
||||
" \"What is the range of the numeric output of a sigmoid node in a neural network?\"\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"RESPONSE1 = \"The output of a sigmoid node is bounded between -1 and 1.\"\n",
|
||||
"RESPONSE2 = \"The output of a sigmoid node is bounded between 0 and 1.\"\n",
|
||||
"\n",
|
||||
"CONVS = [\n",
|
||||
" [{\"role\": \"user\", \"content\": PROMPT}, {\"role\": \"assistant\", \"content\": RESPONSE1}],\n",
|
||||
" [{\"role\": \"user\", \"content\": PROMPT}, {\"role\": \"assistant\", \"content\": RESPONSE2}],\n",
|
||||
"]\n",
|
||||
"\n",
|
||||
"tokenizer = AutoTokenizer.from_pretrained(\"Skywork/Skywork-Reward-Llama-3.1-8B-v0.2\")\n",
|
||||
"prompts = tokenizer.apply_chat_template(CONVS, tokenize=False)\n",
|
||||
"\n",
|
||||
"url = f\"http://localhost:{port}/classify\"\n",
|
||||
"data = {\"model\": \"Skywork/Skywork-Reward-Llama-3.1-8B-v0.2\", \"text\": prompts}\n",
|
||||
"\n",
|
||||
"responses = requests.post(url, json=data).json()\n",
|
||||
"for response in responses:\n",
|
||||
" print_highlight(f\"reward: {response['embedding'][0]}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"terminate_process(reward_process)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Capture expert selection distribution in MoE models\n",
|
||||
"\n",
|
||||
"SGLang Runtime supports recording the number of times an expert is selected in a MoE model run for each expert in the model. This is useful when analyzing the throughput of the model and plan for optimization.\n",
|
||||
"\n",
|
||||
"*Note: We only print out the first 10 lines of the csv below for better readability. Please adjust accordingly if you want to analyze the results more deeply.*"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"expert_record_server_process, port = launch_server_cmd(\n",
|
||||
" \"python3 -m sglang.launch_server --model-path Qwen/Qwen1.5-MoE-A2.7B --host 0.0.0.0 --expert-distribution-recorder-mode stat\"\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"wait_for_server(f\"http://localhost:{port}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"response = requests.post(f\"http://localhost:{port}/start_expert_distribution_record\")\n",
|
||||
"print_highlight(response)\n",
|
||||
"\n",
|
||||
"url = f\"http://localhost:{port}/generate\"\n",
|
||||
"data = {\"text\": \"What is the capital of France?\"}\n",
|
||||
"\n",
|
||||
"response = requests.post(url, json=data)\n",
|
||||
"print_highlight(response.json())\n",
|
||||
"\n",
|
||||
"response = requests.post(f\"http://localhost:{port}/stop_expert_distribution_record\")\n",
|
||||
"print_highlight(response)\n",
|
||||
"\n",
|
||||
"response = requests.post(f\"http://localhost:{port}/dump_expert_distribution_record\")\n",
|
||||
"print_highlight(response)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"terminate_process(expert_record_server_process)"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
}
|
||||
235
docs/basic_usage/offline_engine_api.ipynb
Normal file
235
docs/basic_usage/offline_engine_api.ipynb
Normal file
@@ -0,0 +1,235 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Offline Engine API\n",
|
||||
"\n",
|
||||
"SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:\n",
|
||||
"\n",
|
||||
"- Offline Batch Inference\n",
|
||||
"- Custom Server on Top of the Engine\n",
|
||||
"\n",
|
||||
"This document focuses on the offline batch inference, demonstrating four different inference modes:\n",
|
||||
"\n",
|
||||
"- Non-streaming synchronous generation\n",
|
||||
"- Streaming synchronous generation\n",
|
||||
"- Non-streaming asynchronous generation\n",
|
||||
"- Streaming asynchronous generation\n",
|
||||
"\n",
|
||||
"Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).\n",
|
||||
"\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Nest Asyncio\n",
|
||||
"Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:\n",
|
||||
"```python\n",
|
||||
"import nest_asyncio\n",
|
||||
"\n",
|
||||
"nest_asyncio.apply()\n",
|
||||
"\n",
|
||||
"```"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Advanced Usage\n",
|
||||
"\n",
|
||||
"The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). \n",
|
||||
"\n",
|
||||
"Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Offline Batch Inference\n",
|
||||
"\n",
|
||||
"SGLang offline engine supports batch inference with efficient scheduling."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# launch the offline engine\n",
|
||||
"import asyncio\n",
|
||||
"\n",
|
||||
"import sglang as sgl\n",
|
||||
"import sglang.test.doc_patch\n",
|
||||
"from sglang.utils import async_stream_and_merge, stream_and_merge\n",
|
||||
"\n",
|
||||
"llm = sgl.Engine(model_path=\"qwen/qwen2.5-0.5b-instruct\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Non-streaming Synchronous Generation"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"prompts = [\n",
|
||||
" \"Hello, my name is\",\n",
|
||||
" \"The president of the United States is\",\n",
|
||||
" \"The capital of France is\",\n",
|
||||
" \"The future of AI is\",\n",
|
||||
"]\n",
|
||||
"\n",
|
||||
"sampling_params = {\"temperature\": 0.8, \"top_p\": 0.95}\n",
|
||||
"\n",
|
||||
"outputs = llm.generate(prompts, sampling_params)\n",
|
||||
"for prompt, output in zip(prompts, outputs):\n",
|
||||
" print(\"===============================\")\n",
|
||||
" print(f\"Prompt: {prompt}\\nGenerated text: {output['text']}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Streaming Synchronous Generation"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"prompts = [\n",
|
||||
" \"Write a short, neutral self-introduction for a fictional character. Hello, my name is\",\n",
|
||||
" \"Provide a concise factual statement about France’s capital city. The capital of France is\",\n",
|
||||
" \"Explain possible future trends in artificial intelligence. The future of AI is\",\n",
|
||||
"]\n",
|
||||
"\n",
|
||||
"sampling_params = {\n",
|
||||
" \"temperature\": 0.2,\n",
|
||||
" \"top_p\": 0.9,\n",
|
||||
"}\n",
|
||||
"\n",
|
||||
"print(\"\\n=== Testing synchronous streaming generation with overlap removal ===\\n\")\n",
|
||||
"\n",
|
||||
"for prompt in prompts:\n",
|
||||
" print(f\"Prompt: {prompt}\")\n",
|
||||
" merged_output = stream_and_merge(llm, prompt, sampling_params)\n",
|
||||
" print(\"Generated text:\", merged_output)\n",
|
||||
" print()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Non-streaming Asynchronous Generation"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"prompts = [\n",
|
||||
" \"Write a short, neutral self-introduction for a fictional character. Hello, my name is\",\n",
|
||||
" \"Provide a concise factual statement about France’s capital city. The capital of France is\",\n",
|
||||
" \"Explain possible future trends in artificial intelligence. The future of AI is\",\n",
|
||||
"]\n",
|
||||
"\n",
|
||||
"sampling_params = {\"temperature\": 0.8, \"top_p\": 0.95}\n",
|
||||
"\n",
|
||||
"print(\"\\n=== Testing asynchronous batch generation ===\")\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"async def main():\n",
|
||||
" outputs = await llm.async_generate(prompts, sampling_params)\n",
|
||||
"\n",
|
||||
" for prompt, output in zip(prompts, outputs):\n",
|
||||
" print(f\"\\nPrompt: {prompt}\")\n",
|
||||
" print(f\"Generated text: {output['text']}\")\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"asyncio.run(main())"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Streaming Asynchronous Generation"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"prompts = [\n",
|
||||
" \"Write a short, neutral self-introduction for a fictional character. Hello, my name is\",\n",
|
||||
" \"Provide a concise factual statement about France’s capital city. The capital of France is\",\n",
|
||||
" \"Explain possible future trends in artificial intelligence. The future of AI is\",\n",
|
||||
"]\n",
|
||||
"\n",
|
||||
"sampling_params = {\"temperature\": 0.8, \"top_p\": 0.95}\n",
|
||||
"\n",
|
||||
"print(\"\\n=== Testing asynchronous streaming generation (no repeats) ===\")\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"async def main():\n",
|
||||
" for prompt in prompts:\n",
|
||||
" print(f\"\\nPrompt: {prompt}\")\n",
|
||||
" print(\"Generated text: \", end=\"\", flush=True)\n",
|
||||
"\n",
|
||||
" # Replace direct calls to async_generate with our custom overlap-aware version\n",
|
||||
" async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):\n",
|
||||
" print(cleaned_chunk, end=\"\", flush=True)\n",
|
||||
"\n",
|
||||
" print() # New line after each prompt\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"asyncio.run(main())"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"llm.shutdown()"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
}
|
||||
9
docs/basic_usage/openai_api.rst
Normal file
9
docs/basic_usage/openai_api.rst
Normal file
@@ -0,0 +1,9 @@
|
||||
OpenAI-Compatible APIs
|
||||
======================
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
|
||||
openai_api_completions.ipynb
|
||||
openai_api_vision.ipynb
|
||||
openai_api_embeddings.ipynb
|
||||
311
docs/basic_usage/openai_api_completions.ipynb
Normal file
311
docs/basic_usage/openai_api_completions.ipynb
Normal file
@@ -0,0 +1,311 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# OpenAI APIs - Completions\n",
|
||||
"\n",
|
||||
"SGLang provides OpenAI-compatible APIs to enable a smooth transition from OpenAI services to self-hosted local models.\n",
|
||||
"A complete reference for the API is available in the [OpenAI API Reference](https://platform.openai.com/docs/api-reference).\n",
|
||||
"\n",
|
||||
"This tutorial covers the following popular APIs:\n",
|
||||
"\n",
|
||||
"- `chat/completions`\n",
|
||||
"- `completions`\n",
|
||||
"\n",
|
||||
"Check out other tutorials to learn about [vision APIs](openai_api_vision.ipynb) for vision-language models and [embedding APIs](openai_api_embeddings.ipynb) for embedding models."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Launch A Server\n",
|
||||
"\n",
|
||||
"Launch the server in your terminal and wait for it to initialize."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from sglang.test.doc_patch import launch_server_cmd\n",
|
||||
"from sglang.utils import wait_for_server, print_highlight, terminate_process\n",
|
||||
"\n",
|
||||
"server_process, port = launch_server_cmd(\n",
|
||||
" \"python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct --host 0.0.0.0\"\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"wait_for_server(f\"http://localhost:{port}\")\n",
|
||||
"print(f\"Server started on http://localhost:{port}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Chat Completions\n",
|
||||
"\n",
|
||||
"### Usage\n",
|
||||
"\n",
|
||||
"The server fully implements the OpenAI API.\n",
|
||||
"It will automatically apply the chat template specified in the Hugging Face tokenizer, if one is available.\n",
|
||||
"You can also specify a custom chat template with `--chat-template` when launching the server."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import openai\n",
|
||||
"\n",
|
||||
"client = openai.Client(base_url=f\"http://127.0.0.1:{port}/v1\", api_key=\"None\")\n",
|
||||
"\n",
|
||||
"response = client.chat.completions.create(\n",
|
||||
" model=\"qwen/qwen2.5-0.5b-instruct\",\n",
|
||||
" messages=[\n",
|
||||
" {\"role\": \"user\", \"content\": \"List 3 countries and their capitals.\"},\n",
|
||||
" ],\n",
|
||||
" temperature=0,\n",
|
||||
" max_tokens=64,\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"print_highlight(f\"Response: {response}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Parameters\n",
|
||||
"\n",
|
||||
"The chat completions API accepts OpenAI Chat Completions API's parameters. Refer to [OpenAI Chat Completions API](https://platform.openai.com/docs/api-reference/chat/create) for more details.\n",
|
||||
"\n",
|
||||
"SGLang extends the standard API with the `extra_body` parameter, allowing for additional customization. One key option within `extra_body` is `chat_template_kwargs`, which can be used to pass arguments to the chat template processor."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"response = client.chat.completions.create(\n",
|
||||
" model=\"qwen/qwen2.5-0.5b-instruct\",\n",
|
||||
" messages=[\n",
|
||||
" {\n",
|
||||
" \"role\": \"system\",\n",
|
||||
" \"content\": \"You are a knowledgeable historian who provides concise responses.\",\n",
|
||||
" },\n",
|
||||
" {\"role\": \"user\", \"content\": \"Tell me about ancient Rome\"},\n",
|
||||
" {\n",
|
||||
" \"role\": \"assistant\",\n",
|
||||
" \"content\": \"Ancient Rome was a civilization centered in Italy.\",\n",
|
||||
" },\n",
|
||||
" {\"role\": \"user\", \"content\": \"What were their major achievements?\"},\n",
|
||||
" ],\n",
|
||||
" temperature=0.3, # Lower temperature for more focused responses\n",
|
||||
" max_tokens=128, # Reasonable length for a concise response\n",
|
||||
" top_p=0.95, # Slightly higher for better fluency\n",
|
||||
" presence_penalty=0.2, # Mild penalty to avoid repetition\n",
|
||||
" frequency_penalty=0.2, # Mild penalty for more natural language\n",
|
||||
" n=1, # Single response is usually more stable\n",
|
||||
" seed=42, # Keep for reproducibility\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"print_highlight(response.choices[0].message.content)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Streaming mode is also supported."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"stream = client.chat.completions.create(\n",
|
||||
" model=\"qwen/qwen2.5-0.5b-instruct\",\n",
|
||||
" messages=[{\"role\": \"user\", \"content\": \"Say this is a test\"}],\n",
|
||||
" stream=True,\n",
|
||||
")\n",
|
||||
"for chunk in stream:\n",
|
||||
" if chunk.choices[0].delta.content is not None:\n",
|
||||
" print(chunk.choices[0].delta.content, end=\"\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Enabling Model Thinking/Reasoning\n",
|
||||
"\n",
|
||||
"You can use `chat_template_kwargs` to enable or disable the model's internal thinking or reasoning process output. Set `\"enable_thinking\": True` within `chat_template_kwargs` to include the reasoning steps in the response. This requires launching the server with a compatible reasoning parser.\n",
|
||||
"\n",
|
||||
"**Reasoning Parser Options:**\n",
|
||||
"- `--reasoning-parser deepseek-r1`: For DeepSeek-R1 family models (R1, R1-0528, R1-Distill)\n",
|
||||
"- `--reasoning-parser qwen3`: For both standard Qwen3 models that support `enable_thinking` parameter and Qwen3-Thinking models\n",
|
||||
"- `--reasoning-parser qwen3-thinking`: For Qwen3-Thinking models, force reasoning version of qwen3 parser\n",
|
||||
"- `--reasoning-parser kimi`: For Kimi thinking models\n",
|
||||
"\n",
|
||||
"Here's an example demonstrating how to enable thinking and retrieve the reasoning content separately (using `separate_reasoning: True`):\n",
|
||||
"\n",
|
||||
"```python\n",
|
||||
"# For Qwen3 models with enable_thinking support:\n",
|
||||
"# python3 -m sglang.launch_server --model-path QwQ/Qwen3-32B-250415 --reasoning-parser qwen3 ...\n",
|
||||
"\n",
|
||||
"from openai import OpenAI\n",
|
||||
"\n",
|
||||
"# Modify OpenAI's API key and API base to use SGLang's API server.\n",
|
||||
"openai_api_key = \"EMPTY\"\n",
|
||||
"openai_api_base = f\"http://127.0.0.1:{port}/v1\" # Use the correct port\n",
|
||||
"\n",
|
||||
"client = OpenAI(\n",
|
||||
" api_key=openai_api_key,\n",
|
||||
" base_url=openai_api_base,\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"model = \"QwQ/Qwen3-32B-250415\" # Use the model loaded by the server\n",
|
||||
"messages = [{\"role\": \"user\", \"content\": \"9.11 and 9.8, which is greater?\"}]\n",
|
||||
"\n",
|
||||
"response = client.chat.completions.create(\n",
|
||||
" model=model,\n",
|
||||
" messages=messages,\n",
|
||||
" extra_body={\n",
|
||||
" \"chat_template_kwargs\": {\"enable_thinking\": True},\n",
|
||||
" \"separate_reasoning\": True\n",
|
||||
" }\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"print(\"response.choices[0].message.reasoning_content: \\n\", response.choices[0].message.reasoning_content)\n",
|
||||
"print(\"response.choices[0].message.content: \\n\", response.choices[0].message.content)\n",
|
||||
"```\n",
|
||||
"\n",
|
||||
"**Example Output:**\n",
|
||||
"\n",
|
||||
"```\n",
|
||||
"response.choices[0].message.reasoning_content: \n",
|
||||
" Okay, so I need to figure out which number is greater between 9.11 and 9.8. Hmm, let me think. Both numbers start with 9, right? So the whole number part is the same. That means I need to look at the decimal parts to determine which one is bigger.\n",
|
||||
"...\n",
|
||||
"Therefore, after checking multiple methods—aligning decimals, subtracting, converting to fractions, and using a real-world analogy—it's clear that 9.8 is greater than 9.11.\n",
|
||||
"\n",
|
||||
"response.choices[0].message.content: \n",
|
||||
" To determine which number is greater between **9.11** and **9.8**, follow these steps:\n",
|
||||
"...\n",
|
||||
"**Answer**: \n",
|
||||
"9.8 is greater than 9.11.\n",
|
||||
"```\n",
|
||||
"\n",
|
||||
"Setting `\"enable_thinking\": False` (or omitting it) will result in `reasoning_content` being `None`.\n",
|
||||
"\n",
|
||||
"**Note for Qwen3-Thinking models:** These models always generate thinking content and do not support the `enable_thinking` parameter. Use `--reasoning-parser qwen3-thinking` or `--reasoning-parser qwen3` to parse the thinking content.\n",
|
||||
"\n",
|
||||
"Here is an example of a detailed chat completion request using standard OpenAI parameters:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Completions\n",
|
||||
"\n",
|
||||
"### Usage\n",
|
||||
"Completions API is similar to Chat Completions API, but without the `messages` parameter or chat templates."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"response = client.completions.create(\n",
|
||||
" model=\"qwen/qwen2.5-0.5b-instruct\",\n",
|
||||
" prompt=\"List 3 countries and their capitals.\",\n",
|
||||
" temperature=0,\n",
|
||||
" max_tokens=64,\n",
|
||||
" n=1,\n",
|
||||
" stop=None,\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"print_highlight(f\"Response: {response}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Parameters\n",
|
||||
"\n",
|
||||
"The completions API accepts OpenAI Completions API's parameters. Refer to [OpenAI Completions API](https://platform.openai.com/docs/api-reference/completions/create) for more details.\n",
|
||||
"\n",
|
||||
"Here is an example of a detailed completions request:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"response = client.completions.create(\n",
|
||||
" model=\"qwen/qwen2.5-0.5b-instruct\",\n",
|
||||
" prompt=\"Write a short story about a space explorer.\",\n",
|
||||
" temperature=0.7, # Moderate temperature for creative writing\n",
|
||||
" max_tokens=150, # Longer response for a story\n",
|
||||
" top_p=0.9, # Balanced diversity in word choice\n",
|
||||
" stop=[\"\\n\\n\", \"THE END\"], # Multiple stop sequences\n",
|
||||
" presence_penalty=0.3, # Encourage novel elements\n",
|
||||
" frequency_penalty=0.3, # Reduce repetitive phrases\n",
|
||||
" n=1, # Generate one completion\n",
|
||||
" seed=123, # For reproducible results\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"print_highlight(f\"Response: {response}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Structured Outputs (JSON, Regex, EBNF)\n",
|
||||
"\n",
|
||||
"For OpenAI compatible structured outputs API, refer to [Structured Outputs](../advanced_features/structured_outputs.ipynb) for more details.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"terminate_process(server_process)"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
}
|
||||
195
docs/basic_usage/openai_api_embeddings.ipynb
Normal file
195
docs/basic_usage/openai_api_embeddings.ipynb
Normal file
@@ -0,0 +1,195 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# OpenAI APIs - Embedding\n",
|
||||
"\n",
|
||||
"SGLang provides OpenAI-compatible APIs to enable a smooth transition from OpenAI services to self-hosted local models.\n",
|
||||
"A complete reference for the API is available in the [OpenAI API Reference](https://platform.openai.com/docs/guides/embeddings).\n",
|
||||
"\n",
|
||||
"This tutorial covers the embedding APIs for embedding models. For a list of the supported models see the [corresponding overview page](../supported_models/embedding_models.md)\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Launch A Server\n",
|
||||
"\n",
|
||||
"Launch the server in your terminal and wait for it to initialize. Remember to add `--is-embedding` to the command."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from sglang.test.doc_patch import launch_server_cmd\n",
|
||||
"from sglang.utils import wait_for_server, print_highlight, terminate_process\n",
|
||||
"\n",
|
||||
"embedding_process, port = launch_server_cmd(\n",
|
||||
" \"\"\"\n",
|
||||
"python3 -m sglang.launch_server --model-path Alibaba-NLP/gte-Qwen2-1.5B-instruct \\\n",
|
||||
" --host 0.0.0.0 --is-embedding\n",
|
||||
"\"\"\"\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"wait_for_server(f\"http://localhost:{port}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Using cURL"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import subprocess, json\n",
|
||||
"\n",
|
||||
"text = \"Once upon a time\"\n",
|
||||
"\n",
|
||||
"curl_text = f\"\"\"curl -s http://localhost:{port}/v1/embeddings \\\n",
|
||||
" -H \"Content-Type: application/json\" \\\n",
|
||||
" -d '{{\"model\": \"Alibaba-NLP/gte-Qwen2-1.5B-instruct\", \"input\": \"{text}\"}}'\"\"\"\n",
|
||||
"\n",
|
||||
"result = subprocess.check_output(curl_text, shell=True)\n",
|
||||
"\n",
|
||||
"print(result)\n",
|
||||
"\n",
|
||||
"text_embedding = json.loads(result)[\"data\"][0][\"embedding\"]\n",
|
||||
"\n",
|
||||
"print_highlight(f\"Text embedding (first 10): {text_embedding[:10]}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Using Python Requests"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import requests\n",
|
||||
"\n",
|
||||
"text = \"Once upon a time\"\n",
|
||||
"\n",
|
||||
"response = requests.post(\n",
|
||||
" f\"http://localhost:{port}/v1/embeddings\",\n",
|
||||
" json={\"model\": \"Alibaba-NLP/gte-Qwen2-1.5B-instruct\", \"input\": text},\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"text_embedding = response.json()[\"data\"][0][\"embedding\"]\n",
|
||||
"\n",
|
||||
"print_highlight(f\"Text embedding (first 10): {text_embedding[:10]}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Using OpenAI Python Client"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import openai\n",
|
||||
"\n",
|
||||
"client = openai.Client(base_url=f\"http://127.0.0.1:{port}/v1\", api_key=\"None\")\n",
|
||||
"\n",
|
||||
"# Text embedding example\n",
|
||||
"response = client.embeddings.create(\n",
|
||||
" model=\"Alibaba-NLP/gte-Qwen2-1.5B-instruct\",\n",
|
||||
" input=text,\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"embedding = response.data[0].embedding[:10]\n",
|
||||
"print_highlight(f\"Text embedding (first 10): {embedding}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Using Input IDs\n",
|
||||
"\n",
|
||||
"SGLang also supports `input_ids` as input to get the embedding."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import json\n",
|
||||
"import os\n",
|
||||
"from transformers import AutoTokenizer\n",
|
||||
"\n",
|
||||
"os.environ[\"TOKENIZERS_PARALLELISM\"] = \"false\"\n",
|
||||
"\n",
|
||||
"tokenizer = AutoTokenizer.from_pretrained(\"Alibaba-NLP/gte-Qwen2-1.5B-instruct\")\n",
|
||||
"input_ids = tokenizer.encode(text)\n",
|
||||
"\n",
|
||||
"curl_ids = f\"\"\"curl -s http://localhost:{port}/v1/embeddings \\\n",
|
||||
" -H \"Content-Type: application/json\" \\\n",
|
||||
" -d '{{\"model\": \"Alibaba-NLP/gte-Qwen2-1.5B-instruct\", \"input\": {json.dumps(input_ids)}}}'\"\"\"\n",
|
||||
"\n",
|
||||
"input_ids_embedding = json.loads(subprocess.check_output(curl_ids, shell=True))[\"data\"][\n",
|
||||
" 0\n",
|
||||
"][\"embedding\"]\n",
|
||||
"\n",
|
||||
"print_highlight(f\"Input IDs embedding (first 10): {input_ids_embedding[:10]}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"terminate_process(embedding_process)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Multi-Modal Embedding Model\n",
|
||||
"Please refer to [Multi-Modal Embedding Model](../supported_models/embedding_models.md)"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
}
|
||||
254
docs/basic_usage/openai_api_vision.ipynb
Normal file
254
docs/basic_usage/openai_api_vision.ipynb
Normal file
@@ -0,0 +1,254 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# OpenAI APIs - Vision\n",
|
||||
"\n",
|
||||
"SGLang provides OpenAI-compatible APIs to enable a smooth transition from OpenAI services to self-hosted local models.\n",
|
||||
"A complete reference for the API is available in the [OpenAI API Reference](https://platform.openai.com/docs/guides/vision).\n",
|
||||
"This tutorial covers the vision APIs for vision language models.\n",
|
||||
"\n",
|
||||
"SGLang supports various vision language models such as Llama 3.2, LLaVA-OneVision, Qwen2.5-VL, Gemma3 and [more](../supported_models/multimodal_language_models.md).\n",
|
||||
"\n",
|
||||
"As an alternative to the OpenAI API, you can also use the [SGLang offline engine](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py)."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Launch A Server\n",
|
||||
"\n",
|
||||
"Launch the server in your terminal and wait for it to initialize."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from sglang.test.doc_patch import launch_server_cmd\n",
|
||||
"from sglang.utils import wait_for_server, print_highlight, terminate_process\n",
|
||||
"\n",
|
||||
"vision_process, port = launch_server_cmd(\n",
|
||||
" \"\"\"\n",
|
||||
"python3 -m sglang.launch_server --model-path Qwen/Qwen2.5-VL-7B-Instruct\n",
|
||||
"\"\"\"\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"wait_for_server(f\"http://localhost:{port}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Using cURL\n",
|
||||
"\n",
|
||||
"Once the server is up, you can send test requests using curl or requests."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import subprocess\n",
|
||||
"\n",
|
||||
"curl_command = f\"\"\"\n",
|
||||
"curl -s http://localhost:{port}/v1/chat/completions \\\\\n",
|
||||
" -H \"Content-Type: application/json\" \\\\\n",
|
||||
" -d '{{\n",
|
||||
" \"model\": \"Qwen/Qwen2.5-VL-7B-Instruct\",\n",
|
||||
" \"messages\": [\n",
|
||||
" {{\n",
|
||||
" \"role\": \"user\",\n",
|
||||
" \"content\": [\n",
|
||||
" {{\n",
|
||||
" \"type\": \"text\",\n",
|
||||
" \"text\": \"What’s in this image?\"\n",
|
||||
" }},\n",
|
||||
" {{\n",
|
||||
" \"type\": \"image_url\",\n",
|
||||
" \"image_url\": {{\n",
|
||||
" \"url\": \"https://github.com/sgl-project/sglang/blob/main/test/lang/example_image.png?raw=true\"\n",
|
||||
" }}\n",
|
||||
" }}\n",
|
||||
" ]\n",
|
||||
" }}\n",
|
||||
" ],\n",
|
||||
" \"max_tokens\": 300\n",
|
||||
" }}'\n",
|
||||
"\"\"\"\n",
|
||||
"\n",
|
||||
"response = subprocess.check_output(curl_command, shell=True).decode()\n",
|
||||
"print_highlight(response)\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"response = subprocess.check_output(curl_command, shell=True).decode()\n",
|
||||
"print_highlight(response)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Using Python Requests"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import requests\n",
|
||||
"\n",
|
||||
"url = f\"http://localhost:{port}/v1/chat/completions\"\n",
|
||||
"\n",
|
||||
"data = {\n",
|
||||
" \"model\": \"Qwen/Qwen2.5-VL-7B-Instruct\",\n",
|
||||
" \"messages\": [\n",
|
||||
" {\n",
|
||||
" \"role\": \"user\",\n",
|
||||
" \"content\": [\n",
|
||||
" {\"type\": \"text\", \"text\": \"What’s in this image?\"},\n",
|
||||
" {\n",
|
||||
" \"type\": \"image_url\",\n",
|
||||
" \"image_url\": {\n",
|
||||
" \"url\": \"https://github.com/sgl-project/sglang/blob/main/test/lang/example_image.png?raw=true\"\n",
|
||||
" },\n",
|
||||
" },\n",
|
||||
" ],\n",
|
||||
" }\n",
|
||||
" ],\n",
|
||||
" \"max_tokens\": 300,\n",
|
||||
"}\n",
|
||||
"\n",
|
||||
"response = requests.post(url, json=data)\n",
|
||||
"print_highlight(response.text)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Using OpenAI Python Client"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from openai import OpenAI\n",
|
||||
"\n",
|
||||
"client = OpenAI(base_url=f\"http://localhost:{port}/v1\", api_key=\"None\")\n",
|
||||
"\n",
|
||||
"response = client.chat.completions.create(\n",
|
||||
" model=\"Qwen/Qwen2.5-VL-7B-Instruct\",\n",
|
||||
" messages=[\n",
|
||||
" {\n",
|
||||
" \"role\": \"user\",\n",
|
||||
" \"content\": [\n",
|
||||
" {\n",
|
||||
" \"type\": \"text\",\n",
|
||||
" \"text\": \"What is in this image?\",\n",
|
||||
" },\n",
|
||||
" {\n",
|
||||
" \"type\": \"image_url\",\n",
|
||||
" \"image_url\": {\n",
|
||||
" \"url\": \"https://github.com/sgl-project/sglang/blob/main/test/lang/example_image.png?raw=true\"\n",
|
||||
" },\n",
|
||||
" },\n",
|
||||
" ],\n",
|
||||
" }\n",
|
||||
" ],\n",
|
||||
" max_tokens=300,\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"print_highlight(response.choices[0].message.content)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Multiple-Image Inputs\n",
|
||||
"\n",
|
||||
"The server also supports multiple images and interleaved text and images if the model supports it."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from openai import OpenAI\n",
|
||||
"\n",
|
||||
"client = OpenAI(base_url=f\"http://localhost:{port}/v1\", api_key=\"None\")\n",
|
||||
"\n",
|
||||
"response = client.chat.completions.create(\n",
|
||||
" model=\"Qwen/Qwen2.5-VL-7B-Instruct\",\n",
|
||||
" messages=[\n",
|
||||
" {\n",
|
||||
" \"role\": \"user\",\n",
|
||||
" \"content\": [\n",
|
||||
" {\n",
|
||||
" \"type\": \"image_url\",\n",
|
||||
" \"image_url\": {\n",
|
||||
" \"url\": \"https://github.com/sgl-project/sglang/blob/main/test/lang/example_image.png?raw=true\",\n",
|
||||
" },\n",
|
||||
" },\n",
|
||||
" {\n",
|
||||
" \"type\": \"image_url\",\n",
|
||||
" \"image_url\": {\n",
|
||||
" \"url\": \"https://raw.githubusercontent.com/sgl-project/sglang/main/assets/logo.png\",\n",
|
||||
" },\n",
|
||||
" },\n",
|
||||
" {\n",
|
||||
" \"type\": \"text\",\n",
|
||||
" \"text\": \"I have two very different images. They are not related at all. \"\n",
|
||||
" \"Please describe the first image in one sentence, and then describe the second image in another sentence.\",\n",
|
||||
" },\n",
|
||||
" ],\n",
|
||||
" }\n",
|
||||
" ],\n",
|
||||
" temperature=0,\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"print_highlight(response.choices[0].message.content)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"terminate_process(vision_process)"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
}
|
||||
305
docs/basic_usage/sampling_params.md
Normal file
305
docs/basic_usage/sampling_params.md
Normal file
@@ -0,0 +1,305 @@
|
||||
# Sampling Parameters
|
||||
|
||||
This doc describes the sampling parameters of the SGLang Runtime. It is the low-level endpoint of the runtime.
|
||||
If you want a high-level endpoint that can automatically handle chat templates, consider using the [OpenAI Compatible API](./openai_api_completions.ipynb).
|
||||
|
||||
## `/generate` Endpoint
|
||||
|
||||
The `/generate` endpoint accepts the following parameters in JSON format. For detailed usage, see the [native API doc](./native_api.ipynb). The object is defined at `io_struct.py::GenerateReqInput`. You can also read the source code to find more arguments and docs.
|
||||
|
||||
| Argument | Type/Default | Description |
|
||||
|----------------------------|------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
||||
| text | `Optional[Union[List[str], str]] = None` | The input prompt. Can be a single prompt or a batch of prompts. |
|
||||
| input_ids | `Optional[Union[List[List[int]], List[int]]] = None` | The token IDs for text; one can specify either text or input_ids. |
|
||||
| input_embeds | `Optional[Union[List[List[List[float]]], List[List[float]]]] = None` | The embeddings for input_ids; one can specify either text, input_ids, or input_embeds. |
|
||||
| image_data | `Optional[Union[List[List[ImageDataItem]], List[ImageDataItem], ImageDataItem]] = None` | The image input. Can be an image instance, file name, URL, or base64 encoded string. Can be a single image, list of images, or list of lists of images. |
|
||||
| audio_data | `Optional[Union[List[AudioDataItem], AudioDataItem]] = None` | The audio input. Can be a file name, URL, or base64 encoded string. |
|
||||
| sampling_params | `Optional[Union[List[Dict], Dict]] = None` | The sampling parameters as described in the sections below. |
|
||||
| rid | `Optional[Union[List[str], str]] = None` | The request ID. |
|
||||
| return_logprob | `Optional[Union[List[bool], bool]] = None` | Whether to return log probabilities for tokens. |
|
||||
| logprob_start_len | `Optional[Union[List[int], int]] = None` | If return_logprob, the start location in the prompt for returning logprobs. Default is "-1", which returns logprobs for output tokens only. |
|
||||
| top_logprobs_num | `Optional[Union[List[int], int]] = None` | If return_logprob, the number of top logprobs to return at each position. |
|
||||
| token_ids_logprob | `Optional[Union[List[List[int]], List[int]]] = None` | If return_logprob, the token IDs to return logprob for. |
|
||||
| return_text_in_logprobs | `bool = False` | Whether to detokenize tokens in text in the returned logprobs. |
|
||||
| stream | `bool = False` | Whether to stream output. |
|
||||
| lora_path | `Optional[Union[List[Optional[str]], Optional[str]]] = None` | The path to the LoRA. |
|
||||
| custom_logit_processor | `Optional[Union[List[Optional[str]], str]] = None` | Custom logit processor for advanced sampling control. Must be a serialized instance of `CustomLogitProcessor` using its `to_str()` method. For usage see below. |
|
||||
| return_hidden_states | `Union[List[bool], bool] = False` | Whether to return hidden states. |
|
||||
|
||||
## Sampling parameters
|
||||
|
||||
The object is defined at `sampling_params.py::SamplingParams`. You can also read the source code to find more arguments and docs.
|
||||
|
||||
### Core parameters
|
||||
|
||||
| Argument | Type/Default | Description |
|
||||
|-----------------|----------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------|
|
||||
| max_new_tokens | `int = 128` | The maximum output length measured in tokens. |
|
||||
| stop | `Optional[Union[str, List[str]]] = None` | One or multiple [stop words](https://platform.openai.com/docs/api-reference/chat/create#chat-create-stop). Generation will stop if one of these words is sampled. |
|
||||
| stop_token_ids | `Optional[List[int]] = None` | Provide stop words in the form of token IDs. Generation will stop if one of these token IDs is sampled. |
|
||||
| temperature | `float = 1.0` | [Temperature](https://platform.openai.com/docs/api-reference/chat/create#chat-create-temperature) when sampling the next token. `temperature = 0` corresponds to greedy sampling, a higher temperature leads to more diversity. |
|
||||
| top_p | `float = 1.0` | [Top-p](https://platform.openai.com/docs/api-reference/chat/create#chat-create-top_p) selects tokens from the smallest sorted set whose cumulative probability exceeds `top_p`. When `top_p = 1`, this reduces to unrestricted sampling from all tokens. |
|
||||
| top_k | `int = -1` | [Top-k](https://developer.nvidia.com/blog/how-to-get-better-outputs-from-your-large-language-model/#predictability_vs_creativity) randomly selects from the `k` highest-probability tokens. |
|
||||
| min_p | `float = 0.0` | [Min-p](https://github.com/huggingface/transformers/issues/27670) samples from tokens with probability larger than `min_p * highest_token_probability`. |
|
||||
|
||||
### Penalizers
|
||||
|
||||
| Argument | Type/Default | Description |
|
||||
|--------------------|------------------------|------------------------------------------------------------------------------------------------------------------------------------------------|
|
||||
| frequency_penalty | `float = 0.0` | Penalizes tokens based on their frequency in generation so far. Must be between `-2` and `2` where negative numbers encourage repeatment of tokens and positive number encourages sampling of new tokens. The scaling of penalization grows linearly with each appearance of a token. |
|
||||
| presence_penalty | `float = 0.0` | Penalizes tokens if they appeared in the generation so far. Must be between `-2` and `2` where negative numbers encourage repeatment of tokens and positive number encourages sampling of new tokens. The scaling of the penalization is constant if a token occurred. |
|
||||
| min_new_tokens | `int = 0` | Forces the model to generate at least `min_new_tokens` until a stop word or EOS token is sampled. Note that this might lead to unintended behavior, for example, if the distribution is highly skewed towards these tokens. |
|
||||
|
||||
### Constrained decoding
|
||||
|
||||
Please refer to our dedicated guide on [constrained decoding](../advanced_features/structured_outputs.ipynb) for the following parameters.
|
||||
|
||||
| Argument | Type/Default | Description |
|
||||
|-----------------|---------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------|
|
||||
| json_schema | `Optional[str] = None` | JSON schema for structured outputs. |
|
||||
| regex | `Optional[str] = None` | Regex for structured outputs. |
|
||||
| ebnf | `Optional[str] = None` | EBNF for structured outputs. |
|
||||
| structural_tag | `Optional[str] = None` | The structal tag for structured outputs. |
|
||||
|
||||
### Other options
|
||||
|
||||
| Argument | Type/Default | Description |
|
||||
|-------------------------------|---------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------|
|
||||
| n | `int = 1` | Specifies the number of output sequences to generate per request. (Generating multiple outputs in one request (n > 1) is discouraged; repeating the same prompts several times offers better control and efficiency.) |
|
||||
| ignore_eos | `bool = False` | Don't stop generation when EOS token is sampled. |
|
||||
| skip_special_tokens | `bool = True` | Remove special tokens during decoding. |
|
||||
| spaces_between_special_tokens | `bool = True` | Whether or not to add spaces between special tokens during detokenization. |
|
||||
| no_stop_trim | `bool = False` | Don't trim stop words or EOS token from the generated text. |
|
||||
| custom_params | `Optional[List[Optional[Dict[str, Any]]]] = None` | Used when employing `CustomLogitProcessor`. For usage, see below. |
|
||||
|
||||
## Examples
|
||||
|
||||
### Normal
|
||||
|
||||
Launch a server:
|
||||
|
||||
```bash
|
||||
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --port 30000
|
||||
```
|
||||
|
||||
Send a request:
|
||||
|
||||
```python
|
||||
import requests
|
||||
|
||||
response = requests.post(
|
||||
"http://localhost:30000/generate",
|
||||
json={
|
||||
"text": "The capital of France is",
|
||||
"sampling_params": {
|
||||
"temperature": 0,
|
||||
"max_new_tokens": 32,
|
||||
},
|
||||
},
|
||||
)
|
||||
print(response.json())
|
||||
```
|
||||
|
||||
Detailed example in [send request](./send_request.ipynb).
|
||||
|
||||
### Streaming
|
||||
|
||||
Send a request and stream the output:
|
||||
|
||||
```python
|
||||
import requests, json
|
||||
|
||||
response = requests.post(
|
||||
"http://localhost:30000/generate",
|
||||
json={
|
||||
"text": "The capital of France is",
|
||||
"sampling_params": {
|
||||
"temperature": 0,
|
||||
"max_new_tokens": 32,
|
||||
},
|
||||
"stream": True,
|
||||
},
|
||||
stream=True,
|
||||
)
|
||||
|
||||
prev = 0
|
||||
for chunk in response.iter_lines(decode_unicode=False):
|
||||
chunk = chunk.decode("utf-8")
|
||||
if chunk and chunk.startswith("data:"):
|
||||
if chunk == "data: [DONE]":
|
||||
break
|
||||
data = json.loads(chunk[5:].strip("\n"))
|
||||
output = data["text"].strip()
|
||||
print(output[prev:], end="", flush=True)
|
||||
prev = len(output)
|
||||
print("")
|
||||
```
|
||||
|
||||
Detailed example in [openai compatible api](https://docs.sglang.ai/backend/openai_api_completions.html#id2).
|
||||
|
||||
### Multimodal
|
||||
|
||||
Launch a server:
|
||||
|
||||
```bash
|
||||
python3 -m sglang.launch_server --model-path lmms-lab/llava-onevision-qwen2-7b-ov
|
||||
```
|
||||
|
||||
Download an image:
|
||||
|
||||
```bash
|
||||
curl -o example_image.png -L https://github.com/sgl-project/sglang/blob/main/test/lang/example_image.png?raw=true
|
||||
```
|
||||
|
||||
Send a request:
|
||||
|
||||
```python
|
||||
import requests
|
||||
|
||||
response = requests.post(
|
||||
"http://localhost:30000/generate",
|
||||
json={
|
||||
"text": "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n"
|
||||
"<|im_start|>user\n<image>\nDescribe this image in a very short sentence.<|im_end|>\n"
|
||||
"<|im_start|>assistant\n",
|
||||
"image_data": "example_image.png",
|
||||
"sampling_params": {
|
||||
"temperature": 0,
|
||||
"max_new_tokens": 32,
|
||||
},
|
||||
},
|
||||
)
|
||||
print(response.json())
|
||||
```
|
||||
|
||||
The `image_data` can be a file name, a URL, or a base64 encoded string. See also `python/sglang/srt/utils.py:load_image`.
|
||||
|
||||
Streaming is supported in a similar manner as [above](#streaming).
|
||||
|
||||
Detailed example in [openai api vision](./openai_api_vision.ipynb).
|
||||
|
||||
### Structured Outputs (JSON, Regex, EBNF)
|
||||
|
||||
You can specify a JSON schema, regular expression or [EBNF](https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form) to constrain the model output. The model output will be guaranteed to follow the given constraints. Only one constraint parameter (`json_schema`, `regex`, or `ebnf`) can be specified for a request.
|
||||
|
||||
SGLang supports two grammar backends:
|
||||
|
||||
- [XGrammar](https://github.com/mlc-ai/xgrammar) (default): Supports JSON schema, regular expression, and EBNF constraints.
|
||||
- XGrammar currently uses the [GGML BNF format](https://github.com/ggerganov/llama.cpp/blob/master/grammars/README.md).
|
||||
- [Outlines](https://github.com/dottxt-ai/outlines): Supports JSON schema and regular expression constraints.
|
||||
|
||||
If instead you want to initialize the Outlines backend, you can use `--grammar-backend outlines` flag:
|
||||
|
||||
```bash
|
||||
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
|
||||
--port 30000 --host 0.0.0.0 --grammar-backend [xgrammar|outlines] # xgrammar or outlines (default: xgrammar)
|
||||
```
|
||||
|
||||
```python
|
||||
import json
|
||||
import requests
|
||||
|
||||
json_schema = json.dumps({
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"name": {"type": "string", "pattern": "^[\\w]+$"},
|
||||
"population": {"type": "integer"},
|
||||
},
|
||||
"required": ["name", "population"],
|
||||
})
|
||||
|
||||
# JSON (works with both Outlines and XGrammar)
|
||||
response = requests.post(
|
||||
"http://localhost:30000/generate",
|
||||
json={
|
||||
"text": "Here is the information of the capital of France in the JSON format.\n",
|
||||
"sampling_params": {
|
||||
"temperature": 0,
|
||||
"max_new_tokens": 64,
|
||||
"json_schema": json_schema,
|
||||
},
|
||||
},
|
||||
)
|
||||
print(response.json())
|
||||
|
||||
# Regular expression (Outlines backend only)
|
||||
response = requests.post(
|
||||
"http://localhost:30000/generate",
|
||||
json={
|
||||
"text": "Paris is the capital of",
|
||||
"sampling_params": {
|
||||
"temperature": 0,
|
||||
"max_new_tokens": 64,
|
||||
"regex": "(France|England)",
|
||||
},
|
||||
},
|
||||
)
|
||||
print(response.json())
|
||||
|
||||
# EBNF (XGrammar backend only)
|
||||
response = requests.post(
|
||||
"http://localhost:30000/generate",
|
||||
json={
|
||||
"text": "Write a greeting.",
|
||||
"sampling_params": {
|
||||
"temperature": 0,
|
||||
"max_new_tokens": 64,
|
||||
"ebnf": 'root ::= "Hello" | "Hi" | "Hey"',
|
||||
},
|
||||
},
|
||||
)
|
||||
print(response.json())
|
||||
```
|
||||
|
||||
Detailed example in [structured outputs](../advanced_features/structured_outputs.ipynb).
|
||||
|
||||
### Custom logit processor
|
||||
|
||||
Launch a server with `--enable-custom-logit-processor` flag on.
|
||||
|
||||
```bash
|
||||
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --port 30000 --enable-custom-logit-processor
|
||||
```
|
||||
|
||||
Define a custom logit processor that will always sample a specific token id.
|
||||
|
||||
```python
|
||||
from sglang.srt.sampling.custom_logit_processor import CustomLogitProcessor
|
||||
|
||||
class DeterministicLogitProcessor(CustomLogitProcessor):
|
||||
"""A dummy logit processor that changes the logits to always
|
||||
sample the given token id.
|
||||
"""
|
||||
|
||||
def __call__(self, logits, custom_param_list):
|
||||
# Check that the number of logits matches the number of custom parameters
|
||||
assert logits.shape[0] == len(custom_param_list)
|
||||
key = "token_id"
|
||||
|
||||
for i, param_dict in enumerate(custom_param_list):
|
||||
# Mask all other tokens
|
||||
logits[i, :] = -float("inf")
|
||||
# Assign highest probability to the specified token
|
||||
logits[i, param_dict[key]] = 0.0
|
||||
return logits
|
||||
```
|
||||
|
||||
Send a request:
|
||||
|
||||
```python
|
||||
import requests
|
||||
|
||||
response = requests.post(
|
||||
"http://localhost:30000/generate",
|
||||
json={
|
||||
"text": "The capital of France is",
|
||||
"custom_logit_processor": DeterministicLogitProcessor().to_str(),
|
||||
"sampling_params": {
|
||||
"temperature": 0.0,
|
||||
"max_new_tokens": 32,
|
||||
"custom_params": {"token_id": 5},
|
||||
},
|
||||
},
|
||||
)
|
||||
print(response.json())
|
||||
```
|
||||
253
docs/basic_usage/send_request.ipynb
Normal file
253
docs/basic_usage/send_request.ipynb
Normal file
@@ -0,0 +1,253 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Sending Requests\n",
|
||||
"This notebook provides a quick-start guide to use SGLang in chat completions after installation.\n",
|
||||
"\n",
|
||||
"- For Vision Language Models, see [OpenAI APIs - Vision](openai_api_vision.ipynb).\n",
|
||||
"- For Embedding Models, see [OpenAI APIs - Embedding](openai_api_embeddings.ipynb) and [Encode (embedding model)](native_api.html#Encode-(embedding-model)).\n",
|
||||
"- For Reward Models, see [Classify (reward model)](native_api.html#Classify-(reward-model))."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Launch A Server"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from sglang.test.doc_patch import launch_server_cmd\n",
|
||||
"from sglang.utils import wait_for_server, print_highlight, terminate_process\n",
|
||||
"\n",
|
||||
"# This is equivalent to running the following command in your terminal\n",
|
||||
"# python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct --host 0.0.0.0\n",
|
||||
"\n",
|
||||
"server_process, port = launch_server_cmd(\n",
|
||||
" \"\"\"\n",
|
||||
"python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct \\\n",
|
||||
" --host 0.0.0.0\n",
|
||||
"\"\"\"\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"wait_for_server(f\"http://localhost:{port}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Using cURL\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import subprocess, json\n",
|
||||
"\n",
|
||||
"curl_command = f\"\"\"\n",
|
||||
"curl -s http://localhost:{port}/v1/chat/completions \\\n",
|
||||
" -H \"Content-Type: application/json\" \\\n",
|
||||
" -d '{{\"model\": \"qwen/qwen2.5-0.5b-instruct\", \"messages\": [{{\"role\": \"user\", \"content\": \"What is the capital of France?\"}}]}}'\n",
|
||||
"\"\"\"\n",
|
||||
"\n",
|
||||
"response = json.loads(subprocess.check_output(curl_command, shell=True))\n",
|
||||
"print_highlight(response)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Using Python Requests"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import requests\n",
|
||||
"\n",
|
||||
"url = f\"http://localhost:{port}/v1/chat/completions\"\n",
|
||||
"\n",
|
||||
"data = {\n",
|
||||
" \"model\": \"qwen/qwen2.5-0.5b-instruct\",\n",
|
||||
" \"messages\": [{\"role\": \"user\", \"content\": \"What is the capital of France?\"}],\n",
|
||||
"}\n",
|
||||
"\n",
|
||||
"response = requests.post(url, json=data)\n",
|
||||
"print_highlight(response.json())"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Using OpenAI Python Client"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import openai\n",
|
||||
"\n",
|
||||
"client = openai.Client(base_url=f\"http://127.0.0.1:{port}/v1\", api_key=\"None\")\n",
|
||||
"\n",
|
||||
"response = client.chat.completions.create(\n",
|
||||
" model=\"qwen/qwen2.5-0.5b-instruct\",\n",
|
||||
" messages=[\n",
|
||||
" {\"role\": \"user\", \"content\": \"List 3 countries and their capitals.\"},\n",
|
||||
" ],\n",
|
||||
" temperature=0,\n",
|
||||
" max_tokens=64,\n",
|
||||
")\n",
|
||||
"print_highlight(response)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Streaming"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import openai\n",
|
||||
"\n",
|
||||
"client = openai.Client(base_url=f\"http://127.0.0.1:{port}/v1\", api_key=\"None\")\n",
|
||||
"\n",
|
||||
"# Use stream=True for streaming responses\n",
|
||||
"response = client.chat.completions.create(\n",
|
||||
" model=\"qwen/qwen2.5-0.5b-instruct\",\n",
|
||||
" messages=[\n",
|
||||
" {\"role\": \"user\", \"content\": \"List 3 countries and their capitals.\"},\n",
|
||||
" ],\n",
|
||||
" temperature=0,\n",
|
||||
" max_tokens=64,\n",
|
||||
" stream=True,\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"# Handle the streaming output\n",
|
||||
"for chunk in response:\n",
|
||||
" if chunk.choices[0].delta.content:\n",
|
||||
" print(chunk.choices[0].delta.content, end=\"\", flush=True)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Using Native Generation APIs\n",
|
||||
"\n",
|
||||
"You can also use the native `/generate` endpoint with requests, which provides more flexibility. An API reference is available at [Sampling Parameters](sampling_params.md)."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import requests\n",
|
||||
"\n",
|
||||
"response = requests.post(\n",
|
||||
" f\"http://localhost:{port}/generate\",\n",
|
||||
" json={\n",
|
||||
" \"text\": \"The capital of France is\",\n",
|
||||
" \"sampling_params\": {\n",
|
||||
" \"temperature\": 0,\n",
|
||||
" \"max_new_tokens\": 32,\n",
|
||||
" },\n",
|
||||
" },\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"print_highlight(response.json())"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Streaming"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import requests, json\n",
|
||||
"\n",
|
||||
"response = requests.post(\n",
|
||||
" f\"http://localhost:{port}/generate\",\n",
|
||||
" json={\n",
|
||||
" \"text\": \"The capital of France is\",\n",
|
||||
" \"sampling_params\": {\n",
|
||||
" \"temperature\": 0,\n",
|
||||
" \"max_new_tokens\": 32,\n",
|
||||
" },\n",
|
||||
" \"stream\": True,\n",
|
||||
" },\n",
|
||||
" stream=True,\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"prev = 0\n",
|
||||
"for chunk in response.iter_lines(decode_unicode=False):\n",
|
||||
" chunk = chunk.decode(\"utf-8\")\n",
|
||||
" if chunk and chunk.startswith(\"data:\"):\n",
|
||||
" if chunk == \"data: [DONE]\":\n",
|
||||
" break\n",
|
||||
" data = json.loads(chunk[5:].strip(\"\\n\"))\n",
|
||||
" output = data[\"text\"]\n",
|
||||
" print(output[prev:], end=\"\", flush=True)\n",
|
||||
" prev = len(output)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"terminate_process(server_process)"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
}
|
||||
Reference in New Issue
Block a user