2024-07-19 09:54:01 -07:00
# Sampling Parameters in SGLang Runtime
2024-01-18 11:49:27 -08:00
This doc describes the sampling parameters of the SGLang Runtime.
2024-08-24 08:02:23 -07:00
It is the low-level endpoint of the runtime.
If you want a high-level endpoint that can automatically handle chat templates, consider using the [OpenAI Compatible API
](https://github.com/sgl-project/sglang?tab=readme-ov-file#openai -compatible-api).
2024-01-18 11:49:27 -08:00
The `/generate` endpoint accepts the following arguments in the JSON format.
```python
2024-01-30 05:45:27 -08:00
@dataclass
2024-01-18 11:49:27 -08:00
class GenerateReqInput:
2024-07-19 09:54:01 -07:00
# The input prompt. It can be a single prompt or a batch of prompts.
2024-07-20 03:11:15 -07:00
text: Optional[Union[List[str], str]] = None
2024-11-25 19:35:04 -05:00
# The token ids for text; one can specify either text or input_ids
2024-05-12 12:29:00 -10:00
input_ids: Optional[Union[List[List[int]], List[int]]] = None
2024-11-25 19:35:04 -05:00
# The embeddings for input_ids; one can specify either text or input_ids or input_embeds.
input_embeds: Optional[Union[List[List[List[float]]], List[List[float]]]] = None
2024-07-19 10:58:03 -07:00
# The image input. It can be a file name, a url, or base64 encoded string.
# See also python/sglang/srt/utils.py:load_image.
2024-01-18 11:49:27 -08:00
image_data: Optional[Union[List[str], str]] = None
2024-07-27 19:50:34 -07:00
# The sampling_params. See descriptions below.
2024-11-25 19:35:04 -05:00
sampling_params: Optional[Union[List[Dict], Dict]] = None
2024-07-19 10:58:03 -07:00
# The request id.
2024-01-18 11:49:27 -08:00
rid: Optional[Union[List[str], str]] = None
2024-07-19 10:58:03 -07:00
# Whether to return logprobs.
2024-01-23 05:07:30 -08:00
return_logprob: Optional[Union[List[bool], bool]] = None
2024-11-25 19:35:04 -05:00
# If return logprobs, the start location in the prompt for returning logprobs.
2024-09-18 04:35:35 -07:00
# By default, this value is "-1", which means it will only return logprobs for output tokens.
2024-01-23 05:07:30 -08:00
logprob_start_len: Optional[Union[List[int], int]] = None
2024-11-25 19:35:04 -05:00
# If return logprobs, the number of top logprobs to return at each position.
2024-03-28 14:34:49 +08:00
top_logprobs_num: Optional[Union[List[int], int]] = None
2024-07-27 19:50:34 -07:00
# Whether to detokenize tokens in text in the returned logprobs.
2024-03-28 14:34:49 +08:00
return_text_in_logprobs: bool = False
2024-07-19 10:58:03 -07:00
# Whether to stream output.
2024-01-18 11:49:27 -08:00
stream: bool = False
```
The `sampling_params` follows this format
```python
2024-07-27 19:50:34 -07:00
# The maximum number of output tokens
2024-08-08 02:54:46 +02:00
max_new_tokens: int = 128,
2024-07-27 19:50:34 -07:00
# Stop when hitting any of the strings in this list.
stop: Optional[Union[str, List[str]]] = None,
2024-08-08 04:21:08 -07:00
# Stop when hitting any of the token_ids in this list. Could be useful when mixed with
# `min_new_tokens`.
stop_token_ids: Optional[List[int]] = [],
2024-07-27 19:50:34 -07:00
# Sampling temperature
temperature: float = 1.0,
# Top-p sampling
top_p: float = 1.0,
# Top-k sampling
top_k: int = -1,
2024-08-22 01:49:32 +03:00
# Min-p sampling
min_p: float = 0.0,
2024-07-27 19:50:34 -07:00
# Whether to ignore EOS token.
ignore_eos: bool = False,
# Whether to skip the special tokens during detokenization.
skip_special_tokens: bool = True,
# Whether to add spaces between special tokens during detokenization.
spaces_between_special_tokens: bool = True,
# Constrains the output to follow a given regular expression.
regex: Optional[str] = None,
# Do parallel sampling and return `n` outputs.
n: int = 1,
2024-08-26 18:37:26 +02:00
# Constrains the output to follow a given JSON schema.
# `regex` and `json_schema` cannot be set at the same time.
json_schema: Optional[str] = None,
2024-08-08 04:21:08 -07:00
## Penalties. See [Performance Implications on Penalties] section below for more informations.
# Float that penalizes new tokens based on their frequency in the generated text so far.
# Values > 0 encourage the model to use new tokens, while values < 0 encourage the model to
# repeat tokens. Must be -2 <= value <= 2. Setting to 0 (default) will disable this penalty.
frequency_penalty: float = 0.0,
# Float that penalizes new tokens based on whether they appear in the generated text so far.
# Values > 0 encourage the model to use new tokens, while values < 0 encourage the model to repeat
# tokens. Must be -2 <= value <= 2. Setting to 0 (default) will disable this penalty.
presence_penalty: float = 0.0,
# Float that penalizes new tokens based on whether they appear in the prompt and the generated text
# so far. Values > 1 encourage the model to use new tokens, while values < 1 encourage the model to
# repeat tokens. Must be 0 <= value <= 2. Setting to 1 (default) will disable this penalty.
repetition_penalty: float = 1.0,
# Guides inference to generate at least this number of tokens by penalizing logits of tokenizer's
# EOS token and `stop_token_ids` to -inf, until the output token reaches given length.
# Note that any of the `stop` string can be generated before reaching `min_new_tokens`, as it is
# difficult to infer the correct token ID by given `stop` strings.
# Must be 0 <= value < max_new_tokens. Setting to 0 (default) will disable this penalty.
min_new_tokens: int = 0,
2024-01-18 11:49:27 -08:00
```
## Examples
### Normal
2024-07-19 10:58:03 -07:00
Launch a server
2024-01-18 11:49:27 -08:00
```
2024-07-19 10:58:03 -07:00
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --port 30000
2024-01-18 11:49:27 -08:00
```
2024-07-19 10:58:03 -07:00
Send a request
2024-01-18 11:49:27 -08:00
```python
import requests
response = requests.post(
"http://localhost:30000/generate",
json={
"text": "The capital of France is",
"sampling_params": {
"temperature": 0,
"max_new_tokens": 32,
},
},
)
print(response.json())
```
### Streaming
2024-07-19 10:58:03 -07:00
Send a request and stream the output
2024-01-18 11:49:27 -08:00
```python
import requests, json
response = requests.post(
"http://localhost:30000/generate",
json={
"text": "The capital of France is",
"sampling_params": {
"temperature": 0,
2024-08-08 02:54:46 +02:00
"max_new_tokens": 32,
2024-01-18 11:49:27 -08:00
},
"stream": True,
},
stream=True,
)
prev = 0
2024-01-31 10:18:43 -08:00
for chunk in response.iter_lines(decode_unicode=False):
chunk = chunk.decode("utf-8")
if chunk and chunk.startswith("data:"):
if chunk == "data: [DONE]":
break
data = json.loads(chunk[5:].strip("\n"))
2024-01-18 11:49:27 -08:00
output = data["text"].strip()
print(output[prev:], end="", flush=True)
prev = len(output)
print("")
```
2024-01-30 05:45:27 -08:00
### Multi modal
2024-07-19 10:58:03 -07:00
Launch a server
```
2024-08-24 08:02:23 -07:00
python3 -m sglang.launch_server --model-path lmms-lab/llava-onevision-qwen2-7b-ov --chat-template chatml-llava
2024-07-19 10:58:03 -07:00
```
Download an image
```
curl -o example_image.png -L https://github.com/sgl-project/sglang/blob/main/test/lang/example_image.png?raw=true
```
2024-07-19 11:40:06 -07:00
Send a request
2024-07-19 10:58:03 -07:00
```python
import requests
response = requests.post(
"http://localhost:30000/generate",
json={
2024-08-24 08:02:23 -07:00
"text": "< |im_start|>system\nYou are a helpful assistant.< |im_end|>\n"
"< |im_start|>user\n< image > \nDescribe this image in a very short sentence.< |im_end|>\n"
"< |im_start|>assistant\n",
2024-07-19 10:58:03 -07:00
"image_data": "example_image.png",
"sampling_params": {
"temperature": 0,
"max_new_tokens": 32,
},
},
)
print(response.json())
```
The `image_data` can be a file name, a URL, or a base64 encoded string. See also `python/sglang/srt/utils.py:load_image` .
2024-07-19 11:40:06 -07:00
Streaming is supported in a similar manner as [above ](#streaming ).
2024-11-03 14:18:16 -08:00
### Structured decoding (JSON, Regex)
You can specify a JSON schema or a regular expression to constrain the model output. The model output will be guaranteed to follow the given constraints.
```python
import json
import requests
json_schema = json.dumps(
{
"type": "object",
"properties": {
"name": {"type": "string", "pattern": "^[\\w]+$"},
"population": {"type": "integer"},
},
"required": ["name", "population"],
}
)
# JSON
response = requests.post(
"http://localhost:30000/generate",
json={
"text": "Here is the information of the capital of France in the JSON format.\n",
"sampling_params": {
"temperature": 0,
"max_new_tokens": 64,
"json_schema": json_schema,
},
},
)
print(response.json())
# Regular expression
response = requests.post(
"http://localhost:30000/generate",
json={
"text": "Paris is the capital of",
"sampling_params": {
"temperature": 0,
"max_new_tokens": 64,
"regex": "(France|England)",
},
},
)
print(response.json())
2024-11-06 21:46:04 +08:00
```