[Feat] Expose logprob options to sgl.gen API (#503)

Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
This commit is contained in:
胡译文
2024-07-09 15:35:39 +08:00
committed by GitHub
parent d557e9f3b7
commit 02b7258658
7 changed files with 239 additions and 43 deletions

View File

@@ -279,8 +279,8 @@ for out in state.text_iter():
```
### Tips and Implementation Details
- The `choices` argument in `sgl.gen` is implemented by computing the normalized log probabilities of all choices and selecting the one with the highest probability.
- The `regex` argument in `sgl.gen` is implemented through autoregressive decoding with logit bias masking, according to the constraints set by the regex.
- The `choices` argument in `sgl.gen` is implemented by computing the [token-length normalized log probabilities](https://blog.eleuther.ai/multiple-choice-normalization/) of all choices and selecting the one with the highest probability.
- The `regex` argument in `sgl.gen` is implemented through autoregressive decoding with logit bias masking, according to the constraints set by the regex. It is compatible with `temperature=0` and `temperature != 0`.
## Backend: SGLang Runtime (SRT)
The SGLang Runtime (SRT) is designed to work best with the SGLang frontend.
@@ -337,7 +337,6 @@ response = client.chat.completions.create(
print(response)
```
By default, the server uses the chat template specified in the model tokenizer from Hugging Face. It should just work for most official models such as Llama-2/Llama-3.
If needed, you can also override the chat template when launching the server:
@@ -384,9 +383,8 @@ python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port
- Llama
- Mistral
- Mixtral
- Qwen / Qwen 2
- Gemma
- Please add a new flag `--attention-reduce-in-fp32` to avoid some precision errors.
- Qwen / Qwen 2 / Qwen 2 MoE
- Gemma / Gemma 2
- `python -m sglang.launch_server --model-path google/gemma-7b-it --port 30000 --attention-reduce-in-fp32`
- LLaVA
- `python3 -m sglang.launch_server --model-path liuhaotian/llava-v1.5-7b --tokenizer-path llava-hf/llava-1.5-7b-hf --chat-template vicuna_v1.1 --port 30000`
@@ -399,6 +397,8 @@ python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port
- StableLM
- Command-R
- DBRX
- Grok
- ChatGLM
- AWQ/GPTQ/Marlin quantization
Instructions for supporting a new model are [here](https://github.com/sgl-project/sglang/blob/main/docs/model_support.md).