[Feature] Prefill assistant response - add continue_final_message parameter (#4226)

Co-authored-by: Chayenne <zhaochen20@outlook.com>
This commit is contained in:
Adarsh Shirawalmath
2025-04-21 06:07:18 +05:30
committed by GitHub
parent 5156d5a413
commit 8b39274e34
6 changed files with 82 additions and 23 deletions

View File

@@ -51,6 +51,7 @@ Please refer to our dedicated guide on [constrained decoding](./structured_outpu
* `n: int = 1`: Specifies the number of output sequences to generate per request. (Generating multiple outputs in one request (n > 1) is discouraged; repeating the same prompts several times offers better control and efficiency.)
* `spaces_between_special_tokens: bool = True`: Whether or not to add spaces between special tokens during detokenization.
* `no_stop_trim: bool = False`: Don't trim stop words or EOS token from the generated text.
* `continue_final_message: bool = False` : When enabled, the final assistant message is removed and its content is used as a prefill so that the model continues that message instead of starting a new turn. See [openai_chat_with_response_prefill.py](https://github.com/sgl-project/sglang/blob/main/examples/runtime/openai_chat_with_response_prefill.py) for examples.
* `ignore_eos: bool = False`: Don't stop generation when EOS token is sampled.
* `skip_special_tokens: bool = True`: Remove special tokens during decoding.
* `custom_params: Optional[List[Optional[Dict[str, Any]]]] = None`: Used when employing `CustomLogitProcessor`. For usage, see below.