[Feature] Prefill assistant response - add continue_final_message parameter (#4226)

Co-authored-by: Chayenne <zhaochen20@outlook.com>
2025-04-21 06:07:18 +05:30
parent 5156d5a413
commit 8b39274e34
6 changed files with 82 additions and 23 deletions
--- a/examples/runtime/README.md
+++ b/examples/runtime/README.md
@@ -8,7 +8,9 @@ The below examples will mostly need you to start a server in a separate terminal
 * `multimodal_embedding.py`: An example how perform [multi modal embedding](Alibaba-NLP/gme-Qwen2-VL-2B-Instruct).
 * `openai_batch_chat.py`: An example how to process batch requests for chat completions.
 * `openai_batch_complete.py`: An example how to process batch requests for text completions.
-* `openai_chat_with_response_prefill.py`: An example how to [prefill](https://eugeneyan.com/writing/prompting/#prefill-claudes-responses) a response using OpenAI API.
+* **`openai_chat_with_response_prefill.py`**:
+  An example that demonstrates how to [prefill a response](https://eugeneyan.com/writing/prompting/#prefill-claudes-responses) using the OpenAI API by enabling the `continue_final_message` parameter.
+  When enabled, the final (partial) assistant message is removed and its content is used as a prefill so that the model continues that message rather  than starting a new turn. See [Anthropic's prefill example](https://docs.anthropic.com/en/docs/build-with-claude/prompt-engineering/prefill-claudes-response#example-structured-data-extraction-with-prefilling) for more context.
 * `reward_model.py`: An example how to extract scores from a reward model.
 * `vertex_predict.py`: An example how to deploy a model to [Vertex AI](https://cloud.google.com/vertex-ai?hl=en).