Files
sglang/docs/advanced_features/separate_reasoning.ipynb
2025-08-11 12:15:00 -07:00

461 lines
15 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Reasoning Parser\n",
"\n",
"SGLang supports parsing reasoning content out from \"normal\" content for reasoning models such as [DeepSeek R1](https://huggingface.co/deepseek-ai/DeepSeek-R1).\n",
"\n",
"## Supported Models & Parsers\n",
"\n",
"| Model | Reasoning tags | Parser | Notes |\n",
"|---------|-----------------------------|------------------|-------|\n",
"| [DeepSeekR1 series](https://huggingface.co/collections/deepseek-ai/deepseek-r1-678e1e131c0169c0bc89728d) | `<think>` … `</think>` | `deepseek-r1` | Supports all variants (R1, R1-0528, R1-Distill) |\n",
"| [Standard Qwen3 models](https://huggingface.co/collections/Qwen/qwen3-67dd247413f0e2e4f653967f) | `<think>` … `</think>` | `qwen3` | Supports `enable_thinking` parameter |\n",
"| [Qwen3-Thinking models](https://huggingface.co/Qwen/Qwen3-235B-A22B-Thinking-2507) | `<think>` … `</think>` | `qwen3` or `qwen3-thinking` | Always generates thinking content |\n",
"| [Kimi models](https://huggingface.co/moonshotai/models) | `◁think▷` … `◁/think▷` | `kimi` | Uses special thinking delimiters |\n",
"\n",
"### Model-Specific Behaviors\n",
"\n",
"**DeepSeek-R1 Family:**\n",
"- DeepSeek-R1: No `<think>` start tag, jumps directly to thinking content\n",
"- DeepSeek-R1-0528: Generates both `<think>` start and `</think>` end tags\n",
"- Both are handled by the same `deepseek-r1` parser\n",
"\n",
"**Qwen3 Family:**\n",
"- Standard Qwen3 (e.g., Qwen3-2507): Use `qwen3` parser, supports `enable_thinking` in chat templates\n",
"- Qwen3-Thinking (e.g., Qwen3-235B-A22B-Thinking-2507): Use `qwen3` or `qwen3-thinking` parser, always thinks\n",
"\n",
"**Kimi:**\n",
"- Kimi: Uses special `◁think▷` and `◁/think▷` tags"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Usage\n",
"\n",
"### Launching the Server"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Specify the `--reasoning-parser` option."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import requests\n",
"from openai import OpenAI\n",
"from sglang.test.doc_patch import launch_server_cmd\n",
"from sglang.utils import wait_for_server, print_highlight, terminate_process\n",
"\n",
"server_process, port = launch_server_cmd(\n",
" \"python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-R1-Distill-Qwen-7B --host 0.0.0.0 --reasoning-parser deepseek-r1\"\n",
")\n",
"\n",
"wait_for_server(f\"http://localhost:{port}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that `--reasoning-parser` defines the parser used to interpret responses."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### OpenAI Compatible API\n",
"\n",
"Using the OpenAI compatible API, the contract follows the [DeepSeek API design](https://api-docs.deepseek.com/guides/reasoning_model) established with the release of DeepSeek-R1:\n",
"\n",
"- `reasoning_content`: The content of the CoT.\n",
"- `content`: The content of the final answer."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Initialize OpenAI-like client\n",
"client = OpenAI(api_key=\"None\", base_url=f\"http://0.0.0.0:{port}/v1\")\n",
"model_name = client.models.list().data[0].id\n",
"\n",
"messages = [\n",
" {\n",
" \"role\": \"user\",\n",
" \"content\": \"What is 1+3?\",\n",
" }\n",
"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Non-Streaming Request"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"response_non_stream = client.chat.completions.create(\n",
" model=model_name,\n",
" messages=messages,\n",
" temperature=0.6,\n",
" top_p=0.95,\n",
" stream=False, # Non-streaming\n",
" extra_body={\"separate_reasoning\": True},\n",
")\n",
"print_highlight(\"==== Reasoning ====\")\n",
"print_highlight(response_non_stream.choices[0].message.reasoning_content)\n",
"\n",
"print_highlight(\"==== Text ====\")\n",
"print_highlight(response_non_stream.choices[0].message.content)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Streaming Request"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"response_stream = client.chat.completions.create(\n",
" model=model_name,\n",
" messages=messages,\n",
" temperature=0.6,\n",
" top_p=0.95,\n",
" stream=True, # Non-streaming\n",
" extra_body={\"separate_reasoning\": True},\n",
")\n",
"\n",
"reasoning_content = \"\"\n",
"content = \"\"\n",
"for chunk in response_stream:\n",
" if chunk.choices[0].delta.content:\n",
" content += chunk.choices[0].delta.content\n",
" if chunk.choices[0].delta.reasoning_content:\n",
" reasoning_content += chunk.choices[0].delta.reasoning_content\n",
"\n",
"print_highlight(\"==== Reasoning ====\")\n",
"print_highlight(reasoning_content)\n",
"\n",
"print_highlight(\"==== Text ====\")\n",
"print_highlight(content)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Optionally, you can buffer the reasoning content to the last reasoning chunk (or the first chunk after the reasoning content)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"response_stream = client.chat.completions.create(\n",
" model=model_name,\n",
" messages=messages,\n",
" temperature=0.6,\n",
" top_p=0.95,\n",
" stream=True, # Non-streaming\n",
" extra_body={\"separate_reasoning\": True, \"stream_reasoning\": False},\n",
")\n",
"\n",
"reasoning_content = \"\"\n",
"content = \"\"\n",
"for chunk in response_stream:\n",
" if chunk.choices[0].delta.content:\n",
" content += chunk.choices[0].delta.content\n",
" if chunk.choices[0].delta.reasoning_content:\n",
" reasoning_content = chunk.choices[0].delta.reasoning_content\n",
"\n",
"print_highlight(\"==== Reasoning ====\")\n",
"print_highlight(reasoning_content)\n",
"\n",
"print_highlight(\"==== Text ====\")\n",
"print_highlight(content)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The reasoning separation is enable by default when specify . \n",
"**To disable it, set the `separate_reasoning` option to `False` in request.**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"response_non_stream = client.chat.completions.create(\n",
" model=model_name,\n",
" messages=messages,\n",
" temperature=0.6,\n",
" top_p=0.95,\n",
" stream=False, # Non-streaming\n",
" extra_body={\"separate_reasoning\": False},\n",
")\n",
"\n",
"print_highlight(\"==== Original Output ====\")\n",
"print_highlight(response_non_stream.choices[0].message.content)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### SGLang Native API "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from transformers import AutoTokenizer\n",
"\n",
"tokenizer = AutoTokenizer.from_pretrained(\"deepseek-ai/DeepSeek-R1-Distill-Qwen-7B\")\n",
"input = tokenizer.apply_chat_template(\n",
" messages,\n",
" tokenize=False,\n",
" add_generation_prompt=True,\n",
")\n",
"\n",
"gen_url = f\"http://localhost:{port}/generate\"\n",
"gen_data = {\n",
" \"text\": input,\n",
" \"sampling_params\": {\n",
" \"skip_special_tokens\": False,\n",
" \"max_new_tokens\": 1024,\n",
" \"temperature\": 0.6,\n",
" \"top_p\": 0.95,\n",
" },\n",
"}\n",
"gen_response = requests.post(gen_url, json=gen_data).json()[\"text\"]\n",
"\n",
"print_highlight(\"==== Original Output ====\")\n",
"print_highlight(gen_response)\n",
"\n",
"parse_url = f\"http://localhost:{port}/separate_reasoning\"\n",
"separate_reasoning_data = {\n",
" \"text\": gen_response,\n",
" \"reasoning_parser\": \"deepseek-r1\",\n",
"}\n",
"separate_reasoning_response_json = requests.post(\n",
" parse_url, json=separate_reasoning_data\n",
").json()\n",
"print_highlight(\"==== Reasoning ====\")\n",
"print_highlight(separate_reasoning_response_json[\"reasoning_text\"])\n",
"print_highlight(\"==== Text ====\")\n",
"print_highlight(separate_reasoning_response_json[\"text\"])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"terminate_process(server_process)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Offline Engine API"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import sglang as sgl\n",
"from sglang.srt.reasoning_parser import ReasoningParser\n",
"from sglang.utils import print_highlight\n",
"\n",
"llm = sgl.Engine(model_path=\"deepseek-ai/DeepSeek-R1-Distill-Qwen-7B\")\n",
"tokenizer = AutoTokenizer.from_pretrained(\"deepseek-ai/DeepSeek-R1-Distill-Qwen-7B\")\n",
"input = tokenizer.apply_chat_template(\n",
" messages,\n",
" tokenize=False,\n",
" add_generation_prompt=True,\n",
")\n",
"sampling_params = {\n",
" \"max_new_tokens\": 1024,\n",
" \"skip_special_tokens\": False,\n",
" \"temperature\": 0.6,\n",
" \"top_p\": 0.95,\n",
"}\n",
"result = llm.generate(prompt=input, sampling_params=sampling_params)\n",
"\n",
"generated_text = result[\"text\"] # Assume there is only one prompt\n",
"\n",
"print_highlight(\"==== Original Output ====\")\n",
"print_highlight(generated_text)\n",
"\n",
"parser = ReasoningParser(\"deepseek-r1\")\n",
"reasoning_text, text = parser.parse_non_stream(generated_text)\n",
"print_highlight(\"==== Reasoning ====\")\n",
"print_highlight(reasoning_text)\n",
"print_highlight(\"==== Text ====\")\n",
"print_highlight(text)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"llm.shutdown()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Supporting New Reasoning Model Schemas\n",
"\n",
"For future reasoning models, you can implement the reasoning parser as a subclass of `BaseReasoningFormatDetector` in `python/sglang/srt/reasoning_parser.py` and specify the reasoning parser for new reasoning model schemas accordingly."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"```python\n",
"class DeepSeekR1Detector(BaseReasoningFormatDetector):\n",
" \"\"\"\n",
" Detector for DeepSeek-R1 family models.\n",
" \n",
" Supported models:\n",
" - DeepSeek-R1: Always generates thinking content without <think> start tag\n",
" - DeepSeek-R1-0528: Generates thinking content with <think> start tag\n",
" \n",
" This detector handles both patterns automatically.\n",
" \"\"\"\n",
"\n",
" def __init__(self, stream_reasoning: bool = True):\n",
" super().__init__(\"<think>\", \"</think>\", force_reasoning=True, stream_reasoning=stream_reasoning)\n",
"\n",
"\n",
"class Qwen3Detector(BaseReasoningFormatDetector):\n",
" \"\"\"\n",
" Detector for standard Qwen3 models that support enable_thinking parameter.\n",
" \n",
" These models can switch between thinking and non-thinking modes:\n",
" - enable_thinking=True: Generates <think>...</think> tags\n",
" - enable_thinking=False: No thinking content generated\n",
" \"\"\"\n",
"\n",
" def __init__(self, stream_reasoning: bool = True):\n",
" super().__init__(\"<think>\", \"</think>\", force_reasoning=False, stream_reasoning=stream_reasoning)\n",
"\n",
"\n",
"class Qwen3ThinkingDetector(BaseReasoningFormatDetector):\n",
" \"\"\"\n",
" Detector for Qwen3-Thinking models (e.g., Qwen3-235B-A22B-Thinking-2507).\n",
" \n",
" These models always generate thinking content without <think> start tag.\n",
" They do not support the enable_thinking parameter.\n",
" \"\"\"\n",
"\n",
" def __init__(self, stream_reasoning: bool = True):\n",
" super().__init__(\"<think>\", \"</think>\", force_reasoning=True, stream_reasoning=stream_reasoning)\n",
"\n",
"\n",
"class ReasoningParser:\n",
" \"\"\"\n",
" Parser that handles both streaming and non-streaming scenarios.\n",
" \n",
" Usage:\n",
" # For standard Qwen3 models with enable_thinking support\n",
" parser = ReasoningParser(\"qwen3\")\n",
" \n",
" # For Qwen3-Thinking models that always think\n",
" parser = ReasoningParser(\"qwen3-thinking\")\n",
" \"\"\"\n",
"\n",
" DetectorMap: Dict[str, Type[BaseReasoningFormatDetector]] = {\n",
" \"deepseek-r1\": DeepSeekR1Detector,\n",
" \"qwen3\": Qwen3Detector,\n",
" \"qwen3-thinking\": Qwen3ThinkingDetector,\n",
" \"kimi\": KimiDetector,\n",
" }\n",
"\n",
" def __init__(self, model_type: str = None, stream_reasoning: bool = True):\n",
" if not model_type:\n",
" raise ValueError(\"Model type must be specified\")\n",
"\n",
" detector_class = self.DetectorMap.get(model_type.lower())\n",
" if not detector_class:\n",
" raise ValueError(f\"Unsupported model type: {model_type}\")\n",
"\n",
" self.detector = detector_class(stream_reasoning=stream_reasoning)\n",
"\n",
" def parse_non_stream(self, full_text: str) -> Tuple[str, str]:\n",
" \"\"\"Returns (reasoning_text, normal_text)\"\"\"\n",
" ret = self.detector.detect_and_parse(full_text)\n",
" return ret.reasoning_text, ret.normal_text\n",
"\n",
" def parse_stream_chunk(self, chunk_text: str) -> Tuple[str, str]:\n",
" \"\"\"Returns (reasoning_text, normal_text) for the current chunk\"\"\"\n",
" ret = self.detector.parse_streaming_increment(chunk_text)\n",
" return ret.reasoning_text, ret.normal_text\n",
"```"
]
}
],
"metadata": {
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3"
}
},
"nbformat": 4,
"nbformat_minor": 4
}