refactor: centralize CoT parsing in backend for streaming mode (#16394)

* refactor: unify reasoning handling via backend reasoning_content, drop frontend tag parsing - Updated the chat message component to surface backend-supplied reasoning via message.thinking while showing the raw assistant content without inline tag scrubbing - Simplified chat streaming to append content chunks directly, stream reasoning into the message model, and persist any partial reasoning when generation stops - Refactored the chat service SSE handler to rely on server-provided reasoning_content, removing legacy <think> parsing logic - Refreshed Storybook data and streaming flows to populate the thinking field explicitly for static and streaming assistant messages * refactor: implement streaming-aware universal reasoning parser Remove the streaming mode limitation from --reasoning-format by refactoring try_parse_reasoning() to handle incremental parsing of <think> tags across all formats. - Rework try_parse_reasoning() to track whitespace, partial tags, and multiple reasoning segments, allowing proper separation of reasoning_content and content in streaming mode - Parse reasoning tags before tool call handling in content-only and Llama 3.x formats to ensure inline <think> blocks are captured correctly - Change default reasoning_format from 'auto' to 'deepseek' for consistent behavior - Add 'deepseek-legacy' option to preserve old inline behavior when needed - Update CLI help and documentation to reflect streaming support - Add parser tests for inline <think>...</think> segments The parser now continues processing content after </think> closes instead of stopping, enabling proper message.reasoning_content and message.content separation in both streaming and non-streaming modes. Fixes the issue where streaming responses would dump everything (including post-thinking content) into reasoning_content while leaving content empty. * refactor: address review feedback from allozaur - Passed the assistant message content directly to ChatMessageAssistant to drop the redundant derived state in the chat message component - Simplified chat streaming updates by removing unused partial-thinking handling and persisting partial responses straight from currentResponse - Refreshed the ChatMessage stories to cover standard and reasoning scenarios without the old THINK-tag parsing examples Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * refactor: restore forced reasoning prefix to pass test-chat ([chat] All tests passed) - store the exact sequence seen on input when 'thinking_forced_open' enforces a reasoning block - inject this prefix before the first accumulated segment in 'reasoning_content', then clear it to avoid duplication - repeat the capture on every new 'start_think' detection to properly handle partial/streaming flows * refactor: address review feedback from ngxson * debug: say goodbye to curl -N, hello one-click raw stream - adds a new checkbox in the WebUI to display raw LLM output without backend parsing or frontend Markdown rendering * Update tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessage.svelte Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * webui: add Storybook example for raw LLM output and scope reasoning format toggle per story - Added a Storybook example that showcases the chat message component in raw LLM output mode with the provided trace sample - Updated every ChatMessage story to toggle the disableReasoningFormat setting so the raw-output rendering remains scoped to its own example * npm run format * chat-parser: address review feedback from ngxson Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com> --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
2025-10-08 22:18:41 +02:00
parent 9d0882840e
commit 12bbc3fa50
14 changed files with 276 additions and 431 deletions
--- a/tools/server/webui/src/stories/ChatMessage.stories.svelte
+++ b/tools/server/webui/src/stories/ChatMessage.stories.svelte
@@ -36,6 +36,31 @@
 		children: []
 	};

+	const assistantWithReasoning: DatabaseMessage = {
+		id: '3',
+		convId: 'conv-1',
+		type: 'message',
+		timestamp: Date.now() - 1000 * 60 * 2,
+		role: 'assistant',
+		content: "Here's the concise answer, now that I've thought it through carefully for you.",
+		parent: '1',
+		thinking:
+			"Let's consider the user's question step by step:\\n\\n1. Identify the core problem\\n2. Evaluate relevant information\\n3. Formulate a clear answer\\n\\nFollowing this process ensures the final response stays focused and accurate.",
+		children: []
+	};
+	const rawOutputMessage: DatabaseMessage = {
+		id: '6',
+		convId: 'conv-1',
+		type: 'message',
+		timestamp: Date.now() - 1000 * 60,
+		role: 'assistant',
+		content:
+			'<|channel|>analysis<|message|>User greeted me. Initiating overcomplicated analysis: Is this a trap? No, just a normal hello. Respond calmly, act like a helpful assistant, and do not start explaining quantum physics again. Confidence 0.73. Engaging socially acceptable greeting protocol...<|end|>Hello there! How can I help you today?',
+		parent: '1',
+		thinking: '',
+		children: []
+	};
+
 	let processingMessage = $state({
 		id: '4',
 		convId: 'conv-1',
@@ -59,60 +84,6 @@
 		thinking: '',
 		children: []
 	});
-
-	// Message with <think> format thinking content
-	const thinkTagMessage: DatabaseMessage = {
-		id: '6',
-		convId: 'conv-1',
-		type: 'message',
-		timestamp: Date.now() - 1000 * 60 * 2,
-		role: 'assistant',
-		content:
-			"<think>\nLet me analyze this step by step:\n\n1. The user is asking about thinking formats\n2. I need to demonstrate the &lt;think&gt; tag format\n3. This content should be displayed in the thinking section\n4. The main response should be separate\n\nThis is a good example of reasoning content.\n</think>\n\nHere's my response after thinking through the problem. The thinking content above should be displayed separately from this main response content.",
-		parent: '1',
-		thinking: '',
-		children: []
-	};
-
-	// Message with [THINK] format thinking content
-	const thinkBracketMessage: DatabaseMessage = {
-		id: '7',
-		convId: 'conv-1',
-		type: 'message',
-		timestamp: Date.now() - 1000 * 60 * 1,
-		role: 'assistant',
-		content:
-			'[THINK]\nThis is the DeepSeek-style thinking format:\n\n- Using square brackets instead of angle brackets\n- Should work identically to the &lt;think&gt; format\n- Content parsing should extract this reasoning\n- Display should be the same as &lt;think&gt; format\n\nBoth formats should be supported seamlessly.\n[/THINK]\n\nThis is the main response content that comes after the [THINK] block. The reasoning above should be parsed and displayed in the thinking section.',
-		parent: '1',
-		thinking: '',
-		children: []
-	};
-
-	// Streaming message for <think> format
-	let streamingThinkMessage = $state({
-		id: '8',
-		convId: 'conv-1',
-		type: 'message',
-		timestamp: 0, // No timestamp = streaming
-		role: 'assistant',
-		content: '',
-		parent: '1',
-		thinking: '',
-		children: []
-	});
-
-	// Streaming message for [THINK] format
-	let streamingBracketMessage = $state({
-		id: '9',
-		convId: 'conv-1',
-		type: 'message',
-		timestamp: 0, // No timestamp = streaming
-		role: 'assistant',
-		content: '',
-		parent: '1',
-		thinking: '',
-		children: []
-	});
 </script>

 <Story
@@ -120,6 +91,10 @@
 	args={{
 		message: userMessage
 	}}
+	play={async () => {
+		const { updateConfig } = await import('$lib/stores/settings.svelte');
+		updateConfig('disableReasoningFormat', false);
+	}}
 />

 <Story
@@ -128,15 +103,45 @@
 		class: 'max-w-[56rem] w-[calc(100vw-2rem)]',
 		message: assistantMessage
 	}}
+	play={async () => {
+		const { updateConfig } = await import('$lib/stores/settings.svelte');
+		updateConfig('disableReasoningFormat', false);
+	}}
 />

 <Story
-	name="WithThinkingBlock"
+	name="AssistantWithReasoning"
+	args={{
+		class: 'max-w-[56rem] w-[calc(100vw-2rem)]',
+		message: assistantWithReasoning
+	}}
+	play={async () => {
+		const { updateConfig } = await import('$lib/stores/settings.svelte');
+		updateConfig('disableReasoningFormat', false);
+	}}
+/>
+
+<Story
+	name="RawLlmOutput"
+	args={{
+		class: 'max-w-[56rem] w-[calc(100vw-2rem)]',
+		message: rawOutputMessage
+	}}
+	play={async () => {
+		const { updateConfig } = await import('$lib/stores/settings.svelte');
+		updateConfig('disableReasoningFormat', true);
+	}}
+/>
+
+<Story
+	name="WithReasoningContent"
 	args={{
 		message: streamingMessage
 	}}
 	asChild
 	play={async () => {
+		const { updateConfig } = await import('$lib/stores/settings.svelte');
+		updateConfig('disableReasoningFormat', false);
 		// Phase 1: Stream reasoning content in chunks
 		let reasoningText =
 			'I need to think about this carefully. Let me break down the problem:\n\n1. The user is asking for help with something complex\n2. I should provide a thorough and helpful response\n3. I need to consider multiple approaches\n4. The best solution would be to explain step by step\n\nThis approach will ensure clarity and understanding.';
@@ -187,126 +192,16 @@
 		message: processingMessage
 	}}
 	play={async () => {
+		const { updateConfig } = await import('$lib/stores/settings.svelte');
+		updateConfig('disableReasoningFormat', false);
 		// Import the chat store to simulate loading state
 		const { chatStore } = await import('$lib/stores/chat.svelte');
-		
+
 		// Set loading state to true to trigger the processing UI
 		chatStore.isLoading = true;
-		
+
 		// Simulate the processing state hook behavior
 		// This will show the "Generating..." text and parameter details
-		await new Promise(resolve => setTimeout(resolve, 100));
+		await new Promise((resolve) => setTimeout(resolve, 100));
 	}}
 />
-
-<Story
-	name="ThinkTagFormat"
-	args={{
-		class: 'max-w-[56rem] w-[calc(100vw-2rem)]',
-		message: thinkTagMessage
-	}}
-/>
-
-<Story
-	name="ThinkBracketFormat"
-	args={{
-		class: 'max-w-[56rem] w-[calc(100vw-2rem)]',
-		message: thinkBracketMessage
-	}}
-/>
-
-<Story
-	name="StreamingThinkTag"
-	args={{
-		message: streamingThinkMessage
-	}}
-	parameters={{
-		test: {
-			timeout: 30000
-		}
-	}}
-	asChild
-	play={async () => {
-		// Phase 1: Stream <think> reasoning content
-		const thinkingContent =
-			'Let me work through this problem systematically:\n\n1. First, I need to understand what the user is asking\n2. Then I should consider different approaches\n3. I need to evaluate the pros and cons\n4. Finally, I should provide a clear recommendation\n\nThis step-by-step approach will ensure accuracy.';
-
-		let currentContent = '<think>\n';
-		streamingThinkMessage.content = currentContent;
-
-		for (let i = 0; i < thinkingContent.length; i++) {
-			currentContent += thinkingContent[i];
-			streamingThinkMessage.content = currentContent;
-			await new Promise((resolve) => setTimeout(resolve, 5));
-		}
-
-		// Close the thinking block
-		currentContent += '\n</think>\n\n';
-		streamingThinkMessage.content = currentContent;
-		await new Promise((resolve) => setTimeout(resolve, 200));
-
-		// Phase 2: Stream main response content
-		const responseContent =
-			"Based on my analysis above, here's the solution:\n\n**Key Points:**\n- The approach should be systematic\n- We need to consider all factors\n- Implementation should be step-by-step\n\nThis ensures the best possible outcome.";
-
-		for (let i = 0; i < responseContent.length; i++) {
-			currentContent += responseContent[i];
-			streamingThinkMessage.content = currentContent;
-			await new Promise((resolve) => setTimeout(resolve, 10));
-		}
-
-		streamingThinkMessage.timestamp = Date.now();
-	}}
->
-	<div class="w-[56rem]">
-		<ChatMessage message={streamingThinkMessage} />
-	</div>
-</Story>
-
-<Story
-	name="StreamingThinkBracket"
-	args={{
-		message: streamingBracketMessage
-	}}
-	parameters={{
-		test: {
-			timeout: 30000
-		}
-	}}
-	asChild
-	play={async () => {
-		// Phase 1: Stream [THINK] reasoning content
-		const thinkingContent =
-			'Using the DeepSeek format now:\n\n- This demonstrates the &#91;THINK&#93; bracket format\n- Should parse identically to &lt;think&gt; tags\n- The UI should display this in the thinking section\n- Main content should be separate\n\nBoth formats provide the same functionality.';
-
-		let currentContent = '[THINK]\n';
-		streamingBracketMessage.content = currentContent;
-
-		for (let i = 0; i < thinkingContent.length; i++) {
-			currentContent += thinkingContent[i];
-			streamingBracketMessage.content = currentContent;
-			await new Promise((resolve) => setTimeout(resolve, 5));
-		}
-
-		// Close the thinking block
-		currentContent += '\n[/THINK]\n\n';
-		streamingBracketMessage.content = currentContent;
-		await new Promise((resolve) => setTimeout(resolve, 200));
-
-		// Phase 2: Stream main response content
-		const responseContent =
-			"Here's my response after using the &#91;THINK&#93; format:\n\n**Observations:**\n- Both &lt;think&gt; and &#91;THINK&#93; formats work seamlessly\n- The parsing logic handles both cases\n- UI display is consistent across formats\n\nThis demonstrates the enhanced thinking content support.";
-
-		for (let i = 0; i < responseContent.length; i++) {
-			currentContent += responseContent[i];
-			streamingBracketMessage.content = currentContent;
-			await new Promise((resolve) => setTimeout(resolve, 10));
-		}
-
-		streamingBracketMessage.timestamp = Date.now();
-	}}
->
-	<div class="w-[56rem]">
-		<ChatMessage message={streamingBracketMessage} />
-	</div>
-</Story>