Files
ModelHub XC bd7732d650 初始化项目,由ModelHub XC社区提供模型
Model: tossem/friendli-broken-model-fix
Source: Original Platform
2026-04-22 10:09:59 +08:00

18 KiB
Raw Permalink Blame History

library_name, pipeline_tag, base_model
library_name pipeline_tag base_model
transformers text-generation
Qwen/Qwen3-8B

FriendliAI Take-Home Submission

Summary

This repository fixes a configuration issue that prevented the model from supporting /chat/completions. The root cause was a missing chat_template in tokenizer_config.json, which is required for rendering structured chat messages into the prompt format expected by Qwen3.

The fix adds the upstream Qwen3-8B chat_template and corrects model metadata. No model weights were modified.

Q1) Debugging LLMs

Q1(a) Root cause and fix

I inspected config.json, generation_config.json, tokenizer_config.json, tokenizer.json, and README.md.

What I found:

  • config.json identifies the checkpoint as Qwen3ForCausalLM with model_type: "qwen3".
  • generation_config.json is a normal Qwen3 sampling config and is not the blocker.
  • tokenizer.json contains the expected Qwen special tokens (<|im_start|>, <|im_end|>, <think>, </think>, etc.) and did not require changes.
  • tokenizer_config.json was missing the chat_template field entirely.
  • README.md incorrectly claimed the base model was meta-llama/Meta-Llama-3.1-8B, even though the repository is clearly a Qwen3 checkpoint.

The runtime failure is caused by the missing tokenizer_config.json.chat_template.

Why that breaks /chat/completions:

  • OpenAI-style chat servers first need to render structured messages into the raw prompt text expected by the model.
  • For Hugging Face chat models, that rendering logic comes from tokenizer.chat_template.
  • Qwen3 expects ChatML-style formatting plus Qwen3-specific handling for reasoning/tool use.
  • Without the template, the server has no model-specific way to convert messages=[...] into the correct prompt string, so chat inference fails before decoding starts.

Minimal Fix Applied

Only two changes were required to restore chat functionality:

File Field Old value New value Why
tokenizer_config.json chat_template Absent Exact upstream Qwen/Qwen3-8B template Required so chat messages can be rendered into the prompt format Qwen3 expects.
README.md base_model meta-llama/Meta-Llama-3.1-8B Qwen/Qwen3-8B Documentation correction so the model card matches the actual checkpoint lineage.

Runtime-critical change (minimal fix):

tokenizer_config.json
- "chat_template": <missing>
+ "chat_template": "<exact upstream Qwen/Qwen3-8B template>"

Files I intentionally left unchanged:

  • config.json
  • generation_config.json
  • tokenizer.json

Reason: they were not the root cause of the chat failure, and changing them would have gone beyond the minimal fix.

Why the fix was necessary

Qwen3 chat inference is template-driven. The weights were fine; the tokenizer vocabulary was fine; the generation defaults were fine. The only missing runtime contract was the chat template that tells the serving layer how to serialize roles, assistant reasoning, tool calls, and the enable_thinking switch into Qwen3's expected prompt format.

Problem - b) Why reasoning_effort currently Has No Observable Effect

Root Cause Analysis: reasoning_effort

The reasoning_effort parameter flows through a three-layer pipeline. This model breaks at every single layer.

How the pipeline is supposed to work Client sends reasoning_effort="high" │ ▼ [Layer 1] Inference framework (vLLM/TGI) translates → enable_thinking=True + thinking_budget=N tokens │ ▼ [Layer 2] Chat template (Jinja in tokenizer_config.json) reads enable_thinking → controls whether is pre-filled empty (skipping) or left open for model to fill │ ▼ [Layer 3] Model weights trained behavior generates coherent … blocks and scales depth proportionally to the budget

CleanShot 2026-04-16 at 03.55.44@2x

CleanShot 2026-04-16 at 03.54.38@2x

This is the only mechanism by which enable_thinking controls output: when False, an empty block is pre-filled, signaling the model to skip reasoning and answer directly. When True (or unset), the block is left open for the model to generate. broken-model's original tokenizer_config.json had no chat_template field at all. The addition in the previous scenario should solve this issue.

Layer 3 — Model Weights (broken, and the deepest root cause) Even if Layers 1 and 2 were fixed perfectly, the weights cannot respond. Thinking in Qwen3 is an emergent trained capability, not a prompt trick: ● Genuine Qwen3 weights were trained on large-scale chain-of-thought data with … structure, so the model learned to generate reasoning content inside those tags and to self-regulate its depth ● reasoning_effort maps to a thinking_budget (token count). Enforcement requires the serving framework to count tokens inside the thinking span and force-insert when the budget is exhausted — the model must then have been trained to pivot to a final answer at that point ● broken-model's weights come from Llama-3.1-8B, which has zero training signal for any of this. The token (ID 151667) is in the vocabulary (from the Qwen3 tokenizer) but is completely meaningless to the Llama weight matrix. The model has no learned association between that token and "begin internal reasoning", no learned behavior to produce useful thoughts, and no learned response to a budget cutoff

CleanShot 2026-04-16 at 03.42.20@2x

In short: the template fix from the prior session addressed only a surface symptom. The model is fundamentally incapable of supporting reasoning_effort until requirement 5 (weight replacement with a genuine Qwen3 checkpoint) is met, after which requirements 24 must also be verified.

Q2) Python Asyncio

Q2(a) Two pros and two cons of asyncio's await-yielding design vs Trio

Pro 1: cheap fast paths when the awaited work is already available

In asyncio, await does not guarantee a context switch. If the awaited coroutine returns without hitting its own suspension point, the caller continues immediately. That is useful for cache-heavy paths.

import asyncio

cache = {"answer": 42}

async def get_value(key: str) -> int:
    if key in cache:
        return cache[key]  # no checkpoint here
    await asyncio.sleep(0.050)
    return -1

Compared to Trio, this is a nice performance property, but it comes with a fairness tradeoff: the caller cannot assume that await get_value(...) actually yielded to the scheduler.

Pro 2: very easy interop with existing Future/callback-style asyncio code

Asyncio's await model fits naturally with the ecosystem built around Future, callbacks, and event-loop primitives.

import asyncio

async def from_legacy_future() -> str:
    loop = asyncio.get_running_loop()
    fut = loop.create_future()
    loop.call_soon(fut.set_result, "ready")
    return await fut

That incremental migration story is smoother than Trio's because asyncio was designed around those native loop objects from the start.

Con 1: fairness is fragile because await may not actually yield

This is the most important downside. A coroutine can look cooperative while still monopolizing the loop if its awaited callees do not suspend.

import asyncio

async def cached_op() -> int:
    return 1  # returns immediately, so this is not a checkpoint

async def hog() -> None:
    for i in range(1_000_000):
        await cached_op()
        if i % 1000 == 0:
            await asyncio.sleep(0)  # manual fairness checkpoint

async def heartbeat() -> None:
    while True:
        print("tick")
        await asyncio.sleep(0.1)

Trio is better here conceptually because it leans much harder into explicit checkpoints and structured concurrency, so fairness and cancellation behavior are easier to reason about.

Con 2: cancellation responsiveness depends on where real checkpoints happen

In asyncio, cancellation is delivered at suspension points. If your code accidentally avoids checkpoints, cancellation latency becomes unpredictable.

import asyncio

async def maybe_sync(flag: bool) -> None:
    if flag:
        return
    await asyncio.sleep(1)

async def worker() -> None:
    while True:
        await maybe_sync(True)  # this path never checkpoints

Trio's model makes this easier to reason about because cancellation and yielding are designed around structured scopes and well-defined checkpoints instead of incidental behavior inside callees.

Q2(b) Why aiohttp often outperforms httpx for high-concurrency LLM traffic

The fundamental architectural difference is that aiohttp is an asyncio-native client/server stack optimized around one event-loop model, while httpx is a higher-level, more general client built on top of httpcore, h11, and async-backend abstraction that supports both asyncio and Trio.

The most impactful performance factor is the hot-path overhead of pure-Python protocol handling plus abstraction layers in httpx:

  • httpx uses httpcore underneath.
  • httpcore supports both sync and async interfaces and both asyncio and Trio backends.
  • httpx's HTTP/1.1 path depends on h11, which is a pure-Python HTTP protocol implementation.
  • That generality is elegant, but under many concurrent small or streaming responses it means more Python-level work per socket event and per chunk.

By contrast, aiohttp is tightly coupled to asyncio and can use C speedups:

  • its docs expose aiohttp[speedups]
  • its build/docs mention a higher-performance C llhttp parser
  • it can also use aiodns to reduce DNS overhead

That makes a measurable difference for LLM workloads, where clients often keep many concurrent streaming responses open and process a large number of small chunks.

How each library mitigates the problem:

  • httpx:

    • reuse a single long-lived AsyncClient so connection pooling actually works
    • optionally enable HTTP/2 (http2=True) to reduce connection overhead under high concurrency
    • use explicit transport reuse instead of creating clients in hot loops
    • these reduce avoidable connection/setup cost, but they do not remove the core pure-Python protocol/abstraction overhead
  • aiohttp:

    • reuse one long-lived ClientSession
    • install aiohttp[speedups]
    • build/use the higher-performance llhttp C parser
    • optionally install aiodns
    • these improvements cut work in the hot path itself, which is why the gains are often larger

Bottom line:

For ergonomic application code, httpx is excellent. For extremely high-concurrency, streaming-heavy LLM traffic, aiohttp often wins because it is narrower, lower-level, and more aggressively optimized around asyncio's hottest execution paths.

Q3) Cost Investigation for Claude Code

Q3(a) Why many requests can still bill far below the naive token estimate

At the systems level, this is prompt-prefix caching.

The server computes the transformer state for a reusable prefix once, stores the resulting prefix/KV state, and reuses it when later requests begin with the same token prefix. Billing then distinguishes:

  • normal uncached input tokens
  • cache writes
  • cache reads / cached-input tokens
  • output tokens

Anthropic's prompt caching documentation makes this explicit: cache hits reuse the full prompt prefix up to the cache breakpoint, and subsequent requests bill cache reads instead of full input reprocessing.

Why this matters in agentic workflows:

  • coding agents repeatedly resend the same tool schemas, system instructions, repo context, and conversation history
  • only the final user/tool delta changes from turn to turn
  • if the prefix is cached, both cost and time-to-first-token drop sharply
  • without caching, a supposedly cheaper model can still be expensive because the agent keeps paying for the same giant prompt prefix over and over

Q3(b) What model-design choices make that possible

The key enabler is the autoregressive causal-transformer design itself.

Why:

  • with causal masking, a token's hidden state depends only on earlier tokens
  • that means the computed prefix state is reusable for any later request that starts with the same token sequence
  • the server can safely reuse cached KV states because the suffix has not changed the prefix computation

In practice, a few model-design choices make this especially useful:

  • decoder-only causal attention
  • deterministic tokenization and prompt formatting
  • stable positional encoding behavior for long prefixes
  • long-context training, which makes very large reusable prefixes valuable in real workloads

This is mostly an architectural consequence of autoregressive transformers, not a special "prompt caching loss." The training choice that matters most is using a causal next-token objective where prefix computation is suffix-independent.

Q3(c) Why the Minimax and Kimi experiments imply different family behavior

Under the simplified pricing assumption of $1 / 1M tokens, the observed savings map directly to effective cached-input reuse:

  • $0.32 less means only about 320,000 input tokens were avoided/discounted across ~10,000 requests, or roughly 32 tokens per request on average
  • $170 less means about 170,000,000 input tokens were avoided/discounted, or roughly 17,000 tokens per request on average

That difference is enormous. It means:

  • the Kimi path achieved meaningful large-prefix cache hits on repeated Claude Code prompts
  • the Minimax path effectively did not

Because the replayed prompt corpus was held constant, the variable is not the user's behavior. The variable is the family-specific serving path: cache support, cache hitability, request canonicalization, or token-prefix stability for that model family.

So the measured cost gap is not mainly about base per-token price. It is about whether the serving stack actually converts repeated agent prefixes into cached-input billing.

Q3(d) What should be fixed

The fix should happen in the Claude-compatible serving/integration layer, not in the user prompts.

Specifically:

  1. Fix the Anthropic Messages compatibility path so cache semantics are real, not decorative. Friendli's Messages API currently documents cache_control as "accepted for request portability" but "parsed and not used for generation." If Claude Code is sending Anthropic-style prompt-caching signals, that compatibility layer is not honoring them.

  2. Ensure the GLM-5 serving path actually exposes stable prefix caching and cached-input billing for repeated agent turns. The direct request replay shows the core workload is cache-friendly. The missing piece is consistent cache utilization in the real Claude Code integration path.

  3. Preserve prefix stability in the adapter. Do not inject per-request timestamps, IDs, reordered tool schemas, or other changing metadata into the reusable prefix. Even small mutations can destroy exact-prefix cache hits.

  4. Add replay-based integration tests. Take a real Claude Code trace, replay it through the production adapter, and assert:

    • substantial cache_read_input_tokens after the first request
    • lower billed input than the naive total
    • lower TTFT on repeated-turn requests
  5. Add observability. Surface cache hit rate, cache_read_input_tokens, rendered-prompt hash stability, and billed-input deltas in request logs/metrics so regressions are obvious.

Q3(e) Design principles for a new cost-efficient coding agent

  1. Keep the static prefix stable. Tool schemas, persistent system instructions, and durable repo context should stay byte-for-byte identical whenever possible.

  2. Send deltas, not full replays. Append the newest observation/tool result instead of reserializing the entire world every turn.

  3. Separate long-lived context from volatile context. Put rarely changing instructions and tool definitions at the front; put ephemeral scratch state at the end.

  4. Treat caching as a first-class product metric. Track cache hit rate, cached-input tokens, TTFT, and billed-input tokens per turn.

  5. Reuse sessions before branching massively. For workflows that fan out, seed the cache first, then launch parallel follow-up requests after the reusable prefix exists.

  6. Use the smallest model/context that can do the job. Planner, executor, summarizer, and reviewer roles should not all inherit the same huge prompt.

  7. Summarize aggressively. Old transcript segments should be compressed into state summaries once their exact wording stops mattering.

  8. Make reasoning optional and budgeted. Only enable expensive reasoning modes for tasks that justify the extra compute and latency.

Assumptions and scope notes

  • I worked only inside this local repository.
  • I did not modify any model weights.
  • The local environment did not have transformers installed, so I validated the fix by repository inspection plus upstream Qwen3 metadata comparison rather than by launching a live server in this workspace.

Reference addendum

Official references I used while writing this submission: