Model: tossem/friendli-broken-model-fix Source: Original Platform
library_name, pipeline_tag, base_model
| library_name | pipeline_tag | base_model | |
|---|---|---|---|
| transformers | text-generation |
|
FriendliAI Take-Home Submission
Summary
This repository fixes a configuration issue that prevented the model from supporting /chat/completions. The root cause was a missing chat_template in tokenizer_config.json, which is required for rendering structured chat messages into the prompt format expected by Qwen3.
The fix adds the upstream Qwen3-8B chat_template and corrects model metadata. No model weights were modified.
Q1) Debugging LLMs
Q1(a) Root cause and fix
I inspected config.json, generation_config.json, tokenizer_config.json, tokenizer.json, and README.md.
What I found:
config.jsonidentifies the checkpoint asQwen3ForCausalLMwithmodel_type: "qwen3".generation_config.jsonis a normal Qwen3 sampling config and is not the blocker.tokenizer.jsoncontains the expected Qwen special tokens (<|im_start|>,<|im_end|>,<think>,</think>, etc.) and did not require changes.tokenizer_config.jsonwas missing thechat_templatefield entirely.README.mdincorrectly claimed the base model wasmeta-llama/Meta-Llama-3.1-8B, even though the repository is clearly a Qwen3 checkpoint.
The runtime failure is caused by the missing tokenizer_config.json.chat_template.
Why that breaks /chat/completions:
- OpenAI-style chat servers first need to render structured messages into the raw prompt text expected by the model.
- For Hugging Face chat models, that rendering logic comes from
tokenizer.chat_template. - Qwen3 expects ChatML-style formatting plus Qwen3-specific handling for reasoning/tool use.
- Without the template, the server has no model-specific way to convert
messages=[...]into the correct prompt string, so chat inference fails before decoding starts.
Minimal Fix Applied
Only two changes were required to restore chat functionality:
| File | Field | Old value | New value | Why |
|---|---|---|---|---|
tokenizer_config.json |
chat_template |
Absent | Exact upstream Qwen/Qwen3-8B template |
Required so chat messages can be rendered into the prompt format Qwen3 expects. |
README.md |
base_model |
meta-llama/Meta-Llama-3.1-8B |
Qwen/Qwen3-8B |
Documentation correction so the model card matches the actual checkpoint lineage. |
Runtime-critical change (minimal fix):
tokenizer_config.json
- "chat_template": <missing>
+ "chat_template": "<exact upstream Qwen/Qwen3-8B template>"
Files I intentionally left unchanged:
config.jsongeneration_config.jsontokenizer.json
Reason: they were not the root cause of the chat failure, and changing them would have gone beyond the minimal fix.
Why the fix was necessary
Qwen3 chat inference is template-driven. The weights were fine; the tokenizer vocabulary was fine; the generation defaults were fine. The only missing runtime contract was the chat template that tells the serving layer how to serialize roles, assistant reasoning, tool calls, and the enable_thinking switch into Qwen3's expected prompt format.
Problem - b) Why reasoning_effort currently Has No Observable Effect
Root Cause Analysis: reasoning_effort
The reasoning_effort parameter flows through a three-layer pipeline. This model breaks at every single layer.
How the pipeline is supposed to work Client sends reasoning_effort="high" │ ▼ [Layer 1] Inference framework (vLLM/TGI) translates → enable_thinking=True + thinking_budget=N tokens │ ▼ [Layer 2] Chat template (Jinja in tokenizer_config.json) reads enable_thinking → controls whether is pre-filled empty (skipping) or left open for model to fill │ ▼ [Layer 3] Model weights trained behavior generates coherent … blocks and scales depth proportionally to the budget
This is the only mechanism by which enable_thinking controls output: when False, an empty block is pre-filled, signaling the model to skip reasoning and answer directly. When True (or unset), the block is left open for the model to generate. broken-model's original tokenizer_config.json had no chat_template field at all. The addition in the previous scenario should solve this issue.
Layer 3 — Model Weights (broken, and the deepest root cause) Even if Layers 1 and 2 were fixed perfectly, the weights cannot respond. Thinking in Qwen3 is an emergent trained capability, not a prompt trick: ● Genuine Qwen3 weights were trained on large-scale chain-of-thought data with … structure, so the model learned to generate reasoning content inside those tags and to self-regulate its depth ● reasoning_effort maps to a thinking_budget (token count). Enforcement requires the serving framework to count tokens inside the thinking span and force-insert when the budget is exhausted — the model must then have been trained to pivot to a final answer at that point ● broken-model's weights come from Llama-3.1-8B, which has zero training signal for any of this. The token (ID 151667) is in the vocabulary (from the Qwen3 tokenizer) but is completely meaningless to the Llama weight matrix. The model has no learned association between that token and "begin internal reasoning", no learned behavior to produce useful thoughts, and no learned response to a budget cutoff
In short: the template fix from the prior session addressed only a surface symptom. The model is fundamentally incapable of supporting reasoning_effort until requirement 5 (weight replacement with a genuine Qwen3 checkpoint) is met, after which requirements 2–4 must also be verified.
Q2) Python Asyncio
Q2(a) Two pros and two cons of asyncio's await-yielding design vs Trio
Pro 1: cheap fast paths when the awaited work is already available
In asyncio, await does not guarantee a context switch. If the awaited coroutine returns without hitting its own suspension point, the caller continues immediately. That is useful for cache-heavy paths.
import asyncio
cache = {"answer": 42}
async def get_value(key: str) -> int:
if key in cache:
return cache[key] # no checkpoint here
await asyncio.sleep(0.050)
return -1
Compared to Trio, this is a nice performance property, but it comes with a fairness tradeoff: the caller cannot assume that await get_value(...) actually yielded to the scheduler.
Pro 2: very easy interop with existing Future/callback-style asyncio code
Asyncio's await model fits naturally with the ecosystem built around Future, callbacks, and event-loop primitives.
import asyncio
async def from_legacy_future() -> str:
loop = asyncio.get_running_loop()
fut = loop.create_future()
loop.call_soon(fut.set_result, "ready")
return await fut
That incremental migration story is smoother than Trio's because asyncio was designed around those native loop objects from the start.
Con 1: fairness is fragile because await may not actually yield
This is the most important downside. A coroutine can look cooperative while still monopolizing the loop if its awaited callees do not suspend.
import asyncio
async def cached_op() -> int:
return 1 # returns immediately, so this is not a checkpoint
async def hog() -> None:
for i in range(1_000_000):
await cached_op()
if i % 1000 == 0:
await asyncio.sleep(0) # manual fairness checkpoint
async def heartbeat() -> None:
while True:
print("tick")
await asyncio.sleep(0.1)
Trio is better here conceptually because it leans much harder into explicit checkpoints and structured concurrency, so fairness and cancellation behavior are easier to reason about.
Con 2: cancellation responsiveness depends on where real checkpoints happen
In asyncio, cancellation is delivered at suspension points. If your code accidentally avoids checkpoints, cancellation latency becomes unpredictable.
import asyncio
async def maybe_sync(flag: bool) -> None:
if flag:
return
await asyncio.sleep(1)
async def worker() -> None:
while True:
await maybe_sync(True) # this path never checkpoints
Trio's model makes this easier to reason about because cancellation and yielding are designed around structured scopes and well-defined checkpoints instead of incidental behavior inside callees.
Q2(b) Why aiohttp often outperforms httpx for high-concurrency LLM traffic
The fundamental architectural difference is that aiohttp is an asyncio-native client/server stack optimized around one event-loop model, while httpx is a higher-level, more general client built on top of httpcore, h11, and async-backend abstraction that supports both asyncio and Trio.
The most impactful performance factor is the hot-path overhead of pure-Python protocol handling plus abstraction layers in httpx:
httpxuseshttpcoreunderneath.httpcoresupports both sync and async interfaces and both asyncio and Trio backends.httpx's HTTP/1.1 path depends onh11, which is a pure-Python HTTP protocol implementation.- That generality is elegant, but under many concurrent small or streaming responses it means more Python-level work per socket event and per chunk.
By contrast, aiohttp is tightly coupled to asyncio and can use C speedups:
- its docs expose
aiohttp[speedups] - its build/docs mention a higher-performance C
llhttpparser - it can also use
aiodnsto reduce DNS overhead
That makes a measurable difference for LLM workloads, where clients often keep many concurrent streaming responses open and process a large number of small chunks.
How each library mitigates the problem:
-
httpx:- reuse a single long-lived
AsyncClientso connection pooling actually works - optionally enable HTTP/2 (
http2=True) to reduce connection overhead under high concurrency - use explicit transport reuse instead of creating clients in hot loops
- these reduce avoidable connection/setup cost, but they do not remove the core pure-Python protocol/abstraction overhead
- reuse a single long-lived
-
aiohttp:- reuse one long-lived
ClientSession - install
aiohttp[speedups] - build/use the higher-performance
llhttpC parser - optionally install
aiodns - these improvements cut work in the hot path itself, which is why the gains are often larger
- reuse one long-lived
Bottom line:
For ergonomic application code, httpx is excellent. For extremely high-concurrency, streaming-heavy LLM traffic, aiohttp often wins because it is narrower, lower-level, and more aggressively optimized around asyncio's hottest execution paths.
Q3) Cost Investigation for Claude Code
Q3(a) Why many requests can still bill far below the naive token estimate
At the systems level, this is prompt-prefix caching.
The server computes the transformer state for a reusable prefix once, stores the resulting prefix/KV state, and reuses it when later requests begin with the same token prefix. Billing then distinguishes:
- normal uncached input tokens
- cache writes
- cache reads / cached-input tokens
- output tokens
Anthropic's prompt caching documentation makes this explicit: cache hits reuse the full prompt prefix up to the cache breakpoint, and subsequent requests bill cache reads instead of full input reprocessing.
Why this matters in agentic workflows:
- coding agents repeatedly resend the same tool schemas, system instructions, repo context, and conversation history
- only the final user/tool delta changes from turn to turn
- if the prefix is cached, both cost and time-to-first-token drop sharply
- without caching, a supposedly cheaper model can still be expensive because the agent keeps paying for the same giant prompt prefix over and over
Q3(b) What model-design choices make that possible
The key enabler is the autoregressive causal-transformer design itself.
Why:
- with causal masking, a token's hidden state depends only on earlier tokens
- that means the computed prefix state is reusable for any later request that starts with the same token sequence
- the server can safely reuse cached KV states because the suffix has not changed the prefix computation
In practice, a few model-design choices make this especially useful:
- decoder-only causal attention
- deterministic tokenization and prompt formatting
- stable positional encoding behavior for long prefixes
- long-context training, which makes very large reusable prefixes valuable in real workloads
This is mostly an architectural consequence of autoregressive transformers, not a special "prompt caching loss." The training choice that matters most is using a causal next-token objective where prefix computation is suffix-independent.
Q3(c) Why the Minimax and Kimi experiments imply different family behavior
Under the simplified pricing assumption of $1 / 1M tokens, the observed savings map directly to effective cached-input reuse:
$0.32less means only about320,000input tokens were avoided/discounted across~10,000requests, or roughly32tokens per request on average$170less means about170,000,000input tokens were avoided/discounted, or roughly17,000tokens per request on average
That difference is enormous. It means:
- the Kimi path achieved meaningful large-prefix cache hits on repeated Claude Code prompts
- the Minimax path effectively did not
Because the replayed prompt corpus was held constant, the variable is not the user's behavior. The variable is the family-specific serving path: cache support, cache hitability, request canonicalization, or token-prefix stability for that model family.
So the measured cost gap is not mainly about base per-token price. It is about whether the serving stack actually converts repeated agent prefixes into cached-input billing.
Q3(d) What should be fixed
The fix should happen in the Claude-compatible serving/integration layer, not in the user prompts.
Specifically:
-
Fix the Anthropic Messages compatibility path so cache semantics are real, not decorative. Friendli's Messages API currently documents
cache_controlas "accepted for request portability" but "parsed and not used for generation." If Claude Code is sending Anthropic-style prompt-caching signals, that compatibility layer is not honoring them. -
Ensure the GLM-5 serving path actually exposes stable prefix caching and cached-input billing for repeated agent turns. The direct request replay shows the core workload is cache-friendly. The missing piece is consistent cache utilization in the real Claude Code integration path.
-
Preserve prefix stability in the adapter. Do not inject per-request timestamps, IDs, reordered tool schemas, or other changing metadata into the reusable prefix. Even small mutations can destroy exact-prefix cache hits.
-
Add replay-based integration tests. Take a real Claude Code trace, replay it through the production adapter, and assert:
- substantial
cache_read_input_tokensafter the first request - lower billed input than the naive total
- lower TTFT on repeated-turn requests
- substantial
-
Add observability. Surface cache hit rate,
cache_read_input_tokens, rendered-prompt hash stability, and billed-input deltas in request logs/metrics so regressions are obvious.
Q3(e) Design principles for a new cost-efficient coding agent
-
Keep the static prefix stable. Tool schemas, persistent system instructions, and durable repo context should stay byte-for-byte identical whenever possible.
-
Send deltas, not full replays. Append the newest observation/tool result instead of reserializing the entire world every turn.
-
Separate long-lived context from volatile context. Put rarely changing instructions and tool definitions at the front; put ephemeral scratch state at the end.
-
Treat caching as a first-class product metric. Track cache hit rate, cached-input tokens, TTFT, and billed-input tokens per turn.
-
Reuse sessions before branching massively. For workflows that fan out, seed the cache first, then launch parallel follow-up requests after the reusable prefix exists.
-
Use the smallest model/context that can do the job. Planner, executor, summarizer, and reviewer roles should not all inherit the same huge prompt.
-
Summarize aggressively. Old transcript segments should be compressed into state summaries once their exact wording stops mattering.
-
Make reasoning optional and budgeted. Only enable expensive reasoning modes for tasks that justify the extra compute and latency.
Assumptions and scope notes
- I worked only inside this local repository.
- I did not modify any model weights.
- The local environment did not have
transformersinstalled, so I validated the fix by repository inspection plus upstream Qwen3 metadata comparison rather than by launching a live server in this workspace.
Reference addendum
Official references I used while writing this submission:
- Qwen3 model card: https://huggingface.co/Qwen/Qwen3-8B
- Qwen3 upstream tokenizer config: https://huggingface.co/Qwen/Qwen3-8B/blob/main/tokenizer_config.json
- Anthropic prompt caching: https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching
- Friendli Messages API: https://friendli.ai/docs/openapi/serverless/messages
- Friendli reasoning guide: https://friendli.ai/docs/guides/reasoning
- Friendli serverless pricing: https://friendli.ai/docs/guides/serverless_endpoints/pricing
- HTTPX async support: https://www.python-httpx.org/async/
- HTTPX dependencies: https://www.python-httpx.org/
- HTTPCore README: https://github.com/encode/httpcore
- aiohttp docs: https://docs.aiohttp.org/en/stable/
- aiohttp LLHTTP notes: https://docs.aiohttp.org/en/v3.12.9/contributing.html
- h11 documentation: https://h11.readthedocs.io/


