Files
broken-model-fixed/README.md

77 lines
4.9 KiB
Markdown
Raw Permalink Normal View History

---
library_name: transformers
pipeline_tag: text-generation
base_model:
- Qwen/Qwen3-8B
---
# broken-model (fixed)
HuggingFace Repo: https://huggingface.co/suyashdb/broken-model-fixed/tree/main
## Changes Made
### 1. `README.md` — `base_model` corrected
- **Before:** `meta-llama/Meta-Llama-3.1-8B`
- **After:** `Qwen/Qwen3-8B`
- **Why:** The model architecture (`Qwen3ForCausalLM`), tokenizer class (`Qwen2Tokenizer`), vocabulary size (151936), and all config values exactly match Qwen3-8B, not Llama-3.1-8B. The wrong base_model declaration was misleading but not the functional blocker.
### 2. `tokenizer_config.json` — `chat_template` added
- **Before:** The `chat_template` field was entirely absent from `tokenizer_config.json`.
- **After:** Added the full Jinja2 chat template from the canonical `Qwen/Qwen3-8B` model.
- **Why this broke inference:** Any OpenAI-compatible inference server (vLLM, TGI, FriendliAI engine) calls `tokenizer.apply_chat_template()` to convert the `messages` array in a `/chat/completions` request into a single prompt string. Without a `chat_template`, this call raises `"No chat template is set for this tokenizer"` and the server cannot process any request. The model weights themselves are intact — only the tokenizer configuration was missing this critical field.
The added template handles:
- System / user / assistant message formatting using `<|im_start|>` / `<|im_end|>` tokens
- Tool call formatting (`<tool_call>` / `<tool_response>`)
- Thinking mode: when `enable_thinking=False` is passed, the template injects `<think>\n\n</think>` to suppress chain-of-thought output
- Multi-turn reasoning content (`reasoning_content` field on assistant messages)
### 3. Vocab/tokenizer files added
- `vocab.json`, `tokenizer.json`, and `special_tokens_map.json` were uploaded from the canonical `Qwen/Qwen3-8B` model.
- The original broken repo was missing these, making it impossible to load the tokenizer standalone.
## Verification
You can verify the fix without model weights — just the tokenizer:
```python
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("suyashdb/broken-model-fixed")
messages = [{"role": "user", "content": "What is 2+2?"}]
prompt = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
print(prompt)
# Expected output:
# <|im_start|>user
# What is 2+2?<|im_end|>
# <|im_start|>assistant
```
---
## Part B — Why `reasoning_effort` Does Nothing
If you've tried passing `reasoning_effort: "low"` or `reasoning_effort: "high"` in your requests and noticed zero difference in the output — you're not imagining it. Here's why.
### The short answer
This model has no idea what `reasoning_effort` means. It was never trained to respond to it.
### The longer answer
`reasoning_effort` is a parameter from OpenAI's o-series API (o1, o3, o4). The idea is that you can tell the model how hard to think — `"low"` means give me a quick answer, `"high"` means really work through it. Those models were specifically trained with a concept called budget-forcing: during training, they were given a token budget and rewarded for getting the right answer within that budget. Over time they learned to actually compress or expand their reasoning based on the hint.
Qwen3-8B was not trained that way. It has two modes — thinking (where it produces a `<think>...</think>` block before answering) and non-thinking (where it skips that entirely). That's a binary on/off switch, not a dial. When you send `reasoning_effort: "medium"`, the model receives it, doesn't recognize it, and ignores it. The output is identical regardless of what value you pass.
### What would need to change to make it work
1. The model needs to be retrained with budget-forcing. During fine-tuning, you'd prepend a budget token to each prompt (something like `<budget>512</budget>`) and train the model to produce correct answers within that many tokens. This teaches it to actually reason more efficiently when the budget is tight, rather than just cutting off mid-thought.
2. The inference server needs to translate `reasoning_effort` into a concrete token limit and either inject it into the prompt in a format the model understands, or hard-stop the `<think>` block after N tokens by force-injecting `</think>`. The second approach is blunt — it truncates reasoning but doesn't make the model reason smarter.
3. The API layer (whatever sits between the client and the model) needs to map `"low" / "medium" / "high"` to actual numbers and pass them through correctly. Right now most serving stacks just forward unknown parameters to the model, which silently ignores them.
4. Realistically, the easiest path is to use a model that already supports this natively — like a Qwen3 variant served through FriendliAI's serverless API which exposes `max_thinking_tokens`, or OpenAI's o-series which was purpose-built for `reasoning_effort`. Retrofitting budget-forcing onto an existing model requires retraining, not just a config change.