--- language: - en license: apache-2.0 base_model: unsloth/Qwen3-8B tags: - reasoning - qwen3 - lora - unsloth - distillation - claude-opus - chain-of-thought datasets: - Crownelius/Opus-4.6-Reasoning-3300x - Jackrong/Qwen3.5-reasoning-700x model-index: - name: Qwen3-8B-OpusReasoning results: [] pipeline_tag: text-generation --- # Qwen3-8B-OpusReasoning ## Model Overview A reasoning-enhanced version of [Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B), fine-tuned via supervised knowledge distillation from **Claude Opus 4.6** reasoning traces. The goal is not token-level imitation of Opus output, but transfer of its **reasoning structure and problem-solving style** into a compact 8B model that can run locally. The model outputs structured chain-of-thought inside `...` tags before generating the final answer, following the Qwen3 thinking-mode convention. - **Base model:** [unsloth/Qwen3-8B](https://huggingface.co/unsloth/Qwen3-8B) - **Teacher model:** Claude Opus 4.6 (reasoning traces, distilled) - **Training type:** Supervised Fine-Tuning (SFT) + LoRA → merged bf16 - **Framework:** [Unsloth](https://github.com/unslothai/unsloth) 2026.4.5 + TRL SFTTrainer - **Precision:** bfloat16 - **Hardware:** 1x NVIDIA A100-SXM4-80GB ## Training Data | Dataset | Samples | Role | |---|---|---| | [Crownelius/Opus-4.6-Reasoning-3300x](https://huggingface.co/datasets/Crownelius/Opus-4.6-Reasoning-3300x) | 3,300 | Main distillation — Claude Opus 4.6 reasoning traces | | [Jackrong/Qwen3.5-reasoning-700x](https://huggingface.co/datasets/Jackrong/Qwen3.5-reasoning-700x) | 700 | Auxiliary — supporting reasoning diversity | | **Total** | **4,000** | | ### Data Characteristics - Long-form chain-of-thought supervision (`...`) - Diverse reasoning domains: math, logic, code, analytical QA - High-quality Opus 4.6 teacher traces — carefully curated, no noisy labels - Conversation format compatible with Qwen3 chat template ## Training Pipeline ### LoRA Configuration | Parameter | Value | |---|---| | Rank (r) | 64 | | Alpha | 128 | | Dropout | 0.05 | | Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj | ### Hyperparameters | Parameter | Value | |---|---| | Effective batch size | 1 × 16 = 16 | | Learning rate | 5e-5 | | LR scheduler | Cosine | | Epochs | 3 | | Max sequence length | 16,384 | | Optimizer | AdamW 8-bit | | Warmup ratio | 0.03 | | Weight decay | 0.01 | | Packing | Enabled | | Gradient checkpointing | Unsloth | ### Training Results - **Final train loss:** 0.6953 - **Runtime:** ~153 minutes on A100-80GB ## Distillation Philosophy We distill **reasoning structure**, not surface tokens. Specifically, the model is encouraged to acquire: - **Explicit problem decomposition** — break complex questions into sub-goals - **Assumption checking** — state what's given, what's unknown, and verify constraints - **Step-by-step derivation** — one logical step per line, no skipped algebra - **Reflection & backtracking** — recognize dead-ends and revise rather than plow forward - **Clean answer construction** — separate `` scratch work from the final user-facing answer This follows the "Claude Opus style" of reasoning — deliberative, self-critical, and structurally transparent. ## Reasoning Scaffold (Learned Pattern) After fine-tuning, the model tends to produce reasoning traces with this shape: 1. **Restate and parse the task** — identify exactly what is being asked 2. **Plan** — list the approach or sub-problems 3. **Work through each step** — show algebra, logic, or code reasoning explicitly 4. **Verify** — sanity-check the intermediate results before committing 5. **Construct the final answer** — separate, clean, user-facing summary ## Expected Improvements In practice, the gain is not a dramatic capability jump over the base Qwen3-8B, but rather: - **Improved stability** in multi-step reasoning - **Structured, readable traces** instead of rambling CoT - **Better instruction adherence** when a problem has constraints - **Fewer hallucinated intermediate steps** thanks to Opus-style self-verification ## Usage ### Transformers ```python from transformers import AutoModelForCausalLM, AutoTokenizer model_id = "NhatCuong22/Qwen3-8B-OpusReasoning" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype="bfloat16", device_map="auto", ) messages = [ {"role": "user", "content": "If a train travels 120km in 2 hours, stops for 30 minutes, then travels 90km in 1.5 hours, what is the average speed for the entire journey?"} ] text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) text += "\n" # Activate thinking mode inputs = tokenizer(text, return_tensors="pt").to(model.device) outputs = model.generate( **inputs, max_new_tokens=2048, temperature=0.6, top_p=0.95, do_sample=True, ) response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=False) print(response) ``` ### Unsloth (faster inference) ```python from unsloth import FastLanguageModel model, tokenizer = FastLanguageModel.from_pretrained( "NhatCuong22/Qwen3-8B-OpusReasoning", max_seq_length=16384, dtype="bfloat16", ) FastLanguageModel.for_inference(model) ``` ### Recommended Generation Parameters | Parameter | Reasoning tasks | Creative tasks | |---|---|---| | temperature | 0.6 | 0.8 | | top_p | 0.95 | 0.95 | | max_new_tokens | 2048-4096 | 1024-2048 | | repetition_penalty | 1.0 | 1.05 | ## Model Architecture | Parameter | Value | |---|---| | Parameters | ~8B | | Hidden size | 4,096 | | Layers | 36 | | Attention heads | 32 (8 KV heads, GQA) | | Intermediate size | 12,288 | | Max position embeddings | 40,960 | | Vocabulary size | 151,936 | | Precision | bfloat16 | ## Evaluation Evaluated with [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) (5-shot, bf16, A100-80GB). ### Results vs base Qwen3-8B | Benchmark | Metric | Qwen3-8B-OpusReasoning | Base Qwen3-8B | Δ | |---|---|---|---|---| | **MMLU** | accuracy | **75.40%** | 72.93% | **+2.47** ✅ | | **ARC-Challenge** | acc_norm | **65.87%** | 56.74% | **+9.13** ✅✅ | | **ARC-Challenge** | accuracy | 64.42% | — | — | | **HellaSwag** | acc_norm | **76.98%** | 74.91% | **+2.07** ✅ | | **HellaSwag** | accuracy | 58.32% | — | — | | **GSM8K** | exact_match (strict) | 86.20% | 88.60% | -2.40 | | **GSM8K** | exact_match (flexible) | 86.66% | — | — | ### Analysis - **MMLU +2.47, ARC-Challenge +9.13, HellaSwag +2.07** — the model retains and slightly improves general knowledge and commonsense reasoning after reasoning-focused fine-tuning, without catastrophic forgetting. - **ARC-Challenge +9.13** is a strong signal that Opus-style structured reasoning transfers well to scientific reasoning tasks. - **GSM8K -2.4** is a minor regression, likely due to longer `` traces being occasionally truncated by the default `max_gen_toks` — the model is still at ~87% on grade-school math. More rigorous reasoning benchmarks (MMLU-Pro, MATH-Hard, AIME, IFEval, MuSR) are being evaluated and will be added here. ## Best Suited For - Mathematical problem solving (arithmetic, algebra, word problems) - Logical reasoning and deduction - Code generation with explanation - Multi-step analytical question answering - Instruction-following tasks with constraints - Offline / on-prem reasoning assistants (fits in 16GB VRAM at bf16) ## Limitations & Intended Use - **Scale of supervision:** fine-tuned on only ~4K samples — gains are stylistic and structural, not broad knowledge expansion - **Hallucination risk:** reasoning traces may confidently cite non-existent facts; verify external claims - **Opus-style bias:** inherits tendencies of the teacher (e.g., verbosity, occasional over-hedging) - **Language:** primarily English training data - **Not verified for safety-critical use** — research and learning only - **Base model license constraints:** follow Qwen3 upstream license in commercial settings ## Acknowledgments - [Qwen Team](https://huggingface.co/Qwen) — Qwen3-8B base model - [Unsloth](https://github.com/unslothai/unsloth) — efficient fine-tuning kernels - [Anthropic](https://www.anthropic.com/) — Claude Opus reasoning (teacher) - Dataset authors: [Crownelius](https://huggingface.co/Crownelius), [Jackrong](https://huggingface.co/Jackrong) ## Citation ```bibtex @misc{qwen3-8b-opusreasoning, title = {Qwen3-8B-OpusReasoning}, author = {Vo Nhat Cuong}, year = {2026}, publisher = {Hugging Face}, howpublished = {\url{https://huggingface.co/NhatCuong22/Qwen3-8B-OpusReasoning}} } ``` ## License Apache 2.0 (inherits from Qwen3-8B upstream).