250 lines
8.6 KiB
Markdown
250 lines
8.6 KiB
Markdown
|
|
---
|
|||
|
|
language:
|
|||
|
|
- en
|
|||
|
|
license: apache-2.0
|
|||
|
|
base_model: unsloth/Qwen3-8B
|
|||
|
|
tags:
|
|||
|
|
- reasoning
|
|||
|
|
- qwen3
|
|||
|
|
- lora
|
|||
|
|
- unsloth
|
|||
|
|
- distillation
|
|||
|
|
- claude-opus
|
|||
|
|
- chain-of-thought
|
|||
|
|
datasets:
|
|||
|
|
- Crownelius/Opus-4.6-Reasoning-3300x
|
|||
|
|
- Jackrong/Qwen3.5-reasoning-700x
|
|||
|
|
model-index:
|
|||
|
|
- name: Qwen3-8B-OpusReasoning
|
|||
|
|
results: []
|
|||
|
|
pipeline_tag: text-generation
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
# Qwen3-8B-OpusReasoning
|
|||
|
|
|
|||
|
|
## Model Overview
|
|||
|
|
|
|||
|
|
A reasoning-enhanced version of [Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B), fine-tuned via supervised knowledge distillation from **Claude Opus 4.6** reasoning traces.
|
|||
|
|
|
|||
|
|
The goal is not token-level imitation of Opus output, but transfer of its **reasoning structure and problem-solving style** into a compact 8B model that can run locally. The model outputs structured chain-of-thought inside `<think>...</think>` tags before generating the final answer, following the Qwen3 thinking-mode convention.
|
|||
|
|
|
|||
|
|
- **Base model:** [unsloth/Qwen3-8B](https://huggingface.co/unsloth/Qwen3-8B)
|
|||
|
|
- **Teacher model:** Claude Opus 4.6 (reasoning traces, distilled)
|
|||
|
|
- **Training type:** Supervised Fine-Tuning (SFT) + LoRA → merged bf16
|
|||
|
|
- **Framework:** [Unsloth](https://github.com/unslothai/unsloth) 2026.4.5 + TRL SFTTrainer
|
|||
|
|
- **Precision:** bfloat16
|
|||
|
|
- **Hardware:** 1x NVIDIA A100-SXM4-80GB
|
|||
|
|
|
|||
|
|
## Training Data
|
|||
|
|
|
|||
|
|
| Dataset | Samples | Role |
|
|||
|
|
|---|---|---|
|
|||
|
|
| [Crownelius/Opus-4.6-Reasoning-3300x](https://huggingface.co/datasets/Crownelius/Opus-4.6-Reasoning-3300x) | 3,300 | Main distillation — Claude Opus 4.6 reasoning traces |
|
|||
|
|
| [Jackrong/Qwen3.5-reasoning-700x](https://huggingface.co/datasets/Jackrong/Qwen3.5-reasoning-700x) | 700 | Auxiliary — supporting reasoning diversity |
|
|||
|
|
| **Total** | **4,000** | |
|
|||
|
|
|
|||
|
|
### Data Characteristics
|
|||
|
|
|
|||
|
|
- Long-form chain-of-thought supervision (`<think>...</think>`)
|
|||
|
|
- Diverse reasoning domains: math, logic, code, analytical QA
|
|||
|
|
- High-quality Opus 4.6 teacher traces — carefully curated, no noisy labels
|
|||
|
|
- Conversation format compatible with Qwen3 chat template
|
|||
|
|
|
|||
|
|
## Training Pipeline
|
|||
|
|
|
|||
|
|
### LoRA Configuration
|
|||
|
|
|
|||
|
|
| Parameter | Value |
|
|||
|
|
|---|---|
|
|||
|
|
| Rank (r) | 64 |
|
|||
|
|
| Alpha | 128 |
|
|||
|
|
| Dropout | 0.05 |
|
|||
|
|
| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
|
|||
|
|
|
|||
|
|
### Hyperparameters
|
|||
|
|
|
|||
|
|
| Parameter | Value |
|
|||
|
|
|---|---|
|
|||
|
|
| Effective batch size | 1 × 16 = 16 |
|
|||
|
|
| Learning rate | 5e-5 |
|
|||
|
|
| LR scheduler | Cosine |
|
|||
|
|
| Epochs | 3 |
|
|||
|
|
| Max sequence length | 16,384 |
|
|||
|
|
| Optimizer | AdamW 8-bit |
|
|||
|
|
| Warmup ratio | 0.03 |
|
|||
|
|
| Weight decay | 0.01 |
|
|||
|
|
| Packing | Enabled |
|
|||
|
|
| Gradient checkpointing | Unsloth |
|
|||
|
|
|
|||
|
|
### Training Results
|
|||
|
|
|
|||
|
|
- **Final train loss:** 0.6953
|
|||
|
|
- **Runtime:** ~153 minutes on A100-80GB
|
|||
|
|
|
|||
|
|
## Distillation Philosophy
|
|||
|
|
|
|||
|
|
We distill **reasoning structure**, not surface tokens. Specifically, the model is encouraged to acquire:
|
|||
|
|
|
|||
|
|
- **Explicit problem decomposition** — break complex questions into sub-goals
|
|||
|
|
- **Assumption checking** — state what's given, what's unknown, and verify constraints
|
|||
|
|
- **Step-by-step derivation** — one logical step per line, no skipped algebra
|
|||
|
|
- **Reflection & backtracking** — recognize dead-ends and revise rather than plow forward
|
|||
|
|
- **Clean answer construction** — separate `<think>` scratch work from the final user-facing answer
|
|||
|
|
|
|||
|
|
This follows the "Claude Opus style" of reasoning — deliberative, self-critical, and structurally transparent.
|
|||
|
|
|
|||
|
|
## Reasoning Scaffold (Learned Pattern)
|
|||
|
|
|
|||
|
|
After fine-tuning, the model tends to produce reasoning traces with this shape:
|
|||
|
|
|
|||
|
|
1. **Restate and parse the task** — identify exactly what is being asked
|
|||
|
|
2. **Plan** — list the approach or sub-problems
|
|||
|
|
3. **Work through each step** — show algebra, logic, or code reasoning explicitly
|
|||
|
|
4. **Verify** — sanity-check the intermediate results before committing
|
|||
|
|
5. **Construct the final answer** — separate, clean, user-facing summary
|
|||
|
|
|
|||
|
|
## Expected Improvements
|
|||
|
|
|
|||
|
|
In practice, the gain is not a dramatic capability jump over the base Qwen3-8B, but rather:
|
|||
|
|
|
|||
|
|
- **Improved stability** in multi-step reasoning
|
|||
|
|
- **Structured, readable traces** instead of rambling CoT
|
|||
|
|
- **Better instruction adherence** when a problem has constraints
|
|||
|
|
- **Fewer hallucinated intermediate steps** thanks to Opus-style self-verification
|
|||
|
|
|
|||
|
|
## Usage
|
|||
|
|
|
|||
|
|
### Transformers
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer
|
|||
|
|
|
|||
|
|
model_id = "NhatCuong22/Qwen3-8B-OpusReasoning"
|
|||
|
|
|
|||
|
|
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
|||
|
|
model = AutoModelForCausalLM.from_pretrained(
|
|||
|
|
model_id,
|
|||
|
|
torch_dtype="bfloat16",
|
|||
|
|
device_map="auto",
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
messages = [
|
|||
|
|
{"role": "user", "content": "If a train travels 120km in 2 hours, stops for 30 minutes, then travels 90km in 1.5 hours, what is the average speed for the entire journey?"}
|
|||
|
|
]
|
|||
|
|
|
|||
|
|
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
|
|||
|
|
text += "<think>\n" # Activate thinking mode
|
|||
|
|
|
|||
|
|
inputs = tokenizer(text, return_tensors="pt").to(model.device)
|
|||
|
|
outputs = model.generate(
|
|||
|
|
**inputs,
|
|||
|
|
max_new_tokens=2048,
|
|||
|
|
temperature=0.6,
|
|||
|
|
top_p=0.95,
|
|||
|
|
do_sample=True,
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=False)
|
|||
|
|
print(response)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Unsloth (faster inference)
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
from unsloth import FastLanguageModel
|
|||
|
|
|
|||
|
|
model, tokenizer = FastLanguageModel.from_pretrained(
|
|||
|
|
"NhatCuong22/Qwen3-8B-OpusReasoning",
|
|||
|
|
max_seq_length=16384,
|
|||
|
|
dtype="bfloat16",
|
|||
|
|
)
|
|||
|
|
FastLanguageModel.for_inference(model)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Recommended Generation Parameters
|
|||
|
|
|
|||
|
|
| Parameter | Reasoning tasks | Creative tasks |
|
|||
|
|
|---|---|---|
|
|||
|
|
| temperature | 0.6 | 0.8 |
|
|||
|
|
| top_p | 0.95 | 0.95 |
|
|||
|
|
| max_new_tokens | 2048-4096 | 1024-2048 |
|
|||
|
|
| repetition_penalty | 1.0 | 1.05 |
|
|||
|
|
|
|||
|
|
## Model Architecture
|
|||
|
|
|
|||
|
|
| Parameter | Value |
|
|||
|
|
|---|---|
|
|||
|
|
| Parameters | ~8B |
|
|||
|
|
| Hidden size | 4,096 |
|
|||
|
|
| Layers | 36 |
|
|||
|
|
| Attention heads | 32 (8 KV heads, GQA) |
|
|||
|
|
| Intermediate size | 12,288 |
|
|||
|
|
| Max position embeddings | 40,960 |
|
|||
|
|
| Vocabulary size | 151,936 |
|
|||
|
|
| Precision | bfloat16 |
|
|||
|
|
|
|||
|
|
## Evaluation
|
|||
|
|
|
|||
|
|
Evaluated with [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) (5-shot, bf16, A100-80GB).
|
|||
|
|
|
|||
|
|
### Results vs base Qwen3-8B
|
|||
|
|
|
|||
|
|
| Benchmark | Metric | Qwen3-8B-OpusReasoning | Base Qwen3-8B | Δ |
|
|||
|
|
|---|---|---|---|---|
|
|||
|
|
| **MMLU** | accuracy | **75.40%** | 72.93% | **+2.47** ✅ |
|
|||
|
|
| **ARC-Challenge** | acc_norm | **65.87%** | 56.74% | **+9.13** ✅✅ |
|
|||
|
|
| **ARC-Challenge** | accuracy | 64.42% | — | — |
|
|||
|
|
| **HellaSwag** | acc_norm | **76.98%** | 74.91% | **+2.07** ✅ |
|
|||
|
|
| **HellaSwag** | accuracy | 58.32% | — | — |
|
|||
|
|
| **GSM8K** | exact_match (strict) | 86.20% | 88.60% | -2.40 |
|
|||
|
|
| **GSM8K** | exact_match (flexible) | 86.66% | — | — |
|
|||
|
|
|
|||
|
|
### Analysis
|
|||
|
|
|
|||
|
|
- **MMLU +2.47, ARC-Challenge +9.13, HellaSwag +2.07** — the model retains and slightly improves general knowledge and commonsense reasoning after reasoning-focused fine-tuning, without catastrophic forgetting.
|
|||
|
|
- **ARC-Challenge +9.13** is a strong signal that Opus-style structured reasoning transfers well to scientific reasoning tasks.
|
|||
|
|
- **GSM8K -2.4** is a minor regression, likely due to longer `<think>` traces being occasionally truncated by the default `max_gen_toks` — the model is still at ~87% on grade-school math.
|
|||
|
|
|
|||
|
|
More rigorous reasoning benchmarks (MMLU-Pro, MATH-Hard, AIME, IFEval, MuSR) are being evaluated and will be added here.
|
|||
|
|
|
|||
|
|
## Best Suited For
|
|||
|
|
|
|||
|
|
- Mathematical problem solving (arithmetic, algebra, word problems)
|
|||
|
|
- Logical reasoning and deduction
|
|||
|
|
- Code generation with explanation
|
|||
|
|
- Multi-step analytical question answering
|
|||
|
|
- Instruction-following tasks with constraints
|
|||
|
|
- Offline / on-prem reasoning assistants (fits in 16GB VRAM at bf16)
|
|||
|
|
|
|||
|
|
## Limitations & Intended Use
|
|||
|
|
|
|||
|
|
- **Scale of supervision:** fine-tuned on only ~4K samples — gains are stylistic and structural, not broad knowledge expansion
|
|||
|
|
- **Hallucination risk:** reasoning traces may confidently cite non-existent facts; verify external claims
|
|||
|
|
- **Opus-style bias:** inherits tendencies of the teacher (e.g., verbosity, occasional over-hedging)
|
|||
|
|
- **Language:** primarily English training data
|
|||
|
|
- **Not verified for safety-critical use** — research and learning only
|
|||
|
|
- **Base model license constraints:** follow Qwen3 upstream license in commercial settings
|
|||
|
|
|
|||
|
|
## Acknowledgments
|
|||
|
|
|
|||
|
|
- [Qwen Team](https://huggingface.co/Qwen) — Qwen3-8B base model
|
|||
|
|
- [Unsloth](https://github.com/unslothai/unsloth) — efficient fine-tuning kernels
|
|||
|
|
- [Anthropic](https://www.anthropic.com/) — Claude Opus reasoning (teacher)
|
|||
|
|
- Dataset authors: [Crownelius](https://huggingface.co/Crownelius), [Jackrong](https://huggingface.co/Jackrong)
|
|||
|
|
|
|||
|
|
## Citation
|
|||
|
|
|
|||
|
|
```bibtex
|
|||
|
|
@misc{qwen3-8b-opusreasoning,
|
|||
|
|
title = {Qwen3-8B-OpusReasoning},
|
|||
|
|
author = {Vo Nhat Cuong},
|
|||
|
|
year = {2026},
|
|||
|
|
publisher = {Hugging Face},
|
|||
|
|
howpublished = {\url{https://huggingface.co/NhatCuong22/Qwen3-8B-OpusReasoning}}
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## License
|
|||
|
|
|
|||
|
|
Apache 2.0 (inherits from Qwen3-8B upstream).
|