Model: NhatCuong22/Qwen3-8B-OpusReasoning Source: Original Platform
language, license, base_model, tags, datasets, model-index, pipeline_tag
| language | license | base_model | tags | datasets | model-index | pipeline_tag | |||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
apache-2.0 | unsloth/Qwen3-8B |
|
|
|
text-generation |
Qwen3-8B-OpusReasoning
Model Overview
A reasoning-enhanced version of Qwen3-8B, fine-tuned via supervised knowledge distillation from Claude Opus 4.6 reasoning traces.
The goal is not token-level imitation of Opus output, but transfer of its reasoning structure and problem-solving style into a compact 8B model that can run locally. The model outputs structured chain-of-thought inside <think>...</think> tags before generating the final answer, following the Qwen3 thinking-mode convention.
- Base model: unsloth/Qwen3-8B
- Teacher model: Claude Opus 4.6 (reasoning traces, distilled)
- Training type: Supervised Fine-Tuning (SFT) + LoRA → merged bf16
- Framework: Unsloth 2026.4.5 + TRL SFTTrainer
- Precision: bfloat16
- Hardware: 1x NVIDIA A100-SXM4-80GB
Training Data
| Dataset | Samples | Role |
|---|---|---|
| Crownelius/Opus-4.6-Reasoning-3300x | 3,300 | Main distillation — Claude Opus 4.6 reasoning traces |
| Jackrong/Qwen3.5-reasoning-700x | 700 | Auxiliary — supporting reasoning diversity |
| Total | 4,000 |
Data Characteristics
- Long-form chain-of-thought supervision (
<think>...</think>) - Diverse reasoning domains: math, logic, code, analytical QA
- High-quality Opus 4.6 teacher traces — carefully curated, no noisy labels
- Conversation format compatible with Qwen3 chat template
Training Pipeline
LoRA Configuration
| Parameter | Value |
|---|---|
| Rank (r) | 64 |
| Alpha | 128 |
| Dropout | 0.05 |
| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
Hyperparameters
| Parameter | Value |
|---|---|
| Effective batch size | 1 × 16 = 16 |
| Learning rate | 5e-5 |
| LR scheduler | Cosine |
| Epochs | 3 |
| Max sequence length | 16,384 |
| Optimizer | AdamW 8-bit |
| Warmup ratio | 0.03 |
| Weight decay | 0.01 |
| Packing | Enabled |
| Gradient checkpointing | Unsloth |
Training Results
- Final train loss: 0.6953
- Runtime: ~153 minutes on A100-80GB
Distillation Philosophy
We distill reasoning structure, not surface tokens. Specifically, the model is encouraged to acquire:
- Explicit problem decomposition — break complex questions into sub-goals
- Assumption checking — state what's given, what's unknown, and verify constraints
- Step-by-step derivation — one logical step per line, no skipped algebra
- Reflection & backtracking — recognize dead-ends and revise rather than plow forward
- Clean answer construction — separate
<think>scratch work from the final user-facing answer
This follows the "Claude Opus style" of reasoning — deliberative, self-critical, and structurally transparent.
Reasoning Scaffold (Learned Pattern)
After fine-tuning, the model tends to produce reasoning traces with this shape:
- Restate and parse the task — identify exactly what is being asked
- Plan — list the approach or sub-problems
- Work through each step — show algebra, logic, or code reasoning explicitly
- Verify — sanity-check the intermediate results before committing
- Construct the final answer — separate, clean, user-facing summary
Expected Improvements
In practice, the gain is not a dramatic capability jump over the base Qwen3-8B, but rather:
- Improved stability in multi-step reasoning
- Structured, readable traces instead of rambling CoT
- Better instruction adherence when a problem has constraints
- Fewer hallucinated intermediate steps thanks to Opus-style self-verification
Usage
Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "NhatCuong22/Qwen3-8B-OpusReasoning"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype="bfloat16",
device_map="auto",
)
messages = [
{"role": "user", "content": "If a train travels 120km in 2 hours, stops for 30 minutes, then travels 90km in 1.5 hours, what is the average speed for the entire journey?"}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
text += "<think>\n" # Activate thinking mode
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=2048,
temperature=0.6,
top_p=0.95,
do_sample=True,
)
response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=False)
print(response)
Unsloth (faster inference)
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
"NhatCuong22/Qwen3-8B-OpusReasoning",
max_seq_length=16384,
dtype="bfloat16",
)
FastLanguageModel.for_inference(model)
Recommended Generation Parameters
| Parameter | Reasoning tasks | Creative tasks |
|---|---|---|
| temperature | 0.6 | 0.8 |
| top_p | 0.95 | 0.95 |
| max_new_tokens | 2048-4096 | 1024-2048 |
| repetition_penalty | 1.0 | 1.05 |
Model Architecture
| Parameter | Value |
|---|---|
| Parameters | ~8B |
| Hidden size | 4,096 |
| Layers | 36 |
| Attention heads | 32 (8 KV heads, GQA) |
| Intermediate size | 12,288 |
| Max position embeddings | 40,960 |
| Vocabulary size | 151,936 |
| Precision | bfloat16 |
Evaluation
Evaluated with lm-evaluation-harness (5-shot, bf16, A100-80GB).
Results vs base Qwen3-8B
| Benchmark | Metric | Qwen3-8B-OpusReasoning | Base Qwen3-8B | Δ |
|---|---|---|---|---|
| MMLU | accuracy | 75.40% | 72.93% | +2.47 ✅ |
| ARC-Challenge | acc_norm | 65.87% | 56.74% | +9.13 ✅✅ |
| ARC-Challenge | accuracy | 64.42% | — | — |
| HellaSwag | acc_norm | 76.98% | 74.91% | +2.07 ✅ |
| HellaSwag | accuracy | 58.32% | — | — |
| GSM8K | exact_match (strict) | 86.20% | 88.60% | -2.40 |
| GSM8K | exact_match (flexible) | 86.66% | — | — |
Analysis
- MMLU +2.47, ARC-Challenge +9.13, HellaSwag +2.07 — the model retains and slightly improves general knowledge and commonsense reasoning after reasoning-focused fine-tuning, without catastrophic forgetting.
- ARC-Challenge +9.13 is a strong signal that Opus-style structured reasoning transfers well to scientific reasoning tasks.
- GSM8K -2.4 is a minor regression, likely due to longer
<think>traces being occasionally truncated by the defaultmax_gen_toks— the model is still at ~87% on grade-school math.
More rigorous reasoning benchmarks (MMLU-Pro, MATH-Hard, AIME, IFEval, MuSR) are being evaluated and will be added here.
Best Suited For
- Mathematical problem solving (arithmetic, algebra, word problems)
- Logical reasoning and deduction
- Code generation with explanation
- Multi-step analytical question answering
- Instruction-following tasks with constraints
- Offline / on-prem reasoning assistants (fits in 16GB VRAM at bf16)
Limitations & Intended Use
- Scale of supervision: fine-tuned on only ~4K samples — gains are stylistic and structural, not broad knowledge expansion
- Hallucination risk: reasoning traces may confidently cite non-existent facts; verify external claims
- Opus-style bias: inherits tendencies of the teacher (e.g., verbosity, occasional over-hedging)
- Language: primarily English training data
- Not verified for safety-critical use — research and learning only
- Base model license constraints: follow Qwen3 upstream license in commercial settings
Acknowledgments
- Qwen Team — Qwen3-8B base model
- Unsloth — efficient fine-tuning kernels
- Anthropic — Claude Opus reasoning (teacher)
- Dataset authors: Crownelius, Jackrong
Citation
@misc{qwen3-8b-opusreasoning,
title = {Qwen3-8B-OpusReasoning},
author = {Vo Nhat Cuong},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/NhatCuong22/Qwen3-8B-OpusReasoning}}
}
License
Apache 2.0 (inherits from Qwen3-8B upstream).