Qwen3-8B-OpusReasoning/README.md

---
language:
- en
license: apache-2.0
base_model: unsloth/Qwen3-8B
tags:
- reasoning
- qwen3
- lora
- unsloth
- distillation
- claude-opus
- chain-of-thought
datasets:
- Crownelius/Opus-4.6-Reasoning-3300x
- Jackrong/Qwen3.5-reasoning-700x
model-index:
- name: Qwen3-8B-OpusReasoning
  results: []
pipeline_tag: text-generation
---

# Qwen3-8B-OpusReasoning

## Model Overview

A reasoning-enhanced version of [Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B), fine-tuned via supervised knowledge distillation from **Claude Opus 4.6** reasoning traces.

The goal is not token-level imitation of Opus output, but transfer of its **reasoning structure and problem-solving style** into a compact 8B model that can run locally. The model outputs structured chain-of-thought inside `<think>...</think>` tags before generating the final answer, following the Qwen3 thinking-mode convention.

- **Base model:** [unsloth/Qwen3-8B](https://huggingface.co/unsloth/Qwen3-8B)
- **Teacher model:** Claude Opus 4.6 (reasoning traces, distilled)
- **Training type:** Supervised Fine-Tuning (SFT) + LoRA → merged bf16
- **Framework:** [Unsloth](https://github.com/unslothai/unsloth) 2026.4.5 + TRL SFTTrainer
- **Precision:** bfloat16
- **Hardware:** 1x NVIDIA A100-SXM4-80GB

## Training Data

| Dataset | Samples | Role |
|---|---|---|
| [Crownelius/Opus-4.6-Reasoning-3300x](https://huggingface.co/datasets/Crownelius/Opus-4.6-Reasoning-3300x) | 3,300 | Main distillation — Claude Opus 4.6 reasoning traces |
| [Jackrong/Qwen3.5-reasoning-700x](https://huggingface.co/datasets/Jackrong/Qwen3.5-reasoning-700x) | 700 | Auxiliary — supporting reasoning diversity |
| **Total** | **4,000** | |

### Data Characteristics

- Long-form chain-of-thought supervision (`<think>...</think>`)
- Diverse reasoning domains: math, logic, code, analytical QA
- High-quality Opus 4.6 teacher traces — carefully curated, no noisy labels
- Conversation format compatible with Qwen3 chat template

## Training Pipeline

### LoRA Configuration

| Parameter | Value |
|---|---|
| Rank (r) | 64 |
| Alpha | 128 |
| Dropout | 0.05 |
| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |

### Hyperparameters

| Parameter | Value |
|---|---|
| Effective batch size | 1 × 16 = 16 |
| Learning rate | 5e-5 |
| LR scheduler | Cosine |
| Epochs | 3 |
| Max sequence length | 16,384 |
| Optimizer | AdamW 8-bit |
| Warmup ratio | 0.03 |
| Weight decay | 0.01 |
| Packing | Enabled |
| Gradient checkpointing | Unsloth |

### Training Results

- **Final train loss:** 0.6953
- **Runtime:** ~153 minutes on A100-80GB

## Distillation Philosophy

We distill **reasoning structure**, not surface tokens. Specifically, the model is encouraged to acquire:

- **Explicit problem decomposition** — break complex questions into sub-goals
- **Assumption checking** — state what's given, what's unknown, and verify constraints
- **Step-by-step derivation** — one logical step per line, no skipped algebra
- **Reflection & backtracking** — recognize dead-ends and revise rather than plow forward
- **Clean answer construction** — separate `<think>` scratch work from the final user-facing answer

This follows the "Claude Opus style" of reasoning — deliberative, self-critical, and structurally transparent.

## Reasoning Scaffold (Learned Pattern)

After fine-tuning, the model tends to produce reasoning traces with this shape:

1. **Restate and parse the task** — identify exactly what is being asked
2. **Plan** — list the approach or sub-problems
3. **Work through each step** — show algebra, logic, or code reasoning explicitly
4. **Verify** — sanity-check the intermediate results before committing
5. **Construct the final answer** — separate, clean, user-facing summary

## Expected Improvements

In practice, the gain is not a dramatic capability jump over the base Qwen3-8B, but rather:

- **Improved stability** in multi-step reasoning
- **Structured, readable traces** instead of rambling CoT
- **Better instruction adherence** when a problem has constraints
- **Fewer hallucinated intermediate steps** thanks to Opus-style self-verification

## Usage

### Transformers

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "NhatCuong22/Qwen3-8B-OpusReasoning"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="bfloat16",
    device_map="auto",
)

messages = [
    {"role": "user", "content": "If a train travels 120km in 2 hours, stops for 30 minutes, then travels 90km in 1.5 hours, what is the average speed for the entire journey?"}
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
text += "<think>\n"  # Activate thinking mode

inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=2048,
    temperature=0.6,
    top_p=0.95,
    do_sample=True,
)

response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=False)
print(response)
```

### Unsloth (faster inference)

```python
from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    "NhatCuong22/Qwen3-8B-OpusReasoning",
    max_seq_length=16384,
    dtype="bfloat16",
)
FastLanguageModel.for_inference(model)
```

### Recommended Generation Parameters

| Parameter | Reasoning tasks | Creative tasks |
|---|---|---|
| temperature | 0.6 | 0.8 |
| top_p | 0.95 | 0.95 |
| max_new_tokens | 2048-4096 | 1024-2048 |
| repetition_penalty | 1.0 | 1.05 |

## Model Architecture

| Parameter | Value |
|---|---|
| Parameters | ~8B |
| Hidden size | 4,096 |
| Layers | 36 |
| Attention heads | 32 (8 KV heads, GQA) |
| Intermediate size | 12,288 |
| Max position embeddings | 40,960 |
| Vocabulary size | 151,936 |
| Precision | bfloat16 |

## Evaluation

Evaluated with [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) (5-shot, bf16, A100-80GB).

### Results vs base Qwen3-8B

| Benchmark | Metric | Qwen3-8B-OpusReasoning | Base Qwen3-8B | Δ |
|---|---|---|---|---|
| **MMLU** | accuracy | **75.40%** | 72.93% | **+2.47** ✅ |
| **ARC-Challenge** | acc_norm | **65.87%** | 56.74% | **+9.13** ✅✅ |
| **ARC-Challenge** | accuracy | 64.42% | — | — |
| **HellaSwag** | acc_norm | **76.98%** | 74.91% | **+2.07** ✅ |
| **HellaSwag** | accuracy | 58.32% | — | — |
| **GSM8K** | exact_match (strict) | 86.20% | 88.60% | -2.40 |
| **GSM8K** | exact_match (flexible) | 86.66% | — | — |

### Analysis

- **MMLU +2.47, ARC-Challenge +9.13, HellaSwag +2.07** — the model retains and slightly improves general knowledge and commonsense reasoning after reasoning-focused fine-tuning, without catastrophic forgetting.
- **ARC-Challenge +9.13** is a strong signal that Opus-style structured reasoning transfers well to scientific reasoning tasks.
- **GSM8K -2.4** is a minor regression, likely due to longer `<think>` traces being occasionally truncated by the default `max_gen_toks` — the model is still at ~87% on grade-school math.

More rigorous reasoning benchmarks (MMLU-Pro, MATH-Hard, AIME, IFEval, MuSR) are being evaluated and will be added here.

## Best Suited For

- Mathematical problem solving (arithmetic, algebra, word problems)
- Logical reasoning and deduction
- Code generation with explanation
- Multi-step analytical question answering
- Instruction-following tasks with constraints
- Offline / on-prem reasoning assistants (fits in 16GB VRAM at bf16)

## Limitations & Intended Use

- **Scale of supervision:** fine-tuned on only ~4K samples — gains are stylistic and structural, not broad knowledge expansion
- **Hallucination risk:** reasoning traces may confidently cite non-existent facts; verify external claims
- **Opus-style bias:** inherits tendencies of the teacher (e.g., verbosity, occasional over-hedging)
- **Language:** primarily English training data
- **Not verified for safety-critical use** — research and learning only
- **Base model license constraints:** follow Qwen3 upstream license in commercial settings

## Acknowledgments

- [Qwen Team](https://huggingface.co/Qwen) — Qwen3-8B base model
- [Unsloth](https://github.com/unslothai/unsloth) — efficient fine-tuning kernels
- [Anthropic](https://www.anthropic.com/) — Claude Opus reasoning (teacher)
- Dataset authors: [Crownelius](https://huggingface.co/Crownelius), [Jackrong](https://huggingface.co/Jackrong)

## Citation

```bibtex
@misc{qwen3-8b-opusreasoning,
  title        = {Qwen3-8B-OpusReasoning},
  author       = {Vo Nhat Cuong},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/NhatCuong22/Qwen3-8B-OpusReasoning}}
}
```

## License

Apache 2.0 (inherits from Qwen3-8B upstream).
-												初始化项目，由ModelHub XC社区提供模型

Model: NhatCuong22/Qwen3-8B-OpusReasoning
Source: Original Platform

											
										
										
											2026-05-12 10:45:21 +08:00
+								---
 								language:
 								- en
 								license: apache-2.0
 								base_model: unsloth/Qwen3-8B
 								tags:
 								- reasoning
 								- qwen3
 								- lora
 								- unsloth
 								- distillation
 								- claude-opus
 								- chain-of-thought
 								datasets:
 								- Crownelius/Opus-4.6-Reasoning-3300x
 								- Jackrong/Qwen3.5-reasoning-700x
 								model-index:
 								- name: Qwen3-8B-OpusReasoning
 								  results: []
 								pipeline_tag: text-generation
 								---
 								# Qwen3-8B-OpusReasoning
 								## Model Overview
 								A reasoning-enhanced version of [Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B), fine-tuned via supervised knowledge distillation from **Claude Opus 4.6** reasoning traces.
 								The goal is not token-level imitation of Opus output, but transfer of its **reasoning structure and problem-solving style** into a compact 8B model that can run locally. The model outputs structured chain-of-thought inside `<think>...</think>` tags before generating the final answer, following the Qwen3 thinking-mode convention.
 								- **Base model:** [unsloth/Qwen3-8B](https://huggingface.co/unsloth/Qwen3-8B)
 								- **Teacher model:** Claude Opus 4.6 (reasoning traces, distilled)
 								- **Training type:** Supervised Fine-Tuning (SFT) + LoRA → merged bf16
 								- **Framework:** [Unsloth](https://github.com/unslothai/unsloth) 2026.4.5 + TRL SFTTrainer
 								- **Precision:** bfloat16
 								- **Hardware:** 1x NVIDIA A100-SXM4-80GB
 								## Training Data
 								| Dataset | Samples | Role |
 								|---|---|---|
 								| [Crownelius/Opus-4.6-Reasoning-3300x](https://huggingface.co/datasets/Crownelius/Opus-4.6-Reasoning-3300x) | 3,300 | Main distillation — Claude Opus 4.6 reasoning traces |
 								| [Jackrong/Qwen3.5-reasoning-700x](https://huggingface.co/datasets/Jackrong/Qwen3.5-reasoning-700x) | 700 | Auxiliary — supporting reasoning diversity |
 								| **Total** | **4,000** | |
 								### Data Characteristics
 								- Long-form chain-of-thought supervision (`<think>...</think>`)
 								- Diverse reasoning domains: math, logic, code, analytical QA
 								- High-quality Opus 4.6 teacher traces — carefully curated, no noisy labels
 								- Conversation format compatible with Qwen3 chat template
 								## Training Pipeline
 								### LoRA Configuration
 								| Parameter | Value |
 								|---|---|
 								| Rank (r) | 64 |
 								| Alpha | 128 |
 								| Dropout | 0.05 |
 								| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
 								### Hyperparameters
 								| Parameter | Value |
 								|---|---|
 								| Effective batch size | 1 × 16 = 16 |
 								| Learning rate | 5e-5 |
 								| LR scheduler | Cosine |
 								| Epochs | 3 |
 								| Max sequence length | 16,384 |
 								| Optimizer | AdamW 8-bit |
 								| Warmup ratio | 0.03 |
 								| Weight decay | 0.01 |
 								| Packing | Enabled |
 								| Gradient checkpointing | Unsloth |
 								### Training Results
 								- **Final train loss:** 0.6953
 								- **Runtime:** ~153 minutes on A100-80GB
 								## Distillation Philosophy
 								We distill **reasoning structure**, not surface tokens. Specifically, the model is encouraged to acquire:
 								- **Explicit problem decomposition** — break complex questions into sub-goals
 								- **Assumption checking** — state what's given, what's unknown, and verify constraints
 								- **Step-by-step derivation** — one logical step per line, no skipped algebra
 								- **Reflection & backtracking** — recognize dead-ends and revise rather than plow forward
 								- **Clean answer construction** — separate `<think>` scratch work from the final user-facing answer
 								This follows the "Claude Opus style" of reasoning — deliberative, self-critical, and structurally transparent.
 								## Reasoning Scaffold (Learned Pattern)
 								After fine-tuning, the model tends to produce reasoning traces with this shape:
 . **Restate and parse the task** — identify exactly what is being asked
 . **Plan** — list the approach or sub-problems
 . **Work through each step** — show algebra, logic, or code reasoning explicitly
 . **Verify** — sanity-check the intermediate results before committing
 . **Construct the final answer** — separate, clean, user-facing summary
 								## Expected Improvements
 								In practice, the gain is not a dramatic capability jump over the base Qwen3-8B, but rather:
 								- **Improved stability** in multi-step reasoning
 								- **Structured, readable traces** instead of rambling CoT
 								- **Better instruction adherence** when a problem has constraints
 								- **Fewer hallucinated intermediate steps** thanks to Opus-style self-verification
 								## Usage
 								### Transformers
 								```python
 								from transformers import AutoModelForCausalLM, AutoTokenizer
 								model_id = "NhatCuong22/Qwen3-8B-OpusReasoning"
 								tokenizer = AutoTokenizer.from_pretrained(model_id)
 								model = AutoModelForCausalLM.from_pretrained(
 								    model_id,
 								    torch_dtype="bfloat16",
 								    device_map="auto",
 								)
 								messages = [
 								    {"role": "user", "content": "If a train travels 120km in 2 hours, stops for 30 minutes, then travels 90km in 1.5 hours, what is the average speed for the entire journey?"}
 								]
 								text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
 								text += "<think>\n"  # Activate thinking mode
 								inputs = tokenizer(text, return_tensors="pt").to(model.device)
 								outputs = model.generate(
 								    **inputs,
 								    max_new_tokens=2048,
 								    temperature=0.6,
 								    top_p=0.95,
 								    do_sample=True,
 								)
 								response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=False)
 								print(response)
 								```
 								### Unsloth (faster inference)
 								```python
 								from unsloth import FastLanguageModel
 								model, tokenizer = FastLanguageModel.from_pretrained(
 								    "NhatCuong22/Qwen3-8B-OpusReasoning",
 								    max_seq_length=16384,
 								    dtype="bfloat16",
 								)
 								FastLanguageModel.for_inference(model)
 								```
 								### Recommended Generation Parameters
 								| Parameter | Reasoning tasks | Creative tasks |
 								|---|---|---|
 								| temperature | 0.6 | 0.8 |
 								| top_p | 0.95 | 0.95 |
 								| max_new_tokens | 2048-4096 | 1024-2048 |
 								| repetition_penalty | 1.0 | 1.05 |
 								## Model Architecture
 								| Parameter | Value |
 								|---|---|
 								| Parameters | ~8B |
 								| Hidden size | 4,096 |
 								| Layers | 36 |
 								| Attention heads | 32 (8 KV heads, GQA) |
 								| Intermediate size | 12,288 |
 								| Max position embeddings | 40,960 |
 								| Vocabulary size | 151,936 |
 								| Precision | bfloat16 |
 								## Evaluation
 								Evaluated with [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) (5-shot, bf16, A100-80GB).
 								### Results vs base Qwen3-8B
 								| Benchmark | Metric | Qwen3-8B-OpusReasoning | Base Qwen3-8B | Δ |
 								|---|---|---|---|---|
 								| **MMLU** | accuracy | **75.40%** | 72.93% | **+2.47** ✅ |
 								| **ARC-Challenge** | acc_norm | **65.87%** | 56.74% | **+9.13** ✅✅ |
 								| **ARC-Challenge** | accuracy | 64.42% | — | — |
 								| **HellaSwag** | acc_norm | **76.98%** | 74.91% | **+2.07** ✅ |
 								| **HellaSwag** | accuracy | 58.32% | — | — |
 								| **GSM8K** | exact_match (strict) | 86.20% | 88.60% | -2.40 |
 								| **GSM8K** | exact_match (flexible) | 86.66% | — | — |
 								### Analysis
 								- **MMLU +2.47, ARC-Challenge +9.13, HellaSwag +2.07** — the model retains and slightly improves general knowledge and commonsense reasoning after reasoning-focused fine-tuning, without catastrophic forgetting.
 								- **ARC-Challenge +9.13** is a strong signal that Opus-style structured reasoning transfers well to scientific reasoning tasks.
 								- **GSM8K -2.4** is a minor regression, likely due to longer `<think>` traces being occasionally truncated by the default `max_gen_toks` — the model is still at ~87% on grade-school math.
 								More rigorous reasoning benchmarks (MMLU-Pro, MATH-Hard, AIME, IFEval, MuSR) are being evaluated and will be added here.
 								## Best Suited For
 								- Mathematical problem solving (arithmetic, algebra, word problems)
 								- Logical reasoning and deduction
 								- Code generation with explanation
 								- Multi-step analytical question answering
 								- Instruction-following tasks with constraints
 								- Offline / on-prem reasoning assistants (fits in 16GB VRAM at bf16)
 								## Limitations & Intended Use
 								- **Scale of supervision:** fine-tuned on only ~4K samples — gains are stylistic and structural, not broad knowledge expansion
 								- **Hallucination risk:** reasoning traces may confidently cite non-existent facts; verify external claims
 								- **Opus-style bias:** inherits tendencies of the teacher (e.g., verbosity, occasional over-hedging)
 								- **Language:** primarily English training data
 								- **Not verified for safety-critical use** — research and learning only
 								- **Base model license constraints:** follow Qwen3 upstream license in commercial settings
 								## Acknowledgments
 								- [Qwen Team](https://huggingface.co/Qwen) — Qwen3-8B base model
 								- [Unsloth](https://github.com/unslothai/unsloth) — efficient fine-tuning kernels
 								- [Anthropic](https://www.anthropic.com/) — Claude Opus reasoning (teacher)
 								- Dataset authors: [Crownelius](https://huggingface.co/Crownelius), [Jackrong](https://huggingface.co/Jackrong)
 								## Citation
 								```bibtex
 								@misc{qwen3-8b-opusreasoning,
 								  title        = {Qwen3-8B-OpusReasoning},
 								  author       = {Vo Nhat Cuong},
 								  year         = {2026},
 								  publisher    = {Hugging Face},
 								  howpublished = {\url{https://huggingface.co/NhatCuong22/Qwen3-8B-OpusReasoning}}
 								}
 								```
 								## License
 								Apache 2.0 (inherits from Qwen3-8B upstream).