Files
Qwen3-8B-OpusReasoning/README.md
ModelHub XC 119198d6e6 初始化项目,由ModelHub XC社区提供模型
Model: NhatCuong22/Qwen3-8B-OpusReasoning
Source: Original Platform
2026-05-12 10:45:21 +08:00

250 lines
8.6 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
language:
- en
license: apache-2.0
base_model: unsloth/Qwen3-8B
tags:
- reasoning
- qwen3
- lora
- unsloth
- distillation
- claude-opus
- chain-of-thought
datasets:
- Crownelius/Opus-4.6-Reasoning-3300x
- Jackrong/Qwen3.5-reasoning-700x
model-index:
- name: Qwen3-8B-OpusReasoning
results: []
pipeline_tag: text-generation
---
# Qwen3-8B-OpusReasoning
## Model Overview
A reasoning-enhanced version of [Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B), fine-tuned via supervised knowledge distillation from **Claude Opus 4.6** reasoning traces.
The goal is not token-level imitation of Opus output, but transfer of its **reasoning structure and problem-solving style** into a compact 8B model that can run locally. The model outputs structured chain-of-thought inside `<think>...</think>` tags before generating the final answer, following the Qwen3 thinking-mode convention.
- **Base model:** [unsloth/Qwen3-8B](https://huggingface.co/unsloth/Qwen3-8B)
- **Teacher model:** Claude Opus 4.6 (reasoning traces, distilled)
- **Training type:** Supervised Fine-Tuning (SFT) + LoRA → merged bf16
- **Framework:** [Unsloth](https://github.com/unslothai/unsloth) 2026.4.5 + TRL SFTTrainer
- **Precision:** bfloat16
- **Hardware:** 1x NVIDIA A100-SXM4-80GB
## Training Data
| Dataset | Samples | Role |
|---|---|---|
| [Crownelius/Opus-4.6-Reasoning-3300x](https://huggingface.co/datasets/Crownelius/Opus-4.6-Reasoning-3300x) | 3,300 | Main distillation — Claude Opus 4.6 reasoning traces |
| [Jackrong/Qwen3.5-reasoning-700x](https://huggingface.co/datasets/Jackrong/Qwen3.5-reasoning-700x) | 700 | Auxiliary — supporting reasoning diversity |
| **Total** | **4,000** | |
### Data Characteristics
- Long-form chain-of-thought supervision (`<think>...</think>`)
- Diverse reasoning domains: math, logic, code, analytical QA
- High-quality Opus 4.6 teacher traces — carefully curated, no noisy labels
- Conversation format compatible with Qwen3 chat template
## Training Pipeline
### LoRA Configuration
| Parameter | Value |
|---|---|
| Rank (r) | 64 |
| Alpha | 128 |
| Dropout | 0.05 |
| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
### Hyperparameters
| Parameter | Value |
|---|---|
| Effective batch size | 1 × 16 = 16 |
| Learning rate | 5e-5 |
| LR scheduler | Cosine |
| Epochs | 3 |
| Max sequence length | 16,384 |
| Optimizer | AdamW 8-bit |
| Warmup ratio | 0.03 |
| Weight decay | 0.01 |
| Packing | Enabled |
| Gradient checkpointing | Unsloth |
### Training Results
- **Final train loss:** 0.6953
- **Runtime:** ~153 minutes on A100-80GB
## Distillation Philosophy
We distill **reasoning structure**, not surface tokens. Specifically, the model is encouraged to acquire:
- **Explicit problem decomposition** — break complex questions into sub-goals
- **Assumption checking** — state what's given, what's unknown, and verify constraints
- **Step-by-step derivation** — one logical step per line, no skipped algebra
- **Reflection & backtracking** — recognize dead-ends and revise rather than plow forward
- **Clean answer construction** — separate `<think>` scratch work from the final user-facing answer
This follows the "Claude Opus style" of reasoning — deliberative, self-critical, and structurally transparent.
## Reasoning Scaffold (Learned Pattern)
After fine-tuning, the model tends to produce reasoning traces with this shape:
1. **Restate and parse the task** — identify exactly what is being asked
2. **Plan** — list the approach or sub-problems
3. **Work through each step** — show algebra, logic, or code reasoning explicitly
4. **Verify** — sanity-check the intermediate results before committing
5. **Construct the final answer** — separate, clean, user-facing summary
## Expected Improvements
In practice, the gain is not a dramatic capability jump over the base Qwen3-8B, but rather:
- **Improved stability** in multi-step reasoning
- **Structured, readable traces** instead of rambling CoT
- **Better instruction adherence** when a problem has constraints
- **Fewer hallucinated intermediate steps** thanks to Opus-style self-verification
## Usage
### Transformers
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "NhatCuong22/Qwen3-8B-OpusReasoning"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype="bfloat16",
device_map="auto",
)
messages = [
{"role": "user", "content": "If a train travels 120km in 2 hours, stops for 30 minutes, then travels 90km in 1.5 hours, what is the average speed for the entire journey?"}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
text += "<think>\n" # Activate thinking mode
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=2048,
temperature=0.6,
top_p=0.95,
do_sample=True,
)
response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=False)
print(response)
```
### Unsloth (faster inference)
```python
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
"NhatCuong22/Qwen3-8B-OpusReasoning",
max_seq_length=16384,
dtype="bfloat16",
)
FastLanguageModel.for_inference(model)
```
### Recommended Generation Parameters
| Parameter | Reasoning tasks | Creative tasks |
|---|---|---|
| temperature | 0.6 | 0.8 |
| top_p | 0.95 | 0.95 |
| max_new_tokens | 2048-4096 | 1024-2048 |
| repetition_penalty | 1.0 | 1.05 |
## Model Architecture
| Parameter | Value |
|---|---|
| Parameters | ~8B |
| Hidden size | 4,096 |
| Layers | 36 |
| Attention heads | 32 (8 KV heads, GQA) |
| Intermediate size | 12,288 |
| Max position embeddings | 40,960 |
| Vocabulary size | 151,936 |
| Precision | bfloat16 |
## Evaluation
Evaluated with [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) (5-shot, bf16, A100-80GB).
### Results vs base Qwen3-8B
| Benchmark | Metric | Qwen3-8B-OpusReasoning | Base Qwen3-8B | Δ |
|---|---|---|---|---|
| **MMLU** | accuracy | **75.40%** | 72.93% | **+2.47** ✅ |
| **ARC-Challenge** | acc_norm | **65.87%** | 56.74% | **+9.13** ✅✅ |
| **ARC-Challenge** | accuracy | 64.42% | — | — |
| **HellaSwag** | acc_norm | **76.98%** | 74.91% | **+2.07** ✅ |
| **HellaSwag** | accuracy | 58.32% | — | — |
| **GSM8K** | exact_match (strict) | 86.20% | 88.60% | -2.40 |
| **GSM8K** | exact_match (flexible) | 86.66% | — | — |
### Analysis
- **MMLU +2.47, ARC-Challenge +9.13, HellaSwag +2.07** — the model retains and slightly improves general knowledge and commonsense reasoning after reasoning-focused fine-tuning, without catastrophic forgetting.
- **ARC-Challenge +9.13** is a strong signal that Opus-style structured reasoning transfers well to scientific reasoning tasks.
- **GSM8K -2.4** is a minor regression, likely due to longer `<think>` traces being occasionally truncated by the default `max_gen_toks` — the model is still at ~87% on grade-school math.
More rigorous reasoning benchmarks (MMLU-Pro, MATH-Hard, AIME, IFEval, MuSR) are being evaluated and will be added here.
## Best Suited For
- Mathematical problem solving (arithmetic, algebra, word problems)
- Logical reasoning and deduction
- Code generation with explanation
- Multi-step analytical question answering
- Instruction-following tasks with constraints
- Offline / on-prem reasoning assistants (fits in 16GB VRAM at bf16)
## Limitations & Intended Use
- **Scale of supervision:** fine-tuned on only ~4K samples — gains are stylistic and structural, not broad knowledge expansion
- **Hallucination risk:** reasoning traces may confidently cite non-existent facts; verify external claims
- **Opus-style bias:** inherits tendencies of the teacher (e.g., verbosity, occasional over-hedging)
- **Language:** primarily English training data
- **Not verified for safety-critical use** — research and learning only
- **Base model license constraints:** follow Qwen3 upstream license in commercial settings
## Acknowledgments
- [Qwen Team](https://huggingface.co/Qwen) — Qwen3-8B base model
- [Unsloth](https://github.com/unslothai/unsloth) — efficient fine-tuning kernels
- [Anthropic](https://www.anthropic.com/) — Claude Opus reasoning (teacher)
- Dataset authors: [Crownelius](https://huggingface.co/Crownelius), [Jackrong](https://huggingface.co/Jackrong)
## Citation
```bibtex
@misc{qwen3-8b-opusreasoning,
title = {Qwen3-8B-OpusReasoning},
author = {Vo Nhat Cuong},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/NhatCuong22/Qwen3-8B-OpusReasoning}}
}
```
## License
Apache 2.0 (inherits from Qwen3-8B upstream).