Files
Qwen3-8B-OpusReasoning/README.md
ModelHub XC 119198d6e6 初始化项目,由ModelHub XC社区提供模型
Model: NhatCuong22/Qwen3-8B-OpusReasoning
Source: Original Platform
2026-05-12 10:45:21 +08:00

8.6 KiB
Raw Permalink Blame History

language, license, base_model, tags, datasets, model-index, pipeline_tag
language license base_model tags datasets model-index pipeline_tag
en
apache-2.0 unsloth/Qwen3-8B
reasoning
qwen3
lora
unsloth
distillation
claude-opus
chain-of-thought
Crownelius/Opus-4.6-Reasoning-3300x
Jackrong/Qwen3.5-reasoning-700x
name results
Qwen3-8B-OpusReasoning
text-generation

Qwen3-8B-OpusReasoning

Model Overview

A reasoning-enhanced version of Qwen3-8B, fine-tuned via supervised knowledge distillation from Claude Opus 4.6 reasoning traces.

The goal is not token-level imitation of Opus output, but transfer of its reasoning structure and problem-solving style into a compact 8B model that can run locally. The model outputs structured chain-of-thought inside <think>...</think> tags before generating the final answer, following the Qwen3 thinking-mode convention.

  • Base model: unsloth/Qwen3-8B
  • Teacher model: Claude Opus 4.6 (reasoning traces, distilled)
  • Training type: Supervised Fine-Tuning (SFT) + LoRA → merged bf16
  • Framework: Unsloth 2026.4.5 + TRL SFTTrainer
  • Precision: bfloat16
  • Hardware: 1x NVIDIA A100-SXM4-80GB

Training Data

Dataset Samples Role
Crownelius/Opus-4.6-Reasoning-3300x 3,300 Main distillation — Claude Opus 4.6 reasoning traces
Jackrong/Qwen3.5-reasoning-700x 700 Auxiliary — supporting reasoning diversity
Total 4,000

Data Characteristics

  • Long-form chain-of-thought supervision (<think>...</think>)
  • Diverse reasoning domains: math, logic, code, analytical QA
  • High-quality Opus 4.6 teacher traces — carefully curated, no noisy labels
  • Conversation format compatible with Qwen3 chat template

Training Pipeline

LoRA Configuration

Parameter Value
Rank (r) 64
Alpha 128
Dropout 0.05
Target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj

Hyperparameters

Parameter Value
Effective batch size 1 × 16 = 16
Learning rate 5e-5
LR scheduler Cosine
Epochs 3
Max sequence length 16,384
Optimizer AdamW 8-bit
Warmup ratio 0.03
Weight decay 0.01
Packing Enabled
Gradient checkpointing Unsloth

Training Results

  • Final train loss: 0.6953
  • Runtime: ~153 minutes on A100-80GB

Distillation Philosophy

We distill reasoning structure, not surface tokens. Specifically, the model is encouraged to acquire:

  • Explicit problem decomposition — break complex questions into sub-goals
  • Assumption checking — state what's given, what's unknown, and verify constraints
  • Step-by-step derivation — one logical step per line, no skipped algebra
  • Reflection & backtracking — recognize dead-ends and revise rather than plow forward
  • Clean answer construction — separate <think> scratch work from the final user-facing answer

This follows the "Claude Opus style" of reasoning — deliberative, self-critical, and structurally transparent.

Reasoning Scaffold (Learned Pattern)

After fine-tuning, the model tends to produce reasoning traces with this shape:

  1. Restate and parse the task — identify exactly what is being asked
  2. Plan — list the approach or sub-problems
  3. Work through each step — show algebra, logic, or code reasoning explicitly
  4. Verify — sanity-check the intermediate results before committing
  5. Construct the final answer — separate, clean, user-facing summary

Expected Improvements

In practice, the gain is not a dramatic capability jump over the base Qwen3-8B, but rather:

  • Improved stability in multi-step reasoning
  • Structured, readable traces instead of rambling CoT
  • Better instruction adherence when a problem has constraints
  • Fewer hallucinated intermediate steps thanks to Opus-style self-verification

Usage

Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "NhatCuong22/Qwen3-8B-OpusReasoning"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="bfloat16",
    device_map="auto",
)

messages = [
    {"role": "user", "content": "If a train travels 120km in 2 hours, stops for 30 minutes, then travels 90km in 1.5 hours, what is the average speed for the entire journey?"}
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
text += "<think>\n"  # Activate thinking mode

inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=2048,
    temperature=0.6,
    top_p=0.95,
    do_sample=True,
)

response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=False)
print(response)

Unsloth (faster inference)

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    "NhatCuong22/Qwen3-8B-OpusReasoning",
    max_seq_length=16384,
    dtype="bfloat16",
)
FastLanguageModel.for_inference(model)
Parameter Reasoning tasks Creative tasks
temperature 0.6 0.8
top_p 0.95 0.95
max_new_tokens 2048-4096 1024-2048
repetition_penalty 1.0 1.05

Model Architecture

Parameter Value
Parameters ~8B
Hidden size 4,096
Layers 36
Attention heads 32 (8 KV heads, GQA)
Intermediate size 12,288
Max position embeddings 40,960
Vocabulary size 151,936
Precision bfloat16

Evaluation

Evaluated with lm-evaluation-harness (5-shot, bf16, A100-80GB).

Results vs base Qwen3-8B

Benchmark Metric Qwen3-8B-OpusReasoning Base Qwen3-8B Δ
MMLU accuracy 75.40% 72.93% +2.47
ARC-Challenge acc_norm 65.87% 56.74% +9.13
ARC-Challenge accuracy 64.42%
HellaSwag acc_norm 76.98% 74.91% +2.07
HellaSwag accuracy 58.32%
GSM8K exact_match (strict) 86.20% 88.60% -2.40
GSM8K exact_match (flexible) 86.66%

Analysis

  • MMLU +2.47, ARC-Challenge +9.13, HellaSwag +2.07 — the model retains and slightly improves general knowledge and commonsense reasoning after reasoning-focused fine-tuning, without catastrophic forgetting.
  • ARC-Challenge +9.13 is a strong signal that Opus-style structured reasoning transfers well to scientific reasoning tasks.
  • GSM8K -2.4 is a minor regression, likely due to longer <think> traces being occasionally truncated by the default max_gen_toks — the model is still at ~87% on grade-school math.

More rigorous reasoning benchmarks (MMLU-Pro, MATH-Hard, AIME, IFEval, MuSR) are being evaluated and will be added here.

Best Suited For

  • Mathematical problem solving (arithmetic, algebra, word problems)
  • Logical reasoning and deduction
  • Code generation with explanation
  • Multi-step analytical question answering
  • Instruction-following tasks with constraints
  • Offline / on-prem reasoning assistants (fits in 16GB VRAM at bf16)

Limitations & Intended Use

  • Scale of supervision: fine-tuned on only ~4K samples — gains are stylistic and structural, not broad knowledge expansion
  • Hallucination risk: reasoning traces may confidently cite non-existent facts; verify external claims
  • Opus-style bias: inherits tendencies of the teacher (e.g., verbosity, occasional over-hedging)
  • Language: primarily English training data
  • Not verified for safety-critical use — research and learning only
  • Base model license constraints: follow Qwen3 upstream license in commercial settings

Acknowledgments

Citation

@misc{qwen3-8b-opusreasoning,
  title        = {Qwen3-8B-OpusReasoning},
  author       = {Vo Nhat Cuong},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/NhatCuong22/Qwen3-8B-OpusReasoning}}
}

License

Apache 2.0 (inherits from Qwen3-8B upstream).