base_model, datasets, language, license, library_name, pipeline_tag, tags
base_model datasets language license library_name pipeline_tag tags
Qwen/Qwen3-4B-Instruct-2507
u-10bei/dpo-dataset-qwen-cot
en
apache-2.0 transformers text-generation
dpo
unsloth
qwen
alignment

qwen3-4b-dpo-qwen-cot-merged

This model is a fine-tuned version of Qwen/Qwen3-4B-Instruct-2507 using Direct Preference Optimization (DPO) via the Unsloth library. This repository contains the full-merged 16-bit weights. No adapter loading is required.

Training Configuration

  • Base model: Qwen/Qwen3-4B-Instruct-2507
  • Method: DPO
  • Epochs: 1
  • Learning rate: 1e-07
  • Beta: 0.1
  • Max sequence length: 1024
  • LoRA Config: r=8, alpha=16 (merged into base)
Description
Model synced from source: Yurori/qwen3-4b-dpo-qwen-cot-merged
Readme 13 MiB
Languages
Jinja 100%