--- base_model: Qwen/Qwen3-4B-Instruct-2507 datasets: - u-10bei/dpo-dataset-qwen-cot language: - en license: apache-2.0 library_name: transformers pipeline_tag: text-generation tags: - dpo - unsloth - qwen - alignment --- # qwen3-4b-dpo-qwen-cot-merged This model is a fine-tuned version of **Qwen/Qwen3-4B-Instruct-2507** using **Direct Preference Optimization (DPO)** via the **Unsloth** library. This repository contains the **full-merged 16-bit weights**. No adapter loading is required. ## Training Configuration - **Base model**: Qwen/Qwen3-4B-Instruct-2507 - **Method**: DPO - **Epochs**: 1 - **Learning rate**: 1e-07 - **Beta**: 0.1 - **Max sequence length**: 1024 - **LoRA Config**: r=8, alpha=16 (merged into base)