--- base_model: Qwen/Qwen3-4B-Instruct-2507 datasets: - u-10bei/structured_data_with_cot_dataset_512_v5 - daichira/structured-hard-sft-4k - u-10bei/dpo-dataset-qwen-cot language: - en license: apache-2.0 library_name: transformers pipeline_tag: text-generation tags: - qwen - unsloth - sft - dpo - cot - alignment --- # dpo-qwen-cot-merged This repository provides a multi-stage fine-tuned version of **Qwen3-4B-Instruct-2507**. The training pipeline consists of: 1. Supervised Fine-Tuning (SFT) 2. Stage-2 Hard SFT refinement 3. Direct Preference Optimization (DPO) The LoRA adapters have been merged into the base model. This repository contains the **final merged full-precision weights**. --- # Training Pipeline ## Stage 1 — Supervised Fine-Tuning (SFT) Base model: `Qwen/Qwen3-4B-Instruct-2507` Dataset: `u-10bei/structured_data_with_cot_dataset_512_v5` Configuration: - Method: QLoRA (4-bit, Unsloth) - LoRA: r=64, alpha=128 - Max sequence length: 512 - Epochs: 2 - Learning rate: 1e-4 - Batch size: 2 - Gradient accumulation: 8 - Warmup ratio: 0.05 - Weight decay: 0.0 - Seed: 3407 - CoT masking: Enabled (loss applied only to final outputs) --- ## Stage 2 — Hard Data Refinement Dataset: `daichira/structured-hard-sft-4k` Configuration: - Epochs: 1 - Learning rate: 3e-5 - Same LoRA configuration as Stage 1 This stage improves robustness on difficult structured transformation tasks. --- ## Stage 3 — Direct Preference Optimization (DPO) Dataset: `u-10bei/dpo-dataset-qwen-cot` Configuration: - Method: DPO via TRL + Unsloth - LoRA: r=8, alpha=16 - Learning rate: 1e-7 - Beta: 0.1 - Max sequence length: 1024 - Max prompt length: 512 - Epochs: 1 - Optimizer: adamw_8bit - Batch size: 2 - Gradient accumulation: 4 - Warmup ratio: 0.1 - Weight decay: 0.01 - Seed: 42 The objective is to align the model toward preferred Chain-of-Thought reasoning patterns using (prompt, chosen, rejected) preference data. --- # Merge Status All LoRA adapters have been merged into the base model. No PEFT loading is required. --- ## Intended Use This model is designed for: - Structured transformation tasks - Chain-of-Thought reasoning - Preference-aligned generation - Academic research experiments - Competition submission ## Research Notes This work explores multi-stage fine-tuning combining: - Structured SFT with CoT masking - Hard data refinement - Preference-based alignment via DPO The training was performed using the Unsloth library for memory-efficient 4-bit fine-tuning. ## License This model follows the license of the base model: Qwen/Qwen3-4B-Instruct-2507 Users must comply with the original base model license. # Usage ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch model_id = "HuiyuWang/dpo-qwen-cot-merged" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype=torch.bfloat16, device_map="auto", ) prompt = "Solve the following problem step by step: ..." inputs = tokenizer(prompt, return_tensors="pt") outputs = model.generate( **inputs, max_new_tokens=512, temperature=0.7, ) print(tokenizer.decode(outputs[0], skip_special_tokens=True))