---
base_model: Qwen/Qwen3-4B-Instruct-2507
datasets:
- u-10bei/structured_data_with_cot_dataset_512_v5
- daichira/structured-hard-sft-4k
- u-10bei/dpo-dataset-qwen-cot
language:
- en
license: apache-2.0
library_name: transformers
pipeline_tag: text-generation
tags:
- qwen
- unsloth
- sft
- dpo
- cot
- alignment
---

# dpo-qwen-cot-merged

This repository provides a multi-stage fine-tuned version of **Qwen3-4B-Instruct-2507**.

The training pipeline consists of:

1. Supervised Fine-Tuning (SFT)
2. Stage-2 Hard SFT refinement
3. Direct Preference Optimization (DPO)

The LoRA adapters have been merged into the base model.
This repository contains the **final merged full-precision weights**.

---

# Training Pipeline

## Stage 1 — Supervised Fine-Tuning (SFT)

Base model: `Qwen/Qwen3-4B-Instruct-2507`  
Dataset: `u-10bei/structured_data_with_cot_dataset_512_v5`

Configuration:

- Method: QLoRA (4-bit, Unsloth)
- LoRA: r=64, alpha=128
- Max sequence length: 512
- Epochs: 2
- Learning rate: 1e-4
- Batch size: 2
- Gradient accumulation: 8
- Warmup ratio: 0.05
- Weight decay: 0.0
- Seed: 3407
- CoT masking: Enabled (loss applied only to final outputs)

---

## Stage 2 — Hard Data Refinement

Dataset: `daichira/structured-hard-sft-4k`

Configuration:

- Epochs: 1
- Learning rate: 3e-5
- Same LoRA configuration as Stage 1

This stage improves robustness on difficult structured transformation tasks.

---

## Stage 3 — Direct Preference Optimization (DPO)

Dataset: `u-10bei/dpo-dataset-qwen-cot`

Configuration:

- Method: DPO via TRL + Unsloth
- LoRA: r=8, alpha=16
- Learning rate: 1e-7
- Beta: 0.1
- Max sequence length: 1024
- Max prompt length: 512
- Epochs: 1
- Optimizer: adamw_8bit
- Batch size: 2
- Gradient accumulation: 4
- Warmup ratio: 0.1
- Weight decay: 0.01
- Seed: 42

The objective is to align the model toward preferred Chain-of-Thought reasoning patterns using (prompt, chosen, rejected) preference data.

---

# Merge Status

All LoRA adapters have been merged into the base model.

No PEFT loading is required.

---

## Intended Use

This model is designed for:

- Structured transformation tasks  
- Chain-of-Thought reasoning  
- Preference-aligned generation  
- Academic research experiments  
- Competition submission  


## Research Notes

This work explores multi-stage fine-tuning combining:

- Structured SFT with CoT masking  
- Hard data refinement  
- Preference-based alignment via DPO  

The training was performed using the Unsloth library for memory-efficient 4-bit fine-tuning.


## License

This model follows the license of the base model:

Qwen/Qwen3-4B-Instruct-2507

Users must comply with the original base model license.

# Usage

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "HuiyuWang/dpo-qwen-cot-merged"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

prompt = "Solve the following problem step by step: ..."
inputs = tokenizer(prompt, return_tensors="pt")

outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.7,
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))