dpo-qwen-cot-merged/README.md

---
base_model: Qwen/Qwen3-4B-Instruct-2507
datasets:
- u-10bei/structured_data_with_cot_dataset_512_v5
- daichira/structured-hard-sft-4k
- u-10bei/dpo-dataset-qwen-cot
language:
- en
license: apache-2.0
library_name: transformers
pipeline_tag: text-generation
tags:
- qwen
- unsloth
- sft
- dpo
- cot
- alignment
---

# dpo-qwen-cot-merged

This repository provides a multi-stage fine-tuned version of **Qwen3-4B-Instruct-2507**.

The training pipeline consists of:

1. Supervised Fine-Tuning (SFT)
2. Stage-2 Hard SFT refinement
3. Direct Preference Optimization (DPO)

The LoRA adapters have been merged into the base model.
This repository contains the **final merged full-precision weights**.

---

# Training Pipeline

## Stage 1 — Supervised Fine-Tuning (SFT)

Base model: `Qwen/Qwen3-4B-Instruct-2507`  
Dataset: `u-10bei/structured_data_with_cot_dataset_512_v5`

Configuration:

- Method: QLoRA (4-bit, Unsloth)
- LoRA: r=64, alpha=128
- Max sequence length: 512
- Epochs: 2
- Learning rate: 1e-4
- Batch size: 2
- Gradient accumulation: 8
- Warmup ratio: 0.05
- Weight decay: 0.0
- Seed: 3407
- CoT masking: Enabled (loss applied only to final outputs)

---

## Stage 2 — Hard Data Refinement

Dataset: `daichira/structured-hard-sft-4k`

Configuration:

- Epochs: 1
- Learning rate: 3e-5
- Same LoRA configuration as Stage 1

This stage improves robustness on difficult structured transformation tasks.

---

## Stage 3 — Direct Preference Optimization (DPO)

Dataset: `u-10bei/dpo-dataset-qwen-cot`

Configuration:

- Method: DPO via TRL + Unsloth
- LoRA: r=8, alpha=16
- Learning rate: 1e-7
- Beta: 0.1
- Max sequence length: 1024
- Max prompt length: 512
- Epochs: 1
- Optimizer: adamw_8bit
- Batch size: 2
- Gradient accumulation: 4
- Warmup ratio: 0.1
- Weight decay: 0.01
- Seed: 42

The objective is to align the model toward preferred Chain-of-Thought reasoning patterns using (prompt, chosen, rejected) preference data.

---

# Merge Status

All LoRA adapters have been merged into the base model.

No PEFT loading is required.

---

## Intended Use

This model is designed for:

- Structured transformation tasks  
- Chain-of-Thought reasoning  
- Preference-aligned generation  
- Academic research experiments  
- Competition submission  


## Research Notes

This work explores multi-stage fine-tuning combining:

- Structured SFT with CoT masking  
- Hard data refinement  
- Preference-based alignment via DPO  

The training was performed using the Unsloth library for memory-efficient 4-bit fine-tuning.


## License

This model follows the license of the base model:

Qwen/Qwen3-4B-Instruct-2507

Users must comply with the original base model license.

# Usage

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "HuiyuWang/dpo-qwen-cot-merged"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

prompt = "Solve the following problem step by step: ..."
inputs = tokenizer(prompt, return_tensors="pt")

outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.7,
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))
初始化项目，由ModelHub XC社区提供模型 Model: HuiyuWang/dpo-qwen-cot-merged Source: Original Platform 2026-04-29 10:50:12 +08:00			`---`
			`base_model: Qwen/Qwen3-4B-Instruct-2507`
			`datasets:`
			`- u-10bei/structured_data_with_cot_dataset_512_v5`
			`- daichira/structured-hard-sft-4k`
			`- u-10bei/dpo-dataset-qwen-cot`
			`language:`
			`- en`
			`license: apache-2.0`
			`library_name: transformers`
			`pipeline_tag: text-generation`
			`tags:`
			`- qwen`
			`- unsloth`
			`- sft`
			`- dpo`
			`- cot`
			`- alignment`
			`---`

			`# dpo-qwen-cot-merged`

			`This repository provides a multi-stage fine-tuned version of Qwen3-4B-Instruct-2507.`

			`The training pipeline consists of:`

			`1. Supervised Fine-Tuning (SFT)`
			`2. Stage-2 Hard SFT refinement`
			`3. Direct Preference Optimization (DPO)`

			`The LoRA adapters have been merged into the base model.`
			`This repository contains the final merged full-precision weights.`

			`---`

			`# Training Pipeline`

			`## Stage 1 — Supervised Fine-Tuning (SFT)`

			Base model: `Qwen/Qwen3-4B-Instruct-2507`
			Dataset: `u-10bei/structured_data_with_cot_dataset_512_v5`

			`Configuration:`

			`- Method: QLoRA (4-bit, Unsloth)`
			`- LoRA: r=64, alpha=128`
			`- Max sequence length: 512`
			`- Epochs: 2`
			`- Learning rate: 1e-4`
			`- Batch size: 2`
			`- Gradient accumulation: 8`
			`- Warmup ratio: 0.05`
			`- Weight decay: 0.0`
			`- Seed: 3407`
			`- CoT masking: Enabled (loss applied only to final outputs)`

			`---`

			`## Stage 2 — Hard Data Refinement`

			Dataset: `daichira/structured-hard-sft-4k`

			`Configuration:`

			`- Epochs: 1`
			`- Learning rate: 3e-5`
			`- Same LoRA configuration as Stage 1`

			`This stage improves robustness on difficult structured transformation tasks.`

			`---`

			`## Stage 3 — Direct Preference Optimization (DPO)`

			Dataset: `u-10bei/dpo-dataset-qwen-cot`

			`Configuration:`

			`- Method: DPO via TRL + Unsloth`
			`- LoRA: r=8, alpha=16`
			`- Learning rate: 1e-7`
			`- Beta: 0.1`
			`- Max sequence length: 1024`
			`- Max prompt length: 512`
			`- Epochs: 1`
			`- Optimizer: adamw_8bit`
			`- Batch size: 2`
			`- Gradient accumulation: 4`
			`- Warmup ratio: 0.1`
			`- Weight decay: 0.01`
			`- Seed: 42`

			`The objective is to align the model toward preferred Chain-of-Thought reasoning patterns using (prompt, chosen, rejected) preference data.`

			`---`

			`# Merge Status`

			`All LoRA adapters have been merged into the base model.`

			`No PEFT loading is required.`

			`---`

			`## Intended Use`

			`This model is designed for:`

			`- Structured transformation tasks`
			`- Chain-of-Thought reasoning`
			`- Preference-aligned generation`
			`- Academic research experiments`
			`- Competition submission`


			`## Research Notes`

			`This work explores multi-stage fine-tuning combining:`

			`- Structured SFT with CoT masking`
			`- Hard data refinement`
			`- Preference-based alignment via DPO`

			`The training was performed using the Unsloth library for memory-efficient 4-bit fine-tuning.`


			`## License`

			`This model follows the license of the base model:`

			`Qwen/Qwen3-4B-Instruct-2507`

			`Users must comply with the original base model license.`

			`# Usage`

			```python
			`from transformers import AutoTokenizer, AutoModelForCausalLM`
			`import torch`

			`model_id = "HuiyuWang/dpo-qwen-cot-merged"`

			`tokenizer = AutoTokenizer.from_pretrained(model_id)`
			`model = AutoModelForCausalLM.from_pretrained(`
			`model_id,`
			`torch_dtype=torch.bfloat16,`
			`device_map="auto",`
			`)`

			`prompt = "Solve the following problem step by step: ..."`
			`inputs = tokenizer(prompt, return_tensors="pt")`

			`outputs = model.generate(`
			`**inputs,`
			`max_new_tokens=512,`
			`temperature=0.7,`
			`)`

			`print(tokenizer.decode(outputs[0], skip_special_tokens=True))`