161 lines
3.2 KiB
Markdown
161 lines
3.2 KiB
Markdown
|
|
---
|
||
|
|
base_model: Qwen/Qwen3-4B-Instruct-2507
|
||
|
|
datasets:
|
||
|
|
- u-10bei/structured_data_with_cot_dataset_512_v5
|
||
|
|
- daichira/structured-hard-sft-4k
|
||
|
|
- u-10bei/dpo-dataset-qwen-cot
|
||
|
|
language:
|
||
|
|
- en
|
||
|
|
license: apache-2.0
|
||
|
|
library_name: transformers
|
||
|
|
pipeline_tag: text-generation
|
||
|
|
tags:
|
||
|
|
- qwen
|
||
|
|
- unsloth
|
||
|
|
- sft
|
||
|
|
- dpo
|
||
|
|
- cot
|
||
|
|
- alignment
|
||
|
|
---
|
||
|
|
|
||
|
|
# dpo-qwen-cot-merged
|
||
|
|
|
||
|
|
This repository provides a multi-stage fine-tuned version of **Qwen3-4B-Instruct-2507**.
|
||
|
|
|
||
|
|
The training pipeline consists of:
|
||
|
|
|
||
|
|
1. Supervised Fine-Tuning (SFT)
|
||
|
|
2. Stage-2 Hard SFT refinement
|
||
|
|
3. Direct Preference Optimization (DPO)
|
||
|
|
|
||
|
|
The LoRA adapters have been merged into the base model.
|
||
|
|
This repository contains the **final merged full-precision weights**.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
# Training Pipeline
|
||
|
|
|
||
|
|
## Stage 1 — Supervised Fine-Tuning (SFT)
|
||
|
|
|
||
|
|
Base model: `Qwen/Qwen3-4B-Instruct-2507`
|
||
|
|
Dataset: `u-10bei/structured_data_with_cot_dataset_512_v5`
|
||
|
|
|
||
|
|
Configuration:
|
||
|
|
|
||
|
|
- Method: QLoRA (4-bit, Unsloth)
|
||
|
|
- LoRA: r=64, alpha=128
|
||
|
|
- Max sequence length: 512
|
||
|
|
- Epochs: 2
|
||
|
|
- Learning rate: 1e-4
|
||
|
|
- Batch size: 2
|
||
|
|
- Gradient accumulation: 8
|
||
|
|
- Warmup ratio: 0.05
|
||
|
|
- Weight decay: 0.0
|
||
|
|
- Seed: 3407
|
||
|
|
- CoT masking: Enabled (loss applied only to final outputs)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Stage 2 — Hard Data Refinement
|
||
|
|
|
||
|
|
Dataset: `daichira/structured-hard-sft-4k`
|
||
|
|
|
||
|
|
Configuration:
|
||
|
|
|
||
|
|
- Epochs: 1
|
||
|
|
- Learning rate: 3e-5
|
||
|
|
- Same LoRA configuration as Stage 1
|
||
|
|
|
||
|
|
This stage improves robustness on difficult structured transformation tasks.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Stage 3 — Direct Preference Optimization (DPO)
|
||
|
|
|
||
|
|
Dataset: `u-10bei/dpo-dataset-qwen-cot`
|
||
|
|
|
||
|
|
Configuration:
|
||
|
|
|
||
|
|
- Method: DPO via TRL + Unsloth
|
||
|
|
- LoRA: r=8, alpha=16
|
||
|
|
- Learning rate: 1e-7
|
||
|
|
- Beta: 0.1
|
||
|
|
- Max sequence length: 1024
|
||
|
|
- Max prompt length: 512
|
||
|
|
- Epochs: 1
|
||
|
|
- Optimizer: adamw_8bit
|
||
|
|
- Batch size: 2
|
||
|
|
- Gradient accumulation: 4
|
||
|
|
- Warmup ratio: 0.1
|
||
|
|
- Weight decay: 0.01
|
||
|
|
- Seed: 42
|
||
|
|
|
||
|
|
The objective is to align the model toward preferred Chain-of-Thought reasoning patterns using (prompt, chosen, rejected) preference data.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
# Merge Status
|
||
|
|
|
||
|
|
All LoRA adapters have been merged into the base model.
|
||
|
|
|
||
|
|
No PEFT loading is required.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Intended Use
|
||
|
|
|
||
|
|
This model is designed for:
|
||
|
|
|
||
|
|
- Structured transformation tasks
|
||
|
|
- Chain-of-Thought reasoning
|
||
|
|
- Preference-aligned generation
|
||
|
|
- Academic research experiments
|
||
|
|
- Competition submission
|
||
|
|
|
||
|
|
|
||
|
|
## Research Notes
|
||
|
|
|
||
|
|
This work explores multi-stage fine-tuning combining:
|
||
|
|
|
||
|
|
- Structured SFT with CoT masking
|
||
|
|
- Hard data refinement
|
||
|
|
- Preference-based alignment via DPO
|
||
|
|
|
||
|
|
The training was performed using the Unsloth library for memory-efficient 4-bit fine-tuning.
|
||
|
|
|
||
|
|
|
||
|
|
## License
|
||
|
|
|
||
|
|
This model follows the license of the base model:
|
||
|
|
|
||
|
|
Qwen/Qwen3-4B-Instruct-2507
|
||
|
|
|
||
|
|
Users must comply with the original base model license.
|
||
|
|
|
||
|
|
# Usage
|
||
|
|
|
||
|
|
```python
|
||
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM
|
||
|
|
import torch
|
||
|
|
|
||
|
|
model_id = "HuiyuWang/dpo-qwen-cot-merged"
|
||
|
|
|
||
|
|
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
||
|
|
model = AutoModelForCausalLM.from_pretrained(
|
||
|
|
model_id,
|
||
|
|
torch_dtype=torch.bfloat16,
|
||
|
|
device_map="auto",
|
||
|
|
)
|
||
|
|
|
||
|
|
prompt = "Solve the following problem step by step: ..."
|
||
|
|
inputs = tokenizer(prompt, return_tensors="pt")
|
||
|
|
|
||
|
|
outputs = model.generate(
|
||
|
|
**inputs,
|
||
|
|
max_new_tokens=512,
|
||
|
|
temperature=0.7,
|
||
|
|
)
|
||
|
|
|
||
|
|
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
|
||
|
|
|