初始化项目,由ModelHub XC社区提供模型

Model: HuiyuWang/dpo-qwen-cot-merged
Source: Original Platform
This commit is contained in:
ModelHub XC
2026-04-29 10:50:12 +08:00
commit 8a184ece4d
13 changed files with 152423 additions and 0 deletions

160
README.md Normal file
View File

@@ -0,0 +1,160 @@
---
base_model: Qwen/Qwen3-4B-Instruct-2507
datasets:
- u-10bei/structured_data_with_cot_dataset_512_v5
- daichira/structured-hard-sft-4k
- u-10bei/dpo-dataset-qwen-cot
language:
- en
license: apache-2.0
library_name: transformers
pipeline_tag: text-generation
tags:
- qwen
- unsloth
- sft
- dpo
- cot
- alignment
---
# dpo-qwen-cot-merged
This repository provides a multi-stage fine-tuned version of **Qwen3-4B-Instruct-2507**.
The training pipeline consists of:
1. Supervised Fine-Tuning (SFT)
2. Stage-2 Hard SFT refinement
3. Direct Preference Optimization (DPO)
The LoRA adapters have been merged into the base model.
This repository contains the **final merged full-precision weights**.
---
# Training Pipeline
## Stage 1 — Supervised Fine-Tuning (SFT)
Base model: `Qwen/Qwen3-4B-Instruct-2507`
Dataset: `u-10bei/structured_data_with_cot_dataset_512_v5`
Configuration:
- Method: QLoRA (4-bit, Unsloth)
- LoRA: r=64, alpha=128
- Max sequence length: 512
- Epochs: 2
- Learning rate: 1e-4
- Batch size: 2
- Gradient accumulation: 8
- Warmup ratio: 0.05
- Weight decay: 0.0
- Seed: 3407
- CoT masking: Enabled (loss applied only to final outputs)
---
## Stage 2 — Hard Data Refinement
Dataset: `daichira/structured-hard-sft-4k`
Configuration:
- Epochs: 1
- Learning rate: 3e-5
- Same LoRA configuration as Stage 1
This stage improves robustness on difficult structured transformation tasks.
---
## Stage 3 — Direct Preference Optimization (DPO)
Dataset: `u-10bei/dpo-dataset-qwen-cot`
Configuration:
- Method: DPO via TRL + Unsloth
- LoRA: r=8, alpha=16
- Learning rate: 1e-7
- Beta: 0.1
- Max sequence length: 1024
- Max prompt length: 512
- Epochs: 1
- Optimizer: adamw_8bit
- Batch size: 2
- Gradient accumulation: 4
- Warmup ratio: 0.1
- Weight decay: 0.01
- Seed: 42
The objective is to align the model toward preferred Chain-of-Thought reasoning patterns using (prompt, chosen, rejected) preference data.
---
# Merge Status
All LoRA adapters have been merged into the base model.
No PEFT loading is required.
---
## Intended Use
This model is designed for:
- Structured transformation tasks
- Chain-of-Thought reasoning
- Preference-aligned generation
- Academic research experiments
- Competition submission
## Research Notes
This work explores multi-stage fine-tuning combining:
- Structured SFT with CoT masking
- Hard data refinement
- Preference-based alignment via DPO
The training was performed using the Unsloth library for memory-efficient 4-bit fine-tuning.
## License
This model follows the license of the base model:
Qwen/Qwen3-4B-Instruct-2507
Users must comply with the original base model license.
# Usage
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "HuiyuWang/dpo-qwen-cot-merged"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
)
prompt = "Solve the following problem step by step: ..."
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.7,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))