初始化项目,由ModelHub XC社区提供模型
Model: HuiyuWang/dpo-qwen-cot-merged Source: Original Platform
This commit is contained in:
160
README.md
Normal file
160
README.md
Normal file
@@ -0,0 +1,160 @@
|
||||
---
|
||||
base_model: Qwen/Qwen3-4B-Instruct-2507
|
||||
datasets:
|
||||
- u-10bei/structured_data_with_cot_dataset_512_v5
|
||||
- daichira/structured-hard-sft-4k
|
||||
- u-10bei/dpo-dataset-qwen-cot
|
||||
language:
|
||||
- en
|
||||
license: apache-2.0
|
||||
library_name: transformers
|
||||
pipeline_tag: text-generation
|
||||
tags:
|
||||
- qwen
|
||||
- unsloth
|
||||
- sft
|
||||
- dpo
|
||||
- cot
|
||||
- alignment
|
||||
---
|
||||
|
||||
# dpo-qwen-cot-merged
|
||||
|
||||
This repository provides a multi-stage fine-tuned version of **Qwen3-4B-Instruct-2507**.
|
||||
|
||||
The training pipeline consists of:
|
||||
|
||||
1. Supervised Fine-Tuning (SFT)
|
||||
2. Stage-2 Hard SFT refinement
|
||||
3. Direct Preference Optimization (DPO)
|
||||
|
||||
The LoRA adapters have been merged into the base model.
|
||||
This repository contains the **final merged full-precision weights**.
|
||||
|
||||
---
|
||||
|
||||
# Training Pipeline
|
||||
|
||||
## Stage 1 — Supervised Fine-Tuning (SFT)
|
||||
|
||||
Base model: `Qwen/Qwen3-4B-Instruct-2507`
|
||||
Dataset: `u-10bei/structured_data_with_cot_dataset_512_v5`
|
||||
|
||||
Configuration:
|
||||
|
||||
- Method: QLoRA (4-bit, Unsloth)
|
||||
- LoRA: r=64, alpha=128
|
||||
- Max sequence length: 512
|
||||
- Epochs: 2
|
||||
- Learning rate: 1e-4
|
||||
- Batch size: 2
|
||||
- Gradient accumulation: 8
|
||||
- Warmup ratio: 0.05
|
||||
- Weight decay: 0.0
|
||||
- Seed: 3407
|
||||
- CoT masking: Enabled (loss applied only to final outputs)
|
||||
|
||||
---
|
||||
|
||||
## Stage 2 — Hard Data Refinement
|
||||
|
||||
Dataset: `daichira/structured-hard-sft-4k`
|
||||
|
||||
Configuration:
|
||||
|
||||
- Epochs: 1
|
||||
- Learning rate: 3e-5
|
||||
- Same LoRA configuration as Stage 1
|
||||
|
||||
This stage improves robustness on difficult structured transformation tasks.
|
||||
|
||||
---
|
||||
|
||||
## Stage 3 — Direct Preference Optimization (DPO)
|
||||
|
||||
Dataset: `u-10bei/dpo-dataset-qwen-cot`
|
||||
|
||||
Configuration:
|
||||
|
||||
- Method: DPO via TRL + Unsloth
|
||||
- LoRA: r=8, alpha=16
|
||||
- Learning rate: 1e-7
|
||||
- Beta: 0.1
|
||||
- Max sequence length: 1024
|
||||
- Max prompt length: 512
|
||||
- Epochs: 1
|
||||
- Optimizer: adamw_8bit
|
||||
- Batch size: 2
|
||||
- Gradient accumulation: 4
|
||||
- Warmup ratio: 0.1
|
||||
- Weight decay: 0.01
|
||||
- Seed: 42
|
||||
|
||||
The objective is to align the model toward preferred Chain-of-Thought reasoning patterns using (prompt, chosen, rejected) preference data.
|
||||
|
||||
---
|
||||
|
||||
# Merge Status
|
||||
|
||||
All LoRA adapters have been merged into the base model.
|
||||
|
||||
No PEFT loading is required.
|
||||
|
||||
---
|
||||
|
||||
## Intended Use
|
||||
|
||||
This model is designed for:
|
||||
|
||||
- Structured transformation tasks
|
||||
- Chain-of-Thought reasoning
|
||||
- Preference-aligned generation
|
||||
- Academic research experiments
|
||||
- Competition submission
|
||||
|
||||
|
||||
## Research Notes
|
||||
|
||||
This work explores multi-stage fine-tuning combining:
|
||||
|
||||
- Structured SFT with CoT masking
|
||||
- Hard data refinement
|
||||
- Preference-based alignment via DPO
|
||||
|
||||
The training was performed using the Unsloth library for memory-efficient 4-bit fine-tuning.
|
||||
|
||||
|
||||
## License
|
||||
|
||||
This model follows the license of the base model:
|
||||
|
||||
Qwen/Qwen3-4B-Instruct-2507
|
||||
|
||||
Users must comply with the original base model license.
|
||||
|
||||
# Usage
|
||||
|
||||
```python
|
||||
from transformers import AutoTokenizer, AutoModelForCausalLM
|
||||
import torch
|
||||
|
||||
model_id = "HuiyuWang/dpo-qwen-cot-merged"
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
||||
model = AutoModelForCausalLM.from_pretrained(
|
||||
model_id,
|
||||
torch_dtype=torch.bfloat16,
|
||||
device_map="auto",
|
||||
)
|
||||
|
||||
prompt = "Solve the following problem step by step: ..."
|
||||
inputs = tokenizer(prompt, return_tensors="pt")
|
||||
|
||||
outputs = model.generate(
|
||||
**inputs,
|
||||
max_new_tokens=512,
|
||||
temperature=0.7,
|
||||
)
|
||||
|
||||
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
|
||||
|
||||
Reference in New Issue
Block a user