初始化项目，由ModelHub XC社区提供模型

Model: HuiyuWang/dpo-qwen-cot-merged Source: Original Platform
2026-04-29 10:50:12 +08:00
commit 8a184ece4d
13 changed files with 152423 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,160 @@
+---
+base_model: Qwen/Qwen3-4B-Instruct-2507
+datasets:
+- u-10bei/structured_data_with_cot_dataset_512_v5
+- daichira/structured-hard-sft-4k
+- u-10bei/dpo-dataset-qwen-cot
+language:
+- en
+license: apache-2.0
+library_name: transformers
+pipeline_tag: text-generation
+tags:
+- qwen
+- unsloth
+- sft
+- dpo
+- cot
+- alignment
+---
+
+# dpo-qwen-cot-merged
+
+This repository provides a multi-stage fine-tuned version of **Qwen3-4B-Instruct-2507**.
+
+The training pipeline consists of:
+
+1. Supervised Fine-Tuning (SFT)
+2. Stage-2 Hard SFT refinement
+3. Direct Preference Optimization (DPO)
+
+The LoRA adapters have been merged into the base model.
+This repository contains the **final merged full-precision weights**.
+
+---
+
+# Training Pipeline
+
+## Stage 1 — Supervised Fine-Tuning (SFT)
+
+Base model: `Qwen/Qwen3-4B-Instruct-2507`  
+Dataset: `u-10bei/structured_data_with_cot_dataset_512_v5`
+
+Configuration:
+
+- Method: QLoRA (4-bit, Unsloth)
+- LoRA: r=64, alpha=128
+- Max sequence length: 512
+- Epochs: 2
+- Learning rate: 1e-4
+- Batch size: 2
+- Gradient accumulation: 8
+- Warmup ratio: 0.05
+- Weight decay: 0.0
+- Seed: 3407
+- CoT masking: Enabled (loss applied only to final outputs)
+
+---
+
+## Stage 2 — Hard Data Refinement
+
+Dataset: `daichira/structured-hard-sft-4k`
+
+Configuration:
+
+- Epochs: 1
+- Learning rate: 3e-5
+- Same LoRA configuration as Stage 1
+
+This stage improves robustness on difficult structured transformation tasks.
+
+---
+
+## Stage 3 — Direct Preference Optimization (DPO)
+
+Dataset: `u-10bei/dpo-dataset-qwen-cot`
+
+Configuration:
+
+- Method: DPO via TRL + Unsloth
+- LoRA: r=8, alpha=16
+- Learning rate: 1e-7
+- Beta: 0.1
+- Max sequence length: 1024
+- Max prompt length: 512
+- Epochs: 1
+- Optimizer: adamw_8bit
+- Batch size: 2
+- Gradient accumulation: 4
+- Warmup ratio: 0.1
+- Weight decay: 0.01
+- Seed: 42
+
+The objective is to align the model toward preferred Chain-of-Thought reasoning patterns using (prompt, chosen, rejected) preference data.
+
+---
+
+# Merge Status
+
+All LoRA adapters have been merged into the base model.
+
+No PEFT loading is required.
+
+---
+
+## Intended Use
+
+This model is designed for:
+
+- Structured transformation tasks  
+- Chain-of-Thought reasoning  
+- Preference-aligned generation  
+- Academic research experiments  
+- Competition submission  
+
+
+## Research Notes
+
+This work explores multi-stage fine-tuning combining:
+
+- Structured SFT with CoT masking  
+- Hard data refinement  
+- Preference-based alignment via DPO  
+
+The training was performed using the Unsloth library for memory-efficient 4-bit fine-tuning.
+
+
+## License
+
+This model follows the license of the base model:
+
+Qwen/Qwen3-4B-Instruct-2507
+
+Users must comply with the original base model license.
+
+# Usage
+
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+import torch
+
+model_id = "HuiyuWang/dpo-qwen-cot-merged"
+
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+model = AutoModelForCausalLM.from_pretrained(
+    model_id,
+    torch_dtype=torch.bfloat16,
+    device_map="auto",
+)
+
+prompt = "Solve the following problem step by step: ..."
+inputs = tokenizer(prompt, return_tensors="pt")
+
+outputs = model.generate(
+    **inputs,
+    max_new_tokens=512,
+    temperature=0.7,
+)
+
+print(tokenizer.decode(outputs[0], skip_special_tokens=True))
+