Model: Yurori/qwen3-4b-dpo-qwen-cot-merged Source: Original Platform
base_model, datasets, language, license, library_name, pipeline_tag, tags
| base_model | datasets | language | license | library_name | pipeline_tag | tags | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Qwen/Qwen3-4B-Instruct-2507 |
|
|
apache-2.0 | transformers | text-generation |
|
qwen3-4b-dpo-qwen-cot-merged
This model is a fine-tuned version of Qwen/Qwen3-4B-Instruct-2507 using Direct Preference Optimization (DPO) via the Unsloth library. This repository contains the full-merged 16-bit weights. No adapter loading is required.
Training Configuration
- Base model: Qwen/Qwen3-4B-Instruct-2507
- Method: DPO
- Epochs: 1
- Learning rate: 1e-07
- Beta: 0.1
- Max sequence length: 1024
- LoRA Config: r=8, alpha=16 (merged into base)
Description
Languages
Jinja
100%