Files
dpo-qwen-cot-merged/README.md
ModelHub XC 362cfc8d6c 初始化项目,由ModelHub XC社区提供模型
Model: demimomi/dpo-qwen-cot-merged
Source: Original Platform
2026-05-05 08:04:51 +08:00

100 lines
3.1 KiB
Markdown
Raw Permalink Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
base_model: Qwen/Qwen3-4B-Instruct-2507
datasets:
- dpo_train_brushed_v4_balanced.json
language:
- en
license: apache-2.0
library_name: transformers
pipeline_tag: text-generation
tags:
- dpo
- unsloth
- qwen
- alignment
---
# 東京大学 松尾・岩澤研究室 大規模言語モデル 応用講座2025-2026
## Author and Acknowledgments
- **Author:** Toshiki Demizu (出水 利樹) — GitHub/Hugging Face ID: [@demimomi](https://huggingface.co/demimomi)
- **Affiliation:** ソフトバンク株式会社、MONET Technologies株式会社
- **Course:** Large Language Model Development Lecture Advanced (Winter 20252026)
- **Participants:** 3800名参加
## メインコンペ(2026年2月2日3月2日)
- **状況:**
2026年2月8日現在 293位現時点で497人が提出 0.70044点
2026年2月11日現在 261位現時点で646人が提出0.73407点
https://huggingface.co/demimomi/demimomi44taomax-qwen3-4b-structured-output-lora
T4TPUだと日次Limitにすぐ達するため、A100GPUにて学習/推論コードを実施。
- **ルール:**
基準点0.7 ※コード脳死で回すだけでは超えられない
 Google Colabで実行可能なモデル・実装であること
 2評価は StructEvalTextのみを使用
 提出物は推論結果JSONとHugging Face上のモデルURL
 4運営指定モデル・データのみ使用可
 Omnicampusに提出すると自動採点・順位付け
# demimomi-max44-qwen3-4b-dpo-qwen-cot-merged 0.70044点版)>
## model
This model is a fine-tuned version of **Qwen/Qwen3-4B-Instruct-2507** using **Direct Preference Optimization (DPO)** via the **Unsloth** library.
This repository contains the **full-merged 16-bit weights**. No adapter loading is required.
## Training Objective
This model has been optimized using DPO to align its responses with preferred outputs, focusing on improving reasoning (Chain-of-Thought) and structured response quality based on the provided preference dataset.
## Training Configuration
- **Base model**: Qwen/Qwen3-4B-Instruct-2507
- **Method**: DPO (Direct Preference Optimization)
- **Epochs**: 2
- **Learning rate**: 1e-06
- **Beta**: 0.05
- **Max sequence length**: 1536
- **LoRA Config**: r=8, alpha=16 (merged into base)
## Usage
Since this is a merged model, you can use it directly with `transformers`.
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "demimomi/dpo-qwen-cot-merged"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16,
device_map="auto"
)
# Test inference
prompt = "Your question here"
inputs = tokenizer.apply_chat_template([{"role": "user", "content": prompt}], tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0]))
```
## Sources & License (IMPORTANT)
* **Training Data**: [dpo_train_brushed_v4_balanced.json]
* **License**: MIT License. (As per dataset terms).
* **Compliance**: Users must follow the original base model's license terms.