Model: Raghav-Singhal/dpo-tulu3-lr5e-7-tulu3sft-100B-normal-fixed-off-policy-if Source: Original Platform
library_name, pipeline_tag, model_name, base_model, tags, license
| library_name | pipeline_tag | model_name | base_model | tags | license | |||||
|---|---|---|---|---|---|---|---|---|---|---|
| transformers | text-generation | dpo-tulu3-lr5e-7-tulu3sft-100B-normal-fixed-off-policy-if | tulu3-normal-fixed-smollm-1p7b-100B-20n-2048sl-960gbsz-4n-gbs128 |
|
other |
Model Card for dpo-tulu3-lr5e-7-tulu3sft-100B-normal-fixed-off-policy-if
This repository contains a DPO fine-tune of the local SFT checkpoint
tulu3-normal-fixed-smollm-1p7b-100B-20n-2048sl-960gbsz-4n-gbs128.
The final model weights are stored at the repository root. Intermediate
training checkpoints are also included under checkpoint-500, checkpoint-1000,
and checkpoint-1270.
Quick start
from transformers import pipeline
question = "If you had a time machine, but could only go to the past or the future once and never return, which would you choose and why?"
generator = pipeline(
"text-generation",
model="Raghav-Singhal/dpo-tulu3-lr5e-7-tulu3sft-100B-normal-fixed-off-policy-if",
device="cuda",
)
output = generator(
[{"role": "user", "content": question}],
max_new_tokens=128,
return_full_text=False,
)[0]
print(output["generated_text"])
Training procedure
This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model.
Framework versions
- TRL: 1.0.0
- Transformers: 4.57.6
- Pytorch: 2.10.0a0+b4e4ee81d3.nv25.12
- Datasets: 4.8.4
- Tokenizers: 0.22.1
Description
Model synced from source: Raghav-Singhal/dpo-tulu3-lr5e-7-tulu3sft-100B-normal-fixed-off-policy-if
Languages
Jinja
100%