ModelHub XC 2c77b1ba39 初始化项目,由ModelHub XC社区提供模型
Model: W-61/llama-3-8b-base-margin-dpo-hh-helpful-4xh200-batch-64-20260417-212312
Source: Original Platform
2026-05-29 00:58:20 +08:00

library_name, base_model, tags, datasets, model-index
library_name base_model tags datasets model-index
transformers llama-3-8b-base-sft-hh-helpful-4xh200-batch-64
alignment-handbook
margin-dpo
generated_from_trainer
Anthropic/hh-rlhf
name results
llama-3-8b-base-margin-dpo-hh-helpful-4xh200-batch-64-20260417-212312

llama-3-8b-base-margin-dpo-hh-helpful-4xh200-batch-64-20260417-212312

This model is a fine-tuned version of llama-3-8b-base-sft-hh-helpful-4xh200-batch-64 on the Anthropic/hh-rlhf dataset. It achieves the following results on the evaluation set:

  • Loss: 0.4055
  • Margin Dpo/beta: 0.1000
  • Margin Dpo/loss Margin Mean: 21.7395
  • Margin Dpo/beta Margin Mean: 2.1740
  • Margin Dpo/beta Margin Std: 2.6342
  • Margin Dpo/beta Margin Grad Mean: -0.2573
  • Margin Dpo/beta Margin Grad Std: 0.2541
  • Margin Dpo/margin Mean: 21.7395
  • Margin Dpo/margin Std: 26.3422
  • Logps/chosen: -105.8801
  • Logps/rejected: -135.3665
  • Logps/ref Chosen: -79.0510
  • Logps/ref Rejected: -86.7979
  • Logits/chosen: -0.6200
  • Logits/rejected: -0.5940

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 5e-07
  • train_batch_size: 8
  • eval_batch_size: 8
  • seed: 42
  • distributed_type: multi-GPU
  • num_devices: 4
  • gradient_accumulation_steps: 2
  • total_train_batch_size: 64
  • total_eval_batch_size: 32
  • optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
  • lr_scheduler_type: cosine
  • lr_scheduler_warmup_ratio: 0.1
  • num_epochs: 1

Training results

Training Loss Epoch Step Validation Loss Margin Dpo/beta Margin Dpo/loss Margin Mean Margin Dpo/beta Margin Mean Margin Dpo/beta Margin Std Margin Dpo/beta Margin Grad Mean Margin Dpo/beta Margin Grad Std Margin Dpo/margin Mean Margin Dpo/margin Std Logps/chosen Logps/rejected Logps/ref Chosen Logps/ref Rejected Logits/chosen Logits/rejected
0.9045 0.1468 100 0.5612 0.1000 8.4438 0.8444 1.5441 -0.3672 0.2312 8.4438 15.4407 -87.2143 -103.4049 -79.0510 -86.7979 -0.6444 -0.6169
0.6573 0.2937 200 0.4777 0.1000 14.6731 1.4673 2.1103 -0.3106 0.2466 14.6731 21.1032 -92.9744 -115.3944 -79.0510 -86.7979 -0.6438 -0.6210
0.7096 0.4405 300 0.4405 0.1000 18.1127 1.8113 2.3747 -0.2825 0.2514 18.1127 23.7469 -100.1293 -125.9889 -79.0510 -86.7979 -0.6160 -0.5900
0.4494 0.5874 400 0.4219 0.1000 20.1798 2.0180 2.5367 -0.2694 0.2538 20.1798 25.3668 -101.8411 -129.7678 -79.0510 -86.7979 -0.6053 -0.5765
0.3799 0.7342 500 0.4100 0.1000 21.6333 2.1633 2.6378 -0.2586 0.2554 21.6333 26.3782 -106.3940 -135.7742 -79.0510 -86.7979 -0.6186 -0.5922
0.4868 0.8811 600 0.4055 0.1000 21.7395 2.1740 2.6342 -0.2573 0.2541 21.7395 26.3422 -105.8801 -135.3665 -79.0510 -86.7979 -0.6200 -0.5940

Framework versions

  • Transformers 4.51.0
  • Pytorch 2.3.1+cu121
  • Datasets 2.21.0
  • Tokenizers 0.21.4
Description
Model synced from source: W-61/llama-3-8b-base-margin-dpo-hh-helpful-4xh200-batch-64-20260417-212312
Readme 788 KiB