ModelHub XC 3b8f28089f 初始化项目,由ModelHub XC社区提供模型
Model: jackf857/qwen3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260424-040306
Source: Original Platform
2026-05-14 01:00:02 +08:00

library_name, base_model, tags, datasets, model-index
library_name base_model tags datasets model-index
transformers jackf857/qwen3-8b-base-sft-hh-helpful-4xh200-batch-64-20260417-214452
alignment-handbook
epsilon-dpo
generated_from_trainer
Anthropic/hh-rlhf
name results
qwen3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260424-040306

qwen3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260424-040306

This model is a fine-tuned version of jackf857/qwen3-8b-base-sft-hh-helpful-4xh200-batch-64-20260417-214452 on the Anthropic/hh-rlhf dataset. It achieves the following results on the evaluation set:

  • Loss: 0.5845
  • Rewards/chosen: -0.6029
  • Rewards/rejected: -0.8720
  • Rewards/accuracies: 0.7568
  • Rewards/margins: 0.2691
  • Logps/chosen: -250.9089
  • Logps/rejected: -312.3157
  • Logps/ref Chosen: -100.4936
  • Logps/ref Rejected: -94.0678
  • Logits/chosen: -7.6179
  • Logits/rejected: -7.2013
  • Kl/p Epsilon Steps: 0.7466
  • Kl/n Epsilon Steps: 0.2530

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 5e-07
  • train_batch_size: 8
  • eval_batch_size: 8
  • seed: 42
  • distributed_type: multi-GPU
  • num_devices: 4
  • gradient_accumulation_steps: 2
  • total_train_batch_size: 64
  • total_eval_batch_size: 32
  • optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
  • lr_scheduler_type: cosine
  • lr_scheduler_warmup_ratio: 0.1
  • num_epochs: 1

Training results

Training Loss Epoch Step Validation Loss Rewards/chosen Rewards/rejected Rewards/accuracies Rewards/margins Logps/chosen Logps/rejected Logps/ref Chosen Logps/ref Rejected Logits/chosen Logits/rejected Kl/p Epsilon Steps Kl/n Epsilon Steps
1.1171 0.1468 100 0.6201 -0.3131 -0.5532 0.6841 0.2401 -104.8254 -101.7906 -100.4936 -94.0678 -3.1086 -2.4245 0.6134 0.3857
0.9014 0.2937 200 0.5126 -0.9392 -1.5651 0.7620 0.6259 -122.0199 -130.1003 -100.4936 -94.0678 -5.9108 -5.5656 0.6905 0.3069
0.912 0.4405 300 0.4875 -1.6289 -2.3480 0.7787 0.7190 -167.7714 -191.3218 -100.4936 -94.0678 -7.0070 -6.7219 0.7509 0.2470
0.7868 0.5874 400 0.4938 -1.5799 -2.2534 0.7791 0.6735 -222.8989 -269.1398 -100.4936 -94.0678 -7.4780 -7.1188 0.7646 0.2342
1.0563 0.7342 500 0.5370 -1.0135 -1.4612 0.7667 0.4477 -242.7305 -299.7766 -100.4936 -94.0678 -7.5872 -7.1850 0.7491 0.2504
1.1296 0.8811 600 0.5845 -0.6029 -0.8720 0.7568 0.2691 -250.9089 -312.3157 -100.4936 -94.0678 -7.6179 -7.2013 0.7466 0.2530

Framework versions

  • Transformers 4.51.0
  • Pytorch 2.3.1+cu121
  • Datasets 2.21.0
  • Tokenizers 0.21.4
Description
Model synced from source: jackf857/qwen3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260424-040306
Readme 2.3 MiB