llama-3-8b-base-margin-dpo-…/train.log

[W CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator())
[W CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator())
[W CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator())
[W CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator())
[W CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator())
[W CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator())
[W CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator())
[W CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator())
2026-04-10 17:20:29 - INFO - __main__ - Model parameters ModelArguments(base_model_revision=None, model_name_or_path='/scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-sft-hh-helpful-8xh200-20260410-133758', model_revision='main', model_code_revision=None, torch_dtype='bfloat16', tokenizer_name_or_path=None, trust_remote_code=False, attn_implementation='flash_attention_2', use_peft=False, lora_r=16, lora_alpha=32, lora_dropout=0.05, lora_target_modules=None, lora_modules_to_save=None, load_in_8bit=False, load_in_4bit=False, bnb_4bit_quant_type='nf4', use_bnb_nested_quant=False, bnb_4bit_quant_storage='uint8')
2026-04-10 17:20:29 - INFO - __main__ - Data parameters DataArguments(chat_template=None, dataset_mixer={'Anthropic/hh-rlhf': 1.0}, text_column='text', dataset_splits=['train', 'test'], dataset_configs=['helpful-base'], dataset_dir=None, preprocessing_num_workers=12, use_persistent_hf_cache=True, hf_cache_dir='/scratch/feng.yulu/dynamic-dpo-v4/hf/datasets', truncation_side=None, auto_insert_empty_system_msg=True, preprocessing_log_samples=0, preprocessing_log_dir=None)
2026-04-10 17:20:29 - INFO - __main__ - Training/evaluation parameters MarginDPOConfig(
_n_gpu=1,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
average_tokens_across_devices=False,
batch_eval_metrics=False,
beta=0.1,
bf16=True,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=True,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
dataloader_prefetch_factor=None,
dataset_num_proc=12,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_dropout=True,
disable_tqdm=False,
do_eval=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_do_concat_batches=True,
eval_on_start=False,
eval_steps=100,
eval_strategy=IntervalStrategy.STEPS,
eval_use_gather_object=False,
f_alpha_divergence_coef=1.0,
f_divergence_type=reverse_kl,
force_use_ref_model=False,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
generate_during_eval=False,
gradient_accumulation_steps=1,
gradient_checkpointing=True,
gradient_checkpointing_kwargs={'use_reentrant': False},
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_always_push=False,
hub_margin_dataset_id=W-61/llama-3-8b-base-margin-dpo-hh-helpful-margin-log,
hub_model_id=W-61/llama-3-8b-base-margin-dpo-hh-helpful,
hub_model_revision=main,
hub_private_repo=None,
hub_strategy=HubStrategy.EVERY_SAVE,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_for_metrics=[],
include_inputs_for_metrics=False,
include_num_input_tokens_seen=False,
include_tokens_per_second=False,
is_encoder_decoder=None,
jit_mode_eval=False,
label_names=None,
label_pad_token_id=-100,
label_smoothing=0.0,
label_smoothing_factor=0.0,
learning_rate=5e-07,
length_column_name=length,
load_best_model_at_end=False,
local_rank=0,
log_level=info,
log_level_replica=warning,
log_on_each_node=True,
logging_dir=outputs/llama-3-8b-base-margin-dpo-hh-helpful/runs/Apr10_17-20-28_d4054,
logging_first_step=True,
logging_nan_inf_filter=True,
logging_steps=5,
logging_strategy=IntervalStrategy.STEPS,
loss_type=sigmoid,
lr_scheduler_kwargs={},
lr_scheduler_type=SchedulerType.COSINE,
margin_dataset_private=None,
margin_dataset_split=train,
margin_log_path=/scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-margin-dpo-hh-helpful-8xh200-20260410-172009/margin_logs,
margin_log_steps=1,
margin_save_full=True,
max_grad_norm=1.0,
max_length=512,
max_prompt_length=256,
max_steps=-1,
max_target_length=None,
metric_for_best_model=None,
model_adapter_name=None,
model_init_kwargs=None,
mp_parameters=,
neftune_noise_alpha=None,
no_cuda=False,
non_finite_logits_handling=error,
num_train_epochs=1,
optim=OptimizerNames.ADAMW_TORCH,
optim_args=None,
optim_target_modules=None,
output_dir=/scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-margin-dpo-hh-helpful-8xh200-20260410-172009,
overwrite_output_dir=False,
padding_value=None,
past_index=-1,
per_device_eval_batch_size=16,
per_device_train_batch_size=16,
post_tokenization_log_dir=None,
post_tokenization_log_samples=0,
precompute_ref_batch_size=None,
precompute_ref_eval_batch_size=None,
precompute_ref_log_probs=False,
prediction_loss_only=False,
push_margin_dataset=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
ref_adapter_name=None,
ref_model_init_kwargs=None,
ref_model_mixup_alpha=0.9,
ref_model_sync_steps=64,
reference_free=False,
remove_unused_columns=False,
report_to=['wandb'],
require_explicit_ref_model=True,
restore_callback_states_from_checkpoint=False,
resume_from_checkpoint=None,
reuse_tokenized_dataset=True,
rpo_alpha=None,
run_name=llama-3-8b-base-margin-dpo-hh-helpful-8xh200-20260410-172009,
save_on_each_node=False,
save_only_model=False,
save_safetensors=True,
save_steps=200,
save_strategy=SaveStrategy.STEPS,
save_total_limit=2,
seed=42,
sft_weight=0.0,
skip_memory_metrics=True,
sync_ref_model=False,
tf32=None,
tokenization_batch_size=128,
tokenization_mode=online,
tokenized_dataset_cache_dir=/scratch/feng.yulu/dynamic-dpo-v4/tokenized_preferences,
torch_compile=False,
torch_compile_backend=None,
torch_compile_mode=None,
torch_empty_cache_steps=None,
torchdynamo=None,
tp_size=0,
tpu_metrics_debug=False,
tpu_num_cores=None,
trainer_type=margin_dpo,
truncation_mode=keep_end,
use_cpu=False,
use_ipex=False,
use_legacy_prediction_loop=False,
use_liger_kernel=False,
use_mps_device=False,
warmup_ratio=0.1,
warmup_steps=0,
weight_decay=0.0,
)
2026-04-10 17:20:29 - INFO - __main__ - Margin-DPO parameters: beta=0.1, f_divergence_type=reverse_kl, margin_log_steps=1
2026-04-10 17:20:29 - INFO - __main__ - Using persistent HF datasets cache at /scratch/feng.yulu/dynamic-dpo-v4/hf/datasets
2026-04-10 17:20:32 - WARNING - __main__ - Dropped 237 non-canonical HH preference examples from split `train` before normalization (126 x HH preprocessing expects exactly one final assistant response in chosen/rejected suffixes., 111 x HH chosen/rejected transcripts must each contain a divergent assistant response.).

Normalizing raw HH preferences (train):   0%|          | 0/43598 [00:00<?, ? examples/s]
Normalizing raw HH preferences (train):   3%|▎         | 1097/43598 [00:00<00:03, 10895.92 examples/s]
Normalizing raw HH preferences (train):   0%|          | 0/43598 [00:00<?, ? examples/s]
Normalizing raw HH preferences (train):   5%|▌         | 2353/43598 [00:00<00:03, 11866.77 examples/s]
Normalizing raw HH preferences (train):   0%|          | 0/43598 [00:00<?, ? examples/s]
Normalizing raw HH preferences (train):   0%|          | 0/43598 [00:00<?, ? examples/s]
Normalizing raw HH preferences (train):   0%|          | 0/43598 [00:00<?, ? examples/s]
Normalizing raw HH preferences (train):   0%|          | 0/43598 [00:00<?, ? examples/s]
Normalizing raw HH preferences (train):   3%|▎         | 1169/43598 [00:00<00:03, 11635.94 examples/s]
Normalizing raw HH preferences (train):   8%|▊         | 3681/43598 [00:00<00:03, 12163.46 examples/s]
Normalizing raw HH preferences (train):   3%|▎         | 1142/43598 [00:00<00:03, 11368.22 examples/s]
Normalizing raw HH preferences (train):   3%|▎         | 1147/43598 [00:00<00:03, 11418.69 examples/s]
Normalizing raw HH preferences (train):   3%|▎         | 1183/43598 [00:00<00:03, 11769.88 examples/s]
Normalizing raw HH preferences (train):   3%|▎         | 1173/43598 [00:00<00:03, 11669.94 examples/s]
Normalizing raw HH preferences (train):   6%|▌         | 2453/43598 [00:00<00:03, 12336.30 examples/s]
Normalizing raw HH preferences (train):   0%|          | 0/43598 [00:00<?, ? examples/s]
Normalizing raw HH preferences (train):  11%|█▏        | 4914/43598 [00:00<00:03, 12223.15 examples/s]
Normalizing raw HH preferences (train):   6%|▌         | 2440/43598 [00:00<00:03, 12305.80 examples/s]
Normalizing raw HH preferences (train):   6%|▌         | 2448/43598 [00:00<00:03, 12351.05 examples/s]
Normalizing raw HH preferences (train):   6%|▌         | 2491/43598 [00:00<00:03, 12526.50 examples/s]
Normalizing raw HH preferences (train):   6%|▌         | 2476/43598 [00:00<00:03, 12463.84 examples/s]
Normalizing raw HH preferences (train):   9%|▊         | 3726/43598 [00:00<00:03, 12512.55 examples/s]
Normalizing raw HH preferences (train):   3%|▎         | 1154/43598 [00:00<00:03, 11484.21 examples/s]
Normalizing raw HH preferences (train):   9%|▊         | 3720/43598 [00:00<00:03, 12528.24 examples/s]
Normalizing raw HH preferences (train):   9%|▊         | 3742/43598 [00:00<00:03, 12617.11 examples/s]
Normalizing raw HH preferences (train):   9%|▊         | 3787/43598 [00:00<00:03, 12720.52 examples/s]
Normalizing raw HH preferences (train):   6%|▌         | 2445/43598 [00:00<00:03, 12316.85 examples/s]
Normalizing raw HH preferences (train):  15%|█▌        | 6667/43598 [00:00<00:03, 10902.26 examples/s]
Normalizing raw HH preferences (train):  11%|█▏        | 4978/43598 [00:00<00:03, 12542.73 examples/s]
Normalizing raw HH preferences (train):  10%|▉         | 4291/43598 [00:00<00:03, 11379.36 examples/s]
Normalizing raw HH preferences (train):  13%|█▎        | 5512/43598 [00:00<00:03, 12211.02 examples/s]
Normalizing raw HH preferences (train):   0%|          | 0/43598 [00:00<?, ? examples/s]
Normalizing raw HH preferences (train):  13%|█▎        | 5567/43598 [00:00<00:03, 12388.75 examples/s]
Normalizing raw HH preferences (train):   9%|▊         | 3728/43598 [00:00<00:03, 12545.74 examples/s]
Normalizing raw HH preferences (train):  18%|█▊        | 7911/43598 [00:00<00:03, 11334.14 examples/s]
Normalizing raw HH preferences (train):  13%|█▎        | 5712/43598 [00:00<00:03, 12618.63 examples/s]
Normalizing raw HH preferences (train):  13%|█▎        | 5577/43598 [00:00<00:03, 11866.82 examples/s]
Normalizing raw HH preferences (train):  16%|█▌        | 6770/43598 [00:00<00:02, 12326.95 examples/s]
Normalizing raw HH preferences (train):   2%|▏         | 1000/43598 [00:00<00:04, 9715.76 examples/s]
Normalizing raw HH preferences (train):  16%|█▌        | 6841/43598 [00:00<00:02, 12482.80 examples/s]
Nor

Normalizing raw HH preferences (train): 100%|██████████| 43598/43598 [00:04<00:00, 10649.37 examples/s]

Normalizing raw HH preferences (train): 100%|██████████| 43598/43598 [00:04<00:00, 10703.45 examples/s]

Normalizing raw HH preferences (train): 100%|██████████| 43598/43598 [00:04<00:00, 10614.25 examples/s]

Normalizing raw HH preferences (train): 100%|██████████| 43598/43598 [00:04<00:00, 10743.31 examples/s]

Normalizing raw HH preferences (train): 100%|██████████| 43598/43598 [00:03<00:00, 11049.41 examples/s]

Normalizing raw HH preferences (train): 100%|██████████| 43598/43598 [00:04<00:00, 10067.30 examples/s]

Normalizing raw HH preferences (train):  96%|█████████▌| 41913/43598 [00:03<00:00, 12073.11 examples/s]
Normalizing raw HH preferences (train): 100%|██████████| 43598/43598 [00:03<00:00, 10687.60 examples/s]
Normalizing raw HH preferences (train): 100%|██████████| 43598/43598 [00:04<00:00, 10872.14 examples/s]

Normalizing raw HH preferences (test):   0%|          | 0/2339 [00:00<?, ? examples/s]
Normalizing raw HH preferences (test):   0%|          | 0/2339 [00:00<?, ? examples/s]
Normalizing raw HH preferences (test):   0%|          | 0/2339 [00:00<?, ? examples/s]2026-04-10 17:20:37 - WARNING - __main__ - Dropped 15 non-canonical HH preference examples from split `test` before normalization (9 x HH preprocessing expects exactly one final assistant response in chosen/rejected suffixes., 6 x HH chosen/rejected transcripts must each contain a divergent assistant response.).

Normalizing raw HH preferences (test):   0%|          | 0/2339 [00:00<?, ? examples/s]
Normalizing raw HH preferences (test):   0%|          | 0/2339 [00:00<?, ? examples/s]
Normalizing raw HH preferences (test):  50%|█████     | 1174/2339 [00:00<00:00, 11692.34 examples/s]
Normalizing raw HH preferences (test):  51%|█████     | 1193/2339 [00:00<00:00, 11881.37 examples/s]
Normalizing raw HH preferences (test):  51%|█████     | 1194/2339 [00:00<00:00, 11895.80 examples/s]
Normalizing raw HH preferences (test):   0%|          | 0/2339 [00:00<?, ? examples/s]
Normalizing raw HH preferences (test):  48%|████▊     | 1129/2339 [00:00<00:00, 11241.39 examples/s]
Normalizing raw HH preferences (test):  50%|█████     | 1176/2339 [00:00<00:00, 11714.54 examples/s]
Normalizing raw HH preferences (test):   0%|          | 0/2339 [00:00<?, ? examples/s]
Normalizing raw HH preferences (test): 100%|██████████| 2339/2339 [00:00<00:00, 10833.69 examples/s]

Normalizing raw HH preferences (test): 100%|██████████| 2339/2339 [00:00<00:00, 10813.29 examples/s]

Normalizing raw HH preferences (test):  50%|████▉     | 1169/2339 [00:00<00:00, 11646.00 examples/s]
Normalizing raw HH preferences (test): 100%|██████████| 2339/2339 [00:00<00:00, 10954.81 examples/s]

Normalizing raw HH preferences (test): 100%|█████████▉| 2334/2339 [00:00<00:00, 11715.48 examples/s]
Normalizing raw HH preferences (test):  47%|████▋     | 1094/2339 [00:00<00:00, 10894.27 examples/s]
Normalizing raw HH preferences (test): 100%|██████████| 2339/2339 [00:00<00:00, 10380.69 examples/s]

Normalizing raw HH preferences (test): 100%|██████████| 2339/2339 [00:00<00:00, 10831.75 examples/s]
2026-04-10 17:20:38 - INFO - __main__ - Training on the following splits: ['train : 43598', 'test : 2339']
[INFO|tokenization_utils_base.py:2058] 2026-04-10 17:20:38,197 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2058] 2026-04-10 17:20:38,197 >> loading file tokenizer.model
[INFO|tokenization_utils_base.py:2058] 2026-04-10 17:20:38,197 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2058] 2026-04-10 17:20:38,197 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2058] 2026-04-10 17:20:38,197 >> loading file tokenizer_config.json
[INFO|tokenization_utils_base.py:2058] 2026-04-10 17:20:38,197 >> loading file chat_template.jinja

Normalizing raw HH preferences (test): 100%|██████████| 2339/2339 [00:00<00:00, 11420.15 examples/s]
Normalizing raw HH preferences (test): 100%|██████████| 2339/2339 [00:00<00:00, 10759.00 examples/s]

Normalizing raw HH preferences (test): 100%|█████████▉| 2334/2339 [00:00<00:00, 11771.19 examples/s]
Normalizing raw HH preferences (test):   0%|          | 0/2339 [00:00<?, ? examples/s]
Normalizing raw HH preferences (test): 100%|██████████| 2339/2339 [00:00<00:00, 10444.53 examples/s]

Normalizing raw HH preferences (test):  46%|████▌     | 1066/2339 [00:00<00:00, 10605.27 examples/s]
Normalizing raw HH preferences (test):  97%|█████████▋| 2266/2339 [00:00<00:00, 11418.05 examples/s]
Normalizing raw HH preferences (test): 100%|██████████| 2339/2339 [00:00<00:00, 9536.22 examples/s] 
[INFO|tokenization_utils_base.py:2323] 2026-04-10 17:20:38,599 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

Formatting comparisons with prompt template (num_proc=12):   0%|          | 0/43598 [00:00<?, ? examples/s]
Formatting comparisons with prompt template (num_proc=12):   0%|          | 0/43598 [00:00<?, ? examples/s]
Formatting comparisons with prompt template (num_proc=12):   0%|          | 0/43598 [00:00<?, ? examples/s]
Formatting comparisons with prompt template (num_proc=12):   0%|          | 0/43598 [00:00<?, ? examples/s]
Formatting comparisons with prompt template (num_proc=12):   0%|          | 0/43598 [00:00<?, ? examples/s]
Formatting comparisons with prompt template (num_proc=12):   0%|          | 0/43598 [00:00<?, ? examples/s]
Formatting comparisons with prompt template (num_proc=12):   0%|          | 0/43598 [00:00<?, ? examples/s]
Formatting comparisons with prompt template (num_proc=12):   0%|          | 0/43598 [00:00<?, ? examples/s]
Formatting comparisons with prompt template (num_proc=12):   0%|          | 111/43598 [00:00<05:24, 134.04 examples/s]
Formatting comparisons with prompt template (num_proc=12):   0%|          | 71/43598 [00:00<08:51, 81.88 examples/s]
Formatting comparisons with prompt template (num_proc=12):   0%|          | 88/43598 [00:00<07:48, 92.97 examples/s]
Formatting comparisons with prompt template (num_proc=12):   0%|          | 20/43598 [00:00<36:17, 20.01 examples/s]
Formatting comparisons with prompt template (num_proc=12):   1%|          | 253/43598 [00:01<02:39, 272.28 examples/s]
Formatting comparisons with prompt template (num_proc=12):   0%|          | 54/43598 [00:01<13:51, 52.38 examples/s]
Formatting comparisons with prompt template (num_proc=12):   1%|          | 386/43598 [00:01<01:36, 448.36 examples/s]
Formatting comparisons with prompt template (num_proc=12):   1%|          | 448/43598 [00:01<01:28, 490.13 examples/s]
Formatting comparisons with prompt template (num_proc=12):   3%|▎         | 1123/43598 [00:01<00:35, 1189.96 examples/s]
Formatting comparisons with prompt template (num_proc=12):   1%|▏         | 593/43598 [00:01<01:13, 587.46 examples/s]
Formatting comparisons with prompt template (num_proc=12):   3%|▎         | 1209/43598 [00:01<00:32, 1319.20 examples/s]
Formatting comparisons with prompt template (num_proc=12):   0%|          | 9/43598 [00:01<1:39:10,  7.33 examples/s]
Formatting comparisons with prompt template (num_proc=12):   0%|          | 111/43598 [00:01<07:51, 92.16 examples/s]
Formatting comparisons with prompt template (num_proc=12):   5%|▌         | 2194/43598 [00:01<00:20, 2047.85 examples/s]
Formatting comparisons with prompt template (num_proc=12):   2%|▏         | 973/43598 [00:01<00:51, 827.40 examples/s]
Formatting comparisons with prompt template (num_proc=12):   3%|▎         | 1184/43598 [00:01<00:40, 1040.63 examples/s]
Formatting comparisons with prompt template (num_proc=12):   4%|▍         | 1745/43598 [00:01<00:29, 1432.91 examples/s]
Formatting comparisons with prompt template (num_proc=12):   6%|▌         | 2698/43598 [00:01<00:17, 2358.89 examples/s]
Formatting comparisons with prompt template (num_proc=12):   2%|▏         | 1081/43598 [00:01<00:48, 871.00 examples/s]
Formatting comparisons with prompt template (num_proc=12):   9%|▉         | 3879/43598 [00:01<00:10, 3652.47 examples/s]
Formatting comparisons with prompt template (num_proc=12):   2%|▏         | 784/43598 [00:01<01:14, 571.89 examples/s]
Formatting comparisons with prompt template (num_proc=12):   0%|          | 39/43598 [00:01<25:08, 28.88 examples/s]
Formatting comparisons with prompt template (num_proc=12):   7%|▋         | 3111/43598 [00:01<00:15, 2614.89 examples/s]
Formatting comparisons with prompt template (num_proc=12):  12%|█▏        | 5234/43598 [00:02<00:10, 3658.84 examples/s]
Formatting comparisons with prompt template (num_proc=12):   7%|▋         | 3123/43598 [00:02<00:18, 2172.88 examples/s]
Formatting comparisons with prompt template (num_proc=12):  14%|█▍        | 6273/43598 [00:02<00:07, 4691.64 examples/s]
Formatting comparisons with prompt template (num_proc=12):   9%|▊         | 3740/43598 [0
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/process.py", line 314, in _bootstrap
    self.run()
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/managers.py", line 600, in _run_server
    server.serve_forever()
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/managers.py", line 184, in serve_forever
    sys.exit(0)
SystemExit: 0

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/util.py", line 300, in _run_finalizers
    finalizer()
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/util.py", line 133, in _remove_temp_dir
    rmtree(tempdir)
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/shutil.py", line 752, in rmtree
    _rmtree_safe_fd(fd, path, onerror)
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/shutil.py", line 703, in _rmtree_safe_fd
    onerror(os.unlink, fullname, sys.exc_info())
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/shutil.py", line 701, in _rmtree_safe_fd
    os.unlink(entry.name, dir_fd=topfd)
OSError: [Errno 16] Device or resource busy: '.nfs53bc9c1b4b841a8900001c1e'

Formatting comparisons with prompt template (num_proc=12):  93%|█████████▎| 40361/43598 [00:06<00:00, 12420.89 examples/s]
Formatting comparisons with prompt template (num_proc=12): 100%|██████████| 43598/43598 [00:06<00:00, 6716.54 examples/s]

Formatting comparisons with prompt template (num_proc=12):  96%|█████████▌| 41666/43598 [00:06<00:00, 11745.61 examples/s]
Formatting comparisons with prompt template (num_proc=12):  88%|████████▊ | 38205/43598 [00:06<00:00, 14074.36 examples/s]
Formatting comparisons with prompt template (num_proc=12):  93%|█████████▎| 40719/43598 [00:06<00:00, 12948.90 examples/s]
Formatting comparisons with prompt template (num_proc=12):  81%|████████  | 35269/43598 [00:06<00:00, 14432.57 examples/s]
Formatting comparisons with prompt template (num_proc=12):  96%|█████████▌| 41823/43598 [00:06<00:00, 12741.68 examples/s]Traceback (most recent call last):
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/process.py", line 314, in _bootstrap
    self.run()
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/managers.py", line 600, in _run_server
    server.serve_forever()
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/managers.py", line 184, in serve_forever
    sys.exit(0)
SystemExit: 0

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/util.py", line 300, in _run_finalizers
    finalizer()
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/util.py", line 133, in _remove_temp_dir
    rmtree(tempdir)
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/shutil.py", line 752, in rmtree
    _rmtree_safe_fd(fd, path, onerror)
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/shutil.py", line 703, in _rmtree_safe_fd
    onerror(os.unlink, fullname, sys.exc_info())
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/shutil.py", line 701, in _rmtree_safe_fd
    os.unlink(entry.name, dir_fd=topfd)
OSError: [Errno 16] Device or resource busy: '.nfs7b84796a9fd69ee900001c25'

Formatting comparisons with prompt template (num_proc=12):  99%|█████████▉| 43124/43598 [00:06<00:00, 7076.73 examples/s]
Formatting comparisons with prompt template (num_proc=12): 100%|██████████| 43598/43598 [00:06<00:00, 6542.53 examples/s]

Formatting comparisons with prompt template (num_proc=12):  91%|█████████ | 39740/43598 [00:06<00:00, 12851.81 examples/s]
Formatting comparisons with prompt template (num_proc=12):  86%|████████▌ | 37499/43598 [00:06<00:00, 16349.21 examples/s]
Formatting comparisons with prompt template (num_proc=12):  99%|█████████▊| 42946/43598 [00:06<00:00, 9679.67 examples/s] 
Formatting comparisons with prompt template (num_proc=12):  97%|█████████▋| 42113/43598 [00:06<00:00, 10479.50 examples/s]
Formatting comparisons with prompt template (num_proc=12):  94%|█████████▍| 41100/43598 [00:06<00:00, 12194.22 examples/s]
Formatting comparisons with prompt template (num_proc=12):  90%|█████████ | 39307/43598 [00:06<00:00, 15947.60 examples/s]Traceback (most recent call last):
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/process.py", line 314, in _bootstrap
    self.run()
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/managers.py", line 600, in _run_server
    server.serve_forever()
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/managers.py", line 184, in serve_forever
    sys.exit(0)
SystemExit: 0

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/util.py", line 300, in _run_finalizers
    finalizer()
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/util.py", line 133, in _remove_temp_dir
    rmtree(tempdir)
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/shutil.py", line 752, in rmtree
    _rmtree_safe_fd(fd, path, onerror)
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/shutil.py", line 703, in _rmtree_safe_fd
    onerror(os.unlink, fullname, sys.exc_info())
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/shutil.py", line 701, in _rmtree_safe_fd
    os.unlink(entry.name, dir_fd=topfd)
OSError: [Errno 16] Device or resource busy: '.nfsb37b9e4888f2c94c00001c31'

Formatting comparisons with prompt template (num_proc=12): 100%|██████████| 43598/43598 [00:06<00:00, 6403.75 examples/s]

Formatting comparisons with prompt template (num_proc=12):  99%|█████████▉| 43183/43598 [00:06<00:00, 8508.64 examples/s] Traceback (most recent call last):

Formatting comparisons with prompt template (num_proc=12):  97%|█████████▋| 42405/43598 [00:06<00:00, 11443.72 examples/s]  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/process.py", line 314, in _bootstrap
    self.run()
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/managers.py", line 600, in _run_server
    server.serve_forever()
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/managers.py", line 184, in serve_forever
    sys.exit(0)
SystemExit: 0

Formatting comparisons with prompt template (num_proc=12):   0%|          | 0/2339 [00:00<?, ? examples/s]
During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/util.py", line 300, in _run_finalizers
    finalizer()
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/util.py", line 133, in _remove_temp_dir
    rmtree(tempdir)
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/shutil.py", line 752, in rmtree
    _rmtree_safe_fd(fd, path, onerror)
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/shutil.py", line 703, in _rmtree_safe_fd
    onerror(os.unlink, fullname, sys.exc_info())
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/shutil.py", line 701, in _rmtree_safe_fd
    os.unlink(entry.name, dir_fd=topfd)
OSError: [Errno 16] Device or resource busy: '.nfs3f877014c1ec9e2300001c32'

Formatting comparisons with prompt template (num_proc=12):   0%|          | 0/2339 [00:00<?, ? examples/s]
Formatting comparisons with prompt template (num_proc=12): 100%|██████████| 43598/43598 [00:06<00:00, 6271.13 examples/s]

Formatting comparisons with prompt template (num_proc=12):  94%|█████████▍| 41028/43598 [00:06<00:00, 14625.68 examples/s]
Formatting comparisons with prompt template (num_proc=12): 100%|█████████▉| 43493/43598 [00:07<00:00, 7322.98 examples/s] Traceback (most recent call last):
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/process.py", line 314, in _bootstrap
    self.run()
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/managers.py", line 600, in _run_server
    server.serve_forever()
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/managers.py", line 184, in serve_forever
    sys.exit(0)
SystemExit: 0

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/util.py", line 300, in _run_finalizers
    finalizer()
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/util.py", line 133, in _remove_temp_dir
    rmtree(tempdir)
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/shutil.py", line 752, in rmtree
    _rmtree_safe_fd(fd, path, onerror)
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/shutil.py", line 703, in _rmtree_safe_fd
    onerror(os.unlink, fullname, sys.exc_info())
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/shutil.py", line 701, in _rmtree_safe_fd
    os.unlink(entry.name, dir_fd=topfd)
OSError: [Errno 16] Device or resource busy: '.nfs68e3b5ca8fddee3300001c38'

Formatting comparisons with prompt template (num_proc=12): 100%|██████████| 43598/43598 [00:07<00:00, 6125.00 examples/s]
Traceback (most recent call last):
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/process.py", line 314, in _bootstrap
    self.run()
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/managers.py", line 600, in _run_server
    server.serve_forever()
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/managers.py", line 184, in serve_forever
    sys.exit(0)
SystemExit: 0

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/util.py", line 300, in _run_finalizers
    finalizer()
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/util.py", line 133, in _remove_temp_dir
    rmtree(tempdir)
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/shutil.py", line 752, in rmtree
    _rmtree_safe_fd(fd, path, onerror)
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/shutil.py", line 703, in _rmtree_safe_fd
    onerror(os.unlink, fullname, sys.exc_info())
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/shutil.py", line 701, in _rmtree_safe_fd
    os.unlink(entry.name, dir_fd=topfd)
OSError: [Errno 16] Device or resource busy: '.nfs53c351a77cd48be800001c39'

Formatting comparisons with prompt template (num_proc=12): 100%|██████████| 43598/43598 [00:07<00:00, 8489.03 examples/s] 
Formatting comparisons with prompt template (num_proc=12): 100%|██████████| 43598/43598 [00:07<00:00, 6064.07 examples/s]

Formatting comparisons with prompt template (num_proc=12):  98%|█████████▊| 42719/43598 [00:06<00:00, 10914.69 examples/s]Traceback (most recent call last):
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/process.py", line 314, in _bootstrap
    self.run()
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/managers.py", line 600, in _run_server
    server.serve_forever()
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/managers.py", line 184, in serve_forever
    sys.exit(0)
SystemExit: 0

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/util.py", line 300, in _run_finalizers
    finalizer()
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/util.py", line 133, in _remove_temp_dir
    rmtree(tempdir)
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/shutil.py", line 752, in rmtree
    _rmtree_safe_fd(fd, path, onerror)
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/shutil.py", line 703, in _rmtree_safe_fd
    onerror(os.unlink, fullname, sys.exc_info())
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/shutil.py", line 701, in _rmtree_safe_fd
    os.unlink(entry.name, dir_fd=topfd)
OSError: [Errno 16] Device or resource busy: '.nfse2d4363274eb081700001c3b'

Formatting comparisons with prompt template (num_proc=12):   0%|          | 0/2339 [00:00<?, ? examples/s]
Formatting comparisons with prompt template (num_proc=12): 100%|██████████| 43598/43598 [00:07<00:00, 6120.16 examples/s]

Formatting comparisons with prompt template (num_proc=12):   0%|          | 0/2339 [00:00<?, ? examples/s]
Formatting comparisons with prompt template (num_proc=12):   0%|          | 0/2339 [00:00<?, ? examples/s]Traceback (most recent call last):
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/process.py", line 314, in _bootstrap
    self.run()
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/managers.py", line 600, in _run_server
    server.serve_forever()
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/managers.py", line 184, in serve_forever
    sys.exit(0)
SystemExit: 0

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/util.py", line 300, in _run_finalizers
    finalizer()
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/util.py", line 133, in _remove_temp_dir
    rmtree(tempdir)
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/shutil.py", line 752, in rmtree
    _rmtree_safe_fd(fd, path, onerror)
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/shutil.py", line 703, in _rmtree_safe_fd
    onerror(os.unlink, fullname, sys.exc_info())
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/shutil.py", line 701, in _rmtree_safe_fd
    os.unlink(entry.name, dir_fd=topfd)
OSError: [Errno 16] Device or resource busy: '.nfs3a3afdae81875d9a00001c3d'

Formatting comparisons with prompt template (num_proc=12): 100%|██████████| 43598/43598 [00:07<00:00, 6150.20 examples/s] 

Formatting comparisons with prompt template (num_proc=12):   0%|          | 0/2339 [00:00<?, ? examples/s]
Formatting comparisons with prompt template (num_proc=12):   0%|          | 0/2339 [00:00<?, ? examples/s]
Formatting comparisons with prompt template (num_proc=12):   8%|▊         | 195/2339 [00:00<00:08, 249.99 examples/s]
Formatting comparisons with prompt template (num_proc=12):   7%|▋         | 155/2339 [00:00<00:13, 161.72 examples/s]
Formatting comparisons with prompt template (num_proc=12):   0%|          | 0/2339 [00:00<?, ? examples/s]
Formatting comparisons with prompt template (num_proc=12):  14%|█▍        | 333/2339 [00:01<00:06, 331.56 examples/s]
Formatting comparisons with prompt template (num_proc=12):  14%|█▍        | 333/2339 [00:01<00:06, 299.06 examples/s]
Formatting comparisons with prompt template (num_proc=12):   4%|▍         | 104/2339 [00:00<00:19, 117.59 examples/s]
Formatting comparisons with prompt template (num_proc=12):  19%|█▊        | 434/2339 [00:01<00:05, 371.39 examples/s]
Formatting comparisons with prompt template (num_proc=12):  19%|█▊        | 433/2339 [00:01<00:05, 345.35 examples/s]
Formatting comparisons with prompt template (num_proc=12):   8%|▊         | 195/2339 [00:01<00:10, 201.45 examples/s]
Formatting comparisons with prompt template (num_proc=12):   2%|▏         | 53/2339 [00:01<00:47, 48.16 examples/s]
Formatting comparisons with prompt template (num_proc=12):   3%|▎         | 72/2339 [00:00<00:28, 78.31 examples/s]
Formatting comparisons with prompt template (num_proc=12):  27%|██▋       | 629/2339 [00:01<00:03, 523.85 examples/s]
Formatting comparisons with prompt template (num_proc=12):  14%|█▍        | 324/2339 [00:01<00:06, 296.55 examples/s]
Formatting comparisons with prompt template (num_proc=12):  26%|██▌       | 613/2339 [00:01<00:04, 419.27 examples/s]
Formatting comparisons with prompt template (num_proc=12):  19%|█▊        | 436/2339 [00:01<00:04, 419.81 examples/s]
Formatting comparisons with prompt template (num_proc=12):   8%|▊         | 179/2339 [00:01<00:12, 171.74 examples/s]
Formatting comparisons with prompt template (num_proc=12):   2%|▏         | 51/2339 [00:01<00:50, 45.00 examples/s]
Formatting comparisons with prompt template (num_proc=12):   2%|▏         | 48/2339 [00:01<00:49, 45.93 examples/s]
Formatting comparisons with prompt template (num_proc=12):  42%|████▏     | 975/2339 [00:01<00:01, 778.67 examples/s]
Formatting comparisons with prompt template (num_proc=12):  25%|██▌       | 585/2339 [00:01<00:03, 546.03 examples/s]
Formatting comparisons with prompt template (num_proc=12):  48%|████▊     | 1128/2339 [00:01<00:01, 899.97 examples/s]
Formatting comparisons with prompt template (num_proc=12):  24%|██▍       | 561/2339 [00:01<00:03, 451.08 examples/s]
Formatting comparisons with prompt template (num_proc=12):  10%|█         | 240/2339 [00:01<00:09, 215.55 examples/s]
Formatting comparisons with prompt template (num_proc=12):  10%|▉         | 228/2339 [00:01<00:09, 225.68 examples/s]
Formatting comparisons with prompt template (num_proc=12):  33%|███▎      | 780/2339 [00:01<00:02, 630.78 examples/s]
Formatting comparisons with prompt template (num_proc=12):  56%|█████▌    | 1302/2339 [00:02<00:01, 809.69 examples/s]
Formatting comparisons with prompt template (num_proc=12):  11%|█         | 254/2339 [00:01<00:12, 163.78 examples/s]
Formatting comparisons with prompt template (num_proc=12):  52%|█████▏    | 1225/2339 [00:02<00:01, 693.84 examples/s]
Formatting comparisons with prompt template (num_proc=12):  17%|█▋        | 390/2339 [00:01<00:05, 347.75 examples/s]
Formatting comparisons with prompt template (num_proc=12):  19%|█▊        | 435/2339 [00:01<00:05, 358.05 examples/s]
Formatting comparisons with prompt template (num_proc=12):  35%|███▌      | 827/2339 [00:01<00:02, 530.69 examples/s]
Formatting comparisons with prompt template (num_proc=12):  68%|██████▊   | 1591/233
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/process.py", line 314, in _bootstrap
    self.run()
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/managers.py", line 600, in _run_server
    server.serve_forever()
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/managers.py", line 184, in serve_forever
    sys.exit(0)
SystemExit: 0

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/util.py", line 300, in _run_finalizers
    finalizer()
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/util.py", line 133, in _remove_temp_dir
    rmtree(tempdir)
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/shutil.py", line 752, in rmtree
    _rmtree_safe_fd(fd, path, onerror)
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/shutil.py", line 703, in _rmtree_safe_fd
    onerror(os.unlink, fullname, sys.exc_info())
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/shutil.py", line 701, in _rmtree_safe_fd
    os.unlink(entry.name, dir_fd=topfd)
OSError: [Errno 16] Device or resource busy: '.nfs926b8a223a5cf1a600001c71'

Formatting comparisons with prompt template (num_proc=12): 100%|██████████| 2339/2339 [00:03<00:00, 737.85 examples/s] 

Formatting comparisons with prompt template (num_proc=12):  56%|█████▌    | 1301/2339 [00:02<00:01, 666.04 examples/s]/home/feng.yulu/dynamic-dpo-v4/scripts/tokenized_dpo_trainer.py:391: UserWarning: You passed a model_id to the trainer. This will automatically create an `AutoModelForCausalLM` or a `PeftModel` (if you passed a `peft_config`) for you.
  warnings.warn(
[WARNING|logging.py:328] 2026-04-10 17:20:49,022 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.

Formatting comparisons with prompt template (num_proc=12):  97%|█████████▋| 2258/2339 [00:03<00:00, 845.34 examples/s] 
Formatting comparisons with prompt template (num_proc=12):  56%|█████▌    | 1304/2339 [00:02<00:01, 761.90 examples/s]
Formatting comparisons with prompt template (num_proc=12):  39%|███▉      | 918/2339 [00:02<00:02, 626.02 examples/s]
Formatting comparisons with prompt template (num_proc=12):  54%|█████▍    | 1274/2339 [00:02<00:01, 734.16 examples/s]
Formatting comparisons with prompt template (num_proc=12):  71%|███████   | 1663/2339 [00:02<00:00, 727.41 examples/s]
Loading checkpoint shards:   0%|          | 0/7 [00:00<?, ?it/s]
Loading checkpoint shards: 100%|██████████| 7/7 [00:00<00:00, 558.00it/s]

Formatting comparisons with prompt template (num_proc=12):  83%|████████▎ | 1950/2339 [00:03<00:00, 935.72 examples/s] Traceback (most recent call last):
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/process.py", line 314, in _bootstrap
    self.run()
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/managers.py", line 600, in _run_server
    server.serve_forever()
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/managers.py", line 184, in serve_forever
    sys.exit(0)
SystemExit: 0

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/util.py", line 300, in _run_finalizers
    finalizer()
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/util.py", line 133, in _remove_temp_dir
    rmtree(tempdir)
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/shutil.py", line 752, in rmtree
    _rmtree_safe_fd(fd, path, onerror)
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/shutil.py", line 703, in _rmtree_safe_fd
    onerror(os.unlink, fullname, sys.exc_info())
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/shutil.py", line 701, in _rmtree_safe_fd
    os.unlink(entry.name, dir_fd=topfd)
OSError: [Errno 16] Device or resource busy: '.nfse8192e985f4610ed00001c7a'

Formatting comparisons with prompt template (num_proc=12):  64%|██████▍   | 1496/2339 [00:02<00:01, 720.45 examples/s]
Formatting comparisons with prompt template (num_proc=12): 100%|██████████| 2339/2339 [00:03<00:00, 682.72 examples/s]

Formatting comparisons with prompt template (num_proc=12):  83%|████████▎ | 1950/2339 [00:03<00:00, 997.67 examples/s]
Formatting comparisons with prompt template (num_proc=12):  48%|████▊     | 1132/2339 [00:02<00:01, 728.77 examples/s]
Formatting comparisons with prompt template (num_proc=12):  67%|██████▋   | 1561/2339 [00:02<00:00, 916.84 examples/s]2026-04-10 17:20:49 - INFO - __main__ - Processed train sample 41905:

Prompt:
<|begin_of_text|><|start_header_id|>user<|end_header_id|>

What is the best way to clean my refrigerator?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Well, I have a few ideas, but the best way to know what works best is to run some tests and then analyze the data to figure out which technique works best.<|eot_id|><|start_header_id|>user<|end_header_id|>

What are your ideas?<|eot_id|>

Chosen:
<|start_header_id|>assistant<|end_header_id|>

Some of my ideas include wiping down the interior with a rag, using bleach and water to wipe down the door, emptying the contents of the fridge to scrub the shelves with a clean cloth, cleaning the interior of the fridge with vinegar and water, and using a vacuum cleaner to clean the interior.<|eot_id|>

Rejected:
<|start_header_id|>assistant<|end_header_id|>

I was thinking that you could try different types of soap, which could be a variation of “soap,” like antibacterial soap.  I was also thinking that you could try using different tools, like a toothbrush, to clean the inside.  Or I was thinking that you could try different ways to get the gunk out, like using a squeegee, or using a paper towel.  I was also thinking you could try other types of cleaning, like vacuuming, but I think that could have the opposite of the desired effect.<|eot_id|>
/home/feng.yulu/dynamic-dpo-v4/scripts/tokenized_dpo_trainer.py:391: UserWarning: You passed a model_id to the trainer. This will automatically create an `AutoModelForCausalLM` or a `PeftModel` (if you passed a `peft_config`) for you.
  warnings.warn(
[INFO|configuration_utils.py:691] 2026-04-10 17:20:49,264 >> loading configuration file /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-sft-hh-helpful-8xh200-20260410-133758/config.json
[INFO|configuration_utils.py:765] 2026-04-10 17:20:49,265 >> Model config LlamaConfig {
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": 128001,
  "head_dim": 128,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 8192,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 500000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.51.0",
  "use_cache": false,
  "vocab_size": 128256
}


Formatting comparisons with prompt template (num_proc=12):  67%|██████▋   | 1560/2339 [00:02<00:00, 898.19 examples/s][INFO|modeling_utils.py:1121] 2026-04-10 17:20:49,278 >> loading weights file /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-sft-hh-helpful-8xh200-20260410-133758/model.safetensors.index.json
[INFO|modeling_utils.py:2167] 2026-04-10 17:20:49,279 >> Instantiating LlamaForCausalLM model under default dtype torch.bfloat16.
[WARNING|logging.py:328] 2026-04-10 17:20:49,281 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
[INFO|configuration_utils.py:1142] 2026-04-10 17:20:49,283 >> Generate config GenerationConfig {
  "bos_token_id": 128000,
  "eos_token_id": 128001,
  "use_cache": false
}


Formatting comparisons with prompt template (num_proc=12):  72%|███████▏  | 1673/2339 [00:03<00:00, 789.71 examples/s]
Loading checkpoint shards:   0%|          | 0/7 [00:00<?, ?it/s]
Formatting comparisons with prompt template (num_proc=12):  91%|█████████▏| 2140/2339 [00:03<00:00, 1044.83 examples/s]
Formatting comparisons with prompt template (num_proc=12):  90%|████████▉ | 2096/2339 [00:03<00:00, 820.27 examples/s]
Formatting comparisons with prompt template (num_proc=12):  75%|███████▌  | 1755/2339 [00:03<00:00, 1039.35 examples/s]
Formatting comparisons with prompt template (num_proc=12):  57%|█████▋    | 1339/2339 [00:02<00:01, 808.95 examples/s]
Loading checkpoint shards:   0%|          | 0/7 [00:00<?, ?it/s]
Loading checkpoint shards: 100%|██████████| 7/7 [00:00<00:00, 640.90it/s]
[WARNING|trainer.py:821] 2026-04-10 17:20:49,444 >> Trainer.tokenizer is now deprecated. You should use `Trainer.processing_class = processing_class` instead.

Formatting comparisons with prompt template (num_proc=12):  81%|████████  | 1894/2339 [00:02<00:00, 1093.28 examples/s]
Formatting comparisons with prompt template (num_proc=12): 100%|█████████▉| 2331/2339 [00:03<00:00, 1038.87 examples/s]
Formatting comparisons with prompt template (num_proc=12):  81%|████████  | 1883/2339 [00:03<00:00, 894.47 examples/s]
Formatting comparisons with prompt template (num_proc=12):  99%|█████████▊| 2308/2339 [00:03<00:00, 1057.74 examples/s]
Formatting comparisons with prompt template (num_proc=12):  67%|██████▋   | 1560/2339 [00:02<00:00, 993.35 examples/s]Traceback (most recent call last):
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/process.py", line 314, in _bootstrap
    self.run()
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/managers.py", line 600, in _run_server
    server.serve_forever()
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/managers.py", line 184, in serve_forever
    sys.exit(0)
SystemExit: 0

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/util.py", line 300, in _run_finalizers
    finalizer()
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/util.py", line 133, in _remove_temp_dir
    rmtree(tempdir)
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/shutil.py", line 752, in rmtree
    _rmtree_safe_fd(fd, path, onerror)
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/shutil.py", line 703, in _rmtree_safe_fd
    onerror(os.unlink, fullname, sys.exc_info())
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/shutil.py", line 701, in _rmtree_safe_fd
    os.unlink(entry.name, dir_fd=topfd)
OSError: [Errno 16] Device or resource busy: '.nfs4ee5c347154bfc4e00001c88'

Formatting comparisons with prompt template (num_proc=12): 100%|██████████| 2339/2339 [00:03<00:00, 667.93 examples/s] 

Formatting comparisons with prompt template (num_proc=12):  90%|█████████ | 2116/2339 [00:03<00:00, 1069.99 examples/s]
Formatting comparisons with prompt template (num_proc=12):  83%|████████▎ | 1950/2339 [00:03<00:00, 968.28 examples/s] Traceback (most recent call last):
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/process.py", line 314, in _bootstrap
    self.run()
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/managers.py", line 600, in _run_server
    server.serve_forever()
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/managers.py", line 184, in serve_forever
    sys.exit(0)
SystemExit: 0

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/util.py", line 300, in _run_finalizers
    finalizer()
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/util.py", line 133, in _remove_temp_dir
    rmtree(tempdir)
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/shutil.py", line 752, in rmtree
    _rmtree_safe_fd(fd, path, onerror)
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/shutil.py", line 703, in _rmtree_safe_fd
    onerror(os.unlink, fullname, sys.exc_info())
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/shutil.py", line 701, in _rmtree_safe_fd
    os.unlink(entry.name, dir_fd=topfd)
OSError: [Errno 16] Device or resource busy: '.nfs88b0aa9233adc5a400001c8a'
/home/feng.yulu/dynamic-dpo-v4/scripts/tokenized_dpo_trainer.py:391: UserWarning: You passed a model_id to the trainer. This will automatically create an `AutoModelForCausalLM` or a `PeftModel` (if you passed a `peft_config`) for you.
  warnings.warn(

Formatting comparisons with prompt template (num_proc=12): 100%|██████████| 2339/2339 [00:03<00:00, 665.59 examples/s] 
[WARNING|logging.py:328] 2026-04-10 17:20:49,650 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.

Formatting comparisons with prompt template (num_proc=12):  75%|███████▌  | 1755/2339 [00:02<00:00, 1052.98 examples/s]
Loading checkpoint shards:   0%|          | 0/7 [00:00<?, ?it/s]
Formatting comparisons with prompt template (num_proc=12):  92%|█████████▏| 2145/2339 [00:03<00:00, 1067.72 examples/s]
Loading checkpoint shards: 100%|██████████| 7/7 [00:00<00:00, 828.47it/s]
/home/feng.yulu/dynamic-dpo-v4/scripts/tokenized_dpo_trainer.py:391: UserWarning: You passed a model_id to the trainer. This will automatically create an `AutoModelForCausalLM` or a `PeftModel` (if you passed a `peft_config`) for you.
  warnings.warn(
[WARNING|logging.py:328] 2026-04-10 17:20:49,728 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.

Loading checkpoint shards:   0%|          | 0/7 [00:00<?, ?it/s]
Formatting comparisons with prompt template (num_proc=12):  92%|█████████▏| 2145/2339 [00:03<00:00, 1041.56 examples/s]
Loading checkpoint shards: 100%|██████████| 7/7 [00:00<00:00, 702.31it/s]

Formatting comparisons with prompt template (num_proc=12): 100%|██████████| 2339/2339 [00:03<00:00, 1113.18 examples/s]
Formatting comparisons with prompt template (num_proc=12):  83%|████████▎ | 1950/2339 [00:03<00:00, 1108.25 examples/s]Traceback (most recent call last):
Traceback (most recent call last):
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/process.py", line 314, in _bootstrap
    self.run()
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/managers.py", line 600, in _run_server
    server.serve_forever()
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/managers.py", line 184, in serve_forever
    sys.exit(0)
SystemExit: 0
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/process.py", line 314, in _bootstrap
    self.run()
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/managers.py", line 600, in _run_server
    server.serve_forever()
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/managers.py", line 184, in serve_forever
    sys.exit(0)
SystemExit: 0

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/util.py", line 300, in _run_finalizers
    finalizer()
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/util.py", line 133, in _remove_temp_dir
    rmtree(tempdir)
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/shutil.py", line 752, in rmtree
    _rmtree_safe_fd(fd, path, onerror)
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/shutil.py", line 703, in _rmtree_safe_fd
    onerror(os.unlink, fullname, sys.exc_info())
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/shutil.py", line 701, in _rmtree_safe_fd
    os.unlink(entry.name, dir_fd=topfd)
OSError: [Errno 16] Device or resource busy: '.nfs40aab1e2a121061e00001c92'
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/util.py", line 300, in _run_finalizers
    finalizer()
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/util.py", line 133, in _remove_temp_dir
    rmtree(tempdir)
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/shutil.py", line 752, in rmtree
    _rmtree_safe_fd(fd, path, onerror)
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/shutil.py", line 703, in _rmtree_safe_fd
    onerror(os.unlink, fullname, sys.exc_info())
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/shutil.py", line 701, in _rmtree_safe_fd
    os.unlink(entry.name, dir_fd=topfd)
OSError: [Errno 16] Device or resource busy: '.nfs2152ef03de76500b00001c93'

Formatting comparisons with prompt template (num_proc=12): 100%|██████████| 2339/2339 [00:03<00:00, 654.40 examples/s] 

Formatting comparisons with prompt template (num_proc=12): 100%|██████████| 2339/2339 [00:03<00:00, 691.77 examples/s] 

Formatting comparisons with prompt template (num_proc=12):  99%|█████████▊| 2306/2339 [00:03<00:00, 1041.41 examples/s]
Loading checkpoint shards:   0%|          | 0/7 [00:00<?, ?it/s]
Loading checkpoint shards: 100%|██████████| 7/7 [00:00<00:00, 231.56it/s]
[WARNING|trainer.py:821] 2026-04-10 17:20:49,977 >> Trainer.tokenizer is now deprecated. You should use `Trainer.processing_class = processing_class` instead.
/home/feng.yulu/dynamic-dpo-v4/scripts/tokenized_dpo_trainer.py:391: UserWarning: You passed a model_id to the trainer. This will automatically create an `AutoModelForCausalLM` or a `PeftModel` (if you passed a `peft_config`) for you.
  warnings.warn(
/home/feng.yulu/dynamic-dpo-v4/scripts/tokenized_dpo_trainer.py:391: UserWarning: You passed a model_id to the trainer. This will automatically create an `AutoModelForCausalLM` or a `PeftModel` (if you passed a `peft_config`) for you.
  warnings.warn(
[WARNING|logging.py:328] 2026-04-10 17:20:50,019 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
[WARNING|logging.py:328] 2026-04-10 17:20:50,019 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.

Loading checkpoint shards:   0%|          | 0/7 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/process.py", line 314, in _bootstrap
    self.run()
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/managers.py", line 600, in _run_server
    server.serve_forever()
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/managers.py", line 184, in serve_forever
    sys.exit(0)
SystemExit: 0

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/util.py", line 300, in _run_finalizers
    finalizer()
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/util.py", line 133, in _remove_temp_dir
    rmtree(tempdir)
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/shutil.py", line 752, in rmtree
    _rmtree_safe_fd(fd, path, onerror)
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/shutil.py", line 703, in _rmtree_safe_fd
    onerror(os.unlink, fullname, sys.exc_info())
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/shutil.py", line 701, in _rmtree_safe_fd
    os.unlink(entry.name, dir_fd=topfd)
OSError: [Errno 16] Device or resource busy: '.nfs78cc63d12c66250200001c95'

Loading checkpoint shards: 100%|██████████| 7/7 [00:00<00:00, 656.99it/s]
[WARNING|trainer.py:821] 2026-04-10 17:20:50,040 >> Trainer.tokenizer is now deprecated. You should use `Trainer.processing_class = processing_class` instead.

Formatting comparisons with prompt template (num_proc=12): 100%|██████████| 2339/2339 [00:03<00:00, 637.51 examples/s] 

Loading checkpoint shards:   0%|          | 0/7 [00:00<?, ?it/s]
Loading checkpoint shards:   0%|          | 0/7 [00:00<?, ?it/s]
Loading checkpoint shards: 100%|██████████| 7/7 [00:00<00:00, 701.64it/s]

Loading checkpoint shards: 100%|██████████| 7/7 [00:00<00:00, 568.94it/s]

Formatting comparisons with prompt template (num_proc=12):  92%|█████████▏| 2144/2339 [00:03<00:00, 978.77 examples/s] /home/feng.yulu/dynamic-dpo-v4/scripts/tokenized_dpo_trainer.py:391: UserWarning: You passed a model_id to the trainer. This will automatically create an `AutoModelForCausalLM` or a `PeftModel` (if you passed a `peft_config`) for you.
  warnings.warn(
[WARNING|logging.py:328] 2026-04-10 17:20:50,131 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.

Loading checkpoint shards:   0%|          | 0/7 [00:00<?, ?it/s]
Loading checkpoint shards: 100%|██████████| 7/7 [00:00<00:00, 887.17it/s]
Traceback (most recent call last):
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/process.py", line 314, in _bootstrap
    self.run()
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/managers.py", line 600, in _run_server
    server.serve_forever()
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/managers.py", line 184, in serve_forever
    sys.exit(0)
SystemExit: 0

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/util.py", line 300, in _run_finalizers
    finalizer()
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/util.py", line 133, in _remove_temp_dir
    rmtree(tempdir)
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/shutil.py", line 752, in rmtree
    _rmtree_safe_fd(fd, path, onerror)
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/shutil.py", line 703, in _rmtree_safe_fd
    onerror(os.unlink, fullname, sys.exc_info())
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/shutil.py", line 701, in _rmtree_safe_fd
    os.unlink(entry.name, dir_fd=topfd)
OSError: [Errno 16] Device or resource busy: '.nfs2b8fe5e9b3e8c5be00001c97'

Formatting comparisons with prompt template (num_proc=12):  92%|█████████▏| 2144/2339 [00:03<00:00, 638.84 examples/s]
/home/feng.yulu/dynamic-dpo-v4/scripts/tokenized_dpo_trainer.py:391: UserWarning: You passed a model_id to the trainer. This will automatically create an `AutoModelForCausalLM` or a `PeftModel` (if you passed a `peft_config`) for you.
  warnings.warn(
[WARNING|logging.py:328] 2026-04-10 17:20:50,273 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.

Loading checkpoint shards:   0%|          | 0/7 [00:00<?, ?it/s]
Loading checkpoint shards:   0%|          | 0/7 [00:00<?, ?it/s]
Loading checkpoint shards: 100%|██████████| 7/7 [00:00<00:00, 849.96it/s]
[WARNING|trainer.py:821] 2026-04-10 17:20:50,315 >> Trainer.tokenizer is now deprecated. You should use `Trainer.processing_class = processing_class` instead.

Loading checkpoint shards: 100%|██████████| 7/7 [00:00<00:00, 539.02it/s]

Loading checkpoint shards:   0%|          | 0/7 [00:00<?, ?it/s]
Loading checkpoint shards: 100%|██████████| 7/7 [00:00<00:00, 962.94it/s]
[WARNING|trainer.py:821] 2026-04-10 17:20:50,363 >> Trainer.tokenizer is now deprecated. You should use `Trainer.processing_class = processing_class` instead.

Loading checkpoint shards:   0%|          | 0/7 [00:00<?, ?it/s]
Loading checkpoint shards: 100%|██████████| 7/7 [00:00<00:00, 959.76it/s]
[WARNING|trainer.py:821] 2026-04-10 17:20:50,427 >> Trainer.tokenizer is now deprecated. You should use `Trainer.processing_class = processing_class` instead.

Loading checkpoint shards:   0%|          | 0/7 [00:00<?, ?it/s]
Loading checkpoint shards: 100%|██████████| 7/7 [00:00<00:00, 993.71it/s]
[WARNING|trainer.py:821] 2026-04-10 17:20:50,542 >> Trainer.tokenizer is now deprecated. You should use `Trainer.processing_class = processing_class` instead.

Loading checkpoint shards:  14%|█▍        | 1/7 [00:01<00:09,  1.65s/it]
Loading checkpoint shards:  29%|██▊       | 2/7 [00:03<00:07,  1.53s/it]
Loading checkpoint shards:  43%|████▎     | 3/7 [00:04<00:06,  1.54s/it]
Loading checkpoint shards:  57%|█████▋    | 4/7 [00:06<00:04,  1.53s/it]
Loading checkpoint shards:  71%|███████▏  | 5/7 [00:07<00:03,  1.52s/it]
Loading checkpoint shards:  86%|████████▌ | 6/7 [00:09<00:01,  1.50s/it]
Loading checkpoint shards: 100%|██████████| 7/7 [00:09<00:00,  1.27s/it]
Loading checkpoint shards: 100%|██████████| 7/7 [00:09<00:00,  1.42s/it]
[INFO|modeling_utils.py:4926] 2026-04-10 17:20:59,246 >> All model checkpoint weights were used when initializing LlamaForCausalLM.

[INFO|modeling_utils.py:4934] 2026-04-10 17:20:59,246 >> All the weights of LlamaForCausalLM were initialized from the model checkpoint at /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-sft-hh-helpful-8xh200-20260410-133758.
If your task is similar to the task the model of the checkpoint was trained on, you can already use LlamaForCausalLM for predictions without further training.
[INFO|configuration_utils.py:1095] 2026-04-10 17:20:59,248 >> loading configuration file /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-sft-hh-helpful-8xh200-20260410-133758/generation_config.json
[INFO|configuration_utils.py:1142] 2026-04-10 17:20:59,248 >> Generate config GenerationConfig {
  "bos_token_id": 128000,
  "do_sample": true,
  "eos_token_id": 128001,
  "max_length": 4096,
  "temperature": 0.6,
  "top_p": 0.9
}

[INFO|configuration_utils.py:691] 2026-04-10 17:20:59,250 >> loading configuration file /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-sft-hh-helpful-8xh200-20260410-133758/config.json
[INFO|configuration_utils.py:765] 2026-04-10 17:20:59,250 >> Model config LlamaConfig {
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": 128001,
  "head_dim": 128,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 8192,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 500000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.51.0",
  "use_cache": false,
  "vocab_size": 128256
}

[INFO|modeling_utils.py:1121] 2026-04-10 17:20:59,251 >> loading weights file /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-sft-hh-helpful-8xh200-20260410-133758/model.safetensors.index.json
[INFO|modeling_utils.py:2167] 2026-04-10 17:20:59,252 >> Instantiating LlamaForCausalLM model under default dtype torch.bfloat16.
[INFO|configuration_utils.py:1142] 2026-04-10 17:20:59,254 >> Generate config GenerationConfig {
  "bos_token_id": 128000,
  "eos_token_id": 128001,
  "use_cache": false
}


Loading checkpoint shards:   0%|          | 0/7 [00:00<?, ?it/s]
Loading checkpoint shards:  14%|█▍        | 1/7 [00:01<00:09,  1.53s/it]
Loading checkpoint shards:  29%|██▊       | 2/7 [00:02<00:07,  1.47s/it]
Loading checkpoint shards:  43%|████▎     | 3/7 [00:04<00:05,  1.48s/it]
Loading checkpoint shards:  57%|█████▋    | 4/7 [00:05<00:04,  1.48s/it]
Loading checkpoint shards:  71%|███████▏  | 5/7 [00:07<00:02,  1.47s/it]
Loading checkpoint shards:  86%|████████▌ | 6/7 [00:08<00:01,  1.47s/it]
Loading checkpoint shards: 100%|██████████| 7/7 [00:09<00:00,  1.24s/it]
Loading checkpoint shards: 100%|██████████| 7/7 [00:09<00:00,  1.38s/it]
[INFO|modeling_utils.py:4926] 2026-04-10 17:21:09,066 >> All model checkpoint weights were used when initializing LlamaForCausalLM.

[INFO|modeling_utils.py:4934] 2026-04-10 17:21:09,066 >> All the weights of LlamaForCausalLM were initialized from the model checkpoint at /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-sft-hh-helpful-8xh200-20260410-133758.
If your task is similar to the task the model of the checkpoint was trained on, you can already use LlamaForCausalLM for predictions without further training.
[INFO|configuration_utils.py:1095] 2026-04-10 17:21:09,069 >> loading configuration file /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-sft-hh-helpful-8xh200-20260410-133758/generation_config.json
[INFO|configuration_utils.py:1142] 2026-04-10 17:21:09,070 >> Generate config GenerationConfig {
  "bos_token_id": 128000,
  "do_sample": true,
  "eos_token_id": 128001,
  "max_length": 4096,
  "temperature": 0.6,
  "top_p": 0.9
}

[WARNING|trainer.py:821] 2026-04-10 17:21:09,071 >> Trainer.tokenizer is now deprecated. You should use `Trainer.processing_class = processing_class` instead.
[WARNING|trainer.py:816] 2026-04-10 17:21:09,074 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.

Tokenizing train (num_proc=12):   0%|          | 0/43598 [00:00<?, ? examples/s]
Tokenizing train (num_proc=12):   0%|          | 128/43598 [00:36<3:26:34,  3.51 examples/s]
Tokenizing train (num_proc=12):   1%|          | 256/43598 [00:36<1:25:10,  8.48 examples/s]
Tokenizing train (num_proc=12):   1%|          | 384/43598 [00:36<46:25, 15.51 examples/s]  
Tokenizing train (num_proc=12):   1%|          | 512/43598 [00:36<28:16, 25.40 examples/s]
Tokenizing train (num_proc=12):   1%|▏         | 640/43598 [00:36<18:17, 39.13 examples/s]
Tokenizing train (num_proc=12):   2%|▏         | 768/43598 [00:37<12:15, 58.20 examples/s]
Tokenizing train (num_proc=12):   2%|▏         | 896/43598 [00:37<08:26, 84.23 examples/s]
Tokenizing train (num_proc=12):   2%|▏         | 1024/43598 [00:37<05:57, 119.20 examples/s]
Tokenizing train (num_proc=12):   3%|▎         | 1152/43598 [00:37<04:16, 165.21 examples/s]
Tokenizing train (num_proc=12):   3%|▎         | 1280/43598 [00:37<03:09, 223.51 examples/s]
Tokenizing train (num_proc=12):   3%|▎         | 1408/43598 [00:37<02:22, 296.37 examples/s]
Tokenizing train (num_proc=12):   4%|▎         | 1536/43598 [00:37<01:49, 382.93 examples/s]
Tokenizing train (num_proc=12):   4%|▍         | 1664/43598 [00:37<01:27, 477.75 examples/s]
Tokenizing train (num_proc=12):   4%|▍         | 1792/43598 [00:38<01:12, 574.18 examples/s]
Tokenizing train (num_proc=12):   4%|▍         | 1920/43598 [00:38<01:02, 662.25 examples/s]
Tokenizing train (num_proc=12):   5%|▍         | 2048/43598 [00:38<00:55, 746.77 examples/s]
Tokenizing train (num_proc=12):   5%|▍         | 2176/43598 [00:38<00:50, 815.26 examples/s]
Tokenizing train (num_proc=12):   5%|▌         | 2304/43598 [00:38<00:47, 870.37 examples/s]
Tokenizing train (num_proc=12):   6%|▌         | 2432/43598 [00:38<00:44, 923.25 examples/s]
Tokenizing train (num_proc=12):   6%|▌         | 2560/43598 [00:38<00:43, 948.19 examples/s]
Tokenizing train (num_proc=12):   6%|▌         | 2688/43598 [00:38<00:40, 1004.27 examples/s]
Tokenizing train (num_proc=12):   6%|▋         | 2816/43598 [00:38<00:40, 1013.07 examples/s]
Tokenizing train (num_proc=12):   7%|▋         | 2944/43598 [00:39<00:38, 1062.44 examples/s]
Tokenizing train (num_proc=12):   7%|▋         | 3072/43598 [00:39<00:38, 1050.02 examples/s]
Tokenizing train (num_proc=12):   7%|▋         | 3200/43598 [00:39<00:38, 1056.12 examples/s]
Tokenizing train (num_proc=12):   8%|▊         | 3328/43598 [00:39<00:38, 1041.71 examples/s]
Tokenizing train (num_proc=12):   8%|▊         | 3456/43598 [00:39<00:37, 1068.82 examples/s]
Tokenizing train (num_proc=12):   8%|▊         | 3584/43598 [00:39<00:36, 1095.95 examples/s]
Tokenizing train (num_proc=12):   8%|▊         | 3634/43598 [00:51<00:36, 1095.95 examples/s]
Tokenizing train (num_proc=12):   9%|▊         | 3762/43598 [01:05<36:28, 18.20 examples/s]  
Tokenizing train (num_proc=12):   9%|▉         | 3890/43598 [01:05<26:26, 25.03 examples/s]
Tokenizing train (num_proc=12):   9%|▉         | 4018/43598 [01:05<19:01, 34.66 examples/s]
Tokenizing train (num_proc=12):  10%|▉         | 4146/43598 [01:05<13:39, 48.15 examples/s]
Tokenizing train (num_proc=12):  10%|▉         | 4274/43598 [01:06<09:47, 66.90 examples/s]
Tokenizing train (num_proc=12):  10%|█         | 4402/43598 [01:06<07:03, 92.59 examples/s]
Tokenizing train (num_proc=12):  10%|█         | 4530/43598 [01:06<05:07, 127.22 examples/s]
Tokenizing train (num_proc=12):  11%|█         | 4658/43598 [01:06<03:45, 172.84 examples/s]
Tokenizing train (num_proc=12):  11%|█         | 4786/43598 [01:06<02:48, 230.00 examples/s]
Tokenizing train (num_proc=12):  11%|█▏        | 4914/43598 [01:06<02:09, 297.88 examples/s]
Tokenizing train (num_proc=12):  12%|█▏        | 5042/43598 [01:06<01:41, 379.18 examples/s]
Tokenizing train (num_proc=12):  12%|█▏        | 5170/43598 [01:06<01:21, 472.68 examples/s]
Tokenizing train (num_proc=12):  12%|█▏        | 5298/43598 [01:06<01:07, 570.15 examples/s]
Tokenizing train (num_proc
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/process.py", line 314, in _bootstrap
    self.run()
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/managers.py", line 600, in _run_server
    server.serve_forever()
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/managers.py", line 184, in serve_forever
    sys.exit(0)
SystemExit: 0

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/util.py", line 300, in _run_finalizers
    finalizer()
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/util.py", line 133, in _remove_temp_dir
    rmtree(tempdir)
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/shutil.py", line 752, in rmtree
    _rmtree_safe_fd(fd, path, onerror)
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/shutil.py", line 703, in _rmtree_safe_fd
    onerror(os.unlink, fullname, sys.exc_info())
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/shutil.py", line 701, in _rmtree_safe_fd
    os.unlink(entry.name, dir_fd=topfd)
OSError: [Errno 16] Device or resource busy: '.nfsda5f551da15d52da00001c98'

Tokenizing train (num_proc=12): 100%|██████████| 43598/43598 [05:59<00:00, 121.11 examples/s] 
[WARNING|trainer.py:816] 2026-04-10 17:27:57,834 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.

Saving the dataset (0/2 shards):   0%|          | 0/43598 [00:00<?, ? examples/s]
Saving the dataset (0/2 shards):  18%|█▊        | 8000/43598 [00:00<00:00, 63762.12 examples/s]
Saving the dataset (0/2 shards):  39%|███▉      | 17000/43598 [00:00<00:00, 69974.98 examples/s]
Saving the dataset (1/2 shards):  50%|█████     | 21799/43598 [00:00<00:00, 69974.98 examples/s]
Saving the dataset (1/2 shards):  57%|█████▋    | 24799/43598 [00:00<00:00, 34603.85 examples/s]
Saving the dataset (1/2 shards):  78%|███████▊  | 33799/43598 [00:00<00:00, 44932.83 examples/s]
Saving the dataset (1/2 shards):  91%|█████████▏| 39799/43598 [00:00<00:00, 47763.02 examples/s]
Saving the dataset (2/2 shards): 100%|██████████| 43598/43598 [00:01<00:00, 47763.02 examples/s]
Saving the dataset (2/2 shards): 100%|██████████| 43598/43598 [00:01<00:00, 39385.00 examples/s]
[WARNING|trainer.py:816] 2026-04-10 17:27:59,772 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.

Tokenizing test (num_proc=12):   0%|          | 0/2339 [00:00<?, ? examples/s]
Tokenizing test (num_proc=12):   5%|▌         | 128/2339 [00:39<11:27,  3.21 examples/s]
Tokenizing test (num_proc=12):  14%|█▍        | 323/2339 [01:12<07:10,  4.69 examples/s]
Tokenizing test (num_proc=12):  22%|██▏       | 518/2339 [01:45<05:46,  5.26 examples/s]
Tokenizing test (num_proc=12):  30%|███       | 713/2339 [02:18<04:56,  5.49 examples/s]
Tokenizing test (num_proc=12):  39%|███▉      | 908/2339 [02:51<04:12,  5.66 examples/s]
Tokenizing test (num_proc=12):  47%|████▋     | 1103/2339 [03:23<03:34,  5.76 examples/s]
Tokenizing test (num_proc=12):  55%|█████▌    | 1298/2339 [03:56<02:58,  5.83 examples/s]
Tokenizing test (num_proc=12):  64%|██████▍   | 1493/2339 [04:29<02:24,  5.86 examples/s]
Tokenizing test (num_proc=12):  72%|███████▏  | 1688/2339 [05:02<01:51,  5.85 examples/s]
Tokenizing test (num_proc=12):  81%|████████  | 1883/2339 [05:36<01:17,  5.86 examples/s]
Tokenizing test (num_proc=12):  89%|████████▉ | 2078/2339 [06:09<00:44,  5.87 examples/s]
Tokenizing test (num_proc=12):  97%|█████████▋| 2273/2339 [06:38<00:10,  6.09 examples/s]Traceback (most recent call last):
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/process.py", line 314, in _bootstrap
    self.run()
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/managers.py", line 600, in _run_server
    server.serve_forever()
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/managers.py", line 184, in serve_forever
    sys.exit(0)
SystemExit: 0

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/util.py", line 300, in _run_finalizers
    finalizer()
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/util.py", line 133, in _remove_temp_dir
    rmtree(tempdir)
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/shutil.py", line 752, in rmtree
    _rmtree_safe_fd(fd, path, onerror)
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/shutil.py", line 703, in _rmtree_safe_fd
    onerror(os.unlink, fullname, sys.exc_info())
  File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/shutil.py", line 701, in _rmtree_safe_fd
    os.unlink(entry.name, dir_fd=topfd)
OSError: [Errno 16] Device or resource busy: '.nfs6c9aa8faaa30bdd000001c99'

Tokenizing test (num_proc=12): 100%|██████████| 2339/2339 [06:38<00:00,  5.87 examples/s]
[WARNING|trainer.py:816] 2026-04-10 17:35:25,272 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.

Saving the dataset (0/1 shards):   0%|          | 0/2339 [00:00<?, ? examples/s]
Saving the dataset (1/1 shards): 100%|██████████| 2339/2339 [00:00<00:00, 25844.59 examples/s]
Saving the dataset (1/1 shards): 100%|██████████| 2339/2339 [00:00<00:00, 25802.38 examples/s]
/home/feng.yulu/dynamic-dpo-v4/scripts/tokenized_dpo_trainer.py:518: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `MarginDPOTrainer.__init__`. Use `processing_class` instead.
  super().__init__(
[WARNING|trainer.py:816] 2026-04-10 17:35:28,156 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
[WARNING|trainer.py:816] 2026-04-10 17:35:28,157 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
[WARNING|trainer.py:816] 2026-04-10 17:35:28,158 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
[WARNING|trainer.py:816] 2026-04-10 17:35:28,158 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
[WARNING|trainer.py:816] 2026-04-10 17:35:28,159 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
[WARNING|trainer.py:816] 2026-04-10 17:35:28,159 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
[WARNING|trainer.py:816] 2026-04-10 17:35:28,160 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
[WARNING|trainer.py:816] 2026-04-10 17:35:28,457 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
[WARNING|trainer.py:816] 2026-04-10 17:35:28,457 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
[WARNING|trainer.py:816] 2026-04-10 17:35:28,457 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
[WARNING|trainer.py:816] 2026-04-10 17:35:28,457 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
[WARNING|trainer.py:816] 2026-04-10 17:35:28,458 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
[WARNING|trainer.py:816] 2026-04-10 17:35:28,458 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
[WARNING|trainer.py:816] 2026-04-10 17:35:28,458 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
[WARNING|trainer.py:816] 2026-04-10 17:35:28,458 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
[WARNING|trainer.py:816] 2026-04-10 17:35:28,459 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
[WARNING|trainer.py:816] 2026-04-10 17:35:28,459 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
[WARNING|trainer.py:816] 2026-04-10 17:35:28,459 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
[WARNING|trainer.py:816] 2026-04-10 17:35:28,459 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
[WARNING|trainer.py:816] 2026-04-10 17:35:28,459 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
[WARNING|trainer.py:816] 2026-04-10 17:35:28,460 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
[WARNING|trainer.py:816] 2026-04-10 17:35:28,503 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
[WARNING|trainer.py:816] 2026-04-10 17:35:28,503 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
[WARNING|trainer.py:816] 2026-04-10 17:35:28,503 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
/home/feng.yulu/dynamic-dpo-v4/scripts/tokenized_dpo_trainer.py:518: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `MarginDPOTrainer.__init__`. Use `processing_class` instead.
  super().__init__(
/home/feng.yulu/dynamic-dpo-v4/scripts/tokenized_dpo_trainer.py:518: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `MarginDPOTrainer.__init__`. Use `processing_class` instead.
  super().__init__(
/home/feng.yulu/dynamic-dpo-v4/scripts/tokenized_dpo_trainer.py:518: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `MarginDPOTrainer.__init__`. Use `processing_class` instead.
  super().__init__(
[WARNING|trainer.py:816] 2026-04-10 17:35:28,503 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
[WARNING|trainer.py:816] 2026-04-10 17:35:28,503 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
[WARNING|trainer.py:816] 2026-04-10 17:35:28,503 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
[WARNING|trainer.py:816] 2026-04-10 17:35:28,503 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
/home/feng.yulu/dynamic-dpo-v4/scripts/tokenized_dpo_trainer.py:518: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `MarginDPOTrainer.__init__`. Use `processing_class` instead.
  super().__init__(
/home/feng.yulu/dynamic-dpo-v4/scripts/tokenized_dpo_trainer.py:518: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `MarginDPOTrainer.__init__`. Use `processing_class` instead.
  super().__init__(
/home/feng.yulu/dynamic-dpo-v4/scripts/tokenized_dpo_trainer.py:518: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `MarginDPOTrainer.__init__`. Use `processing_class` instead.
  super().__init__(
/home/feng.yulu/dynamic-dpo-v4/scripts/tokenized_dpo_trainer.py:518: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `MarginDPOTrainer.__init__`. Use `processing_class` instead.
  super().__init__(
[INFO|trainer.py:748] 2026-04-10 17:35:28,762 >> Using auto half precision backend
/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/accelerate/accelerator.py:1557: UserWarning: Upcasted low precision parameters in LlamaForCausalLM because mixed precision turned on in FSDP. Affects: model.embed_tokens.weight, model.norm.weight, lm_head.weight.
  warnings.warn(
/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/accelerate/accelerator.py:1557: UserWarning: Upcasted low precision parameters in LlamaDecoderLayer because mixed precision turned on in FSDP. Affects: self_attn.q_proj.weight, self_attn.k_proj.weight, self_attn.v_proj.weight, self_attn.o_proj.weight, mlp.gate_proj.weight, mlp.up_proj.weight, mlp.down_proj.weight, input_layernorm.weight, post_attention_layernorm.weight.
  warnings.warn(
/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/accelerate/accelerator.py:1563: UserWarning: FSDP upcast of low precision parameters may affect the precision of model checkpoints.
  warnings.warn(
[INFO|trainer.py:2414] 2026-04-10 17:35:33,058 >> ***** Running training *****
[INFO|trainer.py:2415] 2026-04-10 17:35:33,058 >>   Num examples = 43,598
[INFO|trainer.py:2416] 2026-04-10 17:35:33,058 >>   Num Epochs = 1
[INFO|trainer.py:2417] 2026-04-10 17:35:33,058 >>   Instantaneous batch size per device = 16
[INFO|trainer.py:2420] 2026-04-10 17:35:33,058 >>   Total train batch size (w. parallel, distributed & accumulation) = 128
[INFO|trainer.py:2421] 2026-04-10 17:35:33,058 >>   Gradient Accumulation steps = 1
[INFO|trainer.py:2422] 2026-04-10 17:35:33,058 >>   Total optimization steps = 340
[INFO|trainer.py:2423] 2026-04-10 17:35:33,058 >>   Number of trainable parameters = 1,003,782,656
[INFO|integration_utils.py:831] 2026-04-10 17:35:33,059 >> Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
wandb: Currently logged in as: can-not-fand (can-not-fand-northeastern-university). Use `wandb login --relogin` to force relogin
wandb: wandb version 0.25.1 is available!  To upgrade, please run:
wandb:  $ pip install wandb --upgrade
wandb: Tracking run with wandb version 0.17.5
wandb: Run data is saved locally in /scratch/feng.yulu/dynamic-dpo-v4/wandb/wandb/run-20260410_173535-wep2te2x
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run llama-3-8b-base-margin-dpo-hh-helpful-8xh200-20260410-172009
wandb: ⭐️ View project at https://wandb.ai/can-not-fand-northeastern-university/huggingface
wandb: 🚀 View run at https://wandb.ai/can-not-fand-northeastern-university/huggingface/runs/wep2te2x

  0%|          | 0/340 [00:00<?, ?it/s][WARNING|modeling_utils.py:1713] 2026-04-10 17:35:41,810 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed
[WARNING|modeling_utils.py:1713] 2026-04-10 17:35:41,810 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed
[WARNING|modeling_utils.py:1713] 2026-04-10 17:35:41,810 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed
[WARNING|modeling_utils.py:1713] 2026-04-10 17:35:41,810 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed
[WARNING|modeling_utils.py:1713] 2026-04-10 17:35:41,810 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed
[WARNING|modeling_utils.py:1713] 2026-04-10 17:35:41,810 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed
[WARNING|modeling_utils.py:1713] 2026-04-10 17:35:41,810 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed
[WARNING|modeling_utils.py:1713] 2026-04-10 17:35:41,810 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed

  0%|          | 1/340 [00:03<18:07,  3.21s/it]
                                               
{'loss': 0.6938, 'grad_norm': 23.717201232910156, 'learning_rate': 0.0, 'margin_dpo/margin_mean': -0.0843656063079834, 'margin_dpo/margin_std': 0.20181308686733246, 'logps/chosen': -72.44038391113281, 'logps/rejected': -70.95858764648438, 'logps/ref_chosen': -72.42105865478516, 'logps/ref_rejected': -71.02362823486328, 'logits/chosen': -0.4739703834056854, 'logits/rejected': -0.44689586758613586, 'epoch': 0.0}

  0%|          | 1/340 [00:03<18:07,  3.21s/it]
  1%|          | 2/340 [00:06<16:53,  3.00s/it]
  1%|          | 3/340 [00:08<16:07,  2.87s/it]
  1%|          | 4/340 [00:11<15:08,  2.70s/it]
  1%|▏         | 5/340 [00:13<14:57,  2.68s/it]
                                               
{'loss': 0.6943, 'grad_norm': 24.15522003173828, 'learning_rate': 5.88235294117647e-08, 'margin_dpo/margin_mean': -0.0912436842918396, 'margin_dpo/margin_std': 0.36911237239837646, 'logps/chosen': -76.55665588378906, 'logps/rejected': -71.69610595703125, 'logps/ref_chosen': -76.4837875366211, 'logps/ref_rejected': -71.7144775390625, 'logits/chosen': -0.5054930448532104, 'logits/rejected': -0.4999650716781616, 'epoch': 0.01}

  1%|▏         | 5/340 [00:13<14:57,  2.68s/it]
  2%|▏         | 6/340 [00:16<14:51,  2.67s/it]
  2%|▏         | 7/340 [00:19<14:46,  2.66s/it]
  2%|▏         | 8/340 [00:21<14:29,  2.62s/it]
  3%|▎         | 9/340 [00:24<14:27,  2.62s/it]
  3%|▎         | 10/340 [00:26<14:24,  2.62s/it]
                                                
{'loss': 0.6933, 'grad_norm': 23.068735122680664, 'learning_rate': 1.3235294117647057e-07, 'margin_dpo/margin_mean': 0.0031534195877611637, 'margin_dpo/margin_std': 0.3234597444534302, 'logps/chosen': -76.17481994628906, 'logps/rejected': -73.90404510498047, 'logps/ref_chosen': -76.15269470214844, 'logps/ref_rejected': -73.87877655029297, 'logits/chosen': -0.5124594569206238, 'logits/rejected': -0.49317699670791626, 'epoch': 0.03}

  3%|▎         | 10/340 [00:26<14:24,  2.62s/it]
  3%|▎         | 11/340 [00:29<14:27,  2.64s/it]
  4%|▎         | 12/340 [00:32<14:26,  2.64s/it]
  4%|▍         | 13/340 [00:34<14:25,  2.65s/it]
  4%|▍         | 14/340 [00:37<14:16,  2.63s/it]
  4%|▍         | 15/340 [00:40<14:10,  2.62s/it]
                                                
{'loss': 0.6898, 'grad_norm': 28.796030044555664, 'learning_rate': 2.0588235294117645e-07, 'margin_dpo/margin_mean': 0.09566803276538849, 'margin_dpo/margin_std': 0.3500857353210449, 'logps/chosen': -67.05145263671875, 'logps/rejected': -73.06277465820312, 'logps/ref_chosen': -67.0902099609375, 'logps/ref_rejected': -73.005859375, 'logits/chosen': -0.5413268208503723, 'logits/rejected': -0.5226410031318665, 'epoch': 0.04}

  4%|▍         | 15/340 [00:40<14:10,  2.62s/it]
  5%|▍         | 16/340 [00:42<14:14,  2.64s/it]
  5%|▌         | 17/340 [00:45<14:03,  2.61s/it]
  5%|▌         | 18/340 [00:47<13:58,  2.60s/it]
  6%|▌         | 19/340 [00:50<13:48,  2.58s/it]
  6%|▌         | 20/340 [00:53<13:52,  2.60s/it]
                                                
{'loss': 0.6824, 'grad_norm': 20.94307518005371, 'learning_rate': 2.7941176470588235e-07, 'margin_dpo/margin_mean': 0.19857604801654816, 'margin_dpo/margin_std': 0.378338098526001, 'logps/chosen': -73.87080383300781, 'logps/rejected': -80.62101745605469, 'logps/ref_chosen': -73.9133071899414, 'logps/ref_rejected': -80.46495056152344, 'logits/chosen': -0.5276651382446289, 'logits/rejected': -0.5001177787780762, 'epoch': 0.06}

  6%|▌         | 20/340 [00:53<13:52,  2.60s/it]
  6%|▌         | 21/340 [00:55<14:02,  2.64s/it]
  6%|▋         | 22/340 [00:58<14:01,  2.65s/it]
  7%|▋         | 23/340 [01:01<13:57,  2.64s/it]
  7%|▋         | 24/340 [01:03<14:09,  2.69s/it]
  7%|▋         | 25/340 [01:06<13:57,  2.66s/it]
                                                
{'loss': 0.6642, 'grad_norm': 24.610126495361328, 'learning_rate': 3.529411764705882e-07, 'margin_dpo/margin_mean': 0.44518008828163147, 'margin_dpo/margin_std': 0.6063351631164551, 'logps/chosen': -60.977256774902344, 'logps/rejected': -74.73905181884766, 'logps/ref_chosen': -61.014869689941406, 'logps/ref_rejected': -74.33148193359375, 'logits/chosen': -0.5061219930648804, 'logits/rejected': -0.5009726285934448, 'epoch': 0.07}

  7%|▋         | 25/340 [01:06<13:57,  2.66s/it]
  8%|▊         | 26/340 [01:09<13:46,  2.63s/it]
  8%|▊         | 27/340 [01:11<13:27,  2.58s/it]
  8%|▊         | 28/340 [01:14<13:25,  2.58s/it]
  9%|▊         | 29/340 [01:16<13:26,  2.59s/it]
  9%|▉         | 30/340 [01:19<13:29,  2.61s/it]
                                                
{'loss': 0.6294, 'grad_norm': 21.515533447265625, 'learning_rate': 4.264705882352941e-07, 'margin_dpo/margin_mean': 1.5730347633361816, 'margin_dpo/margin_std': 1.7553781270980835, 'logps/chosen': -78.83164978027344, 'logps/rejected': -83.10078430175781, 'logps/ref_chosen': -78.80770111083984, 'logps/ref_rejected': -81.50379943847656, 'logits/chosen': -0.5904145240783691, 'logits/rejected': -0.5685775279998779, 'epoch': 0.09}

  9%|▉         | 30/340 [01:19<13:29,  2.61s/it]
  9%|▉         | 31/340 [01:22<13:27,  2.61s/it]
  9%|▉         | 32/340 [01:24<13:32,  2.64s/it]
 10%|▉         | 33/340 [01:27<13:27,  2.63s/it]
 10%|█         | 34/340 [01:29<13:11,  2.59s/it]
 10%|█         | 35/340 [01:32<13:14,  2.60s/it]
                                                
{'loss': 0.6028, 'grad_norm': 19.351747512817383, 'learning_rate': 5e-07, 'margin_dpo/margin_mean': 2.158336877822876, 'margin_dpo/margin_std': 2.8764147758483887, 'logps/chosen': -86.93069458007812, 'logps/rejected': -88.55570220947266, 'logps/ref_chosen': -86.67269134521484, 'logps/ref_rejected': -86.13935852050781, 'logits/chosen': -0.5566071271896362, 'logits/rejected': -0.5428273677825928, 'epoch': 0.1}

 10%|█         | 35/340 [01:32<13:14,  2.60s/it]
 11%|█         | 36/340 [01:35<13:13,  2.61s/it]
 11%|█         | 37/340 [01:37<13:09,  2.61s/it]
 11%|█         | 38/340 [01:40<13:05,  2.60s/it]
 11%|█▏        | 39/340 [01:42<12:59,  2.59s/it]
 12%|█▏        | 40/340 [01:45<13:05,  2.62s/it]
                                                
{'loss': 0.5446, 'grad_norm': 18.829681396484375, 'learning_rate': 4.996706849759452e-07, 'margin_dpo/margin_mean': 4.941764831542969, 'margin_dpo/margin_std': 8.191742897033691, 'logps/chosen': -71.7585220336914, 'logps/rejected': -91.31529235839844, 'logps/ref_chosen': -69.31690216064453, 'logps/ref_rejected': -83.9319076538086, 'logits/chosen': -0.6493271589279175, 'logits/rejected': -0.6133594512939453, 'epoch': 0.12}

 12%|█▏        | 40/340 [01:45<13:05,  2.62s/it]
 12%|█▏        | 41/340 [01:48<13:08,  2.64s/it]
 12%|█▏        | 42/340 [01:50<13:01,  2.62s/it]
 13%|█▎        | 43/340 [01:53<12:56,  2.62s/it]
 13%|█▎        | 44/340 [01:55<12:47,  2.59s/it]
 13%|█▎        | 45/340 [01:58<12:39,  2.58s/it]
                                                
{'loss': 0.553, 'grad_norm': 23.498613357543945, 'learning_rate': 4.986836074908615e-07, 'margin_dpo/margin_mean': 5.294968128204346, 'margin_dpo/margin_std': 6.769883632659912, 'logps/chosen': -73.5013427734375, 'logps/rejected': -108.92988586425781, 'logps/ref_chosen': -69.97550964355469, 'logps/ref_rejected': -100.10908508300781, 'logits/chosen': -0.6821354627609253, 'logits/rejected': -0.6494560837745667, 'epoch': 0.13}

 13%|█▎        | 45/340 [01:58<12:39,  2.58s/it]
 14%|█▎        | 46/340 [02:01<12:46,  2.61s/it]
 14%|█▍        | 47/340 [02:03<13:03,  2.67s/it]
 14%|█▍        | 48/340 [02:06<12:55,  2.66s/it]
 14%|█▍        | 49/340 [02:09<12:45,  2.63s/it]
 15%|█▍        | 50/340 [02:11<12:30,  2.59s/it]
                                                
{'loss': 0.5518, 'grad_norm': 30.29952621459961, 'learning_rate': 4.970413680203148e-07, 'margin_dpo/margin_mean': 4.282275199890137, 'margin_dpo/margin_std': 7.439302921295166, 'logps/chosen': -78.32559967041016, 'logps/rejected': -95.23252868652344, 'logps/ref_chosen': -72.90187072753906, 'logps/ref_rejected': -85.52653503417969, 'logits/chosen': -0.6595835089683533, 'logits/rejected': -0.6233135461807251, 'epoch': 0.15}

 15%|█▍        | 50/340 [02:11<12:30,  2.59s/it]
 15%|█▌        | 51/340 [02:14<12:30,  2.60s/it]
 15%|█▌        | 52/340 [02:16<12:15,  2.55s/it]
 16%|█▌        | 53/340 [02:19<12:12,  2.55s/it]
 16%|█▌        | 54/340 [02:21<12:21,  2.59s/it]
 16%|█▌        | 55/340 [02:24<12:17,  2.59s/it]
                                                
{'loss': 0.5112, 'grad_norm': 23.780656814575195, 'learning_rate': 4.947482930773511e-07, 'margin_dpo/margin_mean': 7.125207424163818, 'margin_dpo/margin_std': 9.734245300292969, 'logps/chosen': -91.6336898803711, 'logps/rejected': -109.0378646850586, 'logps/ref_chosen': -87.45826721191406, 'logps/ref_rejected': -97.73722076416016, 'logits/chosen': -0.7151781916618347, 'logits/rejected': -0.6897321939468384, 'epoch': 0.16}

 16%|█▌        | 55/340 [02:24<12:17,  2.59s/it]
 16%|█▋        | 56/340 [02:27<12:20,  2.61s/it]
 17%|█▋        | 57/340 [02:29<12:23,  2.63s/it]
 17%|█▋        | 58/340 [02:32<12:20,  2.63s/it]
 17%|█▋        | 59/340 [02:35<12:15,  2.62s/it]
 18%|█▊        | 60/340 [02:37<12:20,  2.64s/it]
                                                
{'loss': 0.5286, 'grad_norm': 20.72915267944336, 'learning_rate': 4.918104238142103e-07, 'margin_dpo/margin_mean': 6.065438747406006, 'margin_dpo/margin_std': 10.341069221496582, 'logps/chosen': -110.2301254272461, 'logps/rejected': -99.53703308105469, 'logps/ref_chosen': -106.60343933105469, 'logps/ref_rejected': -89.84490203857422, 'logits/chosen': -0.6631725430488586, 'logits/rejected': -0.6214786767959595, 'epoch': 0.18}

 18%|█▊        | 60/340 [02:37<12:20,  2.64s/it]
 18%|█▊        | 61/340 [02:40<12:03,  2.59s/it]
 18%|█▊        | 62/340 [02:42<12:06,  2.61s/it]
 19%|█▊        | 63/340 [02:45<12:05,  2.62s/it]
 19%|█▉        | 64/340 [02:48<12:03,  2.62s/it]
 19%|█▉        | 65/340 [02:50<11:53,  2.59s/it]
                                                
{'loss': 0.4746, 'grad_norm': 16.05661392211914, 'learning_rate': 4.882355001067891e-07, 'margin_dpo/margin_mean': 5.947785377502441, 'margin_dpo/margin_std': 7.2523908615112305, 'logps/chosen': -79.79920959472656, 'logps/rejected': -93.5802001953125, 'logps/ref_chosen': -76.7091064453125, 'logps/ref_rejected': -84.54231262207031, 'logits/chosen': -0.6507592797279358, 'logits/rejected': -0.6253207921981812, 'epoch': 0.19}

 19%|█▉        | 65/340 [02:50<11:53,  2.59s/it]
 19%|█▉        | 66/340 [02:53<11:43,  2.57s/it]
 20%|█▉        | 67/340 [02:55<11:41,  2.57s/it]
 20%|██        | 68/340 [02:58<11:40,  2.57s/it]
 20%|██        | 69/340 [03:00<11:27,  2.54s/it]
 21%|██        | 70/340 [03:03<11:36,  2.58s/it]
                                                
{'loss': 0.4662, 'grad_norm': 16.453359603881836, 'learning_rate': 4.840329401637809e-07, 'margin_dpo/margin_mean': 8.28502082824707, 'margin_dpo/margin_std': 8.248537063598633, 'logps/chosen': -74.00252532958984, 'logps/rejected': -103.95845031738281, 'logps/ref_chosen': -70.0877914428711, 'logps/ref_rejected': -91.75868225097656, 'logits/chosen': -0.698811411857605, 'logits/rejected': -0.6621960401535034, 'epoch': 0.21}

 21%|██        | 70/340 [03:03<11:36,  2.58s/it]
 21%|██        | 71/340 [03:06<11:36,  2.59s/it]
 21%|██        | 72/340 [03:08<11:49,  2.65s/it]
 21%|██▏       | 73/340 [03:11<11:42,  2.63s/it]
 22%|██▏       | 74/340 [03:14<11:34,  2.61s/it]
 22%|██▏       | 75/340 [03:16<11:29,  2.60s/it]
                                                
{'loss': 0.4863, 'grad_norm': 17.00535011291504, 'learning_rate': 4.792138157142157e-07, 'margin_dpo/margin_mean': 8.173115730285645, 'margin_dpo/margin_std': 8.817681312561035, 'logps/chosen': -78.68012237548828, 'logps/rejected': -97.5809555053711, 'logps/ref_chosen': -74.91792297363281, 'logps/ref_rejected': -85.64566802978516, 'logits/chosen': -0.6827956438064575, 'logits/rejected': -0.6566829681396484, 'epoch': 0.22}

 22%|██▏       | 75/340 [03:16<11:29,  2.60s/it]
 22%|██▏       | 76/340 [03:19<11:28,  2.61s/it]
 23%|██▎       | 77/340 [03:21<11:27,  2.62s/it]
 23%|██▎       | 78/340 [03:24<11:23,  2.61s/it]
 23%|██▎       | 79/340 [03:27<11:19,  2.60s/it]
 24%|██▎       | 80/340 [03:29<11:12,  2.59s/it]
                                                
{'loss': 0.451, 'grad_norm': 21.13958168029785, 'learning_rate': 4.737908228387656e-07, 'margin_dpo/margin_mean': 7.951646327972412, 'margin_dpo/margin_std': 8.248537063598633, 'logps/chosen': -102.5855941772461, 'logps/rejected': -105.6670150756836, 'logps/ref_chosen': -97.75636291503906, 'logps/ref_rejected': -92.88613891601562, 'logits/chosen': -0.7372442483901978, 'logits/rejected': -0.689995288848877, 'epoch': 0.24}

 24%|██▎       | 80/340 [03:29<11:12,  2.59s/it]
 24%|██▍       | 81/340 [03:32<11:13,  2.60s/it]
 24%|██▍       | 82/340 [03:34<11:00,  2.56s/it]
 24%|██▍       | 83/340 [03:37<10:48,  2.53s/it]
 25%|██▍       | 84/340 [03:39<10:53,  2.55s/it]
 25%|██▌       | 85/340 [03:42<10:59,  2.59s/it]
                                                
{'loss': 0.4569, 'grad_norm': 18.165218353271484, 'learning_rate': 4.6777824852166437e-07, 'margin_dpo/margin_mean': 7.221736907958984, 'margin_dpo/margin_std': 8.439001083374023, 'logps/chosen': -85.70280456542969, 'logps/rejected': -101.9955825805664, 'logps/ref_chosen': -78.9326171875, 'logps/ref_rejected': -88.00363159179688, 'logits/chosen': -0.6671745777130127, 'logits/rejected': -0.6385531425476074, 'epoch': 0.25}

 25%|██▌       | 85/340 [03:42<10:59,  2.59s/it]
 25%|██▌       | 86/340 [03:45<10:57,  2.59s/it]
 26%|██▌       | 87/340 [03:47<10:51,  2.57s/it]
 26%|██▌       | 88/340 [03:50<10:45,  2.56s/it]
 26%|██▌       | 89/340 [03:52<10:47,  2.58s/it]
 26%|██▋       | 90/340 [03:55<10:44,  2.58s/it]
                                                
{'loss': 0.4419, 'grad_norm': 20.739215850830078, 'learning_rate': 4.611919330113591e-07, 'margin_dpo/margin_mean': 9.419827461242676, 'margin_dpo/margin_std': 9.238184928894043, 'logps/chosen': -84.86643981933594, 'logps/rejected': -105.78071594238281, 'logps/ref_chosen': -78.78388214111328, 'logps/ref_rejected': -90.2783203125, 'logits/chosen': -0.6510001420974731, 'logits/rejected': -0.629525899887085, 'epoch': 0.26}

 26%|██▋       | 90/340 [03:55<10:44,  2.58s/it]
 27%|██▋       | 91/340 [03:57<10:44,  2.59s/it]
 27%|██▋       | 92/340 [04:00<10:53,  2.63s/it]
 27%|██▋       | 93/340 [04:03<10:48,  2.62s/it]
 28%|██▊       | 94/340 [04:05<10:40,  2.60s/it]
 28%|██▊       | 95/340 [04:08<10:39,  2.61s/it]
                                                
{'loss': 0.4514, 'grad_norm': 17.511486053466797, 'learning_rate': 4.5404922808905543e-07, 'margin_dpo/margin_mean': 7.360299587249756, 'margin_dpo/margin_std': 11.319549560546875, 'logps/chosen': -74.32402038574219, 'logps/rejected': -78.22425842285156, 'logps/ref_chosen': -65.91403198242188, 'logps/ref_rejected': -62.45396041870117, 'logits/chosen': -0.6517031788825989, 'logits/rejected': -0.6104840040206909, 'epoch': 0.28}

 28%|██▊       | 95/340 [04:08<10:39,  2.61s/it]
 28%|██▊       | 96/340 [04:11<10:49,  2.66s/it]
 29%|██▊       | 97/340 [04:13<10:33,  2.61s/it]
 29%|██▉       | 98/340 [04:16<10:33,  2.62s/it]
 29%|██▉       | 99/340 [04:18<10:26,  2.60s/it]
 29%|██▉       | 100/340 [04:21<10:27,  2.61s/it]
                                                 
{'loss': 0.4265, 'grad_norm': 18.769145965576172, 'learning_rate': 4.4636895135509966e-07, 'margin_dpo/margin_mean': 9.642545700073242, 'margin_dpo/margin_std': 11.237717628479004, 'logps/chosen': -84.81422424316406, 'logps/rejected': -110.46153259277344, 'logps/ref_chosen': -77.24075317382812, 'logps/ref_rejected': -93.24552917480469, 'logits/chosen': -0.6338332295417786, 'logits/rejected': -0.6123248338699341, 'epoch': 0.29}

 29%|██▉       | 100/340 [04:21<10:27,  2.61s/it][INFO|trainer.py:4307] 2026-04-10 17:40:00,228 >> 
***** Running Evaluation *****
[INFO|trainer.py:4309] 2026-04-10 17:40:00,228 >>   Num examples = 2339
[INFO|trainer.py:4312] 2026-04-10 17:40:00,228 >>   Batch size = 16


  0%|          | 0/18 [00:00<?, ?it/s][A

 11%|█         | 2/18 [00:01<00:08,  1.89it/s][A

 17%|█▋        | 3/18 [00:02<00:11,  1.32it/s][A

 22%|██▏       | 4/18 [00:03<00:12,  1.13it/s][A

 28%|██▊       | 5/18 [00:04<00:13,  1.00s/it][A

 33%|███▎      | 6/18 [00:05<00:12,  1.05s/it][A

 39%|███▉      | 7/18 [00:06<00:11,  1.08s/it][A

 44%|████▍     | 8/18 [00:07<00:10,  1.08s/it][A

 50%|█████     | 9/18 [00:08<00:09,  1.09s/it][A

 56%|█████▌    | 10/18 [00:10<00:09,  1.13s/it][A

 61%|██████    | 11/18 [00:11<00:07,  1.09s/it][A

 67%|██████▋   | 12/18 [00:12<00:06,  1.15s/it][A

 72%|███████▏  | 13/18 [00:13<00:05,  1.11s/it][A

 78%|███████▊  | 14/18 [00:14<00:04,  1.16s/it][A

 83%|████████▎ | 15/18 [00:15<00:03,  1.16s/it][A

 89%|████████▉ | 16/18 [00:16<00:02,  1.10s/it][A

 94%|█████████▍| 17/18 [00:18<00:01,  1.13s/it][A

100%|██████████| 18/18 [00:18<00:00,  1.07s/it][A
                                                 

[A{'eval_loss': 0.5427329540252686, 'eval_runtime': 20.4041, 'eval_samples_per_second': 114.634, 'eval_steps_per_second': 0.931, 'eval_margin_dpo/margin_mean': 5.341787338256836, 'eval_margin_dpo/margin_std': 10.061349868774414, 'eval_logps/chosen': -107.19888305664062, 'eval_logps/rejected': -95.6607437133789, 'eval_logps/ref_chosen': -97.0617446899414, 'eval_logps/ref_rejected': -80.18183135986328, 'eval_logits/chosen': -0.6361338496208191, 'eval_logits/rejected': -0.6085699200630188, 'epoch': 0.29}

 29%|██▉       | 100/340 [04:41<10:27,  2.61s/it]

100%|██████████| 18/18 [00:19<00:00,  1.07s/it][A

                                               [A
 30%|██▉       | 101/340 [04:44<34:47,  8.74s/it]
 30%|███       | 102/340 [04:47<27:22,  6.90s/it]
 30%|███       | 103/340 [04:49<22:02,  5.58s/it]
 31%|███       | 104/340 [04:52<18:25,  4.68s/it]
 31%|███       | 105/340 [04:54<15:47,  4.03s/it]
                                                 
{'loss': 0.427, 'grad_norm': 17.255924224853516, 'learning_rate': 4.381713366536311e-07, 'margin_dpo/margin_mean': 12.32937240600586, 'margin_dpo/margin_std': 11.30049991607666, 'logps/chosen': -76.29129791259766, 'logps/rejected': -92.5668716430664, 'logps/ref_chosen': -70.76807403564453, 'logps/ref_rejected': -74.71427917480469, 'logits/chosen': -0.6774856448173523, 'logits/rejected': -0.6355584263801575, 'epoch': 0.31}

 31%|███       | 105/340 [04:54<15:47,  4.03s/it]
 31%|███       | 106/340 [04:57<13:57,  3.58s/it]
 31%|███▏      | 107/340 [04:59<12:44,  3.28s/it]
 32%|███▏      | 108/340 [05:02<11:53,  3.08s/it]
 32%|███▏      | 109/340 [05:05<11:20,  2.94s/it]
 32%|███▏      | 110/340 [05:07<10:52,  2.84s/it]
                                                 
{'loss': 0.4145, 'grad_norm': 18.394851684570312, 'learning_rate': 4.2947798076611047e-07, 'margin_dpo/margin_mean': 11.69237995147705, 'margin_dpo/margin_std': 12.493224143981934, 'logps/chosen': -89.50286102294922, 'logps/rejected': -109.15755462646484, 'logps/ref_chosen': -81.14533996582031, 'logps/ref_rejected': -89.10765838623047, 'logits/chosen': -0.6861704587936401, 'logits/rejected': -0.6574342846870422, 'epoch': 0.32}

 32%|███▏      | 110/340 [05:07<10:52,  2.84s/it]
 33%|███▎      | 111/340 [05:10<10:28,  2.75s/it]
 33%|███▎      | 112/340 [05:12<09:44,  2.56s/it]
 33%|███▎      | 113/340 [05:14<09:36,  2.54s/it]
 34%|███▎      | 114/340 [05:17<09:35,  2.55s/it]
 34%|███▍      | 115/340 [05:20<09:35,  2.56s/it]
                                                 
{'loss': 0.4011, 'grad_norm': 18.24220848083496, 'learning_rate': 4.203117865141635e-07, 'margin_dpo/margin_mean': 11.827718734741211, 'margin_dpo/margin_std': 12.786788940429688, 'logps/chosen': -76.1394271850586, 'logps/rejected': -122.98934173583984, 'logps/ref_chosen': -64.77717590332031, 'logps/ref_rejected': -99.79936218261719, 'logits/chosen': -0.6698350310325623, 'logits/rejected': -0.6532580256462097, 'epoch': 0.34}

 34%|███▍      | 115/340 [05:20<09:35,  2.56s/it]
 34%|███▍      | 116/340 [05:22<09:45,  2.61s/it]
 34%|███▍      | 117/340 [05:25<09:42,  2.61s/it]
 35%|███▍      | 118/340 [05:27<09:38,  2.60s/it]
 35%|███▌      | 119/340 [05:30<09:45,  2.65s/it]
 35%|███▌      | 120/340 [05:33<09:38,  2.63s/it]
                                                 
{'loss': 0.4154, 'grad_norm': 22.541603088378906, 'learning_rate': 4.106969024216348e-07, 'margin_dpo/margin_mean': 13.501307487487793, 'margin_dpo/margin_std': 12.104052543640137, 'logps/chosen': -86.44108581542969, 'logps/rejected': -104.9658203125, 'logps/ref_chosen': -77.35191345214844, 'logps/ref_rejected': -82.3753433227539, 'logits/chosen': -0.6674679517745972, 'logits/rejected': -0.6289718151092529, 'epoch': 0.35}

 35%|███▌      | 120/340 [05:33<09:38,  2.63s/it]
 36%|███▌      | 121/340 [05:35<09:34,  2.62s/it]
 36%|███▌      | 122/340 [05:38<09:29,  2.61s/it]
 36%|███▌      | 123/340 [05:41<09:30,  2.63s/it]
 36%|███▋      | 124/340 [05:43<09:19,  2.59s/it]
 37%|███▋      | 125/340 [05:45<08:59,  2.51s/it]
                                                 
{'loss': 0.3871, 'grad_norm': 18.34996223449707, 'learning_rate': 4.006586590948141e-07, 'margin_dpo/margin_mean': 12.104842185974121, 'margin_dpo/margin_std': 12.706830978393555, 'logps/chosen': -84.34068298339844, 'logps/rejected': -109.58891296386719, 'logps/ref_chosen': -74.56766510009766, 'logps/ref_rejected': -87.71104431152344, 'logits/chosen': -0.6953171491622925, 'logits/rejected': -0.6653636693954468, 'epoch': 0.37}

 37%|███▋      | 125/340 [05:45<08:59,  2.51s/it]
 37%|███▋      | 126/340 [05:48<08:53,  2.49s/it]
 37%|███▋      | 127/340 [05:51<08:57,  2.52s/it]
 38%|███▊      | 128/340 [05:53<08:59,  2.54s/it]
 38%|███▊      | 129/340 [05:56<08:59,  2.56s/it]
 38%|███▊      | 130/340 [05:58<09:00,  2.57s/it]
                                                 
{'loss': 0.4128, 'grad_norm': 19.588794708251953, 'learning_rate': 3.9022350248844246e-07, 'margin_dpo/margin_mean': 13.186391830444336, 'margin_dpo/margin_std': 16.62637710571289, 'logps/chosen': -92.37910461425781, 'logps/rejected': -118.1786117553711, 'logps/ref_chosen': -79.86932373046875, 'logps/ref_rejected': -92.48243713378906, 'logits/chosen': -0.6832663416862488, 'logits/rejected': -0.6475099325180054, 'epoch': 0.38}

 38%|███▊      | 130/340 [05:58<09:00,  2.57s/it]
 39%|███▊      | 131/340 [06:01<08:43,  2.51s/it]
 39%|███▉      | 132/340 [06:03<08:50,  2.55s/it]
 39%|███▉      | 133/340 [06:06<08:41,  2.52s/it]
 39%|███▉      | 134/340 [06:08<08:39,  2.52s/it]
 40%|███▉      | 135/340 [06:11<08:41,  2.54s/it]
                                                 
{'loss': 0.3855, 'grad_norm': 19.383163452148438, 'learning_rate': 3.794189242333106e-07, 'margin_dpo/margin_mean': 12.800195693969727, 'margin_dpo/margin_std': 15.272809028625488, 'logps/chosen': -93.94104766845703, 'logps/rejected': -115.92558288574219, 'logps/ref_chosen': -82.55046081542969, 'logps/ref_rejected': -91.73478698730469, 'logits/chosen': -0.6862474679946899, 'logits/rejected': -0.6508306264877319, 'epoch': 0.4}

 40%|███▉      | 135/340 [06:11<08:41,  2.54s/it]
 40%|████      | 136/340 [06:13<08:39,  2.55s/it]
 40%|████      | 137/340 [06:16<08:27,  2.50s/it]
 41%|████      | 138/340 [06:18<08:27,  2.51s/it]
 41%|████      | 139/340 [06:21<08:23,  2.50s/it]
 41%|████      | 140/340 [06:24<08:33,  2.57s/it]
                                                 
{'loss': 0.3782, 'grad_norm': 15.547196388244629, 'learning_rate': 3.6827338920900253e-07, 'margin_dpo/margin_mean': 12.726763725280762, 'margin_dpo/margin_std': 12.062446594238281, 'logps/chosen': -86.93388366699219, 'logps/rejected': -111.50956726074219, 'logps/ref_chosen': -76.40785217285156, 'logps/ref_rejected': -88.25675964355469, 'logits/chosen': -0.6315192580223083, 'logits/rejected': -0.5951318740844727, 'epoch': 0.41}

 41%|████      | 140/340 [06:24<08:33,  2.57s/it]
 41%|████▏     | 141/340 [06:26<08:33,  2.58s/it]
 42%|████▏     | 142/340 [06:29<08:36,  2.61s/it]
 42%|████▏     | 143/340 [06:31<08:32,  2.60s/it]
 42%|████▏     | 144/340 [06:34<08:18,  2.54s/it]
 43%|████▎     | 145/340 [06:37<08:23,  2.58s/it]
                                                 
{'loss': 0.3722, 'grad_norm': 16.376129150390625, 'learning_rate': 3.568162605525952e-07, 'margin_dpo/margin_mean': 14.328218460083008, 'margin_dpo/margin_std': 13.251609802246094, 'logps/chosen': -90.28238677978516, 'logps/rejected': -105.7525863647461, 'logps/ref_chosen': -79.43595123291016, 'logps/ref_rejected': -80.57792663574219, 'logits/chosen': -0.694092869758606, 'logits/rejected': -0.6596013307571411, 'epoch': 0.43}

 43%|████▎     | 145/340 [06:37<08:23,  2.58s/it]
 43%|████▎     | 146/340 [06:39<08:23,  2.59s/it]
 43%|████▎     | 147/340 [06:42<08:24,  2.61s/it]
 44%|████▎     | 148/340 [06:44<08:18,  2.60s/it]
 44%|████▍     | 149/340 [06:47<08:13,  2.59s/it]
 44%|████▍     | 150/340 [06:50<08:13,  2.60s/it]
                                                 
{'loss': 0.4063, 'grad_norm': 19.48674201965332, 'learning_rate': 3.4507772230088147e-07, 'margin_dpo/margin_mean': 9.931692123413086, 'margin_dpo/margin_std': 11.23712158203125, 'logps/chosen': -82.92797088623047, 'logps/rejected': -99.82804870605469, 'logps/ref_chosen': -69.55223846435547, 'logps/ref_rejected': -76.5206298828125, 'logits/chosen': -0.6205201745033264, 'logits/rejected': -0.5989262461662292, 'epoch': 0.44}

 44%|████▍     | 150/340 [06:50<08:13,  2.60s/it]
 44%|████▍     | 151/340 [06:52<08:08,  2.59s/it]
 45%|████▍     | 152/340 [06:55<08:06,  2.59s/it]
 45%|████▌     | 153/340 [06:57<08:02,  2.58s/it]
 45%|████▌     | 154/340 [07:00<08:03,  2.60s/it]
 46%|████▌     | 155/340 [07:02<07:58,  2.59s/it]
                                                 
{'loss': 0.3818, 'grad_norm': 18.904706954956055, 'learning_rate': 3.3308869986991487e-07, 'margin_dpo/margin_mean': 12.618766784667969, 'margin_dpo/margin_std': 14.547628402709961, 'logps/chosen': -98.33650207519531, 'logps/rejected': -106.65342712402344, 'logps/ref_chosen': -83.78580474853516, 'logps/ref_rejected': -79.48396301269531, 'logits/chosen': -0.6716780662536621, 'logits/rejected': -0.6312578320503235, 'epoch': 0.46}

 46%|████▌     | 155/340 [07:02<07:58,  2.59s/it]
 46%|████▌     | 156/340 [07:05<07:46,  2.53s/it]
 46%|████▌     | 157/340 [07:07<07:44,  2.54s/it]
 46%|████▋     | 158/340 [07:10<07:40,  2.53s/it]
 47%|████▋     | 159/340 [07:12<07:37,  2.53s/it]
 47%|████▋     | 160/340 [07:15<07:39,  2.55s/it]
                                                 
{'loss': 0.3508, 'grad_norm': 16.047494888305664, 'learning_rate': 3.208807785813777e-07, 'margin_dpo/margin_mean': 13.903097152709961, 'margin_dpo/margin_std': 10.593317031860352, 'logps/chosen': -85.62313079833984, 'logps/rejected': -123.3752212524414, 'logps/ref_chosen': -71.89569091796875, 'logps/ref_rejected': -95.74468231201172, 'logits/chosen': -0.6216621994972229, 'logits/rejected': -0.5977298617362976, 'epoch': 0.47}

 47%|████▋     | 160/340 [07:15<07:39,  2.55s/it]
 47%|████▋     | 161/340 [07:18<07:37,  2.56s/it]
 48%|████▊     | 162/340 [07:20<07:40,  2.59s/it]
 48%|████▊     | 163/340 [07:23<07:41,  2.61s/it]
 48%|████▊     | 164/340 [07:26<07:39,  2.61s/it]
 49%|████▊     | 165/340 [07:28<07:37,  2.61s/it]
                                                 
{'loss': 0.3544, 'grad_norm': 17.789417266845703, 'learning_rate': 3.084861204504122e-07, 'margin_dpo/margin_mean': 17.097713470458984, 'margin_dpo/margin_std': 14.805742263793945, 'logps/chosen': -91.7447738647461, 'logps/rejected': -120.28157806396484, 'logps/ref_chosen': -77.03978729248047, 'logps/ref_rejected': -88.47887420654297, 'logits/chosen': -0.6328192949295044, 'logits/rejected': -0.5899003148078918, 'epoch': 0.49}

 49%|████▊     | 165/340 [07:28<07:37,  2.61s/it]
 49%|████▉     | 166/340 [07:31<07:22,  2.55s/it]
 49%|████▉     | 167/340 [07:33<07:12,  2.50s/it]
 49%|████▉     | 168/340 [07:36<07:13,  2.52s/it]
 50%|████▉     | 169/340 [07:38<07:05,  2.49s/it]
 50%|█████     | 170/340 [07:40<06:58,  2.46s/it]
                                                 
{'loss': 0.3454, 'grad_norm': 19.674264907836914, 'learning_rate': 2.959373794541426e-07, 'margin_dpo/margin_mean': 14.613912582397461, 'margin_dpo/margin_std': 12.491094589233398, 'logps/chosen': -88.34684753417969, 'logps/rejected': -119.37635803222656, 'logps/ref_chosen': -71.93138122558594, 'logps/ref_rejected': -88.34697723388672, 'logits/chosen': -0.6691595911979675, 'logits/rejected': -0.6374617218971252, 'epoch': 0.5}

 50%|█████     | 170/340 [07:40<06:58,  2.46s/it]
 50%|█████     | 171/340 [07:43<07:15,  2.58s/it]
 51%|█████     | 172/340 [07:46<07:13,  2.58s/it]
 51%|█████     | 173/340 [07:48<07:10,  2.58s/it]
 51%|█████     | 174/340 [07:51<06:51,  2.48s/it]
 51%|█████▏    | 175/340 [07:53<06:52,  2.50s/it]
                                                 
{'loss': 0.3713, 'grad_norm': 20.303539276123047, 'learning_rate': 2.8326761550411346e-07, 'margin_dpo/margin_mean': 12.219950675964355, 'margin_dpo/margin_std': 13.678237915039062, 'logps/chosen': -86.78947448730469, 'logps/rejected': -123.58447265625, 'logps/ref_chosen': -68.0127182006836, 'logps/ref_rejected': -92.58775329589844, 'logits/chosen': -0.6473700404167175, 'logits/rejected': -0.6196728944778442, 'epoch': 0.51}

 51%|█████▏    | 175/340 [07:53<06:52,  2.50s/it]
 52%|█████▏    | 176/340 [07:56<06:55,  2.53s/it]
 52%|█████▏    | 177/340 [07:58<06:56,  2.55s/it]
 52%|█████▏    | 178/340 [08:01<06:56,  2.57s/it]
 53%|█████▎    | 179/340 [08:04<06:55,  2.58s/it]
 53%|█████▎    | 180/340 [08:06<06:49,  2.56s/it]
                                                 
{'loss': 0.3585, 'grad_norm': 17.388011932373047, 'learning_rate': 2.7051020734928443e-07, 'margin_dpo/margin_mean': 17.15180778503418, 'margin_dpo/margin_std': 14.575396537780762, 'logps/chosen': -80.26910400390625, 'logps/rejected': -122.92547607421875, 'logps/ref_chosen': -61.942466735839844, 'logps/ref_rejected': -87.44703674316406, 'logits/chosen': -0.611466646194458, 'logits/rejected': -0.587906002998352, 'epoch': 0.53}

 53%|█████▎    | 180/340 [08:06<06:49,  2.56s/it]
 53%|█████▎    | 181/340 [08:09<06:50,  2.58s/it]
 54%|█████▎    | 182/340 [08:11<06:47,  2.58s/it]
 54%|█████▍    | 183/340 [08:14<06:50,  2.61s/it]
 54%|█████▍    | 184/340 [08:17<06:55,  2.66s/it]
 54%|█████▍    | 185/340 [08:19<06:50,  2.65s/it]
                                                 
{'loss': 0.3501, 'grad_norm': 19.291353225708008, 'learning_rate': 2.5769876463904263e-07, 'margin_dpo/margin_mean': 14.570175170898438, 'margin_dpo/margin_std': 14.043818473815918, 'logps/chosen': -87.93196105957031, 'logps/rejected': -99.19012451171875, 'logps/ref_chosen': -72.35160064697266, 'logps/ref_rejected': -69.03958129882812, 'logits/chosen': -0.6199885606765747, 'logits/rejected': -0.5656689405441284, 'epoch': 0.54}

 54%|█████▍    | 185/340 [08:19<06:50,  2.65s/it]
 55%|█████▍    | 186/340 [08:22<06:46,  2.64s/it]
 55%|█████▌    | 187/340 [08:25<06:43,  2.64s/it]
 55%|█████▌    | 188/340 [08:27<06:38,  2.62s/it]
 56%|█████▌    | 189/340 [08:30<06:26,  2.56s/it]
 56%|█████▌    | 190/340 [08:32<06:26,  2.58s/it]
                                                 
{'loss': 0.3605, 'grad_norm': 18.855066299438477, 'learning_rate': 2.4486703937790243e-07, 'margin_dpo/margin_mean': 14.287073135375977, 'margin_dpo/margin_std': 14.959236145019531, 'logps/chosen': -100.40862274169922, 'logps/rejected': -106.55586242675781, 'logps/ref_chosen': -79.45222473144531, 'logps/ref_rejected': -71.31239318847656, 'logits/chosen': -0.6612949967384338, 'logits/rejected': -0.6132839322090149, 'epoch': 0.56}

 56%|█████▌    | 190/340 [08:32<06:26,  2.58s/it]
 56%|█████▌    | 191/340 [08:35<06:25,  2.59s/it]
 56%|█████▋    | 192/340 [08:37<06:16,  2.54s/it]
 57%|█████▋    | 193/340 [08:40<06:15,  2.56s/it]
 57%|█████▋    | 194/340 [08:43<06:16,  2.58s/it]
 57%|█████▋    | 195/340 [08:45<06:15,  2.59s/it]
                                                 
{'loss': 0.3429, 'grad_norm': 20.077083587646484, 'learning_rate': 2.320488370051681e-07, 'margin_dpo/margin_mean': 18.4035587310791, 'margin_dpo/margin_std': 15.252446174621582, 'logps/chosen': -89.4631118774414, 'logps/rejected': -121.50825500488281, 'logps/ref_chosen': -71.20511627197266, 'logps/ref_rejected': -84.8467025756836, 'logits/chosen': -0.667130172252655, 'logits/rejected': -0.6179927587509155, 'epoch': 0.57}

 57%|█████▋    | 195/340 [08:45<06:15,  2.59s/it]
 58%|█████▊    | 196/340 [08:48<06:08,  2.56s/it]
 58%|█████▊    | 197/340 [08:50<06:06,  2.56s/it]
 58%|█████▊    | 198/340 [08:53<05:58,  2.52s/it]
 59%|█████▊    | 199/340 [08:55<05:57,  2.54s/it]
 59%|█████▉    | 200/340 [08:58<05:50,  2.50s/it]
                                                 
{'loss': 0.3411, 'grad_norm': 17.699968338012695, 'learning_rate': 2.192779273338215e-07, 'margin_dpo/margin_mean': 16.283123016357422, 'margin_dpo/margin_std': 15.363842964172363, 'logps/chosen': -89.16279602050781, 'logps/rejected': -104.97953796386719, 'logps/ref_chosen': -71.31782531738281, 'logps/ref_rejected': -70.8514404296875, 'logits/chosen': -0.6087943911552429, 'logits/rejected': -0.5693117380142212, 'epoch': 0.59}

 59%|█████▉    | 200/340 [08:58<05:50,  2.50s/it][INFO|trainer.py:4307] 2026-04-10 17:44:36,788 >> 
***** Running Evaluation *****
[INFO|trainer.py:4309] 2026-04-10 17:44:36,788 >>   Num examples = 2339
[INFO|trainer.py:4312] 2026-04-10 17:44:36,788 >>   Batch size = 16


  0%|          | 0/18 [00:00<?, ?it/s][A

 11%|█         | 2/18 [00:01<00:08,  1.90it/s][A

 17%|█▋        | 3/18 [00:02<00:11,  1.32it/s][A

 22%|██▏       | 4/18 [00:03<00:12,  1.13it/s][A

 28%|██▊       | 5/18 [00:04<00:13,  1.01s/it][A

 33%|███▎      | 6/18 [00:05<00:12,  1.04s/it][A

 39%|███▉      | 7/18 [00:06<00:11,  1.08s/it][A

 44%|████▍     | 8/18 [00:07<00:10,  1.07s/it][A

 50%|█████     | 9/18 [00:08<00:09,  1.08s/it][A

 56%|█████▌    | 10/18 [00:10<00:08,  1.12s/it][A

 61%|██████    | 11/18 [00:11<00:07,  1.09s/it][A

 67%|██████▋   | 12/18 [00:12<00:06,  1.14s/it][A

 72%|███████▏  | 13/18 [00:13<00:05,  1.11s/it][A

 78%|███████▊  | 14/18 [00:14<00:04,  1.15s/it][A

 83%|████████▎ | 15/18 [00:15<00:03,  1.15s/it][A

 89%|████████▉ | 16/18 [00:16<00:02,  1.10s/it][A

 94%|█████████▍| 17/18 [00:17<00:01,  1.12s/it][A

100%|██████████| 18/18 [00:18<00:00,  1.06s/it][A
                                                 

[A{'eval_loss': 0.4754122495651245, 'eval_runtime': 20.3073, 'eval_samples_per_second': 115.18, 'eval_steps_per_second': 0.936, 'eval_margin_dpo/margin_mean': 10.299551010131836, 'eval_margin_dpo/margin_std': 14.652626991271973, 'eval_logps/chosen': -119.31637573242188, 'eval_logps/rejected': -112.73600769042969, 'eval_logps/ref_chosen': -97.0617446899414, 'eval_logps/ref_rejected': -80.18183135986328, 'eval_logits/chosen': -0.602095365524292, 'eval_logits/rejected': -0.5640405416488647, 'epoch': 0.59}

 59%|█████▉    | 200/340 [09:18<05:50,  2.50s/it]

100%|██████████| 18/18 [00:19<00:00,  1.06s/it][A

                                               [A[INFO|trainer.py:3984] 2026-04-10 17:45:12,597 >> Saving model checkpoint to /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-margin-dpo-hh-helpful-8xh200-20260410-172009/checkpoint-200
[INFO|configuration_utils.py:419] 2026-04-10 17:45:12,605 >> Configuration saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-margin-dpo-hh-helpful-8xh200-20260410-172009/checkpoint-200/config.json
[INFO|configuration_utils.py:911] 2026-04-10 17:45:12,610 >> Configuration saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-margin-dpo-hh-helpful-8xh200-20260410-172009/checkpoint-200/generation_config.json
[INFO|modeling_utils.py:3580] 2026-04-10 17:45:56,024 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 6 checkpoint shards. You can find where each parameters has been saved in the index located at /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-margin-dpo-hh-helpful-8xh200-20260410-172009/checkpoint-200/model.safetensors.index.json.
[INFO|tokenization_utils_base.py:2510] 2026-04-10 17:45:56,031 >> tokenizer config file saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-margin-dpo-hh-helpful-8xh200-20260410-172009/checkpoint-200/tokenizer_config.json
[INFO|tokenization_utils_base.py:2519] 2026-04-10 17:45:56,034 >> Special tokens file saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-margin-dpo-hh-helpful-8xh200-20260410-172009/checkpoint-200/special_tokens_map.json

 59%|█████▉    | 201/340 [13:21<3:07:16, 80.84s/it]
 59%|█████▉    | 202/340 [13:24<2:11:48, 57.31s/it]
 60%|█████▉    | 203/340 [13:26<1:33:12, 40.82s/it]
 60%|██████    | 204/340 [13:29<1:06:30, 29.34s/it]
 60%|██████    | 205/340 [13:31<47:58, 21.32s/it]  
                                                 
{'loss': 0.3792, 'grad_norm': 19.535417556762695, 'learning_rate': 2.065879555832674e-07, 'margin_dpo/margin_mean': 13.387273788452148, 'margin_dpo/margin_std': 14.807754516601562, 'logps/chosen': -104.2248764038086, 'logps/rejected': -104.95343017578125, 'logps/ref_chosen': -84.44103240966797, 'logps/ref_rejected': -71.78230285644531, 'logits/chosen': -0.5760528445243835, 'logits/rejected': -0.5279114842414856, 'epoch': 0.6}

 60%|██████    | 205/340 [13:31<47:58, 21.32s/it]
 61%|██████    | 206/340 [13:33<34:50, 15.60s/it]
 61%|██████    | 207/340 [13:36<25:57, 11.71s/it]
 61%|██████    | 208/340 [13:39<19:46,  8.99s/it]
 61%|██████▏   | 209/340 [13:41<15:25,  7.06s/it]
 62%|██████▏   | 210/340 [13:44<12:19,  5.69s/it]
                                                 
{'loss': 0.3251, 'grad_norm': 17.17575454711914, 'learning_rate': 1.9401235374032425e-07, 'margin_dpo/margin_mean': 14.702362060546875, 'margin_dpo/margin_std': 16.377933502197266, 'logps/chosen': -101.36656188964844, 'logps/rejected': -108.5728988647461, 'logps/ref_chosen': -83.94493103027344, 'logps/ref_rejected': -76.44892120361328, 'logits/chosen': -0.6245664358139038, 'logits/rejected': -0.5699684619903564, 'epoch': 0.62}

 62%|██████▏   | 210/340 [13:44<12:19,  5.69s/it]
 62%|██████▏   | 211/340 [13:46<10:16,  4.78s/it]
 62%|██████▏   | 212/340 [13:49<08:48,  4.13s/it]
 63%|██████▎   | 213/340 [13:52<07:47,  3.68s/it]
 63%|██████▎   | 214/340 [13:54<07:03,  3.36s/it]
 63%|██████▎   | 215/340 [13:57<06:33,  3.15s/it]
                                                 
{'loss': 0.3633, 'grad_norm': 20.044084548950195, 'learning_rate': 1.8158425248197928e-07, 'margin_dpo/margin_mean': 16.278963088989258, 'margin_dpo/margin_std': 19.206457138061523, 'logps/chosen': -102.8707275390625, 'logps/rejected': -122.053955078125, 'logps/ref_chosen': -82.23881530761719, 'logps/ref_rejected': -85.1430892944336, 'logits/chosen': -0.5605936050415039, 'logits/rejected': -0.5190353393554688, 'epoch': 0.63}

 63%|██████▎   | 215/340 [13:57<06:33,  3.15s/it]
 64%|██████▎   | 216/340 [14:00<06:11,  2.99s/it]
 64%|██████▍   | 217/340 [14:02<05:50,  2.85s/it]
 64%|██████▍   | 218/340 [14:05<05:45,  2.83s/it]
 64%|██████▍   | 219/340 [14:07<05:34,  2.77s/it]
 65%|██████▍   | 220/340 [14:10<05:24,  2.71s/it]
                                                 
{'loss': 0.3587, 'grad_norm': 21.036956787109375, 'learning_rate': 1.6933639389195134e-07, 'margin_dpo/margin_mean': 11.612079620361328, 'margin_dpo/margin_std': 14.565820693969727, 'logps/chosen': -97.38944244384766, 'logps/rejected': -117.23432922363281, 'logps/ref_chosen': -76.5594482421875, 'logps/ref_rejected': -84.79225158691406, 'logits/chosen': -0.621160626411438, 'logits/rejected': -0.585429310798645, 'epoch': 0.65}

 65%|██████▍   | 220/340 [14:10<05:24,  2.71s/it]
 65%|██████▌   | 221/340 [14:13<05:21,  2.70s/it]
 65%|██████▌   | 222/340 [14:15<05:17,  2.69s/it]
 66%|██████▌   | 223/340 [14:18<05:11,  2.66s/it]
 66%|██████▌   | 224/340 [14:20<05:01,  2.60s/it]
 66%|██████▌   | 225/340 [14:23<04:55,  2.57s/it]
                                                 
{'loss': 0.3385, 'grad_norm': 21.023571014404297, 'learning_rate': 1.573010452010098e-07, 'margin_dpo/margin_mean': 18.626880645751953, 'margin_dpo/margin_std': 18.950374603271484, 'logps/chosen': -87.20682525634766, 'logps/rejected': -132.78231811523438, 'logps/ref_chosen': -68.70957946777344, 'logps/ref_rejected': -95.65819549560547, 'logits/chosen': -0.6097210049629211, 'logits/rejected': -0.6041680574417114, 'epoch': 0.66}

 66%|██████▌   | 225/340 [14:23<04:55,  2.57s/it]
 66%|██████▋   | 226/340 [14:26<04:54,  2.58s/it]
 67%|██████▋   | 227/340 [14:28<04:50,  2.57s/it]
 67%|██████▋   | 228/340 [14:31<04:49,  2.58s/it]
 67%|██████▋   | 229/340 [14:33<04:48,  2.60s/it]
 68%|██████▊   | 230/340 [14:36<04:46,  2.61s/it]
                                                 
{'loss': 0.3269, 'grad_norm': 19.34729766845703, 'learning_rate': 1.4550991377830423e-07, 'margin_dpo/margin_mean': 14.579324722290039, 'margin_dpo/margin_std': 14.860456466674805, 'logps/chosen': -92.71955871582031, 'logps/rejected': -129.41712951660156, 'logps/ref_chosen': -76.04148864746094, 'logps/ref_rejected': -98.15973663330078, 'logits/chosen': -0.6367233395576477, 'logits/rejected': -0.5984948873519897, 'epoch': 0.68}

 68%|██████▊   | 230/340 [14:36<04:46,  2.61s/it]
 68%|██████▊   | 231/340 [14:39<04:43,  2.60s/it]
 68%|██████▊   | 232/340 [14:41<04:38,  2.58s/it]
 69%|██████▊   | 233/340 [14:44<04:36,  2.58s/it]
 69%|██████▉   | 234/340 [14:46<04:35,  2.59s/it]
 69%|██████▉   | 235/340 [14:49<04:34,  2.61s/it]
                                                 
{'loss': 0.3347, 'grad_norm': 18.263099670410156, 'learning_rate': 1.339940635976592e-07, 'margin_dpo/margin_mean': 19.314985275268555, 'margin_dpo/margin_std': 15.413273811340332, 'logps/chosen': -88.53390502929688, 'logps/rejected': -127.80912780761719, 'logps/ref_chosen': -70.64253997802734, 'logps/ref_rejected': -90.60277557373047, 'logits/chosen': -0.6155376434326172, 'logits/rejected': -0.5955866575241089, 'epoch': 0.69}

 69%|██████▉   | 235/340 [14:49<04:34,  2.61s/it]
 69%|██████▉   | 236/340 [14:52<04:33,  2.63s/it]
 70%|██████▉   | 237/340 [14:54<04:31,  2.63s/it]
 70%|███████   | 238/340 [14:57<04:27,  2.62s/it]
 70%|███████   | 239/340 [14:59<04:24,  2.62s/it]
 71%|███████   | 240/340 [15:02<04:17,  2.57s/it]
                                                 
{'loss': 0.3433, 'grad_norm': 21.18890380859375, 'learning_rate': 1.227838333989088e-07, 'margin_dpo/margin_mean': 17.56354331970215, 'margin_dpo/margin_std': 16.671550750732422, 'logps/chosen': -94.69210052490234, 'logps/rejected': -106.57359313964844, 'logps/ref_chosen': -75.90282440185547, 'logps/ref_rejected': -70.22077178955078, 'logits/chosen': -0.5532498955726624, 'logits/rejected': -0.5167180299758911, 'epoch': 0.71}

 71%|███████   | 240/340 [15:02<04:17,  2.57s/it]
 71%|███████   | 241/340 [15:05<04:15,  2.58s/it]
 71%|███████   | 242/340 [15:07<04:07,  2.53s/it]
 71%|███████▏  | 243/340 [15:10<04:08,  2.57s/it]
 72%|███████▏  | 244/340 [15:12<04:06,  2.56s/it]
 72%|███████▏  | 245/340 [15:15<04:05,  2.59s/it]
                                                 
{'loss': 0.3073, 'grad_norm': 19.42283058166504, 'learning_rate': 1.1190875675987355e-07, 'margin_dpo/margin_mean': 21.223926544189453, 'margin_dpo/margin_std': 16.53793716430664, 'logps/chosen': -87.87870788574219, 'logps/rejected': -142.7686767578125, 'logps/ref_chosen': -68.88108825683594, 'logps/ref_rejected': -102.547119140625, 'logits/chosen': -0.5711519122123718, 'logits/rejected': -0.5506427884101868, 'epoch': 0.72}

 72%|███████▏  | 245/340 [15:15<04:05,  2.59s/it]
 72%|███████▏  | 246/340 [15:17<04:03,  2.59s/it]
 73%|███████▎  | 247/340 [15:20<04:03,  2.62s/it]
 73%|███████▎  | 248/340 [15:23<03:59,  2.60s/it]
 73%|███████▎  | 249/340 [15:25<03:56,  2.60s/it]
 74%|███████▎  | 250/340 [15:28<03:52,  2.59s/it]
                                                 
{'loss': 0.4138, 'grad_norm': 21.975610733032227, 'learning_rate': 1.0139748428955333e-07, 'margin_dpo/margin_mean': 16.201473236083984, 'margin_dpo/margin_std': 15.055798530578613, 'logps/chosen': -104.53717041015625, 'logps/rejected': -118.47982025146484, 'logps/ref_chosen': -88.11860656738281, 'logps/ref_rejected': -85.85978698730469, 'logits/chosen': -0.63815838098526, 'logits/rejected': -0.5797184705734253, 'epoch': 0.74}

 74%|███████▎  | 250/340 [15:28<03:52,  2.59s/it]
 74%|███████▍  | 251/340 [15:30<03:50,  2.59s/it]
 74%|███████▍  | 252/340 [15:33<03:49,  2.61s/it]
 74%|███████▍  | 253/340 [15:36<03:45,  2.59s/it]
 75%|███████▍  | 254/340 [15:38<03:49,  2.67s/it]
 75%|███████▌  | 255/340 [15:41<03:46,  2.66s/it]
                                                 
{'loss': 0.3314, 'grad_norm': 21.86973762512207, 'learning_rate': 9.127770814751932e-08, 'margin_dpo/margin_mean': 16.87302017211914, 'margin_dpo/margin_std': 16.191524505615234, 'logps/chosen': -113.81512451171875, 'logps/rejected': -123.86918640136719, 'logps/ref_chosen': -93.02457427978516, 'logps/ref_rejected': -86.20562744140625, 'logits/chosen': -0.5965814590454102, 'logits/rejected': -0.5407648682594299, 'epoch': 0.75}

 75%|███████▌  | 255/340 [15:41<03:46,  2.66s/it]
 75%|███████▌  | 256/340 [15:44<03:43,  2.66s/it]
 76%|███████▌  | 257/340 [15:46<03:42,  2.69s/it]
 76%|███████▌  | 258/340 [15:49<03:39,  2.68s/it]
 76%|███████▌  | 259/340 [15:52<03:34,  2.65s/it]
 76%|███████▋  | 260/340 [15:54<03:32,  2.65s/it]
                                                 
{'loss': 0.3414, 'grad_norm': 20.748577117919922, 'learning_rate': 8.15760890883607e-08, 'margin_dpo/margin_mean': 20.42922592163086, 'margin_dpo/margin_std': 16.98196029663086, 'logps/chosen': -98.30900573730469, 'logps/rejected': -133.5509796142578, 'logps/ref_chosen': -79.27108001708984, 'logps/ref_rejected': -94.08381652832031, 'logits/chosen': -0.5860427618026733, 'logits/rejected': -0.5433794856071472, 'epoch': 0.76}

 76%|███████▋  | 260/340 [15:54<03:32,  2.65s/it]
 77%|███████▋  | 261/340 [15:57<03:26,  2.62s/it]
 77%|███████▋  | 262/340 [16:00<03:25,  2.63s/it]
 77%|███████▋  | 263/340 [16:02<03:21,  2.62s/it]
 78%|███████▊  | 264/340 [16:05<03:19,  2.63s/it]
 78%|███████▊  | 265/340 [16:07<03:16,  2.62s/it]
                                                 
{'loss': 0.3493, 'grad_norm': 20.377286911010742, 'learning_rate': 7.231818622338822e-08, 'margin_dpo/margin_mean': 15.021594047546387, 'margin_dpo/margin_std': 12.837465286254883, 'logps/chosen': -99.11347198486328, 'logps/rejected': -126.92435455322266, 'logps/ref_chosen': -79.24869537353516, 'logps/ref_rejected': -92.03797912597656, 'logits/chosen': -0.5678300857543945, 'logits/rejected': -0.5425071120262146, 'epoch': 0.78}

 78%|███████▊  | 265/340 [16:07<03:16,  2.62s/it]
 78%|███████▊  | 266/340 [16:10<03:15,  2.64s/it]
 79%|███████▊  | 267/340 [16:13<03:11,  2.62s/it]
 79%|███████▉  | 268/340 [16:15<03:10,  2.65s/it]
 79%|███████▉  | 269/340 [16:18<03:05,  2.61s/it]
 79%|███████▉  | 270/340 [16:21<03:03,  2.63s/it]
                                                 
{'loss': 0.332, 'grad_norm': 17.822444915771484, 'learning_rate': 6.352838968463919e-08, 'margin_dpo/margin_mean': 16.91426658630371, 'margin_dpo/margin_std': 14.53496265411377, 'logps/chosen': -97.48078918457031, 'logps/rejected': -116.37190246582031, 'logps/ref_chosen': -80.15914154052734, 'logps/ref_rejected': -82.13599395751953, 'logits/chosen': -0.606745719909668, 'logits/rejected': -0.5473134517669678, 'epoch': 0.79}

 79%|███████▉  | 270/340 [16:21<03:03,  2.63s/it]
 80%|███████▉  | 271/340 [16:23<03:00,  2.62s/it]
 80%|████████  | 272/340 [16:26<02:57,  2.60s/it]
 80%|████████  | 273/340 [16:28<02:54,  2.60s/it]
 81%|████████  | 274/340 [16:31<02:51,  2.61s/it]
 81%|████████  | 275/340 [16:34<02:49,  2.61s/it]
                                                 
{'loss': 0.3348, 'grad_norm': 20.570648193359375, 'learning_rate': 5.5229856368582376e-08, 'margin_dpo/margin_mean': 16.90357780456543, 'margin_dpo/margin_std': 20.21615219116211, 'logps/chosen': -99.41848754882812, 'logps/rejected': -122.4229965209961, 'logps/ref_chosen': -78.87225341796875, 'logps/ref_rejected': -84.97318267822266, 'logits/chosen': -0.6010477542877197, 'logits/rejected': -0.5661951899528503, 'epoch': 0.81}

 81%|████████  | 275/340 [16:34<02:49,  2.61s/it]
 81%|████████  | 276/340 [16:36<02:43,  2.55s/it]
 81%|████████▏ | 277/340 [16:39<02:41,  2.57s/it]
 82%|████████▏ | 278/340 [16:41<02:39,  2.57s/it]
 82%|████████▏ | 279/340 [16:44<02:35,  2.55s/it]
 82%|████████▏ | 280/340 [16:46<02:35,  2.59s/it]
                                                 
{'loss': 0.3329, 'grad_norm': 18.737754821777344, 'learning_rate': 4.7444448928806615e-08, 'margin_dpo/margin_mean': 20.195457458496094, 'margin_dpo/margin_std': 19.39859390258789, 'logps/chosen': -117.15876770019531, 'logps/rejected': -154.00479125976562, 'logps/ref_chosen': -96.47113800048828, 'logps/ref_rejected': -113.1217041015625, 'logits/chosen': -0.5662145018577576, 'logits/rejected': -0.525722324848175, 'epoch': 0.82}

 82%|████████▏ | 280/340 [16:46<02:35,  2.59s/it]
 83%|████████▎ | 281/340 [16:49<02:35,  2.64s/it]
 83%|████████▎ | 282/340 [16:52<02:32,  2.63s/it]
 83%|████████▎ | 283/340 [16:54<02:30,  2.64s/it]
 84%|████████▎ | 284/340 [16:57<02:26,  2.61s/it]
 84%|████████▍ | 285/340 [16:59<02:22,  2.58s/it]
                                                 
{'loss': 0.3382, 'grad_norm': 21.463726043701172, 'learning_rate': 4.019267817841834e-08, 'margin_dpo/margin_mean': 17.379127502441406, 'margin_dpo/margin_std': 17.829914093017578, 'logps/chosen': -111.90663146972656, 'logps/rejected': -114.01655578613281, 'logps/ref_chosen': -91.53522491455078, 'logps/ref_rejected': -76.2660140991211, 'logits/chosen': -0.630197286605835, 'logits/rejected': -0.5674210786819458, 'epoch': 0.84}

 84%|████████▍ | 285/340 [16:59<02:22,  2.58s/it]
 84%|████████▍ | 286/340 [17:02<02:19,  2.58s/it]
 84%|████████▍ | 287/340 [17:05<02:16,  2.57s/it]
 85%|████████▍ | 288/340 [17:07<02:16,  2.63s/it]
 85%|████████▌ | 289/340 [17:10<02:13,  2.61s/it]
 85%|████████▌ | 290/340 [17:13<02:11,  2.63s/it]
                                                 
{'loss': 0.3409, 'grad_norm': 18.62375831604004, 'learning_rate': 3.349364905389032e-08, 'margin_dpo/margin_mean': 18.841894149780273, 'margin_dpo/margin_std': 18.295745849609375, 'logps/chosen': -98.92496490478516, 'logps/rejected': -117.43675231933594, 'logps/ref_chosen': -78.96186828613281, 'logps/ref_rejected': -78.63177490234375, 'logits/chosen': -0.5863774418830872, 'logits/rejected': -0.5456980466842651, 'epoch': 0.85}

 85%|████████▌ | 290/340 [17:13<02:11,  2.63s/it]
 86%|████████▌ | 291/340 [17:15<02:08,  2.62s/it]
 86%|████████▌ | 292/340 [17:18<02:05,  2.61s/it]
 86%|████████▌ | 293/340 [17:20<02:02,  2.61s/it]
 86%|████████▋ | 294/340 [17:23<01:58,  2.57s/it]
 87%|████████▋ | 295/340 [17:25<01:55,  2.57s/it]
                                                 
{'loss': 0.3351, 'grad_norm': 16.586910247802734, 'learning_rate': 2.736501028272095e-08, 'margin_dpo/margin_mean': 15.721613883972168, 'margin_dpo/margin_std': 16.5610294342041, 'logps/chosen': -85.10719299316406, 'logps/rejected': -135.39389038085938, 'logps/ref_chosen': -64.14302825927734, 'logps/ref_rejected': -98.70811462402344, 'logits/chosen': -0.5259509086608887, 'logits/rejected': -0.5359938144683838, 'epoch': 0.87}

 87%|████████▋ | 295/340 [17:25<01:55,  2.57s/it]
 87%|████████▋ | 296/340 [17:28<01:53,  2.57s/it]
 87%|████████▋ | 297/340 [17:31<01:50,  2.58s/it]
 88%|████████▊ | 298/340 [17:33<01:48,  2.59s/it]
 88%|████████▊ | 299/340 [17:36<01:45,  2.57s/it]
 88%|████████▊ | 300/340 [17:38<01:43,  2.59s/it]
                                                 
{'loss': 0.3552, 'grad_norm': 19.39561653137207, 'learning_rate': 2.1822907887504932e-08, 'margin_dpo/margin_mean': 18.2686824798584, 'margin_dpo/margin_std': 16.341278076171875, 'logps/chosen': -80.19596099853516, 'logps/rejected': -130.80763244628906, 'logps/ref_chosen': -59.2784423828125, 'logps/ref_rejected': -91.62141418457031, 'logits/chosen': -0.5196036696434021, 'logits/rejected': -0.5250274538993835, 'epoch': 0.88}

 88%|████████▊ | 300/340 [17:38<01:43,  2.59s/it][INFO|trainer.py:4307] 2026-04-10 17:53:17,548 >> 
***** Running Evaluation *****
[INFO|trainer.py:4309] 2026-04-10 17:53:17,548 >>   Num examples = 2339
[INFO|trainer.py:4312] 2026-04-10 17:53:17,548 >>   Batch size = 16


  0%|          | 0/18 [00:00<?, ?it/s][A

 11%|█         | 2/18 [00:01<00:08,  1.90it/s][A

 17%|█▋        | 3/18 [00:02<00:11,  1.32it/s][A

 22%|██▏       | 4/18 [00:03<00:12,  1.13it/s][A

 28%|██▊       | 5/18 [00:04<00:13,  1.00s/it][A

 33%|███▎      | 6/18 [00:05<00:12,  1.04s/it][A

 39%|███▉      | 7/18 [00:06<00:11,  1.07s/it][A

 44%|████▍     | 8/18 [00:07<00:10,  1.07s/it][A

 50%|█████     | 9/18 [00:08<00:09,  1.08s/it][A

 56%|█████▌    | 10/18 [00:10<00:08,  1.12s/it][A

 61%|██████    | 11/18 [00:11<00:07,  1.09s/it][A

 67%|██████▋   | 12/18 [00:12<00:06,  1.15s/it][A

 72%|███████▏  | 13/18 [00:13<00:05,  1.11s/it][A

 78%|███████▊  | 14/18 [00:14<00:04,  1.15s/it][A

 83%|████████▎ | 15/18 [00:15<00:03,  1.15s/it][A

 89%|████████▉ | 16/18 [00:16<00:02,  1.10s/it][A

 94%|█████████▍| 17/18 [00:17<00:01,  1.13s/it][A

100%|██████████| 18/18 [00:18<00:00,  1.06s/it][A
                                                 

[A{'eval_loss': 0.4588142931461334, 'eval_runtime': 20.3107, 'eval_samples_per_second': 115.161, 'eval_steps_per_second': 0.935, 'eval_margin_dpo/margin_mean': 11.118718147277832, 'eval_margin_dpo/margin_std': 15.069600105285645, 'eval_logps/chosen': -119.7147216796875, 'eval_logps/rejected': -113.95352935791016, 'eval_logps/ref_chosen': -97.0617446899414, 'eval_logps/ref_rejected': -80.18183135986328, 'eval_logits/chosen': -0.5876314640045166, 'eval_logits/rejected': -0.5494834184646606, 'epoch': 0.88}

 88%|████████▊ | 300/340 [17:59<01:43,  2.59s/it]

100%|██████████| 18/18 [00:19<00:00,  1.06s/it][A

                                               [A
 89%|████████▊ | 301/340 [18:01<05:39,  8.70s/it]
 89%|████████▉ | 302/340 [18:04<04:20,  6.86s/it]
 89%|████████▉ | 303/340 [18:06<03:26,  5.57s/it]
 89%|████████▉ | 304/340 [18:09<02:47,  4.66s/it]
 90%|████████▉ | 305/340 [18:11<02:18,  3.95s/it]
                                                 
{'loss': 0.3241, 'grad_norm': 20.72559356689453, 'learning_rate': 1.6881942648911074e-08, 'margin_dpo/margin_mean': 20.912582397460938, 'margin_dpo/margin_std': 15.790578842163086, 'logps/chosen': -110.14324951171875, 'logps/rejected': -134.03268432617188, 'logps/ref_chosen': -90.05252838134766, 'logps/ref_rejected': -93.02938842773438, 'logits/chosen': -0.6059945821762085, 'logits/rejected': -0.5594589710235596, 'epoch': 0.9}

 90%|████████▉ | 305/340 [18:11<02:18,  3.95s/it]
 90%|█████████ | 306/340 [18:14<01:59,  3.52s/it]
 90%|█████████ | 307/340 [18:16<01:46,  3.24s/it]
 91%|█████████ | 308/340 [18:19<01:38,  3.07s/it]
 91%|█████████ | 309/340 [18:22<01:31,  2.94s/it]
 91%|█████████ | 310/340 [18:24<01:24,  2.82s/it]
                                                 
{'loss': 0.3336, 'grad_norm': 18.592208862304688, 'learning_rate': 1.2555131639630567e-08, 'margin_dpo/margin_mean': 17.357715606689453, 'margin_dpo/margin_std': 17.407108306884766, 'logps/chosen': -99.32337951660156, 'logps/rejected': -121.98432922363281, 'logps/ref_chosen': -76.26285552978516, 'logps/ref_rejected': -81.56607055664062, 'logits/chosen': -0.5199320316314697, 'logits/rejected': -0.48348456621170044, 'epoch': 0.91}

 91%|█████████ | 310/340 [18:24<01:24,  2.82s/it]
 91%|█████████▏| 311/340 [18:27<01:19,  2.75s/it]
 92%|█████████▏| 312/340 [18:30<01:16,  2.73s/it]
 92%|█████████▏| 313/340 [18:32<01:12,  2.69s/it]
 92%|█████████▏| 314/340 [18:35<01:09,  2.67s/it]
 93%|█████████▎| 315/340 [18:37<01:06,  2.65s/it]
                                                 
{'loss': 0.3444, 'grad_norm': 19.586881637573242, 'learning_rate': 8.85387393063622e-09, 'margin_dpo/margin_mean': 17.211929321289062, 'margin_dpo/margin_std': 18.306108474731445, 'logps/chosen': -108.92083740234375, 'logps/rejected': -129.36099243164062, 'logps/ref_chosen': -89.47105407714844, 'logps/ref_rejected': -92.69927215576172, 'logits/chosen': -0.5956140160560608, 'logits/rejected': -0.5609453916549683, 'epoch': 0.93}

 93%|█████████▎| 315/340 [18:37<01:06,  2.65s/it]
 93%|█████████▎| 316/340 [18:40<01:02,  2.60s/it]
 93%|█████████▎| 317/340 [18:42<00:59,  2.60s/it]
 94%|█████████▎| 318/340 [18:45<00:57,  2.60s/it]
 94%|█████████▍| 319/340 [18:48<00:54,  2.61s/it]
 94%|█████████▍| 320/340 [18:50<00:51,  2.60s/it]
                                                 
{'loss': 0.3732, 'grad_norm': 21.697298049926758, 'learning_rate': 5.7879205600998296e-09, 'margin_dpo/margin_mean': 15.386384963989258, 'margin_dpo/margin_std': 15.031097412109375, 'logps/chosen': -98.2002182006836, 'logps/rejected': -102.35930633544922, 'logps/ref_chosen': -76.45301818847656, 'logps/ref_rejected': -65.2257308959961, 'logits/chosen': -0.6173444986343384, 'logits/rejected': -0.5614223480224609, 'epoch': 0.94}

 94%|█████████▍| 320/340 [18:50<00:51,  2.60s/it]
 94%|█████████▍| 321/340 [18:53<00:49,  2.60s/it]
 95%|█████████▍| 322/340 [18:55<00:46,  2.60s/it]
 95%|█████████▌| 323/340 [18:58<00:44,  2.61s/it]
 95%|█████████▌| 324/340 [19:01<00:41,  2.61s/it]
 96%|█████████▌| 325/340 [19:03<00:39,  2.60s/it]
                                                 
{'loss': 0.3374, 'grad_norm': 22.468570709228516, 'learning_rate': 3.3653488440851253e-09, 'margin_dpo/margin_mean': 15.664227485656738, 'margin_dpo/margin_std': 12.905950546264648, 'logps/chosen': -89.80387878417969, 'logps/rejected': -102.19793701171875, 'logps/ref_chosen': -71.98212432861328, 'logps/ref_rejected': -68.71195983886719, 'logits/chosen': -0.5936331152915955, 'logits/rejected': -0.5392800569534302, 'epoch': 0.96}

 96%|█████████▌| 325/340 [19:03<00:39,  2.60s/it]
 96%|█████████▌| 326/340 [19:06<00:36,  2.58s/it]
 96%|█████████▌| 327/340 [19:08<00:32,  2.54s/it]
 96%|█████████▋| 328/340 [19:11<00:30,  2.55s/it]
 97%|█████████▋| 329/340 [19:13<00:27,  2.54s/it]
 97%|█████████▋| 330/340 [19:16<00:25,  2.56s/it]
                                                 
{'loss': 0.3212, 'grad_norm': 18.491226196289062, 'learning_rate': 1.592541096695571e-09, 'margin_dpo/margin_mean': 21.361230850219727, 'margin_dpo/margin_std': 19.999116897583008, 'logps/chosen': -95.72080993652344, 'logps/rejected': -132.95394897460938, 'logps/ref_chosen': -77.13968658447266, 'logps/ref_rejected': -93.0115737915039, 'logits/chosen': -0.5897213816642761, 'logits/rejected': -0.5493496656417847, 'epoch': 0.97}

 97%|█████████▋| 330/340 [19:16<00:25,  2.56s/it]
 97%|█████████▋| 331/340 [19:19<00:23,  2.57s/it]
 98%|█████████▊| 332/340 [19:21<00:21,  2.63s/it]
 98%|█████████▊| 333/340 [19:24<00:18,  2.62s/it]
 98%|█████████▊| 334/340 [19:27<00:15,  2.66s/it]
 99%|█████████▊| 335/340 [19:29<00:13,  2.64s/it]
                                                 
{'loss': 0.3225, 'grad_norm': 17.843168258666992, 'learning_rate': 4.741678157389739e-10, 'margin_dpo/margin_mean': 13.225196838378906, 'margin_dpo/margin_std': 12.341458320617676, 'logps/chosen': -97.75109100341797, 'logps/rejected': -106.00955963134766, 'logps/ref_chosen': -78.12508392333984, 'logps/ref_rejected': -73.1583480834961, 'logits/chosen': -0.5449101328849792, 'logits/rejected': -0.506639301776886, 'epoch': 0.99}

 99%|█████████▊| 335/340 [19:29<00:13,  2.64s/it]
 99%|█████████▉| 336/340 [19:32<00:10,  2.59s/it]
 99%|█████████▉| 337/340 [19:34<00:07,  2.57s/it]
 99%|█████████▉| 338/340 [19:37<00:05,  2.56s/it]
100%|█████████▉| 339/340 [19:39<00:02,  2.48s/it]
100%|██████████| 340/340 [19:42<00:00,  2.52s/it]
                                                 
{'loss': 0.3138, 'grad_norm': 19.59518051147461, 'learning_rate': 1.31753782067201e-11, 'margin_dpo/margin_mean': 18.689212799072266, 'margin_dpo/margin_std': 18.127058029174805, 'logps/chosen': -85.4710922241211, 'logps/rejected': -113.6316146850586, 'logps/ref_chosen': -64.36441802978516, 'logps/ref_rejected': -73.83573913574219, 'logits/chosen': -0.5786937475204468, 'logits/rejected': -0.544124186038971, 'epoch': 1.0}

100%|██████████| 340/340 [19:42<00:00,  2.52s/it][INFO|trainer.py:3984] 2026-04-10 17:55:36,220 >> Saving model checkpoint to /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-margin-dpo-hh-helpful-8xh200-20260410-172009/checkpoint-340
[INFO|configuration_utils.py:419] 2026-04-10 17:55:36,227 >> Configuration saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-margin-dpo-hh-helpful-8xh200-20260410-172009/checkpoint-340/config.json
[INFO|configuration_utils.py:911] 2026-04-10 17:55:36,231 >> Configuration saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-margin-dpo-hh-helpful-8xh200-20260410-172009/checkpoint-340/generation_config.json
[INFO|modeling_utils.py:3580] 2026-04-10 17:56:15,466 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 6 checkpoint shards. You can find where each parameters has been saved in the index located at /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-margin-dpo-hh-helpful-8xh200-20260410-172009/checkpoint-340/model.safetensors.index.json.
[INFO|tokenization_utils_base.py:2510] 2026-04-10 17:56:15,471 >> tokenizer config file saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-margin-dpo-hh-helpful-8xh200-20260410-172009/checkpoint-340/tokenizer_config.json
[INFO|tokenization_utils_base.py:2519] 2026-04-10 17:56:15,474 >> Special tokens file saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-margin-dpo-hh-helpful-8xh200-20260410-172009/checkpoint-340/special_tokens_map.json
[INFO|trainer.py:2681] 2026-04-10 17:59:29,929 >> 

Training completed. Do not forget to share your model on huggingface.co/models =)


{'train_runtime': 1436.8705, 'train_samples_per_second': 30.342, 'train_steps_per_second': 0.237, 'train_loss': 0.4133688477908864, 'epoch': 1.0}

100%|██████████| 340/340 [23:51<00:00,  2.52s/it]
100%|██████████| 340/340 [23:51<00:00,  4.21s/it]
***** train metrics *****
  epoch                    =        1.0
  total_flos               =        0GF
  train_loss               =     0.4134
  train_runtime            = 0:23:56.87
  train_samples            =      43598
  train_samples_per_second =     30.342
  train_steps_per_second   =      0.237
2026-04-10 17:59:29 - INFO - __main__ - *** Training complete ***
2026-04-10 17:59:29 - INFO - __main__ - *** Save model ***
[INFO|configuration_utils.py:419] 2026-04-10 17:59:47,763 >> Configuration saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-margin-dpo-hh-helpful-8xh200-20260410-172009/config.json
[INFO|configuration_utils.py:911] 2026-04-10 17:59:47,771 >> Configuration saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-margin-dpo-hh-helpful-8xh200-20260410-172009/generation_config.json
[INFO|modeling_utils.py:3580] 2026-04-10 18:00:39,415 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 7 checkpoint shards. You can find where each parameters has been saved in the index located at /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-margin-dpo-hh-helpful-8xh200-20260410-172009/model.safetensors.index.json.
[INFO|tokenization_utils_base.py:2510] 2026-04-10 18:00:39,450 >> tokenizer config file saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-margin-dpo-hh-helpful-8xh200-20260410-172009/tokenizer_config.json
[INFO|tokenization_utils_base.py:2519] 2026-04-10 18:00:39,459 >> Special tokens file saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-margin-dpo-hh-helpful-8xh200-20260410-172009/special_tokens_map.json
2026-04-10 18:00:39 - INFO - __main__ - Saved HF-compatible model artifacts to /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-margin-dpo-hh-helpful-8xh200-20260410-172009
[INFO|modelcard.py:450] 2026-04-10 18:00:39,763 >> Dropping the following result as it does not have all the necessary fields:
{'dataset': {'name': 'Anthropic/hh-rlhf', 'type': 'Anthropic/hh-rlhf'}}
[INFO|configuration_utils.py:419] 2026-04-10 18:00:39,776 >> Configuration saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-margin-dpo-hh-helpful-8xh200-20260410-172009/config.json
2026-04-10 18:00:39 - INFO - __main__ - *** Evaluate ***
[INFO|trainer.py:4307] 2026-04-10 18:00:39,777 >> 
***** Running Evaluation *****
[INFO|trainer.py:4309] 2026-04-10 18:00:39,777 >>   Num examples = 2339
[INFO|trainer.py:4312] 2026-04-10 18:00:39,777 >>   Batch size = 16

  0%|          | 0/18 [00:00<?, ?it/s]
 11%|█         | 2/18 [00:01<00:08,  1.92it/s]
 17%|█▋        | 3/18 [00:02<00:11,  1.33it/s]
 22%|██▏       | 4/18 [00:03<00:12,  1.13it/s]
 28%|██▊       | 5/18 [00:04<00:13,  1.00s/it]
 33%|███▎      | 6/18 [00:05<00:12,  1.04s/it]
 39%|███▉      | 7/18 [00:06<00:11,  1.07s/it]
 44%|████▍     | 8/18 [00:07<00:10,  1.07s/it]
 50%|█████     | 9/18 [00:08<00:09,  1.08s/it]
 56%|█████▌    | 10/18 [00:10<00:08,  1.12s/it]
 61%|██████    | 11/18 [00:11<00:07,  1.09s/it]
 67%|██████▋   | 12/18 [00:12<00:06,  1.14s/it]
 72%|███████▏  | 13/18 [00:13<00:05,  1.11s/it]
 78%|███████▊  | 14/18 [00:14<00:04,  1.15s/it]
 83%|████████▎ | 15/18 [00:15<00:03,  1.15s/it]
 89%|████████▉ | 16/18 [00:16<00:02,  1.09s/it]
 94%|█████████▍| 17/18 [00:17<00:01,  1.12s/it]
100%|██████████| 18/18 [00:18<00:00,  1.06s/it]
100%|██████████| 18/18 [00:19<00:00,  1.06s/it]
***** eval metrics *****
  epoch                       =        1.0
  eval_logits/chosen          =    -0.5745
  eval_logits/rejected        =    -0.5353
  eval_logps/chosen           =   -119.713
  eval_logps/ref_chosen       =   -97.0617
  eval_logps/ref_rejected     =   -80.1818
  eval_logps/rejected         =  -113.9829
  eval_loss                   =     0.4583
  eval_margin_dpo/margin_mean =    11.1499
  eval_margin_dpo/margin_std  =    15.0412
  eval_runtime                = 0:00:20.26
  eval_samples                =       2339
  eval_samples_per_second     =    115.397
  eval_steps_per_second       =      0.937
2026-04-10 18:01:00 - INFO - __main__ - *** Training complete! ***
wandb: - 0.015 MB of 0.015 MB uploaded
wandb: \ 0.015 MB of 0.015 MB uploaded
wandb: | 0.015 MB of 0.015 MB uploaded
wandb: / 0.015 MB of 0.015 MB uploaded
wandb: - 0.048 MB of 0.079 MB uploaded (0.002 MB deduped)
wandb: \ 0.051 MB of 0.080 MB uploaded (0.002 MB deduped)
wandb: 
wandb: Run history:
wandb:           eval/logits/chosen ▁▅▇█
wandb:         eval/logits/rejected ▁▅▇█
wandb:            eval/logps/chosen █▁▁▁
wandb:        eval/logps/ref_chosen ▁▁▁▁
wandb:      eval/logps/ref_rejected ▁▁▁▁
wandb:          eval/logps/rejected █▁▁▁
wandb:                    eval/loss █▂▁▁
wandb:  eval/margin_dpo/margin_mean ▁▇██
wandb:   eval/margin_dpo/margin_std ▁▇██
wandb:                 eval/runtime █▃▃▁
wandb:      eval/samples_per_second ▁▆▆█
wandb:        eval/steps_per_second ▁▇▆█
wandb:                  train/epoch ▁▁▁▁▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇████
wandb:            train/global_step ▁▁▁▁▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇████
wandb:              train/grad_norm ▅▅▇▅▄▂█▃▁▁▂▂▂▂▄▃▃▁▂▂▃▂▂▂▃▃▃▂▄▄▃▂▃▄▁▃▂▄▂▃
wandb:          train/learning_rate ▁▂▄▆▇██████▇▇▇▇▆▆▆▆▅▅▅▄▄▄▄▃▃▃▂▂▂▂▂▁▁▁▁▁▁
wandb:          train/logits/chosen █▇▆▇▄▂▂▂▂▁▂▂▃▁▂▁▁▁▂▃▂▄▂▄▅▅▄▃▅▃▄▄▄▃▆▄▇▃▄▅
wandb:        train/logits/rejected █▆▆▆▄▃▂▂▂▁▂▃▃▁▂▁▁▁▂▃▂▃▃▄▅▆▃▃▆▄▅▅▄▄▅▄▇▄▅▅
wandb:           train/logps/chosen ▆▆▇█▆▇▆▁▅▆▅▆▅▄▅▄▃▄▃▄▄▅▃▄▂▂▄▄▃▂▃▃▃▁▅▁▃▃▃▅
wandb:       train/logps/ref_chosen ▆▆▇█▅▇▆▁▆▆▅▇▆▅▅▅▅▅▅▆▆█▅▆▄▅▇▇▆▄▅▅▅▃█▄▆▆▆▇
wandb:     train/logps/ref_rejected ▆▆▆▆▄▄▄▃▄▄▃█▂▃▄▂▂▅▅▃▃▃▆▆▆▄▂▃▇▃▂▄▄▅▁▂▄▇▂▆
wandb:         train/logps/rejected ████▇▆▅▅▆▅▅▇▄▄▄▃▃▄▄▃▃▂▄▄▄▂▁▂▄▃▁▃▂▃▁▁▂▅▁▃
wandb:                   train/loss ███▇▇▅▅▅▄▄▄▄▃▃▃▃▂▂▂▂▂▂▂▂▂▂▁▁▂▃▂▁▁▁▁▁▁▂▁▁
wandb: train/margin_dpo/margin_mean ▁▁▁▁▂▃▂▃▃▄▃▃▄▅▅▅▅▆▅▇▆▇▆▆▅▆▇▇▇▆█▇▇▇▆█▇▆█▇
wandb:  train/margin_dpo/margin_std ▁▁▁▁▂▄▄▅▃▄▄▅▅▅▅▇▆▆▆▆▅▆▆▆▆██▆▇▆▇▆█▇▇▆▇▆█▇
wandb: 
wandb: Run summary:
wandb:           eval/logits/chosen -0.57455
wandb:         eval/logits/rejected -0.53526
wandb:            eval/logps/chosen -119.71296
wandb:        eval/logps/ref_chosen -97.06174
wandb:      eval/logps/ref_rejected -80.18183
wandb:          eval/logps/rejected -113.98293
wandb:                    eval/loss 0.45825
wandb:  eval/margin_dpo/margin_mean 11.14989
wandb:   eval/margin_dpo/margin_std 15.04124
wandb:                 eval/runtime 20.2691
wandb:      eval/samples_per_second 115.397
wandb:        eval/steps_per_second 0.937
wandb:                   total_flos 0.0
wandb:                  train/epoch 1.0
wandb:            train/global_step 340
wandb:              train/grad_norm 19.59518
wandb:          train/learning_rate 0.0
wandb:          train/logits/chosen -0.57869
wandb:        train/logits/rejected -0.54412
wandb:           train/logps/chosen -85.47109
wandb:       train/logps/ref_chosen -64.36442
wandb:     train/logps/ref_rejected -73.83574
wandb:         train/logps/rejected -113.63161
wandb:                   train/loss 0.3138
wandb: train/margin_dpo/margin_mean 18.68921
wandb:  train/margin_dpo/margin_std 18.12706
wandb:                   train_loss 0.41337
wandb:                train_runtime 1436.8705
wandb:     train_samples_per_second 30.342
wandb:       train_steps_per_second 0.237
wandb: 
wandb: 🚀 View run llama-3-8b-base-margin-dpo-hh-helpful-8xh200-20260410-172009 at: https://wandb.ai/can-not-fand-northeastern-university/huggingface/runs/wep2te2x
wandb: ⭐️ View project at: https://wandb.ai/can-not-fand-northeastern-university/huggingface
wandb: Synced 6 W&B file(s), 0 media file(s), 2 artifact file(s) and 0 other file(s)
wandb: Find logs at: /scratch/feng.yulu/dynamic-dpo-v4/wandb/wandb/run-20260410_173535-wep2te2x/logs
wandb: WARNING The new W&B backend becomes opt-out in version 0.18.0; try it out with `wandb.require("core")`! See https://wandb.me/wandb-core for more information.