[W CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator()) [W CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator()) [W CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator()) [W CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator()) [W CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator()) [W CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator()) [W CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator()) [W CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator()) 2026-04-10 17:20:29 - INFO - __main__ - Model parameters ModelArguments(base_model_revision=None, model_name_or_path='/scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-sft-hh-helpful-8xh200-20260410-133758', model_revision='main', model_code_revision=None, torch_dtype='bfloat16', tokenizer_name_or_path=None, trust_remote_code=False, attn_implementation='flash_attention_2', use_peft=False, lora_r=16, lora_alpha=32, lora_dropout=0.05, lora_target_modules=None, lora_modules_to_save=None, load_in_8bit=False, load_in_4bit=False, bnb_4bit_quant_type='nf4', use_bnb_nested_quant=False, bnb_4bit_quant_storage='uint8') 2026-04-10 17:20:29 - INFO - __main__ - Data parameters DataArguments(chat_template=None, dataset_mixer={'Anthropic/hh-rlhf': 1.0}, text_column='text', dataset_splits=['train', 'test'], dataset_configs=['helpful-base'], dataset_dir=None, preprocessing_num_workers=12, use_persistent_hf_cache=True, hf_cache_dir='/scratch/feng.yulu/dynamic-dpo-v4/hf/datasets', truncation_side=None, auto_insert_empty_system_msg=True, preprocessing_log_samples=0, preprocessing_log_dir=None) 2026-04-10 17:20:29 - INFO - __main__ - Training/evaluation parameters MarginDPOConfig( _n_gpu=1, accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, auto_find_batch_size=False, average_tokens_across_devices=False, batch_eval_metrics=False, beta=0.1, bf16=True, bf16_full_eval=False, data_seed=None, dataloader_drop_last=True, dataloader_num_workers=0, dataloader_persistent_workers=False, dataloader_pin_memory=True, dataloader_prefetch_factor=None, dataset_num_proc=12, ddp_backend=None, ddp_broadcast_buffers=None, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=None, ddp_timeout=1800, debug=[], deepspeed=None, disable_dropout=True, disable_tqdm=False, do_eval=True, do_predict=False, do_train=False, eval_accumulation_steps=None, eval_delay=0, eval_do_concat_batches=True, eval_on_start=False, eval_steps=100, eval_strategy=IntervalStrategy.STEPS, eval_use_gather_object=False, f_alpha_divergence_coef=1.0, f_divergence_type=reverse_kl, force_use_ref_model=False, fp16=False, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, generate_during_eval=False, gradient_accumulation_steps=1, gradient_checkpointing=True, gradient_checkpointing_kwargs={'use_reentrant': False}, greater_is_better=None, group_by_length=False, half_precision_backend=auto, hub_always_push=False, hub_margin_dataset_id=W-61/llama-3-8b-base-margin-dpo-hh-helpful-margin-log, hub_model_id=W-61/llama-3-8b-base-margin-dpo-hh-helpful, hub_model_revision=main, hub_private_repo=None, hub_strategy=HubStrategy.EVERY_SAVE, hub_token=, ignore_data_skip=False, include_for_metrics=[], include_inputs_for_metrics=False, include_num_input_tokens_seen=False, include_tokens_per_second=False, is_encoder_decoder=None, jit_mode_eval=False, label_names=None, label_pad_token_id=-100, label_smoothing=0.0, label_smoothing_factor=0.0, learning_rate=5e-07, length_column_name=length, load_best_model_at_end=False, local_rank=0, log_level=info, log_level_replica=warning, log_on_each_node=True, logging_dir=outputs/llama-3-8b-base-margin-dpo-hh-helpful/runs/Apr10_17-20-28_d4054, logging_first_step=True, logging_nan_inf_filter=True, logging_steps=5, logging_strategy=IntervalStrategy.STEPS, loss_type=sigmoid, lr_scheduler_kwargs={}, lr_scheduler_type=SchedulerType.COSINE, margin_dataset_private=None, margin_dataset_split=train, margin_log_path=/scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-margin-dpo-hh-helpful-8xh200-20260410-172009/margin_logs, margin_log_steps=1, margin_save_full=True, max_grad_norm=1.0, max_length=512, max_prompt_length=256, max_steps=-1, max_target_length=None, metric_for_best_model=None, model_adapter_name=None, model_init_kwargs=None, mp_parameters=, neftune_noise_alpha=None, no_cuda=False, non_finite_logits_handling=error, num_train_epochs=1, optim=OptimizerNames.ADAMW_TORCH, optim_args=None, optim_target_modules=None, output_dir=/scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-margin-dpo-hh-helpful-8xh200-20260410-172009, overwrite_output_dir=False, padding_value=None, past_index=-1, per_device_eval_batch_size=16, per_device_train_batch_size=16, post_tokenization_log_dir=None, post_tokenization_log_samples=0, precompute_ref_batch_size=None, precompute_ref_eval_batch_size=None, precompute_ref_log_probs=False, prediction_loss_only=False, push_margin_dataset=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=, ray_scope=last, ref_adapter_name=None, ref_model_init_kwargs=None, ref_model_mixup_alpha=0.9, ref_model_sync_steps=64, reference_free=False, remove_unused_columns=False, report_to=['wandb'], require_explicit_ref_model=True, restore_callback_states_from_checkpoint=False, resume_from_checkpoint=None, reuse_tokenized_dataset=True, rpo_alpha=None, run_name=llama-3-8b-base-margin-dpo-hh-helpful-8xh200-20260410-172009, save_on_each_node=False, save_only_model=False, save_safetensors=True, save_steps=200, save_strategy=SaveStrategy.STEPS, save_total_limit=2, seed=42, sft_weight=0.0, skip_memory_metrics=True, sync_ref_model=False, tf32=None, tokenization_batch_size=128, tokenization_mode=online, tokenized_dataset_cache_dir=/scratch/feng.yulu/dynamic-dpo-v4/tokenized_preferences, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torch_empty_cache_steps=None, torchdynamo=None, tp_size=0, tpu_metrics_debug=False, tpu_num_cores=None, trainer_type=margin_dpo, truncation_mode=keep_end, use_cpu=False, use_ipex=False, use_legacy_prediction_loop=False, use_liger_kernel=False, use_mps_device=False, warmup_ratio=0.1, warmup_steps=0, weight_decay=0.0, ) 2026-04-10 17:20:29 - INFO - __main__ - Margin-DPO parameters: beta=0.1, f_divergence_type=reverse_kl, margin_log_steps=1 2026-04-10 17:20:29 - INFO - __main__ - Using persistent HF datasets cache at /scratch/feng.yulu/dynamic-dpo-v4/hf/datasets 2026-04-10 17:20:32 - WARNING - __main__ - Dropped 237 non-canonical HH preference examples from split `train` before normalization (126 x HH preprocessing expects exactly one final assistant response in chosen/rejected suffixes., 111 x HH chosen/rejected transcripts must each contain a divergent assistant response.). Normalizing raw HH preferences (train): 0%| | 0/43598 [00:00> loading file tokenizer.json [INFO|tokenization_utils_base.py:2058] 2026-04-10 17:20:38,197 >> loading file tokenizer.model [INFO|tokenization_utils_base.py:2058] 2026-04-10 17:20:38,197 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:2058] 2026-04-10 17:20:38,197 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:2058] 2026-04-10 17:20:38,197 >> loading file tokenizer_config.json [INFO|tokenization_utils_base.py:2058] 2026-04-10 17:20:38,197 >> loading file chat_template.jinja Normalizing raw HH preferences (test): 100%|██████████| 2339/2339 [00:00<00:00, 11420.15 examples/s] Normalizing raw HH preferences (test): 100%|██████████| 2339/2339 [00:00<00:00, 10759.00 examples/s] Normalizing raw HH preferences (test): 100%|█████████▉| 2334/2339 [00:00<00:00, 11771.19 examples/s] Normalizing raw HH preferences (test): 0%| | 0/2339 [00:00> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Formatting comparisons with prompt template (num_proc=12): 0%| | 0/43598 [00:00> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. Formatting comparisons with prompt template (num_proc=12): 97%|█████████▋| 2258/2339 [00:03<00:00, 845.34 examples/s] Formatting comparisons with prompt template (num_proc=12): 56%|█████▌ | 1304/2339 [00:02<00:01, 761.90 examples/s] Formatting comparisons with prompt template (num_proc=12): 39%|███▉ | 918/2339 [00:02<00:02, 626.02 examples/s] Formatting comparisons with prompt template (num_proc=12): 54%|█████▍ | 1274/2339 [00:02<00:01, 734.16 examples/s] Formatting comparisons with prompt template (num_proc=12): 71%|███████ | 1663/2339 [00:02<00:00, 727.41 examples/s] Loading checkpoint shards: 0%| | 0/7 [00:00<|start_header_id|>user<|end_header_id|> What is the best way to clean my refrigerator?<|eot_id|><|start_header_id|>assistant<|end_header_id|> Well, I have a few ideas, but the best way to know what works best is to run some tests and then analyze the data to figure out which technique works best.<|eot_id|><|start_header_id|>user<|end_header_id|> What are your ideas?<|eot_id|> Chosen: <|start_header_id|>assistant<|end_header_id|> Some of my ideas include wiping down the interior with a rag, using bleach and water to wipe down the door, emptying the contents of the fridge to scrub the shelves with a clean cloth, cleaning the interior of the fridge with vinegar and water, and using a vacuum cleaner to clean the interior.<|eot_id|> Rejected: <|start_header_id|>assistant<|end_header_id|> I was thinking that you could try different types of soap, which could be a variation of “soap,” like antibacterial soap. I was also thinking that you could try using different tools, like a toothbrush, to clean the inside. Or I was thinking that you could try different ways to get the gunk out, like using a squeegee, or using a paper towel. I was also thinking you could try other types of cleaning, like vacuuming, but I think that could have the opposite of the desired effect.<|eot_id|> /home/feng.yulu/dynamic-dpo-v4/scripts/tokenized_dpo_trainer.py:391: UserWarning: You passed a model_id to the trainer. This will automatically create an `AutoModelForCausalLM` or a `PeftModel` (if you passed a `peft_config`) for you. warnings.warn( [INFO|configuration_utils.py:691] 2026-04-10 17:20:49,264 >> loading configuration file /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-sft-hh-helpful-8xh200-20260410-133758/config.json [INFO|configuration_utils.py:765] 2026-04-10 17:20:49,265 >> Model config LlamaConfig { "architectures": [ "LlamaForCausalLM" ], "attention_bias": false, "attention_dropout": 0.0, "bos_token_id": 128000, "eos_token_id": 128001, "head_dim": 128, "hidden_act": "silu", "hidden_size": 4096, "initializer_range": 0.02, "intermediate_size": 14336, "max_position_embeddings": 8192, "mlp_bias": false, "model_type": "llama", "num_attention_heads": 32, "num_hidden_layers": 32, "num_key_value_heads": 8, "pretraining_tp": 1, "rms_norm_eps": 1e-05, "rope_scaling": null, "rope_theta": 500000.0, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.51.0", "use_cache": false, "vocab_size": 128256 } Formatting comparisons with prompt template (num_proc=12): 67%|██████▋ | 1560/2339 [00:02<00:00, 898.19 examples/s][INFO|modeling_utils.py:1121] 2026-04-10 17:20:49,278 >> loading weights file /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-sft-hh-helpful-8xh200-20260410-133758/model.safetensors.index.json [INFO|modeling_utils.py:2167] 2026-04-10 17:20:49,279 >> Instantiating LlamaForCausalLM model under default dtype torch.bfloat16. [WARNING|logging.py:328] 2026-04-10 17:20:49,281 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. [INFO|configuration_utils.py:1142] 2026-04-10 17:20:49,283 >> Generate config GenerationConfig { "bos_token_id": 128000, "eos_token_id": 128001, "use_cache": false } Formatting comparisons with prompt template (num_proc=12): 72%|███████▏ | 1673/2339 [00:03<00:00, 789.71 examples/s] Loading checkpoint shards: 0%| | 0/7 [00:00> Trainer.tokenizer is now deprecated. You should use `Trainer.processing_class = processing_class` instead. Formatting comparisons with prompt template (num_proc=12): 81%|████████ | 1894/2339 [00:02<00:00, 1093.28 examples/s] Formatting comparisons with prompt template (num_proc=12): 100%|█████████▉| 2331/2339 [00:03<00:00, 1038.87 examples/s] Formatting comparisons with prompt template (num_proc=12): 81%|████████ | 1883/2339 [00:03<00:00, 894.47 examples/s] Formatting comparisons with prompt template (num_proc=12): 99%|█████████▊| 2308/2339 [00:03<00:00, 1057.74 examples/s] Formatting comparisons with prompt template (num_proc=12): 67%|██████▋ | 1560/2339 [00:02<00:00, 993.35 examples/s]Traceback (most recent call last): File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/process.py", line 314, in _bootstrap self.run() File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/managers.py", line 600, in _run_server server.serve_forever() File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/managers.py", line 184, in serve_forever sys.exit(0) SystemExit: 0 During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/util.py", line 300, in _run_finalizers finalizer() File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/util.py", line 224, in __call__ res = self._callback(*self._args, **self._kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/util.py", line 133, in _remove_temp_dir rmtree(tempdir) File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/shutil.py", line 752, in rmtree _rmtree_safe_fd(fd, path, onerror) File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/shutil.py", line 703, in _rmtree_safe_fd onerror(os.unlink, fullname, sys.exc_info()) File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/shutil.py", line 701, in _rmtree_safe_fd os.unlink(entry.name, dir_fd=topfd) OSError: [Errno 16] Device or resource busy: '.nfs4ee5c347154bfc4e00001c88' Formatting comparisons with prompt template (num_proc=12): 100%|██████████| 2339/2339 [00:03<00:00, 667.93 examples/s] Formatting comparisons with prompt template (num_proc=12): 90%|█████████ | 2116/2339 [00:03<00:00, 1069.99 examples/s] Formatting comparisons with prompt template (num_proc=12): 83%|████████▎ | 1950/2339 [00:03<00:00, 968.28 examples/s] Traceback (most recent call last): File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/process.py", line 314, in _bootstrap self.run() File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/managers.py", line 600, in _run_server server.serve_forever() File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/managers.py", line 184, in serve_forever sys.exit(0) SystemExit: 0 During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/util.py", line 300, in _run_finalizers finalizer() File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/util.py", line 224, in __call__ res = self._callback(*self._args, **self._kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/multiprocess/util.py", line 133, in _remove_temp_dir rmtree(tempdir) File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/shutil.py", line 752, in rmtree _rmtree_safe_fd(fd, path, onerror) File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/shutil.py", line 703, in _rmtree_safe_fd onerror(os.unlink, fullname, sys.exc_info()) File "/home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/shutil.py", line 701, in _rmtree_safe_fd os.unlink(entry.name, dir_fd=topfd) OSError: [Errno 16] Device or resource busy: '.nfs88b0aa9233adc5a400001c8a' /home/feng.yulu/dynamic-dpo-v4/scripts/tokenized_dpo_trainer.py:391: UserWarning: You passed a model_id to the trainer. This will automatically create an `AutoModelForCausalLM` or a `PeftModel` (if you passed a `peft_config`) for you. warnings.warn( Formatting comparisons with prompt template (num_proc=12): 100%|██████████| 2339/2339 [00:03<00:00, 665.59 examples/s] [WARNING|logging.py:328] 2026-04-10 17:20:49,650 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. Formatting comparisons with prompt template (num_proc=12): 75%|███████▌ | 1755/2339 [00:02<00:00, 1052.98 examples/s] Loading checkpoint shards: 0%| | 0/7 [00:00> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. Loading checkpoint shards: 0%| | 0/7 [00:00> Trainer.tokenizer is now deprecated. You should use `Trainer.processing_class = processing_class` instead. /home/feng.yulu/dynamic-dpo-v4/scripts/tokenized_dpo_trainer.py:391: UserWarning: You passed a model_id to the trainer. This will automatically create an `AutoModelForCausalLM` or a `PeftModel` (if you passed a `peft_config`) for you. warnings.warn( /home/feng.yulu/dynamic-dpo-v4/scripts/tokenized_dpo_trainer.py:391: UserWarning: You passed a model_id to the trainer. This will automatically create an `AutoModelForCausalLM` or a `PeftModel` (if you passed a `peft_config`) for you. warnings.warn( [WARNING|logging.py:328] 2026-04-10 17:20:50,019 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. [WARNING|logging.py:328] 2026-04-10 17:20:50,019 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. Loading checkpoint shards: 0%| | 0/7 [00:00> Trainer.tokenizer is now deprecated. You should use `Trainer.processing_class = processing_class` instead. Formatting comparisons with prompt template (num_proc=12): 100%|██████████| 2339/2339 [00:03<00:00, 637.51 examples/s] Loading checkpoint shards: 0%| | 0/7 [00:00> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. Loading checkpoint shards: 0%| | 0/7 [00:00> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. Loading checkpoint shards: 0%| | 0/7 [00:00> Trainer.tokenizer is now deprecated. You should use `Trainer.processing_class = processing_class` instead. Loading checkpoint shards: 100%|██████████| 7/7 [00:00<00:00, 539.02it/s] Loading checkpoint shards: 0%| | 0/7 [00:00> Trainer.tokenizer is now deprecated. You should use `Trainer.processing_class = processing_class` instead. Loading checkpoint shards: 0%| | 0/7 [00:00> Trainer.tokenizer is now deprecated. You should use `Trainer.processing_class = processing_class` instead. Loading checkpoint shards: 0%| | 0/7 [00:00> Trainer.tokenizer is now deprecated. You should use `Trainer.processing_class = processing_class` instead. Loading checkpoint shards: 14%|█▍ | 1/7 [00:01<00:09, 1.65s/it] Loading checkpoint shards: 29%|██▊ | 2/7 [00:03<00:07, 1.53s/it] Loading checkpoint shards: 43%|████▎ | 3/7 [00:04<00:06, 1.54s/it] Loading checkpoint shards: 57%|█████▋ | 4/7 [00:06<00:04, 1.53s/it] Loading checkpoint shards: 71%|███████▏ | 5/7 [00:07<00:03, 1.52s/it] Loading checkpoint shards: 86%|████████▌ | 6/7 [00:09<00:01, 1.50s/it] Loading checkpoint shards: 100%|██████████| 7/7 [00:09<00:00, 1.27s/it] Loading checkpoint shards: 100%|██████████| 7/7 [00:09<00:00, 1.42s/it] [INFO|modeling_utils.py:4926] 2026-04-10 17:20:59,246 >> All model checkpoint weights were used when initializing LlamaForCausalLM. [INFO|modeling_utils.py:4934] 2026-04-10 17:20:59,246 >> All the weights of LlamaForCausalLM were initialized from the model checkpoint at /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-sft-hh-helpful-8xh200-20260410-133758. If your task is similar to the task the model of the checkpoint was trained on, you can already use LlamaForCausalLM for predictions without further training. [INFO|configuration_utils.py:1095] 2026-04-10 17:20:59,248 >> loading configuration file /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-sft-hh-helpful-8xh200-20260410-133758/generation_config.json [INFO|configuration_utils.py:1142] 2026-04-10 17:20:59,248 >> Generate config GenerationConfig { "bos_token_id": 128000, "do_sample": true, "eos_token_id": 128001, "max_length": 4096, "temperature": 0.6, "top_p": 0.9 } [INFO|configuration_utils.py:691] 2026-04-10 17:20:59,250 >> loading configuration file /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-sft-hh-helpful-8xh200-20260410-133758/config.json [INFO|configuration_utils.py:765] 2026-04-10 17:20:59,250 >> Model config LlamaConfig { "architectures": [ "LlamaForCausalLM" ], "attention_bias": false, "attention_dropout": 0.0, "bos_token_id": 128000, "eos_token_id": 128001, "head_dim": 128, "hidden_act": "silu", "hidden_size": 4096, "initializer_range": 0.02, "intermediate_size": 14336, "max_position_embeddings": 8192, "mlp_bias": false, "model_type": "llama", "num_attention_heads": 32, "num_hidden_layers": 32, "num_key_value_heads": 8, "pretraining_tp": 1, "rms_norm_eps": 1e-05, "rope_scaling": null, "rope_theta": 500000.0, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.51.0", "use_cache": false, "vocab_size": 128256 } [INFO|modeling_utils.py:1121] 2026-04-10 17:20:59,251 >> loading weights file /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-sft-hh-helpful-8xh200-20260410-133758/model.safetensors.index.json [INFO|modeling_utils.py:2167] 2026-04-10 17:20:59,252 >> Instantiating LlamaForCausalLM model under default dtype torch.bfloat16. [INFO|configuration_utils.py:1142] 2026-04-10 17:20:59,254 >> Generate config GenerationConfig { "bos_token_id": 128000, "eos_token_id": 128001, "use_cache": false } Loading checkpoint shards: 0%| | 0/7 [00:00> All model checkpoint weights were used when initializing LlamaForCausalLM. [INFO|modeling_utils.py:4934] 2026-04-10 17:21:09,066 >> All the weights of LlamaForCausalLM were initialized from the model checkpoint at /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-sft-hh-helpful-8xh200-20260410-133758. If your task is similar to the task the model of the checkpoint was trained on, you can already use LlamaForCausalLM for predictions without further training. [INFO|configuration_utils.py:1095] 2026-04-10 17:21:09,069 >> loading configuration file /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-sft-hh-helpful-8xh200-20260410-133758/generation_config.json [INFO|configuration_utils.py:1142] 2026-04-10 17:21:09,070 >> Generate config GenerationConfig { "bos_token_id": 128000, "do_sample": true, "eos_token_id": 128001, "max_length": 4096, "temperature": 0.6, "top_p": 0.9 } [WARNING|trainer.py:821] 2026-04-10 17:21:09,071 >> Trainer.tokenizer is now deprecated. You should use `Trainer.processing_class = processing_class` instead. [WARNING|trainer.py:816] 2026-04-10 17:21:09,074 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. Tokenizing train (num_proc=12): 0%| | 0/43598 [00:00> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. Saving the dataset (0/2 shards): 0%| | 0/43598 [00:00> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. Tokenizing test (num_proc=12): 0%| | 0/2339 [00:00> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. Saving the dataset (0/1 shards): 0%| | 0/2339 [00:00> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-10 17:35:28,157 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-10 17:35:28,158 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-10 17:35:28,158 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-10 17:35:28,159 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-10 17:35:28,159 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-10 17:35:28,160 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-10 17:35:28,457 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-10 17:35:28,457 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-10 17:35:28,457 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-10 17:35:28,457 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-10 17:35:28,458 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-10 17:35:28,458 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-10 17:35:28,458 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-10 17:35:28,458 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-10 17:35:28,459 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-10 17:35:28,459 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-10 17:35:28,459 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-10 17:35:28,459 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-10 17:35:28,459 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-10 17:35:28,460 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-10 17:35:28,503 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-10 17:35:28,503 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-10 17:35:28,503 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. /home/feng.yulu/dynamic-dpo-v4/scripts/tokenized_dpo_trainer.py:518: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `MarginDPOTrainer.__init__`. Use `processing_class` instead. super().__init__( /home/feng.yulu/dynamic-dpo-v4/scripts/tokenized_dpo_trainer.py:518: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `MarginDPOTrainer.__init__`. Use `processing_class` instead. super().__init__( /home/feng.yulu/dynamic-dpo-v4/scripts/tokenized_dpo_trainer.py:518: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `MarginDPOTrainer.__init__`. Use `processing_class` instead. super().__init__( [WARNING|trainer.py:816] 2026-04-10 17:35:28,503 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-10 17:35:28,503 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-10 17:35:28,503 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-10 17:35:28,503 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. /home/feng.yulu/dynamic-dpo-v4/scripts/tokenized_dpo_trainer.py:518: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `MarginDPOTrainer.__init__`. Use `processing_class` instead. super().__init__( /home/feng.yulu/dynamic-dpo-v4/scripts/tokenized_dpo_trainer.py:518: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `MarginDPOTrainer.__init__`. Use `processing_class` instead. super().__init__( /home/feng.yulu/dynamic-dpo-v4/scripts/tokenized_dpo_trainer.py:518: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `MarginDPOTrainer.__init__`. Use `processing_class` instead. super().__init__( /home/feng.yulu/dynamic-dpo-v4/scripts/tokenized_dpo_trainer.py:518: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `MarginDPOTrainer.__init__`. Use `processing_class` instead. super().__init__( [INFO|trainer.py:748] 2026-04-10 17:35:28,762 >> Using auto half precision backend /home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/accelerate/accelerator.py:1557: UserWarning: Upcasted low precision parameters in LlamaForCausalLM because mixed precision turned on in FSDP. Affects: model.embed_tokens.weight, model.norm.weight, lm_head.weight. warnings.warn( /home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/accelerate/accelerator.py:1557: UserWarning: Upcasted low precision parameters in LlamaDecoderLayer because mixed precision turned on in FSDP. Affects: self_attn.q_proj.weight, self_attn.k_proj.weight, self_attn.v_proj.weight, self_attn.o_proj.weight, mlp.gate_proj.weight, mlp.up_proj.weight, mlp.down_proj.weight, input_layernorm.weight, post_attention_layernorm.weight. warnings.warn( /home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/accelerate/accelerator.py:1563: UserWarning: FSDP upcast of low precision parameters may affect the precision of model checkpoints. warnings.warn( [INFO|trainer.py:2414] 2026-04-10 17:35:33,058 >> ***** Running training ***** [INFO|trainer.py:2415] 2026-04-10 17:35:33,058 >> Num examples = 43,598 [INFO|trainer.py:2416] 2026-04-10 17:35:33,058 >> Num Epochs = 1 [INFO|trainer.py:2417] 2026-04-10 17:35:33,058 >> Instantaneous batch size per device = 16 [INFO|trainer.py:2420] 2026-04-10 17:35:33,058 >> Total train batch size (w. parallel, distributed & accumulation) = 128 [INFO|trainer.py:2421] 2026-04-10 17:35:33,058 >> Gradient Accumulation steps = 1 [INFO|trainer.py:2422] 2026-04-10 17:35:33,058 >> Total optimization steps = 340 [INFO|trainer.py:2423] 2026-04-10 17:35:33,058 >> Number of trainable parameters = 1,003,782,656 [INFO|integration_utils.py:831] 2026-04-10 17:35:33,059 >> Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true" wandb: Currently logged in as: can-not-fand (can-not-fand-northeastern-university). Use `wandb login --relogin` to force relogin wandb: wandb version 0.25.1 is available! To upgrade, please run: wandb: $ pip install wandb --upgrade wandb: Tracking run with wandb version 0.17.5 wandb: Run data is saved locally in /scratch/feng.yulu/dynamic-dpo-v4/wandb/wandb/run-20260410_173535-wep2te2x wandb: Run `wandb offline` to turn off syncing. wandb: Syncing run llama-3-8b-base-margin-dpo-hh-helpful-8xh200-20260410-172009 wandb: ⭐️ View project at https://wandb.ai/can-not-fand-northeastern-university/huggingface wandb: 🚀 View run at https://wandb.ai/can-not-fand-northeastern-university/huggingface/runs/wep2te2x 0%| | 0/340 [00:00> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_utils.py:1713] 2026-04-10 17:35:41,810 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_utils.py:1713] 2026-04-10 17:35:41,810 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_utils.py:1713] 2026-04-10 17:35:41,810 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_utils.py:1713] 2026-04-10 17:35:41,810 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_utils.py:1713] 2026-04-10 17:35:41,810 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_utils.py:1713] 2026-04-10 17:35:41,810 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_utils.py:1713] 2026-04-10 17:35:41,810 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed 0%| | 1/340 [00:03<18:07, 3.21s/it] {'loss': 0.6938, 'grad_norm': 23.717201232910156, 'learning_rate': 0.0, 'margin_dpo/margin_mean': -0.0843656063079834, 'margin_dpo/margin_std': 0.20181308686733246, 'logps/chosen': -72.44038391113281, 'logps/rejected': -70.95858764648438, 'logps/ref_chosen': -72.42105865478516, 'logps/ref_rejected': -71.02362823486328, 'logits/chosen': -0.4739703834056854, 'logits/rejected': -0.44689586758613586, 'epoch': 0.0} 0%| | 1/340 [00:03<18:07, 3.21s/it] 1%| | 2/340 [00:06<16:53, 3.00s/it] 1%| | 3/340 [00:08<16:07, 2.87s/it] 1%| | 4/340 [00:11<15:08, 2.70s/it] 1%|▏ | 5/340 [00:13<14:57, 2.68s/it] {'loss': 0.6943, 'grad_norm': 24.15522003173828, 'learning_rate': 5.88235294117647e-08, 'margin_dpo/margin_mean': -0.0912436842918396, 'margin_dpo/margin_std': 0.36911237239837646, 'logps/chosen': -76.55665588378906, 'logps/rejected': -71.69610595703125, 'logps/ref_chosen': -76.4837875366211, 'logps/ref_rejected': -71.7144775390625, 'logits/chosen': -0.5054930448532104, 'logits/rejected': -0.4999650716781616, 'epoch': 0.01} 1%|▏ | 5/340 [00:13<14:57, 2.68s/it] 2%|▏ | 6/340 [00:16<14:51, 2.67s/it] 2%|▏ | 7/340 [00:19<14:46, 2.66s/it] 2%|▏ | 8/340 [00:21<14:29, 2.62s/it] 3%|▎ | 9/340 [00:24<14:27, 2.62s/it] 3%|▎ | 10/340 [00:26<14:24, 2.62s/it] {'loss': 0.6933, 'grad_norm': 23.068735122680664, 'learning_rate': 1.3235294117647057e-07, 'margin_dpo/margin_mean': 0.0031534195877611637, 'margin_dpo/margin_std': 0.3234597444534302, 'logps/chosen': -76.17481994628906, 'logps/rejected': -73.90404510498047, 'logps/ref_chosen': -76.15269470214844, 'logps/ref_rejected': -73.87877655029297, 'logits/chosen': -0.5124594569206238, 'logits/rejected': -0.49317699670791626, 'epoch': 0.03} 3%|▎ | 10/340 [00:26<14:24, 2.62s/it] 3%|▎ | 11/340 [00:29<14:27, 2.64s/it] 4%|▎ | 12/340 [00:32<14:26, 2.64s/it] 4%|▍ | 13/340 [00:34<14:25, 2.65s/it] 4%|▍ | 14/340 [00:37<14:16, 2.63s/it] 4%|▍ | 15/340 [00:40<14:10, 2.62s/it] {'loss': 0.6898, 'grad_norm': 28.796030044555664, 'learning_rate': 2.0588235294117645e-07, 'margin_dpo/margin_mean': 0.09566803276538849, 'margin_dpo/margin_std': 0.3500857353210449, 'logps/chosen': -67.05145263671875, 'logps/rejected': -73.06277465820312, 'logps/ref_chosen': -67.0902099609375, 'logps/ref_rejected': -73.005859375, 'logits/chosen': -0.5413268208503723, 'logits/rejected': -0.5226410031318665, 'epoch': 0.04} 4%|▍ | 15/340 [00:40<14:10, 2.62s/it] 5%|▍ | 16/340 [00:42<14:14, 2.64s/it] 5%|▌ | 17/340 [00:45<14:03, 2.61s/it] 5%|▌ | 18/340 [00:47<13:58, 2.60s/it] 6%|▌ | 19/340 [00:50<13:48, 2.58s/it] 6%|▌ | 20/340 [00:53<13:52, 2.60s/it] {'loss': 0.6824, 'grad_norm': 20.94307518005371, 'learning_rate': 2.7941176470588235e-07, 'margin_dpo/margin_mean': 0.19857604801654816, 'margin_dpo/margin_std': 0.378338098526001, 'logps/chosen': -73.87080383300781, 'logps/rejected': -80.62101745605469, 'logps/ref_chosen': -73.9133071899414, 'logps/ref_rejected': -80.46495056152344, 'logits/chosen': -0.5276651382446289, 'logits/rejected': -0.5001177787780762, 'epoch': 0.06} 6%|▌ | 20/340 [00:53<13:52, 2.60s/it] 6%|▌ | 21/340 [00:55<14:02, 2.64s/it] 6%|▋ | 22/340 [00:58<14:01, 2.65s/it] 7%|▋ | 23/340 [01:01<13:57, 2.64s/it] 7%|▋ | 24/340 [01:03<14:09, 2.69s/it] 7%|▋ | 25/340 [01:06<13:57, 2.66s/it] {'loss': 0.6642, 'grad_norm': 24.610126495361328, 'learning_rate': 3.529411764705882e-07, 'margin_dpo/margin_mean': 0.44518008828163147, 'margin_dpo/margin_std': 0.6063351631164551, 'logps/chosen': -60.977256774902344, 'logps/rejected': -74.73905181884766, 'logps/ref_chosen': -61.014869689941406, 'logps/ref_rejected': -74.33148193359375, 'logits/chosen': -0.5061219930648804, 'logits/rejected': -0.5009726285934448, 'epoch': 0.07} 7%|▋ | 25/340 [01:06<13:57, 2.66s/it] 8%|▊ | 26/340 [01:09<13:46, 2.63s/it] 8%|▊ | 27/340 [01:11<13:27, 2.58s/it] 8%|▊ | 28/340 [01:14<13:25, 2.58s/it] 9%|▊ | 29/340 [01:16<13:26, 2.59s/it] 9%|▉ | 30/340 [01:19<13:29, 2.61s/it] {'loss': 0.6294, 'grad_norm': 21.515533447265625, 'learning_rate': 4.264705882352941e-07, 'margin_dpo/margin_mean': 1.5730347633361816, 'margin_dpo/margin_std': 1.7553781270980835, 'logps/chosen': -78.83164978027344, 'logps/rejected': -83.10078430175781, 'logps/ref_chosen': -78.80770111083984, 'logps/ref_rejected': -81.50379943847656, 'logits/chosen': -0.5904145240783691, 'logits/rejected': -0.5685775279998779, 'epoch': 0.09} 9%|▉ | 30/340 [01:19<13:29, 2.61s/it] 9%|▉ | 31/340 [01:22<13:27, 2.61s/it] 9%|▉ | 32/340 [01:24<13:32, 2.64s/it] 10%|▉ | 33/340 [01:27<13:27, 2.63s/it] 10%|█ | 34/340 [01:29<13:11, 2.59s/it] 10%|█ | 35/340 [01:32<13:14, 2.60s/it] {'loss': 0.6028, 'grad_norm': 19.351747512817383, 'learning_rate': 5e-07, 'margin_dpo/margin_mean': 2.158336877822876, 'margin_dpo/margin_std': 2.8764147758483887, 'logps/chosen': -86.93069458007812, 'logps/rejected': -88.55570220947266, 'logps/ref_chosen': -86.67269134521484, 'logps/ref_rejected': -86.13935852050781, 'logits/chosen': -0.5566071271896362, 'logits/rejected': -0.5428273677825928, 'epoch': 0.1} 10%|█ | 35/340 [01:32<13:14, 2.60s/it] 11%|█ | 36/340 [01:35<13:13, 2.61s/it] 11%|█ | 37/340 [01:37<13:09, 2.61s/it] 11%|█ | 38/340 [01:40<13:05, 2.60s/it] 11%|█▏ | 39/340 [01:42<12:59, 2.59s/it] 12%|█▏ | 40/340 [01:45<13:05, 2.62s/it] {'loss': 0.5446, 'grad_norm': 18.829681396484375, 'learning_rate': 4.996706849759452e-07, 'margin_dpo/margin_mean': 4.941764831542969, 'margin_dpo/margin_std': 8.191742897033691, 'logps/chosen': -71.7585220336914, 'logps/rejected': -91.31529235839844, 'logps/ref_chosen': -69.31690216064453, 'logps/ref_rejected': -83.9319076538086, 'logits/chosen': -0.6493271589279175, 'logits/rejected': -0.6133594512939453, 'epoch': 0.12} 12%|█▏ | 40/340 [01:45<13:05, 2.62s/it] 12%|█▏ | 41/340 [01:48<13:08, 2.64s/it] 12%|█▏ | 42/340 [01:50<13:01, 2.62s/it] 13%|█▎ | 43/340 [01:53<12:56, 2.62s/it] 13%|█▎ | 44/340 [01:55<12:47, 2.59s/it] 13%|█▎ | 45/340 [01:58<12:39, 2.58s/it] {'loss': 0.553, 'grad_norm': 23.498613357543945, 'learning_rate': 4.986836074908615e-07, 'margin_dpo/margin_mean': 5.294968128204346, 'margin_dpo/margin_std': 6.769883632659912, 'logps/chosen': -73.5013427734375, 'logps/rejected': -108.92988586425781, 'logps/ref_chosen': -69.97550964355469, 'logps/ref_rejected': -100.10908508300781, 'logits/chosen': -0.6821354627609253, 'logits/rejected': -0.6494560837745667, 'epoch': 0.13} 13%|█▎ | 45/340 [01:58<12:39, 2.58s/it] 14%|█▎ | 46/340 [02:01<12:46, 2.61s/it] 14%|█▍ | 47/340 [02:03<13:03, 2.67s/it] 14%|█▍ | 48/340 [02:06<12:55, 2.66s/it] 14%|█▍ | 49/340 [02:09<12:45, 2.63s/it] 15%|█▍ | 50/340 [02:11<12:30, 2.59s/it] {'loss': 0.5518, 'grad_norm': 30.29952621459961, 'learning_rate': 4.970413680203148e-07, 'margin_dpo/margin_mean': 4.282275199890137, 'margin_dpo/margin_std': 7.439302921295166, 'logps/chosen': -78.32559967041016, 'logps/rejected': -95.23252868652344, 'logps/ref_chosen': -72.90187072753906, 'logps/ref_rejected': -85.52653503417969, 'logits/chosen': -0.6595835089683533, 'logits/rejected': -0.6233135461807251, 'epoch': 0.15} 15%|█▍ | 50/340 [02:11<12:30, 2.59s/it] 15%|█▌ | 51/340 [02:14<12:30, 2.60s/it] 15%|█▌ | 52/340 [02:16<12:15, 2.55s/it] 16%|█▌ | 53/340 [02:19<12:12, 2.55s/it] 16%|█▌ | 54/340 [02:21<12:21, 2.59s/it] 16%|█▌ | 55/340 [02:24<12:17, 2.59s/it] {'loss': 0.5112, 'grad_norm': 23.780656814575195, 'learning_rate': 4.947482930773511e-07, 'margin_dpo/margin_mean': 7.125207424163818, 'margin_dpo/margin_std': 9.734245300292969, 'logps/chosen': -91.6336898803711, 'logps/rejected': -109.0378646850586, 'logps/ref_chosen': -87.45826721191406, 'logps/ref_rejected': -97.73722076416016, 'logits/chosen': -0.7151781916618347, 'logits/rejected': -0.6897321939468384, 'epoch': 0.16} 16%|█▌ | 55/340 [02:24<12:17, 2.59s/it] 16%|█▋ | 56/340 [02:27<12:20, 2.61s/it] 17%|█▋ | 57/340 [02:29<12:23, 2.63s/it] 17%|█▋ | 58/340 [02:32<12:20, 2.63s/it] 17%|█▋ | 59/340 [02:35<12:15, 2.62s/it] 18%|█▊ | 60/340 [02:37<12:20, 2.64s/it] {'loss': 0.5286, 'grad_norm': 20.72915267944336, 'learning_rate': 4.918104238142103e-07, 'margin_dpo/margin_mean': 6.065438747406006, 'margin_dpo/margin_std': 10.341069221496582, 'logps/chosen': -110.2301254272461, 'logps/rejected': -99.53703308105469, 'logps/ref_chosen': -106.60343933105469, 'logps/ref_rejected': -89.84490203857422, 'logits/chosen': -0.6631725430488586, 'logits/rejected': -0.6214786767959595, 'epoch': 0.18} 18%|█▊ | 60/340 [02:37<12:20, 2.64s/it] 18%|█▊ | 61/340 [02:40<12:03, 2.59s/it] 18%|█▊ | 62/340 [02:42<12:06, 2.61s/it] 19%|█▊ | 63/340 [02:45<12:05, 2.62s/it] 19%|█▉ | 64/340 [02:48<12:03, 2.62s/it] 19%|█▉ | 65/340 [02:50<11:53, 2.59s/it] {'loss': 0.4746, 'grad_norm': 16.05661392211914, 'learning_rate': 4.882355001067891e-07, 'margin_dpo/margin_mean': 5.947785377502441, 'margin_dpo/margin_std': 7.2523908615112305, 'logps/chosen': -79.79920959472656, 'logps/rejected': -93.5802001953125, 'logps/ref_chosen': -76.7091064453125, 'logps/ref_rejected': -84.54231262207031, 'logits/chosen': -0.6507592797279358, 'logits/rejected': -0.6253207921981812, 'epoch': 0.19} 19%|█▉ | 65/340 [02:50<11:53, 2.59s/it] 19%|█▉ | 66/340 [02:53<11:43, 2.57s/it] 20%|█▉ | 67/340 [02:55<11:41, 2.57s/it] 20%|██ | 68/340 [02:58<11:40, 2.57s/it] 20%|██ | 69/340 [03:00<11:27, 2.54s/it] 21%|██ | 70/340 [03:03<11:36, 2.58s/it] {'loss': 0.4662, 'grad_norm': 16.453359603881836, 'learning_rate': 4.840329401637809e-07, 'margin_dpo/margin_mean': 8.28502082824707, 'margin_dpo/margin_std': 8.248537063598633, 'logps/chosen': -74.00252532958984, 'logps/rejected': -103.95845031738281, 'logps/ref_chosen': -70.0877914428711, 'logps/ref_rejected': -91.75868225097656, 'logits/chosen': -0.698811411857605, 'logits/rejected': -0.6621960401535034, 'epoch': 0.21} 21%|██ | 70/340 [03:03<11:36, 2.58s/it] 21%|██ | 71/340 [03:06<11:36, 2.59s/it] 21%|██ | 72/340 [03:08<11:49, 2.65s/it] 21%|██▏ | 73/340 [03:11<11:42, 2.63s/it] 22%|██▏ | 74/340 [03:14<11:34, 2.61s/it] 22%|██▏ | 75/340 [03:16<11:29, 2.60s/it] {'loss': 0.4863, 'grad_norm': 17.00535011291504, 'learning_rate': 4.792138157142157e-07, 'margin_dpo/margin_mean': 8.173115730285645, 'margin_dpo/margin_std': 8.817681312561035, 'logps/chosen': -78.68012237548828, 'logps/rejected': -97.5809555053711, 'logps/ref_chosen': -74.91792297363281, 'logps/ref_rejected': -85.64566802978516, 'logits/chosen': -0.6827956438064575, 'logits/rejected': -0.6566829681396484, 'epoch': 0.22} 22%|██▏ | 75/340 [03:16<11:29, 2.60s/it] 22%|██▏ | 76/340 [03:19<11:28, 2.61s/it] 23%|██▎ | 77/340 [03:21<11:27, 2.62s/it] 23%|██▎ | 78/340 [03:24<11:23, 2.61s/it] 23%|██▎ | 79/340 [03:27<11:19, 2.60s/it] 24%|██▎ | 80/340 [03:29<11:12, 2.59s/it] {'loss': 0.451, 'grad_norm': 21.13958168029785, 'learning_rate': 4.737908228387656e-07, 'margin_dpo/margin_mean': 7.951646327972412, 'margin_dpo/margin_std': 8.248537063598633, 'logps/chosen': -102.5855941772461, 'logps/rejected': -105.6670150756836, 'logps/ref_chosen': -97.75636291503906, 'logps/ref_rejected': -92.88613891601562, 'logits/chosen': -0.7372442483901978, 'logits/rejected': -0.689995288848877, 'epoch': 0.24} 24%|██▎ | 80/340 [03:29<11:12, 2.59s/it] 24%|██▍ | 81/340 [03:32<11:13, 2.60s/it] 24%|██▍ | 82/340 [03:34<11:00, 2.56s/it] 24%|██▍ | 83/340 [03:37<10:48, 2.53s/it] 25%|██▍ | 84/340 [03:39<10:53, 2.55s/it] 25%|██▌ | 85/340 [03:42<10:59, 2.59s/it] {'loss': 0.4569, 'grad_norm': 18.165218353271484, 'learning_rate': 4.6777824852166437e-07, 'margin_dpo/margin_mean': 7.221736907958984, 'margin_dpo/margin_std': 8.439001083374023, 'logps/chosen': -85.70280456542969, 'logps/rejected': -101.9955825805664, 'logps/ref_chosen': -78.9326171875, 'logps/ref_rejected': -88.00363159179688, 'logits/chosen': -0.6671745777130127, 'logits/rejected': -0.6385531425476074, 'epoch': 0.25} 25%|██▌ | 85/340 [03:42<10:59, 2.59s/it] 25%|██▌ | 86/340 [03:45<10:57, 2.59s/it] 26%|██▌ | 87/340 [03:47<10:51, 2.57s/it] 26%|██▌ | 88/340 [03:50<10:45, 2.56s/it] 26%|██▌ | 89/340 [03:52<10:47, 2.58s/it] 26%|██▋ | 90/340 [03:55<10:44, 2.58s/it] {'loss': 0.4419, 'grad_norm': 20.739215850830078, 'learning_rate': 4.611919330113591e-07, 'margin_dpo/margin_mean': 9.419827461242676, 'margin_dpo/margin_std': 9.238184928894043, 'logps/chosen': -84.86643981933594, 'logps/rejected': -105.78071594238281, 'logps/ref_chosen': -78.78388214111328, 'logps/ref_rejected': -90.2783203125, 'logits/chosen': -0.6510001420974731, 'logits/rejected': -0.629525899887085, 'epoch': 0.26} 26%|██▋ | 90/340 [03:55<10:44, 2.58s/it] 27%|██▋ | 91/340 [03:57<10:44, 2.59s/it] 27%|██▋ | 92/340 [04:00<10:53, 2.63s/it] 27%|██▋ | 93/340 [04:03<10:48, 2.62s/it] 28%|██▊ | 94/340 [04:05<10:40, 2.60s/it] 28%|██▊ | 95/340 [04:08<10:39, 2.61s/it] {'loss': 0.4514, 'grad_norm': 17.511486053466797, 'learning_rate': 4.5404922808905543e-07, 'margin_dpo/margin_mean': 7.360299587249756, 'margin_dpo/margin_std': 11.319549560546875, 'logps/chosen': -74.32402038574219, 'logps/rejected': -78.22425842285156, 'logps/ref_chosen': -65.91403198242188, 'logps/ref_rejected': -62.45396041870117, 'logits/chosen': -0.6517031788825989, 'logits/rejected': -0.6104840040206909, 'epoch': 0.28} 28%|██▊ | 95/340 [04:08<10:39, 2.61s/it] 28%|██▊ | 96/340 [04:11<10:49, 2.66s/it] 29%|██▊ | 97/340 [04:13<10:33, 2.61s/it] 29%|██▉ | 98/340 [04:16<10:33, 2.62s/it] 29%|██▉ | 99/340 [04:18<10:26, 2.60s/it] 29%|██▉ | 100/340 [04:21<10:27, 2.61s/it] {'loss': 0.4265, 'grad_norm': 18.769145965576172, 'learning_rate': 4.4636895135509966e-07, 'margin_dpo/margin_mean': 9.642545700073242, 'margin_dpo/margin_std': 11.237717628479004, 'logps/chosen': -84.81422424316406, 'logps/rejected': -110.46153259277344, 'logps/ref_chosen': -77.24075317382812, 'logps/ref_rejected': -93.24552917480469, 'logits/chosen': -0.6338332295417786, 'logits/rejected': -0.6123248338699341, 'epoch': 0.29} 29%|██▉ | 100/340 [04:21<10:27, 2.61s/it][INFO|trainer.py:4307] 2026-04-10 17:40:00,228 >> ***** Running Evaluation ***** [INFO|trainer.py:4309] 2026-04-10 17:40:00,228 >> Num examples = 2339 [INFO|trainer.py:4312] 2026-04-10 17:40:00,228 >> Batch size = 16 0%| | 0/18 [00:00> ***** Running Evaluation ***** [INFO|trainer.py:4309] 2026-04-10 17:44:36,788 >> Num examples = 2339 [INFO|trainer.py:4312] 2026-04-10 17:44:36,788 >> Batch size = 16 0%| | 0/18 [00:00> Saving model checkpoint to /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-margin-dpo-hh-helpful-8xh200-20260410-172009/checkpoint-200 [INFO|configuration_utils.py:419] 2026-04-10 17:45:12,605 >> Configuration saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-margin-dpo-hh-helpful-8xh200-20260410-172009/checkpoint-200/config.json [INFO|configuration_utils.py:911] 2026-04-10 17:45:12,610 >> Configuration saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-margin-dpo-hh-helpful-8xh200-20260410-172009/checkpoint-200/generation_config.json [INFO|modeling_utils.py:3580] 2026-04-10 17:45:56,024 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 6 checkpoint shards. You can find where each parameters has been saved in the index located at /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-margin-dpo-hh-helpful-8xh200-20260410-172009/checkpoint-200/model.safetensors.index.json. [INFO|tokenization_utils_base.py:2510] 2026-04-10 17:45:56,031 >> tokenizer config file saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-margin-dpo-hh-helpful-8xh200-20260410-172009/checkpoint-200/tokenizer_config.json [INFO|tokenization_utils_base.py:2519] 2026-04-10 17:45:56,034 >> Special tokens file saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-margin-dpo-hh-helpful-8xh200-20260410-172009/checkpoint-200/special_tokens_map.json 59%|█████▉ | 201/340 [13:21<3:07:16, 80.84s/it] 59%|█████▉ | 202/340 [13:24<2:11:48, 57.31s/it] 60%|█████▉ | 203/340 [13:26<1:33:12, 40.82s/it] 60%|██████ | 204/340 [13:29<1:06:30, 29.34s/it] 60%|██████ | 205/340 [13:31<47:58, 21.32s/it] {'loss': 0.3792, 'grad_norm': 19.535417556762695, 'learning_rate': 2.065879555832674e-07, 'margin_dpo/margin_mean': 13.387273788452148, 'margin_dpo/margin_std': 14.807754516601562, 'logps/chosen': -104.2248764038086, 'logps/rejected': -104.95343017578125, 'logps/ref_chosen': -84.44103240966797, 'logps/ref_rejected': -71.78230285644531, 'logits/chosen': -0.5760528445243835, 'logits/rejected': -0.5279114842414856, 'epoch': 0.6} 60%|██████ | 205/340 [13:31<47:58, 21.32s/it] 61%|██████ | 206/340 [13:33<34:50, 15.60s/it] 61%|██████ | 207/340 [13:36<25:57, 11.71s/it] 61%|██████ | 208/340 [13:39<19:46, 8.99s/it] 61%|██████▏ | 209/340 [13:41<15:25, 7.06s/it] 62%|██████▏ | 210/340 [13:44<12:19, 5.69s/it] {'loss': 0.3251, 'grad_norm': 17.17575454711914, 'learning_rate': 1.9401235374032425e-07, 'margin_dpo/margin_mean': 14.702362060546875, 'margin_dpo/margin_std': 16.377933502197266, 'logps/chosen': -101.36656188964844, 'logps/rejected': -108.5728988647461, 'logps/ref_chosen': -83.94493103027344, 'logps/ref_rejected': -76.44892120361328, 'logits/chosen': -0.6245664358139038, 'logits/rejected': -0.5699684619903564, 'epoch': 0.62} 62%|██████▏ | 210/340 [13:44<12:19, 5.69s/it] 62%|██████▏ | 211/340 [13:46<10:16, 4.78s/it] 62%|██████▏ | 212/340 [13:49<08:48, 4.13s/it] 63%|██████▎ | 213/340 [13:52<07:47, 3.68s/it] 63%|██████▎ | 214/340 [13:54<07:03, 3.36s/it] 63%|██████▎ | 215/340 [13:57<06:33, 3.15s/it] {'loss': 0.3633, 'grad_norm': 20.044084548950195, 'learning_rate': 1.8158425248197928e-07, 'margin_dpo/margin_mean': 16.278963088989258, 'margin_dpo/margin_std': 19.206457138061523, 'logps/chosen': -102.8707275390625, 'logps/rejected': -122.053955078125, 'logps/ref_chosen': -82.23881530761719, 'logps/ref_rejected': -85.1430892944336, 'logits/chosen': -0.5605936050415039, 'logits/rejected': -0.5190353393554688, 'epoch': 0.63} 63%|██████▎ | 215/340 [13:57<06:33, 3.15s/it] 64%|██████▎ | 216/340 [14:00<06:11, 2.99s/it] 64%|██████▍ | 217/340 [14:02<05:50, 2.85s/it] 64%|██████▍ | 218/340 [14:05<05:45, 2.83s/it] 64%|██████▍ | 219/340 [14:07<05:34, 2.77s/it] 65%|██████▍ | 220/340 [14:10<05:24, 2.71s/it] {'loss': 0.3587, 'grad_norm': 21.036956787109375, 'learning_rate': 1.6933639389195134e-07, 'margin_dpo/margin_mean': 11.612079620361328, 'margin_dpo/margin_std': 14.565820693969727, 'logps/chosen': -97.38944244384766, 'logps/rejected': -117.23432922363281, 'logps/ref_chosen': -76.5594482421875, 'logps/ref_rejected': -84.79225158691406, 'logits/chosen': -0.621160626411438, 'logits/rejected': -0.585429310798645, 'epoch': 0.65} 65%|██████▍ | 220/340 [14:10<05:24, 2.71s/it] 65%|██████▌ | 221/340 [14:13<05:21, 2.70s/it] 65%|██████▌ | 222/340 [14:15<05:17, 2.69s/it] 66%|██████▌ | 223/340 [14:18<05:11, 2.66s/it] 66%|██████▌ | 224/340 [14:20<05:01, 2.60s/it] 66%|██████▌ | 225/340 [14:23<04:55, 2.57s/it] {'loss': 0.3385, 'grad_norm': 21.023571014404297, 'learning_rate': 1.573010452010098e-07, 'margin_dpo/margin_mean': 18.626880645751953, 'margin_dpo/margin_std': 18.950374603271484, 'logps/chosen': -87.20682525634766, 'logps/rejected': -132.78231811523438, 'logps/ref_chosen': -68.70957946777344, 'logps/ref_rejected': -95.65819549560547, 'logits/chosen': -0.6097210049629211, 'logits/rejected': -0.6041680574417114, 'epoch': 0.66} 66%|██████▌ | 225/340 [14:23<04:55, 2.57s/it] 66%|██████▋ | 226/340 [14:26<04:54, 2.58s/it] 67%|██████▋ | 227/340 [14:28<04:50, 2.57s/it] 67%|██████▋ | 228/340 [14:31<04:49, 2.58s/it] 67%|██████▋ | 229/340 [14:33<04:48, 2.60s/it] 68%|██████▊ | 230/340 [14:36<04:46, 2.61s/it] {'loss': 0.3269, 'grad_norm': 19.34729766845703, 'learning_rate': 1.4550991377830423e-07, 'margin_dpo/margin_mean': 14.579324722290039, 'margin_dpo/margin_std': 14.860456466674805, 'logps/chosen': -92.71955871582031, 'logps/rejected': -129.41712951660156, 'logps/ref_chosen': -76.04148864746094, 'logps/ref_rejected': -98.15973663330078, 'logits/chosen': -0.6367233395576477, 'logits/rejected': -0.5984948873519897, 'epoch': 0.68} 68%|██████▊ | 230/340 [14:36<04:46, 2.61s/it] 68%|██████▊ | 231/340 [14:39<04:43, 2.60s/it] 68%|██████▊ | 232/340 [14:41<04:38, 2.58s/it] 69%|██████▊ | 233/340 [14:44<04:36, 2.58s/it] 69%|██████▉ | 234/340 [14:46<04:35, 2.59s/it] 69%|██████▉ | 235/340 [14:49<04:34, 2.61s/it] {'loss': 0.3347, 'grad_norm': 18.263099670410156, 'learning_rate': 1.339940635976592e-07, 'margin_dpo/margin_mean': 19.314985275268555, 'margin_dpo/margin_std': 15.413273811340332, 'logps/chosen': -88.53390502929688, 'logps/rejected': -127.80912780761719, 'logps/ref_chosen': -70.64253997802734, 'logps/ref_rejected': -90.60277557373047, 'logits/chosen': -0.6155376434326172, 'logits/rejected': -0.5955866575241089, 'epoch': 0.69} 69%|██████▉ | 235/340 [14:49<04:34, 2.61s/it] 69%|██████▉ | 236/340 [14:52<04:33, 2.63s/it] 70%|██████▉ | 237/340 [14:54<04:31, 2.63s/it] 70%|███████ | 238/340 [14:57<04:27, 2.62s/it] 70%|███████ | 239/340 [14:59<04:24, 2.62s/it] 71%|███████ | 240/340 [15:02<04:17, 2.57s/it] {'loss': 0.3433, 'grad_norm': 21.18890380859375, 'learning_rate': 1.227838333989088e-07, 'margin_dpo/margin_mean': 17.56354331970215, 'margin_dpo/margin_std': 16.671550750732422, 'logps/chosen': -94.69210052490234, 'logps/rejected': -106.57359313964844, 'logps/ref_chosen': -75.90282440185547, 'logps/ref_rejected': -70.22077178955078, 'logits/chosen': -0.5532498955726624, 'logits/rejected': -0.5167180299758911, 'epoch': 0.71} 71%|███████ | 240/340 [15:02<04:17, 2.57s/it] 71%|███████ | 241/340 [15:05<04:15, 2.58s/it] 71%|███████ | 242/340 [15:07<04:07, 2.53s/it] 71%|███████▏ | 243/340 [15:10<04:08, 2.57s/it] 72%|███████▏ | 244/340 [15:12<04:06, 2.56s/it] 72%|███████▏ | 245/340 [15:15<04:05, 2.59s/it] {'loss': 0.3073, 'grad_norm': 19.42283058166504, 'learning_rate': 1.1190875675987355e-07, 'margin_dpo/margin_mean': 21.223926544189453, 'margin_dpo/margin_std': 16.53793716430664, 'logps/chosen': -87.87870788574219, 'logps/rejected': -142.7686767578125, 'logps/ref_chosen': -68.88108825683594, 'logps/ref_rejected': -102.547119140625, 'logits/chosen': -0.5711519122123718, 'logits/rejected': -0.5506427884101868, 'epoch': 0.72} 72%|███████▏ | 245/340 [15:15<04:05, 2.59s/it] 72%|███████▏ | 246/340 [15:17<04:03, 2.59s/it] 73%|███████▎ | 247/340 [15:20<04:03, 2.62s/it] 73%|███████▎ | 248/340 [15:23<03:59, 2.60s/it] 73%|███████▎ | 249/340 [15:25<03:56, 2.60s/it] 74%|███████▎ | 250/340 [15:28<03:52, 2.59s/it] {'loss': 0.4138, 'grad_norm': 21.975610733032227, 'learning_rate': 1.0139748428955333e-07, 'margin_dpo/margin_mean': 16.201473236083984, 'margin_dpo/margin_std': 15.055798530578613, 'logps/chosen': -104.53717041015625, 'logps/rejected': -118.47982025146484, 'logps/ref_chosen': -88.11860656738281, 'logps/ref_rejected': -85.85978698730469, 'logits/chosen': -0.63815838098526, 'logits/rejected': -0.5797184705734253, 'epoch': 0.74} 74%|███████▎ | 250/340 [15:28<03:52, 2.59s/it] 74%|███████▍ | 251/340 [15:30<03:50, 2.59s/it] 74%|███████▍ | 252/340 [15:33<03:49, 2.61s/it] 74%|███████▍ | 253/340 [15:36<03:45, 2.59s/it] 75%|███████▍ | 254/340 [15:38<03:49, 2.67s/it] 75%|███████▌ | 255/340 [15:41<03:46, 2.66s/it] {'loss': 0.3314, 'grad_norm': 21.86973762512207, 'learning_rate': 9.127770814751932e-08, 'margin_dpo/margin_mean': 16.87302017211914, 'margin_dpo/margin_std': 16.191524505615234, 'logps/chosen': -113.81512451171875, 'logps/rejected': -123.86918640136719, 'logps/ref_chosen': -93.02457427978516, 'logps/ref_rejected': -86.20562744140625, 'logits/chosen': -0.5965814590454102, 'logits/rejected': -0.5407648682594299, 'epoch': 0.75} 75%|███████▌ | 255/340 [15:41<03:46, 2.66s/it] 75%|███████▌ | 256/340 [15:44<03:43, 2.66s/it] 76%|███████▌ | 257/340 [15:46<03:42, 2.69s/it] 76%|███████▌ | 258/340 [15:49<03:39, 2.68s/it] 76%|███████▌ | 259/340 [15:52<03:34, 2.65s/it] 76%|███████▋ | 260/340 [15:54<03:32, 2.65s/it] {'loss': 0.3414, 'grad_norm': 20.748577117919922, 'learning_rate': 8.15760890883607e-08, 'margin_dpo/margin_mean': 20.42922592163086, 'margin_dpo/margin_std': 16.98196029663086, 'logps/chosen': -98.30900573730469, 'logps/rejected': -133.5509796142578, 'logps/ref_chosen': -79.27108001708984, 'logps/ref_rejected': -94.08381652832031, 'logits/chosen': -0.5860427618026733, 'logits/rejected': -0.5433794856071472, 'epoch': 0.76} 76%|███████▋ | 260/340 [15:54<03:32, 2.65s/it] 77%|███████▋ | 261/340 [15:57<03:26, 2.62s/it] 77%|███████▋ | 262/340 [16:00<03:25, 2.63s/it] 77%|███████▋ | 263/340 [16:02<03:21, 2.62s/it] 78%|███████▊ | 264/340 [16:05<03:19, 2.63s/it] 78%|███████▊ | 265/340 [16:07<03:16, 2.62s/it] {'loss': 0.3493, 'grad_norm': 20.377286911010742, 'learning_rate': 7.231818622338822e-08, 'margin_dpo/margin_mean': 15.021594047546387, 'margin_dpo/margin_std': 12.837465286254883, 'logps/chosen': -99.11347198486328, 'logps/rejected': -126.92435455322266, 'logps/ref_chosen': -79.24869537353516, 'logps/ref_rejected': -92.03797912597656, 'logits/chosen': -0.5678300857543945, 'logits/rejected': -0.5425071120262146, 'epoch': 0.78} 78%|███████▊ | 265/340 [16:07<03:16, 2.62s/it] 78%|███████▊ | 266/340 [16:10<03:15, 2.64s/it] 79%|███████▊ | 267/340 [16:13<03:11, 2.62s/it] 79%|███████▉ | 268/340 [16:15<03:10, 2.65s/it] 79%|███████▉ | 269/340 [16:18<03:05, 2.61s/it] 79%|███████▉ | 270/340 [16:21<03:03, 2.63s/it] {'loss': 0.332, 'grad_norm': 17.822444915771484, 'learning_rate': 6.352838968463919e-08, 'margin_dpo/margin_mean': 16.91426658630371, 'margin_dpo/margin_std': 14.53496265411377, 'logps/chosen': -97.48078918457031, 'logps/rejected': -116.37190246582031, 'logps/ref_chosen': -80.15914154052734, 'logps/ref_rejected': -82.13599395751953, 'logits/chosen': -0.606745719909668, 'logits/rejected': -0.5473134517669678, 'epoch': 0.79} 79%|███████▉ | 270/340 [16:21<03:03, 2.63s/it] 80%|███████▉ | 271/340 [16:23<03:00, 2.62s/it] 80%|████████ | 272/340 [16:26<02:57, 2.60s/it] 80%|████████ | 273/340 [16:28<02:54, 2.60s/it] 81%|████████ | 274/340 [16:31<02:51, 2.61s/it] 81%|████████ | 275/340 [16:34<02:49, 2.61s/it] {'loss': 0.3348, 'grad_norm': 20.570648193359375, 'learning_rate': 5.5229856368582376e-08, 'margin_dpo/margin_mean': 16.90357780456543, 'margin_dpo/margin_std': 20.21615219116211, 'logps/chosen': -99.41848754882812, 'logps/rejected': -122.4229965209961, 'logps/ref_chosen': -78.87225341796875, 'logps/ref_rejected': -84.97318267822266, 'logits/chosen': -0.6010477542877197, 'logits/rejected': -0.5661951899528503, 'epoch': 0.81} 81%|████████ | 275/340 [16:34<02:49, 2.61s/it] 81%|████████ | 276/340 [16:36<02:43, 2.55s/it] 81%|████████▏ | 277/340 [16:39<02:41, 2.57s/it] 82%|████████▏ | 278/340 [16:41<02:39, 2.57s/it] 82%|████████▏ | 279/340 [16:44<02:35, 2.55s/it] 82%|████████▏ | 280/340 [16:46<02:35, 2.59s/it] {'loss': 0.3329, 'grad_norm': 18.737754821777344, 'learning_rate': 4.7444448928806615e-08, 'margin_dpo/margin_mean': 20.195457458496094, 'margin_dpo/margin_std': 19.39859390258789, 'logps/chosen': -117.15876770019531, 'logps/rejected': -154.00479125976562, 'logps/ref_chosen': -96.47113800048828, 'logps/ref_rejected': -113.1217041015625, 'logits/chosen': -0.5662145018577576, 'logits/rejected': -0.525722324848175, 'epoch': 0.82} 82%|████████▏ | 280/340 [16:46<02:35, 2.59s/it] 83%|████████▎ | 281/340 [16:49<02:35, 2.64s/it] 83%|████████▎ | 282/340 [16:52<02:32, 2.63s/it] 83%|████████▎ | 283/340 [16:54<02:30, 2.64s/it] 84%|████████▎ | 284/340 [16:57<02:26, 2.61s/it] 84%|████████▍ | 285/340 [16:59<02:22, 2.58s/it] {'loss': 0.3382, 'grad_norm': 21.463726043701172, 'learning_rate': 4.019267817841834e-08, 'margin_dpo/margin_mean': 17.379127502441406, 'margin_dpo/margin_std': 17.829914093017578, 'logps/chosen': -111.90663146972656, 'logps/rejected': -114.01655578613281, 'logps/ref_chosen': -91.53522491455078, 'logps/ref_rejected': -76.2660140991211, 'logits/chosen': -0.630197286605835, 'logits/rejected': -0.5674210786819458, 'epoch': 0.84} 84%|████████▍ | 285/340 [16:59<02:22, 2.58s/it] 84%|████████▍ | 286/340 [17:02<02:19, 2.58s/it] 84%|████████▍ | 287/340 [17:05<02:16, 2.57s/it] 85%|████████▍ | 288/340 [17:07<02:16, 2.63s/it] 85%|████████▌ | 289/340 [17:10<02:13, 2.61s/it] 85%|████████▌ | 290/340 [17:13<02:11, 2.63s/it] {'loss': 0.3409, 'grad_norm': 18.62375831604004, 'learning_rate': 3.349364905389032e-08, 'margin_dpo/margin_mean': 18.841894149780273, 'margin_dpo/margin_std': 18.295745849609375, 'logps/chosen': -98.92496490478516, 'logps/rejected': -117.43675231933594, 'logps/ref_chosen': -78.96186828613281, 'logps/ref_rejected': -78.63177490234375, 'logits/chosen': -0.5863774418830872, 'logits/rejected': -0.5456980466842651, 'epoch': 0.85} 85%|████████▌ | 290/340 [17:13<02:11, 2.63s/it] 86%|████████▌ | 291/340 [17:15<02:08, 2.62s/it] 86%|████████▌ | 292/340 [17:18<02:05, 2.61s/it] 86%|████████▌ | 293/340 [17:20<02:02, 2.61s/it] 86%|████████▋ | 294/340 [17:23<01:58, 2.57s/it] 87%|████████▋ | 295/340 [17:25<01:55, 2.57s/it] {'loss': 0.3351, 'grad_norm': 16.586910247802734, 'learning_rate': 2.736501028272095e-08, 'margin_dpo/margin_mean': 15.721613883972168, 'margin_dpo/margin_std': 16.5610294342041, 'logps/chosen': -85.10719299316406, 'logps/rejected': -135.39389038085938, 'logps/ref_chosen': -64.14302825927734, 'logps/ref_rejected': -98.70811462402344, 'logits/chosen': -0.5259509086608887, 'logits/rejected': -0.5359938144683838, 'epoch': 0.87} 87%|████████▋ | 295/340 [17:25<01:55, 2.57s/it] 87%|████████▋ | 296/340 [17:28<01:53, 2.57s/it] 87%|████████▋ | 297/340 [17:31<01:50, 2.58s/it] 88%|████████▊ | 298/340 [17:33<01:48, 2.59s/it] 88%|████████▊ | 299/340 [17:36<01:45, 2.57s/it] 88%|████████▊ | 300/340 [17:38<01:43, 2.59s/it] {'loss': 0.3552, 'grad_norm': 19.39561653137207, 'learning_rate': 2.1822907887504932e-08, 'margin_dpo/margin_mean': 18.2686824798584, 'margin_dpo/margin_std': 16.341278076171875, 'logps/chosen': -80.19596099853516, 'logps/rejected': -130.80763244628906, 'logps/ref_chosen': -59.2784423828125, 'logps/ref_rejected': -91.62141418457031, 'logits/chosen': -0.5196036696434021, 'logits/rejected': -0.5250274538993835, 'epoch': 0.88} 88%|████████▊ | 300/340 [17:38<01:43, 2.59s/it][INFO|trainer.py:4307] 2026-04-10 17:53:17,548 >> ***** Running Evaluation ***** [INFO|trainer.py:4309] 2026-04-10 17:53:17,548 >> Num examples = 2339 [INFO|trainer.py:4312] 2026-04-10 17:53:17,548 >> Batch size = 16 0%| | 0/18 [00:00> Saving model checkpoint to /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-margin-dpo-hh-helpful-8xh200-20260410-172009/checkpoint-340 [INFO|configuration_utils.py:419] 2026-04-10 17:55:36,227 >> Configuration saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-margin-dpo-hh-helpful-8xh200-20260410-172009/checkpoint-340/config.json [INFO|configuration_utils.py:911] 2026-04-10 17:55:36,231 >> Configuration saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-margin-dpo-hh-helpful-8xh200-20260410-172009/checkpoint-340/generation_config.json [INFO|modeling_utils.py:3580] 2026-04-10 17:56:15,466 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 6 checkpoint shards. You can find where each parameters has been saved in the index located at /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-margin-dpo-hh-helpful-8xh200-20260410-172009/checkpoint-340/model.safetensors.index.json. [INFO|tokenization_utils_base.py:2510] 2026-04-10 17:56:15,471 >> tokenizer config file saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-margin-dpo-hh-helpful-8xh200-20260410-172009/checkpoint-340/tokenizer_config.json [INFO|tokenization_utils_base.py:2519] 2026-04-10 17:56:15,474 >> Special tokens file saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-margin-dpo-hh-helpful-8xh200-20260410-172009/checkpoint-340/special_tokens_map.json [INFO|trainer.py:2681] 2026-04-10 17:59:29,929 >> Training completed. Do not forget to share your model on huggingface.co/models =) {'train_runtime': 1436.8705, 'train_samples_per_second': 30.342, 'train_steps_per_second': 0.237, 'train_loss': 0.4133688477908864, 'epoch': 1.0} 100%|██████████| 340/340 [23:51<00:00, 2.52s/it] 100%|██████████| 340/340 [23:51<00:00, 4.21s/it] ***** train metrics ***** epoch = 1.0 total_flos = 0GF train_loss = 0.4134 train_runtime = 0:23:56.87 train_samples = 43598 train_samples_per_second = 30.342 train_steps_per_second = 0.237 2026-04-10 17:59:29 - INFO - __main__ - *** Training complete *** 2026-04-10 17:59:29 - INFO - __main__ - *** Save model *** [INFO|configuration_utils.py:419] 2026-04-10 17:59:47,763 >> Configuration saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-margin-dpo-hh-helpful-8xh200-20260410-172009/config.json [INFO|configuration_utils.py:911] 2026-04-10 17:59:47,771 >> Configuration saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-margin-dpo-hh-helpful-8xh200-20260410-172009/generation_config.json [INFO|modeling_utils.py:3580] 2026-04-10 18:00:39,415 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 7 checkpoint shards. You can find where each parameters has been saved in the index located at /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-margin-dpo-hh-helpful-8xh200-20260410-172009/model.safetensors.index.json. [INFO|tokenization_utils_base.py:2510] 2026-04-10 18:00:39,450 >> tokenizer config file saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-margin-dpo-hh-helpful-8xh200-20260410-172009/tokenizer_config.json [INFO|tokenization_utils_base.py:2519] 2026-04-10 18:00:39,459 >> Special tokens file saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-margin-dpo-hh-helpful-8xh200-20260410-172009/special_tokens_map.json 2026-04-10 18:00:39 - INFO - __main__ - Saved HF-compatible model artifacts to /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-margin-dpo-hh-helpful-8xh200-20260410-172009 [INFO|modelcard.py:450] 2026-04-10 18:00:39,763 >> Dropping the following result as it does not have all the necessary fields: {'dataset': {'name': 'Anthropic/hh-rlhf', 'type': 'Anthropic/hh-rlhf'}} [INFO|configuration_utils.py:419] 2026-04-10 18:00:39,776 >> Configuration saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-margin-dpo-hh-helpful-8xh200-20260410-172009/config.json 2026-04-10 18:00:39 - INFO - __main__ - *** Evaluate *** [INFO|trainer.py:4307] 2026-04-10 18:00:39,777 >> ***** Running Evaluation ***** [INFO|trainer.py:4309] 2026-04-10 18:00:39,777 >> Num examples = 2339 [INFO|trainer.py:4312] 2026-04-10 18:00:39,777 >> Batch size = 16 0%| | 0/18 [00:00