2026-04-21 22:33:52 - INFO - __main__ - Model parameters ModelArguments(base_model_revision=None, model_name_or_path='/root/dynamic-dpo-v4/sft-checkpoints/llama-3-8b-base-sft-hh-helpful-4xh200', model_revision='main', model_code_revision=None, torch_dtype='bfloat16', tokenizer_name_or_path=None, trust_remote_code=False, attn_implementation='flash_attention_2', use_peft=False, lora_r=16, lora_alpha=32, lora_dropout=0.05, lora_target_modules=None, lora_modules_to_save=None, load_in_8bit=False, load_in_4bit=False, bnb_4bit_quant_type='nf4', use_bnb_nested_quant=False, bnb_4bit_quant_storage='uint8') 2026-04-21 22:33:52 - INFO - __main__ - Data parameters DataArguments(chat_template=None, dataset_mixer={'Anthropic/hh-rlhf': 1.0}, text_column='text', dataset_splits=['train', 'test'], dataset_configs=['helpful-base'], dataset_dir=None, preprocessing_num_workers=12, use_persistent_hf_cache=True, hf_cache_dir='/root/dynamic-dpo-v4/hf/datasets', truncation_side=None, auto_insert_empty_system_msg=True, preprocessing_log_samples=0, preprocessing_log_dir=None) 2026-04-21 22:33:52 - INFO - __main__ - Training/evaluation parameters NewDPOConfig( _n_gpu=1, accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, auto_find_batch_size=False, average_tokens_across_devices=False, batch_eval_metrics=False, beta=0.1, bf16=True, bf16_full_eval=False, data_seed=None, dataloader_drop_last=True, dataloader_num_workers=0, dataloader_persistent_workers=False, dataloader_pin_memory=True, dataloader_prefetch_factor=None, dataset_num_proc=12, ddp_backend=None, ddp_broadcast_buffers=None, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=None, ddp_timeout=1800, debug=[], deepspeed=None, disable_dropout=True, disable_tqdm=False, do_eval=True, do_predict=False, do_train=False, eta=0.1, eval_accumulation_steps=None, eval_delay=0, eval_do_concat_batches=True, eval_on_start=False, eval_steps=200, eval_strategy=IntervalStrategy.STEPS, eval_use_gather_object=False, f_alpha_divergence_coef=1.0, f_divergence_type=reverse_kl, force_use_ref_model=False, fp16=False, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, generate_during_eval=False, gradient_accumulation_steps=2, gradient_checkpointing=True, gradient_checkpointing_kwargs={'use_reentrant': False}, greater_is_better=None, group_by_length=False, half_precision_backend=auto, hub_always_push=False, hub_margin_dataset_id=None, hub_model_id=jackf857/llama-3-8b-base-new-dpo-hh-helpful-s_star0.6-4xh200-batch-64-20260421-214335-rerun, hub_model_revision=main, hub_private_repo=None, hub_strategy=HubStrategy.EVERY_SAVE, hub_token=, ignore_data_skip=False, include_for_metrics=[], include_inputs_for_metrics=False, include_num_input_tokens_seen=False, include_tokens_per_second=False, is_encoder_decoder=None, jit_mode_eval=False, label_names=None, label_pad_token_id=-100, label_smoothing=0.0, label_smoothing_factor=0.0, learning_rate=5e-07, length_column_name=length, load_best_model_at_end=False, local_rank=0, log_level=info, log_level_replica=warning, log_on_each_node=True, logging_dir=outputs/llama3-8b-base-new-method-s_star0.6/runs/Apr21_22-33-51_f6a54ae9d6f6, logging_first_step=True, logging_nan_inf_filter=True, logging_steps=5, logging_strategy=IntervalStrategy.STEPS, loss_type=sigmoid, lr_scheduler_kwargs={}, lr_scheduler_type=SchedulerType.COSINE, margin_dataset_private=None, margin_dataset_split=train, margin_log_path=/root/dynamic-dpo-v4/outputs/llama-3-8b-base-new-dpo-hh-helpful-s_star0.6-4xh200-batch-64-20260421-214335-rerun/margin_logs, margin_log_steps=1, margin_save_full=True, max_grad_norm=1.0, max_length=512, max_prompt_length=256, max_steps=-1, max_target_length=None, metric_for_best_model=None, model_adapter_name=None, model_init_kwargs=None, mp_parameters=, neftune_noise_alpha=None, no_cuda=False, non_finite_logits_handling=error, num_train_epochs=1, optim=OptimizerNames.ADAMW_TORCH, optim_args=None, optim_target_modules=None, output_dir=/root/dynamic-dpo-v4/outputs/llama-3-8b-base-new-dpo-hh-helpful-s_star0.6-4xh200-batch-64-20260421-214335-rerun, overwrite_output_dir=False, padding_value=None, past_index=-1, per_device_eval_batch_size=8, per_device_train_batch_size=8, post_tokenization_log_dir=None, post_tokenization_log_samples=0, precompute_ref_batch_size=None, precompute_ref_eval_batch_size=None, precompute_ref_log_probs=False, prediction_loss_only=False, push_margin_dataset=True, push_to_hub=True, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=, q_target=0.45, ray_scope=last, ref_adapter_name=None, ref_model_init_kwargs=None, ref_model_mixup_alpha=0.9, ref_model_sync_steps=64, reference_free=False, remove_unused_columns=False, report_to=['wandb'], require_explicit_ref_model=True, restore_callback_states_from_checkpoint=False, resume_from_checkpoint=None, reuse_tokenized_dataset=True, rpo_alpha=None, run_name=llama-3-8b-base-new-dpo-hh-helpful-s_star0.6-4xh200-batch-64-20260421-214335-rerun, s_star=0.6, save_on_each_node=False, save_only_model=False, save_safetensors=True, save_steps=50, save_strategy=SaveStrategy.NO, save_total_limit=2, seed=42, sft_weight=0.0, skip_memory_metrics=True, sync_ref_model=False, tf32=None, tokenization_batch_size=128, tokenization_mode=online, tokenized_dataset_cache_dir=/root/dynamic-dpo-v4/tokenized_preferences, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torch_empty_cache_steps=None, torchdynamo=None, tp_size=0, tpu_metrics_debug=False, tpu_num_cores=None, trainer_type=new_dpo, truncation_mode=keep_end, use_cpu=False, use_ipex=False, use_legacy_prediction_loop=False, use_liger_kernel=False, use_mps_device=False, wandb_project=llama3-8b-base-new-method-hh-beta-0.1, warmup_ratio=0.1, warmup_steps=0, weight_decay=0.0, ) 2026-04-21 22:33:52 - INFO - __main__ - Using W&B project from training args: llama3-8b-base-new-method-hh-beta-0.1 2026-04-21 22:33:52 - INFO - __main__ - New-DPO parameters: beta=0.1, q_target=0.45, s_star=0.6, eta=0.1 2026-04-21 22:33:52 - INFO - __main__ - Using persistent HF datasets cache at /root/dynamic-dpo-v4/hf/datasets 2026-04-21 22:33:56 - WARNING - __main__ - Dropped 237 non-canonical HH preference examples from split `train` before normalization (126 x HH preprocessing expects exactly one final assistant response in chosen/rejected suffixes., 111 x HH chosen/rejected transcripts must each contain a divergent assistant response.). Normalizing raw HH preferences (train): 0%| | 0/43598 [00:00> loading file tokenizer.json [INFO|tokenization_utils_base.py:2058] 2026-04-21 22:34:01,704 >> loading file tokenizer.model [INFO|tokenization_utils_base.py:2058] 2026-04-21 22:34:01,704 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:2058] 2026-04-21 22:34:01,704 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:2058] 2026-04-21 22:34:01,704 >> loading file tokenizer_config.json [INFO|tokenization_utils_base.py:2058] 2026-04-21 22:34:01,704 >> loading file chat_template.jinja Normalizing raw HH preferences (test): 0%| | 0/2339 [00:00> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. 2026-04-21 22:34:02 - INFO - __main__ - Processed train sample 41905: Prompt: <|begin_of_text|><|start_header_id|>user<|end_header_id|> What is the best way to clean my refrigerator?<|eot_id|><|start_header_id|>assistant<|end_header_id|> Well, I have a few ideas, but the best way to know what works best is to run some tests and then analyze the data to figure out which technique works best.<|eot_id|><|start_header_id|>user<|end_header_id|> What are your ideas?<|eot_id|> Chosen: <|start_header_id|>assistant<|end_header_id|> Some of my ideas include wiping down the interior with a rag, using bleach and water to wipe down the door, emptying the contents of the fridge to scrub the shelves with a clean cloth, cleaning the interior of the fridge with vinegar and water, and using a vacuum cleaner to clean the interior.<|eot_id|> Rejected: <|start_header_id|>assistant<|end_header_id|> I was thinking that you could try different types of soap, which could be a variation of “soap,” like antibacterial soap. I was also thinking that you could try using different tools, like a toothbrush, to clean the inside. Or I was thinking that you could try different ways to get the gunk out, like using a squeegee, or using a paper towel. I was also thinking you could try other types of cleaning, like vacuuming, but I think that could have the opposite of the desired effect.<|eot_id|> /root/dynamic-dpo-v4/scripts/tokenized_dpo_trainer.py:391: UserWarning: You passed a model_id to the trainer. This will automatically create an `AutoModelForCausalLM` or a `PeftModel` (if you passed a `peft_config`) for you. warnings.warn( [INFO|configuration_utils.py:691] 2026-04-21 22:34:02,347 >> loading configuration file /root/dynamic-dpo-v4/sft-checkpoints/llama-3-8b-base-sft-hh-helpful-4xh200/config.json [INFO|configuration_utils.py:765] 2026-04-21 22:34:02,348 >> Model config LlamaConfig { "architectures": [ "LlamaForCausalLM" ], "attention_bias": false, "attention_dropout": 0.0, "bos_token_id": 128000, "eos_token_id": 128001, "head_dim": 128, "hidden_act": "silu", "hidden_size": 4096, "initializer_range": 0.02, "intermediate_size": 14336, "max_position_embeddings": 8192, "mlp_bias": false, "model_type": "llama", "num_attention_heads": 32, "num_hidden_layers": 32, "num_key_value_heads": 8, "pretraining_tp": 1, "rms_norm_eps": 1e-05, "rope_scaling": null, "rope_theta": 500000.0, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.51.0", "use_cache": false, "vocab_size": 128256 } [INFO|modeling_utils.py:1121] 2026-04-21 22:34:02,356 >> loading weights file /root/dynamic-dpo-v4/sft-checkpoints/llama-3-8b-base-sft-hh-helpful-4xh200/model.safetensors.index.json [INFO|modeling_utils.py:2167] 2026-04-21 22:34:02,356 >> Instantiating LlamaForCausalLM model under default dtype torch.bfloat16. [WARNING|logging.py:328] 2026-04-21 22:34:02,358 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. [INFO|configuration_utils.py:1142] 2026-04-21 22:34:02,360 >> Generate config GenerationConfig { "bos_token_id": 128000, "eos_token_id": 128001, "use_cache": false } Loading checkpoint shards: 0%| | 0/7 [00:00> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. [WARNING|logging.py:328] 2026-04-21 22:34:02,674 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. [WARNING|logging.py:328] 2026-04-21 22:34:02,682 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. Loading checkpoint shards: 0%| | 0/7 [00:00> Trainer.tokenizer is now deprecated. You should use `Trainer.processing_class = processing_class` instead. [WARNING|trainer.py:821] 2026-04-21 22:34:02,761 >> Trainer.tokenizer is now deprecated. You should use `Trainer.processing_class = processing_class` instead. Loading checkpoint shards: 0%| | 0/7 [00:00> Trainer.tokenizer is now deprecated. You should use `Trainer.processing_class = processing_class` instead. Loading checkpoint shards: 14%|█▍ | 1/7 [00:01<00:10, 1.70s/it] Loading checkpoint shards: 29%|██▊ | 2/7 [00:03<00:08, 1.73s/it] Loading checkpoint shards: 43%|████▎ | 3/7 [00:05<00:06, 1.73s/it] Loading checkpoint shards: 57%|█████▋ | 4/7 [00:06<00:05, 1.74s/it] Loading checkpoint shards: 71%|███████▏ | 5/7 [00:08<00:03, 1.70s/it] Loading checkpoint shards: 86%|████████▌ | 6/7 [00:10<00:01, 1.70s/it] Loading checkpoint shards: 100%|██████████| 7/7 [00:11<00:00, 1.43s/it] Loading checkpoint shards: 100%|██████████| 7/7 [00:11<00:00, 1.59s/it] [INFO|modeling_utils.py:4926] 2026-04-21 22:34:13,505 >> All model checkpoint weights were used when initializing LlamaForCausalLM. [INFO|modeling_utils.py:4934] 2026-04-21 22:34:13,505 >> All the weights of LlamaForCausalLM were initialized from the model checkpoint at /root/dynamic-dpo-v4/sft-checkpoints/llama-3-8b-base-sft-hh-helpful-4xh200. If your task is similar to the task the model of the checkpoint was trained on, you can already use LlamaForCausalLM for predictions without further training. [INFO|configuration_utils.py:1095] 2026-04-21 22:34:13,507 >> loading configuration file /root/dynamic-dpo-v4/sft-checkpoints/llama-3-8b-base-sft-hh-helpful-4xh200/generation_config.json [INFO|configuration_utils.py:1142] 2026-04-21 22:34:13,507 >> Generate config GenerationConfig { "bos_token_id": 128000, "do_sample": true, "eos_token_id": 128001, "max_length": 4096, "temperature": 0.6, "top_p": 0.9 } [INFO|configuration_utils.py:691] 2026-04-21 22:34:13,508 >> loading configuration file /root/dynamic-dpo-v4/sft-checkpoints/llama-3-8b-base-sft-hh-helpful-4xh200/config.json [INFO|configuration_utils.py:765] 2026-04-21 22:34:13,509 >> Model config LlamaConfig { "architectures": [ "LlamaForCausalLM" ], "attention_bias": false, "attention_dropout": 0.0, "bos_token_id": 128000, "eos_token_id": 128001, "head_dim": 128, "hidden_act": "silu", "hidden_size": 4096, "initializer_range": 0.02, "intermediate_size": 14336, "max_position_embeddings": 8192, "mlp_bias": false, "model_type": "llama", "num_attention_heads": 32, "num_hidden_layers": 32, "num_key_value_heads": 8, "pretraining_tp": 1, "rms_norm_eps": 1e-05, "rope_scaling": null, "rope_theta": 500000.0, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.51.0", "use_cache": false, "vocab_size": 128256 } [INFO|modeling_utils.py:1121] 2026-04-21 22:34:13,509 >> loading weights file /root/dynamic-dpo-v4/sft-checkpoints/llama-3-8b-base-sft-hh-helpful-4xh200/model.safetensors.index.json [INFO|modeling_utils.py:2167] 2026-04-21 22:34:13,510 >> Instantiating LlamaForCausalLM model under default dtype torch.bfloat16. [INFO|configuration_utils.py:1142] 2026-04-21 22:34:13,512 >> Generate config GenerationConfig { "bos_token_id": 128000, "eos_token_id": 128001, "use_cache": false } Loading checkpoint shards: 0%| | 0/7 [00:00> All model checkpoint weights were used when initializing LlamaForCausalLM. [INFO|modeling_utils.py:4934] 2026-04-21 22:34:24,489 >> All the weights of LlamaForCausalLM were initialized from the model checkpoint at /root/dynamic-dpo-v4/sft-checkpoints/llama-3-8b-base-sft-hh-helpful-4xh200. If your task is similar to the task the model of the checkpoint was trained on, you can already use LlamaForCausalLM for predictions without further training. [INFO|configuration_utils.py:1095] 2026-04-21 22:34:24,491 >> loading configuration file /root/dynamic-dpo-v4/sft-checkpoints/llama-3-8b-base-sft-hh-helpful-4xh200/generation_config.json [INFO|configuration_utils.py:1142] 2026-04-21 22:34:24,491 >> Generate config GenerationConfig { "bos_token_id": 128000, "do_sample": true, "eos_token_id": 128001, "max_length": 4096, "temperature": 0.6, "top_p": 0.9 } [WARNING|trainer.py:821] 2026-04-21 22:34:24,493 >> Trainer.tokenizer is now deprecated. You should use `Trainer.processing_class = processing_class` instead. [WARNING|trainer.py:816] 2026-04-21 22:34:24,493 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-21 22:34:24,502 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-21 22:34:24,503 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-21 22:34:24,507 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. /root/dynamic-dpo-v4/scripts/tokenized_dpo_trainer.py:518: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `NewDPOTrainer.__init__`. Use `processing_class` instead. super().__init__( [WARNING|trainer.py:816] 2026-04-21 22:34:25,971 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-21 22:34:25,971 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-21 22:34:25,971 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-21 22:34:25,991 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-21 22:34:25,991 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-21 22:34:25,992 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-21 22:34:25,992 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-21 22:34:25,993 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-21 22:34:25,993 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-21 22:34:26,003 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. /root/dynamic-dpo-v4/scripts/tokenized_dpo_trainer.py:518: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `NewDPOTrainer.__init__`. Use `processing_class` instead. super().__init__( [WARNING|trainer.py:816] 2026-04-21 22:34:26,004 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. /root/dynamic-dpo-v4/scripts/tokenized_dpo_trainer.py:518: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `NewDPOTrainer.__init__`. Use `processing_class` instead. super().__init__( [WARNING|trainer.py:816] 2026-04-21 22:34:26,004 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. /root/dynamic-dpo-v4/scripts/tokenized_dpo_trainer.py:518: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `NewDPOTrainer.__init__`. Use `processing_class` instead. super().__init__( [INFO|trainer.py:748] 2026-04-21 22:34:26,406 >> Using auto half precision backend /root/dynamic-dpo-v4/.venv/lib/python3.11/site-packages/accelerate/accelerator.py:1557: UserWarning: Upcasted low precision parameters in LlamaForCausalLM because mixed precision turned on in FSDP. Affects: model.embed_tokens.weight, model.norm.weight, lm_head.weight. warnings.warn( /root/dynamic-dpo-v4/.venv/lib/python3.11/site-packages/accelerate/accelerator.py:1557: UserWarning: Upcasted low precision parameters in LlamaDecoderLayer because mixed precision turned on in FSDP. Affects: self_attn.q_proj.weight, self_attn.k_proj.weight, self_attn.v_proj.weight, self_attn.o_proj.weight, mlp.gate_proj.weight, mlp.up_proj.weight, mlp.down_proj.weight, input_layernorm.weight, post_attention_layernorm.weight. warnings.warn( /root/dynamic-dpo-v4/.venv/lib/python3.11/site-packages/accelerate/accelerator.py:1563: UserWarning: FSDP upcast of low precision parameters may affect the precision of model checkpoints. warnings.warn( [INFO|trainer.py:2414] 2026-04-21 22:34:36,013 >> ***** Running training ***** [INFO|trainer.py:2415] 2026-04-21 22:34:36,013 >> Num examples = 43,598 [INFO|trainer.py:2416] 2026-04-21 22:34:36,013 >> Num Epochs = 1 [INFO|trainer.py:2417] 2026-04-21 22:34:36,013 >> Instantaneous batch size per device = 8 [INFO|trainer.py:2420] 2026-04-21 22:34:36,013 >> Total train batch size (w. parallel, distributed & accumulation) = 64 [INFO|trainer.py:2421] 2026-04-21 22:34:36,013 >> Gradient Accumulation steps = 2 [INFO|trainer.py:2422] 2026-04-21 22:34:36,013 >> Total optimization steps = 681 [INFO|trainer.py:2423] 2026-04-21 22:34:36,014 >> Number of trainable parameters = 2,007,565,312 [INFO|integration_utils.py:831] 2026-04-21 22:34:36,015 >> Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true" wandb: Currently logged in as: feng-cheng (feng-cheng-northeastern-university). Use `wandb login --relogin` to force relogin wandb: - Waiting for wandb.init()... wandb: \ Waiting for wandb.init()... wandb: wandb version 0.26.0 is available! To upgrade, please run: wandb: $ pip install wandb --upgrade wandb: Tracking run with wandb version 0.17.5 wandb: Run data is saved locally in /root/dynamic-dpo-v4/wandb/wandb/run-20260421_223437-cejacemh wandb: Run `wandb offline` to turn off syncing. wandb: Syncing run llama-3-8b-base-new-dpo-hh-helpful-s_star0.6-4xh200-batch-64-20260421-214335-rerun wandb: ⭐️ View project at https://wandb.ai/feng-cheng-northeastern-university/llama3-8b-base-new-method-hh-beta-0.1 wandb: 🚀 View run at https://wandb.ai/feng-cheng-northeastern-university/llama3-8b-base-new-method-hh-beta-0.1/runs/cejacemh 0%| | 0/681 [00:00> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_utils.py:1713] 2026-04-21 22:34:40,421 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_utils.py:1713] 2026-04-21 22:34:40,426 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_utils.py:1713] 2026-04-21 22:34:40,435 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed 0%| | 1/681 [00:02<30:20, 2.68s/it] {'loss': 1.389, 'grad_norm': 83.50728607177734, 'learning_rate': 0.0, 'fcm_dpo/beta': 0.10000000149011612, 'fcm_dpo/q_t': 0.5005706548690796, 'fcm_dpo/delta': 0.0, 'fcm_dpo/margin': -0.02287006378173828, 'margin_dpo/margin_mean': -0.02287048101425171, 'margin_dpo/margin_std': 0.41920793056488037, 'logps/chosen': -50.1435661315918, 'logps/rejected': -74.09991455078125, 'logps/ref_chosen': -50.14883804321289, 'logps/ref_rejected': -74.1280517578125, 'logits/chosen': -0.4974287748336792, 'logits/rejected': -0.43299180269241333, 'epoch': 0.0} 0%| | 1/681 [00:02<30:20, 2.68s/it] 0%| | 2/681 [00:05<29:05, 2.57s/it] 0%| | 3/681 [00:07<28:49, 2.55s/it] 1%| | 4/681 [00:10<29:09, 2.58s/it] 1%| | 5/681 [00:12<29:01, 2.58s/it] {'loss': 1.3899, 'grad_norm': 90.14773559570312, 'learning_rate': 2.898550724637681e-08, 'fcm_dpo/beta': 0.10000000894069672, 'fcm_dpo/q_t': 0.5008102059364319, 'fcm_dpo/delta': 0.0, 'fcm_dpo/margin': -0.03240281343460083, 'margin_dpo/margin_mean': -0.03240284323692322, 'margin_dpo/margin_std': 0.3555586636066437, 'logps/chosen': -56.07246017456055, 'logps/rejected': -78.67597198486328, 'logps/ref_chosen': -56.05734634399414, 'logps/ref_rejected': -78.69325256347656, 'logits/chosen': -0.4901035726070404, 'logits/rejected': -0.4534408450126648, 'epoch': 0.01} 1%| | 5/681 [00:12<29:01, 2.58s/it] 1%| | 6/681 [00:15<27:31, 2.45s/it] 1%| | 7/681 [00:17<26:54, 2.39s/it] 1%| | 8/681 [00:19<26:36, 2.37s/it] 1%|▏ | 9/681 [00:22<27:12, 2.43s/it] 1%|▏ | 10/681 [00:24<27:42, 2.48s/it] {'loss': 1.3839, 'grad_norm': 70.48045349121094, 'learning_rate': 6.521739130434782e-08, 'fcm_dpo/beta': 0.10000000894069672, 'fcm_dpo/q_t': 0.4993022382259369, 'fcm_dpo/delta': 0.0, 'fcm_dpo/margin': 0.027925759553909302, 'margin_dpo/margin_mean': 0.027925794944167137, 'margin_dpo/margin_std': 0.37033817172050476, 'logps/chosen': -59.527122497558594, 'logps/rejected': -91.18089294433594, 'logps/ref_chosen': -59.54457473754883, 'logps/ref_rejected': -91.17041778564453, 'logits/chosen': -0.5015245079994202, 'logits/rejected': -0.4629823565483093, 'epoch': 0.01} 1%|▏ | 10/681 [00:24<27:42, 2.48s/it] 2%|▏ | 11/681 [00:27<28:25, 2.55s/it] 2%|▏ | 12/681 [00:30<28:33, 2.56s/it] 2%|▏ | 13/681 [00:32<28:55, 2.60s/it] 2%|▏ | 14/681 [00:35<28:32, 2.57s/it] 2%|▏ | 15/681 [00:37<28:23, 2.56s/it] {'loss': 1.3861, 'grad_norm': 64.33786010742188, 'learning_rate': 1.0144927536231885e-07, 'fcm_dpo/beta': 0.10000000894069672, 'fcm_dpo/q_t': 0.49986687302589417, 'fcm_dpo/delta': 0.0, 'fcm_dpo/margin': 0.005324178840965033, 'margin_dpo/margin_mean': 0.005324071738868952, 'margin_dpo/margin_std': 0.36571556329727173, 'logps/chosen': -58.83959197998047, 'logps/rejected': -92.95245361328125, 'logps/ref_chosen': -58.83195877075195, 'logps/ref_rejected': -92.93949890136719, 'logits/chosen': -0.4974799156188965, 'logits/rejected': -0.46847113966941833, 'epoch': 0.02} 2%|▏ | 15/681 [00:37<28:23, 2.56s/it] 2%|▏ | 16/681 [00:40<28:00, 2.53s/it] 2%|▏ | 17/681 [00:42<27:44, 2.51s/it] 3%|▎ | 18/681 [00:45<27:39, 2.50s/it] 3%|▎ | 19/681 [00:47<27:45, 2.52s/it] 3%|▎ | 20/681 [00:50<27:43, 2.52s/it] {'loss': 1.381, 'grad_norm': 73.8245620727539, 'learning_rate': 1.3768115942028986e-07, 'fcm_dpo/beta': 0.10000000894069672, 'fcm_dpo/q_t': 0.49860554933547974, 'fcm_dpo/delta': 0.0, 'fcm_dpo/margin': 0.05579507350921631, 'margin_dpo/margin_mean': 0.055795006453990936, 'margin_dpo/margin_std': 0.33391329646110535, 'logps/chosen': -59.63999557495117, 'logps/rejected': -82.81639862060547, 'logps/ref_chosen': -59.6396598815918, 'logps/ref_rejected': -82.76026916503906, 'logits/chosen': -0.5040138959884644, 'logits/rejected': -0.45514219999313354, 'epoch': 0.03} 3%|▎ | 20/681 [00:50<27:43, 2.52s/it] 3%|▎ | 21/681 [00:52<27:31, 2.50s/it] 3%|▎ | 22/681 [00:55<27:55, 2.54s/it] 3%|▎ | 23/681 [00:58<28:51, 2.63s/it] 4%|▎ | 24/681 [01:00<28:53, 2.64s/it] 4%|▎ | 25/681 [01:03<28:50, 2.64s/it] {'loss': 1.366, 'grad_norm': 73.5445785522461, 'learning_rate': 1.7391304347826085e-07, 'fcm_dpo/beta': 0.10000000894069672, 'fcm_dpo/q_t': 0.49479326605796814, 'fcm_dpo/delta': 0.0, 'fcm_dpo/margin': 0.20841345191001892, 'margin_dpo/margin_mean': 0.20841336250305176, 'margin_dpo/margin_std': 0.4185457229614258, 'logps/chosen': -53.173057556152344, 'logps/rejected': -89.17227172851562, 'logps/ref_chosen': -53.205284118652344, 'logps/ref_rejected': -88.99608612060547, 'logits/chosen': -0.5032899975776672, 'logits/rejected': -0.4763486981391907, 'epoch': 0.04} 4%|▎ | 25/681 [01:03<28:50, 2.64s/it] 4%|▍ | 26/681 [01:05<27:31, 2.52s/it] 4%|▍ | 27/681 [01:08<27:17, 2.50s/it] 4%|▍ | 28/681 [01:10<27:23, 2.52s/it] 4%|▍ | 29/681 [01:13<26:20, 2.42s/it] 4%|▍ | 30/681 [01:15<26:56, 2.48s/it] {'loss': 1.3389, 'grad_norm': 87.73991394042969, 'learning_rate': 2.1014492753623187e-07, 'fcm_dpo/beta': 0.10000000894069672, 'fcm_dpo/q_t': 0.48778820037841797, 'fcm_dpo/delta': 0.0, 'fcm_dpo/margin': 0.4891234338283539, 'margin_dpo/margin_mean': 0.4891238212585449, 'margin_dpo/margin_std': 0.5750466585159302, 'logps/chosen': -53.45922088623047, 'logps/rejected': -98.26947021484375, 'logps/ref_chosen': -53.5526008605957, 'logps/ref_rejected': -97.87371826171875, 'logits/chosen': -0.5239602327346802, 'logits/rejected': -0.48419055342674255, 'epoch': 0.04} 4%|▍ | 30/681 [01:15<26:56, 2.48s/it] 5%|▍ | 31/681 [01:18<27:28, 2.54s/it] 5%|▍ | 32/681 [01:21<28:03, 2.59s/it] 5%|▍ | 33/681 [01:23<27:44, 2.57s/it] 5%|▍ | 34/681 [01:26<27:49, 2.58s/it] 5%|▌ | 35/681 [01:28<28:04, 2.61s/it] {'loss': 1.3122, 'grad_norm': 82.94285583496094, 'learning_rate': 2.463768115942029e-07, 'fcm_dpo/beta': 0.10000000894069672, 'fcm_dpo/q_t': 0.48066458106040955, 'fcm_dpo/delta': 0.0, 'fcm_dpo/margin': 0.7763983607292175, 'margin_dpo/margin_mean': 0.776398241519928, 'margin_dpo/margin_std': 0.8276771306991577, 'logps/chosen': -56.198211669921875, 'logps/rejected': -92.41334533691406, 'logps/ref_chosen': -56.3298454284668, 'logps/ref_rejected': -91.76858520507812, 'logits/chosen': -0.4989829957485199, 'logits/rejected': -0.4650956094264984, 'epoch': 0.05} 5%|▌ | 35/681 [01:28<28:04, 2.61s/it] 5%|▌ | 36/681 [01:31<28:09, 2.62s/it] 5%|▌ | 37/681 [01:34<27:59, 2.61s/it] 6%|▌ | 38/681 [01:36<26:41, 2.49s/it] 6%|▌ | 39/681 [01:38<26:43, 2.50s/it] 6%|▌ | 40/681 [01:41<27:17, 2.56s/it] {'loss': 1.2606, 'grad_norm': 60.41274642944336, 'learning_rate': 2.8260869565217386e-07, 'fcm_dpo/beta': 0.10000000894069672, 'fcm_dpo/q_t': 0.46632710099220276, 'fcm_dpo/delta': 0.0, 'fcm_dpo/margin': 1.359745979309082, 'margin_dpo/margin_mean': 1.359745979309082, 'margin_dpo/margin_std': 1.4517606496810913, 'logps/chosen': -54.27339553833008, 'logps/rejected': -84.19175720214844, 'logps/ref_chosen': -54.38492965698242, 'logps/ref_rejected': -82.94353485107422, 'logits/chosen': -0.5347701907157898, 'logits/rejected': -0.4986083507537842, 'epoch': 0.06} 6%|▌ | 40/681 [01:41<27:17, 2.56s/it] 6%|▌ | 41/681 [01:44<27:13, 2.55s/it] 6%|▌ | 42/681 [01:46<27:07, 2.55s/it] 6%|▋ | 43/681 [01:49<27:07, 2.55s/it] 6%|▋ | 44/681 [01:51<27:22, 2.58s/it] 7%|▋ | 45/681 [01:54<27:26, 2.59s/it] {'loss': 1.1498, 'grad_norm': 74.45612335205078, 'learning_rate': 3.188405797101449e-07, 'fcm_dpo/beta': 0.1127050369977951, 'fcm_dpo/q_t': 0.4357197880744934, 'fcm_dpo/delta': 0.25362294912338257, 'fcm_dpo/margin': 2.3980860710144043, 'margin_dpo/margin_mean': 2.398085832595825, 'margin_dpo/margin_std': 2.2269370555877686, 'logps/chosen': -54.6392822265625, 'logps/rejected': -100.20148468017578, 'logps/ref_chosen': -54.862335205078125, 'logps/ref_rejected': -98.0264663696289, 'logits/chosen': -0.5095103979110718, 'logits/rejected': -0.48132508993148804, 'epoch': 0.07} 7%|▋ | 45/681 [01:54<27:26, 2.59s/it] 7%|▋ | 46/681 [01:57<27:32, 2.60s/it] 7%|▋ | 47/681 [01:59<27:35, 2.61s/it] 7%|▋ | 48/681 [02:02<27:57, 2.65s/it] 7%|▋ | 49/681 [02:04<27:33, 2.62s/it] 7%|▋ | 50/681 [02:07<27:37, 2.63s/it] {'loss': 1.0187, 'grad_norm': 79.67459869384766, 'learning_rate': 3.5507246376811595e-07, 'fcm_dpo/beta': 0.141450434923172, 'fcm_dpo/q_t': 0.3908053934574127, 'fcm_dpo/delta': 0.12390259653329849, 'fcm_dpo/margin': 3.386523485183716, 'margin_dpo/margin_mean': 3.386524200439453, 'margin_dpo/margin_std': 3.3679816722869873, 'logps/chosen': -58.14664840698242, 'logps/rejected': -94.92338562011719, 'logps/ref_chosen': -58.304595947265625, 'logps/ref_rejected': -91.69480895996094, 'logits/chosen': -0.5551148653030396, 'logits/rejected': -0.5035051107406616, 'epoch': 0.07} 7%|▋ | 50/681 [02:07<27:37, 2.63s/it] 7%|▋ | 51/681 [02:10<27:30, 2.62s/it] 8%|▊ | 52/681 [02:12<26:47, 2.56s/it] 8%|▊ | 53/681 [02:15<26:45, 2.56s/it] 8%|▊ | 54/681 [02:17<26:04, 2.49s/it] 8%|▊ | 55/681 [02:19<25:11, 2.41s/it] {'loss': 0.8962, 'grad_norm': 62.811153411865234, 'learning_rate': 3.9130434782608694e-07, 'fcm_dpo/beta': 0.135177880525589, 'fcm_dpo/q_t': 0.3425524830818176, 'fcm_dpo/delta': -0.16671812534332275, 'fcm_dpo/margin': 5.600610256195068, 'margin_dpo/margin_mean': 5.600610256195068, 'margin_dpo/margin_std': 5.793082237243652, 'logps/chosen': -56.37145233154297, 'logps/rejected': -91.59982299804688, 'logps/ref_chosen': -56.06591796875, 'logps/ref_rejected': -85.69367980957031, 'logits/chosen': -0.6010715961456299, 'logits/rejected': -0.5568638443946838, 'epoch': 0.08} 8%|▊ | 55/681 [02:19<25:11, 2.41s/it] 8%|▊ | 56/681 [02:22<26:01, 2.50s/it] 8%|▊ | 57/681 [02:24<25:52, 2.49s/it] 9%|▊ | 58/681 [02:27<26:16, 2.53s/it] 9%|▊ | 59/681 [02:30<26:39, 2.57s/it] 9%|▉ | 60/681 [02:32<25:57, 2.51s/it] {'loss': 0.8969, 'grad_norm': 67.2679214477539, 'learning_rate': 4.2753623188405794e-07, 'fcm_dpo/beta': 0.11089271306991577, 'fcm_dpo/q_t': 0.33725228905677795, 'fcm_dpo/delta': -0.1925317347049713, 'fcm_dpo/margin': 7.025670528411865, 'margin_dpo/margin_mean': 7.025670528411865, 'margin_dpo/margin_std': 7.496710777282715, 'logps/chosen': -61.9241828918457, 'logps/rejected': -97.98988342285156, 'logps/ref_chosen': -60.6871337890625, 'logps/ref_rejected': -89.72715759277344, 'logits/chosen': -0.6061812043190002, 'logits/rejected': -0.5570945739746094, 'epoch': 0.09} 9%|▉ | 60/681 [02:32<25:57, 2.51s/it] 9%|▉ | 61/681 [02:35<26:16, 2.54s/it] 9%|▉ | 62/681 [02:37<26:45, 2.59s/it] 9%|▉ | 63/681 [02:40<26:29, 2.57s/it] 9%|▉ | 64/681 [02:42<26:04, 2.54s/it] 10%|▉ | 65/681 [02:45<26:12, 2.55s/it] {'loss': 0.923, 'grad_norm': 48.89730453491211, 'learning_rate': 4.63768115942029e-07, 'fcm_dpo/beta': 0.09299755096435547, 'fcm_dpo/q_t': 0.3424831032752991, 'fcm_dpo/delta': -0.18385855853557587, 'fcm_dpo/margin': 8.298527717590332, 'margin_dpo/margin_mean': 8.298527717590332, 'margin_dpo/margin_std': 9.724918365478516, 'logps/chosen': -63.573402404785156, 'logps/rejected': -103.41975402832031, 'logps/ref_chosen': -61.75325393676758, 'logps/ref_rejected': -93.30108642578125, 'logits/chosen': -0.6179511547088623, 'logits/rejected': -0.5864478945732117, 'epoch': 0.1} 10%|▉ | 65/681 [02:45<26:12, 2.55s/it] 10%|▉ | 66/681 [02:48<26:21, 2.57s/it] 10%|▉ | 67/681 [02:50<25:24, 2.48s/it] 10%|▉ | 68/681 [02:52<25:07, 2.46s/it] 10%|█ | 69/681 [02:55<26:08, 2.56s/it] 10%|█ | 70/681 [02:57<25:45, 2.53s/it] {'loss': 0.9041, 'grad_norm': 47.65688705444336, 'learning_rate': 5e-07, 'fcm_dpo/beta': 0.07844052463769913, 'fcm_dpo/q_t': 0.3441976308822632, 'fcm_dpo/delta': -0.1582036018371582, 'fcm_dpo/margin': 9.539754867553711, 'margin_dpo/margin_mean': 9.539755821228027, 'margin_dpo/margin_std': 10.295551300048828, 'logps/chosen': -62.56956100463867, 'logps/rejected': -96.57740783691406, 'logps/ref_chosen': -59.548004150390625, 'logps/ref_rejected': -84.01609802246094, 'logits/chosen': -0.6304086446762085, 'logits/rejected': -0.5917232632637024, 'epoch': 0.1} 10%|█ | 70/681 [02:58<25:45, 2.53s/it] 10%|█ | 71/681 [03:00<26:01, 2.56s/it] 11%|█ | 72/681 [03:03<26:11, 2.58s/it] 11%|█ | 73/681 [03:05<26:14, 2.59s/it] 11%|█ | 74/681 [03:08<26:01, 2.57s/it] 11%|█ | 75/681 [03:11<26:11, 2.59s/it] {'loss': 0.873, 'grad_norm': 36.49312973022461, 'learning_rate': 4.999176576834721e-07, 'fcm_dpo/beta': 0.06165589019656181, 'fcm_dpo/q_t': 0.3237493336200714, 'fcm_dpo/delta': -0.32924890518188477, 'fcm_dpo/margin': 14.760737419128418, 'margin_dpo/margin_mean': 14.760736465454102, 'margin_dpo/margin_std': 17.107942581176758, 'logps/chosen': -65.28561401367188, 'logps/rejected': -118.2331771850586, 'logps/ref_chosen': -59.86931228637695, 'logps/ref_rejected': -98.05613708496094, 'logits/chosen': -0.6605738997459412, 'logits/rejected': -0.6328510642051697, 'epoch': 0.11} 11%|█ | 75/681 [03:11<26:11, 2.59s/it] 11%|█ | 76/681 [03:13<26:00, 2.58s/it] 11%|█▏ | 77/681 [03:15<24:59, 2.48s/it] 11%|█▏ | 78/681 [03:18<25:26, 2.53s/it] 12%|█▏ | 79/681 [03:21<25:42, 2.56s/it] 12%|█▏ | 80/681 [03:23<25:35, 2.56s/it] {'loss': 0.9203, 'grad_norm': 35.74776077270508, 'learning_rate': 4.996706849759452e-07, 'fcm_dpo/beta': 0.04629804939031601, 'fcm_dpo/q_t': 0.341538667678833, 'fcm_dpo/delta': -0.19796454906463623, 'fcm_dpo/margin': 16.89699935913086, 'margin_dpo/margin_mean': 16.896997451782227, 'margin_dpo/margin_std': 19.718297958374023, 'logps/chosen': -63.93366622924805, 'logps/rejected': -111.06534576416016, 'logps/ref_chosen': -56.18925857543945, 'logps/ref_rejected': -86.42393493652344, 'logits/chosen': -0.6835442781448364, 'logits/rejected': -0.6468649506568909, 'epoch': 0.12} 12%|█▏ | 80/681 [03:23<25:35, 2.56s/it] 12%|█▏ | 81/681 [03:26<26:10, 2.62s/it] 12%|█▏ | 82/681 [03:28<25:55, 2.60s/it] 12%|█▏ | 83/681 [03:31<25:19, 2.54s/it] 12%|█▏ | 84/681 [03:34<25:50, 2.60s/it] 12%|█▏ | 85/681 [03:36<25:47, 2.60s/it] {'loss': 0.9443, 'grad_norm': 34.31068420410156, 'learning_rate': 4.992592445678582e-07, 'fcm_dpo/beta': 0.0381317213177681, 'fcm_dpo/q_t': 0.34707337617874146, 'fcm_dpo/delta': -0.16901178658008575, 'fcm_dpo/margin': 19.726295471191406, 'margin_dpo/margin_mean': 19.726295471191406, 'margin_dpo/margin_std': 24.040042877197266, 'logps/chosen': -70.46139526367188, 'logps/rejected': -128.18124389648438, 'logps/ref_chosen': -60.018287658691406, 'logps/ref_rejected': -98.01185607910156, 'logits/chosen': -0.6622103452682495, 'logits/rejected': -0.6311969757080078, 'epoch': 0.12} 12%|█▏ | 85/681 [03:36<25:47, 2.60s/it] 13%|█▎ | 86/681 [03:39<26:01, 2.62s/it] 13%|█▎ | 87/681 [03:41<25:52, 2.61s/it] 13%|█▎ | 88/681 [03:44<25:52, 2.62s/it] 13%|█▎ | 89/681 [03:47<25:23, 2.57s/it] 13%|█▎ | 90/681 [03:49<24:59, 2.54s/it] {'loss': 1.0061, 'grad_norm': 35.00300216674805, 'learning_rate': 4.986836074908615e-07, 'fcm_dpo/beta': 0.03405915945768356, 'fcm_dpo/q_t': 0.3624621331691742, 'fcm_dpo/delta': -0.11596596240997314, 'fcm_dpo/margin': 20.768291473388672, 'margin_dpo/margin_mean': 20.768291473388672, 'margin_dpo/margin_std': 29.607013702392578, 'logps/chosen': -73.39559173583984, 'logps/rejected': -131.07809448242188, 'logps/ref_chosen': -59.8709831237793, 'logps/ref_rejected': -96.78519439697266, 'logits/chosen': -0.7018736600875854, 'logits/rejected': -0.6867517232894897, 'epoch': 0.13} 13%|█▎ | 90/681 [03:49<24:59, 2.54s/it] 13%|█▎ | 91/681 [03:52<25:07, 2.56s/it] 14%|█▎ | 92/681 [03:54<24:50, 2.53s/it] 14%|█▎ | 93/681 [03:56<24:01, 2.45s/it] 14%|█▍ | 94/681 [03:59<25:05, 2.56s/it] 14%|█▍ | 95/681 [04:02<24:41, 2.53s/it] {'loss': 0.9664, 'grad_norm': 27.68400764465332, 'learning_rate': 4.979441529392784e-07, 'fcm_dpo/beta': 0.030608216300606728, 'fcm_dpo/q_t': 0.36035576462745667, 'fcm_dpo/delta': -0.07700999826192856, 'fcm_dpo/margin': 21.932090759277344, 'margin_dpo/margin_mean': 21.932090759277344, 'margin_dpo/margin_std': 26.880752563476562, 'logps/chosen': -69.35963439941406, 'logps/rejected': -119.02693939208984, 'logps/ref_chosen': -55.94385528564453, 'logps/ref_rejected': -83.6790542602539, 'logits/chosen': -0.708720326423645, 'logits/rejected': -0.6767187714576721, 'epoch': 0.14} 14%|█▍ | 95/681 [04:02<24:41, 2.53s/it] 14%|█▍ | 96/681 [04:04<24:42, 2.53s/it] 14%|█▍ | 97/681 [04:07<24:44, 2.54s/it] 14%|█▍ | 98/681 [04:09<24:14, 2.50s/it] 15%|█▍ | 99/681 [04:11<23:22, 2.41s/it] 15%|█▍ | 100/681 [04:14<23:53, 2.47s/it] {'loss': 0.9722, 'grad_norm': 30.916765213012695, 'learning_rate': 4.970413680203148e-07, 'fcm_dpo/beta': 0.028173187747597694, 'fcm_dpo/q_t': 0.36101511120796204, 'fcm_dpo/delta': -0.068596251308918, 'fcm_dpo/margin': 23.49247169494629, 'margin_dpo/margin_mean': 23.49247169494629, 'margin_dpo/margin_std': 28.96224594116211, 'logps/chosen': -71.47965240478516, 'logps/rejected': -124.03050231933594, 'logps/ref_chosen': -57.05888748168945, 'logps/ref_rejected': -86.11727142333984, 'logits/chosen': -0.6772828698158264, 'logits/rejected': -0.648100733757019, 'epoch': 0.15} 15%|█▍ | 100/681 [04:14<23:53, 2.47s/it] 15%|█▍ | 101/681 [04:16<23:39, 2.45s/it] 15%|█▍ | 102/681 [04:19<23:26, 2.43s/it] 15%|█▌ | 103/681 [04:21<23:09, 2.40s/it] 15%|█▌ | 104/681 [04:23<22:36, 2.35s/it] 15%|█▌ | 105/681 [04:26<23:34, 2.46s/it] {'loss': 0.9567, 'grad_norm': 26.486059188842773, 'learning_rate': 4.959758474331832e-07, 'fcm_dpo/beta': 0.027121257036924362, 'fcm_dpo/q_t': 0.35333341360092163, 'fcm_dpo/delta': -0.13553811609745026, 'fcm_dpo/margin': 26.961578369140625, 'margin_dpo/margin_mean': 26.961578369140625, 'margin_dpo/margin_std': 32.831111907958984, 'logps/chosen': -76.32167053222656, 'logps/rejected': -130.57305908203125, 'logps/ref_chosen': -59.20774459838867, 'logps/ref_rejected': -86.49754333496094, 'logits/chosen': -0.6960592269897461, 'logits/rejected': -0.6600139141082764, 'epoch': 0.15} 15%|█▌ | 105/681 [04:26<23:34, 2.46s/it] 16%|█▌ | 106/681 [04:28<23:30, 2.45s/it] 16%|█▌ | 107/681 [04:31<24:00, 2.51s/it] 16%|█▌ | 108/681 [04:33<23:35, 2.47s/it] 16%|█▌ | 109/681 [04:36<24:08, 2.53s/it] 16%|█▌ | 110/681 [04:39<24:25, 2.57s/it] {'loss': 0.9511, 'grad_norm': 24.114713668823242, 'learning_rate': 4.947482930773511e-07, 'fcm_dpo/beta': 0.02301758900284767, 'fcm_dpo/q_t': 0.3538368046283722, 'fcm_dpo/delta': -0.11018934100866318, 'fcm_dpo/margin': 30.556344985961914, 'margin_dpo/margin_mean': 30.556344985961914, 'margin_dpo/margin_std': 35.99966812133789, 'logps/chosen': -78.81887817382812, 'logps/rejected': -139.77645874023438, 'logps/ref_chosen': -60.437957763671875, 'logps/ref_rejected': -90.83917999267578, 'logits/chosen': -0.6646202206611633, 'logits/rejected': -0.6281755566596985, 'epoch': 0.16} 16%|█▌ | 110/681 [04:39<24:25, 2.57s/it] 16%|█▋ | 111/681 [04:41<24:18, 2.56s/it] 16%|█▋ | 112/681 [04:44<23:21, 2.46s/it] 17%|█▋ | 113/681 [04:46<23:45, 2.51s/it] 17%|█▋ | 114/681 [04:49<23:35, 2.50s/it] 17%|█▋ | 115/681 [04:51<24:01, 2.55s/it] {'loss': 0.9992, 'grad_norm': 40.84029769897461, 'learning_rate': 4.933595135901732e-07, 'fcm_dpo/beta': 0.021153923124074936, 'fcm_dpo/q_t': 0.3688841462135315, 'fcm_dpo/delta': -0.041334737092256546, 'fcm_dpo/margin': 30.124019622802734, 'margin_dpo/margin_mean': 30.124013900756836, 'margin_dpo/margin_std': 39.94293212890625, 'logps/chosen': -84.20191955566406, 'logps/rejected': -137.90447998046875, 'logps/ref_chosen': -61.7908821105957, 'logps/ref_rejected': -85.36943054199219, 'logits/chosen': -0.6649340391159058, 'logits/rejected': -0.6294328570365906, 'epoch': 0.17} 17%|█▋ | 115/681 [04:51<24:01, 2.55s/it] 17%|█▋ | 116/681 [04:54<23:20, 2.48s/it] 17%|█▋ | 117/681 [04:56<23:09, 2.46s/it] 17%|█▋ | 118/681 [04:59<23:29, 2.50s/it] 17%|█▋ | 119/681 [05:01<24:13, 2.59s/it] 18%|█▊ | 120/681 [05:04<24:08, 2.58s/it] {'loss': 0.9818, 'grad_norm': 26.78792381286621, 'learning_rate': 4.918104238142103e-07, 'fcm_dpo/beta': 0.02078414149582386, 'fcm_dpo/q_t': 0.36713889241218567, 'fcm_dpo/delta': -0.037118665874004364, 'fcm_dpo/margin': 30.540584564208984, 'margin_dpo/margin_mean': 30.540584564208984, 'margin_dpo/margin_std': 38.079750061035156, 'logps/chosen': -91.19302368164062, 'logps/rejected': -143.1626434326172, 'logps/ref_chosen': -65.3261489868164, 'logps/ref_rejected': -86.75518798828125, 'logits/chosen': -0.6711692214012146, 'logits/rejected': -0.645135760307312, 'epoch': 0.18} 18%|█▊ | 120/681 [05:04<24:08, 2.58s/it] 18%|█▊ | 121/681 [05:07<23:50, 2.55s/it] 18%|█▊ | 122/681 [05:09<23:08, 2.48s/it] 18%|█▊ | 123/681 [05:12<23:42, 2.55s/it] 18%|█▊ | 124/681 [05:14<23:44, 2.56s/it] 18%|█▊ | 125/681 [05:17<23:24, 2.53s/it] {'loss': 0.9204, 'grad_norm': 23.552217483520508, 'learning_rate': 4.90102044194588e-07, 'fcm_dpo/beta': 0.017505459487438202, 'fcm_dpo/q_t': 0.3401046693325043, 'fcm_dpo/delta': -0.22550848126411438, 'fcm_dpo/margin': 46.04296112060547, 'margin_dpo/margin_mean': 46.0429573059082, 'margin_dpo/margin_std': 55.54075241088867, 'logps/chosen': -87.12136840820312, 'logps/rejected': -176.0518035888672, 'logps/ref_chosen': -58.323204040527344, 'logps/ref_rejected': -101.2106704711914, 'logits/chosen': -0.6151807904243469, 'logits/rejected': -0.6104758381843567, 'epoch': 0.18} 18%|█▊ | 125/681 [05:17<23:24, 2.53s/it] 19%|█▊ | 126/681 [05:19<23:43, 2.56s/it] 19%|█▊ | 127/681 [05:22<23:53, 2.59s/it] 19%|█▉ | 128/681 [05:25<24:06, 2.62s/it] 19%|█▉ | 129/681 [05:27<23:55, 2.60s/it] 19%|█▉ | 130/681 [05:30<23:41, 2.58s/it] {'loss': 1.0035, 'grad_norm': 22.16413116455078, 'learning_rate': 4.882355001067891e-07, 'fcm_dpo/beta': 0.01598326489329338, 'fcm_dpo/q_t': 0.3680208623409271, 'fcm_dpo/delta': -0.04410712048411369, 'fcm_dpo/margin': 40.082298278808594, 'margin_dpo/margin_mean': 40.082298278808594, 'margin_dpo/margin_std': 53.219139099121094, 'logps/chosen': -86.93000793457031, 'logps/rejected': -156.78482055664062, 'logps/ref_chosen': -56.38518524169922, 'logps/ref_rejected': -86.15767669677734, 'logits/chosen': -0.5932961106300354, 'logits/rejected': -0.5749183893203735, 'epoch': 0.19} 19%|█▉ | 130/681 [05:30<23:41, 2.58s/it] 19%|█▉ | 131/681 [05:32<23:45, 2.59s/it] 19%|█▉ | 132/681 [05:35<23:22, 2.56s/it] 20%|█▉ | 133/681 [05:37<23:31, 2.58s/it] 20%|█▉ | 134/681 [05:40<22:49, 2.50s/it] 20%|█▉ | 135/681 [05:42<22:39, 2.49s/it] {'loss': 0.9545, 'grad_norm': 25.677669525146484, 'learning_rate': 4.862120211153265e-07, 'fcm_dpo/beta': 0.014573054388165474, 'fcm_dpo/q_t': 0.3577379286289215, 'fcm_dpo/delta': -0.09526528418064117, 'fcm_dpo/margin': 47.24794387817383, 'margin_dpo/margin_mean': 47.24794387817383, 'margin_dpo/margin_std': 57.28125762939453, 'logps/chosen': -86.5953140258789, 'logps/rejected': -174.51339721679688, 'logps/ref_chosen': -54.59065628051758, 'logps/ref_rejected': -95.26080322265625, 'logits/chosen': -0.5761778950691223, 'logits/rejected': -0.5731192827224731, 'epoch': 0.2} 20%|█▉ | 135/681 [05:42<22:39, 2.49s/it] 20%|█▉ | 136/681 [05:45<23:11, 2.55s/it] 20%|██ | 137/681 [05:47<23:08, 2.55s/it] 20%|██ | 138/681 [05:50<22:24, 2.48s/it] 20%|██ | 139/681 [05:52<22:08, 2.45s/it] 21%|██ | 140/681 [05:54<21:47, 2.42s/it] {'loss': 0.9755, 'grad_norm': 25.558738708496094, 'learning_rate': 4.840329401637809e-07, 'fcm_dpo/beta': 0.013362633995711803, 'fcm_dpo/q_t': 0.3625403940677643, 'fcm_dpo/delta': -0.08761467784643173, 'fcm_dpo/margin': 51.02484893798828, 'margin_dpo/margin_mean': 51.02485275268555, 'margin_dpo/margin_std': 65.68046569824219, 'logps/chosen': -96.27259826660156, 'logps/rejected': -184.53277587890625, 'logps/ref_chosen': -56.04347610473633, 'logps/ref_rejected': -93.27880859375, 'logits/chosen': -0.5525860786437988, 'logits/rejected': -0.545661449432373, 'epoch': 0.21} 21%|██ | 140/681 [05:54<21:47, 2.42s/it] 21%|██ | 141/681 [05:57<22:31, 2.50s/it] 21%|██ | 142/681 [06:00<23:08, 2.58s/it] 21%|██ | 143/681 [06:03<23:54, 2.67s/it] 21%|██ | 144/681 [06:05<23:57, 2.68s/it] 21%|██▏ | 145/681 [06:08<23:10, 2.59s/it] {'loss': 1.0202, 'grad_norm': 29.300233840942383, 'learning_rate': 4.816996926967401e-07, 'fcm_dpo/beta': 0.012635116465389729, 'fcm_dpo/q_t': 0.3737943470478058, 'fcm_dpo/delta': -0.008492978289723396, 'fcm_dpo/margin': 48.067604064941406, 'margin_dpo/margin_mean': 48.067596435546875, 'margin_dpo/margin_std': 66.08811950683594, 'logps/chosen': -107.9009017944336, 'logps/rejected': -180.85520935058594, 'logps/ref_chosen': -61.4414176940918, 'logps/ref_rejected': -86.32813262939453, 'logits/chosen': -0.5054234862327576, 'logits/rejected': -0.4867471754550934, 'epoch': 0.21} 21%|██▏ | 145/681 [06:08<23:10, 2.59s/it] 21%|██▏ | 146/681 [06:10<23:00, 2.58s/it] 22%|██▏ | 147/681 [06:13<23:03, 2.59s/it] 22%|██▏ | 148/681 [06:16<22:55, 2.58s/it] 22%|██▏ | 149/681 [06:18<23:04, 2.60s/it] 22%|██▏ | 150/681 [06:21<22:55, 2.59s/it] {'loss': 1.0113, 'grad_norm': 25.043779373168945, 'learning_rate': 4.792138157142157e-07, 'fcm_dpo/beta': 0.012664164416491985, 'fcm_dpo/q_t': 0.3732047379016876, 'fcm_dpo/delta': -0.01627928391098976, 'fcm_dpo/margin': 48.467201232910156, 'margin_dpo/margin_mean': 48.467201232910156, 'margin_dpo/margin_std': 64.91874694824219, 'logps/chosen': -104.0806884765625, 'logps/rejected': -182.61329650878906, 'logps/ref_chosen': -57.70451736450195, 'logps/ref_rejected': -87.76991271972656, 'logits/chosen': -0.5404887199401855, 'logits/rejected': -0.5210872888565063, 'epoch': 0.22} 22%|██▏ | 150/681 [06:21<22:55, 2.59s/it] 22%|██▏ | 151/681 [06:23<22:27, 2.54s/it] 22%|██▏ | 152/681 [06:26<22:58, 2.61s/it] 22%|██▏ | 153/681 [06:29<22:46, 2.59s/it] 23%|██▎ | 154/681 [06:31<22:59, 2.62s/it] 23%|██▎ | 155/681 [06:34<23:13, 2.65s/it] {'loss': 0.9764, 'grad_norm': 23.727567672729492, 'learning_rate': 4.7657694675916247e-07, 'fcm_dpo/beta': 0.011945498175919056, 'fcm_dpo/q_t': 0.3624417185783386, 'fcm_dpo/delta': -0.06720416247844696, 'fcm_dpo/margin': 55.43426513671875, 'margin_dpo/margin_mean': 55.43426513671875, 'margin_dpo/margin_std': 69.9148178100586, 'logps/chosen': -105.16175842285156, 'logps/rejected': -193.30606079101562, 'logps/ref_chosen': -62.08925247192383, 'logps/ref_rejected': -94.79930114746094, 'logits/chosen': -0.581199586391449, 'logits/rejected': -0.5655697584152222, 'epoch': 0.23} 23%|██▎ | 155/681 [06:34<23:13, 2.65s/it] 23%|██▎ | 156/681 [06:37<23:09, 2.65s/it] 23%|██▎ | 157/681 [06:39<22:11, 2.54s/it] 23%|██▎ | 158/681 [06:41<22:21, 2.57s/it] 23%|██▎ | 159/681 [06:44<22:30, 2.59s/it] 23%|██▎ | 160/681 [06:47<22:13, 2.56s/it] {'loss': 1.0372, 'grad_norm': 25.801401138305664, 'learning_rate': 4.737908228387656e-07, 'fcm_dpo/beta': 0.011539025232195854, 'fcm_dpo/q_t': 0.3720964789390564, 'fcm_dpo/delta': -0.045818835496902466, 'fcm_dpo/margin': 55.6801643371582, 'margin_dpo/margin_mean': 55.6801643371582, 'margin_dpo/margin_std': 83.104736328125, 'logps/chosen': -124.75065612792969, 'logps/rejected': -210.2032928466797, 'logps/ref_chosen': -67.15288543701172, 'logps/ref_rejected': -96.92537689208984, 'logits/chosen': -0.5244706869125366, 'logits/rejected': -0.5115067362785339, 'epoch': 0.23} 23%|██▎ | 160/681 [06:47<22:13, 2.56s/it] 24%|██▎ | 161/681 [06:49<21:18, 2.46s/it] 24%|██▍ | 162/681 [06:52<22:07, 2.56s/it] 24%|██▍ | 163/681 [06:54<22:02, 2.55s/it] 24%|██▍ | 164/681 [06:57<22:05, 2.56s/it] 24%|██▍ | 165/681 [06:59<21:52, 2.54s/it] {'loss': 1.0098, 'grad_norm': 37.94820022583008, 'learning_rate': 4.708572792802069e-07, 'fcm_dpo/beta': 0.010906776413321495, 'fcm_dpo/q_t': 0.3736818730831146, 'fcm_dpo/delta': -0.010979633778333664, 'fcm_dpo/margin': 55.84454345703125, 'margin_dpo/margin_mean': 55.84454345703125, 'margin_dpo/margin_std': 74.67647552490234, 'logps/chosen': -110.22456359863281, 'logps/rejected': -188.9801025390625, 'logps/ref_chosen': -57.40401077270508, 'logps/ref_rejected': -80.31498718261719, 'logits/chosen': -0.5201188325881958, 'logits/rejected': -0.49569135904312134, 'epoch': 0.24} 24%|██▍ | 165/681 [06:59<21:52, 2.54s/it] 24%|██▍ | 166/681 [07:01<21:00, 2.45s/it] 25%|██▍ | 167/681 [07:04<21:40, 2.53s/it] 25%|██▍ | 168/681 [07:07<21:45, 2.55s/it] 25%|██▍ | 169/681 [07:10<22:14, 2.61s/it] 25%|██▍ | 170/681 [07:12<22:16, 2.61s/it] {'loss': 0.9592, 'grad_norm': 23.627363204956055, 'learning_rate': 4.6777824852166437e-07, 'fcm_dpo/beta': 0.010051427409052849, 'fcm_dpo/q_t': 0.3577578365802765, 'fcm_dpo/delta': -0.10779444873332977, 'fcm_dpo/margin': 69.25593566894531, 'margin_dpo/margin_mean': 69.25593566894531, 'margin_dpo/margin_std': 85.97371673583984, 'logps/chosen': -106.43888854980469, 'logps/rejected': -209.40512084960938, 'logps/ref_chosen': -52.029144287109375, 'logps/ref_rejected': -85.73944091796875, 'logits/chosen': -0.45740675926208496, 'logits/rejected': -0.4491025507450104, 'epoch': 0.25} 25%|██▍ | 170/681 [07:12<22:16, 2.61s/it] 25%|██▌ | 171/681 [07:14<21:25, 2.52s/it] 25%|██▌ | 172/681 [07:17<21:14, 2.50s/it] 25%|██▌ | 173/681 [07:19<21:06, 2.49s/it] 26%|██▌ | 174/681 [07:22<21:08, 2.50s/it] 26%|██▌ | 175/681 [07:25<21:23, 2.54s/it] {'loss': 0.9915, 'grad_norm': 29.522018432617188, 'learning_rate': 4.645557588393406e-07, 'fcm_dpo/beta': 0.009930510073900223, 'fcm_dpo/q_t': 0.3673258423805237, 'fcm_dpo/delta': -0.047995198518037796, 'fcm_dpo/margin': 65.00736236572266, 'margin_dpo/margin_mean': 65.00736236572266, 'margin_dpo/margin_std': 84.73751831054688, 'logps/chosen': -128.42086791992188, 'logps/rejected': -223.41519165039062, 'logps/ref_chosen': -62.996971130371094, 'logps/ref_rejected': -92.98394012451172, 'logits/chosen': -0.45035696029663086, 'logits/rejected': -0.4322957396507263, 'epoch': 0.26} 26%|██▌ | 175/681 [07:25<21:23, 2.54s/it] 26%|██▌ | 176/681 [07:27<20:55, 2.49s/it] 26%|██▌ | 177/681 [07:30<21:13, 2.53s/it] 26%|██▌ | 178/681 [07:32<21:19, 2.54s/it] 26%|██▋ | 179/681 [07:35<21:38, 2.59s/it] 26%|██▋ | 180/681 [07:37<21:21, 2.56s/it] {'loss': 0.9528, 'grad_norm': 23.635892868041992, 'learning_rate': 4.611919330113591e-07, 'fcm_dpo/beta': 0.008855604566633701, 'fcm_dpo/q_t': 0.35542401671409607, 'fcm_dpo/delta': -0.11197604238986969, 'fcm_dpo/margin': 79.53601837158203, 'margin_dpo/margin_mean': 79.53601837158203, 'margin_dpo/margin_std': 97.05994415283203, 'logps/chosen': -127.61091613769531, 'logps/rejected': -247.19143676757812, 'logps/ref_chosen': -57.0670280456543, 'logps/ref_rejected': -97.1115493774414, 'logits/chosen': -0.38669413328170776, 'logits/rejected': -0.3846648335456848, 'epoch': 0.26} 26%|██▋ | 180/681 [07:37<21:21, 2.56s/it] 27%|██▋ | 181/681 [07:40<21:26, 2.57s/it] 27%|██▋ | 182/681 [07:42<21:05, 2.54s/it] 27%|██▋ | 183/681 [07:45<20:33, 2.48s/it] 27%|██▋ | 184/681 [07:47<21:04, 2.55s/it] 27%|██▋ | 185/681 [07:50<20:44, 2.51s/it] {'loss': 1.0713, 'grad_norm': 26.241926193237305, 'learning_rate': 4.5768898691940836e-07, 'fcm_dpo/beta': 0.008529609069228172, 'fcm_dpo/q_t': 0.39326274394989014, 'fcm_dpo/delta': 0.05834978073835373, 'fcm_dpo/margin': 58.99933624267578, 'margin_dpo/margin_mean': 58.99933624267578, 'margin_dpo/margin_std': 85.73370361328125, 'logps/chosen': -120.03946685791016, 'logps/rejected': -199.70809936523438, 'logps/ref_chosen': -54.840736389160156, 'logps/ref_rejected': -75.51002502441406, 'logits/chosen': -0.421181857585907, 'logits/rejected': -0.40174850821495056, 'epoch': 0.27} 27%|██▋ | 185/681 [07:50<20:44, 2.51s/it] 27%|██▋ | 186/681 [07:52<20:38, 2.50s/it] 27%|██▋ | 187/681 [07:55<20:06, 2.44s/it] 28%|██▊ | 188/681 [07:57<20:20, 2.48s/it] 28%|██▊ | 189/681 [08:00<20:26, 2.49s/it] 28%|██▊ | 190/681 [08:02<19:48, 2.42s/it] {'loss': 0.9793, 'grad_norm': 28.541696548461914, 'learning_rate': 4.5404922808905543e-07, 'fcm_dpo/beta': 0.008668321184813976, 'fcm_dpo/q_t': 0.3645266592502594, 'fcm_dpo/delta': -0.054877202957868576, 'fcm_dpo/margin': 75.03819274902344, 'margin_dpo/margin_mean': 75.03819274902344, 'margin_dpo/margin_std': 94.09630584716797, 'logps/chosen': -127.11979675292969, 'logps/rejected': -231.29647827148438, 'logps/ref_chosen': -57.72148895263672, 'logps/ref_rejected': -86.85997009277344, 'logits/chosen': -0.41162386536598206, 'logits/rejected': -0.39615827798843384, 'epoch': 0.28} 28%|██▊ | 190/681 [08:02<19:48, 2.42s/it] 28%|██▊ | 191/681 [08:05<20:29, 2.51s/it] 28%|██▊ | 192/681 [08:07<20:33, 2.52s/it] 28%|██▊ | 193/681 [08:10<20:13, 2.49s/it] 28%|██▊ | 194/681 [08:12<19:36, 2.42s/it] 29%|██▊ | 195/681 [08:14<20:03, 2.48s/it] {'loss': 0.9959, 'grad_norm': 28.007156372070312, 'learning_rate': 4.5027505416968985e-07, 'fcm_dpo/beta': 0.008089645765721798, 'fcm_dpo/q_t': 0.3677811920642853, 'fcm_dpo/delta': -0.03434378653764725, 'fcm_dpo/margin': 77.82075500488281, 'margin_dpo/margin_mean': 77.82075500488281, 'margin_dpo/margin_std': 99.57084655761719, 'logps/chosen': -140.86399841308594, 'logps/rejected': -249.8879852294922, 'logps/ref_chosen': -58.26164627075195, 'logps/ref_rejected': -89.46485900878906, 'logits/chosen': -0.3651648759841919, 'logits/rejected': -0.35718274116516113, 'epoch': 0.29} 29%|██▊ | 195/681 [08:15<20:03, 2.48s/it] 29%|██▉ | 196/681 [08:17<20:10, 2.50s/it] 29%|██▉ | 197/681 [08:20<20:18, 2.52s/it] 29%|██▉ | 198/681 [08:22<20:24, 2.54s/it] 29%|██▉ | 199/681 [08:25<20:43, 2.58s/it] 29%|██▉ | 200/681 [08:27<20:43, 2.59s/it] {'loss': 0.9767, 'grad_norm': 28.69991111755371, 'learning_rate': 4.4636895135509966e-07, 'fcm_dpo/beta': 0.007911969907581806, 'fcm_dpo/q_t': 0.365100622177124, 'fcm_dpo/delta': -0.04977406933903694, 'fcm_dpo/margin': 81.63540649414062, 'margin_dpo/margin_mean': 81.63540649414062, 'margin_dpo/margin_std': 101.0685806274414, 'logps/chosen': -130.88851928710938, 'logps/rejected': -239.95675659179688, 'logps/ref_chosen': -55.71953201293945, 'logps/ref_rejected': -83.15235137939453, 'logits/chosen': -0.3688076138496399, 'logits/rejected': -0.3557121157646179, 'epoch': 0.29} 29%|██▉ | 200/681 [08:27<20:43, 2.59s/it][INFO|trainer.py:4307] 2026-04-21 22:43:06,885 >> ***** Running Evaluation ***** [INFO|trainer.py:4309] 2026-04-21 22:43:06,885 >> Num examples = 2339 [INFO|trainer.py:4312] 2026-04-21 22:43:06,885 >> Batch size = 8 0%| | 0/73 [00:00> ***** Running Evaluation ***** [INFO|trainer.py:4309] 2026-04-21 22:52:14,935 >> Num examples = 2339 [INFO|trainer.py:4312] 2026-04-21 22:52:14,935 >> Batch size = 8 0%| | 0/73 [00:00> ***** Running Evaluation ***** [INFO|trainer.py:4309] 2026-04-21 23:01:26,880 >> Num examples = 2339 [INFO|trainer.py:4312] 2026-04-21 23:01:26,880 >> Batch size = 8 0%| | 0/73 [00:00> Training completed. Do not forget to share your model on huggingface.co/models =) {'train_runtime': 1856.6581, 'train_samples_per_second': 23.482, 'train_steps_per_second': 0.367, 'train_loss': 0.9952153347312266, 'epoch': 1.0} 100%|██████████| 681/681 [30:53<00:00, 2.55s/it] 100%|██████████| 681/681 [30:53<00:00, 2.72s/it] ***** train metrics ***** epoch = 1.0 total_flos = 0GF train_loss = 0.9952 train_runtime = 0:30:56.65 train_samples = 43598 train_samples_per_second = 23.482 train_steps_per_second = 0.367 2026-04-21 23:05:32 - INFO - __main__ - *** Training complete *** 2026-04-21 23:05:32 - INFO - __main__ - *** Save model *** [INFO|configuration_utils.py:419] 2026-04-21 23:06:08,571 >> Configuration saved in /root/dynamic-dpo-v4/outputs/llama-3-8b-base-new-dpo-hh-helpful-s_star0.6-4xh200-batch-64-20260421-214335-rerun/config.json [INFO|configuration_utils.py:911] 2026-04-21 23:06:08,572 >> Configuration saved in /root/dynamic-dpo-v4/outputs/llama-3-8b-base-new-dpo-hh-helpful-s_star0.6-4xh200-batch-64-20260421-214335-rerun/generation_config.json [INFO|modeling_utils.py:3580] 2026-04-21 23:06:36,066 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 7 checkpoint shards. You can find where each parameters has been saved in the index located at /root/dynamic-dpo-v4/outputs/llama-3-8b-base-new-dpo-hh-helpful-s_star0.6-4xh200-batch-64-20260421-214335-rerun/model.safetensors.index.json. [INFO|tokenization_utils_base.py:2510] 2026-04-21 23:06:36,069 >> tokenizer config file saved in /root/dynamic-dpo-v4/outputs/llama-3-8b-base-new-dpo-hh-helpful-s_star0.6-4xh200-batch-64-20260421-214335-rerun/tokenizer_config.json [INFO|tokenization_utils_base.py:2519] 2026-04-21 23:06:36,070 >> Special tokens file saved in /root/dynamic-dpo-v4/outputs/llama-3-8b-base-new-dpo-hh-helpful-s_star0.6-4xh200-batch-64-20260421-214335-rerun/special_tokens_map.json 2026-04-21 23:06:36 - INFO - __main__ - Saved HF-compatible model artifacts to /root/dynamic-dpo-v4/outputs/llama-3-8b-base-new-dpo-hh-helpful-s_star0.6-4xh200-batch-64-20260421-214335-rerun [INFO|modelcard.py:450] 2026-04-21 23:06:37,820 >> Dropping the following result as it does not have all the necessary fields: {'dataset': {'name': 'Anthropic/hh-rlhf', 'type': 'Anthropic/hh-rlhf'}} [INFO|configuration_utils.py:419] 2026-04-21 23:06:37,823 >> Configuration saved in /root/dynamic-dpo-v4/outputs/llama-3-8b-base-new-dpo-hh-helpful-s_star0.6-4xh200-batch-64-20260421-214335-rerun/config.json 2026-04-21 23:06:37 - INFO - __main__ - *** Evaluate *** [INFO|trainer.py:4307] 2026-04-21 23:06:37,824 >> ***** Running Evaluation ***** [INFO|trainer.py:4309] 2026-04-21 23:06:37,824 >> Num examples = 2339 [INFO|trainer.py:4312] 2026-04-21 23:06:37,824 >> Batch size = 8 0%| | 0/73 [00:00