2026-04-18 09:13:06 - INFO - __main__ - Model parameters ModelArguments(base_model_revision=None, model_name_or_path='/scratch/feng.yulu/dynamic-dpo-v4/outputs/mistral-7b-base-sft-hh-helpful-4xh200-batch-64-20260418-015332', model_revision='main', model_code_revision=None, torch_dtype='bfloat16', tokenizer_name_or_path=None, trust_remote_code=False, attn_implementation='flash_attention_2', use_peft=False, lora_r=16, lora_alpha=32, lora_dropout=0.05, lora_target_modules=None, lora_modules_to_save=None, load_in_8bit=False, load_in_4bit=False, bnb_4bit_quant_type='nf4', use_bnb_nested_quant=False, bnb_4bit_quant_storage='uint8') 2026-04-18 09:13:06 - INFO - __main__ - Data parameters DataArguments(chat_template=None, dataset_mixer={'Anthropic/hh-rlhf': 1.0}, text_column='text', dataset_splits=['train', 'test'], dataset_configs=['helpful-base'], dataset_dir=None, preprocessing_num_workers=12, use_persistent_hf_cache=True, hf_cache_dir='/scratch/feng.yulu/dynamic-dpo-v4/hf/datasets', truncation_side=None, auto_insert_empty_system_msg=True, disable_thinking=False, preprocessing_log_samples=0, preprocessing_log_dir=None) 2026-04-18 09:13:06 - INFO - __main__ - Training/evaluation parameters EpsilonDPOConfig( _n_gpu=1, accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, auto_find_batch_size=False, average_tokens_across_devices=False, batch_eval_metrics=False, beta=0.1, bf16=True, bf16_full_eval=False, data_seed=None, dataloader_drop_last=True, dataloader_num_workers=0, dataloader_persistent_workers=False, dataloader_pin_memory=True, dataloader_prefetch_factor=None, dataset_num_proc=12, ddp_backend=None, ddp_broadcast_buffers=None, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=None, ddp_timeout=1800, debug=[], deepspeed=None, disable_dropout=True, disable_tqdm=False, do_eval=True, do_predict=False, do_train=False, epsilon=0.01, eval_accumulation_steps=None, eval_delay=0, eval_do_concat_batches=True, eval_on_start=False, eval_steps=100, eval_strategy=IntervalStrategy.STEPS, eval_use_gather_object=False, f_alpha_divergence_coef=1.0, f_divergence_type=FDivergenceType.REVERSE_KL, force_use_ref_model=False, fp16=False, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, generate_during_eval=False, gradient_accumulation_steps=2, gradient_checkpointing=True, gradient_checkpointing_kwargs={'use_reentrant': False}, greater_is_better=None, group_by_length=False, half_precision_backend=auto, hub_always_push=False, hub_model_id=W-61/mistral-7b-base-epsilon-dpo-hh-helpful-4xh200-batch-64, hub_model_revision=main, hub_private_repo=None, hub_strategy=HubStrategy.EVERY_SAVE, hub_token=, ignore_data_skip=False, include_for_metrics=[], include_inputs_for_metrics=False, include_num_input_tokens_seen=False, include_tokens_per_second=False, is_encoder_decoder=None, jit_mode_eval=False, label_names=None, label_pad_token_id=-100, label_smoothing=0.0, label_smoothing_factor=0.0, learning_rate=5e-07, length_column_name=length, load_best_model_at_end=False, local_rank=0, log_level=info, log_level_replica=warning, log_on_each_node=True, logging_dir=outputs/mistral-7b-base-epsilon-dpo-hh-helpful-4xh200-batch-64/runs/Apr18_09-13-06_d4053, logging_first_step=True, logging_nan_inf_filter=True, logging_steps=1, logging_strategy=IntervalStrategy.STEPS, loss_type=sigmoid, lr_scheduler_kwargs={}, lr_scheduler_type=SchedulerType.COSINE, max_grad_norm=1.0, max_length=512, max_prompt_length=256, max_steps=-1, max_target_length=None, metric_for_best_model=None, model_adapter_name=None, model_init_kwargs=None, mp_parameters=, neftune_noise_alpha=None, no_cuda=False, non_finite_logits_handling=error, num_train_epochs=1, optim=OptimizerNames.ADAMW_TORCH, optim_args=None, optim_target_modules=None, output_dir=/scratch/feng.yulu/dynamic-dpo-v4/outputs/mistral-7b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260418-015332, overwrite_output_dir=False, padding_value=None, past_index=-1, per_device_eval_batch_size=8, per_device_train_batch_size=8, post_tokenization_log_dir=None, post_tokenization_log_samples=0, precompute_ref_batch_size=None, precompute_ref_eval_batch_size=None, precompute_ref_log_probs=False, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=, ray_scope=last, ref_adapter_name=None, ref_model_init_kwargs=None, ref_model_mixup_alpha=0.9, ref_model_sync_steps=64, reference_free=False, remove_unused_columns=False, report_to=['wandb'], restore_callback_states_from_checkpoint=False, resume_from_checkpoint=None, reuse_tokenized_dataset=True, rpo_alpha=None, run_name=mistral-7b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260418-015332, save_on_each_node=False, save_only_model=False, save_safetensors=True, save_steps=200, save_strategy=SaveStrategy.STEPS, save_total_limit=2, seed=42, sft_weight=0.0, skip_memory_metrics=True, sync_ref_model=False, tf32=None, tokenization_batch_size=128, tokenization_mode=online, tokenized_dataset_cache_dir=/scratch/feng.yulu/dynamic-dpo-v4/tokenized_preferences, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torch_empty_cache_steps=None, torchdynamo=None, tp_size=0, tpu_metrics_debug=False, tpu_num_cores=None, trainer_type=epsilon_dpo, truncation_mode=keep_end, use_cpu=False, use_ipex=False, use_legacy_prediction_loop=False, use_liger_kernel=False, use_mps_device=False, wandb_project=ood-run-4xh200, warmup_ratio=0.1, warmup_steps=0, weight_decay=0.0, ) 2026-04-18 09:13:06 - INFO - __main__ - W&B project: ood-run-4xh200 2026-04-18 09:13:06 - INFO - __main__ - Epsilon-DPO parameters: beta=0.1, epsilon=0.01, gradient_accumulation_steps=2 2026-04-18 09:13:06 - INFO - __main__ - Using persistent HF datasets cache at /scratch/feng.yulu/dynamic-dpo-v4/hf/datasets Normalizing raw HH preferences (train): 0%| | 0/43598 [00:00> loading file tokenizer.model [INFO|tokenization_utils_base.py:2058] 2026-04-18 09:13:14,127 >> loading file tokenizer.json [INFO|tokenization_utils_base.py:2058] 2026-04-18 09:13:14,127 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:2058] 2026-04-18 09:13:14,127 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:2058] 2026-04-18 09:13:14,127 >> loading file tokenizer_config.json [INFO|tokenization_utils_base.py:2058] 2026-04-18 09:13:14,127 >> loading file chat_template.jinja Normalizing raw HH preferences (test): 98%|█████████████████████████████████████████████████████████████▋ | 2289/2339 [00:00<00:00, 11556.98 examples/s] Normalizing raw HH preferences (test): 43%|███████████████████████████▎ | 1000/2339 [00:00<00:00, 9566.10 examples/s] Normalizing raw HH preferences (test): 100%|████████████████████████████████████████████████████████████████| 2339/2339 [00:00<00:00, 9002.76 examples/s] Normalizing raw HH preferences (test): 95%|███████████████████████████████████████████████████████████▊ | 2220/2339 [00:00<00:00, 11086.23 examples/s] Normalizing raw HH preferences (test): 100%|████████████████████████████████████████████████████████████████| 2339/2339 [00:00<00:00, 8890.42 examples/s] /home/feng.yulu/dynamic-dpo-v4/scripts/tokenized_dpo_trainer.py:391: UserWarning: You passed a model_id to the trainer. This will automatically create an `AutoModelForCausalLM` or a `PeftModel` (if you passed a `peft_config`) for you. warnings.warn( 2026-04-18 09:13:14 - INFO - __main__ - Processed train sample 41905: Prompt: <|system|> <|user|> What is the best way to clean my refrigerator? <|assistant|> Well, I have a few ideas, but the best way to know what works best is to run some tests and then analyze the data to figure out which technique works best. <|user|> What are your ideas? Chosen: <|assistant|> Some of my ideas include wiping down the interior with a rag, using bleach and water to wipe down the door, emptying the contents of the fridge to scrub the shelves with a clean cloth, cleaning the interior of the fridge with vinegar and water, and using a vacuum cleaner to clean the interior. Rejected: <|assistant|> I was thinking that you could try different types of soap, which could be a variation of “soap,” like antibacterial soap. I was also thinking that you could try using different tools, like a toothbrush, to clean the inside. Or I was thinking that you could try different ways to get the gunk out, like using a squeegee, or using a paper towel. I was also thinking you could try other types of cleaning, like vacuuming, but I think that could have the opposite of the desired effect. /home/feng.yulu/dynamic-dpo-v4/scripts/tokenized_dpo_trainer.py:391: UserWarning: You passed a model_id to the trainer. This will automatically create an `AutoModelForCausalLM` or a `PeftModel` (if you passed a `peft_config`) for you. warnings.warn( [INFO|configuration_utils.py:691] 2026-04-18 09:13:14,482 >> loading configuration file /scratch/feng.yulu/dynamic-dpo-v4/outputs/mistral-7b-base-sft-hh-helpful-4xh200-batch-64-20260418-015332/config.json [INFO|configuration_utils.py:765] 2026-04-18 09:13:14,487 >> Model config MistralConfig { "architectures": [ "MistralForCausalLM" ], "attention_dropout": 0.0, "bos_token_id": 1, "eos_token_id": 2, "head_dim": 128, "hidden_act": "silu", "hidden_size": 4096, "initializer_range": 0.02, "intermediate_size": 14336, "max_position_embeddings": 32768, "model_type": "mistral", "num_attention_heads": 32, "num_hidden_layers": 32, "num_key_value_heads": 8, "rms_norm_eps": 1e-05, "rope_theta": 1000000.0, "sliding_window": null, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.51.0", "use_cache": false, "vocab_size": 32768 } /home/feng.yulu/dynamic-dpo-v4/scripts/tokenized_dpo_trainer.py:391: UserWarning: You passed a model_id to the trainer. This will automatically create an `AutoModelForCausalLM` or a `PeftModel` (if you passed a `peft_config`) for you. warnings.warn( /home/feng.yulu/dynamic-dpo-v4/scripts/tokenized_dpo_trainer.py:391: UserWarning: You passed a model_id to the trainer. This will automatically create an `AutoModelForCausalLM` or a `PeftModel` (if you passed a `peft_config`) for you. warnings.warn( [INFO|modeling_utils.py:1121] 2026-04-18 09:13:14,750 >> loading weights file /scratch/feng.yulu/dynamic-dpo-v4/outputs/mistral-7b-base-sft-hh-helpful-4xh200-batch-64-20260418-015332/model.safetensors.index.json [INFO|modeling_utils.py:2167] 2026-04-18 09:13:14,753 >> Instantiating MistralForCausalLM model under default dtype torch.bfloat16. [WARNING|logging.py:328] 2026-04-18 09:13:14,755 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. [INFO|configuration_utils.py:1142] 2026-04-18 09:13:14,756 >> Generate config GenerationConfig { "bos_token_id": 1, "eos_token_id": 2, "use_cache": false } [WARNING|logging.py:328] 2026-04-18 09:13:14,756 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. [WARNING|logging.py:328] 2026-04-18 09:13:14,760 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. Loading checkpoint shards: 0%| | 0/6 [00:00> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 152.88it/s] [WARNING|trainer.py:821] 2026-04-18 09:13:14,955 >> Trainer.tokenizer is now deprecated. You should use `Trainer.processing_class = processing_class` instead. [WARNING|trainer.py:821] 2026-04-18 09:13:14,960 >> Trainer.tokenizer is now deprecated. You should use `Trainer.processing_class = processing_class` instead. Loading checkpoint shards: 0%| | 0/6 [00:00> Trainer.tokenizer is now deprecated. You should use `Trainer.processing_class = processing_class` instead. Loading checkpoint shards: 17%|███████████████▏ | 1/6 [00:10<00:52, 10.52s/it] Loading checkpoint shards: 33%|██████████████████████████████▎ | 2/6 [00:12<00:21, 5.46s/it] Loading checkpoint shards: 50%|█████████████████████████████████████████████▌ | 3/6 [00:14<00:11, 3.95s/it] Loading checkpoint shards: 67%|████████████████████████████████████████████████████████████▋ | 4/6 [00:16<00:06, 3.21s/it] Loading checkpoint shards: 83%|███████████████████████████████████████████████████████████████████████████▊ | 5/6 [00:18<00:02, 2.78s/it] Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:20<00:00, 2.44s/it] Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:20<00:00, 3.41s/it] [INFO|modeling_utils.py:4926] 2026-04-18 09:13:35,258 >> All model checkpoint weights were used when initializing MistralForCausalLM. [INFO|modeling_utils.py:4934] 2026-04-18 09:13:35,258 >> All the weights of MistralForCausalLM were initialized from the model checkpoint at /scratch/feng.yulu/dynamic-dpo-v4/outputs/mistral-7b-base-sft-hh-helpful-4xh200-batch-64-20260418-015332. If your task is similar to the task the model of the checkpoint was trained on, you can already use MistralForCausalLM for predictions without further training. [INFO|configuration_utils.py:1095] 2026-04-18 09:13:35,260 >> loading configuration file /scratch/feng.yulu/dynamic-dpo-v4/outputs/mistral-7b-base-sft-hh-helpful-4xh200-batch-64-20260418-015332/generation_config.json [INFO|configuration_utils.py:1142] 2026-04-18 09:13:35,260 >> Generate config GenerationConfig { "bos_token_id": 1, "eos_token_id": 2 } [INFO|configuration_utils.py:691] 2026-04-18 09:13:35,262 >> loading configuration file /scratch/feng.yulu/dynamic-dpo-v4/outputs/mistral-7b-base-sft-hh-helpful-4xh200-batch-64-20260418-015332/config.json [INFO|configuration_utils.py:765] 2026-04-18 09:13:35,262 >> Model config MistralConfig { "architectures": [ "MistralForCausalLM" ], "attention_dropout": 0.0, "bos_token_id": 1, "eos_token_id": 2, "head_dim": 128, "hidden_act": "silu", "hidden_size": 4096, "initializer_range": 0.02, "intermediate_size": 14336, "max_position_embeddings": 32768, "model_type": "mistral", "num_attention_heads": 32, "num_hidden_layers": 32, "num_key_value_heads": 8, "rms_norm_eps": 1e-05, "rope_theta": 1000000.0, "sliding_window": null, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.51.0", "use_cache": false, "vocab_size": 32768 } [INFO|modeling_utils.py:1121] 2026-04-18 09:13:35,264 >> loading weights file /scratch/feng.yulu/dynamic-dpo-v4/outputs/mistral-7b-base-sft-hh-helpful-4xh200-batch-64-20260418-015332/model.safetensors.index.json [INFO|modeling_utils.py:2167] 2026-04-18 09:13:35,264 >> Instantiating MistralForCausalLM model under default dtype torch.bfloat16. [INFO|configuration_utils.py:1142] 2026-04-18 09:13:35,267 >> Generate config GenerationConfig { "bos_token_id": 1, "eos_token_id": 2, "use_cache": false } Loading checkpoint shards: 0%| | 0/6 [00:00> All model checkpoint weights were used when initializing MistralForCausalLM. [INFO|modeling_utils.py:4934] 2026-04-18 09:13:47,615 >> All the weights of MistralForCausalLM were initialized from the model checkpoint at /scratch/feng.yulu/dynamic-dpo-v4/outputs/mistral-7b-base-sft-hh-helpful-4xh200-batch-64-20260418-015332. If your task is similar to the task the model of the checkpoint was trained on, you can already use MistralForCausalLM for predictions without further training. [INFO|configuration_utils.py:1095] 2026-04-18 09:13:47,617 >> loading configuration file /scratch/feng.yulu/dynamic-dpo-v4/outputs/mistral-7b-base-sft-hh-helpful-4xh200-batch-64-20260418-015332/generation_config.json [INFO|configuration_utils.py:1142] 2026-04-18 09:13:47,617 >> Generate config GenerationConfig { "bos_token_id": 1, "eos_token_id": 2 } [WARNING|trainer.py:821] 2026-04-18 09:13:47,618 >> Trainer.tokenizer is now deprecated. You should use `Trainer.processing_class = processing_class` instead. [WARNING|trainer.py:816] 2026-04-18 09:13:47,623 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. Tokenizing train (num_proc=12): 0%| | 0/43598 [00:00> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. Saving the dataset (0/2 shards): 0%| | 0/43598 [00:00> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. Tokenizing test (num_proc=12): 0%| | 0/2339 [00:00> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. Saving the dataset (0/1 shards): 0%| | 0/2339 [00:00> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-18 09:31:02,080 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-18 09:31:02,081 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-18 09:31:02,239 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-18 09:31:02,239 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-18 09:31:02,240 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-18 09:31:02,240 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-18 09:31:02,241 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-18 09:31:02,241 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-18 09:31:02,263 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-18 09:31:02,263 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-18 09:31:02,263 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. /home/feng.yulu/dynamic-dpo-v4/scripts/tokenized_dpo_trainer.py:521: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `EpsilonDPOTrainer.__init__`. Use `processing_class` instead. super().__init__( /home/feng.yulu/dynamic-dpo-v4/scripts/tokenized_dpo_trainer.py:521: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `EpsilonDPOTrainer.__init__`. Use `processing_class` instead. super().__init__( /home/feng.yulu/dynamic-dpo-v4/scripts/tokenized_dpo_trainer.py:521: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `EpsilonDPOTrainer.__init__`. Use `processing_class` instead. super().__init__( [INFO|trainer.py:748] 2026-04-18 09:31:02,395 >> Using auto half precision backend /home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/accelerate/accelerator.py:1557: UserWarning: Upcasted low precision parameters in MistralForCausalLM because mixed precision turned on in FSDP. Affects: model.embed_tokens.weight, model.norm.weight, lm_head.weight. warnings.warn( /home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/accelerate/accelerator.py:1557: UserWarning: Upcasted low precision parameters in MistralDecoderLayer because mixed precision turned on in FSDP. Affects: self_attn.q_proj.weight, self_attn.k_proj.weight, self_attn.v_proj.weight, self_attn.o_proj.weight, mlp.gate_proj.weight, mlp.up_proj.weight, mlp.down_proj.weight, input_layernorm.weight, post_attention_layernorm.weight. warnings.warn( /home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/accelerate/accelerator.py:1563: UserWarning: FSDP upcast of low precision parameters may affect the precision of model checkpoints. warnings.warn( [INFO|trainer.py:2414] 2026-04-18 09:31:13,913 >> ***** Running training ***** [INFO|trainer.py:2415] 2026-04-18 09:31:13,913 >> Num examples = 43,598 [INFO|trainer.py:2416] 2026-04-18 09:31:13,913 >> Num Epochs = 1 [INFO|trainer.py:2417] 2026-04-18 09:31:13,913 >> Instantaneous batch size per device = 8 [INFO|trainer.py:2420] 2026-04-18 09:31:13,913 >> Total train batch size (w. parallel, distributed & accumulation) = 64 [INFO|trainer.py:2421] 2026-04-18 09:31:13,913 >> Gradient Accumulation steps = 2 [INFO|trainer.py:2422] 2026-04-18 09:31:13,913 >> Total optimization steps = 681 [INFO|trainer.py:2423] 2026-04-18 09:31:13,914 >> Number of trainable parameters = 1,812,005,888 [INFO|integration_utils.py:831] 2026-04-18 09:31:13,914 >> Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true" wandb: Currently logged in as: can-not-fand (can-not-fand-northeastern-university). Use `wandb login --relogin` to force relogin wandb: wandb version 0.26.0 is available! To upgrade, please run: wandb: $ pip install wandb --upgrade wandb: Tracking run with wandb version 0.17.5 wandb: Run data is saved locally in /scratch/feng.yulu/dynamic-dpo-v4/wandb/wandb/run-20260418_093116-7dwx2wac wandb: Run `wandb offline` to turn off syncing. wandb: Syncing run mistral-7b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260418-015332 wandb: ⭐️ View project at https://wandb.ai/can-not-fand-northeastern-university/ood-run-4xh200 wandb: 🚀 View run at https://wandb.ai/can-not-fand-northeastern-university/ood-run-4xh200/runs/7dwx2wac 0%| | 0/681 [00:00> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_utils.py:1713] 2026-04-18 09:31:23,467 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_utils.py:1713] 2026-04-18 09:31:23,471 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_utils.py:1713] 2026-04-18 09:31:23,484 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed 0%|▏ | 1/681 [00:02<30:36, 2.70s/it] {'loss': 1.3843, 'grad_norm': 153.6117401123047, 'learning_rate': 0.0, 'rewards/chosen': 0.0014821073273196816, 'rewards/rejected': -0.0007488295086659491, 'rewards/accuracies': 0.59375, 'rewards/margins': 0.002230936661362648, 'logps/chosen': -31.484785079956055, 'logps/rejected': -79.35993194580078, 'logps/ref_chosen': -31.500404357910156, 'logps/ref_rejected': -79.35133361816406, 'logits/chosen': -3.435748338699341, 'logits/rejected': -3.461001396179199, 'kl/p_epsilon_steps': 0.609375, 'kl/n_epsilon_steps': 0.390625, 'epsilon_dpo/beta': 0.09979122877120972, 'epsilon_dpo/loss_margin_mean': 0.024216145277023315, 'epsilon_dpo/beta_margin_mean': 0.0022308863699436188, 'epsilon_dpo/beta_margin_std': 0.028449052944779396, 'epsilon_dpo/beta_margin_grad_mean': -0.4994426667690277, 'epsilon_dpo/beta_margin_grad_std': 0.007109890226274729, 'kl/beta': 0.10000000149011612, 'kl/avg_steps': 0.21875, 'epoch': 0.0} 0%|▏ | 1/681 [00:02<30:36, 2.70s/it] 0%|▎ | 2/681 [00:05<29:49, 2.64s/it] {'loss': 1.3875, 'grad_norm': 155.6954803466797, 'learning_rate': 7.246376811594203e-09, 'rewards/chosen': -8.885323768481612e-05, 'rewards/rejected': 0.0009472252568230033, 'rewards/accuracies': 0.546875, 'rewards/margins': -0.0010360784363001585, 'logps/chosen': -36.63703918457031, 'logps/rejected': -80.44050598144531, 'logps/ref_chosen': -36.63695526123047, 'logps/ref_rejected': -80.44895935058594, 'logits/chosen': -3.517317533493042, 'logits/rejected': -3.4876961708068848, 'kl/p_epsilon_steps': 0.53125, 'kl/n_epsilon_steps': 0.46875, 'epsilon_dpo/beta': 0.09972933679819107, 'epsilon_dpo/loss_margin_mean': -0.008541211485862732, 'epsilon_dpo/beta_margin_mean': -0.0010360678425058722, 'epsilon_dpo/beta_margin_std': 0.024278299883008003, 'epsilon_dpo/beta_margin_grad_mean': -0.5002588629722595, 'epsilon_dpo/beta_margin_grad_std': 0.006068441551178694, 'kl/beta': 0.09978172928094864, 'kl/avg_steps': 0.0625, 'epoch': 0.0} 0%|▎ | 2/681 [00:05<29:49, 2.64s/it] 0%|▌ | 3/681 [00:08<30:30, 2.70s/it] {'loss': 1.3817, 'grad_norm': 160.3519744873047, 'learning_rate': 1.4492753623188406e-08, 'rewards/chosen': 0.004386060871183872, 'rewards/rejected': -0.00041415661689825356, 'rewards/accuracies': 0.515625, 'rewards/margins': 0.004800218157470226, 'logps/chosen': -37.792213439941406, 'logps/rejected': -73.14955139160156, 'logps/ref_chosen': -37.83708190917969, 'logps/ref_rejected': -73.14408874511719, 'logits/chosen': -3.5040597915649414, 'logits/rejected': -3.4887866973876953, 'kl/p_epsilon_steps': 0.5, 'kl/n_epsilon_steps': 0.484375, 'epsilon_dpo/beta': 0.09971363842487335, 'epsilon_dpo/loss_margin_mean': 0.050338417291641235, 'epsilon_dpo/beta_margin_mean': 0.004800259135663509, 'epsilon_dpo/beta_margin_std': 0.028725432232022285, 'epsilon_dpo/beta_margin_grad_mean': -0.4988003969192505, 'epsilon_dpo/beta_margin_grad_std': 0.007179732900112867, 'kl/beta': 0.09971940517425537, 'kl/avg_steps': 0.015625, 'epoch': 0.0} 0%|▌ | 3/681 [00:08<30:30, 2.70s/it] 1%|▋ | 4/681 [00:10<31:09, 2.76s/it] {'loss': 1.3843, 'grad_norm': 155.65963745117188, 'learning_rate': 2.1739130434782606e-08, 'rewards/chosen': -0.000430062209488824, 'rewards/rejected': -0.0025970228016376495, 'rewards/accuracies': 0.53125, 'rewards/margins': 0.002166960621252656, 'logps/chosen': -43.339508056640625, 'logps/rejected': -93.78823852539062, 'logps/ref_chosen': -43.336036682128906, 'logps/ref_rejected': -93.7607650756836, 'logits/chosen': -3.468817710876465, 'logits/rejected': -3.508214235305786, 'kl/p_epsilon_steps': 0.546875, 'kl/n_epsilon_steps': 0.4375, 'epsilon_dpo/beta': 0.09960457682609558, 'epsilon_dpo/loss_margin_mean': 0.02400125563144684, 'epsilon_dpo/beta_margin_mean': 0.002166971331462264, 'epsilon_dpo/beta_margin_std': 0.029604651033878326, 'epsilon_dpo/beta_margin_grad_mean': -0.4994582235813141, 'epsilon_dpo/beta_margin_grad_std': 0.007399262860417366, 'kl/beta': 0.0997038260102272, 'kl/avg_steps': 0.109375, 'epoch': 0.01} 1%|▋ | 4/681 [00:10<31:09, 2.76s/it] 1%|▊ | 5/681 [00:13<31:00, 2.75s/it] {'loss': 1.3905, 'grad_norm': 179.6461639404297, 'learning_rate': 2.898550724637681e-08, 'rewards/chosen': -0.0026276579592376947, 'rewards/rejected': 0.0012790121836587787, 'rewards/accuracies': 0.484375, 'rewards/margins': -0.003906670026481152, 'logps/chosen': -32.93244934082031, 'logps/rejected': -90.29646301269531, 'logps/ref_chosen': -32.90675354003906, 'logps/ref_rejected': -90.30764770507812, 'logits/chosen': -3.436485767364502, 'logits/rejected': -3.4411563873291016, 'kl/p_epsilon_steps': 0.484375, 'kl/n_epsilon_steps': 0.515625, 'epsilon_dpo/beta': 0.09963598102331161, 'epsilon_dpo/loss_margin_mean': -0.036883413791656494, 'epsilon_dpo/beta_margin_mean': -0.003906682599335909, 'epsilon_dpo/beta_margin_std': 0.033521223813295364, 'epsilon_dpo/beta_margin_grad_mean': -0.5009759068489075, 'epsilon_dpo/beta_margin_grad_std': 0.008376221172511578, 'kl/beta': 0.09959489107131958, 'kl/avg_steps': -0.03125, 'epoch': 0.01} 1%|▊ | 5/681 [00:13<31:00, 2.75s/it] 1%|█ | 6/681 [00:16<29:25, 2.62s/it] {'loss': 1.3837, 'grad_norm': 168.01638793945312, 'learning_rate': 3.6231884057971014e-08, 'rewards/chosen': -0.0019228225573897362, 'rewards/rejected': -0.004670892842113972, 'rewards/accuracies': 0.53125, 'rewards/margins': 0.0027480702847242355, 'logps/chosen': -40.59587478637695, 'logps/rejected': -98.52117919921875, 'logps/ref_chosen': -40.57701110839844, 'logps/ref_rejected': -98.47296142578125, 'logits/chosen': -3.476027011871338, 'logits/rejected': -3.391662836074829, 'kl/p_epsilon_steps': 0.546875, 'kl/n_epsilon_steps': 0.453125, 'epsilon_dpo/beta': 0.09954258054494858, 'epsilon_dpo/loss_margin_mean': 0.029358193278312683, 'epsilon_dpo/beta_margin_mean': 0.002748197177425027, 'epsilon_dpo/beta_margin_std': 0.025408554822206497, 'epsilon_dpo/beta_margin_grad_mean': -0.4993135929107666, 'epsilon_dpo/beta_margin_grad_std': 0.006349037401378155, 'kl/beta': 0.09962602704763412, 'kl/avg_steps': 0.09375, 'epoch': 0.01} 1%|█ | 6/681 [00:16<29:25, 2.62s/it] 1%|█▏ | 7/681 [00:18<28:56, 2.58s/it] {'loss': 1.3862, 'grad_norm': 165.30645751953125, 'learning_rate': 4.347826086956521e-08, 'rewards/chosen': 0.002097110729664564, 'rewards/rejected': 0.0017947049345821142, 'rewards/accuracies': 0.515625, 'rewards/margins': 0.0003024058823939413, 'logps/chosen': -41.3924674987793, 'logps/rejected': -111.000244140625, 'logps/ref_chosen': -41.414642333984375, 'logps/ref_rejected': -111.01716613769531, 'logits/chosen': -3.4681951999664307, 'logits/rejected': -3.4873459339141846, 'kl/p_epsilon_steps': 0.515625, 'kl/n_epsilon_steps': 0.484375, 'epsilon_dpo/beta': 0.09951155632734299, 'epsilon_dpo/loss_margin_mean': 0.00525471568107605, 'epsilon_dpo/beta_margin_mean': 0.0003023834724444896, 'epsilon_dpo/beta_margin_std': 0.028604645282030106, 'epsilon_dpo/beta_margin_grad_mean': -0.49992430210113525, 'epsilon_dpo/beta_margin_grad_std': 0.007149491459131241, 'kl/beta': 0.09953271597623825, 'kl/avg_steps': 0.03125, 'epoch': 0.01} 1%|█▏ | 7/681 [00:18<28:56, 2.58s/it] 1%|█▎ | 8/681 [00:21<28:51, 2.57s/it] {'loss': 1.3941, 'grad_norm': 170.1224822998047, 'learning_rate': 5.0724637681159424e-08, 'rewards/chosen': -0.0024497562553733587, 'rewards/rejected': 0.005181857850402594, 'rewards/accuracies': 0.328125, 'rewards/margins': -0.007631613872945309, 'logps/chosen': -39.27941131591797, 'logps/rejected': -85.89848327636719, 'logps/ref_chosen': -39.25566482543945, 'logps/ref_rejected': -85.94947814941406, 'logits/chosen': -3.5701963901519775, 'logits/rejected': -3.46230411529541, 'kl/p_epsilon_steps': 0.328125, 'kl/n_epsilon_steps': 0.65625, 'epsilon_dpo/beta': 0.09983792901039124, 'epsilon_dpo/loss_margin_mean': -0.07474762201309204, 'epsilon_dpo/beta_margin_mean': -0.007631635759025812, 'epsilon_dpo/beta_margin_std': 0.02497241646051407, 'epsilon_dpo/beta_margin_grad_mean': -0.5019076466560364, 'epsilon_dpo/beta_margin_grad_std': 0.006241849157959223, 'kl/beta': 0.09950161725282669, 'kl/avg_steps': -0.328125, 'epoch': 0.01} 1%|█▎ | 8/681 [00:21<28:51, 2.57s/it] 1%|█▌ | 9/681 [00:23<29:28, 2.63s/it] {'loss': 1.3867, 'grad_norm': 179.10533142089844, 'learning_rate': 5.797101449275362e-08, 'rewards/chosen': -4.6280911192297935e-07, 'rewards/rejected': 0.00030112662352621555, 'rewards/accuracies': 0.453125, 'rewards/margins': -0.0003015893744304776, 'logps/chosen': -38.925682067871094, 'logps/rejected': -106.42868041992188, 'logps/ref_chosen': -38.9265251159668, 'logps/ref_rejected': -106.43075561523438, 'logits/chosen': -3.526092529296875, 'logits/rejected': -3.532437801361084, 'kl/p_epsilon_steps': 0.5, 'kl/n_epsilon_steps': 0.5, 'epsilon_dpo/beta': 0.09983916580677032, 'epsilon_dpo/loss_margin_mean': -0.0012376010417938232, 'epsilon_dpo/beta_margin_mean': -0.0003015802940353751, 'epsilon_dpo/beta_margin_std': 0.02405872382223606, 'epsilon_dpo/beta_margin_grad_mean': -0.5000754594802856, 'epsilon_dpo/beta_margin_grad_std': 0.006013626232743263, 'kl/beta': 0.09982918202877045, 'kl/avg_steps': 0.0, 'epoch': 0.01} 1%|█▌ | 9/681 [00:23<29:28, 2.63s/it] 1%|█▋ | 10/681 [00:26<29:56, 2.68s/it] {'loss': 1.3838, 'grad_norm': 161.84414672851562, 'learning_rate': 6.521739130434782e-08, 'rewards/chosen': 0.0009347056620754302, 'rewards/rejected': -0.0017286810325458646, 'rewards/accuracies': 0.578125, 'rewards/margins': 0.0026633867528289557, 'logps/chosen': -37.931678771972656, 'logps/rejected': -84.44198608398438, 'logps/ref_chosen': -37.94172668457031, 'logps/ref_rejected': -84.4234848022461, 'logits/chosen': -3.487415313720703, 'logits/rejected': -3.4311609268188477, 'kl/p_epsilon_steps': 0.546875, 'kl/n_epsilon_steps': 0.4375, 'epsilon_dpo/beta': 0.09972980618476868, 'epsilon_dpo/loss_margin_mean': 0.028553977608680725, 'epsilon_dpo/beta_margin_mean': 0.0026634030509740114, 'epsilon_dpo/beta_margin_std': 0.02231536991894245, 'epsilon_dpo/beta_margin_grad_mean': -0.49933433532714844, 'epsilon_dpo/beta_margin_grad_std': 0.005578126758337021, 'kl/beta': 0.09982918202877045, 'kl/avg_steps': 0.109375, 'epoch': 0.01} 1%|█▋ | 10/681 [00:26<29:56, 2.68s/it] 2%|█▊ | 11/681 [00:29<30:11, 2.70s/it] {'loss': 1.3775, 'grad_norm': 167.265380859375, 'learning_rate': 7.246376811594203e-08, 'rewards/chosen': 0.0010777543066069484, 'rewards/rejected': -0.007868202403187752, 'rewards/accuracies': 0.578125, 'rewards/margins': 0.008945956826210022, 'logps/chosen': -32.73089599609375, 'logps/rejected': -94.19029998779297, 'logps/ref_chosen': -32.742462158203125, 'logps/ref_rejected': -94.11013793945312, 'logits/chosen': -3.4978814125061035, 'logits/rejected': -3.5033230781555176, 'kl/p_epsilon_steps': 0.59375, 'kl/n_epsilon_steps': 0.40625, 'epsilon_dpo/beta': 0.09954308718442917, 'epsilon_dpo/loss_margin_mean': 0.09173570573329926, 'epsilon_dpo/beta_margin_mean': 0.008945857174694538, 'epsilon_dpo/beta_margin_std': 0.02535552717745304, 'epsilon_dpo/beta_margin_grad_mean': -0.49776411056518555, 'epsilon_dpo/beta_margin_grad_std': 0.006337402854114771, 'kl/beta': 0.09972011297941208, 'kl/avg_steps': 0.1875, 'epoch': 0.02} 2%|█▊ | 11/681 [00:29<30:11, 2.70s/it] 2%|██ | 12/681 [00:32<30:01, 2.69s/it] {'loss': 1.3817, 'grad_norm': 184.79959106445312, 'learning_rate': 7.971014492753623e-08, 'rewards/chosen': -0.0023162683937698603, 'rewards/rejected': -0.007148245815187693, 'rewards/accuracies': 0.625, 'rewards/margins': 0.004831977654248476, 'logps/chosen': -43.87688446044922, 'logps/rejected': -111.80323791503906, 'logps/ref_chosen': -43.85453796386719, 'logps/ref_rejected': -111.72984313964844, 'logits/chosen': -3.5019748210906982, 'logits/rejected': -3.507927417755127, 'kl/p_epsilon_steps': 0.625, 'kl/n_epsilon_steps': 0.375, 'epsilon_dpo/beta': 0.09929458051919937, 'epsilon_dpo/loss_margin_mean': 0.05105578899383545, 'epsilon_dpo/beta_margin_mean': 0.004831877537071705, 'epsilon_dpo/beta_margin_std': 0.03315826877951622, 'epsilon_dpo/beta_margin_grad_mean': -0.49879196286201477, 'epsilon_dpo/beta_margin_grad_std': 0.008286659605801105, 'kl/beta': 0.09953349083662033, 'kl/avg_steps': 0.25, 'epoch': 0.02} 2%|██ | 12/681 [00:32<30:01, 2.69s/it] 2%|██▏ | 13/681 [00:34<30:03, 2.70s/it] {'loss': 1.3832, 'grad_norm': 170.2859344482422, 'learning_rate': 8.695652173913042e-08, 'rewards/chosen': 0.0017041495302692056, 'rewards/rejected': -0.0015271796146407723, 'rewards/accuracies': 0.484375, 'rewards/margins': 0.0032313289120793343, 'logps/chosen': -41.653934478759766, 'logps/rejected': -91.1076889038086, 'logps/ref_chosen': -41.67176818847656, 'logps/ref_rejected': -91.09086608886719, 'logits/chosen': -3.540365219116211, 'logits/rejected': -3.4680285453796387, 'kl/p_epsilon_steps': 0.46875, 'kl/n_epsilon_steps': 0.53125, 'epsilon_dpo/beta': 0.09935726225376129, 'epsilon_dpo/loss_margin_mean': 0.034654468297958374, 'epsilon_dpo/beta_margin_mean': 0.00323132099583745, 'epsilon_dpo/beta_margin_std': 0.026057543233036995, 'epsilon_dpo/beta_margin_grad_mean': -0.49919238686561584, 'epsilon_dpo/beta_margin_grad_std': 0.006513380445539951, 'kl/beta': 0.09928527474403381, 'kl/avg_steps': -0.0625, 'epoch': 0.02} 2%|██▏ | 13/681 [00:34<30:03, 2.70s/it] 2%|██▎ | 14/681 [00:37<29:32, 2.66s/it] {'loss': 1.3668, 'grad_norm': 191.28602600097656, 'learning_rate': 9.420289855072464e-08, 'rewards/chosen': 0.0063568041659891605, 'rewards/rejected': -0.01342916302382946, 'rewards/accuracies': 0.734375, 'rewards/margins': 0.019785966724157333, 'logps/chosen': -32.60367202758789, 'logps/rejected': -111.98973846435547, 'logps/ref_chosen': -32.668601989746094, 'logps/ref_rejected': -111.8526611328125, 'logits/chosen': -3.5063905715942383, 'logits/rejected': -3.5407614707946777, 'kl/p_epsilon_steps': 0.71875, 'kl/n_epsilon_steps': 0.28125, 'epsilon_dpo/beta': 0.09892261028289795, 'epsilon_dpo/loss_margin_mean': 0.20201179385185242, 'epsilon_dpo/beta_margin_mean': 0.01978609710931778, 'epsilon_dpo/beta_margin_std': 0.03049391135573387, 'epsilon_dpo/beta_margin_grad_mean': -0.49505484104156494, 'epsilon_dpo/beta_margin_grad_std': 0.007620512507855892, 'kl/beta': 0.09934736788272858, 'kl/avg_steps': 0.4375, 'epoch': 0.02} 2%|██▎ | 14/681 [00:37<29:32, 2.66s/it] 2%|██▌ | 15/681 [00:40<29:35, 2.67s/it] {'loss': 1.3699, 'grad_norm': 144.098388671875, 'learning_rate': 1.0144927536231885e-07, 'rewards/chosen': -0.0012117127189412713, 'rewards/rejected': -0.01782957650721073, 'rewards/accuracies': 0.734375, 'rewards/margins': 0.01661786250770092, 'logps/chosen': -41.79482650756836, 'logps/rejected': -86.60433197021484, 'logps/ref_chosen': -41.78295135498047, 'logps/ref_rejected': -86.42213439941406, 'logits/chosen': -3.4299309253692627, 'logits/rejected': -3.4826550483703613, 'kl/p_epsilon_steps': 0.734375, 'kl/n_epsilon_steps': 0.25, 'epsilon_dpo/beta': 0.09844519197940826, 'epsilon_dpo/loss_margin_mean': 0.17031680047512054, 'epsilon_dpo/beta_margin_mean': 0.016617843881249428, 'epsilon_dpo/beta_margin_std': 0.025397835299372673, 'epsilon_dpo/beta_margin_grad_mean': -0.4958465099334717, 'epsilon_dpo/beta_margin_grad_std': 0.006347531918436289, 'kl/beta': 0.09891461580991745, 'kl/avg_steps': 0.484375, 'epoch': 0.02} 2%|██▌ | 15/681 [00:40<29:35, 2.67s/it] 2%|██▋ | 16/681 [00:42<29:29, 2.66s/it] {'loss': 1.37, 'grad_norm': 150.39041137695312, 'learning_rate': 1.0869565217391303e-07, 'rewards/chosen': 0.0014686340000480413, 'rewards/rejected': -0.015126017853617668, 'rewards/accuracies': 0.75, 'rewards/margins': 0.016594652086496353, 'logps/chosen': -40.885353088378906, 'logps/rejected': -89.05500793457031, 'logps/ref_chosen': -40.9011116027832, 'logps/ref_rejected': -88.89961242675781, 'logits/chosen': -3.477813959121704, 'logits/rejected': -3.4168307781219482, 'kl/p_epsilon_steps': 0.75, 'kl/n_epsilon_steps': 0.25, 'epsilon_dpo/beta': 0.09795541316270828, 'epsilon_dpo/loss_margin_mean': 0.1711507886648178, 'epsilon_dpo/beta_margin_mean': 0.016594676300883293, 'epsilon_dpo/beta_margin_std': 0.029557235538959503, 'epsilon_dpo/beta_margin_grad_mean': -0.4958525598049164, 'epsilon_dpo/beta_margin_grad_std': 0.007386185694485903, 'kl/beta': 0.09843780845403671, 'kl/avg_steps': 0.5, 'epoch': 0.02} 2%|██▋ | 16/681 [00:42<29:29, 2.66s/it] 2%|██▊ | 17/681 [00:45<29:14, 2.64s/it] {'loss': 1.3536, 'grad_norm': 171.44161987304688, 'learning_rate': 1.1594202898550725e-07, 'rewards/chosen': 0.00957987830042839, 'rewards/rejected': -0.023779571056365967, 'rewards/accuracies': 0.8125, 'rewards/margins': 0.03335944563150406, 'logps/chosen': -38.45612335205078, 'logps/rejected': -91.90687561035156, 'logps/ref_chosen': -38.555274963378906, 'logps/ref_rejected': -91.66143798828125, 'logits/chosen': -3.526141405105591, 'logits/rejected': -3.4046664237976074, 'kl/p_epsilon_steps': 0.8125, 'kl/n_epsilon_steps': 0.1875, 'epsilon_dpo/beta': 0.09734562784433365, 'epsilon_dpo/loss_margin_mean': 0.34458500146865845, 'epsilon_dpo/beta_margin_mean': 0.033359427005052567, 'epsilon_dpo/beta_margin_std': 0.037631068378686905, 'epsilon_dpo/beta_margin_grad_mean': -0.49166470766067505, 'epsilon_dpo/beta_margin_grad_std': 0.009398871101439, 'kl/beta': 0.09794806689023972, 'kl/avg_steps': 0.625, 'epoch': 0.02} 2%|██▊ | 17/681 [00:45<29:14, 2.64s/it] 3%|███ | 18/681 [00:47<29:11, 2.64s/it] {'loss': 1.3496, 'grad_norm': 174.4280242919922, 'learning_rate': 1.2318840579710146e-07, 'rewards/chosen': 0.008510860614478588, 'rewards/rejected': -0.028904292732477188, 'rewards/accuracies': 0.90625, 'rewards/margins': 0.03741515427827835, 'logps/chosen': -26.462953567504883, 'logps/rejected': -82.455078125, 'logps/ref_chosen': -26.55130386352539, 'logps/ref_rejected': -82.15496063232422, 'logits/chosen': -3.457716464996338, 'logits/rejected': -3.3878774642944336, 'kl/p_epsilon_steps': 0.90625, 'kl/n_epsilon_steps': 0.09375, 'epsilon_dpo/beta': 0.09655846655368805, 'epsilon_dpo/loss_margin_mean': 0.3884708285331726, 'epsilon_dpo/beta_margin_mean': 0.03741518035531044, 'epsilon_dpo/beta_margin_std': 0.03761782497167587, 'epsilon_dpo/beta_margin_grad_mean': -0.4906516373157501, 'epsilon_dpo/beta_margin_grad_std': 0.009394010528922081, 'kl/beta': 0.0973396971821785, 'kl/avg_steps': 0.8125, 'epoch': 0.03} 3%|███ | 18/681 [00:47<29:11, 2.64s/it] 3%|███▏ | 19/681 [00:50<28:50, 2.61s/it] {'loss': 1.3362, 'grad_norm': 158.5471649169922, 'learning_rate': 1.3043478260869563e-07, 'rewards/chosen': 0.009450599551200867, 'rewards/rejected': -0.04208826646208763, 'rewards/accuracies': 0.828125, 'rewards/margins': 0.0515388660132885, 'logps/chosen': -51.44847869873047, 'logps/rejected': -96.42372131347656, 'logps/ref_chosen': -51.548377990722656, 'logps/ref_rejected': -95.98385620117188, 'logits/chosen': -3.5245118141174316, 'logits/rejected': -3.4201385974884033, 'kl/p_epsilon_steps': 0.8125, 'kl/n_epsilon_steps': 0.1875, 'epsilon_dpo/beta': 0.0959613099694252, 'epsilon_dpo/loss_margin_mean': 0.5397706031799316, 'epsilon_dpo/beta_margin_mean': 0.05153882876038551, 'epsilon_dpo/beta_margin_std': 0.05628020688891411, 'epsilon_dpo/beta_margin_grad_mean': -0.4871312379837036, 'epsilon_dpo/beta_margin_grad_std': 0.014039521105587482, 'kl/beta': 0.0965551808476448, 'kl/avg_steps': 0.625, 'epoch': 0.03} 3%|███▏ | 19/681 [00:50<28:50, 2.61s/it] 3%|███▍ | 20/681 [00:53<28:56, 2.63s/it] {'loss': 1.3324, 'grad_norm': 148.6184844970703, 'learning_rate': 1.3768115942028986e-07, 'rewards/chosen': 0.012646486982703209, 'rewards/rejected': -0.042639512568712234, 'rewards/accuracies': 0.90625, 'rewards/margins': 0.055285997688770294, 'logps/chosen': -32.44221878051758, 'logps/rejected': -84.17327880859375, 'logps/ref_chosen': -32.57563781738281, 'logps/ref_rejected': -83.72441101074219, 'logits/chosen': -3.5266153812408447, 'logits/rejected': -3.454347848892212, 'kl/p_epsilon_steps': 0.90625, 'kl/n_epsilon_steps': 0.09375, 'epsilon_dpo/beta': 0.09518533945083618, 'epsilon_dpo/loss_margin_mean': 0.5822827816009521, 'epsilon_dpo/beta_margin_mean': 0.05528602749109268, 'epsilon_dpo/beta_margin_std': 0.05034392327070236, 'epsilon_dpo/beta_margin_grad_mean': -0.4861924350261688, 'epsilon_dpo/beta_margin_grad_std': 0.01256165187805891, 'kl/beta': 0.09595546126365662, 'kl/avg_steps': 0.8125, 'epoch': 0.03} 3%|███▍ | 20/681 [00:53<28:56, 2.63s/it] 3%|███▌ | 21/681 [00:55<28:40, 2.61s/it] {'loss': 1.3294, 'grad_norm': 138.44097900390625, 'learning_rate': 1.4492753623188405e-07, 'rewards/chosen': 0.007934953086078167, 'rewards/rejected': -0.05056622251868248, 'rewards/accuracies': 0.9375, 'rewards/margins': 0.05850117653608322, 'logps/chosen': -36.161651611328125, 'logps/rejected': -83.22535705566406, 'logps/ref_chosen': -36.24628448486328, 'logps/ref_rejected': -82.68882751464844, 'logits/chosen': -3.416736602783203, 'logits/rejected': -3.4210104942321777, 'kl/p_epsilon_steps': 0.9375, 'kl/n_epsilon_steps': 0.0625, 'epsilon_dpo/beta': 0.09435869753360748, 'epsilon_dpo/loss_margin_mean': 0.6211739182472229, 'epsilon_dpo/beta_margin_mean': 0.058501191437244415, 'epsilon_dpo/beta_margin_std': 0.054356660693883896, 'epsilon_dpo/beta_margin_grad_mean': -0.485392689704895, 'epsilon_dpo/beta_margin_grad_std': 0.01355255488306284, 'kl/beta': 0.09518210589885712, 'kl/avg_steps': 0.875, 'epoch': 0.03} 3%|███▌ | 21/681 [00:55<28:40, 2.61s/it] 3%|███▋ | 22/681 [00:58<29:09, 2.66s/it] {'loss': 1.3155, 'grad_norm': 163.26019287109375, 'learning_rate': 1.5217391304347825e-07, 'rewards/chosen': 0.007986144162714481, 'rewards/rejected': -0.06498903036117554, 'rewards/accuracies': 0.953125, 'rewards/margins': 0.07297517359256744, 'logps/chosen': -44.6229248046875, 'logps/rejected': -104.12324523925781, 'logps/ref_chosen': -44.70884704589844, 'logps/ref_rejected': -103.42787170410156, 'logits/chosen': -3.4850616455078125, 'logits/rejected': -3.4672698974609375, 'kl/p_epsilon_steps': 0.953125, 'kl/n_epsilon_steps': 0.046875, 'epsilon_dpo/beta': 0.09351073950529099, 'epsilon_dpo/loss_margin_mean': 0.7812974452972412, 'epsilon_dpo/beta_margin_mean': 0.07297518849372864, 'epsilon_dpo/beta_margin_std': 0.05880572646856308, 'epsilon_dpo/beta_margin_grad_mean': -0.48178407549858093, 'epsilon_dpo/beta_margin_grad_std': 0.01465072762221098, 'kl/beta': 0.0943564921617508, 'kl/avg_steps': 0.90625, 'epoch': 0.03} 3%|███▋ | 22/681 [00:58<29:09, 2.66s/it] 3%|███▉ | 23/681 [01:01<29:18, 2.67s/it] {'loss': 1.3054, 'grad_norm': 153.97116088867188, 'learning_rate': 1.5942028985507245e-07, 'rewards/chosen': 0.013693436048924923, 'rewards/rejected': -0.07020558416843414, 'rewards/accuracies': 0.9375, 'rewards/margins': 0.08389902114868164, 'logps/chosen': -40.120094299316406, 'logps/rejected': -86.60855102539062, 'logps/ref_chosen': -40.26862335205078, 'logps/ref_rejected': -85.85059356689453, 'logits/chosen': -3.4643330574035645, 'logits/rejected': -3.442781686782837, 'kl/p_epsilon_steps': 0.9375, 'kl/n_epsilon_steps': 0.0625, 'epsilon_dpo/beta': 0.09270013123750687, 'epsilon_dpo/loss_margin_mean': 0.9064887762069702, 'epsilon_dpo/beta_margin_mean': 0.08389898389577866, 'epsilon_dpo/beta_margin_std': 0.07160933315753937, 'epsilon_dpo/beta_margin_grad_mean': -0.47907260060310364, 'epsilon_dpo/beta_margin_grad_std': 0.01780831441283226, 'kl/beta': 0.09350906312465668, 'kl/avg_steps': 0.875, 'epoch': 0.03} 3%|███▉ | 23/681 [01:01<29:18, 2.67s/it] 4%|████ | 24/681 [01:03<29:05, 2.66s/it] {'loss': 1.2883, 'grad_norm': 161.09317016601562, 'learning_rate': 1.6666666666666665e-07, 'rewards/chosen': 0.015648098662495613, 'rewards/rejected': -0.08636436611413956, 'rewards/accuracies': 0.9375, 'rewards/margins': 0.10201247036457062, 'logps/chosen': -27.038997650146484, 'logps/rejected': -107.220458984375, 'logps/ref_chosen': -27.20970916748047, 'logps/ref_rejected': -106.27947998046875, 'logits/chosen': -3.4035227298736572, 'logits/rejected': -3.4438557624816895, 'kl/p_epsilon_steps': 0.921875, 'kl/n_epsilon_steps': 0.078125, 'epsilon_dpo/beta': 0.0919250100851059, 'epsilon_dpo/loss_margin_mean': 1.1116865873336792, 'epsilon_dpo/beta_margin_mean': 0.10201232135295868, 'epsilon_dpo/beta_margin_std': 0.07649125903844833, 'epsilon_dpo/beta_margin_grad_mean': -0.4745597541332245, 'epsilon_dpo/beta_margin_grad_std': 0.019036216661334038, 'kl/beta': 0.09269795566797256, 'kl/avg_steps': 0.84375, 'epoch': 0.04} 4%|████ | 24/681 [01:03<29:05, 2.66s/it] 4%|████▏ | 25/681 [01:06<29:23, 2.69s/it] {'loss': 1.2937, 'grad_norm': 128.98117065429688, 'learning_rate': 1.7391304347826085e-07, 'rewards/chosen': 0.004450969398021698, 'rewards/rejected': -0.0927439033985138, 'rewards/accuracies': 0.859375, 'rewards/margins': 0.09719488024711609, 'logps/chosen': -36.42045593261719, 'logps/rejected': -94.91465759277344, 'logps/ref_chosen': -36.47064208984375, 'logps/ref_rejected': -93.89593505859375, 'logits/chosen': -3.487727403640747, 'logits/rejected': -3.466651439666748, 'kl/p_epsilon_steps': 0.859375, 'kl/n_epsilon_steps': 0.140625, 'epsilon_dpo/beta': 0.09127078950405121, 'epsilon_dpo/loss_margin_mean': 1.0689027309417725, 'epsilon_dpo/beta_margin_mean': 0.09719490259885788, 'epsilon_dpo/beta_margin_std': 0.09393724054098129, 'epsilon_dpo/beta_margin_grad_mean': -0.47578126192092896, 'epsilon_dpo/beta_margin_grad_std': 0.023359432816505432, 'kl/beta': 0.09192235767841339, 'kl/avg_steps': 0.71875, 'epoch': 0.04} 4%|████▏ | 25/681 [01:06<29:23, 2.69s/it] 4%|████▍ | 26/681 [01:08<28:26, 2.61s/it] {'loss': 1.2385, 'grad_norm': 150.258056640625, 'learning_rate': 1.8115942028985507e-07, 'rewards/chosen': 0.010786900296807289, 'rewards/rejected': -0.15054798126220703, 'rewards/accuracies': 0.875, 'rewards/margins': 0.16133487224578857, 'logps/chosen': -39.704803466796875, 'logps/rejected': -110.67877197265625, 'logps/ref_chosen': -39.82624816894531, 'logps/ref_rejected': -109.0130615234375, 'logits/chosen': -3.4571170806884766, 'logits/rejected': -3.5464980602264404, 'kl/p_epsilon_steps': 0.875, 'kl/n_epsilon_steps': 0.125, 'epsilon_dpo/beta': 0.0905909463763237, 'epsilon_dpo/loss_margin_mean': 1.7871633768081665, 'epsilon_dpo/beta_margin_mean': 0.16133488714694977, 'epsilon_dpo/beta_margin_std': 0.1687132865190506, 'epsilon_dpo/beta_margin_grad_mean': -0.4600984454154968, 'epsilon_dpo/beta_margin_grad_std': 0.04119768738746643, 'kl/beta': 0.09126638621091843, 'kl/avg_steps': 0.75, 'epoch': 0.04} 4%|████▍ | 26/681 [01:08<28:26, 2.61s/it] 4%|████▌ | 27/681 [01:11<28:08, 2.58s/it] {'loss': 1.2104, 'grad_norm': 152.13011169433594, 'learning_rate': 1.8840579710144927e-07, 'rewards/chosen': 0.012909766286611557, 'rewards/rejected': -0.17539237439632416, 'rewards/accuracies': 0.984375, 'rewards/margins': 0.18830212950706482, 'logps/chosen': -19.790260314941406, 'logps/rejected': -109.28077697753906, 'logps/ref_chosen': -19.93426513671875, 'logps/ref_rejected': -107.32525634765625, 'logits/chosen': -3.417879104614258, 'logits/rejected': -3.4293429851531982, 'kl/p_epsilon_steps': 0.984375, 'kl/n_epsilon_steps': 0.015625, 'epsilon_dpo/beta': 0.08971839398145676, 'epsilon_dpo/loss_margin_mean': 2.099519729614258, 'epsilon_dpo/beta_margin_mean': 0.1883021891117096, 'epsilon_dpo/beta_margin_std': 0.1207134947180748, 'epsilon_dpo/beta_margin_grad_mean': -0.4532409906387329, 'epsilon_dpo/beta_margin_grad_std': 0.02980455383658409, 'kl/beta': 0.09058698266744614, 'kl/avg_steps': 0.96875, 'epoch': 0.04} 4%|████▌ | 27/681 [01:11<28:08, 2.58s/it] 4%|████▋ | 28/681 [01:14<28:14, 2.60s/it] {'loss': 1.2081, 'grad_norm': 144.26730346679688, 'learning_rate': 1.9565217391304347e-07, 'rewards/chosen': 0.02439035288989544, 'rewards/rejected': -0.1676751971244812, 'rewards/accuracies': 0.96875, 'rewards/margins': 0.19206556677818298, 'logps/chosen': -43.32648849487305, 'logps/rejected': -98.03630065917969, 'logps/ref_chosen': -43.6025390625, 'logps/ref_rejected': -96.1494140625, 'logits/chosen': -3.5261058807373047, 'logits/rejected': -3.475046157836914, 'kl/p_epsilon_steps': 0.96875, 'kl/n_epsilon_steps': 0.03125, 'epsilon_dpo/beta': 0.08888562768697739, 'epsilon_dpo/loss_margin_mean': 2.162940263748169, 'epsilon_dpo/beta_margin_mean': 0.1920655220746994, 'epsilon_dpo/beta_margin_std': 0.13740301132202148, 'epsilon_dpo/beta_margin_grad_mean': -0.4523911476135254, 'epsilon_dpo/beta_margin_grad_std': 0.03366325423121452, 'kl/beta': 0.08971784263849258, 'kl/avg_steps': 0.9375, 'epoch': 0.04} 4%|████▋ | 28/681 [01:14<28:14, 2.60s/it] 4%|████▉ | 29/681 [01:16<27:31, 2.53s/it] {'loss': 1.1716, 'grad_norm': 141.24661254882812, 'learning_rate': 2.028985507246377e-07, 'rewards/chosen': 0.03065740317106247, 'rewards/rejected': -0.20474430918693542, 'rewards/accuracies': 0.953125, 'rewards/margins': 0.2354017198085785, 'logps/chosen': -32.09503936767578, 'logps/rejected': -104.30924224853516, 'logps/ref_chosen': -32.44408416748047, 'logps/ref_rejected': -101.98307037353516, 'logits/chosen': -3.4727797508239746, 'logits/rejected': -3.4470648765563965, 'kl/p_epsilon_steps': 0.953125, 'kl/n_epsilon_steps': 0.046875, 'epsilon_dpo/beta': 0.08808783441781998, 'epsilon_dpo/loss_margin_mean': 2.675217866897583, 'epsilon_dpo/beta_margin_mean': 0.23540179431438446, 'epsilon_dpo/beta_margin_std': 0.1678141951560974, 'epsilon_dpo/beta_margin_grad_mean': -0.4418832063674927, 'epsilon_dpo/beta_margin_grad_std': 0.04080166295170784, 'kl/beta': 0.08888454735279083, 'kl/avg_steps': 0.90625, 'epoch': 0.04} 4%|████▉ | 29/681 [01:16<27:31, 2.53s/it] 4%|█████ | 30/681 [01:19<27:47, 2.56s/it] {'loss': 1.1744, 'grad_norm': 139.83612060546875, 'learning_rate': 2.1014492753623187e-07, 'rewards/chosen': 0.0018627983517944813, 'rewards/rejected': -0.23203766345977783, 'rewards/accuracies': 0.9375, 'rewards/margins': 0.23390044271945953, 'logps/chosen': -39.259483337402344, 'logps/rejected': -106.5517807006836, 'logps/ref_chosen': -39.2830810546875, 'logps/ref_rejected': -103.8922119140625, 'logits/chosen': -3.5117549896240234, 'logits/rejected': -3.389063835144043, 'kl/p_epsilon_steps': 0.953125, 'kl/n_epsilon_steps': 0.046875, 'epsilon_dpo/beta': 0.08729670941829681, 'epsilon_dpo/loss_margin_mean': 2.6831705570220947, 'epsilon_dpo/beta_margin_mean': 0.23390042781829834, 'epsilon_dpo/beta_margin_std': 0.1849268078804016, 'epsilon_dpo/beta_margin_grad_mean': -0.44231948256492615, 'epsilon_dpo/beta_margin_grad_std': 0.04499894008040428, 'kl/beta': 0.08808626234531403, 'kl/avg_steps': 0.90625, 'epoch': 0.04} 4%|█████ | 30/681 [01:19<27:47, 2.56s/it] 5%|█████▏ | 31/681 [01:21<27:57, 2.58s/it] {'loss': 1.1824, 'grad_norm': 121.26469421386719, 'learning_rate': 2.1739130434782607e-07, 'rewards/chosen': 0.02715596929192543, 'rewards/rejected': -0.19907742738723755, 'rewards/accuracies': 0.921875, 'rewards/margins': 0.22623339295387268, 'logps/chosen': -35.737491607666016, 'logps/rejected': -80.65327453613281, 'logps/ref_chosen': -36.05577850341797, 'logps/ref_rejected': -78.35195922851562, 'logits/chosen': -3.4779539108276367, 'logits/rejected': -3.431878089904785, 'kl/p_epsilon_steps': 0.921875, 'kl/n_epsilon_steps': 0.078125, 'epsilon_dpo/beta': 0.08656726032495499, 'epsilon_dpo/loss_margin_mean': 2.6196024417877197, 'epsilon_dpo/beta_margin_mean': 0.22623342275619507, 'epsilon_dpo/beta_margin_std': 0.1985906958580017, 'epsilon_dpo/beta_margin_grad_mean': -0.44433295726776123, 'epsilon_dpo/beta_margin_grad_std': 0.04770767688751221, 'kl/beta': 0.08729515224695206, 'kl/avg_steps': 0.84375, 'epoch': 0.05} 5%|█████▏ | 31/681 [01:21<27:57, 2.58s/it] 5%|█████▍ | 32/681 [01:24<27:43, 2.56s/it] {'loss': 1.1284, 'grad_norm': 131.14877319335938, 'learning_rate': 2.2463768115942027e-07, 'rewards/chosen': 0.023832818493247032, 'rewards/rejected': -0.2682601809501648, 'rewards/accuracies': 0.96875, 'rewards/margins': 0.2920929789543152, 'logps/chosen': -32.82677459716797, 'logps/rejected': -98.60302734375, 'logps/ref_chosen': -33.10527420043945, 'logps/ref_rejected': -95.47318267822266, 'logits/chosen': -3.4760398864746094, 'logits/rejected': -3.5285589694976807, 'kl/p_epsilon_steps': 0.96875, 'kl/n_epsilon_steps': 0.03125, 'epsilon_dpo/beta': 0.08576179295778275, 'epsilon_dpo/loss_margin_mean': 3.4083411693573, 'epsilon_dpo/beta_margin_mean': 0.292092889547348, 'epsilon_dpo/beta_margin_std': 0.2318342626094818, 'epsilon_dpo/beta_margin_grad_mean': -0.42864087224006653, 'epsilon_dpo/beta_margin_grad_std': 0.05508127063512802, 'kl/beta': 0.08656476438045502, 'kl/avg_steps': 0.9375, 'epoch': 0.05} 5%|█████▍ | 32/681 [01:24<27:43, 2.56s/it] 5%|█████▌ | 33/681 [01:26<27:33, 2.55s/it] {'loss': 1.1399, 'grad_norm': 127.04460906982422, 'learning_rate': 2.318840579710145e-07, 'rewards/chosen': 0.00047481246292591095, 'rewards/rejected': -0.2803095579147339, 'rewards/accuracies': 0.96875, 'rewards/margins': 0.28078436851501465, 'logps/chosen': -40.38764953613281, 'logps/rejected': -97.78340148925781, 'logps/ref_chosen': -40.39752960205078, 'logps/ref_rejected': -94.48348999023438, 'logits/chosen': -3.5147736072540283, 'logits/rejected': -3.4672505855560303, 'kl/p_epsilon_steps': 0.96875, 'kl/n_epsilon_steps': 0.03125, 'epsilon_dpo/beta': 0.08496525138616562, 'epsilon_dpo/loss_margin_mean': 3.309786796569824, 'epsilon_dpo/beta_margin_mean': 0.2807844281196594, 'epsilon_dpo/beta_margin_std': 0.24786221981048584, 'epsilon_dpo/beta_margin_grad_mean': -0.43119847774505615, 'epsilon_dpo/beta_margin_grad_std': 0.058498919010162354, 'kl/beta': 0.08576075732707977, 'kl/avg_steps': 0.9375, 'epoch': 0.05} 5%|█████▌ | 33/681 [01:26<27:33, 2.55s/it] 5%|█████▋ | 34/681 [01:29<27:18, 2.53s/it] {'loss': 1.1152, 'grad_norm': 127.52790069580078, 'learning_rate': 2.391304347826087e-07, 'rewards/chosen': -0.0053707570768892765, 'rewards/rejected': -0.31881266832351685, 'rewards/accuracies': 0.921875, 'rewards/margins': 0.3134418725967407, 'logps/chosen': -35.16386413574219, 'logps/rejected': -110.49327850341797, 'logps/ref_chosen': -35.10262680053711, 'logps/ref_rejected': -106.70514678955078, 'logits/chosen': -3.4469704627990723, 'logits/rejected': -3.5294151306152344, 'kl/p_epsilon_steps': 0.890625, 'kl/n_epsilon_steps': 0.109375, 'epsilon_dpo/beta': 0.08430886268615723, 'epsilon_dpo/loss_margin_mean': 3.726898193359375, 'epsilon_dpo/beta_margin_mean': 0.3134419322013855, 'epsilon_dpo/beta_margin_std': 0.27257490158081055, 'epsilon_dpo/beta_margin_grad_mean': -0.42384782433509827, 'epsilon_dpo/beta_margin_grad_std': 0.06483103334903717, 'kl/beta': 0.08496421575546265, 'kl/avg_steps': 0.78125, 'epoch': 0.05} 5%|█████▋ | 34/681 [01:29<27:18, 2.53s/it] 5%|█████▉ | 35/681 [01:31<27:08, 2.52s/it] {'loss': 1.0629, 'grad_norm': 120.23053741455078, 'learning_rate': 2.463768115942029e-07, 'rewards/chosen': -0.019653314724564552, 'rewards/rejected': -0.4028552770614624, 'rewards/accuracies': 0.953125, 'rewards/margins': 0.3832019567489624, 'logps/chosen': -34.6436767578125, 'logps/rejected': -116.70634460449219, 'logps/ref_chosen': -34.41180419921875, 'logps/ref_rejected': -111.88399505615234, 'logits/chosen': -3.4881181716918945, 'logits/rejected': -3.5153722763061523, 'kl/p_epsilon_steps': 0.9375, 'kl/n_epsilon_steps': 0.0625, 'epsilon_dpo/beta': 0.0835762619972229, 'epsilon_dpo/loss_margin_mean': 4.590485095977783, 'epsilon_dpo/beta_margin_mean': 0.3832019567489624, 'epsilon_dpo/beta_margin_std': 0.3155387043952942, 'epsilon_dpo/beta_margin_grad_mean': -0.4079773724079132, 'epsilon_dpo/beta_margin_grad_std': 0.07277688384056091, 'kl/beta': 0.08430557698011398, 'kl/avg_steps': 0.875, 'epoch': 0.05} 5%|█████▉ | 35/681 [01:31<27:08, 2.52s/it] 5%|██████ | 36/681 [01:34<26:50, 2.50s/it] {'loss': 1.0334, 'grad_norm': 105.79562377929688, 'learning_rate': 2.536231884057971e-07, 'rewards/chosen': 0.012363580986857414, 'rewards/rejected': -0.428993821144104, 'rewards/accuracies': 0.921875, 'rewards/margins': 0.4413573741912842, 'logps/chosen': -32.58708572387695, 'logps/rejected': -95.34486389160156, 'logps/ref_chosen': -32.743473052978516, 'logps/ref_rejected': -90.1633529663086, 'logits/chosen': -3.4250617027282715, 'logits/rejected': -3.506716251373291, 'kl/p_epsilon_steps': 0.890625, 'kl/n_epsilon_steps': 0.109375, 'epsilon_dpo/beta': 0.08292967081069946, 'epsilon_dpo/loss_margin_mean': 5.337898254394531, 'epsilon_dpo/beta_margin_mean': 0.4413573443889618, 'epsilon_dpo/beta_margin_std': 0.41608670353889465, 'epsilon_dpo/beta_margin_grad_mean': -0.3959466516971588, 'epsilon_dpo/beta_margin_grad_std': 0.09478563815355301, 'kl/beta': 0.08357430249452591, 'kl/avg_steps': 0.78125, 'epoch': 0.05} 5%|██████ | 36/681 [01:34<26:50, 2.50s/it] 5%|██████▏ | 37/681 [01:36<26:15, 2.45s/it] {'loss': 1.0351, 'grad_norm': 100.5900650024414, 'learning_rate': 2.6086956521739126e-07, 'rewards/chosen': -0.025536730885505676, 'rewards/rejected': -0.46323513984680176, 'rewards/accuracies': 0.921875, 'rewards/margins': 0.4376984238624573, 'logps/chosen': -40.18016052246094, 'logps/rejected': -89.59291076660156, 'logps/ref_chosen': -39.88025665283203, 'logps/ref_rejected': -83.95890808105469, 'logits/chosen': -3.4775028228759766, 'logits/rejected': -3.44918155670166, 'kl/p_epsilon_steps': 0.890625, 'kl/n_epsilon_steps': 0.109375, 'epsilon_dpo/beta': 0.08228681236505508, 'epsilon_dpo/loss_margin_mean': 5.334095478057861, 'epsilon_dpo/beta_margin_mean': 0.4376983940601349, 'epsilon_dpo/beta_margin_std': 0.4152112901210785, 'epsilon_dpo/beta_margin_grad_mean': -0.3971908390522003, 'epsilon_dpo/beta_margin_grad_std': 0.09092327207326889, 'kl/beta': 0.082926444709301, 'kl/avg_steps': 0.78125, 'epoch': 0.05} 5%|██████▏ | 37/681 [01:36<26:15, 2.45s/it] 6%|██████▍ | 38/681 [01:38<26:00, 2.43s/it] {'loss': 0.9294, 'grad_norm': 99.69669342041016, 'learning_rate': 2.681159420289855e-07, 'rewards/chosen': -0.0062013790011405945, 'rewards/rejected': -0.6268683671951294, 'rewards/accuracies': 0.90625, 'rewards/margins': 0.6206669807434082, 'logps/chosen': -33.92012405395508, 'logps/rejected': -112.65040588378906, 'logps/ref_chosen': -33.85154342651367, 'logps/ref_rejected': -104.96053314208984, 'logits/chosen': -3.436084270477295, 'logits/rejected': -3.5204224586486816, 'kl/p_epsilon_steps': 0.875, 'kl/n_epsilon_steps': 0.125, 'epsilon_dpo/beta': 0.08167463541030884, 'epsilon_dpo/loss_margin_mean': 7.621293067932129, 'epsilon_dpo/beta_margin_mean': 0.6206669211387634, 'epsilon_dpo/beta_margin_std': 0.5732053518295288, 'epsilon_dpo/beta_margin_grad_mean': -0.36130231618881226, 'epsilon_dpo/beta_margin_grad_std': 0.11581370234489441, 'kl/beta': 0.08228360116481781, 'kl/avg_steps': 0.75, 'epoch': 0.06} 6%|██████▍ | 38/681 [01:38<26:00, 2.43s/it] 6%|██████▌ | 39/681 [01:41<26:02, 2.43s/it] {'loss': 0.8922, 'grad_norm': 102.14472198486328, 'learning_rate': 2.753623188405797e-07, 'rewards/chosen': 0.035918496549129486, 'rewards/rejected': -0.6091995239257812, 'rewards/accuracies': 0.96875, 'rewards/margins': 0.645117998123169, 'logps/chosen': -31.436540603637695, 'logps/rejected': -91.78189086914062, 'logps/ref_chosen': -31.883747100830078, 'logps/ref_rejected': -84.24908447265625, 'logits/chosen': -3.435272216796875, 'logits/rejected': -3.441884994506836, 'kl/p_epsilon_steps': 0.96875, 'kl/n_epsilon_steps': 0.03125, 'epsilon_dpo/beta': 0.0809134915471077, 'epsilon_dpo/loss_margin_mean': 7.98002290725708, 'epsilon_dpo/beta_margin_mean': 0.6451180577278137, 'epsilon_dpo/beta_margin_std': 0.48367542028427124, 'epsilon_dpo/beta_margin_grad_mean': -0.35245731472969055, 'epsilon_dpo/beta_margin_grad_std': 0.09772588312625885, 'kl/beta': 0.08167106658220291, 'kl/avg_steps': 0.9375, 'epoch': 0.06} 6%|██████▌ | 39/681 [01:41<26:02, 2.43s/it] 6%|██████▊ | 40/681 [01:43<25:58, 2.43s/it] {'loss': 0.9118, 'grad_norm': 94.2869644165039, 'learning_rate': 2.8260869565217386e-07, 'rewards/chosen': -0.018150903284549713, 'rewards/rejected': -0.654141366481781, 'rewards/accuracies': 0.953125, 'rewards/margins': 0.6359904408454895, 'logps/chosen': -36.815399169921875, 'logps/rejected': -89.08628845214844, 'logps/ref_chosen': -36.59412384033203, 'logps/ref_rejected': -80.92609405517578, 'logits/chosen': -3.48494291305542, 'logits/rejected': -3.507589101791382, 'kl/p_epsilon_steps': 0.875, 'kl/n_epsilon_steps': 0.125, 'epsilon_dpo/beta': 0.08031369745731354, 'epsilon_dpo/loss_margin_mean': 7.938919544219971, 'epsilon_dpo/beta_margin_mean': 0.6359905004501343, 'epsilon_dpo/beta_margin_std': 0.5488929152488708, 'epsilon_dpo/beta_margin_grad_mean': -0.35732436180114746, 'epsilon_dpo/beta_margin_grad_std': 0.10807473212480545, 'kl/beta': 0.0809125155210495, 'kl/avg_steps': 0.75, 'epoch': 0.06} 6%|██████▊ | 40/681 [01:43<25:58, 2.43s/it] 6%|██████▉ | 41/681 [01:46<25:37, 2.40s/it] {'loss': 0.889, 'grad_norm': 101.93110656738281, 'learning_rate': 2.898550724637681e-07, 'rewards/chosen': -0.10497994720935822, 'rewards/rejected': -0.8672504425048828, 'rewards/accuracies': 0.9375, 'rewards/margins': 0.762270450592041, 'logps/chosen': -41.283260345458984, 'logps/rejected': -116.42434692382812, 'logps/ref_chosen': -39.986053466796875, 'logps/ref_rejected': -105.53334045410156, 'logits/chosen': -3.483686923980713, 'logits/rejected': -3.4968438148498535, 'kl/p_epsilon_steps': 0.84375, 'kl/n_epsilon_steps': 0.140625, 'epsilon_dpo/beta': 0.07975336164236069, 'epsilon_dpo/loss_margin_mean': 9.593799591064453, 'epsilon_dpo/beta_margin_mean': 0.762270450592041, 'epsilon_dpo/beta_margin_std': 0.7963690161705017, 'epsilon_dpo/beta_margin_grad_mean': -0.3376636803150177, 'epsilon_dpo/beta_margin_grad_std': 0.1441744863986969, 'kl/beta': 0.08031018823385239, 'kl/avg_steps': 0.703125, 'epoch': 0.06} 6%|██████▉ | 41/681 [01:46<25:37, 2.40s/it] 6%|███████ | 42/681 [01:48<26:24, 2.48s/it] {'loss': 0.8297, 'grad_norm': 98.95552062988281, 'learning_rate': 2.971014492753623e-07, 'rewards/chosen': -0.1441619098186493, 'rewards/rejected': -1.0115324258804321, 'rewards/accuracies': 0.921875, 'rewards/margins': 0.86737060546875, 'logps/chosen': -47.57762145996094, 'logps/rejected': -130.83279418945312, 'logps/ref_chosen': -45.769351959228516, 'logps/ref_rejected': -118.03570556640625, 'logits/chosen': -3.436079502105713, 'logits/rejected': -3.551501989364624, 'kl/p_epsilon_steps': 0.875, 'kl/n_epsilon_steps': 0.125, 'epsilon_dpo/beta': 0.07915924489498138, 'epsilon_dpo/loss_margin_mean': 10.988822937011719, 'epsilon_dpo/beta_margin_mean': 0.8673705458641052, 'epsilon_dpo/beta_margin_std': 0.8684898018836975, 'epsilon_dpo/beta_margin_grad_mean': -0.3248225450515747, 'epsilon_dpo/beta_margin_grad_std': 0.14031846821308136, 'kl/beta': 0.0797494500875473, 'kl/avg_steps': 0.75, 'epoch': 0.06} 6%|███████ | 42/681 [01:48<26:24, 2.48s/it] 6%|███████▎ | 43/681 [01:51<26:49, 2.52s/it] {'loss': 0.7552, 'grad_norm': 86.0045166015625, 'learning_rate': 3.043478260869565e-07, 'rewards/chosen': -0.010491969995200634, 'rewards/rejected': -0.9838355779647827, 'rewards/accuracies': 0.953125, 'rewards/margins': 0.9733436703681946, 'logps/chosen': -36.81052017211914, 'logps/rejected': -117.46549224853516, 'logps/ref_chosen': -36.684478759765625, 'logps/ref_rejected': -104.91730499267578, 'logits/chosen': -3.4531750679016113, 'logits/rejected': -3.502927303314209, 'kl/p_epsilon_steps': 0.90625, 'kl/n_epsilon_steps': 0.09375, 'epsilon_dpo/beta': 0.07852049171924591, 'epsilon_dpo/loss_margin_mean': 12.422151565551758, 'epsilon_dpo/beta_margin_mean': 0.9733436703681946, 'epsilon_dpo/beta_margin_std': 0.7786263227462769, 'epsilon_dpo/beta_margin_grad_mean': -0.29879364371299744, 'epsilon_dpo/beta_margin_grad_std': 0.14372758567333221, 'kl/beta': 0.07915578037500381, 'kl/avg_steps': 0.8125, 'epoch': 0.06} 6%|███████▎ | 43/681 [01:51<26:49, 2.52s/it] 6%|███████▍ | 44/681 [01:54<27:27, 2.59s/it] {'loss': 0.6985, 'grad_norm': 85.54306030273438, 'learning_rate': 3.115942028985507e-07, 'rewards/chosen': -0.02512773871421814, 'rewards/rejected': -1.1819229125976562, 'rewards/accuracies': 0.9375, 'rewards/margins': 1.1567951440811157, 'logps/chosen': -28.09503936767578, 'logps/rejected': -132.7843017578125, 'logps/ref_chosen': -27.785930633544922, 'logps/ref_rejected': -117.58551788330078, 'logits/chosen': -3.3832082748413086, 'logits/rejected': -3.4922008514404297, 'kl/p_epsilon_steps': 0.9375, 'kl/n_epsilon_steps': 0.0625, 'epsilon_dpo/beta': 0.07783857733011246, 'epsilon_dpo/loss_margin_mean': 14.889684677124023, 'epsilon_dpo/beta_margin_mean': 1.1567952632904053, 'epsilon_dpo/beta_margin_std': 0.9610118269920349, 'epsilon_dpo/beta_margin_grad_mean': -0.2732856273651123, 'epsilon_dpo/beta_margin_grad_std': 0.15621285140514374, 'kl/beta': 0.07851782441139221, 'kl/avg_steps': 0.875, 'epoch': 0.06} 6%|███████▍ | 44/681 [01:54<27:27, 2.59s/it] 7%|███████▌ | 45/681 [01:56<26:29, 2.50s/it] {'loss': 0.8642, 'grad_norm': 86.84064483642578, 'learning_rate': 3.188405797101449e-07, 'rewards/chosen': -0.14034625887870789, 'rewards/rejected': -0.9687749147415161, 'rewards/accuracies': 0.875, 'rewards/margins': 0.8284286260604858, 'logps/chosen': -38.12892150878906, 'logps/rejected': -95.89362335205078, 'logps/ref_chosen': -36.33074951171875, 'logps/ref_rejected': -83.34062194824219, 'logits/chosen': -3.4090023040771484, 'logits/rejected': -3.488529920578003, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.171875, 'epsilon_dpo/beta': 0.07733368128538132, 'epsilon_dpo/loss_margin_mean': 10.75483226776123, 'epsilon_dpo/beta_margin_mean': 0.8284286260604858, 'epsilon_dpo/beta_margin_std': 0.8817453980445862, 'epsilon_dpo/beta_margin_grad_mean': -0.3332192301750183, 'epsilon_dpo/beta_margin_grad_std': 0.15007154643535614, 'kl/beta': 0.07783675193786621, 'kl/avg_steps': 0.65625, 'epoch': 0.07} 7%|███████▌ | 45/681 [01:56<26:29, 2.50s/it] 7%|███████▊ | 46/681 [01:58<26:23, 2.49s/it] {'loss': 0.7386, 'grad_norm': 96.92198181152344, 'learning_rate': 3.260869565217391e-07, 'rewards/chosen': -0.13517028093338013, 'rewards/rejected': -1.2355304956436157, 'rewards/accuracies': 0.90625, 'rewards/margins': 1.1003601551055908, 'logps/chosen': -37.75640106201172, 'logps/rejected': -117.52153015136719, 'logps/ref_chosen': -36.0171012878418, 'logps/ref_rejected': -101.40194702148438, 'logits/chosen': -3.375389814376831, 'logits/rejected': -3.439635992050171, 'kl/p_epsilon_steps': 0.875, 'kl/n_epsilon_steps': 0.125, 'epsilon_dpo/beta': 0.07675698399543762, 'epsilon_dpo/loss_margin_mean': 14.38028621673584, 'epsilon_dpo/beta_margin_mean': 1.1003601551055908, 'epsilon_dpo/beta_margin_std': 1.0318849086761475, 'epsilon_dpo/beta_margin_grad_mean': -0.2891290783882141, 'epsilon_dpo/beta_margin_grad_std': 0.1580805480480194, 'kl/beta': 0.07732927799224854, 'kl/avg_steps': 0.75, 'epoch': 0.07} 7%|███████▊ | 46/681 [01:58<26:23, 2.49s/it] 7%|███████▉ | 47/681 [02:01<26:19, 2.49s/it] {'loss': 0.8604, 'grad_norm': 94.45354461669922, 'learning_rate': 3.333333333333333e-07, 'rewards/chosen': -0.19145357608795166, 'rewards/rejected': -1.1218818426132202, 'rewards/accuracies': 0.890625, 'rewards/margins': 0.9304282665252686, 'logps/chosen': -44.316593170166016, 'logps/rejected': -116.04118347167969, 'logps/ref_chosen': -41.82904815673828, 'logps/ref_rejected': -101.29283142089844, 'logits/chosen': -3.4388890266418457, 'logits/rejected': -3.5206193923950195, 'kl/p_epsilon_steps': 0.890625, 'kl/n_epsilon_steps': 0.109375, 'epsilon_dpo/beta': 0.07616160064935684, 'epsilon_dpo/loss_margin_mean': 12.260807037353516, 'epsilon_dpo/beta_margin_mean': 0.9304282069206238, 'epsilon_dpo/beta_margin_std': 1.0573234558105469, 'epsilon_dpo/beta_margin_grad_mean': -0.31451866030693054, 'epsilon_dpo/beta_margin_grad_std': 0.16808247566223145, 'kl/beta': 0.07675362378358841, 'kl/avg_steps': 0.78125, 'epoch': 0.07} 7%|███████▉ | 47/681 [02:01<26:19, 2.49s/it] 7%|████████ | 48/681 [02:04<27:16, 2.59s/it] {'loss': 0.7607, 'grad_norm': 90.87052154541016, 'learning_rate': 3.4057971014492755e-07, 'rewards/chosen': -0.19789756834506989, 'rewards/rejected': -1.3635492324829102, 'rewards/accuracies': 0.859375, 'rewards/margins': 1.1656516790390015, 'logps/chosen': -38.7716064453125, 'logps/rejected': -113.47309875488281, 'logps/ref_chosen': -36.18339920043945, 'logps/ref_rejected': -95.41502380371094, 'logits/chosen': -3.463271141052246, 'logits/rejected': -3.4467902183532715, 'kl/p_epsilon_steps': 0.796875, 'kl/n_epsilon_steps': 0.203125, 'epsilon_dpo/beta': 0.07571401447057724, 'epsilon_dpo/loss_margin_mean': 15.469880104064941, 'epsilon_dpo/beta_margin_mean': 1.1656516790390015, 'epsilon_dpo/beta_margin_std': 1.2263123989105225, 'epsilon_dpo/beta_margin_grad_mean': -0.29027602076530457, 'epsilon_dpo/beta_margin_grad_std': 0.17632031440734863, 'kl/beta': 0.07615863531827927, 'kl/avg_steps': 0.59375, 'epoch': 0.07} 7%|████████ | 48/681 [02:04<27:16, 2.59s/it] 7%|████████▎ | 49/681 [02:06<26:45, 2.54s/it] {'loss': 0.6703, 'grad_norm': 81.80361938476562, 'learning_rate': 3.478260869565217e-07, 'rewards/chosen': -0.118332639336586, 'rewards/rejected': -1.4788892269134521, 'rewards/accuracies': 0.90625, 'rewards/margins': 1.3605566024780273, 'logps/chosen': -37.51390075683594, 'logps/rejected': -111.02618408203125, 'logps/ref_chosen': -35.96546936035156, 'logps/ref_rejected': -91.31820678710938, 'logits/chosen': -3.472126007080078, 'logits/rejected': -3.5396313667297363, 'kl/p_epsilon_steps': 0.875, 'kl/n_epsilon_steps': 0.125, 'epsilon_dpo/beta': 0.075148805975914, 'epsilon_dpo/loss_margin_mean': 18.15953826904297, 'epsilon_dpo/beta_margin_mean': 1.3605566024780273, 'epsilon_dpo/beta_margin_std': 1.2057913541793823, 'epsilon_dpo/beta_margin_grad_mean': -0.2587027847766876, 'epsilon_dpo/beta_margin_grad_std': 0.17777220904827118, 'kl/beta': 0.07570911198854446, 'kl/avg_steps': 0.75, 'epoch': 0.07} 7%|████████▎ | 49/681 [02:06<26:45, 2.54s/it] 7%|████████▍ | 50/681 [02:09<26:17, 2.50s/it] {'loss': 0.7726, 'grad_norm': 100.63481903076172, 'learning_rate': 3.5507246376811595e-07, 'rewards/chosen': -0.30415183305740356, 'rewards/rejected': -1.6275885105133057, 'rewards/accuracies': 0.84375, 'rewards/margins': 1.3234367370605469, 'logps/chosen': -46.16408920288086, 'logps/rejected': -122.25482177734375, 'logps/ref_chosen': -42.138206481933594, 'logps/ref_rejected': -100.4173583984375, 'logits/chosen': -3.4349403381347656, 'logits/rejected': -3.533860445022583, 'kl/p_epsilon_steps': 0.8125, 'kl/n_epsilon_steps': 0.1875, 'epsilon_dpo/beta': 0.07468333095312119, 'epsilon_dpo/loss_margin_mean': 17.81157684326172, 'epsilon_dpo/beta_margin_mean': 1.3234366178512573, 'epsilon_dpo/beta_margin_std': 1.4268280267715454, 'epsilon_dpo/beta_margin_grad_mean': -0.27890822291374207, 'epsilon_dpo/beta_margin_grad_std': 0.21263383328914642, 'kl/beta': 0.07514552026987076, 'kl/avg_steps': 0.625, 'epoch': 0.07} 7%|████████▍ | 50/681 [02:09<26:17, 2.50s/it] 7%|████████▌ | 51/681 [02:11<26:21, 2.51s/it] {'loss': 0.7605, 'grad_norm': 91.88065338134766, 'learning_rate': 3.6231884057971015e-07, 'rewards/chosen': -0.3597980737686157, 'rewards/rejected': -1.7088831663131714, 'rewards/accuracies': 0.875, 'rewards/margins': 1.3490850925445557, 'logps/chosen': -43.83557891845703, 'logps/rejected': -103.68801879882812, 'logps/ref_chosen': -39.016597747802734, 'logps/ref_rejected': -80.60652160644531, 'logits/chosen': -3.377105236053467, 'logits/rejected': -3.4192728996276855, 'kl/p_epsilon_steps': 0.8125, 'kl/n_epsilon_steps': 0.1875, 'epsilon_dpo/beta': 0.07421945780515671, 'epsilon_dpo/loss_margin_mean': 18.262516021728516, 'epsilon_dpo/beta_margin_mean': 1.3490850925445557, 'epsilon_dpo/beta_margin_std': 1.5637013912200928, 'epsilon_dpo/beta_margin_grad_mean': -0.28537219762802124, 'epsilon_dpo/beta_margin_grad_std': 0.19426609575748444, 'kl/beta': 0.07467877864837646, 'kl/avg_steps': 0.625, 'epoch': 0.07} 7%|████████▌ | 51/681 [02:11<26:21, 2.51s/it] 8%|████████▊ | 52/681 [02:14<26:30, 2.53s/it] {'loss': 0.6001, 'grad_norm': 87.76582336425781, 'learning_rate': 3.695652173913043e-07, 'rewards/chosen': -0.3565700650215149, 'rewards/rejected': -2.0638861656188965, 'rewards/accuracies': 0.921875, 'rewards/margins': 1.7073161602020264, 'logps/chosen': -39.097511291503906, 'logps/rejected': -114.01925659179688, 'logps/ref_chosen': -34.285945892333984, 'logps/ref_rejected': -85.96109008789062, 'logits/chosen': -3.4000444412231445, 'logits/rejected': -3.473278284072876, 'kl/p_epsilon_steps': 0.890625, 'kl/n_epsilon_steps': 0.109375, 'epsilon_dpo/beta': 0.07364249974489212, 'epsilon_dpo/loss_margin_mean': 23.246597290039062, 'epsilon_dpo/beta_margin_mean': 1.7073160409927368, 'epsilon_dpo/beta_margin_std': 1.5912011861801147, 'epsilon_dpo/beta_margin_grad_mean': -0.23005272448062897, 'epsilon_dpo/beta_margin_grad_std': 0.1819276660680771, 'kl/beta': 0.07421493530273438, 'kl/avg_steps': 0.78125, 'epoch': 0.08} 8%|████████▊ | 52/681 [02:14<26:30, 2.53s/it] 8%|████████▉ | 53/681 [02:16<26:43, 2.55s/it] {'loss': 0.558, 'grad_norm': 85.12645721435547, 'learning_rate': 3.7681159420289855e-07, 'rewards/chosen': -0.5423527359962463, 'rewards/rejected': -2.3573391437530518, 'rewards/accuracies': 0.90625, 'rewards/margins': 1.8149864673614502, 'logps/chosen': -42.113983154296875, 'logps/rejected': -129.96218872070312, 'logps/ref_chosen': -34.706817626953125, 'logps/ref_rejected': -97.64952087402344, 'logits/chosen': -3.3786673545837402, 'logits/rejected': -3.4282350540161133, 'kl/p_epsilon_steps': 0.890625, 'kl/n_epsilon_steps': 0.109375, 'epsilon_dpo/beta': 0.07307162135839462, 'epsilon_dpo/loss_margin_mean': 24.905494689941406, 'epsilon_dpo/beta_margin_mean': 1.8149864673614502, 'epsilon_dpo/beta_margin_std': 1.6574300527572632, 'epsilon_dpo/beta_margin_grad_mean': -0.21757791936397552, 'epsilon_dpo/beta_margin_grad_std': 0.17687389254570007, 'kl/beta': 0.07363962382078171, 'kl/avg_steps': 0.78125, 'epoch': 0.08} 8%|████████▉ | 53/681 [02:16<26:43, 2.55s/it] 8%|█████████ | 54/681 [02:19<26:08, 2.50s/it] {'loss': 0.6017, 'grad_norm': 83.51258850097656, 'learning_rate': 3.8405797101449274e-07, 'rewards/chosen': -0.46132320165634155, 'rewards/rejected': -2.206540822982788, 'rewards/accuracies': 0.90625, 'rewards/margins': 1.7452175617218018, 'logps/chosen': -46.09777069091797, 'logps/rejected': -129.1588592529297, 'logps/ref_chosen': -39.777854919433594, 'logps/ref_rejected': -98.70614624023438, 'logits/chosen': -3.400343656539917, 'logits/rejected': -3.5012292861938477, 'kl/p_epsilon_steps': 0.875, 'kl/n_epsilon_steps': 0.125, 'epsilon_dpo/beta': 0.07252801209688187, 'epsilon_dpo/loss_margin_mean': 24.13280487060547, 'epsilon_dpo/beta_margin_mean': 1.7452175617218018, 'epsilon_dpo/beta_margin_std': 1.7455625534057617, 'epsilon_dpo/beta_margin_grad_mean': -0.23283910751342773, 'epsilon_dpo/beta_margin_grad_std': 0.18114687502384186, 'kl/beta': 0.0730687752366066, 'kl/avg_steps': 0.75, 'epoch': 0.08} 8%|█████████ | 54/681 [02:19<26:08, 2.50s/it] 8%|█████████▎ | 55/681 [02:21<25:34, 2.45s/it] {'loss': 0.7261, 'grad_norm': 104.0904541015625, 'learning_rate': 3.9130434782608694e-07, 'rewards/chosen': -0.7076321840286255, 'rewards/rejected': -2.3510682582855225, 'rewards/accuracies': 0.796875, 'rewards/margins': 1.643436074256897, 'logps/chosen': -58.18317794799805, 'logps/rejected': -126.38877868652344, 'logps/ref_chosen': -48.42914962768555, 'logps/ref_rejected': -93.72831726074219, 'logits/chosen': -3.390665054321289, 'logits/rejected': -3.4186453819274902, 'kl/p_epsilon_steps': 0.765625, 'kl/n_epsilon_steps': 0.234375, 'epsilon_dpo/beta': 0.07214676588773727, 'epsilon_dpo/loss_margin_mean': 22.906444549560547, 'epsilon_dpo/beta_margin_mean': 1.643436074256897, 'epsilon_dpo/beta_margin_std': 1.8868768215179443, 'epsilon_dpo/beta_margin_grad_mean': -0.2661394476890564, 'epsilon_dpo/beta_margin_grad_std': 0.21455174684524536, 'kl/beta': 0.07252483814954758, 'kl/avg_steps': 0.53125, 'epoch': 0.08} 8%|█████████▎ | 55/681 [02:21<25:34, 2.45s/it] 8%|█████████▍ | 56/681 [02:24<26:06, 2.51s/it] {'loss': 0.713, 'grad_norm': 123.8470458984375, 'learning_rate': 3.9855072463768114e-07, 'rewards/chosen': -0.5154626369476318, 'rewards/rejected': -2.4932093620300293, 'rewards/accuracies': 0.84375, 'rewards/margins': 1.9777467250823975, 'logps/chosen': -38.82612991333008, 'logps/rejected': -135.484375, 'logps/ref_chosen': -31.692344665527344, 'logps/ref_rejected': -100.61968994140625, 'logits/chosen': -3.2844839096069336, 'logits/rejected': -3.3951363563537598, 'kl/p_epsilon_steps': 0.8125, 'kl/n_epsilon_steps': 0.171875, 'epsilon_dpo/beta': 0.07168648391962051, 'epsilon_dpo/loss_margin_mean': 27.73088836669922, 'epsilon_dpo/beta_margin_mean': 1.977746605873108, 'epsilon_dpo/beta_margin_std': 2.0250532627105713, 'epsilon_dpo/beta_margin_grad_mean': -0.2267267107963562, 'epsilon_dpo/beta_margin_grad_std': 0.24484246969223022, 'kl/beta': 0.07214158773422241, 'kl/avg_steps': 0.640625, 'epoch': 0.08} 8%|█████████▍ | 56/681 [02:24<26:06, 2.51s/it] 8%|█████████▋ | 57/681 [02:26<26:18, 2.53s/it] {'loss': 0.6021, 'grad_norm': 102.96115112304688, 'learning_rate': 4.057971014492754e-07, 'rewards/chosen': -0.5308903455734253, 'rewards/rejected': -2.707615375518799, 'rewards/accuracies': 0.890625, 'rewards/margins': 2.176724910736084, 'logps/chosen': -45.702720642089844, 'logps/rejected': -139.84365844726562, 'logps/ref_chosen': -38.302345275878906, 'logps/ref_rejected': -101.74482727050781, 'logits/chosen': -3.386384963989258, 'logits/rejected': -3.439547538757324, 'kl/p_epsilon_steps': 0.875, 'kl/n_epsilon_steps': 0.125, 'epsilon_dpo/beta': 0.07115186750888824, 'epsilon_dpo/loss_margin_mean': 30.698436737060547, 'epsilon_dpo/beta_margin_mean': 2.176724910736084, 'epsilon_dpo/beta_margin_std': 2.1261887550354004, 'epsilon_dpo/beta_margin_grad_mean': -0.21097464859485626, 'epsilon_dpo/beta_margin_grad_std': 0.22179222106933594, 'kl/beta': 0.07168237119913101, 'kl/avg_steps': 0.75, 'epoch': 0.08} 8%|█████████▋ | 57/681 [02:26<26:18, 2.53s/it] 9%|█████████▊ | 58/681 [02:29<25:52, 2.49s/it] {'loss': 0.4979, 'grad_norm': 87.39006805419922, 'learning_rate': 4.1304347826086954e-07, 'rewards/chosen': -0.5473577976226807, 'rewards/rejected': -2.704098701477051, 'rewards/accuracies': 0.9375, 'rewards/margins': 2.15674090385437, 'logps/chosen': -46.135406494140625, 'logps/rejected': -127.86618041992188, 'logps/ref_chosen': -38.44845962524414, 'logps/ref_rejected': -89.55912780761719, 'logits/chosen': -3.3047280311584473, 'logits/rejected': -3.393859386444092, 'kl/p_epsilon_steps': 0.875, 'kl/n_epsilon_steps': 0.125, 'epsilon_dpo/beta': 0.07062220573425293, 'epsilon_dpo/loss_margin_mean': 30.6201114654541, 'epsilon_dpo/beta_margin_mean': 2.15674090385437, 'epsilon_dpo/beta_margin_std': 2.0642378330230713, 'epsilon_dpo/beta_margin_grad_mean': -0.1960882991552353, 'epsilon_dpo/beta_margin_grad_std': 0.17183293402194977, 'kl/beta': 0.07114876061677933, 'kl/avg_steps': 0.75, 'epoch': 0.09} 9%|█████████▊ | 58/681 [02:29<25:52, 2.49s/it] 9%|█████████▉ | 59/681 [02:31<25:37, 2.47s/it] {'loss': 0.6172, 'grad_norm': 87.90422058105469, 'learning_rate': 4.2028985507246374e-07, 'rewards/chosen': -0.5922181010246277, 'rewards/rejected': -2.7694854736328125, 'rewards/accuracies': 0.90625, 'rewards/margins': 2.177267074584961, 'logps/chosen': -46.44553756713867, 'logps/rejected': -133.6805419921875, 'logps/ref_chosen': -38.029998779296875, 'logps/ref_rejected': -94.10072326660156, 'logits/chosen': -3.3299951553344727, 'logits/rejected': -3.408975124359131, 'kl/p_epsilon_steps': 0.890625, 'kl/n_epsilon_steps': 0.09375, 'epsilon_dpo/beta': 0.0700632631778717, 'epsilon_dpo/loss_margin_mean': 31.16427993774414, 'epsilon_dpo/beta_margin_mean': 2.177267074584961, 'epsilon_dpo/beta_margin_std': 2.287863254547119, 'epsilon_dpo/beta_margin_grad_mean': -0.21717199683189392, 'epsilon_dpo/beta_margin_grad_std': 0.2148067206144333, 'kl/beta': 0.0706191137433052, 'kl/avg_steps': 0.796875, 'epoch': 0.09} 9%|█████████▉ | 59/681 [02:31<25:37, 2.47s/it] 9%|██████████▏ | 60/681 [02:34<26:17, 2.54s/it] {'loss': 0.9335, 'grad_norm': 167.74856567382812, 'learning_rate': 4.2753623188405794e-07, 'rewards/chosen': -0.9826135635375977, 'rewards/rejected': -2.746914863586426, 'rewards/accuracies': 0.828125, 'rewards/margins': 1.7643013000488281, 'logps/chosen': -62.78964614868164, 'logps/rejected': -130.8575439453125, 'logps/ref_chosen': -48.789947509765625, 'logps/ref_rejected': -91.3543701171875, 'logits/chosen': -3.3077573776245117, 'logits/rejected': -3.2960987091064453, 'kl/p_epsilon_steps': 0.8125, 'kl/n_epsilon_steps': 0.1875, 'epsilon_dpo/beta': 0.06962990015745163, 'epsilon_dpo/loss_margin_mean': 25.50347328186035, 'epsilon_dpo/beta_margin_mean': 1.7643014192581177, 'epsilon_dpo/beta_margin_std': 2.2814714908599854, 'epsilon_dpo/beta_margin_grad_mean': -0.24627114832401276, 'epsilon_dpo/beta_margin_grad_std': 0.2656000852584839, 'kl/beta': 0.07006081938743591, 'kl/avg_steps': 0.625, 'epoch': 0.09} 9%|██████████▏ | 60/681 [02:34<26:17, 2.54s/it] 9%|██████████▎ | 61/681 [02:36<26:18, 2.55s/it] {'loss': 0.6385, 'grad_norm': 96.6867446899414, 'learning_rate': 4.3478260869565214e-07, 'rewards/chosen': -0.7702499628067017, 'rewards/rejected': -2.9774599075317383, 'rewards/accuracies': 0.890625, 'rewards/margins': 2.207209825515747, 'logps/chosen': -47.05156707763672, 'logps/rejected': -138.587158203125, 'logps/ref_chosen': -35.972103118896484, 'logps/ref_rejected': -95.45098876953125, 'logits/chosen': -3.313138484954834, 'logits/rejected': -3.4350461959838867, 'kl/p_epsilon_steps': 0.875, 'kl/n_epsilon_steps': 0.125, 'epsilon_dpo/beta': 0.06911037117242813, 'epsilon_dpo/loss_margin_mean': 32.056705474853516, 'epsilon_dpo/beta_margin_mean': 2.207209825515747, 'epsilon_dpo/beta_margin_std': 2.3618977069854736, 'epsilon_dpo/beta_margin_grad_mean': -0.20471161603927612, 'epsilon_dpo/beta_margin_grad_std': 0.22111034393310547, 'kl/beta': 0.06962565332651138, 'kl/avg_steps': 0.75, 'epoch': 0.09} 9%|██████████▎ | 61/681 [02:36<26:18, 2.55s/it] 9%|██████████▍ | 62/681 [02:39<26:16, 2.55s/it] {'loss': 0.6249, 'grad_norm': 91.73394775390625, 'learning_rate': 4.420289855072464e-07, 'rewards/chosen': -0.6530295610427856, 'rewards/rejected': -2.504047155380249, 'rewards/accuracies': 0.90625, 'rewards/margins': 1.8510175943374634, 'logps/chosen': -45.381500244140625, 'logps/rejected': -119.45492553710938, 'logps/ref_chosen': -35.904327392578125, 'logps/ref_rejected': -82.9093017578125, 'logits/chosen': -3.2718987464904785, 'logits/rejected': -3.3174455165863037, 'kl/p_epsilon_steps': 0.84375, 'kl/n_epsilon_steps': 0.15625, 'epsilon_dpo/beta': 0.0686391070485115, 'epsilon_dpo/loss_margin_mean': 27.06845474243164, 'epsilon_dpo/beta_margin_mean': 1.8510175943374634, 'epsilon_dpo/beta_margin_std': 1.8581922054290771, 'epsilon_dpo/beta_margin_grad_mean': -0.2324148714542389, 'epsilon_dpo/beta_margin_grad_std': 0.20171168446540833, 'kl/beta': 0.06910735368728638, 'kl/avg_steps': 0.6875, 'epoch': 0.09} 9%|██████████▍ | 62/681 [02:39<26:16, 2.55s/it] 9%|██████████▋ | 63/681 [02:41<25:58, 2.52s/it] {'loss': 0.6897, 'grad_norm': 102.05279541015625, 'learning_rate': 4.4927536231884053e-07, 'rewards/chosen': -0.9690430164337158, 'rewards/rejected': -3.1313982009887695, 'rewards/accuracies': 0.921875, 'rewards/margins': 2.1623551845550537, 'logps/chosen': -60.41229248046875, 'logps/rejected': -150.15708923339844, 'logps/ref_chosen': -46.25957107543945, 'logps/ref_rejected': -104.15571594238281, 'logits/chosen': -3.3514161109924316, 'logits/rejected': -3.4100210666656494, 'kl/p_epsilon_steps': 0.84375, 'kl/n_epsilon_steps': 0.15625, 'epsilon_dpo/beta': 0.06817042827606201, 'epsilon_dpo/loss_margin_mean': 31.848644256591797, 'epsilon_dpo/beta_margin_mean': 2.1623551845550537, 'epsilon_dpo/beta_margin_std': 2.539600372314453, 'epsilon_dpo/beta_margin_grad_mean': -0.22793808579444885, 'epsilon_dpo/beta_margin_grad_std': 0.2142634242773056, 'kl/beta': 0.0686354786157608, 'kl/avg_steps': 0.6875, 'epoch': 0.09} 9%|██████████▋ | 63/681 [02:41<25:58, 2.52s/it] 9%|██████████▊ | 64/681 [02:44<25:37, 2.49s/it] {'loss': 0.5255, 'grad_norm': 95.53367614746094, 'learning_rate': 4.5652173913043473e-07, 'rewards/chosen': -0.6991441249847412, 'rewards/rejected': -3.055689811706543, 'rewards/accuracies': 0.90625, 'rewards/margins': 2.3565456867218018, 'logps/chosen': -44.803466796875, 'logps/rejected': -146.42654418945312, 'logps/ref_chosen': -34.512210845947266, 'logps/ref_rejected': -101.21166229248047, 'logits/chosen': -3.3310346603393555, 'logits/rejected': -3.3657891750335693, 'kl/p_epsilon_steps': 0.84375, 'kl/n_epsilon_steps': 0.15625, 'epsilon_dpo/beta': 0.06770496070384979, 'epsilon_dpo/loss_margin_mean': 34.92362594604492, 'epsilon_dpo/beta_margin_mean': 2.3565456867218018, 'epsilon_dpo/beta_margin_std': 2.4017789363861084, 'epsilon_dpo/beta_margin_grad_mean': -0.19847899675369263, 'epsilon_dpo/beta_margin_grad_std': 0.19424547255039215, 'kl/beta': 0.0681668370962143, 'kl/avg_steps': 0.6875, 'epoch': 0.09} 9%|██████████▊ | 64/681 [02:44<25:37, 2.49s/it] 10%|██████████▉ | 65/681 [02:46<25:44, 2.51s/it] {'loss': 0.5986, 'grad_norm': 93.06818389892578, 'learning_rate': 4.63768115942029e-07, 'rewards/chosen': -0.6990154981613159, 'rewards/rejected': -3.1816627979278564, 'rewards/accuracies': 0.859375, 'rewards/margins': 2.48264741897583, 'logps/chosen': -53.97098159790039, 'logps/rejected': -162.62474060058594, 'logps/ref_chosen': -43.620361328125, 'logps/ref_rejected': -115.21531677246094, 'logits/chosen': -3.3164443969726562, 'logits/rejected': -3.343629837036133, 'kl/p_epsilon_steps': 0.796875, 'kl/n_epsilon_steps': 0.203125, 'epsilon_dpo/beta': 0.06730613857507706, 'epsilon_dpo/loss_margin_mean': 37.058799743652344, 'epsilon_dpo/beta_margin_mean': 2.48264741897583, 'epsilon_dpo/beta_margin_std': 2.5546720027923584, 'epsilon_dpo/beta_margin_grad_mean': -0.21261604130268097, 'epsilon_dpo/beta_margin_grad_std': 0.22306111454963684, 'kl/beta': 0.06770138442516327, 'kl/avg_steps': 0.59375, 'epoch': 0.1} 10%|██████████▉ | 65/681 [02:46<25:44, 2.51s/it] 10%|███████████▏ | 66/681 [02:49<25:51, 2.52s/it] {'loss': 0.7582, 'grad_norm': 116.8039321899414, 'learning_rate': 4.7101449275362313e-07, 'rewards/chosen': -0.7809337973594666, 'rewards/rejected': -2.8556485176086426, 'rewards/accuracies': 0.796875, 'rewards/margins': 2.0747146606445312, 'logps/chosen': -49.09389877319336, 'logps/rejected': -123.10272216796875, 'logps/ref_chosen': -37.514625549316406, 'logps/ref_rejected': -80.34272766113281, 'logits/chosen': -3.2685537338256836, 'logits/rejected': -3.3573737144470215, 'kl/p_epsilon_steps': 0.78125, 'kl/n_epsilon_steps': 0.21875, 'epsilon_dpo/beta': 0.0669299066066742, 'epsilon_dpo/loss_margin_mean': 31.18071746826172, 'epsilon_dpo/beta_margin_mean': 2.0747146606445312, 'epsilon_dpo/beta_margin_std': 2.3843464851379395, 'epsilon_dpo/beta_margin_grad_mean': -0.24468040466308594, 'epsilon_dpo/beta_margin_grad_std': 0.2503945231437683, 'kl/beta': 0.06730178743600845, 'kl/avg_steps': 0.5625, 'epoch': 0.1} 10%|███████████▏ | 66/681 [02:49<25:51, 2.52s/it] 10%|███████████▎ | 67/681 [02:51<25:21, 2.48s/it] {'loss': 0.7213, 'grad_norm': 115.71697998046875, 'learning_rate': 4.782608695652174e-07, 'rewards/chosen': -0.8721305727958679, 'rewards/rejected': -2.741394519805908, 'rewards/accuracies': 0.84375, 'rewards/margins': 1.8692641258239746, 'logps/chosen': -51.816200256347656, 'logps/rejected': -119.65333557128906, 'logps/ref_chosen': -38.82200622558594, 'logps/ref_rejected': -78.41658782958984, 'logits/chosen': -3.2035374641418457, 'logits/rejected': -3.216823101043701, 'kl/p_epsilon_steps': 0.796875, 'kl/n_epsilon_steps': 0.203125, 'epsilon_dpo/beta': 0.0665346160531044, 'epsilon_dpo/loss_margin_mean': 28.2425479888916, 'epsilon_dpo/beta_margin_mean': 1.869264006614685, 'epsilon_dpo/beta_margin_std': 2.120335340499878, 'epsilon_dpo/beta_margin_grad_mean': -0.2480592578649521, 'epsilon_dpo/beta_margin_grad_std': 0.23476538062095642, 'kl/beta': 0.06692533195018768, 'kl/avg_steps': 0.59375, 'epoch': 0.1} 10%|███████████▎ | 67/681 [02:51<25:21, 2.48s/it] 10%|███████████▍ | 68/681 [02:54<25:34, 2.50s/it] {'loss': 0.5268, 'grad_norm': 89.41032409667969, 'learning_rate': 4.855072463768116e-07, 'rewards/chosen': -0.7523297071456909, 'rewards/rejected': -2.9893717765808105, 'rewards/accuracies': 0.875, 'rewards/margins': 2.237042188644409, 'logps/chosen': -53.25446319580078, 'logps/rejected': -127.92015075683594, 'logps/ref_chosen': -41.910316467285156, 'logps/ref_rejected': -82.59764862060547, 'logits/chosen': -3.2750229835510254, 'logits/rejected': -3.2742607593536377, 'kl/p_epsilon_steps': 0.859375, 'kl/n_epsilon_steps': 0.140625, 'epsilon_dpo/beta': 0.06605872511863708, 'epsilon_dpo/loss_margin_mean': 33.978355407714844, 'epsilon_dpo/beta_margin_mean': 2.237042188644409, 'epsilon_dpo/beta_margin_std': 1.8940068483352661, 'epsilon_dpo/beta_margin_grad_mean': -0.19077911972999573, 'epsilon_dpo/beta_margin_grad_std': 0.20632153749465942, 'kl/beta': 0.06653030216693878, 'kl/avg_steps': 0.71875, 'epoch': 0.1} 10%|███████████▍ | 68/681 [02:54<25:34, 2.50s/it] 10%|███████████▋ | 69/681 [02:56<25:59, 2.55s/it] {'loss': 0.5513, 'grad_norm': 85.71179962158203, 'learning_rate': 4.927536231884058e-07, 'rewards/chosen': -0.910045325756073, 'rewards/rejected': -3.3686389923095703, 'rewards/accuracies': 0.921875, 'rewards/margins': 2.4585933685302734, 'logps/chosen': -56.83376693725586, 'logps/rejected': -162.74642944335938, 'logps/ref_chosen': -42.98963165283203, 'logps/ref_rejected': -111.28137969970703, 'logits/chosen': -3.309999704360962, 'logits/rejected': -3.2920103073120117, 'kl/p_epsilon_steps': 0.921875, 'kl/n_epsilon_steps': 0.078125, 'epsilon_dpo/beta': 0.06550473719835281, 'epsilon_dpo/loss_margin_mean': 37.620914459228516, 'epsilon_dpo/beta_margin_mean': 2.4585933685302734, 'epsilon_dpo/beta_margin_std': 2.252986431121826, 'epsilon_dpo/beta_margin_grad_mean': -0.1768295019865036, 'epsilon_dpo/beta_margin_grad_std': 0.19139200448989868, 'kl/beta': 0.066055528819561, 'kl/avg_steps': 0.84375, 'epoch': 0.1} 10%|███████████▋ | 69/681 [02:56<25:59, 2.55s/it] 10%|███████████▊ | 70/681 [02:59<25:41, 2.52s/it] {'loss': 0.5016, 'grad_norm': 88.73102569580078, 'learning_rate': 5e-07, 'rewards/chosen': -0.781528651714325, 'rewards/rejected': -3.0996618270874023, 'rewards/accuracies': 0.90625, 'rewards/margins': 2.3181333541870117, 'logps/chosen': -55.655487060546875, 'logps/rejected': -144.90789794921875, 'logps/ref_chosen': -43.68109130859375, 'logps/ref_rejected': -97.17718505859375, 'logits/chosen': -3.2341670989990234, 'logits/rejected': -3.3183038234710693, 'kl/p_epsilon_steps': 0.90625, 'kl/n_epsilon_steps': 0.09375, 'epsilon_dpo/beta': 0.06497713923454285, 'epsilon_dpo/loss_margin_mean': 35.75630187988281, 'epsilon_dpo/beta_margin_mean': 2.3181331157684326, 'epsilon_dpo/beta_margin_std': 2.1517505645751953, 'epsilon_dpo/beta_margin_grad_mean': -0.1917823851108551, 'epsilon_dpo/beta_margin_grad_std': 0.18596571683883667, 'kl/beta': 0.06550285220146179, 'kl/avg_steps': 0.8125, 'epoch': 0.1} 10%|███████████▊ | 70/681 [02:59<25:41, 2.52s/it] 10%|███████████▉ | 71/681 [03:01<25:55, 2.55s/it] {'loss': 0.5652, 'grad_norm': 100.56282806396484, 'learning_rate': 4.999967061337492e-07, 'rewards/chosen': -0.8535667657852173, 'rewards/rejected': -3.5020976066589355, 'rewards/accuracies': 0.890625, 'rewards/margins': 2.6485307216644287, 'logps/chosen': -54.0555419921875, 'logps/rejected': -158.8681182861328, 'logps/ref_chosen': -40.898582458496094, 'logps/ref_rejected': -104.50498962402344, 'logits/chosen': -3.228590488433838, 'logits/rejected': -3.3432466983795166, 'kl/p_epsilon_steps': 0.875, 'kl/n_epsilon_steps': 0.125, 'epsilon_dpo/beta': 0.0644940659403801, 'epsilon_dpo/loss_margin_mean': 41.206172943115234, 'epsilon_dpo/beta_margin_mean': 2.6485307216644287, 'epsilon_dpo/beta_margin_std': 2.504869222640991, 'epsilon_dpo/beta_margin_grad_mean': -0.18719229102134705, 'epsilon_dpo/beta_margin_grad_std': 0.23163102567195892, 'kl/beta': 0.06497492641210556, 'kl/avg_steps': 0.75, 'epoch': 0.1} 10%|███████████▉ | 71/681 [03:02<25:55, 2.55s/it] 11%|████████████▏ | 72/681 [03:04<25:42, 2.53s/it] {'loss': 0.4074, 'grad_norm': 71.43647766113281, 'learning_rate': 4.999868246217933e-07, 'rewards/chosen': -0.8116366863250732, 'rewards/rejected': -3.40474271774292, 'rewards/accuracies': 0.96875, 'rewards/margins': 2.593106269836426, 'logps/chosen': -54.79957580566406, 'logps/rejected': -155.27841186523438, 'logps/ref_chosen': -42.15618896484375, 'logps/ref_rejected': -102.02656555175781, 'logits/chosen': -3.2843093872070312, 'logits/rejected': -3.316100835800171, 'kl/p_epsilon_steps': 0.90625, 'kl/n_epsilon_steps': 0.09375, 'epsilon_dpo/beta': 0.06397364288568497, 'epsilon_dpo/loss_margin_mean': 40.60845184326172, 'epsilon_dpo/beta_margin_mean': 2.5931060314178467, 'epsilon_dpo/beta_margin_std': 2.0323598384857178, 'epsilon_dpo/beta_margin_grad_mean': -0.15650521218776703, 'epsilon_dpo/beta_margin_grad_std': 0.17222045361995697, 'kl/beta': 0.06449124217033386, 'kl/avg_steps': 0.8125, 'epoch': 0.11} 11%|████████████▏ | 72/681 [03:04<25:42, 2.53s/it] 11%|████████████▎ | 73/681 [03:07<26:25, 2.61s/it] {'loss': 0.5577, 'grad_norm': 75.35091400146484, 'learning_rate': 4.999703557245192e-07, 'rewards/chosen': -0.7197288870811462, 'rewards/rejected': -3.2321105003356934, 'rewards/accuracies': 0.859375, 'rewards/margins': 2.5123815536499023, 'logps/chosen': -55.1505126953125, 'logps/rejected': -147.10968017578125, 'logps/ref_chosen': -43.86912155151367, 'logps/ref_rejected': -96.146728515625, 'logits/chosen': -3.2197837829589844, 'logits/rejected': -3.2296009063720703, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.171875, 'epsilon_dpo/beta': 0.06355801969766617, 'epsilon_dpo/loss_margin_mean': 39.681556701660156, 'epsilon_dpo/beta_margin_mean': 2.5123815536499023, 'epsilon_dpo/beta_margin_std': 2.4150218963623047, 'epsilon_dpo/beta_margin_grad_mean': -0.2032501995563507, 'epsilon_dpo/beta_margin_grad_std': 0.21837185323238373, 'kl/beta': 0.06397147476673126, 'kl/avg_steps': 0.65625, 'epoch': 0.11} 11%|████████████▎ | 73/681 [03:07<26:25, 2.61s/it] 11%|████████████▍ | 74/681 [03:09<25:47, 2.55s/it] {'loss': 0.4061, 'grad_norm': 65.50677490234375, 'learning_rate': 4.999472998758977e-07, 'rewards/chosen': -0.6169643998146057, 'rewards/rejected': -3.4588942527770996, 'rewards/accuracies': 0.9375, 'rewards/margins': 2.8419299125671387, 'logps/chosen': -38.77723693847656, 'logps/rejected': -157.6680145263672, 'logps/ref_chosen': -29.008399963378906, 'logps/ref_rejected': -102.72833251953125, 'logits/chosen': -3.147824764251709, 'logits/rejected': -3.291006088256836, 'kl/p_epsilon_steps': 0.90625, 'kl/n_epsilon_steps': 0.09375, 'epsilon_dpo/beta': 0.06304432451725006, 'epsilon_dpo/loss_margin_mean': 45.170841217041016, 'epsilon_dpo/beta_margin_mean': 2.8419299125671387, 'epsilon_dpo/beta_margin_std': 2.680656909942627, 'epsilon_dpo/beta_margin_grad_mean': -0.15927036106586456, 'epsilon_dpo/beta_margin_grad_std': 0.18017472326755524, 'kl/beta': 0.06355439871549606, 'kl/avg_steps': 0.8125, 'epoch': 0.11} 11%|████████████▍ | 74/681 [03:09<25:47, 2.55s/it] 11%|████████████▋ | 75/681 [03:12<25:24, 2.52s/it] {'loss': 0.3702, 'grad_norm': 62.55946731567383, 'learning_rate': 4.999176576834721e-07, 'rewards/chosen': -0.9213298559188843, 'rewards/rejected': -3.7345452308654785, 'rewards/accuracies': 0.953125, 'rewards/margins': 2.8132152557373047, 'logps/chosen': -45.443458557128906, 'logps/rejected': -177.2479705810547, 'logps/ref_chosen': -30.710708618164062, 'logps/ref_rejected': -117.44107818603516, 'logits/chosen': -3.151859760284424, 'logits/rejected': -3.281860828399658, 'kl/p_epsilon_steps': 0.9375, 'kl/n_epsilon_steps': 0.0625, 'epsilon_dpo/beta': 0.062496818602085114, 'epsilon_dpo/loss_margin_mean': 45.07414245605469, 'epsilon_dpo/beta_margin_mean': 2.8132150173187256, 'epsilon_dpo/beta_margin_std': 2.3551828861236572, 'epsilon_dpo/beta_margin_grad_mean': -0.15049783885478973, 'epsilon_dpo/beta_margin_grad_std': 0.1585647463798523, 'kl/beta': 0.06304218620061874, 'kl/avg_steps': 0.875, 'epoch': 0.11} 11%|████████████▋ | 75/681 [03:12<25:24, 2.52s/it] 11%|████████████▊ | 76/681 [03:14<25:23, 2.52s/it] {'loss': 0.5828, 'grad_norm': 96.22077941894531, 'learning_rate': 4.998814299283415e-07, 'rewards/chosen': -0.8483954668045044, 'rewards/rejected': -3.050872802734375, 'rewards/accuracies': 0.890625, 'rewards/margins': 2.20247745513916, 'logps/chosen': -48.6707763671875, 'logps/rejected': -134.10556030273438, 'logps/ref_chosen': -35.03684997558594, 'logps/ref_rejected': -84.84458923339844, 'logits/chosen': -3.207365036010742, 'logits/rejected': -3.260375499725342, 'kl/p_epsilon_steps': 0.890625, 'kl/n_epsilon_steps': 0.109375, 'epsilon_dpo/beta': 0.062013305723667145, 'epsilon_dpo/loss_margin_mean': 35.627044677734375, 'epsilon_dpo/beta_margin_mean': 2.20247745513916, 'epsilon_dpo/beta_margin_std': 1.9251240491867065, 'epsilon_dpo/beta_margin_grad_mean': -0.19430118799209595, 'epsilon_dpo/beta_margin_grad_std': 0.2106785923242569, 'kl/beta': 0.06249535083770752, 'kl/avg_steps': 0.78125, 'epoch': 0.11} 11%|████████████▊ | 76/681 [03:14<25:23, 2.52s/it] 11%|█████████████ | 77/681 [03:16<24:51, 2.47s/it] {'loss': 0.5577, 'grad_norm': 109.90064239501953, 'learning_rate': 4.998386175651409e-07, 'rewards/chosen': -0.8574961423873901, 'rewards/rejected': -3.405547618865967, 'rewards/accuracies': 0.921875, 'rewards/margins': 2.548051357269287, 'logps/chosen': -57.969879150390625, 'logps/rejected': -154.14987182617188, 'logps/ref_chosen': -44.09752655029297, 'logps/ref_rejected': -98.75190734863281, 'logits/chosen': -3.159731864929199, 'logits/rejected': -3.2670416831970215, 'kl/p_epsilon_steps': 0.90625, 'kl/n_epsilon_steps': 0.09375, 'epsilon_dpo/beta': 0.06151320040225983, 'epsilon_dpo/loss_margin_mean': 41.525611877441406, 'epsilon_dpo/beta_margin_mean': 2.548051595687866, 'epsilon_dpo/beta_margin_std': 2.222637414932251, 'epsilon_dpo/beta_margin_grad_mean': -0.16589321196079254, 'epsilon_dpo/beta_margin_grad_std': 0.19776158034801483, 'kl/beta': 0.06201088801026344, 'kl/avg_steps': 0.8125, 'epoch': 0.11} 11%|█████████████ | 77/681 [03:17<24:51, 2.47s/it] 11%|█████████████▏ | 78/681 [03:19<25:11, 2.51s/it] {'loss': 0.5417, 'grad_norm': 88.5641860961914, 'learning_rate': 4.997892217220159e-07, 'rewards/chosen': -0.8202856779098511, 'rewards/rejected': -2.9441146850585938, 'rewards/accuracies': 0.921875, 'rewards/margins': 2.123828887939453, 'logps/chosen': -52.061004638671875, 'logps/rejected': -139.2445068359375, 'logps/ref_chosen': -38.710906982421875, 'logps/ref_rejected': -91.00759887695312, 'logits/chosen': -3.2637405395507812, 'logits/rejected': -3.250288963317871, 'kl/p_epsilon_steps': 0.84375, 'kl/n_epsilon_steps': 0.15625, 'epsilon_dpo/beta': 0.06109432876110077, 'epsilon_dpo/loss_margin_mean': 34.886810302734375, 'epsilon_dpo/beta_margin_mean': 2.1238291263580322, 'epsilon_dpo/beta_margin_std': 1.9207454919815063, 'epsilon_dpo/beta_margin_grad_mean': -0.19816353917121887, 'epsilon_dpo/beta_margin_grad_std': 0.19675467908382416, 'kl/beta': 0.06151111051440239, 'kl/avg_steps': 0.6875, 'epoch': 0.11} 11%|█████████████▏ | 78/681 [03:19<25:11, 2.51s/it] 12%|█████████████▎ | 79/681 [03:22<25:17, 2.52s/it] {'loss': 0.4274, 'grad_norm': 86.2493667602539, 'learning_rate': 4.997332437005931e-07, 'rewards/chosen': -0.6022886633872986, 'rewards/rejected': -3.202016592025757, 'rewards/accuracies': 0.9375, 'rewards/margins': 2.5997278690338135, 'logps/chosen': -42.79743957519531, 'logps/rejected': -148.58889770507812, 'logps/ref_chosen': -32.905845642089844, 'logps/ref_rejected': -95.70394897460938, 'logits/chosen': -3.1998233795166016, 'logits/rejected': -3.2696633338928223, 'kl/p_epsilon_steps': 0.921875, 'kl/n_epsilon_steps': 0.078125, 'epsilon_dpo/beta': 0.060581713914871216, 'epsilon_dpo/loss_margin_mean': 42.99335479736328, 'epsilon_dpo/beta_margin_mean': 2.5997278690338135, 'epsilon_dpo/beta_margin_std': 2.096586227416992, 'epsilon_dpo/beta_margin_grad_mean': -0.1641714721918106, 'epsilon_dpo/beta_margin_grad_std': 0.17812295258045197, 'kl/beta': 0.0610911101102829, 'kl/avg_steps': 0.84375, 'epoch': 0.12} 12%|█████████████▎ | 79/681 [03:22<25:17, 2.52s/it] 12%|█████████████▌ | 80/681 [03:24<25:14, 2.52s/it] {'loss': 0.6202, 'grad_norm': 94.2277603149414, 'learning_rate': 4.996706849759452e-07, 'rewards/chosen': -0.8336386680603027, 'rewards/rejected': -3.2143394947052, 'rewards/accuracies': 0.90625, 'rewards/margins': 2.3807008266448975, 'logps/chosen': -55.867835998535156, 'logps/rejected': -147.4423828125, 'logps/ref_chosen': -42.08654022216797, 'logps/ref_rejected': -93.93815612792969, 'logits/chosen': -3.2733001708984375, 'logits/rejected': -3.3127360343933105, 'kl/p_epsilon_steps': 0.859375, 'kl/n_epsilon_steps': 0.140625, 'epsilon_dpo/beta': 0.06015056371688843, 'epsilon_dpo/loss_margin_mean': 39.722938537597656, 'epsilon_dpo/beta_margin_mean': 2.3807008266448975, 'epsilon_dpo/beta_margin_std': 2.437171220779419, 'epsilon_dpo/beta_margin_grad_mean': -0.20884603261947632, 'epsilon_dpo/beta_margin_grad_std': 0.22885563969612122, 'kl/beta': 0.06057996675372124, 'kl/avg_steps': 0.71875, 'epoch': 0.12} 12%|█████████████▌ | 80/681 [03:24<25:14, 2.52s/it] 12%|█████████████▋ | 81/681 [03:27<25:59, 2.60s/it] {'loss': 0.4684, 'grad_norm': 70.84126281738281, 'learning_rate': 4.996015471965529e-07, 'rewards/chosen': -0.7309558391571045, 'rewards/rejected': -3.3320164680480957, 'rewards/accuracies': 0.953125, 'rewards/margins': 2.601060628890991, 'logps/chosen': -52.02825927734375, 'logps/rejected': -188.8577117919922, 'logps/ref_chosen': -39.808433532714844, 'logps/ref_rejected': -132.9473876953125, 'logits/chosen': -3.1918153762817383, 'logits/rejected': -3.2612764835357666, 'kl/p_epsilon_steps': 0.90625, 'kl/n_epsilon_steps': 0.09375, 'epsilon_dpo/beta': 0.05966491997241974, 'epsilon_dpo/loss_margin_mean': 43.69050598144531, 'epsilon_dpo/beta_margin_mean': 2.601060628890991, 'epsilon_dpo/beta_margin_std': 2.4161243438720703, 'epsilon_dpo/beta_margin_grad_mean': -0.17548146843910217, 'epsilon_dpo/beta_margin_grad_std': 0.19263611733913422, 'kl/beta': 0.06014765426516533, 'kl/avg_steps': 0.8125, 'epoch': 0.12} 12%|█████████████▋ | 81/681 [03:27<25:59, 2.60s/it] 12%|█████████████▊ | 82/681 [03:29<25:42, 2.57s/it] {'loss': 0.6066, 'grad_norm': 98.73262023925781, 'learning_rate': 4.995258321842611e-07, 'rewards/chosen': -0.8024958372116089, 'rewards/rejected': -3.178739547729492, 'rewards/accuracies': 0.890625, 'rewards/margins': 2.376243829727173, 'logps/chosen': -46.98686218261719, 'logps/rejected': -150.47857666015625, 'logps/ref_chosen': -33.495845794677734, 'logps/ref_rejected': -96.71635437011719, 'logits/chosen': -3.1882076263427734, 'logits/rejected': -3.236912488937378, 'kl/p_epsilon_steps': 0.875, 'kl/n_epsilon_steps': 0.125, 'epsilon_dpo/beta': 0.05922134220600128, 'epsilon_dpo/loss_margin_mean': 40.271209716796875, 'epsilon_dpo/beta_margin_mean': 2.376243829727173, 'epsilon_dpo/beta_margin_std': 2.1569864749908447, 'epsilon_dpo/beta_margin_grad_mean': -0.18756245076656342, 'epsilon_dpo/beta_margin_grad_std': 0.23345990478992462, 'kl/beta': 0.059662893414497375, 'kl/avg_steps': 0.75, 'epoch': 0.12} 12%|█████████████▊ | 82/681 [03:30<25:42, 2.57s/it] 12%|██████████████ | 83/681 [03:32<25:03, 2.51s/it] {'loss': 0.424, 'grad_norm': 67.05672454833984, 'learning_rate': 4.994435419342304e-07, 'rewards/chosen': -0.6363058090209961, 'rewards/rejected': -3.1005678176879883, 'rewards/accuracies': 0.90625, 'rewards/margins': 2.464262008666992, 'logps/chosen': -36.729583740234375, 'logps/rejected': -161.15139770507812, 'logps/ref_chosen': -25.916236877441406, 'logps/ref_rejected': -108.29981994628906, 'logits/chosen': -3.163329839706421, 'logits/rejected': -3.297056198120117, 'kl/p_epsilon_steps': 0.921875, 'kl/n_epsilon_steps': 0.078125, 'epsilon_dpo/beta': 0.05872496962547302, 'epsilon_dpo/loss_margin_mean': 42.03822326660156, 'epsilon_dpo/beta_margin_mean': 2.464262008666992, 'epsilon_dpo/beta_margin_std': 1.7969151735305786, 'epsilon_dpo/beta_margin_grad_mean': -0.16398201882839203, 'epsilon_dpo/beta_margin_grad_std': 0.18266217410564423, 'kl/beta': 0.05921875312924385, 'kl/avg_steps': 0.84375, 'epoch': 0.12} 12%|██████████████ | 83/681 [03:32<25:03, 2.51s/it] 12%|██████████████▏ | 84/681 [03:34<25:19, 2.55s/it] {'loss': 0.4147, 'grad_norm': 68.96977996826172, 'learning_rate': 4.993546786148857e-07, 'rewards/chosen': -0.8531267642974854, 'rewards/rejected': -3.1109256744384766, 'rewards/accuracies': 0.953125, 'rewards/margins': 2.2577991485595703, 'logps/chosen': -51.26725769042969, 'logps/rejected': -148.75570678710938, 'logps/ref_chosen': -36.62953567504883, 'logps/ref_rejected': -95.27814483642578, 'logits/chosen': -3.2171850204467773, 'logits/rejected': -3.2014272212982178, 'kl/p_epsilon_steps': 0.9375, 'kl/n_epsilon_steps': 0.0625, 'epsilon_dpo/beta': 0.058215267956256866, 'epsilon_dpo/loss_margin_mean': 38.83984375, 'epsilon_dpo/beta_margin_mean': 2.257798910140991, 'epsilon_dpo/beta_margin_std': 1.5832382440567017, 'epsilon_dpo/beta_margin_grad_mean': -0.1662423461675644, 'epsilon_dpo/beta_margin_grad_std': 0.16180217266082764, 'kl/beta': 0.05872327461838722, 'kl/avg_steps': 0.875, 'epoch': 0.12} 12%|██████████████▏ | 84/681 [03:34<25:19, 2.55s/it] 12%|██████████████▎ | 85/681 [03:37<25:23, 2.56s/it] {'loss': 0.564, 'grad_norm': 81.10045623779297, 'learning_rate': 4.992592445678582e-07, 'rewards/chosen': -0.9453153610229492, 'rewards/rejected': -2.953948497772217, 'rewards/accuracies': 0.890625, 'rewards/margins': 2.0086328983306885, 'logps/chosen': -59.8691291809082, 'logps/rejected': -135.07781982421875, 'logps/ref_chosen': -43.555397033691406, 'logps/ref_rejected': -83.9044418334961, 'logits/chosen': -3.2968966960906982, 'logits/rejected': -3.2394418716430664, 'kl/p_epsilon_steps': 0.84375, 'kl/n_epsilon_steps': 0.140625, 'epsilon_dpo/beta': 0.05781027674674988, 'epsilon_dpo/loss_margin_mean': 34.859657287597656, 'epsilon_dpo/beta_margin_mean': 2.0086328983306885, 'epsilon_dpo/beta_margin_std': 1.7025460004806519, 'epsilon_dpo/beta_margin_grad_mean': -0.20659653842449188, 'epsilon_dpo/beta_margin_grad_std': 0.2083936482667923, 'kl/beta': 0.05821390450000763, 'kl/avg_steps': 0.703125, 'epoch': 0.12} 12%|██████████████▎ | 85/681 [03:37<25:23, 2.56s/it] 13%|██████████████▌ | 86/681 [03:40<25:40, 2.59s/it] {'loss': 0.6967, 'grad_norm': 107.6495132446289, 'learning_rate': 4.991572423079235e-07, 'rewards/chosen': -0.9202961921691895, 'rewards/rejected': -3.0485289096832275, 'rewards/accuracies': 0.875, 'rewards/margins': 2.128232479095459, 'logps/chosen': -54.787696838378906, 'logps/rejected': -145.7591094970703, 'logps/ref_chosen': -38.846839904785156, 'logps/ref_rejected': -92.59671020507812, 'logits/chosen': -3.272704601287842, 'logits/rejected': -3.2372689247131348, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.171875, 'epsilon_dpo/beta': 0.057433828711509705, 'epsilon_dpo/loss_margin_mean': 37.22154998779297, 'epsilon_dpo/beta_margin_mean': 2.128232479095459, 'epsilon_dpo/beta_margin_std': 2.417414903640747, 'epsilon_dpo/beta_margin_grad_mean': -0.2296195775270462, 'epsilon_dpo/beta_margin_grad_std': 0.22347640991210938, 'kl/beta': 0.05780744552612305, 'kl/avg_steps': 0.65625, 'epoch': 0.13} 13%|██████████████▌ | 86/681 [03:40<25:40, 2.59s/it] 13%|██████████████▋ | 87/681 [03:42<25:18, 2.56s/it] {'loss': 0.4717, 'grad_norm': 59.761165618896484, 'learning_rate': 4.990486745229364e-07, 'rewards/chosen': -0.8040882349014282, 'rewards/rejected': -3.161027193069458, 'rewards/accuracies': 0.921875, 'rewards/margins': 2.3569388389587402, 'logps/chosen': -52.74066162109375, 'logps/rejected': -157.99281311035156, 'logps/ref_chosen': -38.653236389160156, 'logps/ref_rejected': -102.44976043701172, 'logits/chosen': -3.230192184448242, 'logits/rejected': -3.3376736640930176, 'kl/p_epsilon_steps': 0.890625, 'kl/n_epsilon_steps': 0.109375, 'epsilon_dpo/beta': 0.05698757991194725, 'epsilon_dpo/loss_margin_mean': 41.45563507080078, 'epsilon_dpo/beta_margin_mean': 2.3569390773773193, 'epsilon_dpo/beta_margin_std': 1.9608224630355835, 'epsilon_dpo/beta_margin_grad_mean': -0.18504740297794342, 'epsilon_dpo/beta_margin_grad_std': 0.18343256413936615, 'kl/beta': 0.057430557906627655, 'kl/avg_steps': 0.78125, 'epoch': 0.13} 13%|██████████████▋ | 87/681 [03:42<25:18, 2.56s/it] 13%|██████████████▊ | 88/681 [03:45<25:24, 2.57s/it] {'loss': 0.4141, 'grad_norm': 58.7109375, 'learning_rate': 4.989335440737586e-07, 'rewards/chosen': -0.7879468202590942, 'rewards/rejected': -3.3477907180786133, 'rewards/accuracies': 0.90625, 'rewards/margins': 2.5598437786102295, 'logps/chosen': -51.133506774902344, 'logps/rejected': -175.40699768066406, 'logps/ref_chosen': -37.23695373535156, 'logps/ref_rejected': -116.12947082519531, 'logits/chosen': -3.27421236038208, 'logits/rejected': -3.3343310356140137, 'kl/p_epsilon_steps': 0.875, 'kl/n_epsilon_steps': 0.125, 'epsilon_dpo/beta': 0.05656362697482109, 'epsilon_dpo/loss_margin_mean': 45.38097381591797, 'epsilon_dpo/beta_margin_mean': 2.5598440170288086, 'epsilon_dpo/beta_margin_std': 1.910704493522644, 'epsilon_dpo/beta_margin_grad_mean': -0.1606673151254654, 'epsilon_dpo/beta_margin_grad_std': 0.18233220279216766, 'kl/beta': 0.05698535963892937, 'kl/avg_steps': 0.75, 'epoch': 0.13} 13%|██████████████▊ | 88/681 [03:45<25:24, 2.57s/it] 13%|███████████████ | 89/681 [03:47<25:01, 2.54s/it] {'loss': 0.4951, 'grad_norm': 69.53501892089844, 'learning_rate': 4.988118539941847e-07, 'rewards/chosen': -0.7197608947753906, 'rewards/rejected': -2.946448802947998, 'rewards/accuracies': 0.90625, 'rewards/margins': 2.2266876697540283, 'logps/chosen': -52.141876220703125, 'logps/rejected': -140.5678253173828, 'logps/ref_chosen': -39.35747146606445, 'logps/ref_rejected': -88.01043701171875, 'logits/chosen': -3.2087016105651855, 'logits/rejected': -3.23443603515625, 'kl/p_epsilon_steps': 0.875, 'kl/n_epsilon_steps': 0.125, 'epsilon_dpo/beta': 0.05614255368709564, 'epsilon_dpo/loss_margin_mean': 39.772979736328125, 'epsilon_dpo/beta_margin_mean': 2.2266876697540283, 'epsilon_dpo/beta_margin_std': 1.9912763833999634, 'epsilon_dpo/beta_margin_grad_mean': -0.18962043523788452, 'epsilon_dpo/beta_margin_grad_std': 0.18511323630809784, 'kl/beta': 0.056561149656772614, 'kl/avg_steps': 0.75, 'epoch': 0.13} 13%|███████████████ | 89/681 [03:47<25:01, 2.54s/it] 13%|███████████████▏ | 90/681 [03:50<24:38, 2.50s/it] {'loss': 0.5447, 'grad_norm': 91.28634643554688, 'learning_rate': 4.986836074908615e-07, 'rewards/chosen': -0.9231457710266113, 'rewards/rejected': -3.432866096496582, 'rewards/accuracies': 0.875, 'rewards/margins': 2.5097203254699707, 'logps/chosen': -46.801185607910156, 'logps/rejected': -181.02479553222656, 'logps/ref_chosen': -30.30811882019043, 'logps/ref_rejected': -119.35741424560547, 'logits/chosen': -3.129134178161621, 'logits/rejected': -3.3651089668273926, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.171875, 'epsilon_dpo/beta': 0.055777255445718765, 'epsilon_dpo/loss_margin_mean': 45.174312591552734, 'epsilon_dpo/beta_margin_mean': 2.5097200870513916, 'epsilon_dpo/beta_margin_std': 2.4309592247009277, 'epsilon_dpo/beta_margin_grad_mean': -0.19303679466247559, 'epsilon_dpo/beta_margin_grad_std': 0.21628044545650482, 'kl/beta': 0.056140098720788956, 'kl/avg_steps': 0.65625, 'epoch': 0.13} 13%|███████████████▏ | 90/681 [03:50<24:38, 2.50s/it] 13%|███████████████▎ | 91/681 [03:52<24:26, 2.49s/it] {'loss': 0.4213, 'grad_norm': 68.51940155029297, 'learning_rate': 4.985488079432037e-07, 'rewards/chosen': -0.8394383192062378, 'rewards/rejected': -3.1970739364624023, 'rewards/accuracies': 0.953125, 'rewards/margins': 2.357635736465454, 'logps/chosen': -50.61489486694336, 'logps/rejected': -151.99273681640625, 'logps/ref_chosen': -35.484596252441406, 'logps/ref_rejected': -94.16378784179688, 'logits/chosen': -3.2310478687286377, 'logits/rejected': -3.2350800037384033, 'kl/p_epsilon_steps': 0.90625, 'kl/n_epsilon_steps': 0.09375, 'epsilon_dpo/beta': 0.05532645061612129, 'epsilon_dpo/loss_margin_mean': 42.69865036010742, 'epsilon_dpo/beta_margin_mean': 2.357635736465454, 'epsilon_dpo/beta_margin_std': 1.7421871423721313, 'epsilon_dpo/beta_margin_grad_mean': -0.16726556420326233, 'epsilon_dpo/beta_margin_grad_std': 0.17135490477085114, 'kl/beta': 0.055774081498384476, 'kl/avg_steps': 0.8125, 'epoch': 0.13} 13%|███████████████▎ | 91/681 [03:52<24:26, 2.49s/it] 14%|███████████████▌ | 92/681 [03:55<24:29, 2.49s/it] {'loss': 0.5756, 'grad_norm': 87.93189239501953, 'learning_rate': 4.984074589033043e-07, 'rewards/chosen': -0.808807373046875, 'rewards/rejected': -3.0445470809936523, 'rewards/accuracies': 0.890625, 'rewards/margins': 2.2357397079467773, 'logps/chosen': -52.659706115722656, 'logps/rejected': -139.8248291015625, 'logps/ref_chosen': -37.970062255859375, 'logps/ref_rejected': -84.28839111328125, 'logits/chosen': -3.121596574783325, 'logits/rejected': -3.1575112342834473, 'kl/p_epsilon_steps': 0.890625, 'kl/n_epsilon_steps': 0.109375, 'epsilon_dpo/beta': 0.05489783734083176, 'epsilon_dpo/loss_margin_mean': 40.8467903137207, 'epsilon_dpo/beta_margin_mean': 2.2357397079467773, 'epsilon_dpo/beta_margin_std': 2.0685596466064453, 'epsilon_dpo/beta_margin_grad_mean': -0.2024046629667282, 'epsilon_dpo/beta_margin_grad_std': 0.220162495970726, 'kl/beta': 0.05532456934452057, 'kl/avg_steps': 0.78125, 'epoch': 0.14} 14%|███████████████▌ | 92/681 [03:55<24:29, 2.49s/it] 14%|███████████████▋ | 93/681 [03:57<23:43, 2.42s/it] {'loss': 0.4779, 'grad_norm': 66.98503875732422, 'learning_rate': 4.982595640958425e-07, 'rewards/chosen': -0.88683021068573, 'rewards/rejected': -3.1384482383728027, 'rewards/accuracies': 0.921875, 'rewards/margins': 2.251617908477783, 'logps/chosen': -51.651126861572266, 'logps/rejected': -139.5394287109375, 'logps/ref_chosen': -35.3890266418457, 'logps/ref_rejected': -81.84159851074219, 'logits/chosen': -3.2347559928894043, 'logits/rejected': -3.2382359504699707, 'kl/p_epsilon_steps': 0.890625, 'kl/n_epsilon_steps': 0.109375, 'epsilon_dpo/beta': 0.05447227507829666, 'epsilon_dpo/loss_margin_mean': 41.435726165771484, 'epsilon_dpo/beta_margin_mean': 2.251617908477783, 'epsilon_dpo/beta_margin_std': 1.9179073572158813, 'epsilon_dpo/beta_margin_grad_mean': -0.18667982518672943, 'epsilon_dpo/beta_margin_grad_std': 0.18083734810352325, 'kl/beta': 0.05489569902420044, 'kl/avg_steps': 0.78125, 'epoch': 0.14} 14%|███████████████▋ | 93/681 [03:57<23:43, 2.42s/it] 14%|███████████████▊ | 94/681 [04:00<24:23, 2.49s/it] {'loss': 0.3679, 'grad_norm': 63.9381103515625, 'learning_rate': 4.98105127417984e-07, 'rewards/chosen': -0.7756279706954956, 'rewards/rejected': -3.350586414337158, 'rewards/accuracies': 0.96875, 'rewards/margins': 2.574958324432373, 'logps/chosen': -53.30908203125, 'logps/rejected': -168.24609375, 'logps/ref_chosen': -38.974853515625, 'logps/ref_rejected': -106.16789245605469, 'logits/chosen': -3.194207191467285, 'logits/rejected': -3.2492258548736572, 'kl/p_epsilon_steps': 0.9375, 'kl/n_epsilon_steps': 0.0625, 'epsilon_dpo/beta': 0.05399893596768379, 'epsilon_dpo/loss_margin_mean': 47.74397659301758, 'epsilon_dpo/beta_margin_mean': 2.574958562850952, 'epsilon_dpo/beta_margin_std': 1.9459197521209717, 'epsilon_dpo/beta_margin_grad_mean': -0.15303176641464233, 'epsilon_dpo/beta_margin_grad_std': 0.14751987159252167, 'kl/beta': 0.05447014793753624, 'kl/avg_steps': 0.875, 'epoch': 0.14} 14%|███████████████▊ | 94/681 [04:00<24:23, 2.49s/it] 14%|████████████████ | 95/681 [04:02<24:32, 2.51s/it] {'loss': 0.4364, 'grad_norm': 63.63764572143555, 'learning_rate': 4.979441529392784e-07, 'rewards/chosen': -0.6584010124206543, 'rewards/rejected': -2.824158191680908, 'rewards/accuracies': 0.953125, 'rewards/margins': 2.165757417678833, 'logps/chosen': -41.924564361572266, 'logps/rejected': -133.45326232910156, 'logps/ref_chosen': -29.644317626953125, 'logps/ref_rejected': -80.65695190429688, 'logits/chosen': -3.180877208709717, 'logits/rejected': -3.2767512798309326, 'kl/p_epsilon_steps': 0.953125, 'kl/n_epsilon_steps': 0.046875, 'epsilon_dpo/beta': 0.05351366475224495, 'epsilon_dpo/loss_margin_mean': 40.51607131958008, 'epsilon_dpo/beta_margin_mean': 2.165757417678833, 'epsilon_dpo/beta_margin_std': 1.6699368953704834, 'epsilon_dpo/beta_margin_grad_mean': -0.176425963640213, 'epsilon_dpo/beta_margin_grad_std': 0.15948951244354248, 'kl/beta': 0.053997669368982315, 'kl/avg_steps': 0.90625, 'epoch': 0.14} 14%|████████████████ | 95/681 [04:02<24:32, 2.51s/it] 14%|████████████████▏ | 96/681 [04:05<24:39, 2.53s/it] {'loss': 0.4965, 'grad_norm': 85.53929901123047, 'learning_rate': 4.977766449015534e-07, 'rewards/chosen': -0.8906632661819458, 'rewards/rejected': -3.444911479949951, 'rewards/accuracies': 0.90625, 'rewards/margins': 2.554248094558716, 'logps/chosen': -65.12588500976562, 'logps/rejected': -167.49996948242188, 'logps/ref_chosen': -48.42084884643555, 'logps/ref_rejected': -102.56741333007812, 'logits/chosen': -3.2188198566436768, 'logits/rejected': -3.264829635620117, 'kl/p_epsilon_steps': 0.875, 'kl/n_epsilon_steps': 0.125, 'epsilon_dpo/beta': 0.05311667546629906, 'epsilon_dpo/loss_margin_mean': 48.22751998901367, 'epsilon_dpo/beta_margin_mean': 2.554248094558716, 'epsilon_dpo/beta_margin_std': 2.2736546993255615, 'epsilon_dpo/beta_margin_grad_mean': -0.18641680479049683, 'epsilon_dpo/beta_margin_grad_std': 0.20043620467185974, 'kl/beta': 0.05351271107792854, 'kl/avg_steps': 0.75, 'epoch': 0.14} 14%|████████████████▏ | 96/681 [04:05<24:39, 2.53s/it] 14%|████████████████▍ | 97/681 [04:07<25:21, 2.61s/it] {'loss': 0.3851, 'grad_norm': 52.84062194824219, 'learning_rate': 4.976026077188012e-07, 'rewards/chosen': -0.6797794103622437, 'rewards/rejected': -3.2887463569641113, 'rewards/accuracies': 0.96875, 'rewards/margins': 2.6089670658111572, 'logps/chosen': -47.99535369873047, 'logps/rejected': -145.8551025390625, 'logps/ref_chosen': -35.1164436340332, 'logps/ref_rejected': -83.36341857910156, 'logits/chosen': -3.178095817565918, 'logits/rejected': -3.169985294342041, 'kl/p_epsilon_steps': 0.921875, 'kl/n_epsilon_steps': 0.078125, 'epsilon_dpo/beta': 0.05267146974802017, 'epsilon_dpo/loss_margin_mean': 49.61277770996094, 'epsilon_dpo/beta_margin_mean': 2.608966827392578, 'epsilon_dpo/beta_margin_std': 1.9469996690750122, 'epsilon_dpo/beta_margin_grad_mean': -0.15445828437805176, 'epsilon_dpo/beta_margin_grad_std': 0.16520872712135315, 'kl/beta': 0.053114354610443115, 'kl/avg_steps': 0.84375, 'epoch': 0.14} 14%|████████████████▍ | 97/681 [04:07<25:21, 2.61s/it] 14%|████████████████▌ | 98/681 [04:10<24:53, 2.56s/it] {'loss': 0.5783, 'grad_norm': 101.56369018554688, 'learning_rate': 4.974220459770639e-07, 'rewards/chosen': -1.079300880432129, 'rewards/rejected': -3.3333208560943604, 'rewards/accuracies': 0.875, 'rewards/margins': 2.2540199756622314, 'logps/chosen': -65.423828125, 'logps/rejected': -165.55812072753906, 'logps/ref_chosen': -44.868499755859375, 'logps/ref_rejected': -101.7425537109375, 'logits/chosen': -3.108753204345703, 'logits/rejected': -3.2454357147216797, 'kl/p_epsilon_steps': 0.875, 'kl/n_epsilon_steps': 0.125, 'epsilon_dpo/beta': 0.05228015407919884, 'epsilon_dpo/loss_margin_mean': 43.26023864746094, 'epsilon_dpo/beta_margin_mean': 2.2540199756622314, 'epsilon_dpo/beta_margin_std': 1.9207807779312134, 'epsilon_dpo/beta_margin_grad_mean': -0.19064751267433167, 'epsilon_dpo/beta_margin_grad_std': 0.22201648354530334, 'kl/beta': 0.0526699498295784, 'kl/avg_steps': 0.75, 'epoch': 0.14} 14%|████████████████▌ | 98/681 [04:10<24:53, 2.56s/it] 15%|████████████████▋ | 99/681 [04:12<24:22, 2.51s/it] {'loss': 0.3636, 'grad_norm': 56.95939254760742, 'learning_rate': 4.972349644343108e-07, 'rewards/chosen': -0.7500526905059814, 'rewards/rejected': -3.516953945159912, 'rewards/accuracies': 0.96875, 'rewards/margins': 2.7669014930725098, 'logps/chosen': -40.343666076660156, 'logps/rejected': -158.97531127929688, 'logps/ref_chosen': -25.89197540283203, 'logps/ref_rejected': -91.06307983398438, 'logits/chosen': -3.0614562034606934, 'logits/rejected': -3.1564629077911377, 'kl/p_epsilon_steps': 0.9375, 'kl/n_epsilon_steps': 0.0625, 'epsilon_dpo/beta': 0.051825616508722305, 'epsilon_dpo/loss_margin_mean': 53.46054458618164, 'epsilon_dpo/beta_margin_mean': 2.7669012546539307, 'epsilon_dpo/beta_margin_std': 2.086336135864258, 'epsilon_dpo/beta_margin_grad_mean': -0.1456242799758911, 'epsilon_dpo/beta_margin_grad_std': 0.16527007520198822, 'kl/beta': 0.05227786675095558, 'kl/avg_steps': 0.875, 'epoch': 0.15} 15%|████████████████▋ | 99/681 [04:12<24:22, 2.51s/it] 15%|████████████████▋ | 100/681 [04:15<23:47, 2.46s/it] {'loss': 0.6663, 'grad_norm': 109.34486389160156, 'learning_rate': 4.970413680203148e-07, 'rewards/chosen': -0.9696159362792969, 'rewards/rejected': -2.9393372535705566, 'rewards/accuracies': 0.828125, 'rewards/margins': 1.9697211980819702, 'logps/chosen': -60.350921630859375, 'logps/rejected': -136.6974334716797, 'logps/ref_chosen': -41.60627746582031, 'logps/ref_rejected': -79.52035522460938, 'logits/chosen': -3.0767757892608643, 'logits/rejected': -3.0463757514953613, 'kl/p_epsilon_steps': 0.8125, 'kl/n_epsilon_steps': 0.1875, 'epsilon_dpo/beta': 0.05150564759969711, 'epsilon_dpo/loss_margin_mean': 38.432437896728516, 'epsilon_dpo/beta_margin_mean': 1.9697211980819702, 'epsilon_dpo/beta_margin_std': 1.9218071699142456, 'epsilon_dpo/beta_margin_grad_mean': -0.23118621110916138, 'epsilon_dpo/beta_margin_grad_std': 0.23917478322982788, 'kl/beta': 0.05182440206408501, 'kl/avg_steps': 0.625, 'epoch': 0.15} 15%|████████████████▋ | 100/681 [04:15<23:47, 2.46s/it][INFO|trainer.py:4307] 2026-04-18 09:35:37,096 >> ***** Running Evaluation ***** [INFO|trainer.py:4309] 2026-04-18 09:35:37,096 >> Num examples = 2339 [INFO|trainer.py:4312] 2026-04-18 09:35:37,096 >> Batch size = 8 0%| | 0/73 [00:00> ***** Running Evaluation ***** [INFO|trainer.py:4309] 2026-04-18 09:40:25,243 >> Num examples = 2339 [INFO|trainer.py:4312] 2026-04-18 09:40:25,243 >> Batch size = 8 0%| | 0/73 [00:00> Saving model checkpoint to /scratch/feng.yulu/dynamic-dpo-v4/outputs/mistral-7b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260418-015332/checkpoint-200 [INFO|configuration_utils.py:419] 2026-04-18 09:41:20,296 >> Configuration saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/mistral-7b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260418-015332/checkpoint-200/config.json [INFO|configuration_utils.py:911] 2026-04-18 09:41:20,394 >> Configuration saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/mistral-7b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260418-015332/checkpoint-200/generation_config.json [INFO|modeling_utils.py:3580] 2026-04-18 09:42:14,171 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 6 checkpoint shards. You can find where each parameters has been saved in the index located at /scratch/feng.yulu/dynamic-dpo-v4/outputs/mistral-7b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260418-015332/checkpoint-200/model.safetensors.index.json. [INFO|tokenization_utils_base.py:2510] 2026-04-18 09:42:14,185 >> tokenizer config file saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/mistral-7b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260418-015332/checkpoint-200/tokenizer_config.json [INFO|tokenization_utils_base.py:2519] 2026-04-18 09:42:14,190 >> Special tokens file saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/mistral-7b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260418-015332/checkpoint-200/special_tokens_map.json 30%|████████████████████████████████▊ | 201/681 [14:16<12:45:53, 95.74s/it] {'loss': 0.4443, 'grad_norm': 77.21038055419922, 'learning_rate': 4.455721242469372e-07, 'rewards/chosen': -1.7362768650054932, 'rewards/rejected': -4.34990119934082, 'rewards/accuracies': 0.90625, 'rewards/margins': 2.613624095916748, 'logps/chosen': -133.12782287597656, 'logps/rejected': -316.4865417480469, 'logps/ref_chosen': -55.676116943359375, 'logps/ref_rejected': -121.86392974853516, 'logits/chosen': -2.2331690788269043, 'logits/rejected': -2.180114269256592, 'kl/p_epsilon_steps': 0.90625, 'kl/n_epsilon_steps': 0.09375, 'epsilon_dpo/beta': 0.02235795184969902, 'epsilon_dpo/loss_margin_mean': 117.17089080810547, 'epsilon_dpo/beta_margin_mean': 2.613624095916748, 'epsilon_dpo/beta_margin_std': 1.999254584312439, 'epsilon_dpo/beta_margin_grad_mean': -0.1633475422859192, 'epsilon_dpo/beta_margin_grad_std': 0.2009790688753128, 'kl/beta': 0.022538842633366585, 'kl/avg_steps': 0.8125, 'epoch': 0.3} 30%|████████████████████████████████▊ | 201/681 [14:16<12:45:53, 95.74s/it] 30%|█████████████████████████████████▏ | 202/681 [14:19<9:01:16, 67.80s/it] {'loss': 0.4009, 'grad_norm': 63.114845275878906, 'learning_rate': 4.4477014363141755e-07, 'rewards/chosen': -1.8372435569763184, 'rewards/rejected': -4.240664482116699, 'rewards/accuracies': 0.9375, 'rewards/margins': 2.403420925140381, 'logps/chosen': -113.52767181396484, 'logps/rejected': -284.8798828125, 'logps/ref_chosen': -30.73172378540039, 'logps/ref_rejected': -93.48927307128906, 'logits/chosen': -1.9032566547393799, 'logits/rejected': -2.0256400108337402, 'kl/p_epsilon_steps': 0.921875, 'kl/n_epsilon_steps': 0.078125, 'epsilon_dpo/beta': 0.022170767188072205, 'epsilon_dpo/loss_margin_mean': 108.59467315673828, 'epsilon_dpo/beta_margin_mean': 2.40342116355896, 'epsilon_dpo/beta_margin_std': 1.810045599937439, 'epsilon_dpo/beta_margin_grad_mean': -0.16092374920845032, 'epsilon_dpo/beta_margin_grad_std': 0.16366447508335114, 'kl/beta': 0.022357190027832985, 'kl/avg_steps': 0.84375, 'epoch': 0.3} 30%|█████████████████████████████████▏ | 202/681 [14:19<9:01:16, 67.80s/it] 30%|█████████████████████████████████▍ | 203/681 [14:21<6:24:19, 48.24s/it] {'loss': 0.3417, 'grad_norm': 45.17288589477539, 'learning_rate': 4.439630306414758e-07, 'rewards/chosen': -1.5144764184951782, 'rewards/rejected': -4.293052673339844, 'rewards/accuracies': 0.953125, 'rewards/margins': 2.778576612472534, 'logps/chosen': -107.68612670898438, 'logps/rejected': -287.6023864746094, 'logps/ref_chosen': -38.8436393737793, 'logps/ref_rejected': -92.13667297363281, 'logits/chosen': -2.110769748687744, 'logits/rejected': -2.13863468170166, 'kl/p_epsilon_steps': 0.953125, 'kl/n_epsilon_steps': 0.046875, 'epsilon_dpo/beta': 0.021971412003040314, 'epsilon_dpo/loss_margin_mean': 126.62322998046875, 'epsilon_dpo/beta_margin_mean': 2.778576612472534, 'epsilon_dpo/beta_margin_std': 1.7826684713363647, 'epsilon_dpo/beta_margin_grad_mean': -0.12936589121818542, 'epsilon_dpo/beta_margin_grad_std': 0.16617560386657715, 'kl/beta': 0.022170130163431168, 'kl/avg_steps': 0.90625, 'epoch': 0.3} 30%|█████████████████████████████████▍ | 203/681 [14:21<6:24:19, 48.24s/it] 30%|█████████████████████████████████▌ | 204/681 [14:24<4:35:10, 34.61s/it] {'loss': 0.3937, 'grad_norm': 67.1108169555664, 'learning_rate': 4.431508065452897e-07, 'rewards/chosen': -1.7504637241363525, 'rewards/rejected': -4.320202350616455, 'rewards/accuracies': 0.90625, 'rewards/margins': 2.5697386264801025, 'logps/chosen': -135.9166259765625, 'logps/rejected': -292.0838623046875, 'logps/ref_chosen': -55.713932037353516, 'logps/ref_rejected': -93.70796203613281, 'logits/chosen': -2.281491756439209, 'logits/rejected': -2.114370822906494, 'kl/p_epsilon_steps': 0.90625, 'kl/n_epsilon_steps': 0.09375, 'epsilon_dpo/beta': 0.02179468236863613, 'epsilon_dpo/loss_margin_mean': 118.17318725585938, 'epsilon_dpo/beta_margin_mean': 2.5697386264801025, 'epsilon_dpo/beta_margin_std': 1.7625229358673096, 'epsilon_dpo/beta_margin_grad_mean': -0.14814503490924835, 'epsilon_dpo/beta_margin_grad_std': 0.18509677052497864, 'kl/beta': 0.021971017122268677, 'kl/avg_steps': 0.8125, 'epoch': 0.3} 30%|█████████████████████████████████▌ | 204/681 [14:24<4:35:10, 34.61s/it] 30%|█████████████████████████████████▋ | 205/681 [14:27<3:18:23, 25.01s/it] {'loss': 0.3485, 'grad_norm': 67.83074951171875, 'learning_rate': 4.4233349274571974e-07, 'rewards/chosen': -1.872864007949829, 'rewards/rejected': -4.799254417419434, 'rewards/accuracies': 0.921875, 'rewards/margins': 2.9263906478881836, 'logps/chosen': -121.29883575439453, 'logps/rejected': -314.7345886230469, 'logps/ref_chosen': -34.816200256347656, 'logps/ref_rejected': -92.58261108398438, 'logits/chosen': -2.015986680984497, 'logits/rejected': -1.8826128244400024, 'kl/p_epsilon_steps': 0.921875, 'kl/n_epsilon_steps': 0.078125, 'epsilon_dpo/beta': 0.021612215787172318, 'epsilon_dpo/loss_margin_mean': 135.66934204101562, 'epsilon_dpo/beta_margin_mean': 2.9263906478881836, 'epsilon_dpo/beta_margin_std': 1.8439240455627441, 'epsilon_dpo/beta_margin_grad_mean': -0.12213469296693802, 'epsilon_dpo/beta_margin_grad_std': 0.18215671181678772, 'kl/beta': 0.021793941035866737, 'kl/avg_steps': 0.84375, 'epoch': 0.3} 30%|█████████████████████████████████▋ | 205/681 [14:27<3:18:23, 25.01s/it] 30%|█████████████████████████████████▉ | 206/681 [14:29<2:24:19, 18.23s/it] {'loss': 0.4266, 'grad_norm': 70.67012023925781, 'learning_rate': 4.415111107797445e-07, 'rewards/chosen': -1.8712209463119507, 'rewards/rejected': -4.5623250007629395, 'rewards/accuracies': 0.921875, 'rewards/margins': 2.6911041736602783, 'logps/chosen': -117.2315673828125, 'logps/rejected': -316.34771728515625, 'logps/ref_chosen': -30.099918365478516, 'logps/ref_rejected': -103.39237976074219, 'logits/chosen': -1.9591203927993774, 'logits/rejected': -1.9519094228744507, 'kl/p_epsilon_steps': 0.921875, 'kl/n_epsilon_steps': 0.078125, 'epsilon_dpo/beta': 0.021431388333439827, 'epsilon_dpo/loss_margin_mean': 125.82368469238281, 'epsilon_dpo/beta_margin_mean': 2.6911041736602783, 'epsilon_dpo/beta_margin_std': 1.9302713871002197, 'epsilon_dpo/beta_margin_grad_mean': -0.14491204917430878, 'epsilon_dpo/beta_margin_grad_std': 0.19017614424228668, 'kl/beta': 0.021611593663692474, 'kl/avg_steps': 0.84375, 'epoch': 0.3} 30%|█████████████████████████████████▉ | 206/681 [14:29<2:24:19, 18.23s/it] 30%|██████████████████████████████████ | 207/681 [14:32<1:46:56, 13.54s/it] {'loss': 0.3788, 'grad_norm': 59.43008041381836, 'learning_rate': 4.4068368231789365e-07, 'rewards/chosen': -1.4902305603027344, 'rewards/rejected': -4.307093143463135, 'rewards/accuracies': 0.90625, 'rewards/margins': 2.8168625831604004, 'logps/chosen': -101.26658630371094, 'logps/rejected': -292.57171630859375, 'logps/ref_chosen': -31.34187889099121, 'logps/ref_rejected': -89.86247253417969, 'logits/chosen': -2.092212200164795, 'logits/rejected': -1.968192458152771, 'kl/p_epsilon_steps': 0.890625, 'kl/n_epsilon_steps': 0.109375, 'epsilon_dpo/beta': 0.02126547135412693, 'epsilon_dpo/loss_margin_mean': 132.78456115722656, 'epsilon_dpo/beta_margin_mean': 2.8168625831604004, 'epsilon_dpo/beta_margin_std': 1.992667317390442, 'epsilon_dpo/beta_margin_grad_mean': -0.1460837423801422, 'epsilon_dpo/beta_margin_grad_std': 0.1840200126171112, 'kl/beta': 0.02143077179789543, 'kl/avg_steps': 0.78125, 'epoch': 0.3} 30%|██████████████████████████████████ | 207/681 [14:32<1:46:56, 13.54s/it] 31%|██████████████████████████████████▏ | 208/681 [14:34<1:20:45, 10.24s/it] {'loss': 0.2919, 'grad_norm': 52.98038101196289, 'learning_rate': 4.398512291636768e-07, 'rewards/chosen': -1.7668159008026123, 'rewards/rejected': -4.634600639343262, 'rewards/accuracies': 0.953125, 'rewards/margins': 2.8677845001220703, 'logps/chosen': -119.51541137695312, 'logps/rejected': -320.81878662109375, 'logps/ref_chosen': -35.819129943847656, 'logps/ref_rejected': -100.89794921875, 'logits/chosen': -2.082803726196289, 'logits/rejected': -2.1149425506591797, 'kl/p_epsilon_steps': 0.921875, 'kl/n_epsilon_steps': 0.078125, 'epsilon_dpo/beta': 0.02108733169734478, 'epsilon_dpo/loss_margin_mean': 136.2245635986328, 'epsilon_dpo/beta_margin_mean': 2.8677847385406494, 'epsilon_dpo/beta_margin_std': 1.6818532943725586, 'epsilon_dpo/beta_margin_grad_mean': -0.1172553300857544, 'epsilon_dpo/beta_margin_grad_std': 0.15206165611743927, 'kl/beta': 0.021264642477035522, 'kl/avg_steps': 0.84375, 'epoch': 0.31} 31%|██████████████████████████████████▏ | 208/681 [14:34<1:20:45, 10.24s/it] 31%|██████████████████████████████████▎ | 209/681 [14:37<1:02:11, 7.91s/it] {'loss': 0.4441, 'grad_norm': 68.60906219482422, 'learning_rate': 4.3901377325300857e-07, 'rewards/chosen': -1.573714017868042, 'rewards/rejected': -4.144985198974609, 'rewards/accuracies': 0.921875, 'rewards/margins': 2.5712709426879883, 'logps/chosen': -114.03466796875, 'logps/rejected': -285.0487365722656, 'logps/ref_chosen': -38.91720199584961, 'logps/ref_rejected': -86.70390319824219, 'logits/chosen': -2.075434446334839, 'logits/rejected': -1.8750395774841309, 'kl/p_epsilon_steps': 0.90625, 'kl/n_epsilon_steps': 0.09375, 'epsilon_dpo/beta': 0.020917484536767006, 'epsilon_dpo/loss_margin_mean': 123.22737884521484, 'epsilon_dpo/beta_margin_mean': 2.5712711811065674, 'epsilon_dpo/beta_margin_std': 1.814591646194458, 'epsilon_dpo/beta_margin_grad_mean': -0.1551363617181778, 'epsilon_dpo/beta_margin_grad_std': 0.20075276494026184, 'kl/beta': 0.02108672261238098, 'kl/avg_steps': 0.8125, 'epoch': 0.31} 31%|██████████████████████████████████▎ | 209/681 [14:37<1:02:11, 7.91s/it] 31%|███████████████████████████████████▏ | 210/681 [14:39<49:17, 6.28s/it] {'loss': 0.32, 'grad_norm': 44.407615661621094, 'learning_rate': 4.381713366536311e-07, 'rewards/chosen': -1.6338069438934326, 'rewards/rejected': -4.250455856323242, 'rewards/accuracies': 0.984375, 'rewards/margins': 2.6166491508483887, 'logps/chosen': -108.2236099243164, 'logps/rejected': -290.24237060546875, 'logps/ref_chosen': -29.373889923095703, 'logps/ref_rejected': -85.03504943847656, 'logits/chosen': -2.012350082397461, 'logits/rejected': -1.9750981330871582, 'kl/p_epsilon_steps': 0.984375, 'kl/n_epsilon_steps': 0.015625, 'epsilon_dpo/beta': 0.020716214552521706, 'epsilon_dpo/loss_margin_mean': 126.35758972167969, 'epsilon_dpo/beta_margin_mean': 2.6166491508483887, 'epsilon_dpo/beta_margin_std': 1.6349241733551025, 'epsilon_dpo/beta_margin_grad_mean': -0.13410209119319916, 'epsilon_dpo/beta_margin_grad_std': 0.13863912224769592, 'kl/beta': 0.02091677300632, 'kl/avg_steps': 0.96875, 'epoch': 0.31} 31%|███████████████████████████████████▏ | 210/681 [14:39<49:17, 6.28s/it] 31%|███████████████████████████████████▎ | 211/681 [14:42<40:20, 5.15s/it] {'loss': 0.4024, 'grad_norm': 62.367835998535156, 'learning_rate': 4.373239415645323e-07, 'rewards/chosen': -1.7274291515350342, 'rewards/rejected': -4.268545150756836, 'rewards/accuracies': 0.9375, 'rewards/margins': 2.541116237640381, 'logps/chosen': -131.88340759277344, 'logps/rejected': -301.1044921875, 'logps/ref_chosen': -47.88237380981445, 'logps/ref_rejected': -93.18321228027344, 'logits/chosen': -2.1857690811157227, 'logits/rejected': -2.038249969482422, 'kl/p_epsilon_steps': 0.9375, 'kl/n_epsilon_steps': 0.0625, 'epsilon_dpo/beta': 0.02053687535226345, 'epsilon_dpo/loss_margin_mean': 123.92025756835938, 'epsilon_dpo/beta_margin_mean': 2.54111647605896, 'epsilon_dpo/beta_margin_std': 1.720736026763916, 'epsilon_dpo/beta_margin_grad_mean': -0.1510934829711914, 'epsilon_dpo/beta_margin_grad_std': 0.18279674649238586, 'kl/beta': 0.02071608603000641, 'kl/avg_steps': 0.875, 'epoch': 0.31} 31%|███████████████████████████████████▎ | 211/681 [14:42<40:20, 5.15s/it] 31%|███████████████████████████████████▍ | 212/681 [14:44<34:16, 4.39s/it] {'loss': 0.3267, 'grad_norm': 56.73617935180664, 'learning_rate': 4.3647161031536086e-07, 'rewards/chosen': -1.4194426536560059, 'rewards/rejected': -4.518226623535156, 'rewards/accuracies': 0.96875, 'rewards/margins': 3.0987837314605713, 'logps/chosen': -105.17568969726562, 'logps/rejected': -333.01275634765625, 'logps/ref_chosen': -35.5427360534668, 'logps/ref_rejected': -110.93476867675781, 'logits/chosen': -2.1070785522460938, 'logits/rejected': -2.0808253288269043, 'kl/p_epsilon_steps': 0.953125, 'kl/n_epsilon_steps': 0.046875, 'epsilon_dpo/beta': 0.02035231702029705, 'epsilon_dpo/loss_margin_mean': 152.44503784179688, 'epsilon_dpo/beta_margin_mean': 3.0987837314605713, 'epsilon_dpo/beta_margin_std': 2.040536403656006, 'epsilon_dpo/beta_margin_grad_mean': -0.11501824855804443, 'epsilon_dpo/beta_margin_grad_std': 0.16267718374729156, 'kl/beta': 0.0205363929271698, 'kl/avg_steps': 0.90625, 'epoch': 0.31} 31%|███████████████████████████████████▍ | 212/681 [14:44<34:16, 4.39s/it] 31%|███████████████████████████████████▋ | 213/681 [14:47<30:24, 3.90s/it] {'loss': 0.4116, 'grad_norm': 60.45912551879883, 'learning_rate': 4.3561436536583774e-07, 'rewards/chosen': -1.522735595703125, 'rewards/rejected': -4.057496070861816, 'rewards/accuracies': 0.921875, 'rewards/margins': 2.5347602367401123, 'logps/chosen': -121.68977355957031, 'logps/rejected': -299.35791015625, 'logps/ref_chosen': -46.382476806640625, 'logps/ref_rejected': -98.22808074951172, 'logits/chosen': -2.2190451622009277, 'logits/rejected': -2.022303581237793, 'kl/p_epsilon_steps': 0.90625, 'kl/n_epsilon_steps': 0.09375, 'epsilon_dpo/beta': 0.02018861286342144, 'epsilon_dpo/loss_margin_mean': 125.82254028320312, 'epsilon_dpo/beta_margin_mean': 2.5347602367401123, 'epsilon_dpo/beta_margin_std': 1.81096613407135, 'epsilon_dpo/beta_margin_grad_mean': -0.1576324999332428, 'epsilon_dpo/beta_margin_grad_std': 0.1839437037706375, 'kl/beta': 0.02035195380449295, 'kl/avg_steps': 0.8125, 'epoch': 0.31} 31%|███████████████████████████████████▋ | 213/681 [14:47<30:24, 3.90s/it] 31%|███████████████████████████████████▊ | 214/681 [14:49<26:53, 3.46s/it] {'loss': 0.5721, 'grad_norm': 80.94749450683594, 'learning_rate': 4.3475222930516473e-07, 'rewards/chosen': -1.683471441268921, 'rewards/rejected': -3.938481330871582, 'rewards/accuracies': 0.90625, 'rewards/margins': 2.255009889602661, 'logps/chosen': -117.653564453125, 'logps/rejected': -280.5225830078125, 'logps/ref_chosen': -33.69921112060547, 'logps/ref_rejected': -83.6459732055664, 'logits/chosen': -1.925289273262024, 'logits/rejected': -1.9057788848876953, 'kl/p_epsilon_steps': 0.9375, 'kl/n_epsilon_steps': 0.0625, 'epsilon_dpo/beta': 0.02001328393816948, 'epsilon_dpo/loss_margin_mean': 112.92225646972656, 'epsilon_dpo/beta_margin_mean': 2.255009889602661, 'epsilon_dpo/beta_margin_std': 1.9002856016159058, 'epsilon_dpo/beta_margin_grad_mean': -0.18691565096378326, 'epsilon_dpo/beta_margin_grad_std': 0.2173331379890442, 'kl/beta': 0.020187927410006523, 'kl/avg_steps': 0.875, 'epoch': 0.31} 31%|███████████████████████████████████▊ | 214/681 [14:50<26:53, 3.46s/it] 32%|███████████████████████████████████▉ | 215/681 [14:52<25:31, 3.29s/it] {'loss': 0.2971, 'grad_norm': 46.67967987060547, 'learning_rate': 4.3388522485142885e-07, 'rewards/chosen': -1.5396337509155273, 'rewards/rejected': -4.214178562164307, 'rewards/accuracies': 0.953125, 'rewards/margins': 2.6745448112487793, 'logps/chosen': -107.9367446899414, 'logps/rejected': -312.21636962890625, 'logps/ref_chosen': -30.35393714904785, 'logps/ref_rejected': -99.64697265625, 'logits/chosen': -2.086688756942749, 'logits/rejected': -2.0053024291992188, 'kl/p_epsilon_steps': 0.953125, 'kl/n_epsilon_steps': 0.046875, 'epsilon_dpo/beta': 0.019833432510495186, 'epsilon_dpo/loss_margin_mean': 134.98655700683594, 'epsilon_dpo/beta_margin_mean': 2.6745448112487793, 'epsilon_dpo/beta_margin_std': 1.592254638671875, 'epsilon_dpo/beta_margin_grad_mean': -0.12425589561462402, 'epsilon_dpo/beta_margin_grad_std': 0.13742640614509583, 'kl/beta': 0.020012814551591873, 'kl/avg_steps': 0.90625, 'epoch': 0.32} 32%|███████████████████████████████████▉ | 215/681 [14:52<25:31, 3.29s/it] 32%|████████████████████████████████████▏ | 216/681 [14:55<23:48, 3.07s/it] {'loss': 0.4448, 'grad_norm': 60.99223709106445, 'learning_rate': 4.330133748510036e-07, 'rewards/chosen': -1.4557723999023438, 'rewards/rejected': -3.9574337005615234, 'rewards/accuracies': 0.921875, 'rewards/margins': 2.5016613006591797, 'logps/chosen': -102.5655746459961, 'logps/rejected': -284.519287109375, 'logps/ref_chosen': -28.687610626220703, 'logps/ref_rejected': -83.20097351074219, 'logits/chosen': -2.0426881313323975, 'logits/rejected': -1.9267804622650146, 'kl/p_epsilon_steps': 0.90625, 'kl/n_epsilon_steps': 0.09375, 'epsilon_dpo/beta': 0.019673900678753853, 'epsilon_dpo/loss_margin_mean': 127.44037628173828, 'epsilon_dpo/beta_margin_mean': 2.5016613006591797, 'epsilon_dpo/beta_margin_std': 1.8782446384429932, 'epsilon_dpo/beta_margin_grad_mean': -0.1697354018688202, 'epsilon_dpo/beta_margin_grad_std': 0.19105024635791779, 'kl/beta': 0.019833076745271683, 'kl/avg_steps': 0.8125, 'epoch': 0.32} 32%|████████████████████████████████████▏ | 216/681 [14:55<23:48, 3.07s/it] 32%|████████████████████████████████████▎ | 217/681 [14:58<22:40, 2.93s/it] {'loss': 0.3023, 'grad_norm': 50.90324401855469, 'learning_rate': 4.3213670227794757e-07, 'rewards/chosen': -1.3119699954986572, 'rewards/rejected': -4.179328441619873, 'rewards/accuracies': 0.953125, 'rewards/margins': 2.867358684539795, 'logps/chosen': -98.74182891845703, 'logps/rejected': -321.77667236328125, 'logps/ref_chosen': -31.528701782226562, 'logps/ref_rejected': -107.33251190185547, 'logits/chosen': -2.1286168098449707, 'logits/rejected': -2.087315559387207, 'kl/p_epsilon_steps': 0.9375, 'kl/n_epsilon_steps': 0.0625, 'epsilon_dpo/beta': 0.019503042101860046, 'epsilon_dpo/loss_margin_mean': 147.23101806640625, 'epsilon_dpo/beta_margin_mean': 2.867358446121216, 'epsilon_dpo/beta_margin_std': 1.7631230354309082, 'epsilon_dpo/beta_margin_grad_mean': -0.12006211280822754, 'epsilon_dpo/beta_margin_grad_std': 0.1564583033323288, 'kl/beta': 0.01967323198914528, 'kl/avg_steps': 0.875, 'epoch': 0.32} 32%|████████████████████████████████████▎ | 217/681 [14:58<22:40, 2.93s/it] 32%|████████████████████████████████████▍ | 218/681 [15:00<21:12, 2.75s/it] {'loss': 0.3962, 'grad_norm': 58.98360824584961, 'learning_rate': 4.3125523023339815e-07, 'rewards/chosen': -1.39695405960083, 'rewards/rejected': -3.623495578765869, 'rewards/accuracies': 0.953125, 'rewards/margins': 2.226541519165039, 'logps/chosen': -108.99090576171875, 'logps/rejected': -280.693603515625, 'logps/ref_chosen': -36.7948112487793, 'logps/ref_rejected': -93.1485366821289, 'logits/chosen': -2.162309169769287, 'logits/rejected': -2.1342501640319824, 'kl/p_epsilon_steps': 0.953125, 'kl/n_epsilon_steps': 0.046875, 'epsilon_dpo/beta': 0.01932777464389801, 'epsilon_dpo/loss_margin_mean': 115.3489990234375, 'epsilon_dpo/beta_margin_mean': 2.22654128074646, 'epsilon_dpo/beta_margin_std': 1.3763279914855957, 'epsilon_dpo/beta_margin_grad_mean': -0.15312562882900238, 'epsilon_dpo/beta_margin_grad_std': 0.161267951130867, 'kl/beta': 0.019502583891153336, 'kl/avg_steps': 0.90625, 'epoch': 0.32} 32%|████████████████████████████████████▍ | 218/681 [15:00<21:12, 2.75s/it] 32%|████████████████████████████████████▋ | 219/681 [15:02<20:45, 2.70s/it] {'loss': 0.4669, 'grad_norm': 61.87991714477539, 'learning_rate': 4.303689819449636e-07, 'rewards/chosen': -1.3612611293792725, 'rewards/rejected': -3.615694046020508, 'rewards/accuracies': 0.9375, 'rewards/margins': 2.2544326782226562, 'logps/chosen': -113.684814453125, 'logps/rejected': -280.8560791015625, 'logps/ref_chosen': -42.875755310058594, 'logps/ref_rejected': -92.20575714111328, 'logits/chosen': -2.2709426879882812, 'logits/rejected': -2.1236162185668945, 'kl/p_epsilon_steps': 0.890625, 'kl/n_epsilon_steps': 0.109375, 'epsilon_dpo/beta': 0.019178351387381554, 'epsilon_dpo/loss_margin_mean': 117.84127807617188, 'epsilon_dpo/beta_margin_mean': 2.2544329166412354, 'epsilon_dpo/beta_margin_std': 1.7746726274490356, 'epsilon_dpo/beta_margin_grad_mean': -0.178512305021286, 'epsilon_dpo/beta_margin_grad_std': 0.184907004237175, 'kl/beta': 0.019327430054545403, 'kl/avg_steps': 0.78125, 'epoch': 0.32} 32%|████████████████████████████████████▋ | 219/681 [15:02<20:45, 2.70s/it] 32%|████████████████████████████████████▊ | 220/681 [15:05<20:22, 2.65s/it] {'loss': 0.4237, 'grad_norm': 57.75115966796875, 'learning_rate': 4.2947798076611047e-07, 'rewards/chosen': -1.4991801977157593, 'rewards/rejected': -3.63541841506958, 'rewards/accuracies': 0.953125, 'rewards/margins': 2.1362380981445312, 'logps/chosen': -121.99055480957031, 'logps/rejected': -286.1322326660156, 'logps/ref_chosen': -43.218231201171875, 'logps/ref_rejected': -94.84095764160156, 'logits/chosen': -2.249175548553467, 'logits/rejected': -2.228762626647949, 'kl/p_epsilon_steps': 0.921875, 'kl/n_epsilon_steps': 0.078125, 'epsilon_dpo/beta': 0.019017694517970085, 'epsilon_dpo/loss_margin_mean': 112.51897430419922, 'epsilon_dpo/beta_margin_mean': 2.1362380981445312, 'epsilon_dpo/beta_margin_std': 1.426170825958252, 'epsilon_dpo/beta_margin_grad_mean': -0.17227615416049957, 'epsilon_dpo/beta_margin_grad_std': 0.1603304147720337, 'kl/beta': 0.019177604466676712, 'kl/avg_steps': 0.84375, 'epoch': 0.32} 32%|████████████████████████████████████▊ | 220/681 [15:05<20:22, 2.65s/it] 32%|████████████████████████████████████▉ | 221/681 [15:08<20:05, 2.62s/it] {'loss': 0.2732, 'grad_norm': 44.90319061279297, 'learning_rate': 4.285822501755485e-07, 'rewards/chosen': -1.2945091724395752, 'rewards/rejected': -4.154721260070801, 'rewards/accuracies': 0.953125, 'rewards/margins': 2.8602123260498047, 'logps/chosen': -105.45207214355469, 'logps/rejected': -333.3541259765625, 'logps/ref_chosen': -36.884986877441406, 'logps/ref_rejected': -112.87872314453125, 'logits/chosen': -2.0982179641723633, 'logits/rejected': -2.1222004890441895, 'kl/p_epsilon_steps': 0.9375, 'kl/n_epsilon_steps': 0.0625, 'epsilon_dpo/beta': 0.018852630630135536, 'epsilon_dpo/loss_margin_mean': 151.9083251953125, 'epsilon_dpo/beta_margin_mean': 2.8602123260498047, 'epsilon_dpo/beta_margin_std': 1.571157693862915, 'epsilon_dpo/beta_margin_grad_mean': -0.11038573086261749, 'epsilon_dpo/beta_margin_grad_std': 0.14331035315990448, 'kl/beta': 0.01901714690029621, 'kl/avg_steps': 0.875, 'epoch': 0.32} 32%|████████████████████████████████████▉ | 221/681 [15:08<20:05, 2.62s/it] 33%|█████████████████████████████████████▏ | 222/681 [15:10<19:43, 2.58s/it] {'loss': 0.3942, 'grad_norm': 68.18460845947266, 'learning_rate': 4.276818137766118e-07, 'rewards/chosen': -1.4119031429290771, 'rewards/rejected': -3.8390281200408936, 'rewards/accuracies': 0.9375, 'rewards/margins': 2.4271249771118164, 'logps/chosen': -112.70558166503906, 'logps/rejected': -311.8533935546875, 'logps/ref_chosen': -37.27526092529297, 'logps/ref_rejected': -106.37206268310547, 'logits/chosen': -2.1893157958984375, 'logits/rejected': -2.2103257179260254, 'kl/p_epsilon_steps': 0.9375, 'kl/n_epsilon_steps': 0.0625, 'epsilon_dpo/beta': 0.018689103424549103, 'epsilon_dpo/loss_margin_mean': 130.051025390625, 'epsilon_dpo/beta_margin_mean': 2.4271247386932373, 'epsilon_dpo/beta_margin_std': 1.7556877136230469, 'epsilon_dpo/beta_margin_grad_mean': -0.15696805715560913, 'epsilon_dpo/beta_margin_grad_std': 0.16546384990215302, 'kl/beta': 0.018852191045880318, 'kl/avg_steps': 0.875, 'epoch': 0.33} 33%|█████████████████████████████████████▏ | 222/681 [15:10<19:43, 2.58s/it] 33%|█████████████████████████████████████▎ | 223/681 [15:12<18:47, 2.46s/it] {'loss': 0.542, 'grad_norm': 77.07755279541016, 'learning_rate': 4.2677669529663686e-07, 'rewards/chosen': -1.581796407699585, 'rewards/rejected': -3.775017261505127, 'rewards/accuracies': 0.859375, 'rewards/margins': 2.193220615386963, 'logps/chosen': -117.65826416015625, 'logps/rejected': -290.10894775390625, 'logps/ref_chosen': -32.709083557128906, 'logps/ref_rejected': -86.52430725097656, 'logits/chosen': -2.1241304874420166, 'logits/rejected': -2.0650038719177246, 'kl/p_epsilon_steps': 0.859375, 'kl/n_epsilon_steps': 0.140625, 'epsilon_dpo/beta': 0.018556196242570877, 'epsilon_dpo/loss_margin_mean': 118.63549041748047, 'epsilon_dpo/beta_margin_mean': 2.193220615386963, 'epsilon_dpo/beta_margin_std': 1.8057856559753418, 'epsilon_dpo/beta_margin_grad_mean': -0.1877458542585373, 'epsilon_dpo/beta_margin_grad_std': 0.21679145097732544, 'kl/beta': 0.018688665702939034, 'kl/avg_steps': 0.71875, 'epoch': 0.33} 33%|█████████████████████████████████████▎ | 223/681 [15:12<18:47, 2.46s/it] 33%|█████████████████████████████████████▍ | 224/681 [15:14<18:08, 2.38s/it] {'loss': 0.4078, 'grad_norm': 64.31571197509766, 'learning_rate': 4.2586691858633747e-07, 'rewards/chosen': -1.429018497467041, 'rewards/rejected': -3.910590648651123, 'rewards/accuracies': 0.921875, 'rewards/margins': 2.481572389602661, 'logps/chosen': -113.04464721679688, 'logps/rejected': -294.23968505859375, 'logps/ref_chosen': -35.54627990722656, 'logps/ref_rejected': -81.60932922363281, 'logits/chosen': -2.1345582008361816, 'logits/rejected': -1.9911662340164185, 'kl/p_epsilon_steps': 0.90625, 'kl/n_epsilon_steps': 0.09375, 'epsilon_dpo/beta': 0.018406376242637634, 'epsilon_dpo/loss_margin_mean': 135.1320037841797, 'epsilon_dpo/beta_margin_mean': 2.481572389602661, 'epsilon_dpo/beta_margin_std': 1.7067841291427612, 'epsilon_dpo/beta_margin_grad_mean': -0.15500150620937347, 'epsilon_dpo/beta_margin_grad_std': 0.1830359399318695, 'kl/beta': 0.018555298447608948, 'kl/avg_steps': 0.8125, 'epoch': 0.33} 33%|█████████████████████████████████████▍ | 224/681 [15:14<18:08, 2.38s/it] 33%|█████████████████████████████████████▋ | 225/681 [15:17<18:22, 2.42s/it] {'loss': 0.4097, 'grad_norm': 65.03907775878906, 'learning_rate': 4.249525076191759e-07, 'rewards/chosen': -1.4013426303863525, 'rewards/rejected': -4.040144920349121, 'rewards/accuracies': 0.953125, 'rewards/margins': 2.6388025283813477, 'logps/chosen': -111.3397216796875, 'logps/rejected': -328.50433349609375, 'logps/ref_chosen': -34.65919876098633, 'logps/ref_rejected': -106.95365905761719, 'logits/chosen': -2.1871337890625, 'logits/rejected': -2.1202986240386963, 'kl/p_epsilon_steps': 0.9375, 'kl/n_epsilon_steps': 0.0625, 'epsilon_dpo/beta': 0.018246525898575783, 'epsilon_dpo/loss_margin_mean': 144.87014770507812, 'epsilon_dpo/beta_margin_mean': 2.6388025283813477, 'epsilon_dpo/beta_margin_std': 1.800585150718689, 'epsilon_dpo/beta_margin_grad_mean': -0.1459973156452179, 'epsilon_dpo/beta_margin_grad_std': 0.19097650051116943, 'kl/beta': 0.018405752256512642, 'kl/avg_steps': 0.875, 'epoch': 0.33} 33%|█████████████████████████████████████▋ | 225/681 [15:17<18:22, 2.42s/it] 33%|█████████████████████████████████████▊ | 226/681 [15:20<18:53, 2.49s/it] {'loss': 0.4196, 'grad_norm': 53.89804458618164, 'learning_rate': 4.2403348649073167e-07, 'rewards/chosen': -1.3841276168823242, 'rewards/rejected': -3.5787124633789062, 'rewards/accuracies': 0.953125, 'rewards/margins': 2.194584846496582, 'logps/chosen': -120.92872619628906, 'logps/rejected': -281.99481201171875, 'logps/ref_chosen': -44.4660758972168, 'logps/ref_rejected': -84.042724609375, 'logits/chosen': -2.2922072410583496, 'logits/rejected': -2.074962615966797, 'kl/p_epsilon_steps': 0.9375, 'kl/n_epsilon_steps': 0.0625, 'epsilon_dpo/beta': 0.01808825507760048, 'epsilon_dpo/loss_margin_mean': 121.48944854736328, 'epsilon_dpo/beta_margin_mean': 2.194584846496582, 'epsilon_dpo/beta_margin_std': 1.5229578018188477, 'epsilon_dpo/beta_margin_grad_mean': -0.17010191082954407, 'epsilon_dpo/beta_margin_grad_std': 0.16130515933036804, 'kl/beta': 0.01824609935283661, 'kl/avg_steps': 0.875, 'epoch': 0.33} 33%|█████████████████████████████████████▊ | 226/681 [15:20<18:53, 2.49s/it] 33%|██████████████████████████████████████ | 227/681 [15:22<18:37, 2.46s/it] {'loss': 0.4017, 'grad_norm': 62.88763427734375, 'learning_rate': 4.2310987941806615e-07, 'rewards/chosen': -1.3941301107406616, 'rewards/rejected': -4.054275035858154, 'rewards/accuracies': 0.90625, 'rewards/margins': 2.6601450443267822, 'logps/chosen': -111.29264831542969, 'logps/rejected': -332.619384765625, 'logps/ref_chosen': -33.7484245300293, 'logps/ref_rejected': -106.498291015625, 'logits/chosen': -2.217491865158081, 'logits/rejected': -2.0915043354034424, 'kl/p_epsilon_steps': 0.890625, 'kl/n_epsilon_steps': 0.109375, 'epsilon_dpo/beta': 0.017948314547538757, 'epsilon_dpo/loss_margin_mean': 148.5768585205078, 'epsilon_dpo/beta_margin_mean': 2.6601450443267822, 'epsilon_dpo/beta_margin_std': 1.9379621744155884, 'epsilon_dpo/beta_margin_grad_mean': -0.15359929203987122, 'epsilon_dpo/beta_margin_grad_std': 0.18651175498962402, 'kl/beta': 0.018087830394506454, 'kl/avg_steps': 0.78125, 'epoch': 0.33} 33%|██████████████████████████████████████ | 227/681 [15:22<18:37, 2.46s/it] 33%|██████████████████████████████████████▏ | 228/681 [15:25<19:01, 2.52s/it] {'loss': 0.3771, 'grad_norm': 55.945281982421875, 'learning_rate': 4.2218171073908463e-07, 'rewards/chosen': -1.473421335220337, 'rewards/rejected': -3.7081234455108643, 'rewards/accuracies': 0.96875, 'rewards/margins': 2.2347021102905273, 'logps/chosen': -128.66030883789062, 'logps/rejected': -304.66015625, 'logps/ref_chosen': -45.935726165771484, 'logps/ref_rejected': -96.16656494140625, 'logits/chosen': -2.300729990005493, 'logits/rejected': -2.2182745933532715, 'kl/p_epsilon_steps': 0.9375, 'kl/n_epsilon_steps': 0.0625, 'epsilon_dpo/beta': 0.0177923534065485, 'epsilon_dpo/loss_margin_mean': 125.76899719238281, 'epsilon_dpo/beta_margin_mean': 2.2347021102905273, 'epsilon_dpo/beta_margin_std': 1.3810467720031738, 'epsilon_dpo/beta_margin_grad_mean': -0.15493857860565186, 'epsilon_dpo/beta_margin_grad_std': 0.1513359248638153, 'kl/beta': 0.017947614192962646, 'kl/avg_steps': 0.875, 'epoch': 0.33} 33%|██████████████████████████████████████▏ | 228/681 [15:25<19:01, 2.52s/it] 34%|██████████████████████████████████████▎ | 229/681 [15:27<18:49, 2.50s/it] {'loss': 0.4188, 'grad_norm': 64.01463317871094, 'learning_rate': 4.212490049118951e-07, 'rewards/chosen': -1.1469919681549072, 'rewards/rejected': -3.779086112976074, 'rewards/accuracies': 0.90625, 'rewards/margins': 2.632094383239746, 'logps/chosen': -100.0278091430664, 'logps/rejected': -304.251220703125, 'logps/ref_chosen': -35.16382598876953, 'logps/ref_rejected': -89.91634368896484, 'logits/chosen': -2.2102646827697754, 'logits/rejected': -2.084036350250244, 'kl/p_epsilon_steps': 0.875, 'kl/n_epsilon_steps': 0.125, 'epsilon_dpo/beta': 0.01766026020050049, 'epsilon_dpo/loss_margin_mean': 149.47091674804688, 'epsilon_dpo/beta_margin_mean': 2.632094383239746, 'epsilon_dpo/beta_margin_std': 1.839905023574829, 'epsilon_dpo/beta_margin_grad_mean': -0.155641108751297, 'epsilon_dpo/beta_margin_grad_std': 0.19922208786010742, 'kl/beta': 0.017791934311389923, 'kl/avg_steps': 0.75, 'epoch': 0.34} 34%|██████████████████████████████████████▎ | 229/681 [15:27<18:49, 2.50s/it] 34%|██████████████████████████████████████▌ | 230/681 [15:30<18:51, 2.51s/it] {'loss': 0.4007, 'grad_norm': 57.6724853515625, 'learning_rate': 4.203117865141635e-07, 'rewards/chosen': -1.5095224380493164, 'rewards/rejected': -3.8183393478393555, 'rewards/accuracies': 0.953125, 'rewards/margins': 2.30881667137146, 'logps/chosen': -114.36758422851562, 'logps/rejected': -308.0694274902344, 'logps/ref_chosen': -28.29522705078125, 'logps/ref_rejected': -89.92157745361328, 'logits/chosen': -1.9427410364151, 'logits/rejected': -1.9851582050323486, 'kl/p_epsilon_steps': 0.921875, 'kl/n_epsilon_steps': 0.078125, 'epsilon_dpo/beta': 0.017512237653136253, 'epsilon_dpo/loss_margin_mean': 132.07550048828125, 'epsilon_dpo/beta_margin_mean': 2.30881667137146, 'epsilon_dpo/beta_margin_std': 1.4573560953140259, 'epsilon_dpo/beta_margin_grad_mean': -0.15799804031848907, 'epsilon_dpo/beta_margin_grad_std': 0.1725914478302002, 'kl/beta': 0.017659489065408707, 'kl/avg_steps': 0.84375, 'epoch': 0.34} 34%|██████████████████████████████████████▌ | 230/681 [15:30<18:51, 2.51s/it] 34%|██████████████████████████████████████▋ | 231/681 [15:32<19:06, 2.55s/it] {'loss': 0.3873, 'grad_norm': 58.66816711425781, 'learning_rate': 4.1937008024246625e-07, 'rewards/chosen': -1.4385372400283813, 'rewards/rejected': -3.782843828201294, 'rewards/accuracies': 0.921875, 'rewards/margins': 2.344306468963623, 'logps/chosen': -122.01680755615234, 'logps/rejected': -298.10369873046875, 'logps/ref_chosen': -39.274810791015625, 'logps/ref_rejected': -80.1042709350586, 'logits/chosen': -2.2687087059020996, 'logits/rejected': -2.0110936164855957, 'kl/p_epsilon_steps': 0.921875, 'kl/n_epsilon_steps': 0.078125, 'epsilon_dpo/beta': 0.01736571453511715, 'epsilon_dpo/loss_margin_mean': 135.25741577148438, 'epsilon_dpo/beta_margin_mean': 2.344306468963623, 'epsilon_dpo/beta_margin_std': 1.5224120616912842, 'epsilon_dpo/beta_margin_grad_mean': -0.15104635059833527, 'epsilon_dpo/beta_margin_grad_std': 0.16789193451404572, 'kl/beta': 0.01751173473894596, 'kl/avg_steps': 0.84375, 'epoch': 0.34} 34%|██████████████████████████████████████▋ | 231/681 [15:32<19:06, 2.55s/it] 34%|██████████████████████████████████████▊ | 232/681 [15:35<18:58, 2.53s/it] {'loss': 0.4288, 'grad_norm': 55.898075103759766, 'learning_rate': 4.1842391091163933e-07, 'rewards/chosen': -1.3723829984664917, 'rewards/rejected': -3.6786396503448486, 'rewards/accuracies': 0.953125, 'rewards/margins': 2.3062567710876465, 'logps/chosen': -121.8990707397461, 'logps/rejected': -302.57232666015625, 'logps/ref_chosen': -42.393104553222656, 'logps/ref_rejected': -88.90144348144531, 'logits/chosen': -2.2190351486206055, 'logits/rejected': -2.2044219970703125, 'kl/p_epsilon_steps': 0.90625, 'kl/n_epsilon_steps': 0.09375, 'epsilon_dpo/beta': 0.0172258447855711, 'epsilon_dpo/loss_margin_mean': 134.1649169921875, 'epsilon_dpo/beta_margin_mean': 2.3062567710876465, 'epsilon_dpo/beta_margin_std': 1.650961995124817, 'epsilon_dpo/beta_margin_grad_mean': -0.16703417897224426, 'epsilon_dpo/beta_margin_grad_std': 0.17445211112499237, 'kl/beta': 0.017365215346217155, 'kl/avg_steps': 0.8125, 'epoch': 0.34} 34%|██████████████████████████████████████▊ | 232/681 [15:35<18:58, 2.53s/it] 34%|███████████████████████████████████████ | 233/681 [15:37<19:03, 2.55s/it] {'loss': 0.2959, 'grad_norm': 49.05929183959961, 'learning_rate': 4.174733034541245e-07, 'rewards/chosen': -1.3310226202011108, 'rewards/rejected': -4.204285621643066, 'rewards/accuracies': 0.9375, 'rewards/margins': 2.873262882232666, 'logps/chosen': -108.28323364257812, 'logps/rejected': -361.3421630859375, 'logps/ref_chosen': -30.483036041259766, 'logps/ref_rejected': -115.03839111328125, 'logits/chosen': -2.11576509475708, 'logits/rejected': -2.1665868759155273, 'kl/p_epsilon_steps': 0.9375, 'kl/n_epsilon_steps': 0.0625, 'epsilon_dpo/beta': 0.017076246440410614, 'epsilon_dpo/loss_margin_mean': 168.5035858154297, 'epsilon_dpo/beta_margin_mean': 2.873262882232666, 'epsilon_dpo/beta_margin_std': 1.6749653816223145, 'epsilon_dpo/beta_margin_grad_mean': -0.11840350925922394, 'epsilon_dpo/beta_margin_grad_std': 0.15504033863544464, 'kl/beta': 0.01722525991499424, 'kl/avg_steps': 0.875, 'epoch': 0.34} 34%|███████████████████████████████████████ | 233/681 [15:37<19:03, 2.55s/it] 34%|███████████████████████████████████████▏ | 234/681 [15:40<19:08, 2.57s/it] {'loss': 0.4328, 'grad_norm': 67.56781005859375, 'learning_rate': 4.165182829193126e-07, 'rewards/chosen': -1.3788411617279053, 'rewards/rejected': -4.01970911026001, 'rewards/accuracies': 0.90625, 'rewards/margins': 2.6408681869506836, 'logps/chosen': -111.21868133544922, 'logps/rejected': -346.22235107421875, 'logps/ref_chosen': -30.016942977905273, 'logps/ref_rejected': -108.75608825683594, 'logits/chosen': -2.0071873664855957, 'logits/rejected': -2.1877129077911377, 'kl/p_epsilon_steps': 0.890625, 'kl/n_epsilon_steps': 0.109375, 'epsilon_dpo/beta': 0.01694413460791111, 'epsilon_dpo/loss_margin_mean': 156.26454162597656, 'epsilon_dpo/beta_margin_mean': 2.6408679485321045, 'epsilon_dpo/beta_margin_std': 2.0294158458709717, 'epsilon_dpo/beta_margin_grad_mean': -0.15992534160614014, 'epsilon_dpo/beta_margin_grad_std': 0.19790522754192352, 'kl/beta': 0.01707584597170353, 'kl/avg_steps': 0.78125, 'epoch': 0.34} 34%|███████████████████████████████████████▏ | 234/681 [15:40<19:08, 2.57s/it] 35%|███████████████████████████████████████▎ | 235/681 [15:42<18:54, 2.54s/it] {'loss': 0.4894, 'grad_norm': 77.5376205444336, 'learning_rate': 4.1555887447288255e-07, 'rewards/chosen': -1.6119556427001953, 'rewards/rejected': -3.8830833435058594, 'rewards/accuracies': 0.921875, 'rewards/margins': 2.271127700805664, 'logps/chosen': -141.7093963623047, 'logps/rejected': -327.0941162109375, 'logps/ref_chosen': -46.081146240234375, 'logps/ref_rejected': -96.02244567871094, 'logits/chosen': -2.111543655395508, 'logits/rejected': -2.081702470779419, 'kl/p_epsilon_steps': 0.890625, 'kl/n_epsilon_steps': 0.109375, 'epsilon_dpo/beta': 0.01681278459727764, 'epsilon_dpo/loss_margin_mean': 135.44342041015625, 'epsilon_dpo/beta_margin_mean': 2.271127700805664, 'epsilon_dpo/beta_margin_std': 1.7841382026672363, 'epsilon_dpo/beta_margin_grad_mean': -0.175105020403862, 'epsilon_dpo/beta_margin_grad_std': 0.1960555464029312, 'kl/beta': 0.016943475231528282, 'kl/avg_steps': 0.78125, 'epoch': 0.35} 35%|███████████████████████████████████████▎ | 235/681 [15:42<18:54, 2.54s/it] 35%|███████████████████████████████████████▌ | 236/681 [15:45<18:40, 2.52s/it] {'loss': 0.4048, 'grad_norm': 56.97810363769531, 'learning_rate': 4.1459510339613946e-07, 'rewards/chosen': -1.3549550771713257, 'rewards/rejected': -3.6831302642822266, 'rewards/accuracies': 0.921875, 'rewards/margins': 2.3281750679016113, 'logps/chosen': -112.40109252929688, 'logps/rejected': -329.66357421875, 'logps/ref_chosen': -31.17489242553711, 'logps/ref_rejected': -108.55508422851562, 'logits/chosen': -2.05222487449646, 'logits/rejected': -2.147634267807007, 'kl/p_epsilon_steps': 0.9375, 'kl/n_epsilon_steps': 0.0625, 'epsilon_dpo/beta': 0.01666669175028801, 'epsilon_dpo/loss_margin_mean': 139.8822784423828, 'epsilon_dpo/beta_margin_mean': 2.3281750679016113, 'epsilon_dpo/beta_margin_std': 1.6591843366622925, 'epsilon_dpo/beta_margin_grad_mean': -0.1637529879808426, 'epsilon_dpo/beta_margin_grad_std': 0.16410239040851593, 'kl/beta': 0.01681213080883026, 'kl/avg_steps': 0.875, 'epoch': 0.35} 35%|███████████████████████████████████████▌ | 236/681 [15:45<18:40, 2.52s/it] 35%|███████████████████████████████████████▋ | 237/681 [15:48<19:04, 2.58s/it] {'loss': 0.3225, 'grad_norm': 47.88125991821289, 'learning_rate': 4.136269950853473e-07, 'rewards/chosen': -1.198293924331665, 'rewards/rejected': -3.849404811859131, 'rewards/accuracies': 0.953125, 'rewards/margins': 2.651111125946045, 'logps/chosen': -99.72041320800781, 'logps/rejected': -334.0084228515625, 'logps/ref_chosen': -27.259479522705078, 'logps/ref_rejected': -100.87033081054688, 'logits/chosen': -2.0394325256347656, 'logits/rejected': -2.007416248321533, 'kl/p_epsilon_steps': 0.9375, 'kl/n_epsilon_steps': 0.0625, 'epsilon_dpo/beta': 0.01652212254703045, 'epsilon_dpo/loss_margin_mean': 160.67715454101562, 'epsilon_dpo/beta_margin_mean': 2.651111125946045, 'epsilon_dpo/beta_margin_std': 1.6006019115447998, 'epsilon_dpo/beta_margin_grad_mean': -0.13068081438541412, 'epsilon_dpo/beta_margin_grad_std': 0.15401048958301544, 'kl/beta': 0.01666630059480667, 'kl/avg_steps': 0.875, 'epoch': 0.35} 35%|███████████████████████████████████████▋ | 237/681 [15:48<19:04, 2.58s/it] 35%|███████████████████████████████████████▊ | 238/681 [15:50<19:05, 2.59s/it] {'loss': 0.4646, 'grad_norm': 65.86769104003906, 'learning_rate': 4.126545750510605e-07, 'rewards/chosen': -1.598804235458374, 'rewards/rejected': -3.772151231765747, 'rewards/accuracies': 0.9375, 'rewards/margins': 2.173346996307373, 'logps/chosen': -137.6107177734375, 'logps/rejected': -323.8140563964844, 'logps/ref_chosen': -40.190574645996094, 'logps/ref_rejected': -93.49894714355469, 'logits/chosen': -2.0702290534973145, 'logits/rejected': -2.123429298400879, 'kl/p_epsilon_steps': 0.921875, 'kl/n_epsilon_steps': 0.078125, 'epsilon_dpo/beta': 0.016383972018957138, 'epsilon_dpo/loss_margin_mean': 132.8949737548828, 'epsilon_dpo/beta_margin_mean': 2.173346996307373, 'epsilon_dpo/beta_margin_std': 1.7011436223983765, 'epsilon_dpo/beta_margin_grad_mean': -0.1836518496274948, 'epsilon_dpo/beta_margin_grad_std': 0.17506077885627747, 'kl/beta': 0.01652173511683941, 'kl/avg_steps': 0.84375, 'epoch': 0.35} 35%|███████████████████████████████████████▊ | 238/681 [15:50<19:05, 2.59s/it] 35%|████████████████████████████████████████ | 239/681 [15:53<19:02, 2.59s/it] {'loss': 0.4034, 'grad_norm': 56.121368408203125, 'learning_rate': 4.116778689174514e-07, 'rewards/chosen': -1.4062585830688477, 'rewards/rejected': -3.755234956741333, 'rewards/accuracies': 0.96875, 'rewards/margins': 2.3489763736724854, 'logps/chosen': -124.91606140136719, 'logps/rejected': -331.0585021972656, 'logps/ref_chosen': -38.33943176269531, 'logps/ref_rejected': -99.64225769042969, 'logits/chosen': -2.0602469444274902, 'logits/rejected': -1.9942574501037598, 'kl/p_epsilon_steps': 0.953125, 'kl/n_epsilon_steps': 0.046875, 'epsilon_dpo/beta': 0.016236647963523865, 'epsilon_dpo/loss_margin_mean': 144.83961486816406, 'epsilon_dpo/beta_margin_mean': 2.3489763736724854, 'epsilon_dpo/beta_margin_std': 1.5995773077011108, 'epsilon_dpo/beta_margin_grad_mean': -0.16019320487976074, 'epsilon_dpo/beta_margin_grad_std': 0.1673043966293335, 'kl/beta': 0.016383498907089233, 'kl/avg_steps': 0.90625, 'epoch': 0.35} 35%|████████████████████████████████████████ | 239/681 [15:53<19:02, 2.59s/it] 35%|████████████████████████████████████████▏ | 240/681 [15:55<18:39, 2.54s/it] {'loss': 0.4095, 'grad_norm': 66.01368713378906, 'learning_rate': 4.106969024216348e-07, 'rewards/chosen': -1.2190780639648438, 'rewards/rejected': -3.4683375358581543, 'rewards/accuracies': 0.953125, 'rewards/margins': 2.2492594718933105, 'logps/chosen': -111.68765258789062, 'logps/rejected': -295.5617980957031, 'logps/ref_chosen': -36.1579704284668, 'logps/ref_rejected': -80.07916259765625, 'logits/chosen': -2.0398552417755127, 'logits/rejected': -1.8625662326812744, 'kl/p_epsilon_steps': 0.921875, 'kl/n_epsilon_steps': 0.078125, 'epsilon_dpo/beta': 0.016100972890853882, 'epsilon_dpo/loss_margin_mean': 139.95294189453125, 'epsilon_dpo/beta_margin_mean': 2.2492594718933105, 'epsilon_dpo/beta_margin_std': 1.4863446950912476, 'epsilon_dpo/beta_margin_grad_mean': -0.16005262732505798, 'epsilon_dpo/beta_margin_grad_std': 0.16643419861793518, 'kl/beta': 0.016236357390880585, 'kl/avg_steps': 0.84375, 'epoch': 0.35} 35%|████████████████████████████████████████▏ | 240/681 [15:55<18:39, 2.54s/it] 35%|████████████████████████████████████████▎ | 241/681 [15:58<18:35, 2.53s/it] {'loss': 0.426, 'grad_norm': 74.83716583251953, 'learning_rate': 4.097117014129903e-07, 'rewards/chosen': -1.3152108192443848, 'rewards/rejected': -3.7722537517547607, 'rewards/accuracies': 0.890625, 'rewards/margins': 2.457043170928955, 'logps/chosen': -126.07571411132812, 'logps/rejected': -329.2753601074219, 'logps/ref_chosen': -44.0040397644043, 'logps/ref_rejected': -93.001220703125, 'logits/chosen': -2.196474313735962, 'logits/rejected': -1.913820505142212, 'kl/p_epsilon_steps': 0.90625, 'kl/n_epsilon_steps': 0.09375, 'epsilon_dpo/beta': 0.015971289947628975, 'epsilon_dpo/loss_margin_mean': 154.20245361328125, 'epsilon_dpo/beta_margin_mean': 2.457042932510376, 'epsilon_dpo/beta_margin_std': 1.8046914339065552, 'epsilon_dpo/beta_margin_grad_mean': -0.16114287078380585, 'epsilon_dpo/beta_margin_grad_std': 0.18923403322696686, 'kl/beta': 0.016100509092211723, 'kl/avg_steps': 0.8125, 'epoch': 0.35} 35%|████████████████████████████████████████▎ | 241/681 [15:58<18:35, 2.53s/it] 36%|████████████████████████████████████████▌ | 242/681 [16:00<18:37, 2.54s/it] {'loss': 0.4175, 'grad_norm': 78.47781372070312, 'learning_rate': 4.087222918524807e-07, 'rewards/chosen': -1.2343041896820068, 'rewards/rejected': -3.691114902496338, 'rewards/accuracies': 0.9375, 'rewards/margins': 2.45681095123291, 'logps/chosen': -109.82485961914062, 'logps/rejected': -324.636962890625, 'logps/ref_chosen': -32.014137268066406, 'logps/ref_rejected': -91.38673400878906, 'logits/chosen': -2.161656379699707, 'logits/rejected': -1.9027358293533325, 'kl/p_epsilon_steps': 0.953125, 'kl/n_epsilon_steps': 0.046875, 'epsilon_dpo/beta': 0.015827594324946404, 'epsilon_dpo/loss_margin_mean': 155.43951416015625, 'epsilon_dpo/beta_margin_mean': 2.45681095123291, 'epsilon_dpo/beta_margin_std': 1.6646445989608765, 'epsilon_dpo/beta_margin_grad_mean': -0.14874504506587982, 'epsilon_dpo/beta_margin_grad_std': 0.17723040282726288, 'kl/beta': 0.0159707460552454, 'kl/avg_steps': 0.90625, 'epoch': 0.36} 36%|████████████████████████████████████████▌ | 242/681 [16:00<18:37, 2.54s/it] 36%|████████████████████████████████████████▋ | 243/681 [16:03<18:30, 2.53s/it] {'loss': 0.3675, 'grad_norm': 58.3444709777832, 'learning_rate': 4.07728699811968e-07, 'rewards/chosen': -1.282745122909546, 'rewards/rejected': -3.6751694679260254, 'rewards/accuracies': 0.953125, 'rewards/margins': 2.3924245834350586, 'logps/chosen': -119.57041931152344, 'logps/rejected': -316.1097717285156, 'logps/ref_chosen': -37.93376922607422, 'logps/ref_rejected': -81.79078674316406, 'logits/chosen': -2.0235860347747803, 'logits/rejected': -1.7727184295654297, 'kl/p_epsilon_steps': 0.9375, 'kl/n_epsilon_steps': 0.0625, 'epsilon_dpo/beta': 0.01569039188325405, 'epsilon_dpo/loss_margin_mean': 152.6823272705078, 'epsilon_dpo/beta_margin_mean': 2.3924245834350586, 'epsilon_dpo/beta_margin_std': 1.4725757837295532, 'epsilon_dpo/beta_margin_grad_mean': -0.14957940578460693, 'epsilon_dpo/beta_margin_grad_std': 0.1580885350704193, 'kl/beta': 0.01582731120288372, 'kl/avg_steps': 0.875, 'epoch': 0.36} 36%|████████████████████████████████████████▋ | 243/681 [16:03<18:30, 2.53s/it] 36%|████████████████████████████████████████▊ | 244/681 [16:05<18:26, 2.53s/it] {'loss': 0.4834, 'grad_norm': 60.811737060546875, 'learning_rate': 4.067309514735267e-07, 'rewards/chosen': -1.3352429866790771, 'rewards/rejected': -3.4218435287475586, 'rewards/accuracies': 0.890625, 'rewards/margins': 2.0866003036499023, 'logps/chosen': -125.60481262207031, 'logps/rejected': -321.4036865234375, 'logps/ref_chosen': -39.915008544921875, 'logps/ref_rejected': -101.33251953125, 'logits/chosen': -2.1224303245544434, 'logits/rejected': -2.0141685009002686, 'kl/p_epsilon_steps': 0.890625, 'kl/n_epsilon_steps': 0.109375, 'epsilon_dpo/beta': 0.015569004230201244, 'epsilon_dpo/loss_margin_mean': 134.38136291503906, 'epsilon_dpo/beta_margin_mean': 2.0866005420684814, 'epsilon_dpo/beta_margin_std': 1.515892744064331, 'epsilon_dpo/beta_margin_grad_mean': -0.18348611891269684, 'epsilon_dpo/beta_margin_grad_std': 0.19074872136116028, 'kl/beta': 0.015690024942159653, 'kl/avg_steps': 0.78125, 'epoch': 0.36} 36%|████████████████████████████████████████▊ | 244/681 [16:05<18:26, 2.53s/it] 36%|█████████████████████████████████████████ | 245/681 [16:08<18:21, 2.53s/it] {'loss': 0.538, 'grad_norm': 70.79053497314453, 'learning_rate': 4.057290731287531e-07, 'rewards/chosen': -1.1700770854949951, 'rewards/rejected': -3.351145029067993, 'rewards/accuracies': 0.875, 'rewards/margins': 2.181067943572998, 'logps/chosen': -115.84690856933594, 'logps/rejected': -310.2763671875, 'logps/ref_chosen': -40.404693603515625, 'logps/ref_rejected': -93.24897766113281, 'logits/chosen': -2.187222957611084, 'logits/rejected': -1.977565050125122, 'kl/p_epsilon_steps': 0.859375, 'kl/n_epsilon_steps': 0.140625, 'epsilon_dpo/beta': 0.015458044596016407, 'epsilon_dpo/loss_margin_mean': 141.5851593017578, 'epsilon_dpo/beta_margin_mean': 2.181067943572998, 'epsilon_dpo/beta_margin_std': 1.7734534740447998, 'epsilon_dpo/beta_margin_grad_mean': -0.19498883187770844, 'epsilon_dpo/beta_margin_grad_std': 0.21323725581169128, 'kl/beta': 0.01556839607656002, 'kl/avg_steps': 0.71875, 'epoch': 0.36} 36%|█████████████████████████████████████████ | 245/681 [16:08<18:21, 2.53s/it] 36%|█████████████████████████████████████████▏ | 246/681 [16:10<18:24, 2.54s/it] {'loss': 0.4609, 'grad_norm': 59.1838264465332, 'learning_rate': 4.047230911780736e-07, 'rewards/chosen': -1.3602585792541504, 'rewards/rejected': -3.4846174716949463, 'rewards/accuracies': 0.921875, 'rewards/margins': 2.124358892440796, 'logps/chosen': -134.75564575195312, 'logps/rejected': -317.5809326171875, 'logps/ref_chosen': -46.212432861328125, 'logps/ref_rejected': -90.18721008300781, 'logits/chosen': -2.1199212074279785, 'logits/rejected': -1.9131314754486084, 'kl/p_epsilon_steps': 0.90625, 'kl/n_epsilon_steps': 0.09375, 'epsilon_dpo/beta': 0.015333239920437336, 'epsilon_dpo/loss_margin_mean': 138.85049438476562, 'epsilon_dpo/beta_margin_mean': 2.124358892440796, 'epsilon_dpo/beta_margin_std': 1.5861018896102905, 'epsilon_dpo/beta_margin_grad_mean': -0.18123117089271545, 'epsilon_dpo/beta_margin_grad_std': 0.17709243297576904, 'kl/beta': 0.01545729674398899, 'kl/avg_steps': 0.8125, 'epoch': 0.36} 36%|█████████████████████████████████████████▏ | 246/681 [16:10<18:24, 2.54s/it] 36%|█████████████████████████████████████████▎ | 247/681 [16:13<17:50, 2.47s/it] {'loss': 0.4948, 'grad_norm': 73.05105590820312, 'learning_rate': 4.0371303213004814e-07, 'rewards/chosen': -1.479392409324646, 'rewards/rejected': -3.8148579597473145, 'rewards/accuracies': 0.90625, 'rewards/margins': 2.335465431213379, 'logps/chosen': -138.3466796875, 'logps/rejected': -362.81231689453125, 'logps/ref_chosen': -41.23990249633789, 'logps/ref_rejected': -111.74742126464844, 'logits/chosen': -2.0222108364105225, 'logits/rejected': -2.00075626373291, 'kl/p_epsilon_steps': 0.90625, 'kl/n_epsilon_steps': 0.09375, 'epsilon_dpo/beta': 0.015209660865366459, 'epsilon_dpo/loss_margin_mean': 153.95811462402344, 'epsilon_dpo/beta_margin_mean': 2.335465669631958, 'epsilon_dpo/beta_margin_std': 1.7916253805160522, 'epsilon_dpo/beta_margin_grad_mean': -0.16919678449630737, 'epsilon_dpo/beta_margin_grad_std': 0.20299823582172394, 'kl/beta': 0.015332718379795551, 'kl/avg_steps': 0.8125, 'epoch': 0.36} 36%|█████████████████████████████████████████▎ | 247/681 [16:13<17:50, 2.47s/it] 36%|█████████████████████████████████████████▌ | 248/681 [16:15<17:44, 2.46s/it] {'loss': 0.2904, 'grad_norm': 40.315460205078125, 'learning_rate': 4.0269892260067197e-07, 'rewards/chosen': -1.1554369926452637, 'rewards/rejected': -3.61720609664917, 'rewards/accuracies': 0.984375, 'rewards/margins': 2.461768865585327, 'logps/chosen': -102.28887939453125, 'logps/rejected': -337.76348876953125, 'logps/ref_chosen': -25.618247985839844, 'logps/ref_rejected': -97.59014892578125, 'logits/chosen': -1.9299895763397217, 'logits/rejected': -2.0007243156433105, 'kl/p_epsilon_steps': 0.984375, 'kl/n_epsilon_steps': 0.015625, 'epsilon_dpo/beta': 0.015063311904668808, 'epsilon_dpo/loss_margin_mean': 163.502685546875, 'epsilon_dpo/beta_margin_mean': 2.461768865585327, 'epsilon_dpo/beta_margin_std': 1.234181523323059, 'epsilon_dpo/beta_margin_grad_mean': -0.12139809876680374, 'epsilon_dpo/beta_margin_grad_std': 0.12686574459075928, 'kl/beta': 0.015209143981337547, 'kl/avg_steps': 0.96875, 'epoch': 0.36} 36%|█████████████████████████████████████████▌ | 248/681 [16:15<17:44, 2.46s/it] 37%|█████████████████████████████████████████▋ | 249/681 [16:18<18:18, 2.54s/it] {'loss': 0.4484, 'grad_norm': 58.40861129760742, 'learning_rate': 4.0168078931267426e-07, 'rewards/chosen': -1.377791404724121, 'rewards/rejected': -3.506488800048828, 'rewards/accuracies': 0.90625, 'rewards/margins': 2.128697395324707, 'logps/chosen': -134.6277618408203, 'logps/rejected': -322.1723937988281, 'logps/ref_chosen': -42.576805114746094, 'logps/ref_rejected': -87.38154602050781, 'logits/chosen': -2.040287494659424, 'logits/rejected': -1.8837002515792847, 'kl/p_epsilon_steps': 0.890625, 'kl/n_epsilon_steps': 0.109375, 'epsilon_dpo/beta': 0.014947031624615192, 'epsilon_dpo/loss_margin_mean': 142.73989868164062, 'epsilon_dpo/beta_margin_mean': 2.128697395324707, 'epsilon_dpo/beta_margin_std': 1.604970097541809, 'epsilon_dpo/beta_margin_grad_mean': -0.18095283210277557, 'epsilon_dpo/beta_margin_grad_std': 0.164636492729187, 'kl/beta': 0.015063218772411346, 'kl/avg_steps': 0.78125, 'epoch': 0.37} 37%|█████████████████████████████████████████▋ | 249/681 [16:18<18:18, 2.54s/it] 37%|█████████████████████████████████████████▊ | 250/681 [16:20<18:06, 2.52s/it] {'loss': 0.4175, 'grad_norm': 54.212379455566406, 'learning_rate': 4.006586590948141e-07, 'rewards/chosen': -1.2351545095443726, 'rewards/rejected': -3.3480517864227295, 'rewards/accuracies': 0.953125, 'rewards/margins': 2.1128971576690674, 'logps/chosen': -126.69745635986328, 'logps/rejected': -306.03778076171875, 'logps/ref_chosen': -43.33977508544922, 'logps/ref_rejected': -79.84855651855469, 'logits/chosen': -2.1342391967773438, 'logits/rejected': -1.7132606506347656, 'kl/p_epsilon_steps': 0.953125, 'kl/n_epsilon_steps': 0.046875, 'epsilon_dpo/beta': 0.014812478795647621, 'epsilon_dpo/loss_margin_mean': 142.83154296875, 'epsilon_dpo/beta_margin_mean': 2.1128971576690674, 'epsilon_dpo/beta_margin_std': 1.335999608039856, 'epsilon_dpo/beta_margin_grad_mean': -0.16431747376918793, 'epsilon_dpo/beta_margin_grad_std': 0.16239361464977264, 'kl/beta': 0.014946449548006058, 'kl/avg_steps': 0.90625, 'epoch': 0.37} 37%|█████████████████████████████████████████▊ | 250/681 [16:20<18:06, 2.52s/it] 37%|██████████████████████████████████████████ | 251/681 [16:23<18:02, 2.52s/it] {'loss': 0.4894, 'grad_norm': 58.540199279785156, 'learning_rate': 3.9963255888117325e-07, 'rewards/chosen': -1.332367181777954, 'rewards/rejected': -3.3120296001434326, 'rewards/accuracies': 0.9375, 'rewards/margins': 1.979662299156189, 'logps/chosen': -128.45263671875, 'logps/rejected': -308.31683349609375, 'logps/ref_chosen': -37.8934211730957, 'logps/ref_rejected': -82.71955871582031, 'logits/chosen': -2.064459800720215, 'logits/rejected': -1.8460227251052856, 'kl/p_epsilon_steps': 0.90625, 'kl/n_epsilon_steps': 0.09375, 'epsilon_dpo/beta': 0.014693334698677063, 'epsilon_dpo/loss_margin_mean': 135.0380859375, 'epsilon_dpo/beta_margin_mean': 1.979662299156189, 'epsilon_dpo/beta_margin_std': 1.4761016368865967, 'epsilon_dpo/beta_margin_grad_mean': -0.1908206194639206, 'epsilon_dpo/beta_margin_grad_std': 0.17475423216819763, 'kl/beta': 0.01481221430003643, 'kl/avg_steps': 0.8125, 'epoch': 0.37} 37%|██████████████████████████████████████████ | 251/681 [16:23<18:02, 2.52s/it] 37%|██████████████████████████████████████████▏ | 252/681 [16:26<18:25, 2.58s/it] {'loss': 0.4052, 'grad_norm': 61.02168273925781, 'learning_rate': 3.9860251571044666e-07, 'rewards/chosen': -1.3763771057128906, 'rewards/rejected': -3.5437517166137695, 'rewards/accuracies': 0.953125, 'rewards/margins': 2.167374610900879, 'logps/chosen': -143.55337524414062, 'logps/rejected': -335.2132568359375, 'logps/ref_chosen': -49.172019958496094, 'logps/ref_rejected': -91.81843566894531, 'logits/chosen': -2.0972189903259277, 'logits/rejected': -1.8574435710906982, 'kl/p_epsilon_steps': 0.953125, 'kl/n_epsilon_steps': 0.046875, 'epsilon_dpo/beta': 0.014561137184500694, 'epsilon_dpo/loss_margin_mean': 149.01345825195312, 'epsilon_dpo/beta_margin_mean': 2.167374849319458, 'epsilon_dpo/beta_margin_std': 1.4150327444076538, 'epsilon_dpo/beta_margin_grad_mean': -0.16394339501857758, 'epsilon_dpo/beta_margin_grad_std': 0.15933236479759216, 'kl/beta': 0.014692834578454494, 'kl/avg_steps': 0.90625, 'epoch': 0.37} 37%|██████████████████████████████████████████▏ | 252/681 [16:26<18:25, 2.58s/it] 37%|██████████████████████████████████████████▎ | 253/681 [16:28<18:29, 2.59s/it] {'loss': 0.5018, 'grad_norm': 65.06855773925781, 'learning_rate': 3.9756855672522986e-07, 'rewards/chosen': -1.4654229879379272, 'rewards/rejected': -3.4208884239196777, 'rewards/accuracies': 0.890625, 'rewards/margins': 1.95546555519104, 'logps/chosen': -156.42469787597656, 'logps/rejected': -341.0745544433594, 'logps/ref_chosen': -55.18561553955078, 'logps/ref_rejected': -104.15153503417969, 'logits/chosen': -2.0035042762756348, 'logits/rejected': -2.029510021209717, 'kl/p_epsilon_steps': 0.890625, 'kl/n_epsilon_steps': 0.109375, 'epsilon_dpo/beta': 0.014448565430939198, 'epsilon_dpo/loss_margin_mean': 135.68392944335938, 'epsilon_dpo/beta_margin_mean': 1.9554654359817505, 'epsilon_dpo/beta_margin_std': 1.4330785274505615, 'epsilon_dpo/beta_margin_grad_mean': -0.1948602944612503, 'epsilon_dpo/beta_margin_grad_std': 0.1849391907453537, 'kl/beta': 0.014560877345502377, 'kl/avg_steps': 0.78125, 'epoch': 0.37} 37%|██████████████████████████████████████████▎ | 253/681 [16:28<18:29, 2.59s/it] 37%|██████████████████████████████████████████▌ | 254/681 [16:31<18:25, 2.59s/it] {'loss': 0.5136, 'grad_norm': 57.888275146484375, 'learning_rate': 3.965307091713037e-07, 'rewards/chosen': -1.277719259262085, 'rewards/rejected': -3.309704065322876, 'rewards/accuracies': 0.859375, 'rewards/margins': 2.031984806060791, 'logps/chosen': -133.6912078857422, 'logps/rejected': -324.47515869140625, 'logps/ref_chosen': -44.80467224121094, 'logps/ref_rejected': -93.50021362304688, 'logits/chosen': -2.0362446308135986, 'logits/rejected': -1.8677754402160645, 'kl/p_epsilon_steps': 0.84375, 'kl/n_epsilon_steps': 0.15625, 'epsilon_dpo/beta': 0.014350106939673424, 'epsilon_dpo/loss_margin_mean': 142.08840942382812, 'epsilon_dpo/beta_margin_mean': 2.031984806060791, 'epsilon_dpo/beta_margin_std': 1.5720576047897339, 'epsilon_dpo/beta_margin_grad_mean': -0.19600307941436768, 'epsilon_dpo/beta_margin_grad_std': 0.19512991607189178, 'kl/beta': 0.014448001980781555, 'kl/avg_steps': 0.6875, 'epoch': 0.37} 37%|██████████████████████████████████████████▌ | 254/681 [16:31<18:25, 2.59s/it] 37%|██████████████████████████████████████████▋ | 255/681 [16:33<18:12, 2.57s/it] {'loss': 0.4448, 'grad_norm': 61.83622360229492, 'learning_rate': 3.954890003969163e-07, 'rewards/chosen': -1.3225433826446533, 'rewards/rejected': -3.631333112716675, 'rewards/accuracies': 0.9375, 'rewards/margins': 2.3087897300720215, 'logps/chosen': -130.06533813476562, 'logps/rejected': -352.3484191894531, 'logps/ref_chosen': -37.239234924316406, 'logps/ref_rejected': -96.95054626464844, 'logits/chosen': -1.7996987104415894, 'logits/rejected': -1.752530574798584, 'kl/p_epsilon_steps': 0.90625, 'kl/n_epsilon_steps': 0.09375, 'epsilon_dpo/beta': 0.01423418615013361, 'epsilon_dpo/loss_margin_mean': 162.57176208496094, 'epsilon_dpo/beta_margin_mean': 2.3087897300720215, 'epsilon_dpo/beta_margin_std': 1.6275063753128052, 'epsilon_dpo/beta_margin_grad_mean': -0.16781477630138397, 'epsilon_dpo/beta_margin_grad_std': 0.1848216950893402, 'kl/beta': 0.014349350705742836, 'kl/avg_steps': 0.8125, 'epoch': 0.37} 37%|██████████████████████████████████████████▋ | 255/681 [16:33<18:12, 2.57s/it] 38%|██████████████████████████████████████████▊ | 256/681 [16:36<17:37, 2.49s/it] {'loss': 0.3517, 'grad_norm': 47.72064208984375, 'learning_rate': 3.944434578520628e-07, 'rewards/chosen': -1.1789929866790771, 'rewards/rejected': -3.5271737575531006, 'rewards/accuracies': 0.953125, 'rewards/margins': 2.3481807708740234, 'logps/chosen': -118.48297119140625, 'logps/rejected': -349.3486328125, 'logps/ref_chosen': -35.025508880615234, 'logps/ref_rejected': -99.25279998779297, 'logits/chosen': -1.8363198041915894, 'logits/rejected': -1.8857228755950928, 'kl/p_epsilon_steps': 0.9375, 'kl/n_epsilon_steps': 0.0625, 'epsilon_dpo/beta': 0.014110567979514599, 'epsilon_dpo/loss_margin_mean': 166.6383819580078, 'epsilon_dpo/beta_margin_mean': 2.3481807708740234, 'epsilon_dpo/beta_margin_std': 1.3911449909210205, 'epsilon_dpo/beta_margin_grad_mean': -0.14517581462860107, 'epsilon_dpo/beta_margin_grad_std': 0.1478438377380371, 'kl/beta': 0.01423370186239481, 'kl/avg_steps': 0.875, 'epoch': 0.38} 38%|██████████████████████████████████████████▊ | 256/681 [16:36<17:37, 2.49s/it] 38%|███████████████████████████████████████████ | 257/681 [16:38<17:50, 2.52s/it] {'loss': 0.4202, 'grad_norm': 58.855709075927734, 'learning_rate': 3.933941090877615e-07, 'rewards/chosen': -1.2731714248657227, 'rewards/rejected': -3.4311952590942383, 'rewards/accuracies': 0.9375, 'rewards/margins': 2.1580240726470947, 'logps/chosen': -125.66242980957031, 'logps/rejected': -329.75701904296875, 'logps/ref_chosen': -34.74375534057617, 'logps/ref_rejected': -84.338134765625, 'logits/chosen': -1.8602499961853027, 'logits/rejected': -1.7255544662475586, 'kl/p_epsilon_steps': 0.921875, 'kl/n_epsilon_steps': 0.078125, 'epsilon_dpo/beta': 0.013992580585181713, 'epsilon_dpo/loss_margin_mean': 154.5001983642578, 'epsilon_dpo/beta_margin_mean': 2.1580240726470947, 'epsilon_dpo/beta_margin_std': 1.4499742984771729, 'epsilon_dpo/beta_margin_grad_mean': -0.1678382307291031, 'epsilon_dpo/beta_margin_grad_std': 0.16572539508342743, 'kl/beta': 0.014110236428678036, 'kl/avg_steps': 0.84375, 'epoch': 0.38} 38%|███████████████████████████████████████████ | 257/681 [16:38<17:50, 2.52s/it] 38%|███████████████████████████████████████████▏ | 258/681 [16:41<17:31, 2.49s/it] {'loss': 0.4481, 'grad_norm': 56.01180648803711, 'learning_rate': 3.923409817553284e-07, 'rewards/chosen': -1.2534571886062622, 'rewards/rejected': -3.497140884399414, 'rewards/accuracies': 0.921875, 'rewards/margins': 2.2436838150024414, 'logps/chosen': -128.53773498535156, 'logps/rejected': -356.6092529296875, 'logps/ref_chosen': -38.32011032104492, 'logps/ref_rejected': -104.34061431884766, 'logits/chosen': -1.829561710357666, 'logits/rejected': -1.8402704000473022, 'kl/p_epsilon_steps': 0.921875, 'kl/n_epsilon_steps': 0.078125, 'epsilon_dpo/beta': 0.013875506818294525, 'epsilon_dpo/loss_margin_mean': 162.05101013183594, 'epsilon_dpo/beta_margin_mean': 2.2436838150024414, 'epsilon_dpo/beta_margin_std': 1.5077780485153198, 'epsilon_dpo/beta_margin_grad_mean': -0.16494029760360718, 'epsilon_dpo/beta_margin_grad_std': 0.19155991077423096, 'kl/beta': 0.01399217825382948, 'kl/avg_steps': 0.84375, 'epoch': 0.38} 38%|███████████████████████████████████████████▏ | 258/681 [16:41<17:31, 2.49s/it] 38%|███████████████████████████████████████████▎ | 259/681 [16:43<17:29, 2.49s/it] {'loss': 0.4435, 'grad_norm': 53.66272735595703, 'learning_rate': 3.9128410360564793e-07, 'rewards/chosen': -1.3125896453857422, 'rewards/rejected': -3.3289833068847656, 'rewards/accuracies': 0.9375, 'rewards/margins': 2.0163938999176025, 'logps/chosen': -133.28622436523438, 'logps/rejected': -336.7703552246094, 'logps/ref_chosen': -37.96626663208008, 'logps/ref_rejected': -94.62816619873047, 'logits/chosen': -1.766379952430725, 'logits/rejected': -1.7046854496002197, 'kl/p_epsilon_steps': 0.9375, 'kl/n_epsilon_steps': 0.0625, 'epsilon_dpo/beta': 0.013755074702203274, 'epsilon_dpo/loss_margin_mean': 146.82223510742188, 'epsilon_dpo/beta_margin_mean': 2.0163936614990234, 'epsilon_dpo/beta_margin_std': 1.3358848094940186, 'epsilon_dpo/beta_margin_grad_mean': -0.17741423845291138, 'epsilon_dpo/beta_margin_grad_std': 0.16296601295471191, 'kl/beta': 0.01387510634958744, 'kl/avg_steps': 0.875, 'epoch': 0.38} 38%|███████████████████████████████████████████▎ | 259/681 [16:43<17:29, 2.49s/it] 38%|███████████████████████████████████████████▌ | 260/681 [16:46<17:39, 2.52s/it] {'loss': 0.4876, 'grad_norm': 71.4144287109375, 'learning_rate': 3.9022350248844246e-07, 'rewards/chosen': -1.3459320068359375, 'rewards/rejected': -3.411945343017578, 'rewards/accuracies': 0.921875, 'rewards/margins': 2.0660133361816406, 'logps/chosen': -128.06361389160156, 'logps/rejected': -351.5653076171875, 'logps/ref_chosen': -29.670434951782227, 'logps/ref_rejected': -101.38003540039062, 'logits/chosen': -1.7605104446411133, 'logits/rejected': -1.844254732131958, 'kl/p_epsilon_steps': 0.90625, 'kl/n_epsilon_steps': 0.09375, 'epsilon_dpo/beta': 0.01364436000585556, 'epsilon_dpo/loss_margin_mean': 151.79209899902344, 'epsilon_dpo/beta_margin_mean': 2.0660133361816406, 'epsilon_dpo/beta_margin_std': 1.5317002534866333, 'epsilon_dpo/beta_margin_grad_mean': -0.18185892701148987, 'epsilon_dpo/beta_margin_grad_std': 0.1839314103126526, 'kl/beta': 0.013754752464592457, 'kl/avg_steps': 0.8125, 'epoch': 0.38} 38%|███████████████████████████████████████████▌ | 260/681 [16:46<17:39, 2.52s/it] 38%|███████████████████████████████████████████▋ | 261/681 [16:48<17:34, 2.51s/it] {'loss': 0.396, 'grad_norm': 66.16899871826172, 'learning_rate': 3.891592063515376e-07, 'rewards/chosen': -1.1082427501678467, 'rewards/rejected': -3.654106855392456, 'rewards/accuracies': 0.90625, 'rewards/margins': 2.5458641052246094, 'logps/chosen': -113.61106872558594, 'logps/rejected': -365.1859436035156, 'logps/ref_chosen': -31.892623901367188, 'logps/ref_rejected': -94.94419860839844, 'logits/chosen': -1.7042864561080933, 'logits/rejected': -1.5886824131011963, 'kl/p_epsilon_steps': 0.90625, 'kl/n_epsilon_steps': 0.09375, 'epsilon_dpo/beta': 0.013534392230212688, 'epsilon_dpo/loss_margin_mean': 188.52330017089844, 'epsilon_dpo/beta_margin_mean': 2.5458641052246094, 'epsilon_dpo/beta_margin_std': 1.7510801553726196, 'epsilon_dpo/beta_margin_grad_mean': -0.14695821702480316, 'epsilon_dpo/beta_margin_grad_std': 0.183539479970932, 'kl/beta': 0.013643896207213402, 'kl/avg_steps': 0.8125, 'epoch': 0.38} 38%|███████████████████████████████████████████▋ | 261/681 [16:48<17:34, 2.51s/it] 38%|███████████████████████████████████████████▊ | 262/681 [16:51<17:15, 2.47s/it] {'loss': 0.4156, 'grad_norm': 56.77939224243164, 'learning_rate': 3.880912432401264e-07, 'rewards/chosen': -1.348780632019043, 'rewards/rejected': -3.3994228839874268, 'rewards/accuracies': 0.9375, 'rewards/margins': 2.050642251968384, 'logps/chosen': -134.6734619140625, 'logps/rejected': -343.780029296875, 'logps/ref_chosen': -34.26129913330078, 'logps/ref_rejected': -90.29788208007812, 'logits/chosen': -1.6830520629882812, 'logits/rejected': -1.5291244983673096, 'kl/p_epsilon_steps': 0.9375, 'kl/n_epsilon_steps': 0.0625, 'epsilon_dpo/beta': 0.013416852802038193, 'epsilon_dpo/loss_margin_mean': 153.06997680664062, 'epsilon_dpo/beta_margin_mean': 2.050642251968384, 'epsilon_dpo/beta_margin_std': 1.2614741325378418, 'epsilon_dpo/beta_margin_grad_mean': -0.16779224574565887, 'epsilon_dpo/beta_margin_grad_std': 0.15692253410816193, 'kl/beta': 0.013533933088183403, 'kl/avg_steps': 0.875, 'epoch': 0.38} 38%|███████████████████████████████████████████▊ | 262/681 [16:51<17:15, 2.47s/it] 39%|████████████████████████████████████████████ | 263/681 [16:53<17:39, 2.53s/it] {'loss': 0.3749, 'grad_norm': 59.517337799072266, 'learning_rate': 3.870196412960302e-07, 'rewards/chosen': -1.233377456665039, 'rewards/rejected': -3.5795066356658936, 'rewards/accuracies': 0.921875, 'rewards/margins': 2.3461294174194336, 'logps/chosen': -131.9771728515625, 'logps/rejected': -371.615966796875, 'logps/ref_chosen': -39.4767951965332, 'logps/ref_rejected': -102.44886779785156, 'logits/chosen': -1.8858226537704468, 'logits/rejected': -1.5893621444702148, 'kl/p_epsilon_steps': 0.921875, 'kl/n_epsilon_steps': 0.078125, 'epsilon_dpo/beta': 0.013304665684700012, 'epsilon_dpo/loss_margin_mean': 176.666748046875, 'epsilon_dpo/beta_margin_mean': 2.3461291790008545, 'epsilon_dpo/beta_margin_std': 1.4775606393814087, 'epsilon_dpo/beta_margin_grad_mean': -0.14855755865573883, 'epsilon_dpo/beta_margin_grad_std': 0.16537067294120789, 'kl/beta': 0.013416538015007973, 'kl/avg_steps': 0.84375, 'epoch': 0.39} 39%|████████████████████████████████████████████ | 263/681 [16:53<17:39, 2.53s/it] 39%|████████████████████████████████████████████▏ | 264/681 [16:56<17:53, 2.58s/it] {'loss': 0.3936, 'grad_norm': 60.96957015991211, 'learning_rate': 3.8594442875695665e-07, 'rewards/chosen': -1.1168205738067627, 'rewards/rejected': -3.4131574630737305, 'rewards/accuracies': 0.953125, 'rewards/margins': 2.296337127685547, 'logps/chosen': -123.23445129394531, 'logps/rejected': -359.046875, 'logps/ref_chosen': -38.707645416259766, 'logps/ref_rejected': -100.15180969238281, 'logits/chosen': -1.6896053552627563, 'logits/rejected': -1.7085572481155396, 'kl/p_epsilon_steps': 0.9375, 'kl/n_epsilon_steps': 0.0625, 'epsilon_dpo/beta': 0.01318918913602829, 'epsilon_dpo/loss_margin_mean': 174.36825561523438, 'epsilon_dpo/beta_margin_mean': 2.2963368892669678, 'epsilon_dpo/beta_margin_std': 1.5442652702331543, 'epsilon_dpo/beta_margin_grad_mean': -0.15769457817077637, 'epsilon_dpo/beta_margin_grad_std': 0.15815286338329315, 'kl/beta': 0.013304282911121845, 'kl/avg_steps': 0.875, 'epoch': 0.39} 39%|████████████████████████████████████████████▏ | 264/681 [16:56<17:53, 2.58s/it] 39%|████████████████████████████████████████████▎ | 265/681 [16:58<17:35, 2.54s/it] {'loss': 0.4478, 'grad_norm': 61.71824645996094, 'learning_rate': 3.848656339557562e-07, 'rewards/chosen': -1.4101543426513672, 'rewards/rejected': -3.5063629150390625, 'rewards/accuracies': 0.9375, 'rewards/margins': 2.0962088108062744, 'logps/chosen': -145.28750610351562, 'logps/rejected': -362.984619140625, 'logps/ref_chosen': -37.51420593261719, 'logps/ref_rejected': -94.66896057128906, 'logits/chosen': -1.6642320156097412, 'logits/rejected': -1.5806505680084229, 'kl/p_epsilon_steps': 0.953125, 'kl/n_epsilon_steps': 0.046875, 'epsilon_dpo/beta': 0.013070664368569851, 'epsilon_dpo/loss_margin_mean': 160.54234313964844, 'epsilon_dpo/beta_margin_mean': 2.0962088108062744, 'epsilon_dpo/beta_margin_std': 1.5082602500915527, 'epsilon_dpo/beta_margin_grad_mean': -0.17950545251369476, 'epsilon_dpo/beta_margin_grad_std': 0.168075293302536, 'kl/beta': 0.013188880868256092, 'kl/avg_steps': 0.90625, 'epoch': 0.39} 39%|████████████████████████████████████████████▎ | 265/681 [16:58<17:35, 2.54s/it] 39%|████████████████████████████████████████████▌ | 266/681 [17:01<17:39, 2.55s/it] {'loss': 0.5442, 'grad_norm': 68.82777404785156, 'learning_rate': 3.8378328531967507e-07, 'rewards/chosen': -1.5156571865081787, 'rewards/rejected': -3.358886241912842, 'rewards/accuracies': 0.921875, 'rewards/margins': 1.843229055404663, 'logps/chosen': -167.84719848632812, 'logps/rejected': -332.4281311035156, 'logps/ref_chosen': -51.23357391357422, 'logps/ref_rejected': -73.29232025146484, 'logits/chosen': -1.6910204887390137, 'logits/rejected': -1.2894883155822754, 'kl/p_epsilon_steps': 0.875, 'kl/n_epsilon_steps': 0.125, 'epsilon_dpo/beta': 0.012973698787391186, 'epsilon_dpo/loss_margin_mean': 142.52218627929688, 'epsilon_dpo/beta_margin_mean': 1.843229055404663, 'epsilon_dpo/beta_margin_std': 1.3907108306884766, 'epsilon_dpo/beta_margin_grad_mean': -0.20105549693107605, 'epsilon_dpo/beta_margin_grad_std': 0.19386924803256989, 'kl/beta': 0.013070429675281048, 'kl/avg_steps': 0.75, 'epoch': 0.39} 39%|████████████████████████████████████████████▌ | 266/681 [17:01<17:39, 2.55s/it] 39%|████████████████████████████████████████████▋ | 267/681 [17:03<17:29, 2.54s/it] {'loss': 0.4993, 'grad_norm': 67.35930633544922, 'learning_rate': 3.8269741136960646e-07, 'rewards/chosen': -1.3415794372558594, 'rewards/rejected': -3.271149158477783, 'rewards/accuracies': 0.890625, 'rewards/margins': 1.9295697212219238, 'logps/chosen': -154.36102294921875, 'logps/rejected': -349.48822021484375, 'logps/ref_chosen': -50.439453125, 'logps/ref_rejected': -95.28913879394531, 'logits/chosen': -1.8555903434753418, 'logits/rejected': -1.507951021194458, 'kl/p_epsilon_steps': 0.859375, 'kl/n_epsilon_steps': 0.140625, 'epsilon_dpo/beta': 0.012881174683570862, 'epsilon_dpo/loss_margin_mean': 150.2775115966797, 'epsilon_dpo/beta_margin_mean': 1.9295697212219238, 'epsilon_dpo/beta_margin_std': 1.4559646844863892, 'epsilon_dpo/beta_margin_grad_mean': -0.19470059871673584, 'epsilon_dpo/beta_margin_grad_std': 0.1797318160533905, 'kl/beta': 0.012973131611943245, 'kl/avg_steps': 0.71875, 'epoch': 0.39} 39%|████████████████████████████████████████████▋ | 267/681 [17:04<17:29, 2.54s/it] 39%|████████████████████████████████████████████▊ | 268/681 [17:06<17:23, 2.53s/it] {'loss': 0.5814, 'grad_norm': 80.12260437011719, 'learning_rate': 3.8160804071933894e-07, 'rewards/chosen': -1.461806058883667, 'rewards/rejected': -3.4406800270080566, 'rewards/accuracies': 0.859375, 'rewards/margins': 1.9788737297058105, 'logps/chosen': -156.72665405273438, 'logps/rejected': -377.4356384277344, 'logps/ref_chosen': -42.84587097167969, 'logps/ref_rejected': -108.26278686523438, 'logits/chosen': -1.6909327507019043, 'logits/rejected': -1.6416935920715332, 'kl/p_epsilon_steps': 0.859375, 'kl/n_epsilon_steps': 0.140625, 'epsilon_dpo/beta': 0.012789253145456314, 'epsilon_dpo/loss_margin_mean': 155.29205322265625, 'epsilon_dpo/beta_margin_mean': 1.9788737297058105, 'epsilon_dpo/beta_margin_std': 1.7514647245407104, 'epsilon_dpo/beta_margin_grad_mean': -0.21375124156475067, 'epsilon_dpo/beta_margin_grad_std': 0.20960654318332672, 'kl/beta': 0.012880552560091019, 'kl/avg_steps': 0.71875, 'epoch': 0.39} 39%|████████████████████████████████████████████▊ | 268/681 [17:06<17:23, 2.53s/it] 40%|█████████████████████████████████████████████ | 269/681 [17:09<17:20, 2.52s/it] {'loss': 0.3916, 'grad_norm': 59.1228141784668, 'learning_rate': 3.8051520207480204e-07, 'rewards/chosen': -1.253930926322937, 'rewards/rejected': -3.7381930351257324, 'rewards/accuracies': 0.90625, 'rewards/margins': 2.484261989593506, 'logps/chosen': -142.43875122070312, 'logps/rejected': -410.3055419921875, 'logps/ref_chosen': -43.99567413330078, 'logps/ref_rejected': -115.62770080566406, 'logits/chosen': -1.7581638097763062, 'logits/rejected': -1.7033562660217285, 'kl/p_epsilon_steps': 0.859375, 'kl/n_epsilon_steps': 0.140625, 'epsilon_dpo/beta': 0.012697985395789146, 'epsilon_dpo/loss_margin_mean': 196.23475646972656, 'epsilon_dpo/beta_margin_mean': 2.484262228012085, 'epsilon_dpo/beta_margin_std': 1.5870592594146729, 'epsilon_dpo/beta_margin_grad_mean': -0.15136900544166565, 'epsilon_dpo/beta_margin_grad_std': 0.18286660313606262, 'kl/beta': 0.012788633815944195, 'kl/avg_steps': 0.71875, 'epoch': 0.4} 40%|█████████████████████████████████████████████ | 269/681 [17:09<17:20, 2.52s/it] 40%|█████████████████████████████████████████████▏ | 270/681 [17:11<17:29, 2.55s/it] {'loss': 0.512, 'grad_norm': 79.74817657470703, 'learning_rate': 3.794189242333106e-07, 'rewards/chosen': -1.5177865028381348, 'rewards/rejected': -3.590897798538208, 'rewards/accuracies': 0.875, 'rewards/margins': 2.0731112957000732, 'logps/chosen': -164.27764892578125, 'logps/rejected': -401.3154296875, 'logps/ref_chosen': -44.127079010009766, 'logps/ref_rejected': -116.14840698242188, 'logits/chosen': -1.8194841146469116, 'logits/rejected': -1.6801855564117432, 'kl/p_epsilon_steps': 0.890625, 'kl/n_epsilon_steps': 0.109375, 'epsilon_dpo/beta': 0.012599433772265911, 'epsilon_dpo/loss_margin_mean': 165.01644897460938, 'epsilon_dpo/beta_margin_mean': 2.0731112957000732, 'epsilon_dpo/beta_margin_std': 1.5216357707977295, 'epsilon_dpo/beta_margin_grad_mean': -0.18488836288452148, 'epsilon_dpo/beta_margin_grad_std': 0.2021288424730301, 'kl/beta': 0.012697371654212475, 'kl/avg_steps': 0.78125, 'epoch': 0.4} 40%|█████████████████████████████████████████████▏ | 270/681 [17:11<17:29, 2.55s/it] 40%|█████████████████████████████████████████████▎ | 271/681 [17:14<17:13, 2.52s/it] {'loss': 0.5335, 'grad_norm': 58.63235855102539, 'learning_rate': 3.7831923608280514e-07, 'rewards/chosen': -1.219379186630249, 'rewards/rejected': -3.2758355140686035, 'rewards/accuracies': 0.9375, 'rewards/margins': 2.0564560890197754, 'logps/chosen': -133.10488891601562, 'logps/rejected': -360.34832763671875, 'logps/ref_chosen': -35.64381408691406, 'logps/ref_rejected': -97.9295883178711, 'logits/chosen': -1.7537434101104736, 'logits/rejected': -1.6001136302947998, 'kl/p_epsilon_steps': 0.921875, 'kl/n_epsilon_steps': 0.078125, 'epsilon_dpo/beta': 0.012493887916207314, 'epsilon_dpo/loss_margin_mean': 164.9576873779297, 'epsilon_dpo/beta_margin_mean': 2.0564560890197754, 'epsilon_dpo/beta_margin_std': 1.6271862983703613, 'epsilon_dpo/beta_margin_grad_mean': -0.19335998594760895, 'epsilon_dpo/beta_margin_grad_std': 0.19798798859119415, 'kl/beta': 0.012598942033946514, 'kl/avg_steps': 0.84375, 'epoch': 0.4} 40%|█████████████████████████████████████████████▎ | 271/681 [17:14<17:13, 2.52s/it] 40%|█████████████████████████████████████████████▌ | 272/681 [17:16<17:23, 2.55s/it] {'loss': 0.336, 'grad_norm': 43.1225471496582, 'learning_rate': 3.772161666010912e-07, 'rewards/chosen': -1.065723180770874, 'rewards/rejected': -3.6643643379211426, 'rewards/accuracies': 0.921875, 'rewards/margins': 2.5986409187316895, 'logps/chosen': -112.50562286376953, 'logps/rejected': -408.69091796875, 'logps/ref_chosen': -26.52655792236328, 'logps/ref_rejected': -112.5738525390625, 'logits/chosen': -1.6689045429229736, 'logits/rejected': -1.6010162830352783, 'kl/p_epsilon_steps': 0.9375, 'kl/n_epsilon_steps': 0.0625, 'epsilon_dpo/beta': 0.012385448440909386, 'epsilon_dpo/loss_margin_mean': 210.13800048828125, 'epsilon_dpo/beta_margin_mean': 2.5986411571502686, 'epsilon_dpo/beta_margin_std': 1.48728609085083, 'epsilon_dpo/beta_margin_grad_mean': -0.12951304018497467, 'epsilon_dpo/beta_margin_grad_std': 0.1683792918920517, 'kl/beta': 0.012493528425693512, 'kl/avg_steps': 0.875, 'epoch': 0.4} 40%|█████████████████████████████████████████████▌ | 272/681 [17:16<17:23, 2.55s/it] 40%|█████████████████████████████████████████████▋ | 273/681 [17:19<17:18, 2.54s/it] {'loss': 0.3461, 'grad_norm': 50.07746505737305, 'learning_rate': 3.761097448550755e-07, 'rewards/chosen': -1.2572829723358154, 'rewards/rejected': -3.6104252338409424, 'rewards/accuracies': 0.984375, 'rewards/margins': 2.353142261505127, 'logps/chosen': -137.0894317626953, 'logps/rejected': -391.7444152832031, 'logps/ref_chosen': -34.63496017456055, 'logps/ref_rejected': -97.36636352539062, 'logits/chosen': -1.71974515914917, 'logits/rejected': -1.6216449737548828, 'kl/p_epsilon_steps': 0.984375, 'kl/n_epsilon_steps': 0.015625, 'epsilon_dpo/beta': 0.012266403064131737, 'epsilon_dpo/loss_margin_mean': 191.923583984375, 'epsilon_dpo/beta_margin_mean': 2.353142261505127, 'epsilon_dpo/beta_margin_std': 1.3905528783798218, 'epsilon_dpo/beta_margin_grad_mean': -0.14163939654827118, 'epsilon_dpo/beta_margin_grad_std': 0.14621266722679138, 'kl/beta': 0.012385157868266106, 'kl/avg_steps': 0.96875, 'epoch': 0.4} 40%|█████████████████████████████████████████████▋ | 273/681 [17:19<17:18, 2.54s/it] 40%|█████████████████████████████████████████████▊ | 274/681 [17:21<17:41, 2.61s/it] {'loss': 0.4745, 'grad_norm': 58.21686553955078, 'learning_rate': 3.75e-07, 'rewards/chosen': -1.4374797344207764, 'rewards/rejected': -3.5061697959899902, 'rewards/accuracies': 0.921875, 'rewards/margins': 2.0686898231506348, 'logps/chosen': -150.64378356933594, 'logps/rejected': -371.6086730957031, 'logps/ref_chosen': -32.576805114746094, 'logps/ref_rejected': -83.13209533691406, 'logits/chosen': -1.5487819910049438, 'logits/rejected': -1.4224430322647095, 'kl/p_epsilon_steps': 0.921875, 'kl/n_epsilon_steps': 0.078125, 'epsilon_dpo/beta': 0.012164046056568623, 'epsilon_dpo/loss_margin_mean': 170.40960693359375, 'epsilon_dpo/beta_margin_mean': 2.068690061569214, 'epsilon_dpo/beta_margin_std': 1.446864366531372, 'epsilon_dpo/beta_margin_grad_mean': -0.1828116923570633, 'epsilon_dpo/beta_margin_grad_std': 0.1835782825946808, 'kl/beta': 0.012266327627003193, 'kl/avg_steps': 0.84375, 'epoch': 0.4} 40%|█████████████████████████████████████████████▊ | 274/681 [17:22<17:41, 2.61s/it] 40%|██████████████████████████████████████████████ | 275/681 [17:24<17:25, 2.58s/it] {'loss': 0.5343, 'grad_norm': 58.6602783203125, 'learning_rate': 3.738869612786737e-07, 'rewards/chosen': -1.4348502159118652, 'rewards/rejected': -3.4032628536224365, 'rewards/accuracies': 0.90625, 'rewards/margins': 1.9684127569198608, 'logps/chosen': -152.82989501953125, 'logps/rejected': -380.90191650390625, 'logps/ref_chosen': -34.1890983581543, 'logps/ref_rejected': -98.71599578857422, 'logits/chosen': -1.6947541236877441, 'logits/rejected': -1.5999658107757568, 'kl/p_epsilon_steps': 0.90625, 'kl/n_epsilon_steps': 0.09375, 'epsilon_dpo/beta': 0.012066072784364223, 'epsilon_dpo/loss_margin_mean': 163.5451202392578, 'epsilon_dpo/beta_margin_mean': 1.9684127569198608, 'epsilon_dpo/beta_margin_std': 1.5511200428009033, 'epsilon_dpo/beta_margin_grad_mean': -0.1988518238067627, 'epsilon_dpo/beta_margin_grad_std': 0.19736242294311523, 'kl/beta': 0.012163696810603142, 'kl/avg_steps': 0.8125, 'epoch': 0.4} 40%|██████████████████████████████████████████████ | 275/681 [17:24<17:25, 2.58s/it] 41%|██████████████████████████████████████████████▏ | 276/681 [17:27<17:19, 2.57s/it] {'loss': 0.537, 'grad_norm': 62.463478088378906, 'learning_rate': 3.7277065802070204e-07, 'rewards/chosen': -1.6980412006378174, 'rewards/rejected': -3.442594289779663, 'rewards/accuracies': 0.9375, 'rewards/margins': 1.7445529699325562, 'logps/chosen': -188.8147735595703, 'logps/rejected': -364.2151794433594, 'logps/ref_chosen': -47.06272888183594, 'logps/ref_rejected': -76.37776947021484, 'logits/chosen': -1.75152587890625, 'logits/rejected': -1.3374929428100586, 'kl/p_epsilon_steps': 0.921875, 'kl/n_epsilon_steps': 0.078125, 'epsilon_dpo/beta': 0.011965055949985981, 'epsilon_dpo/loss_margin_mean': 146.08535766601562, 'epsilon_dpo/beta_margin_mean': 1.7445530891418457, 'epsilon_dpo/beta_margin_std': 1.3083961009979248, 'epsilon_dpo/beta_margin_grad_mean': -0.2083517462015152, 'epsilon_dpo/beta_margin_grad_std': 0.1770058423280716, 'kl/beta': 0.012065663002431393, 'kl/avg_steps': 0.84375, 'epoch': 0.41} 41%|██████████████████████████████████████████████▏ | 276/681 [17:27<17:19, 2.57s/it] 41%|██████████████████████████████████████████████▎ | 277/681 [17:29<16:51, 2.50s/it] {'loss': 0.4258, 'grad_norm': 52.0661735534668, 'learning_rate': 3.71651119641714e-07, 'rewards/chosen': -1.4755170345306396, 'rewards/rejected': -3.613009452819824, 'rewards/accuracies': 0.953125, 'rewards/margins': 2.1374926567077637, 'logps/chosen': -159.37274169921875, 'logps/rejected': -404.24322509765625, 'logps/ref_chosen': -35.24298095703125, 'logps/ref_rejected': -99.66352844238281, 'logits/chosen': -1.622573971748352, 'logits/rejected': -1.471212387084961, 'kl/p_epsilon_steps': 0.921875, 'kl/n_epsilon_steps': 0.078125, 'epsilon_dpo/beta': 0.01186494529247284, 'epsilon_dpo/loss_margin_mean': 180.449951171875, 'epsilon_dpo/beta_margin_mean': 2.1374924182891846, 'epsilon_dpo/beta_margin_std': 1.4325969219207764, 'epsilon_dpo/beta_margin_grad_mean': -0.17211416363716125, 'epsilon_dpo/beta_margin_grad_std': 0.16370359063148499, 'kl/beta': 0.011964711360633373, 'kl/avg_steps': 0.84375, 'epoch': 0.41} 41%|██████████████████████████████████████████████▎ | 277/681 [17:29<16:51, 2.50s/it] 41%|██████████████████████████████████████████████▌ | 278/681 [17:31<16:57, 2.52s/it] {'loss': 0.4658, 'grad_norm': 54.76714324951172, 'learning_rate': 3.705283756425872e-07, 'rewards/chosen': -1.5532996654510498, 'rewards/rejected': -3.5797531604766846, 'rewards/accuracies': 0.90625, 'rewards/margins': 2.0264534950256348, 'logps/chosen': -169.8358154296875, 'logps/rejected': -400.4996337890625, 'logps/ref_chosen': -38.0869140625, 'logps/ref_rejected': -96.20486450195312, 'logits/chosen': -1.7193242311477661, 'logits/rejected': -1.5687487125396729, 'kl/p_epsilon_steps': 0.90625, 'kl/n_epsilon_steps': 0.09375, 'epsilon_dpo/beta': 0.011769380420446396, 'epsilon_dpo/loss_margin_mean': 172.5458526611328, 'epsilon_dpo/beta_margin_mean': 2.0264534950256348, 'epsilon_dpo/beta_margin_std': 1.5373589992523193, 'epsilon_dpo/beta_margin_grad_mean': -0.18432171642780304, 'epsilon_dpo/beta_margin_grad_std': 0.17036166787147522, 'kl/beta': 0.011864603497087955, 'kl/avg_steps': 0.8125, 'epoch': 0.41} 41%|██████████████████████████████████████████████▌ | 278/681 [17:31<16:57, 2.52s/it] 41%|██████████████████████████████████████████████▋ | 279/681 [17:34<16:29, 2.46s/it] {'loss': 0.4424, 'grad_norm': 56.205360412597656, 'learning_rate': 3.6940245560867e-07, 'rewards/chosen': -1.4212896823883057, 'rewards/rejected': -3.5274763107299805, 'rewards/accuracies': 0.96875, 'rewards/margins': 2.106186628341675, 'logps/chosen': -151.58514404296875, 'logps/rejected': -395.55865478515625, 'logps/ref_chosen': -29.908262252807617, 'logps/ref_rejected': -93.1087646484375, 'logits/chosen': -1.4648022651672363, 'logits/rejected': -1.4414713382720947, 'kl/p_epsilon_steps': 0.9375, 'kl/n_epsilon_steps': 0.0625, 'epsilon_dpo/beta': 0.011667169630527496, 'epsilon_dpo/loss_margin_mean': 180.77297973632812, 'epsilon_dpo/beta_margin_mean': 2.106186628341675, 'epsilon_dpo/beta_margin_std': 1.458591341972351, 'epsilon_dpo/beta_margin_grad_mean': -0.17751751840114594, 'epsilon_dpo/beta_margin_grad_std': 0.16590261459350586, 'kl/beta': 0.011768980883061886, 'kl/avg_steps': 0.875, 'epoch': 0.41} 41%|██████████████████████████████████████████████▋ | 279/681 [17:34<16:29, 2.46s/it] 41%|██████████████████████████████████████████████▊ | 280/681 [17:36<16:24, 2.46s/it] {'loss': 0.417, 'grad_norm': 52.6536865234375, 'learning_rate': 3.6827338920900253e-07, 'rewards/chosen': -1.4572311639785767, 'rewards/rejected': -3.623271942138672, 'rewards/accuracies': 0.9375, 'rewards/margins': 2.1660408973693848, 'logps/chosen': -165.87985229492188, 'logps/rejected': -418.8723449707031, 'logps/ref_chosen': -40.1248664855957, 'logps/ref_rejected': -105.53138732910156, 'logits/chosen': -1.6275845766067505, 'logits/rejected': -1.6037421226501465, 'kl/p_epsilon_steps': 0.921875, 'kl/n_epsilon_steps': 0.078125, 'epsilon_dpo/beta': 0.01156961265951395, 'epsilon_dpo/loss_margin_mean': 187.58596801757812, 'epsilon_dpo/beta_margin_mean': 2.1660408973693848, 'epsilon_dpo/beta_margin_std': 1.4027117490768433, 'epsilon_dpo/beta_margin_grad_mean': -0.16350157558918, 'epsilon_dpo/beta_margin_grad_std': 0.16755622625350952, 'kl/beta': 0.011666894890367985, 'kl/avg_steps': 0.84375, 'epoch': 0.41} 41%|██████████████████████████████████████████████▊ | 280/681 [17:36<16:24, 2.46s/it] 41%|███████████████████████████████████████████████ | 281/681 [17:39<16:09, 2.42s/it] {'loss': 0.45, 'grad_norm': 54.2619743347168, 'learning_rate': 3.6714120619553435e-07, 'rewards/chosen': -1.4972405433654785, 'rewards/rejected': -3.5504188537597656, 'rewards/accuracies': 0.9375, 'rewards/margins': 2.053178548812866, 'logps/chosen': -169.83157348632812, 'logps/rejected': -396.84686279296875, 'logps/ref_chosen': -39.501502990722656, 'logps/ref_rejected': -87.23008728027344, 'logits/chosen': -1.6719520092010498, 'logits/rejected': -1.367035150527954, 'kl/p_epsilon_steps': 0.9375, 'kl/n_epsilon_steps': 0.0625, 'epsilon_dpo/beta': 0.011469194665551186, 'epsilon_dpo/loss_margin_mean': 179.28672790527344, 'epsilon_dpo/beta_margin_mean': 2.053178548812866, 'epsilon_dpo/beta_margin_std': 1.504626750946045, 'epsilon_dpo/beta_margin_grad_mean': -0.18068785965442657, 'epsilon_dpo/beta_margin_grad_std': 0.1649240106344223, 'kl/beta': 0.011569279246032238, 'kl/avg_steps': 0.875, 'epoch': 0.41} 41%|███████████████████████████████████████████████ | 281/681 [17:39<16:09, 2.42s/it] 41%|███████████████████████████████████████████████▏ | 282/681 [17:41<16:16, 2.45s/it] {'loss': 0.4842, 'grad_norm': 54.99889373779297, 'learning_rate': 3.660059364023408e-07, 'rewards/chosen': -1.4642865657806396, 'rewards/rejected': -3.4173216819763184, 'rewards/accuracies': 0.9375, 'rewards/margins': 1.9530349969863892, 'logps/chosen': -174.48223876953125, 'logps/rejected': -402.43865966796875, 'logps/ref_chosen': -46.00492858886719, 'logps/ref_rejected': -101.88005828857422, 'logits/chosen': -1.6971466541290283, 'logits/rejected': -1.5857012271881104, 'kl/p_epsilon_steps': 0.90625, 'kl/n_epsilon_steps': 0.09375, 'epsilon_dpo/beta': 0.011376879177987576, 'epsilon_dpo/loss_margin_mean': 172.08128356933594, 'epsilon_dpo/beta_margin_mean': 1.9530349969863892, 'epsilon_dpo/beta_margin_std': 1.3995331525802612, 'epsilon_dpo/beta_margin_grad_mean': -0.18481247127056122, 'epsilon_dpo/beta_margin_grad_std': 0.17733608186244965, 'kl/beta': 0.011468926444649696, 'kl/avg_steps': 0.8125, 'epoch': 0.41} 41%|███████████████████████████████████████████████▏ | 282/681 [17:41<16:16, 2.45s/it] 42%|███████████████████████████████████████████████▎ | 283/681 [17:43<16:00, 2.41s/it] {'loss': 0.3582, 'grad_norm': 47.481971740722656, 'learning_rate': 3.6486760974483685e-07, 'rewards/chosen': -1.330566644668579, 'rewards/rejected': -3.6710352897644043, 'rewards/accuracies': 0.953125, 'rewards/margins': 2.340468645095825, 'logps/chosen': -153.2581329345703, 'logps/rejected': -428.9803466796875, 'logps/ref_chosen': -35.3682861328125, 'logps/ref_rejected': -103.27058410644531, 'logits/chosen': -1.6741735935211182, 'logits/rejected': -1.502656102180481, 'kl/p_epsilon_steps': 0.9375, 'kl/n_epsilon_steps': 0.0625, 'epsilon_dpo/beta': 0.01127807516604662, 'epsilon_dpo/loss_margin_mean': 207.81991577148438, 'epsilon_dpo/beta_margin_mean': 2.340468645095825, 'epsilon_dpo/beta_margin_std': 1.3866698741912842, 'epsilon_dpo/beta_margin_grad_mean': -0.1464729756116867, 'epsilon_dpo/beta_margin_grad_std': 0.15092232823371887, 'kl/beta': 0.011376491747796535, 'kl/avg_steps': 0.875, 'epoch': 0.42} 42%|███████████████████████████████████████████████▎ | 283/681 [17:43<16:00, 2.41s/it] 42%|███████████████████████████████████████████████▌ | 284/681 [17:46<16:11, 2.45s/it] {'loss': 0.4919, 'grad_norm': 64.37258911132812, 'learning_rate': 3.6372625621898863e-07, 'rewards/chosen': -1.4272245168685913, 'rewards/rejected': -3.583281993865967, 'rewards/accuracies': 0.90625, 'rewards/margins': 2.156057357788086, 'logps/chosen': -168.32333374023438, 'logps/rejected': -419.96551513671875, 'logps/ref_chosen': -41.10857391357422, 'logps/ref_rejected': -99.55363464355469, 'logits/chosen': -1.67313551902771, 'logits/rejected': -1.5210282802581787, 'kl/p_epsilon_steps': 0.859375, 'kl/n_epsilon_steps': 0.140625, 'epsilon_dpo/beta': 0.011197871528565884, 'epsilon_dpo/loss_margin_mean': 193.1970977783203, 'epsilon_dpo/beta_margin_mean': 2.156057596206665, 'epsilon_dpo/beta_margin_std': 1.6328465938568115, 'epsilon_dpo/beta_margin_grad_mean': -0.18323473632335663, 'epsilon_dpo/beta_margin_grad_std': 0.1965819150209427, 'kl/beta': 0.011277811601758003, 'kl/avg_steps': 0.71875, 'epoch': 0.42} 42%|███████████████████████████████████████████████▌ | 284/681 [17:46<16:11, 2.45s/it] 42%|███████████████████████████████████████████████▋ | 285/681 [17:49<16:28, 2.50s/it] {'loss': 0.4048, 'grad_norm': 60.48746871948242, 'learning_rate': 3.625819059005228e-07, 'rewards/chosen': -1.4201911687850952, 'rewards/rejected': -3.7171897888183594, 'rewards/accuracies': 0.953125, 'rewards/margins': 2.2969985008239746, 'logps/chosen': -163.57217407226562, 'logps/rejected': -443.7559814453125, 'logps/ref_chosen': -35.757301330566406, 'logps/ref_rejected': -108.66427612304688, 'logits/chosen': -1.64085054397583, 'logits/rejected': -1.4849854707717896, 'kl/p_epsilon_steps': 0.953125, 'kl/n_epsilon_steps': 0.046875, 'epsilon_dpo/beta': 0.011096964590251446, 'epsilon_dpo/loss_margin_mean': 207.27685546875, 'epsilon_dpo/beta_margin_mean': 2.2969985008239746, 'epsilon_dpo/beta_margin_std': 1.3974249362945557, 'epsilon_dpo/beta_margin_grad_mean': -0.14801117777824402, 'epsilon_dpo/beta_margin_grad_std': 0.17136132717132568, 'kl/beta': 0.011197330430150032, 'kl/avg_steps': 0.90625, 'epoch': 0.42} 42%|███████████████████████████████████████████████▋ | 285/681 [17:49<16:28, 2.50s/it] 42%|███████████████████████████████████████████████▉ | 286/681 [17:52<17:27, 2.65s/it] {'loss': 0.4342, 'grad_norm': 60.148529052734375, 'learning_rate': 3.614345889441346e-07, 'rewards/chosen': -1.4335534572601318, 'rewards/rejected': -3.6558303833007812, 'rewards/accuracies': 0.9375, 'rewards/margins': 2.2222766876220703, 'logps/chosen': -175.14715576171875, 'logps/rejected': -428.2286376953125, 'logps/ref_chosen': -45.06391525268555, 'logps/ref_rejected': -95.78263854980469, 'logits/chosen': -1.7381117343902588, 'logits/rejected': -1.4690245389938354, 'kl/p_epsilon_steps': 0.921875, 'kl/n_epsilon_steps': 0.078125, 'epsilon_dpo/beta': 0.011004237458109856, 'epsilon_dpo/loss_margin_mean': 202.3627471923828, 'epsilon_dpo/beta_margin_mean': 2.2222766876220703, 'epsilon_dpo/beta_margin_std': 1.464618444442749, 'epsilon_dpo/beta_margin_grad_mean': -0.16391685605049133, 'epsilon_dpo/beta_margin_grad_std': 0.18253783881664276, 'kl/beta': 0.011096766218543053, 'kl/avg_steps': 0.84375, 'epoch': 0.42} 42%|███████████████████████████████████████████████▉ | 286/681 [17:52<17:27, 2.65s/it] 42%|████████████████████████████████████████████████ | 287/681 [17:54<17:01, 2.59s/it] {'loss': 0.611, 'grad_norm': 71.68363189697266, 'learning_rate': 3.6028433558269275e-07, 'rewards/chosen': -1.551443099975586, 'rewards/rejected': -3.22538423538208, 'rewards/accuracies': 0.84375, 'rewards/margins': 1.6739410161972046, 'logps/chosen': -186.29312133789062, 'logps/rejected': -378.27142333984375, 'logps/ref_chosen': -44.68206787109375, 'logps/ref_rejected': -82.90010070800781, 'logits/chosen': -1.7347445487976074, 'logits/rejected': -1.4353196620941162, 'kl/p_epsilon_steps': 0.84375, 'kl/n_epsilon_steps': 0.15625, 'epsilon_dpo/beta': 0.010929361917078495, 'epsilon_dpo/loss_margin_mean': 153.76026916503906, 'epsilon_dpo/beta_margin_mean': 1.6739410161972046, 'epsilon_dpo/beta_margin_std': 1.4721882343292236, 'epsilon_dpo/beta_margin_grad_mean': -0.23106051981449127, 'epsilon_dpo/beta_margin_grad_std': 0.19667471945285797, 'kl/beta': 0.011003920808434486, 'kl/avg_steps': 0.6875, 'epoch': 0.42} 42%|████████████████████████████████████████████████ | 287/681 [17:54<17:01, 2.59s/it] 42%|████████████████████████████████████████████████▏ | 288/681 [17:57<16:52, 2.58s/it] {'loss': 0.3983, 'grad_norm': 56.249759674072266, 'learning_rate': 3.5913117612644327e-07, 'rewards/chosen': -1.3546841144561768, 'rewards/rejected': -3.5305793285369873, 'rewards/accuracies': 0.96875, 'rewards/margins': 2.1758952140808105, 'logps/chosen': -160.89662170410156, 'logps/rejected': -418.419921875, 'logps/ref_chosen': -35.92053985595703, 'logps/ref_rejected': -92.28993225097656, 'logits/chosen': -1.6811950206756592, 'logits/rejected': -1.4292771816253662, 'kl/p_epsilon_steps': 0.96875, 'kl/n_epsilon_steps': 0.03125, 'epsilon_dpo/beta': 0.010827410966157913, 'epsilon_dpo/loss_margin_mean': 201.15391540527344, 'epsilon_dpo/beta_margin_mean': 2.1758952140808105, 'epsilon_dpo/beta_margin_std': 1.2892285585403442, 'epsilon_dpo/beta_margin_grad_mean': -0.15691792964935303, 'epsilon_dpo/beta_margin_grad_std': 0.16194118559360504, 'kl/beta': 0.010928785428404808, 'kl/avg_steps': 0.9375, 'epoch': 0.42} 42%|████████████████████████████████████████████████▏ | 288/681 [17:57<16:52, 2.58s/it] 42%|████████████████████████████████████████████████▍ | 289/681 [17:59<16:36, 2.54s/it] {'loss': 0.5158, 'grad_norm': 53.693756103515625, 'learning_rate': 3.5797514096221024e-07, 'rewards/chosen': -1.222680687904358, 'rewards/rejected': -3.2015156745910645, 'rewards/accuracies': 0.9375, 'rewards/margins': 1.978835105895996, 'logps/chosen': -151.41265869140625, 'logps/rejected': -390.99652099609375, 'logps/ref_chosen': -37.65406036376953, 'logps/ref_rejected': -92.58161163330078, 'logits/chosen': -1.7554616928100586, 'logits/rejected': -1.5511056184768677, 'kl/p_epsilon_steps': 0.9375, 'kl/n_epsilon_steps': 0.0625, 'epsilon_dpo/beta': 0.010733614675700665, 'epsilon_dpo/loss_margin_mean': 184.6563262939453, 'epsilon_dpo/beta_margin_mean': 1.978835105895996, 'epsilon_dpo/beta_margin_std': 1.5322097539901733, 'epsilon_dpo/beta_margin_grad_mean': -0.19883452355861664, 'epsilon_dpo/beta_margin_grad_std': 0.18830937147140503, 'kl/beta': 0.010827279649674892, 'kl/avg_steps': 0.875, 'epoch': 0.42} 42%|████████████████████████████████████████████████▍ | 289/681 [17:59<16:36, 2.54s/it] 43%|████████████████████████████████████████████████▌ | 290/681 [18:01<16:16, 2.50s/it] {'loss': 0.4168, 'grad_norm': 51.92376708984375, 'learning_rate': 3.568162605525952e-07, 'rewards/chosen': -1.2994060516357422, 'rewards/rejected': -3.3938629627227783, 'rewards/accuracies': 0.953125, 'rewards/margins': 2.094456911087036, 'logps/chosen': -165.3158416748047, 'logps/rejected': -442.55718994140625, 'logps/ref_chosen': -43.256103515625, 'logps/ref_rejected': -123.40228271484375, 'logits/chosen': -1.6529500484466553, 'logits/rejected': -1.7434873580932617, 'kl/p_epsilon_steps': 0.953125, 'kl/n_epsilon_steps': 0.046875, 'epsilon_dpo/beta': 0.01063715573400259, 'epsilon_dpo/loss_margin_mean': 197.09515380859375, 'epsilon_dpo/beta_margin_mean': 2.094456911087036, 'epsilon_dpo/beta_margin_std': 1.3732136487960815, 'epsilon_dpo/beta_margin_grad_mean': -0.17113091051578522, 'epsilon_dpo/beta_margin_grad_std': 0.15478722751140594, 'kl/beta': 0.010733362287282944, 'kl/avg_steps': 0.90625, 'epoch': 0.43} 43%|████████████████████████████████████████████████▌ | 290/681 [18:01<16:16, 2.50s/it] 43%|████████████████████████████████████████████████▋ | 291/681 [18:04<16:29, 2.54s/it] {'loss': 0.5064, 'grad_norm': 62.94866943359375, 'learning_rate': 3.5565456543517485e-07, 'rewards/chosen': -1.2878754138946533, 'rewards/rejected': -3.207477569580078, 'rewards/accuracies': 0.90625, 'rewards/margins': 1.9196021556854248, 'logps/chosen': -165.5596466064453, 'logps/rejected': -398.55230712890625, 'logps/ref_chosen': -43.823760986328125, 'logps/ref_rejected': -94.49006652832031, 'logits/chosen': -1.8479794263839722, 'logits/rejected': -1.6821517944335938, 'kl/p_epsilon_steps': 0.875, 'kl/n_epsilon_steps': 0.109375, 'epsilon_dpo/beta': 0.010556564666330814, 'epsilon_dpo/loss_margin_mean': 182.32635498046875, 'epsilon_dpo/beta_margin_mean': 1.9196021556854248, 'epsilon_dpo/beta_margin_std': 1.422903060913086, 'epsilon_dpo/beta_margin_grad_mean': -0.1979316920042038, 'epsilon_dpo/beta_margin_grad_std': 0.18217770755290985, 'kl/beta': 0.010636964812874794, 'kl/avg_steps': 0.765625, 'epoch': 0.43} 43%|████████████████████████████████████████████████▋ | 291/681 [18:04<16:29, 2.54s/it] 43%|████████████████████████████████████████████████▉ | 292/681 [18:06<16:09, 2.49s/it] {'loss': 0.4426, 'grad_norm': 62.74113464355469, 'learning_rate': 3.5449008622169583e-07, 'rewards/chosen': -1.1173462867736816, 'rewards/rejected': -3.1657166481018066, 'rewards/accuracies': 0.921875, 'rewards/margins': 2.048370361328125, 'logps/chosen': -141.83103942871094, 'logps/rejected': -397.3466796875, 'logps/ref_chosen': -35.38202667236328, 'logps/ref_rejected': -94.85586547851562, 'logits/chosen': -1.7095630168914795, 'logits/rejected': -1.6350113153457642, 'kl/p_epsilon_steps': 0.90625, 'kl/n_epsilon_steps': 0.09375, 'epsilon_dpo/beta': 0.010471423156559467, 'epsilon_dpo/loss_margin_mean': 196.04180908203125, 'epsilon_dpo/beta_margin_mean': 2.048370122909546, 'epsilon_dpo/beta_margin_std': 1.3248964548110962, 'epsilon_dpo/beta_margin_grad_mean': -0.1748674064874649, 'epsilon_dpo/beta_margin_grad_std': 0.17317946255207062, 'kl/beta': 0.010556144639849663, 'kl/avg_steps': 0.8125, 'epoch': 0.43} 43%|████████████████████████████████████████████████▉ | 292/681 [18:06<16:09, 2.49s/it] 43%|█████████████████████████████████████████████████ | 293/681 [18:09<16:20, 2.53s/it] {'loss': 0.6168, 'grad_norm': 59.1533088684082, 'learning_rate': 3.5332285359726846e-07, 'rewards/chosen': -1.193084478378296, 'rewards/rejected': -2.81472110748291, 'rewards/accuracies': 0.84375, 'rewards/margins': 1.6216366291046143, 'logps/chosen': -150.8321075439453, 'logps/rejected': -354.53131103515625, 'logps/ref_chosen': -36.39015197753906, 'logps/ref_rejected': -83.55977630615234, 'logits/chosen': -1.8089414834976196, 'logits/rejected': -1.613842248916626, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.171875, 'epsilon_dpo/beta': 0.010403390973806381, 'epsilon_dpo/loss_margin_mean': 156.5295867919922, 'epsilon_dpo/beta_margin_mean': 1.6216365098953247, 'epsilon_dpo/beta_margin_std': 1.3715225458145142, 'epsilon_dpo/beta_margin_grad_mean': -0.2334311306476593, 'epsilon_dpo/beta_margin_grad_std': 0.19696100056171417, 'kl/beta': 0.010471067391335964, 'kl/avg_steps': 0.65625, 'epoch': 0.43} 43%|█████████████████████████████████████████████████ | 293/681 [18:09<16:20, 2.53s/it] 43%|█████████████████████████████████████████████████▏ | 294/681 [18:11<16:11, 2.51s/it] {'loss': 0.5324, 'grad_norm': 53.15205001831055, 'learning_rate': 3.5215289831955786e-07, 'rewards/chosen': -1.1122854948043823, 'rewards/rejected': -2.79593563079834, 'rewards/accuracies': 0.9375, 'rewards/margins': 1.6836501359939575, 'logps/chosen': -139.99928283691406, 'logps/rejected': -358.48431396484375, 'logps/ref_chosen': -32.32667541503906, 'logps/ref_rejected': -87.2591552734375, 'logits/chosen': -1.6690821647644043, 'logits/rejected': -1.641129732131958, 'kl/p_epsilon_steps': 0.921875, 'kl/n_epsilon_steps': 0.078125, 'epsilon_dpo/beta': 0.01031605713069439, 'epsilon_dpo/loss_margin_mean': 163.5525360107422, 'epsilon_dpo/beta_margin_mean': 1.6836501359939575, 'epsilon_dpo/beta_margin_std': 1.2057768106460571, 'epsilon_dpo/beta_margin_grad_mean': -0.20596943795681, 'epsilon_dpo/beta_margin_grad_std': 0.16830576956272125, 'kl/beta': 0.010402798652648926, 'kl/avg_steps': 0.84375, 'epoch': 0.43} 43%|█████████████████████████████████████████████████▏ | 294/681 [18:12<16:11, 2.51s/it] 43%|█████████████████████████████████████████████████▍ | 295/681 [18:14<15:59, 2.49s/it] {'loss': 0.5393, 'grad_norm': 73.17021179199219, 'learning_rate': 3.509802512179737e-07, 'rewards/chosen': -1.1549984216690063, 'rewards/rejected': -3.0700550079345703, 'rewards/accuracies': 0.921875, 'rewards/margins': 1.9150567054748535, 'logps/chosen': -145.589599609375, 'logps/rejected': -396.48590087890625, 'logps/ref_chosen': -32.976951599121094, 'logps/ref_rejected': -96.25013732910156, 'logits/chosen': -1.7157707214355469, 'logits/rejected': -1.6699479818344116, 'kl/p_epsilon_steps': 0.921875, 'kl/n_epsilon_steps': 0.078125, 'epsilon_dpo/beta': 0.010229743085801601, 'epsilon_dpo/loss_margin_mean': 187.62313842773438, 'epsilon_dpo/beta_margin_mean': 1.9150567054748535, 'epsilon_dpo/beta_margin_std': 1.5144377946853638, 'epsilon_dpo/beta_margin_grad_mean': -0.2001977413892746, 'epsilon_dpo/beta_margin_grad_std': 0.19301089644432068, 'kl/beta': 0.010315759107470512, 'kl/avg_steps': 0.84375, 'epoch': 0.43} 43%|█████████████████████████████████████████████████▍ | 295/681 [18:14<15:59, 2.49s/it] 43%|█████████████████████████████████████████████████▌ | 296/681 [18:16<15:48, 2.46s/it] {'loss': 0.5562, 'grad_norm': 54.07697296142578, 'learning_rate': 3.498049431928577e-07, 'rewards/chosen': -1.1971094608306885, 'rewards/rejected': -2.926664352416992, 'rewards/accuracies': 0.90625, 'rewards/margins': 1.7295548915863037, 'logps/chosen': -159.61997985839844, 'logps/rejected': -388.03179931640625, 'logps/ref_chosen': -41.81062316894531, 'logps/ref_rejected': -99.36541748046875, 'logits/chosen': -1.944000005722046, 'logits/rejected': -1.7023565769195557, 'kl/p_epsilon_steps': 0.90625, 'kl/n_epsilon_steps': 0.09375, 'epsilon_dpo/beta': 0.010147349908947945, 'epsilon_dpo/loss_margin_mean': 170.8570556640625, 'epsilon_dpo/beta_margin_mean': 1.7295547723770142, 'epsilon_dpo/beta_margin_std': 1.38618004322052, 'epsilon_dpo/beta_margin_grad_mean': -0.21665216982364655, 'epsilon_dpo/beta_margin_grad_std': 0.17736758291721344, 'kl/beta': 0.010229448787868023, 'kl/avg_steps': 0.8125, 'epoch': 0.43} 43%|█████████████████████████████████████████████████▌ | 296/681 [18:16<15:48, 2.46s/it] 44%|█████████████████████████████████████████████████▋ | 297/681 [18:19<15:44, 2.46s/it] {'loss': 0.469, 'grad_norm': 52.08464431762695, 'learning_rate': 3.486270052146694e-07, 'rewards/chosen': -1.112633466720581, 'rewards/rejected': -2.900785446166992, 'rewards/accuracies': 0.953125, 'rewards/margins': 1.7881522178649902, 'logps/chosen': -146.10855102539062, 'logps/rejected': -389.8145751953125, 'logps/ref_chosen': -35.64509582519531, 'logps/ref_rejected': -101.33485412597656, 'logits/chosen': -1.8551688194274902, 'logits/rejected': -1.8467998504638672, 'kl/p_epsilon_steps': 0.921875, 'kl/n_epsilon_steps': 0.078125, 'epsilon_dpo/beta': 0.010062395595014095, 'epsilon_dpo/loss_margin_mean': 178.01626586914062, 'epsilon_dpo/beta_margin_mean': 1.7881520986557007, 'epsilon_dpo/beta_margin_std': 1.1349061727523804, 'epsilon_dpo/beta_margin_grad_mean': -0.19078318774700165, 'epsilon_dpo/beta_margin_grad_std': 0.15408045053482056, 'kl/beta': 0.010147004388272762, 'kl/avg_steps': 0.84375, 'epoch': 0.44} 44%|█████████████████████████████████████████████████▋ | 297/681 [18:19<15:44, 2.46s/it] 44%|█████████████████████████████████████████████████▉ | 298/681 [18:22<16:12, 2.54s/it] {'loss': 0.4176, 'grad_norm': 47.372406005859375, 'learning_rate': 3.474464683231698e-07, 'rewards/chosen': -0.9769392013549805, 'rewards/rejected': -3.023324966430664, 'rewards/accuracies': 0.96875, 'rewards/margins': 2.0463857650756836, 'logps/chosen': -136.94308471679688, 'logps/rejected': -428.3704833984375, 'logps/ref_chosen': -39.13259506225586, 'logps/ref_rejected': -125.148193359375, 'logits/chosen': -1.854588508605957, 'logits/rejected': -1.974242091178894, 'kl/p_epsilon_steps': 0.953125, 'kl/n_epsilon_steps': 0.046875, 'epsilon_dpo/beta': 0.009971914812922478, 'epsilon_dpo/loss_margin_mean': 205.4117889404297, 'epsilon_dpo/beta_margin_mean': 2.0463857650756836, 'epsilon_dpo/beta_margin_std': 1.3721272945404053, 'epsilon_dpo/beta_margin_grad_mean': -0.17282237112522125, 'epsilon_dpo/beta_margin_grad_std': 0.14743012189865112, 'kl/beta': 0.010062105022370815, 'kl/avg_steps': 0.90625, 'epoch': 0.44} 44%|█████████████████████████████████████████████████▉ | 298/681 [18:22<16:12, 2.54s/it] 44%|██████████████████████████████████████████████████ | 299/681 [18:24<16:27, 2.58s/it] {'loss': 0.4348, 'grad_norm': 50.078041076660156, 'learning_rate': 3.462633636266041e-07, 'rewards/chosen': -0.933772087097168, 'rewards/rejected': -2.8413944244384766, 'rewards/accuracies': 0.953125, 'rewards/margins': 1.9076223373413086, 'logps/chosen': -122.997314453125, 'logps/rejected': -375.3733215332031, 'logps/ref_chosen': -28.626670837402344, 'logps/ref_rejected': -87.74382781982422, 'logits/chosen': -1.7307016849517822, 'logits/rejected': -1.819953203201294, 'kl/p_epsilon_steps': 0.953125, 'kl/n_epsilon_steps': 0.046875, 'epsilon_dpo/beta': 0.009882355108857155, 'epsilon_dpo/loss_margin_mean': 193.25885009765625, 'epsilon_dpo/beta_margin_mean': 1.9076223373413086, 'epsilon_dpo/beta_margin_std': 1.1600240468978882, 'epsilon_dpo/beta_margin_grad_mean': -0.17700159549713135, 'epsilon_dpo/beta_margin_grad_std': 0.15176741778850555, 'kl/beta': 0.009971735998988152, 'kl/avg_steps': 0.90625, 'epoch': 0.44} 44%|██████████████████████████████████████████████████ | 299/681 [18:24<16:27, 2.58s/it] 44%|██████████████████████████████████████████████████▏ | 300/681 [18:27<16:15, 2.56s/it] {'loss': 0.4996, 'grad_norm': 58.53467559814453, 'learning_rate': 3.4507772230088147e-07, 'rewards/chosen': -1.0862467288970947, 'rewards/rejected': -3.1269874572753906, 'rewards/accuracies': 0.90625, 'rewards/margins': 2.040740728378296, 'logps/chosen': -144.44647216796875, 'logps/rejected': -423.14324951171875, 'logps/ref_chosen': -33.894203186035156, 'logps/ref_rejected': -103.88007354736328, 'logits/chosen': -1.7490193843841553, 'logits/rejected': -1.7381911277770996, 'kl/p_epsilon_steps': 0.90625, 'kl/n_epsilon_steps': 0.09375, 'epsilon_dpo/beta': 0.009802866727113724, 'epsilon_dpo/loss_margin_mean': 208.71090698242188, 'epsilon_dpo/beta_margin_mean': 2.040740728378296, 'epsilon_dpo/beta_margin_std': 1.4987692832946777, 'epsilon_dpo/beta_margin_grad_mean': -0.1854911744594574, 'epsilon_dpo/beta_margin_grad_std': 0.19396579265594482, 'kl/beta': 0.009882179088890553, 'kl/avg_steps': 0.8125, 'epoch': 0.44} 44%|██████████████████████████████████████████████████▏ | 300/681 [18:27<16:15, 2.56s/it][INFO|trainer.py:4307] 2026-04-18 09:49:49,167 >> ***** Running Evaluation ***** [INFO|trainer.py:4309] 2026-04-18 09:49:49,167 >> Num examples = 2339 [INFO|trainer.py:4312] 2026-04-18 09:49:49,167 >> Batch size = 8 0%| | 0/73 [00:00> ***** Running Evaluation ***** [INFO|trainer.py:4309] 2026-04-18 09:54:37,937 >> Num examples = 2339 [INFO|trainer.py:4312] 2026-04-18 09:54:37,937 >> Batch size = 8 0%| | 0/73 [00:00> Saving model checkpoint to /scratch/feng.yulu/dynamic-dpo-v4/outputs/mistral-7b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260418-015332/checkpoint-400 [INFO|configuration_utils.py:419] 2026-04-18 09:55:32,405 >> Configuration saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/mistral-7b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260418-015332/checkpoint-400/config.json [INFO|configuration_utils.py:911] 2026-04-18 09:55:32,414 >> Configuration saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/mistral-7b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260418-015332/checkpoint-400/generation_config.json [INFO|modeling_utils.py:3580] 2026-04-18 09:56:24,111 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 6 checkpoint shards. You can find where each parameters has been saved in the index located at /scratch/feng.yulu/dynamic-dpo-v4/outputs/mistral-7b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260418-015332/checkpoint-400/model.safetensors.index.json. [INFO|tokenization_utils_base.py:2510] 2026-04-18 09:56:24,151 >> tokenizer config file saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/mistral-7b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260418-015332/checkpoint-400/tokenizer_config.json [INFO|tokenization_utils_base.py:2519] 2026-04-18 09:56:24,160 >> Special tokens file saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/mistral-7b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260418-015332/checkpoint-400/special_tokens_map.json 59%|█████████████████████████████████████████████████████████████████▉ | 401/681 [28:18<7:12:09, 92.60s/it] {'loss': 0.7457, 'grad_norm': 55.85029983520508, 'learning_rate': 2.1800473436235136e-07, 'rewards/chosen': -0.7069000005722046, 'rewards/rejected': -1.8603973388671875, 'rewards/accuracies': 0.890625, 'rewards/margins': 1.153497338294983, 'logps/chosen': -206.0033721923828, 'logps/rejected': -530.2316284179688, 'logps/ref_chosen': -39.69450378417969, 'logps/ref_rejected': -90.86283874511719, 'logits/chosen': -2.2151618003845215, 'logits/rejected': -2.1000635623931885, 'kl/p_epsilon_steps': 0.875, 'kl/n_epsilon_steps': 0.125, 'epsilon_dpo/beta': 0.0042383247055113316, 'epsilon_dpo/loss_margin_mean': 273.0599365234375, 'epsilon_dpo/beta_margin_mean': 1.153497338294983, 'epsilon_dpo/beta_margin_std': 1.093888521194458, 'epsilon_dpo/beta_margin_grad_mean': -0.2859787940979004, 'epsilon_dpo/beta_margin_grad_std': 0.17702849209308624, 'kl/beta': 0.004269925411790609, 'kl/avg_steps': 0.75, 'epoch': 0.59} 59%|█████████████████████████████████████████████████████████████████▉ | 401/681 [28:18<7:12:09, 92.60s/it] 59%|██████████████████████████████████████████████████████████████████ | 402/681 [28:21<5:04:44, 65.53s/it] {'loss': 0.6286, 'grad_norm': 44.19596862792969, 'learning_rate': 2.1673238449588665e-07, 'rewards/chosen': -0.6281874179840088, 'rewards/rejected': -1.8827701807022095, 'rewards/accuracies': 0.90625, 'rewards/margins': 1.2545827627182007, 'logps/chosen': -187.8063201904297, 'logps/rejected': -534.6738891601562, 'logps/ref_chosen': -38.76295852661133, 'logps/ref_rejected': -86.50106811523438, 'logits/chosen': -2.216102123260498, 'logits/rejected': -2.0876975059509277, 'kl/p_epsilon_steps': 0.90625, 'kl/n_epsilon_steps': 0.09375, 'epsilon_dpo/beta': 0.004204125143587589, 'epsilon_dpo/loss_margin_mean': 299.1294860839844, 'epsilon_dpo/beta_margin_mean': 1.2545827627182007, 'epsilon_dpo/beta_margin_std': 0.8603823184967041, 'epsilon_dpo/beta_margin_grad_mean': -0.251901239156723, 'epsilon_dpo/beta_margin_grad_std': 0.14963571727275848, 'kl/beta': 0.004238139372318983, 'kl/avg_steps': 0.8125, 'epoch': 0.59} 59%|██████████████████████████████████████████████████████████████████ | 402/681 [28:21<5:04:44, 65.53s/it] 59%|██████████████████████████████████████████████████████████████████▎ | 403/681 [28:23<3:36:01, 46.62s/it] {'loss': 0.6468, 'grad_norm': 45.329769134521484, 'learning_rate': 2.154609112620295e-07, 'rewards/chosen': -0.5560154914855957, 'rewards/rejected': -1.8019505739212036, 'rewards/accuracies': 0.921875, 'rewards/margins': 1.2459352016448975, 'logps/chosen': -162.46539306640625, 'logps/rejected': -515.2846069335938, 'logps/ref_chosen': -29.60453224182129, 'logps/ref_rejected': -82.97395324707031, 'logits/chosen': -2.254790782928467, 'logits/rejected': -2.070467472076416, 'kl/p_epsilon_steps': 0.90625, 'kl/n_epsilon_steps': 0.09375, 'epsilon_dpo/beta': 0.004170241765677929, 'epsilon_dpo/loss_margin_mean': 299.4497985839844, 'epsilon_dpo/beta_margin_mean': 1.245935082435608, 'epsilon_dpo/beta_margin_std': 0.9069526791572571, 'epsilon_dpo/beta_margin_grad_mean': -0.2570376694202423, 'epsilon_dpo/beta_margin_grad_std': 0.15749378502368927, 'kl/beta': 0.004203982185572386, 'kl/avg_steps': 0.8125, 'epoch': 0.59} 59%|██████████████████████████████████████████████████████████████████▎ | 403/681 [28:23<3:36:01, 46.62s/it] 59%|██████████████████████████████████████████████████████████████████▍ | 404/681 [28:26<2:34:06, 33.38s/it] {'loss': 0.7084, 'grad_norm': 52.8735466003418, 'learning_rate': 2.1419034816528218e-07, 'rewards/chosen': -0.6246769428253174, 'rewards/rejected': -1.7911345958709717, 'rewards/accuracies': 0.859375, 'rewards/margins': 1.1664576530456543, 'logps/chosen': -182.46783447265625, 'logps/rejected': -517.0897216796875, 'logps/ref_chosen': -32.369415283203125, 'logps/ref_rejected': -84.27439880371094, 'logits/chosen': -2.253624439239502, 'logits/rejected': -2.092195987701416, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.171875, 'epsilon_dpo/beta': 0.004143148194998503, 'epsilon_dpo/loss_margin_mean': 282.7168884277344, 'epsilon_dpo/beta_margin_mean': 1.1664576530456543, 'epsilon_dpo/beta_margin_std': 0.9855588674545288, 'epsilon_dpo/beta_margin_grad_mean': -0.2758185565471649, 'epsilon_dpo/beta_margin_grad_std': 0.16760212182998657, 'kl/beta': 0.004170100204646587, 'kl/avg_steps': 0.65625, 'epoch': 0.59} 59%|██████████████████████████████████████████████████████████████████▍ | 404/681 [28:26<2:34:06, 33.38s/it] 59%|██████████████████████████████████████████████████████████████████▌ | 405/681 [28:28<1:51:01, 24.13s/it] {'loss': 0.6822, 'grad_norm': 45.45560836791992, 'learning_rate': 2.129207286861638e-07, 'rewards/chosen': -0.6669937372207642, 'rewards/rejected': -1.8423340320587158, 'rewards/accuracies': 0.953125, 'rewards/margins': 1.1753404140472412, 'logps/chosen': -207.39706420898438, 'logps/rejected': -542.759033203125, 'logps/ref_chosen': -45.16187286376953, 'logps/ref_rejected': -93.87014770507812, 'logits/chosen': -2.2662670612335205, 'logits/rejected': -2.1083085536956787, 'kl/p_epsilon_steps': 0.953125, 'kl/n_epsilon_steps': 0.046875, 'epsilon_dpo/beta': 0.004105777945369482, 'epsilon_dpo/loss_margin_mean': 286.6537170410156, 'epsilon_dpo/beta_margin_mean': 1.1753402948379517, 'epsilon_dpo/beta_margin_std': 0.9294463396072388, 'epsilon_dpo/beta_margin_grad_mean': -0.27076855301856995, 'epsilon_dpo/beta_margin_grad_std': 0.15409667789936066, 'kl/beta': 0.004142912104725838, 'kl/avg_steps': 0.90625, 'epoch': 0.59} 59%|██████████████████████████████████████████████████████████████████▌ | 405/681 [28:28<1:51:01, 24.13s/it] 60%|██████████████████████████████████████████████████████████████████▊ | 406/681 [28:31<1:21:08, 17.70s/it] {'loss': 0.612, 'grad_norm': 44.63689422607422, 'learning_rate': 2.1165208628032861e-07, 'rewards/chosen': -0.5877770185470581, 'rewards/rejected': -1.8649117946624756, 'rewards/accuracies': 0.921875, 'rewards/margins': 1.2771347761154175, 'logps/chosen': -175.9896240234375, 'logps/rejected': -557.4638671875, 'logps/ref_chosen': -32.108238220214844, 'logps/ref_rejected': -99.3056640625, 'logits/chosen': -2.2001147270202637, 'logits/rejected': -2.1016151905059814, 'kl/p_epsilon_steps': 0.921875, 'kl/n_epsilon_steps': 0.078125, 'epsilon_dpo/beta': 0.004071469884365797, 'epsilon_dpo/loss_margin_mean': 314.2767639160156, 'epsilon_dpo/beta_margin_mean': 1.2771347761154175, 'epsilon_dpo/beta_margin_std': 0.8445379137992859, 'epsilon_dpo/beta_margin_grad_mean': -0.24695971608161926, 'epsilon_dpo/beta_margin_grad_std': 0.14529289305210114, 'kl/beta': 0.004105704370886087, 'kl/avg_steps': 0.84375, 'epoch': 0.6} 60%|██████████████████████████████████████████████████████████████████▊ | 406/681 [28:31<1:21:08, 17.70s/it] 60%|██████████████████████████████████████████████████████████████████▉ | 407/681 [28:33<1:00:03, 13.15s/it] {'loss': 0.7582, 'grad_norm': 52.973270416259766, 'learning_rate': 2.1038445437768375e-07, 'rewards/chosen': -0.5955917835235596, 'rewards/rejected': -1.6564050912857056, 'rewards/accuracies': 0.859375, 'rewards/margins': 1.060813307762146, 'logps/chosen': -181.23577880859375, 'logps/rejected': -493.41815185546875, 'logps/ref_chosen': -34.63081359863281, 'logps/ref_rejected': -83.40669250488281, 'logits/chosen': -2.2982513904571533, 'logits/rejected': -2.1020822525024414, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.171875, 'epsilon_dpo/beta': 0.0040450384840369225, 'epsilon_dpo/loss_margin_mean': 263.4065246582031, 'epsilon_dpo/beta_margin_mean': 1.060813307762146, 'epsilon_dpo/beta_margin_std': 0.9306771159172058, 'epsilon_dpo/beta_margin_grad_mean': -0.2901814579963684, 'epsilon_dpo/beta_margin_grad_std': 0.17480535805225372, 'kl/beta': 0.004071352072060108, 'kl/avg_steps': 0.65625, 'epoch': 0.6} 60%|██████████████████████████████████████████████████████████████████▉ | 407/681 [28:33<1:00:03, 13.15s/it] 60%|████████████████████████████████████████████████████████████████████▎ | 408/681 [28:36<45:21, 9.97s/it] {'loss': 0.682, 'grad_norm': 44.9481201171875, 'learning_rate': 2.0911786638150872e-07, 'rewards/chosen': -0.6118342280387878, 'rewards/rejected': -1.7125052213668823, 'rewards/accuracies': 0.890625, 'rewards/margins': 1.1006710529327393, 'logps/chosen': -198.23095703125, 'logps/rejected': -523.3738403320312, 'logps/ref_chosen': -46.1392822265625, 'logps/ref_rejected': -96.3233642578125, 'logits/chosen': -2.361016273498535, 'logits/rejected': -2.139800548553467, 'kl/p_epsilon_steps': 0.90625, 'kl/n_epsilon_steps': 0.09375, 'epsilon_dpo/beta': 0.004012344870716333, 'epsilon_dpo/loss_margin_mean': 274.95880126953125, 'epsilon_dpo/beta_margin_mean': 1.1006709337234497, 'epsilon_dpo/beta_margin_std': 0.758716344833374, 'epsilon_dpo/beta_margin_grad_mean': -0.2729249894618988, 'epsilon_dpo/beta_margin_grad_std': 0.14225371181964874, 'kl/beta': 0.004044807981699705, 'kl/avg_steps': 0.8125, 'epoch': 0.6} 60%|████████████████████████████████████████████████████████████████████▎ | 408/681 [28:36<45:21, 9.97s/it] 60%|████████████████████████████████████████████████████████████████████▍ | 409/681 [28:39<35:10, 7.76s/it] {'loss': 0.7654, 'grad_norm': 45.330448150634766, 'learning_rate': 2.0785235566757517e-07, 'rewards/chosen': -0.6647806167602539, 'rewards/rejected': -1.6724402904510498, 'rewards/accuracies': 0.859375, 'rewards/margins': 1.007659673690796, 'logps/chosen': -214.66848754882812, 'logps/rejected': -508.6387939453125, 'logps/ref_chosen': -48.41924285888672, 'logps/ref_rejected': -88.46084594726562, 'logits/chosen': -2.2953333854675293, 'logits/rejected': -2.1765618324279785, 'kl/p_epsilon_steps': 0.875, 'kl/n_epsilon_steps': 0.125, 'epsilon_dpo/beta': 0.003982515539973974, 'epsilon_dpo/loss_margin_mean': 253.92869567871094, 'epsilon_dpo/beta_margin_mean': 1.007659673690796, 'epsilon_dpo/beta_margin_std': 0.8669053316116333, 'epsilon_dpo/beta_margin_grad_mean': -0.294739693403244, 'epsilon_dpo/beta_margin_grad_std': 0.16220302879810333, 'kl/beta': 0.0040122088976204395, 'kl/avg_steps': 0.75, 'epoch': 0.6} 60%|████████████████████████████████████████████████████████████████████▍ | 409/681 [28:39<35:10, 7.76s/it] 60%|████████████████████████████████████████████████████████████████████▋ | 410/681 [28:41<27:56, 6.19s/it] {'loss': 0.6897, 'grad_norm': 39.14712905883789, 'learning_rate': 2.065879555832674e-07, 'rewards/chosen': -0.5501143336296082, 'rewards/rejected': -1.676256775856018, 'rewards/accuracies': 0.921875, 'rewards/margins': 1.1261425018310547, 'logps/chosen': -171.04891967773438, 'logps/rejected': -515.6138916015625, 'logps/ref_chosen': -32.20702362060547, 'logps/ref_rejected': -90.97166442871094, 'logits/chosen': -2.258359909057617, 'logits/rejected': -2.2356643676757812, 'kl/p_epsilon_steps': 0.90625, 'kl/n_epsilon_steps': 0.09375, 'epsilon_dpo/beta': 0.003950380254536867, 'epsilon_dpo/loss_margin_mean': 285.80035400390625, 'epsilon_dpo/beta_margin_mean': 1.1261425018310547, 'epsilon_dpo/beta_margin_std': 0.8273832201957703, 'epsilon_dpo/beta_margin_grad_mean': -0.27121517062187195, 'epsilon_dpo/beta_margin_grad_std': 0.15460430085659027, 'kl/beta': 0.003982341382652521, 'kl/avg_steps': 0.8125, 'epoch': 0.6} 60%|████████████████████████████████████████████████████████████████████▋ | 410/681 [28:41<27:56, 6.19s/it] 60%|████████████████████████████████████████████████████████████████████▊ | 411/681 [28:44<22:43, 5.05s/it] {'loss': 0.6557, 'grad_norm': 38.4173698425293, 'learning_rate': 2.0532469944670343e-07, 'rewards/chosen': -0.5305934548377991, 'rewards/rejected': -1.6526423692703247, 'rewards/accuracies': 0.9375, 'rewards/margins': 1.1220488548278809, 'logps/chosen': -163.12564086914062, 'logps/rejected': -509.96392822265625, 'logps/ref_chosen': -27.866039276123047, 'logps/ref_rejected': -87.75438690185547, 'logits/chosen': -2.2671849727630615, 'logits/rejected': -2.18176007270813, 'kl/p_epsilon_steps': 0.9375, 'kl/n_epsilon_steps': 0.0625, 'epsilon_dpo/beta': 0.0039160726591944695, 'epsilon_dpo/loss_margin_mean': 286.949951171875, 'epsilon_dpo/beta_margin_mean': 1.1220488548278809, 'epsilon_dpo/beta_margin_std': 0.7217124104499817, 'epsilon_dpo/beta_margin_grad_mean': -0.2674121856689453, 'epsilon_dpo/beta_margin_grad_std': 0.12682108581066132, 'kl/beta': 0.003950245678424835, 'kl/avg_steps': 0.875, 'epoch': 0.6} 60%|████████████████████████████████████████████████████████████████████▊ | 411/681 [28:44<22:43, 5.05s/it] 60%|████████████████████████████████████████████████████████████████████▉ | 412/681 [28:46<19:03, 4.25s/it] {'loss': 0.6232, 'grad_norm': 39.3294563293457, 'learning_rate': 2.0406262054585738e-07, 'rewards/chosen': -0.4781460464000702, 'rewards/rejected': -1.7018874883651733, 'rewards/accuracies': 0.953125, 'rewards/margins': 1.2237414121627808, 'logps/chosen': -154.23080444335938, 'logps/rejected': -545.803466796875, 'logps/ref_chosen': -31.307266235351562, 'logps/ref_rejected': -107.21038818359375, 'logits/chosen': -2.2864794731140137, 'logits/rejected': -2.262669563293457, 'kl/p_epsilon_steps': 0.9375, 'kl/n_epsilon_steps': 0.0625, 'epsilon_dpo/beta': 0.0038821042980998755, 'epsilon_dpo/loss_margin_mean': 315.66949462890625, 'epsilon_dpo/beta_margin_mean': 1.2237414121627808, 'epsilon_dpo/beta_margin_std': 0.7976933121681213, 'epsilon_dpo/beta_margin_grad_mean': -0.254239022731781, 'epsilon_dpo/beta_margin_grad_std': 0.13515068590641022, 'kl/beta': 0.00391598092392087, 'kl/avg_steps': 0.875, 'epoch': 0.6} 60%|████████████████████████████████████████████████████████████████████▉ | 412/681 [28:46<19:03, 4.25s/it] 61%|█████████████████████████████████████████████████████████████████████▏ | 413/681 [28:48<16:34, 3.71s/it] {'loss': 0.6576, 'grad_norm': 38.31694412231445, 'learning_rate': 2.0280175213768205e-07, 'rewards/chosen': -0.5653949975967407, 'rewards/rejected': -1.7393302917480469, 'rewards/accuracies': 0.921875, 'rewards/margins': 1.1739351749420166, 'logps/chosen': -189.5083465576172, 'logps/rejected': -558.2633666992188, 'logps/ref_chosen': -43.01777648925781, 'logps/ref_rejected': -106.10704040527344, 'logits/chosen': -2.353483200073242, 'logits/rejected': -2.2761778831481934, 'kl/p_epsilon_steps': 0.921875, 'kl/n_epsilon_steps': 0.078125, 'epsilon_dpo/beta': 0.003849643748253584, 'epsilon_dpo/loss_margin_mean': 305.6657409667969, 'epsilon_dpo/beta_margin_mean': 1.1739351749420166, 'epsilon_dpo/beta_margin_std': 0.7997394800186157, 'epsilon_dpo/beta_margin_grad_mean': -0.2595198452472687, 'epsilon_dpo/beta_margin_grad_std': 0.14788804948329926, 'kl/beta': 0.003882013261318207, 'kl/avg_steps': 0.84375, 'epoch': 0.61} 61%|█████████████████████████████████████████████████████████████████████▏ | 413/681 [28:48<16:34, 3.71s/it] 61%|█████████████████████████████████████████████████████████████████████▎ | 414/681 [28:51<15:00, 3.37s/it] {'loss': 0.6491, 'grad_norm': 46.121341705322266, 'learning_rate': 2.0154212744723247e-07, 'rewards/chosen': -0.4871448874473572, 'rewards/rejected': -1.6691176891326904, 'rewards/accuracies': 0.921875, 'rewards/margins': 1.181972861289978, 'logps/chosen': -161.10008239746094, 'logps/rejected': -531.1929931640625, 'logps/ref_chosen': -33.92742919921875, 'logps/ref_rejected': -93.77487182617188, 'logits/chosen': -2.2362513542175293, 'logits/rejected': -2.2514965534210205, 'kl/p_epsilon_steps': 0.90625, 'kl/n_epsilon_steps': 0.09375, 'epsilon_dpo/beta': 0.0038186372257769108, 'epsilon_dpo/loss_margin_mean': 310.2454833984375, 'epsilon_dpo/beta_margin_mean': 1.181972861289978, 'epsilon_dpo/beta_margin_std': 0.8064230680465698, 'epsilon_dpo/beta_margin_grad_mean': -0.2614062428474426, 'epsilon_dpo/beta_margin_grad_std': 0.14156121015548706, 'kl/beta': 0.003849532688036561, 'kl/avg_steps': 0.8125, 'epoch': 0.61} 61%|█████████████████████████████████████████████████████████████████████▎ | 414/681 [28:51<15:00, 3.37s/it] 61%|█████████████████████████████████████████████████████████████████████▍ | 415/681 [28:54<13:58, 3.15s/it] {'loss': 0.7054, 'grad_norm': 42.91494369506836, 'learning_rate': 2.002837796667909e-07, 'rewards/chosen': -0.5699288845062256, 'rewards/rejected': -1.6337485313415527, 'rewards/accuracies': 0.921875, 'rewards/margins': 1.0638197660446167, 'logps/chosen': -195.28121948242188, 'logps/rejected': -539.98828125, 'logps/ref_chosen': -45.32706832885742, 'logps/ref_rejected': -108.54624938964844, 'logits/chosen': -2.451190948486328, 'logits/rejected': -2.4177732467651367, 'kl/p_epsilon_steps': 0.890625, 'kl/n_epsilon_steps': 0.109375, 'epsilon_dpo/beta': 0.0037890539970248938, 'epsilon_dpo/loss_margin_mean': 281.4879150390625, 'epsilon_dpo/beta_margin_mean': 1.0638197660446167, 'epsilon_dpo/beta_margin_std': 0.7809419631958008, 'epsilon_dpo/beta_margin_grad_mean': -0.281541645526886, 'epsilon_dpo/beta_margin_grad_std': 0.14272256195545197, 'kl/beta': 0.003818507306277752, 'kl/avg_steps': 0.78125, 'epoch': 0.61} 61%|█████████████████████████████████████████████████████████████████████▍ | 415/681 [28:54<13:58, 3.15s/it] 61%|█████████████████████████████████████████████████████████████████████▋ | 416/681 [28:56<13:07, 2.97s/it] {'loss': 0.6203, 'grad_norm': 41.69633483886719, 'learning_rate': 1.990267419549914e-07, 'rewards/chosen': -0.4858553111553192, 'rewards/rejected': -1.71781587600708, 'rewards/accuracies': 0.9375, 'rewards/margins': 1.2319605350494385, 'logps/chosen': -167.67477416992188, 'logps/rejected': -555.6788330078125, 'logps/ref_chosen': -38.674949645996094, 'logps/ref_rejected': -98.17755889892578, 'logits/chosen': -2.335287094116211, 'logits/rejected': -2.2950291633605957, 'kl/p_epsilon_steps': 0.921875, 'kl/n_epsilon_steps': 0.078125, 'epsilon_dpo/beta': 0.00375731335952878, 'epsilon_dpo/loss_margin_mean': 328.50146484375, 'epsilon_dpo/beta_margin_mean': 1.231960654258728, 'epsilon_dpo/beta_margin_std': 0.7866246700286865, 'epsilon_dpo/beta_margin_grad_mean': -0.2511499524116516, 'epsilon_dpo/beta_margin_grad_std': 0.1389520764350891, 'kl/beta': 0.003788906615227461, 'kl/avg_steps': 0.84375, 'epoch': 0.61} 61%|█████████████████████████████████████████████████████████████████████▋ | 416/681 [28:56<13:07, 2.97s/it] 61%|█████████████████████████████████████████████████████████████████████▊ | 417/681 [28:59<12:28, 2.84s/it] {'loss': 0.6858, 'grad_norm': 43.83847427368164, 'learning_rate': 1.9777104743594686e-07, 'rewards/chosen': -0.4534587562084198, 'rewards/rejected': -1.5868133306503296, 'rewards/accuracies': 0.90625, 'rewards/margins': 1.1333545446395874, 'logps/chosen': -147.36915588378906, 'logps/rejected': -500.62115478515625, 'logps/ref_chosen': -26.17800521850586, 'logps/ref_rejected': -74.53215026855469, 'logits/chosen': -2.3649911880493164, 'logits/rejected': -2.098527431488037, 'kl/p_epsilon_steps': 0.90625, 'kl/n_epsilon_steps': 0.09375, 'epsilon_dpo/beta': 0.0037270504981279373, 'epsilon_dpo/loss_margin_mean': 304.8978271484375, 'epsilon_dpo/beta_margin_mean': 1.1333545446395874, 'epsilon_dpo/beta_margin_std': 0.8189650774002075, 'epsilon_dpo/beta_margin_grad_mean': -0.26951104402542114, 'epsilon_dpo/beta_margin_grad_std': 0.15652604401111603, 'kl/beta': 0.003757205093279481, 'kl/avg_steps': 0.8125, 'epoch': 0.61} 61%|█████████████████████████████████████████████████████████████████████▊ | 417/681 [28:59<12:28, 2.84s/it] 61%|█████████████████████████████████████████████████████████████████████▉ | 418/681 [29:01<12:23, 2.83s/it] {'loss': 0.6258, 'grad_norm': 42.56073760986328, 'learning_rate': 1.965167291983757e-07, 'rewards/chosen': -0.47949230670928955, 'rewards/rejected': -1.6942154169082642, 'rewards/accuracies': 0.9375, 'rewards/margins': 1.2147231101989746, 'logps/chosen': -180.94284057617188, 'logps/rejected': -571.2081298828125, 'logps/ref_chosen': -51.68669128417969, 'logps/ref_rejected': -112.65290832519531, 'logits/chosen': -2.3986239433288574, 'logits/rejected': -2.3206000328063965, 'kl/p_epsilon_steps': 0.90625, 'kl/n_epsilon_steps': 0.09375, 'epsilon_dpo/beta': 0.0036970123182982206, 'epsilon_dpo/loss_margin_mean': 329.29913330078125, 'epsilon_dpo/beta_margin_mean': 1.2147231101989746, 'epsilon_dpo/beta_margin_std': 0.7788877487182617, 'epsilon_dpo/beta_margin_grad_mean': -0.25392869114875793, 'epsilon_dpo/beta_margin_grad_std': 0.1371525526046753, 'kl/beta': 0.0037269238382577896, 'kl/avg_steps': 0.8125, 'epoch': 0.61} 61%|█████████████████████████████████████████████████████████████████████▉ | 418/681 [29:01<12:23, 2.83s/it] 62%|██████████████████████████████████████████████████████████████████████▏ | 419/681 [29:04<11:51, 2.71s/it] {'loss': 0.6311, 'grad_norm': 44.2584228515625, 'learning_rate': 1.9526382029472988e-07, 'rewards/chosen': -0.4797900915145874, 'rewards/rejected': -1.664259910583496, 'rewards/accuracies': 0.953125, 'rewards/margins': 1.1844696998596191, 'logps/chosen': -165.10519409179688, 'logps/rejected': -552.49658203125, 'logps/ref_chosen': -34.45082473754883, 'logps/ref_rejected': -98.03851318359375, 'logits/chosen': -2.3918709754943848, 'logits/rejected': -2.2970142364501953, 'kl/p_epsilon_steps': 0.953125, 'kl/n_epsilon_steps': 0.046875, 'epsilon_dpo/beta': 0.003663749899715185, 'epsilon_dpo/loss_margin_mean': 323.8036804199219, 'epsilon_dpo/beta_margin_mean': 1.1844698190689087, 'epsilon_dpo/beta_margin_std': 0.7216576337814331, 'epsilon_dpo/beta_margin_grad_mean': -0.25385352969169617, 'epsilon_dpo/beta_margin_grad_std': 0.13596518337726593, 'kl/beta': 0.0036968865897506475, 'kl/avg_steps': 0.90625, 'epoch': 0.62} 62%|██████████████████████████████████████████████████████████████████████▏ | 419/681 [29:04<11:51, 2.71s/it] 62%|██████████████████████████████████████████████████████████████████████▎ | 420/681 [29:06<11:37, 2.67s/it] {'loss': 0.6883, 'grad_norm': 52.36576843261719, 'learning_rate': 1.9401235374032425e-07, 'rewards/chosen': -0.499639630317688, 'rewards/rejected': -1.5953891277313232, 'rewards/accuracies': 0.921875, 'rewards/margins': 1.0957496166229248, 'logps/chosen': -174.64231872558594, 'logps/rejected': -515.6880493164062, 'logps/ref_chosen': -37.82621765136719, 'logps/ref_rejected': -76.69117736816406, 'logits/chosen': -2.424182891845703, 'logits/rejected': -2.128852605819702, 'kl/p_epsilon_steps': 0.875, 'kl/n_epsilon_steps': 0.125, 'epsilon_dpo/beta': 0.0036365704145282507, 'epsilon_dpo/loss_margin_mean': 302.1807556152344, 'epsilon_dpo/beta_margin_mean': 1.0957496166229248, 'epsilon_dpo/beta_margin_std': 0.781441330909729, 'epsilon_dpo/beta_margin_grad_mean': -0.27526789903640747, 'epsilon_dpo/beta_margin_grad_std': 0.14264513552188873, 'kl/beta': 0.0036636844743043184, 'kl/avg_steps': 0.75, 'epoch': 0.62} 62%|██████████████████████████████████████████████████████████████████████▎ | 420/681 [29:07<11:37, 2.67s/it] 62%|██████████████████████████████████████████████████████████████████████▍ | 421/681 [29:09<11:20, 2.62s/it] {'loss': 0.6841, 'grad_norm': 42.273597717285156, 'learning_rate': 1.9276236251246653e-07, 'rewards/chosen': -0.49425339698791504, 'rewards/rejected': -1.5823595523834229, 'rewards/accuracies': 0.953125, 'rewards/margins': 1.0881061553955078, 'logps/chosen': -170.40980529785156, 'logps/rejected': -533.166748046875, 'logps/ref_chosen': -33.550575256347656, 'logps/ref_rejected': -93.99878692626953, 'logits/chosen': -2.356675863265991, 'logits/rejected': -2.3096842765808105, 'kl/p_epsilon_steps': 0.9375, 'kl/n_epsilon_steps': 0.0625, 'epsilon_dpo/beta': 0.003604953410103917, 'epsilon_dpo/loss_margin_mean': 302.30877685546875, 'epsilon_dpo/beta_margin_mean': 1.0881061553955078, 'epsilon_dpo/beta_margin_std': 0.751617968082428, 'epsilon_dpo/beta_margin_grad_mean': -0.27535223960876465, 'epsilon_dpo/beta_margin_grad_std': 0.13682620227336884, 'kl/beta': 0.003636411391198635, 'kl/avg_steps': 0.875, 'epoch': 0.62} 62%|██████████████████████████████████████████████████████████████████████▍ | 421/681 [29:09<11:20, 2.62s/it] 62%|██████████████████████████████████████████████████████████████████████▋ | 422/681 [29:11<11:10, 2.59s/it] {'loss': 0.7896, 'grad_norm': 46.02511215209961, 'learning_rate': 1.9151387954958792e-07, 'rewards/chosen': -0.6386697292327881, 'rewards/rejected': -1.6290339231491089, 'rewards/accuracies': 0.84375, 'rewards/margins': 0.9903641939163208, 'logps/chosen': -231.19760131835938, 'logps/rejected': -547.3240356445312, 'logps/ref_chosen': -53.665977478027344, 'logps/ref_rejected': -91.93289947509766, 'logits/chosen': -2.373016357421875, 'logits/rejected': -2.3339807987213135, 'kl/p_epsilon_steps': 0.84375, 'kl/n_epsilon_steps': 0.15625, 'epsilon_dpo/beta': 0.0035804433282464743, 'epsilon_dpo/loss_margin_mean': 277.8594970703125, 'epsilon_dpo/beta_margin_mean': 0.9903641939163208, 'epsilon_dpo/beta_margin_std': 0.89506596326828, 'epsilon_dpo/beta_margin_grad_mean': -0.29801541566848755, 'epsilon_dpo/beta_margin_grad_std': 0.17275013029575348, 'kl/beta': 0.0036048688925802708, 'kl/avg_steps': 0.6875, 'epoch': 0.62} 62%|██████████████████████████████████████████████████████████████████████▋ | 422/681 [29:12<11:10, 2.59s/it] 62%|██████████████████████████████████████████████████████████████████████▊ | 423/681 [29:14<10:50, 2.52s/it] {'loss': 0.6702, 'grad_norm': 43.717838287353516, 'learning_rate': 1.902669377503756e-07, 'rewards/chosen': -0.5520890951156616, 'rewards/rejected': -1.6618356704711914, 'rewards/accuracies': 0.921875, 'rewards/margins': 1.1097465753555298, 'logps/chosen': -190.4375762939453, 'logps/rejected': -560.4072265625, 'logps/ref_chosen': -35.00245666503906, 'logps/ref_rejected': -91.82589721679688, 'logits/chosen': -2.299365758895874, 'logits/rejected': -2.2865123748779297, 'kl/p_epsilon_steps': 0.9375, 'kl/n_epsilon_steps': 0.0625, 'epsilon_dpo/beta': 0.003549282206222415, 'epsilon_dpo/loss_margin_mean': 313.14617919921875, 'epsilon_dpo/beta_margin_mean': 1.1097465753555298, 'epsilon_dpo/beta_margin_std': 0.7358165383338928, 'epsilon_dpo/beta_margin_grad_mean': -0.2701137661933899, 'epsilon_dpo/beta_margin_grad_std': 0.13685345649719238, 'kl/beta': 0.003580254502594471, 'kl/avg_steps': 0.875, 'epoch': 0.62} 62%|██████████████████████████████████████████████████████████████████████▊ | 423/681 [29:14<10:50, 2.52s/it] 62%|██████████████████████████████████████████████████████████████████████▉ | 424/681 [29:16<10:52, 2.54s/it] {'loss': 0.7749, 'grad_norm': 42.60936737060547, 'learning_rate': 1.890215699729057e-07, 'rewards/chosen': -0.543306291103363, 'rewards/rejected': -1.4932714700698853, 'rewards/accuracies': 0.875, 'rewards/margins': 0.9499651789665222, 'logps/chosen': -187.34298706054688, 'logps/rejected': -494.61981201171875, 'logps/ref_chosen': -33.510772705078125, 'logps/ref_rejected': -70.15070343017578, 'logits/chosen': -2.3963804244995117, 'logits/rejected': -2.1616666316986084, 'kl/p_epsilon_steps': 0.875, 'kl/n_epsilon_steps': 0.125, 'epsilon_dpo/beta': 0.0035229322966188192, 'epsilon_dpo/loss_margin_mean': 270.63690185546875, 'epsilon_dpo/beta_margin_mean': 0.9499651789665222, 'epsilon_dpo/beta_margin_std': 0.7828956842422485, 'epsilon_dpo/beta_margin_grad_mean': -0.30060890316963196, 'epsilon_dpo/beta_margin_grad_std': 0.15238063037395477, 'kl/beta': 0.0035491990856826305, 'kl/avg_steps': 0.75, 'epoch': 0.62} 62%|██████████████████████████████████████████████████████████████████████▉ | 424/681 [29:16<10:52, 2.54s/it] 62%|███████████████████████████████████████████████████████████████████████▏ | 425/681 [29:19<10:43, 2.51s/it] {'loss': 0.7154, 'grad_norm': 47.06675338745117, 'learning_rate': 1.8777780903377732e-07, 'rewards/chosen': -0.47508615255355835, 'rewards/rejected': -1.5563328266143799, 'rewards/accuracies': 0.90625, 'rewards/margins': 1.0812466144561768, 'logps/chosen': -167.17559814453125, 'logps/rejected': -549.517333984375, 'logps/ref_chosen': -31.619510650634766, 'logps/ref_rejected': -103.71711730957031, 'logits/chosen': -2.3238930702209473, 'logits/rejected': -2.3823156356811523, 'kl/p_epsilon_steps': 0.90625, 'kl/n_epsilon_steps': 0.09375, 'epsilon_dpo/beta': 0.0034945050720125437, 'epsilon_dpo/loss_margin_mean': 310.2441101074219, 'epsilon_dpo/beta_margin_mean': 1.0812467336654663, 'epsilon_dpo/beta_margin_std': 0.8364912867546082, 'epsilon_dpo/beta_margin_grad_mean': -0.27929848432540894, 'epsilon_dpo/beta_margin_grad_std': 0.15515829622745514, 'kl/beta': 0.0035227781627327204, 'kl/avg_steps': 0.8125, 'epoch': 0.62} 62%|███████████████████████████████████████████████████████████████████████▏ | 425/681 [29:19<10:43, 2.51s/it] 63%|███████████████████████████████████████████████████████████████████████▎ | 426/681 [29:21<10:34, 2.49s/it] {'loss': 0.7381, 'grad_norm': 66.66638946533203, 'learning_rate': 1.8653568770724803e-07, 'rewards/chosen': -0.48144960403442383, 'rewards/rejected': -1.4888947010040283, 'rewards/accuracies': 0.890625, 'rewards/margins': 1.007444977760315, 'logps/chosen': -187.71844482421875, 'logps/rejected': -515.9197387695312, 'logps/ref_chosen': -49.42609405517578, 'logps/ref_rejected': -86.20869445800781, 'logits/chosen': -2.42264986038208, 'logits/rejected': -2.2819976806640625, 'kl/p_epsilon_steps': 0.875, 'kl/n_epsilon_steps': 0.125, 'epsilon_dpo/beta': 0.003468525130301714, 'epsilon_dpo/loss_margin_mean': 291.418701171875, 'epsilon_dpo/beta_margin_mean': 1.007444977760315, 'epsilon_dpo/beta_margin_std': 0.7751356959342957, 'epsilon_dpo/beta_margin_grad_mean': -0.28968408703804016, 'epsilon_dpo/beta_margin_grad_std': 0.1473844200372696, 'kl/beta': 0.00349438632838428, 'kl/avg_steps': 0.75, 'epoch': 0.63} 63%|███████████████████████████████████████████████████████████████████████▎ | 426/681 [29:21<10:34, 2.49s/it] 63%|███████████████████████████████████████████████████████████████████████▍ | 427/681 [29:24<10:30, 2.48s/it] {'loss': 0.8242, 'grad_norm': 60.45020294189453, 'learning_rate': 1.8529523872436977e-07, 'rewards/chosen': -0.5330853462219238, 'rewards/rejected': -1.3945105075836182, 'rewards/accuracies': 0.875, 'rewards/margins': 0.8614251613616943, 'logps/chosen': -194.640869140625, 'logps/rejected': -489.7284240722656, 'logps/ref_chosen': -40.53302001953125, 'logps/ref_rejected': -84.44095611572266, 'logits/chosen': -2.4879813194274902, 'logits/rejected': -2.38423490524292, 'kl/p_epsilon_steps': 0.84375, 'kl/n_epsilon_steps': 0.140625, 'epsilon_dpo/beta': 0.0034443254116922617, 'epsilon_dpo/loss_margin_mean': 251.1796112060547, 'epsilon_dpo/beta_margin_mean': 0.8614251017570496, 'epsilon_dpo/beta_margin_std': 0.760391354560852, 'epsilon_dpo/beta_margin_grad_mean': -0.3163089156150818, 'epsilon_dpo/beta_margin_grad_std': 0.1538737267255783, 'kl/beta': 0.0034683735575526953, 'kl/avg_steps': 0.703125, 'epoch': 0.63} 63%|███████████████████████████████████████████████████████████████████████▍ | 427/681 [29:24<10:30, 2.48s/it] 63%|███████████████████████████████████████████████████████████████████████▋ | 428/681 [29:26<10:40, 2.53s/it] {'loss': 0.7038, 'grad_norm': 48.36793899536133, 'learning_rate': 1.8405649477212697e-07, 'rewards/chosen': -0.5489051938056946, 'rewards/rejected': -1.6179664134979248, 'rewards/accuracies': 0.9375, 'rewards/margins': 1.069061040878296, 'logps/chosen': -196.82138061523438, 'logps/rejected': -585.9830322265625, 'logps/ref_chosen': -36.43036651611328, 'logps/ref_rejected': -111.91488647460938, 'logits/chosen': -2.412050724029541, 'logits/rejected': -2.416749954223633, 'kl/p_epsilon_steps': 0.9375, 'kl/n_epsilon_steps': 0.0625, 'epsilon_dpo/beta': 0.0034143617376685143, 'epsilon_dpo/loss_margin_mean': 313.6771545410156, 'epsilon_dpo/beta_margin_mean': 1.0690611600875854, 'epsilon_dpo/beta_margin_std': 0.7979580163955688, 'epsilon_dpo/beta_margin_grad_mean': -0.2803729176521301, 'epsilon_dpo/beta_margin_grad_std': 0.14126239717006683, 'kl/beta': 0.0034441568423062563, 'kl/avg_steps': 0.875, 'epoch': 0.63} 63%|███████████████████████████████████████████████████████████████████████▋ | 428/681 [29:26<10:40, 2.53s/it] 63%|███████████████████████████████████████████████████████████████████████▊ | 429/681 [29:29<10:43, 2.55s/it] {'loss': 0.6748, 'grad_norm': 45.01988220214844, 'learning_rate': 1.828194884925749e-07, 'rewards/chosen': -0.5519835352897644, 'rewards/rejected': -1.6421988010406494, 'rewards/accuracies': 0.921875, 'rewards/margins': 1.0902152061462402, 'logps/chosen': -211.84674072265625, 'logps/rejected': -583.9923095703125, 'logps/ref_chosen': -49.30812072753906, 'logps/ref_rejected': -98.94145965576172, 'logits/chosen': -2.5120902061462402, 'logits/rejected': -2.3635692596435547, 'kl/p_epsilon_steps': 0.90625, 'kl/n_epsilon_steps': 0.09375, 'epsilon_dpo/beta': 0.0033868795726448298, 'epsilon_dpo/loss_margin_mean': 322.5122375488281, 'epsilon_dpo/beta_margin_mean': 1.0902152061462402, 'epsilon_dpo/beta_margin_std': 0.7191058397293091, 'epsilon_dpo/beta_margin_grad_mean': -0.2734263241291046, 'epsilon_dpo/beta_margin_grad_std': 0.13189740478992462, 'kl/beta': 0.003414281876757741, 'kl/avg_steps': 0.8125, 'epoch': 0.63} 63%|███████████████████████████████████████████████████████████████████████▊ | 429/681 [29:29<10:43, 2.55s/it] 63%|███████████████████████████████████████████████████████████████████████▉ | 430/681 [29:32<11:02, 2.64s/it] {'loss': 0.7561, 'grad_norm': 43.82745361328125, 'learning_rate': 1.8158425248197928e-07, 'rewards/chosen': -0.5032367706298828, 'rewards/rejected': -1.4752440452575684, 'rewards/accuracies': 0.875, 'rewards/margins': 0.9720072746276855, 'logps/chosen': -194.5563507080078, 'logps/rejected': -549.539306640625, 'logps/ref_chosen': -45.3841438293457, 'logps/ref_rejected': -110.27545928955078, 'logits/chosen': -2.4356656074523926, 'logits/rejected': -2.5032784938812256, 'kl/p_epsilon_steps': 0.84375, 'kl/n_epsilon_steps': 0.15625, 'epsilon_dpo/beta': 0.0033638167660683393, 'epsilon_dpo/loss_margin_mean': 290.09161376953125, 'epsilon_dpo/beta_margin_mean': 0.9720072150230408, 'epsilon_dpo/beta_margin_std': 0.7672375440597534, 'epsilon_dpo/beta_margin_grad_mean': -0.2967393100261688, 'epsilon_dpo/beta_margin_grad_std': 0.14850302040576935, 'kl/beta': 0.003386764321476221, 'kl/avg_steps': 0.6875, 'epoch': 0.63} 63%|███████████████████████████████████████████████████████████████████████▉ | 430/681 [29:32<11:02, 2.64s/it] 63%|████████████████████████████████████████████████████████████████████████▏ | 431/681 [29:34<10:43, 2.57s/it] {'loss': 0.6656, 'grad_norm': 37.97023391723633, 'learning_rate': 1.8035081928995788e-07, 'rewards/chosen': -0.4194945991039276, 'rewards/rejected': -1.5508911609649658, 'rewards/accuracies': 0.90625, 'rewards/margins': 1.1313966512680054, 'logps/chosen': -159.66354370117188, 'logps/rejected': -563.6588134765625, 'logps/ref_chosen': -34.30770492553711, 'logps/ref_rejected': -98.43756866455078, 'logits/chosen': -2.4460361003875732, 'logits/rejected': -2.4488284587860107, 'kl/p_epsilon_steps': 0.90625, 'kl/n_epsilon_steps': 0.09375, 'epsilon_dpo/beta': 0.00333664333447814, 'epsilon_dpo/loss_margin_mean': 339.86541748046875, 'epsilon_dpo/beta_margin_mean': 1.1313966512680054, 'epsilon_dpo/beta_margin_std': 0.7779242396354675, 'epsilon_dpo/beta_margin_grad_mean': -0.2679252624511719, 'epsilon_dpo/beta_margin_grad_std': 0.13725493848323822, 'kl/beta': 0.003363639349117875, 'kl/avg_steps': 0.8125, 'epoch': 0.63} 63%|████████████████████████████████████████████████████████████████████████▏ | 431/681 [29:34<10:43, 2.57s/it] 63%|████████████████████████████████████████████████████████████████████████▎ | 432/681 [29:37<10:46, 2.60s/it] {'loss': 0.7004, 'grad_norm': 44.22251892089844, 'learning_rate': 1.791192214186223e-07, 'rewards/chosen': -0.4821677803993225, 'rewards/rejected': -1.5151182413101196, 'rewards/accuracies': 0.921875, 'rewards/margins': 1.0329504013061523, 'logps/chosen': -184.24755859375, 'logps/rejected': -563.4796142578125, 'logps/ref_chosen': -38.85405349731445, 'logps/ref_rejected': -105.27049255371094, 'logits/chosen': -2.466712236404419, 'logits/rejected': -2.4444732666015625, 'kl/p_epsilon_steps': 0.921875, 'kl/n_epsilon_steps': 0.078125, 'epsilon_dpo/beta': 0.0033087090123444796, 'epsilon_dpo/loss_margin_mean': 312.8155822753906, 'epsilon_dpo/beta_margin_mean': 1.032950520515442, 'epsilon_dpo/beta_margin_std': 0.699227511882782, 'epsilon_dpo/beta_margin_grad_mean': -0.28203877806663513, 'epsilon_dpo/beta_margin_grad_std': 0.12971222400665283, 'kl/beta': 0.003336530178785324, 'kl/avg_steps': 0.84375, 'epoch': 0.63} 63%|████████████████████████████████████████████████████████████████████████▎ | 432/681 [29:37<10:46, 2.60s/it] 64%|████████████████████████████████████████████████████████████████████████▍ | 433/681 [29:39<10:34, 2.56s/it] {'loss': 0.6991, 'grad_norm': 45.43602752685547, 'learning_rate': 1.7788949132172193e-07, 'rewards/chosen': -0.5854209661483765, 'rewards/rejected': -1.6637802124023438, 'rewards/accuracies': 0.9375, 'rewards/margins': 1.0783591270446777, 'logps/chosen': -213.7806396484375, 'logps/rejected': -610.318359375, 'logps/ref_chosen': -35.81452178955078, 'logps/ref_rejected': -103.11997985839844, 'logits/chosen': -2.3216044902801514, 'logits/rejected': -2.3884963989257812, 'kl/p_epsilon_steps': 0.890625, 'kl/n_epsilon_steps': 0.109375, 'epsilon_dpo/beta': 0.0032830932177603245, 'epsilon_dpo/loss_margin_mean': 329.2322692871094, 'epsilon_dpo/beta_margin_mean': 1.0783592462539673, 'epsilon_dpo/beta_margin_std': 0.7789259552955627, 'epsilon_dpo/beta_margin_grad_mean': -0.2790597379207611, 'epsilon_dpo/beta_margin_grad_std': 0.14551861584186554, 'kl/beta': 0.0033086135517805815, 'kl/avg_steps': 0.78125, 'epoch': 0.64} 64%|████████████████████████████████████████████████████████████████████████▍ | 433/681 [29:39<10:34, 2.56s/it] 64%|████████████████████████████████████████████████████████████████████████▋ | 434/681 [29:42<10:25, 2.53s/it] {'loss': 0.7919, 'grad_norm': 43.43961715698242, 'learning_rate': 1.7666166140378853e-07, 'rewards/chosen': -0.6385765075683594, 'rewards/rejected': -1.5220519304275513, 'rewards/accuracies': 0.90625, 'rewards/margins': 0.8834754228591919, 'logps/chosen': -245.33560180664062, 'logps/rejected': -550.40966796875, 'logps/ref_chosen': -49.93016052246094, 'logps/ref_rejected': -83.06277465820312, 'logits/chosen': -2.4602980613708496, 'logits/rejected': -2.3824265003204346, 'kl/p_epsilon_steps': 0.875, 'kl/n_epsilon_steps': 0.125, 'epsilon_dpo/beta': 0.00325866905041039, 'epsilon_dpo/loss_margin_mean': 271.9414367675781, 'epsilon_dpo/beta_margin_mean': 0.8834754228591919, 'epsilon_dpo/beta_margin_std': 0.7322390079498291, 'epsilon_dpo/beta_margin_grad_mean': -0.31264179944992065, 'epsilon_dpo/beta_margin_grad_std': 0.1338089406490326, 'kl/beta': 0.0032829653937369585, 'kl/avg_steps': 0.75, 'epoch': 0.64} 64%|████████████████████████████████████████████████████████████████████████▋ | 434/681 [29:42<10:25, 2.53s/it] 64%|████████████████████████████████████████████████████████████████████████▊ | 435/681 [29:44<10:07, 2.47s/it] {'loss': 0.6928, 'grad_norm': 42.341400146484375, 'learning_rate': 1.7543576401928218e-07, 'rewards/chosen': -0.4249953627586365, 'rewards/rejected': -1.4922096729278564, 'rewards/accuracies': 0.953125, 'rewards/margins': 1.0672142505645752, 'logps/chosen': -165.0016326904297, 'logps/rejected': -555.97021484375, 'logps/ref_chosen': -33.67361831665039, 'logps/ref_rejected': -93.73330688476562, 'logits/chosen': -2.4340808391571045, 'logits/rejected': -2.4357552528381348, 'kl/p_epsilon_steps': 0.953125, 'kl/n_epsilon_steps': 0.046875, 'epsilon_dpo/beta': 0.003229319117963314, 'epsilon_dpo/loss_margin_mean': 330.9088439941406, 'epsilon_dpo/beta_margin_mean': 1.0672143697738647, 'epsilon_dpo/beta_margin_std': 0.7533097863197327, 'epsilon_dpo/beta_margin_grad_mean': -0.2789570689201355, 'epsilon_dpo/beta_margin_grad_std': 0.13298888504505157, 'kl/beta': 0.0032585265580564737, 'kl/avg_steps': 0.90625, 'epoch': 0.64} 64%|████████████████████████████████████████████████████████████████████████▊ | 435/681 [29:44<10:07, 2.47s/it] 64%|████████████████████████████████████████████████████████████████████████▉ | 436/681 [29:47<10:17, 2.52s/it] {'loss': 0.7008, 'grad_norm': 51.16184616088867, 'learning_rate': 1.742118314717391e-07, 'rewards/chosen': -0.4683375954627991, 'rewards/rejected': -1.468498945236206, 'rewards/accuracies': 0.921875, 'rewards/margins': 1.0001612901687622, 'logps/chosen': -188.12747192382812, 'logps/rejected': -547.4158325195312, 'logps/ref_chosen': -42.048667907714844, 'logps/ref_rejected': -88.41279602050781, 'logits/chosen': -2.5231666564941406, 'logits/rejected': -2.4640743732452393, 'kl/p_epsilon_steps': 0.9375, 'kl/n_epsilon_steps': 0.0625, 'epsilon_dpo/beta': 0.0032013256568461657, 'epsilon_dpo/loss_margin_mean': 312.92425537109375, 'epsilon_dpo/beta_margin_mean': 1.0001612901687622, 'epsilon_dpo/beta_margin_std': 0.6187871694564819, 'epsilon_dpo/beta_margin_grad_mean': -0.2841736078262329, 'epsilon_dpo/beta_margin_grad_std': 0.11989546567201614, 'kl/beta': 0.003229261375963688, 'kl/avg_steps': 0.875, 'epoch': 0.64} 64%|████████████████████████████████████████████████████████████████████████▉ | 436/681 [29:47<10:17, 2.52s/it] 64%|█████████████████████████████████████████████████████████████████████████▏ | 437/681 [29:49<10:15, 2.52s/it] {'loss': 0.7736, 'grad_norm': 45.89207458496094, 'learning_rate': 1.7298989601292036e-07, 'rewards/chosen': -0.5791641473770142, 'rewards/rejected': -1.4944398403167725, 'rewards/accuracies': 0.890625, 'rewards/margins': 0.9152756929397583, 'logps/chosen': -226.48953247070312, 'logps/rejected': -557.265625, 'logps/ref_chosen': -44.77692413330078, 'logps/ref_rejected': -86.48928833007812, 'logits/chosen': -2.4697906970977783, 'logits/rejected': -2.390430450439453, 'kl/p_epsilon_steps': 0.875, 'kl/n_epsilon_steps': 0.125, 'epsilon_dpo/beta': 0.0031775590032339096, 'epsilon_dpo/loss_margin_mean': 289.0637512207031, 'epsilon_dpo/beta_margin_mean': 0.9152756929397583, 'epsilon_dpo/beta_margin_std': 0.6979488134384155, 'epsilon_dpo/beta_margin_grad_mean': -0.30295178294181824, 'epsilon_dpo/beta_margin_grad_std': 0.14245736598968506, 'kl/beta': 0.0032012504525482655, 'kl/avg_steps': 0.75, 'epoch': 0.64} 64%|█████████████████████████████████████████████████████████████████████████▏ | 437/681 [29:49<10:15, 2.52s/it] 64%|█████████████████████████████████████████████████████████████████████████▎ | 438/681 [29:52<09:59, 2.47s/it] {'loss': 0.6943, 'grad_norm': 40.396175384521484, 'learning_rate': 1.7176998984196144e-07, 'rewards/chosen': -0.41470038890838623, 'rewards/rejected': -1.4498802423477173, 'rewards/accuracies': 0.9375, 'rewards/margins': 1.035179853439331, 'logps/chosen': -164.937744140625, 'logps/rejected': -549.906005859375, 'logps/ref_chosen': -33.662109375, 'logps/ref_rejected': -89.52166748046875, 'logits/chosen': -2.473397731781006, 'logits/rejected': -2.3974690437316895, 'kl/p_epsilon_steps': 0.90625, 'kl/n_epsilon_steps': 0.09375, 'epsilon_dpo/beta': 0.0031519182957708836, 'epsilon_dpo/loss_margin_mean': 329.1087341308594, 'epsilon_dpo/beta_margin_mean': 1.035179853439331, 'epsilon_dpo/beta_margin_std': 0.6745886206626892, 'epsilon_dpo/beta_margin_grad_mean': -0.2808303236961365, 'epsilon_dpo/beta_margin_grad_std': 0.12779134511947632, 'kl/beta': 0.0031774197705090046, 'kl/avg_steps': 0.8125, 'epoch': 0.64} 64%|█████████████████████████████████████████████████████████████████████████▎ | 438/681 [29:52<09:59, 2.47s/it] 64%|█████████████████████████████████████████████████████████████████████████▍ | 439/681 [29:54<10:06, 2.50s/it] {'loss': 0.7563, 'grad_norm': 45.90419006347656, 'learning_rate': 1.7055214510452458e-07, 'rewards/chosen': -0.5056626200675964, 'rewards/rejected': -1.4195419549942017, 'rewards/accuracies': 0.90625, 'rewards/margins': 0.91387939453125, 'logps/chosen': -196.35198974609375, 'logps/rejected': -541.820556640625, 'logps/ref_chosen': -34.986392974853516, 'logps/ref_rejected': -87.497314453125, 'logits/chosen': -2.472698926925659, 'logits/rejected': -2.5124192237854004, 'kl/p_epsilon_steps': 0.90625, 'kl/n_epsilon_steps': 0.09375, 'epsilon_dpo/beta': 0.003126515308395028, 'epsilon_dpo/loss_margin_mean': 292.9576721191406, 'epsilon_dpo/beta_margin_mean': 0.91387939453125, 'epsilon_dpo/beta_margin_std': 0.6512243747711182, 'epsilon_dpo/beta_margin_grad_mean': -0.3024454414844513, 'epsilon_dpo/beta_margin_grad_std': 0.1240093857049942, 'kl/beta': 0.0031518111936748028, 'kl/avg_steps': 0.8125, 'epoch': 0.64} 64%|█████████████████████████████████████████████████████████████████████████▍ | 439/681 [29:54<10:06, 2.50s/it] 65%|█████████████████████████████████████████████████████████████████████████▋ | 440/681 [29:57<10:18, 2.56s/it] {'loss': 0.761, 'grad_norm': 51.92120361328125, 'learning_rate': 1.6933639389195134e-07, 'rewards/chosen': -0.5756343603134155, 'rewards/rejected': -1.509265661239624, 'rewards/accuracies': 0.890625, 'rewards/margins': 0.9336312413215637, 'logps/chosen': -238.4813232421875, 'logps/rejected': -591.0408935546875, 'logps/ref_chosen': -53.56586837768555, 'logps/ref_rejected': -104.3643569946289, 'logits/chosen': -2.5072128772735596, 'logits/rejected': -2.5406949520111084, 'kl/p_epsilon_steps': 0.875, 'kl/n_epsilon_steps': 0.125, 'epsilon_dpo/beta': 0.003103271359577775, 'epsilon_dpo/loss_margin_mean': 301.76104736328125, 'epsilon_dpo/beta_margin_mean': 0.9336312413215637, 'epsilon_dpo/beta_margin_std': 0.7152769565582275, 'epsilon_dpo/beta_margin_grad_mean': -0.3023149073123932, 'epsilon_dpo/beta_margin_grad_std': 0.134578138589859, 'kl/beta': 0.003126409137621522, 'kl/avg_steps': 0.75, 'epoch': 0.65} 65%|█████████████████████████████████████████████████████████████████████████▋ | 440/681 [29:57<10:18, 2.56s/it] 65%|█████████████████████████████████████████████████████████████████████████▊ | 441/681 [30:00<10:35, 2.65s/it] {'loss': 0.7438, 'grad_norm': 40.502708435058594, 'learning_rate': 1.681227682404166e-07, 'rewards/chosen': -0.49797898530960083, 'rewards/rejected': -1.4459290504455566, 'rewards/accuracies': 0.9375, 'rewards/margins': 0.947950005531311, 'logps/chosen': -200.94158935546875, 'logps/rejected': -573.1942138671875, 'logps/ref_chosen': -39.209449768066406, 'logps/ref_rejected': -102.78851318359375, 'logits/chosen': -2.5055270195007324, 'logits/rejected': -2.469489812850952, 'kl/p_epsilon_steps': 0.953125, 'kl/n_epsilon_steps': 0.046875, 'epsilon_dpo/beta': 0.0030753209721297026, 'epsilon_dpo/loss_margin_mean': 308.67352294921875, 'epsilon_dpo/beta_margin_mean': 0.9479500651359558, 'epsilon_dpo/beta_margin_std': 0.6735605597496033, 'epsilon_dpo/beta_margin_grad_mean': -0.2959996163845062, 'epsilon_dpo/beta_margin_grad_std': 0.12961319088935852, 'kl/beta': 0.003103135619312525, 'kl/avg_steps': 0.90625, 'epoch': 0.65} 65%|█████████████████████████████████████████████████████████████████████████▊ | 441/681 [30:00<10:35, 2.65s/it] 65%|█████████████████████████████████████████████████████████████████████████▉ | 442/681 [30:02<10:11, 2.56s/it] {'loss': 0.675, 'grad_norm': 37.710105895996094, 'learning_rate': 1.669113001300851e-07, 'rewards/chosen': -0.39891505241394043, 'rewards/rejected': -1.444936990737915, 'rewards/accuracies': 0.921875, 'rewards/margins': 1.046021819114685, 'logps/chosen': -159.370849609375, 'logps/rejected': -557.7063598632812, 'logps/ref_chosen': -29.0069580078125, 'logps/ref_rejected': -83.71453857421875, 'logits/chosen': -2.491802930831909, 'logits/rejected': -2.457737922668457, 'kl/p_epsilon_steps': 0.90625, 'kl/n_epsilon_steps': 0.09375, 'epsilon_dpo/beta': 0.003050584578886628, 'epsilon_dpo/loss_margin_mean': 343.6279602050781, 'epsilon_dpo/beta_margin_mean': 1.046021819114685, 'epsilon_dpo/beta_margin_std': 0.6180484294891357, 'epsilon_dpo/beta_margin_grad_mean': -0.2755725681781769, 'epsilon_dpo/beta_margin_grad_std': 0.11769750714302063, 'kl/beta': 0.0030752660240978003, 'kl/avg_steps': 0.8125, 'epoch': 0.65} 65%|█████████████████████████████████████████████████████████████████████████▉ | 442/681 [30:02<10:11, 2.56s/it] 65%|██████████████████████████████████████████████████████████████████████████▏ | 443/681 [30:05<09:59, 2.52s/it] {'loss': 0.841, 'grad_norm': 46.755645751953125, 'learning_rate': 1.6570202148426815e-07, 'rewards/chosen': -0.549048900604248, 'rewards/rejected': -1.3913331031799316, 'rewards/accuracies': 0.890625, 'rewards/margins': 0.8422842025756836, 'logps/chosen': -233.57965087890625, 'logps/rejected': -553.4365234375, 'logps/ref_chosen': -52.68767166137695, 'logps/ref_rejected': -93.39274597167969, 'logits/chosen': -2.5168378353118896, 'logits/rejected': -2.499556541442871, 'kl/p_epsilon_steps': 0.875, 'kl/n_epsilon_steps': 0.125, 'epsilon_dpo/beta': 0.0030279052443802357, 'epsilon_dpo/loss_margin_mean': 279.1518249511719, 'epsilon_dpo/beta_margin_mean': 0.8422841429710388, 'epsilon_dpo/beta_margin_std': 0.7962964177131653, 'epsilon_dpo/beta_margin_grad_mean': -0.3231920897960663, 'epsilon_dpo/beta_margin_grad_std': 0.15162836015224457, 'kl/beta': 0.003050480969250202, 'kl/avg_steps': 0.75, 'epoch': 0.65} 65%|██████████████████████████████████████████████████████████████████████████▏ | 443/681 [30:05<09:59, 2.52s/it] 65%|██████████████████████████████████████████████████████████████████████████▎ | 444/681 [30:07<09:58, 2.52s/it] {'loss': 0.6502, 'grad_norm': 38.41836929321289, 'learning_rate': 1.6449496416858282e-07, 'rewards/chosen': -0.4290057420730591, 'rewards/rejected': -1.5417864322662354, 'rewards/accuracies': 0.984375, 'rewards/margins': 1.1127806901931763, 'logps/chosen': -178.10812377929688, 'logps/rejected': -618.6232299804688, 'logps/ref_chosen': -35.20741271972656, 'logps/ref_rejected': -104.47367858886719, 'logits/chosen': -2.462678909301758, 'logits/rejected': -2.5427470207214355, 'kl/p_epsilon_steps': 0.96875, 'kl/n_epsilon_steps': 0.03125, 'epsilon_dpo/beta': 0.0029996871016919613, 'epsilon_dpo/loss_margin_mean': 371.2488708496094, 'epsilon_dpo/beta_margin_mean': 1.1127806901931763, 'epsilon_dpo/beta_margin_std': 0.6654544472694397, 'epsilon_dpo/beta_margin_grad_mean': -0.26602280139923096, 'epsilon_dpo/beta_margin_grad_std': 0.12350434064865112, 'kl/beta': 0.003027772530913353, 'kl/avg_steps': 0.9375, 'epoch': 0.65} 65%|██████████████████████████████████████████████████████████████████████████▎ | 444/681 [30:07<09:58, 2.52s/it] 65%|██████████████████████████████████████████████████████████████████████████▍ | 445/681 [30:10<09:59, 2.54s/it] {'loss': 0.8334, 'grad_norm': 48.016639709472656, 'learning_rate': 1.6329015999011182e-07, 'rewards/chosen': -0.5069864392280579, 'rewards/rejected': -1.3884000778198242, 'rewards/accuracies': 0.875, 'rewards/margins': 0.8814135789871216, 'logps/chosen': -216.25303649902344, 'logps/rejected': -565.91015625, 'logps/ref_chosen': -46.78347396850586, 'logps/ref_rejected': -99.2047119140625, 'logits/chosen': -2.5741279125213623, 'logits/rejected': -2.5458946228027344, 'kl/p_epsilon_steps': 0.859375, 'kl/n_epsilon_steps': 0.140625, 'epsilon_dpo/beta': 0.0029783889185637236, 'epsilon_dpo/loss_margin_mean': 297.23583984375, 'epsilon_dpo/beta_margin_mean': 0.8814136385917664, 'epsilon_dpo/beta_margin_std': 0.8328824639320374, 'epsilon_dpo/beta_margin_grad_mean': -0.31596365571022034, 'epsilon_dpo/beta_margin_grad_std': 0.1648440957069397, 'kl/beta': 0.002999651012942195, 'kl/avg_steps': 0.71875, 'epoch': 0.65} 65%|██████████████████████████████████████████████████████████████████████████▍ | 445/681 [30:10<09:59, 2.54s/it] 65%|██████████████████████████████████████████████████████████████████████████▋ | 446/681 [30:12<10:00, 2.55s/it] {'loss': 0.6534, 'grad_norm': 50.814979553222656, 'learning_rate': 1.6208764069656578e-07, 'rewards/chosen': -0.37178486585617065, 'rewards/rejected': -1.4675066471099854, 'rewards/accuracies': 0.96875, 'rewards/margins': 1.0957218408584595, 'logps/chosen': -160.44252014160156, 'logps/rejected': -607.453857421875, 'logps/ref_chosen': -34.56015396118164, 'logps/ref_rejected': -109.97004699707031, 'logits/chosen': -2.528761863708496, 'logits/rejected': -2.6162219047546387, 'kl/p_epsilon_steps': 0.96875, 'kl/n_epsilon_steps': 0.03125, 'epsilon_dpo/beta': 0.0029506187420338392, 'epsilon_dpo/loss_margin_mean': 371.6014099121094, 'epsilon_dpo/beta_margin_mean': 1.0957218408584595, 'epsilon_dpo/beta_margin_std': 0.6562060713768005, 'epsilon_dpo/beta_margin_grad_mean': -0.2692401707172394, 'epsilon_dpo/beta_margin_grad_std': 0.11489293724298477, 'kl/beta': 0.002978244796395302, 'kl/avg_steps': 0.9375, 'epoch': 0.65} 65%|██████████████████████████████████████████████████████████████████████████▋ | 446/681 [30:12<10:00, 2.55s/it] 66%|██████████████████████████████████████████████████████████████████████████▊ | 447/681 [30:15<09:45, 2.50s/it] {'loss': 0.7578, 'grad_norm': 40.274986267089844, 'learning_rate': 1.608874379754465e-07, 'rewards/chosen': -0.471713662147522, 'rewards/rejected': -1.4239914417266846, 'rewards/accuracies': 0.890625, 'rewards/margins': 0.9522777795791626, 'logps/chosen': -199.97299194335938, 'logps/rejected': -592.6107788085938, 'logps/ref_chosen': -39.49730682373047, 'logps/ref_rejected': -105.97085571289062, 'logits/chosen': -2.502809762954712, 'logits/rejected': -2.5910682678222656, 'kl/p_epsilon_steps': 0.890625, 'kl/n_epsilon_steps': 0.109375, 'epsilon_dpo/beta': 0.0029278243891894817, 'epsilon_dpo/loss_margin_mean': 326.1642150878906, 'epsilon_dpo/beta_margin_mean': 0.9522778391838074, 'epsilon_dpo/beta_margin_std': 0.7474956512451172, 'epsilon_dpo/beta_margin_grad_mean': -0.2999436855316162, 'epsilon_dpo/beta_margin_grad_std': 0.13803143799304962, 'kl/beta': 0.00295058311894536, 'kl/avg_steps': 0.78125, 'epoch': 0.66} 66%|██████████████████████████████████████████████████████████████████████████▊ | 447/681 [30:15<09:45, 2.50s/it] 66%|██████████████████████████████████████████████████████████████████████████▉ | 448/681 [30:17<09:53, 2.55s/it] {'loss': 0.7027, 'grad_norm': 41.37987518310547, 'learning_rate': 1.5968958345321177e-07, 'rewards/chosen': -0.5373802781105042, 'rewards/rejected': -1.5529210567474365, 'rewards/accuracies': 0.953125, 'rewards/margins': 1.0155407190322876, 'logps/chosen': -227.657958984375, 'logps/rejected': -644.6298217773438, 'logps/ref_chosen': -42.827239990234375, 'logps/ref_rejected': -109.36424255371094, 'logits/chosen': -2.5615386962890625, 'logits/rejected': -2.621204376220703, 'kl/p_epsilon_steps': 0.953125, 'kl/n_epsilon_steps': 0.046875, 'epsilon_dpo/beta': 0.0029014681931585073, 'epsilon_dpo/loss_margin_mean': 350.43487548828125, 'epsilon_dpo/beta_margin_mean': 1.0155407190322876, 'epsilon_dpo/beta_margin_std': 0.674639880657196, 'epsilon_dpo/beta_margin_grad_mean': -0.2846612334251404, 'epsilon_dpo/beta_margin_grad_std': 0.12346479296684265, 'kl/beta': 0.0029277103021740913, 'kl/avg_steps': 0.90625, 'epoch': 0.66} 66%|██████████████████████████████████████████████████████████████████████████▉ | 448/681 [30:17<09:53, 2.55s/it] 66%|███████████████████████████████████████████████████████████████████████████▏ | 449/681 [30:20<09:36, 2.49s/it] {'loss': 0.7298, 'grad_norm': 43.268863677978516, 'learning_rate': 1.584941086944423e-07, 'rewards/chosen': -0.4556068181991577, 'rewards/rejected': -1.4396498203277588, 'rewards/accuracies': 0.921875, 'rewards/margins': 0.9840430021286011, 'logps/chosen': -194.70394897460938, 'logps/rejected': -596.5521240234375, 'logps/ref_chosen': -36.90496826171875, 'logps/ref_rejected': -95.95344543457031, 'logits/chosen': -2.5850870609283447, 'logits/rejected': -2.595974922180176, 'kl/p_epsilon_steps': 0.90625, 'kl/n_epsilon_steps': 0.09375, 'epsilon_dpo/beta': 0.002878130180761218, 'epsilon_dpo/loss_margin_mean': 342.7997131347656, 'epsilon_dpo/beta_margin_mean': 0.9840430617332458, 'epsilon_dpo/beta_margin_std': 0.6940675973892212, 'epsilon_dpo/beta_margin_grad_mean': -0.28992605209350586, 'epsilon_dpo/beta_margin_grad_std': 0.13492745161056519, 'kl/beta': 0.0029014162719249725, 'kl/avg_steps': 0.8125, 'epoch': 0.66} 66%|███████████████████████████████████████████████████████████████████████████▏ | 449/681 [30:20<09:36, 2.49s/it] 66%|███████████████████████████████████████████████████████████████████████████▎ | 450/681 [30:22<09:34, 2.49s/it] {'loss': 0.6252, 'grad_norm': 37.674678802490234, 'learning_rate': 1.573010452010098e-07, 'rewards/chosen': -0.40372031927108765, 'rewards/rejected': -1.5394420623779297, 'rewards/accuracies': 0.953125, 'rewards/margins': 1.1357218027114868, 'logps/chosen': -175.88995361328125, 'logps/rejected': -648.0965576171875, 'logps/ref_chosen': -34.65415573120117, 'logps/ref_rejected': -108.24179077148438, 'logits/chosen': -2.546546459197998, 'logits/rejected': -2.698519229888916, 'kl/p_epsilon_steps': 0.9375, 'kl/n_epsilon_steps': 0.0625, 'epsilon_dpo/beta': 0.002853134647011757, 'epsilon_dpo/loss_margin_mean': 398.61895751953125, 'epsilon_dpo/beta_margin_mean': 1.1357218027114868, 'epsilon_dpo/beta_margin_std': 0.607581377029419, 'epsilon_dpo/beta_margin_grad_mean': -0.2585708200931549, 'epsilon_dpo/beta_margin_grad_std': 0.11334740370512009, 'kl/beta': 0.00287803215906024, 'kl/avg_steps': 0.875, 'epoch': 0.66} 66%|███████████████████████████████████████████████████████████████████████████▎ | 450/681 [30:22<09:34, 2.49s/it] 66%|███████████████████████████████████████████████████████████████████████████▍ | 451/681 [30:25<09:33, 2.49s/it] {'loss': 0.8563, 'grad_norm': 46.581146240234375, 'learning_rate': 1.5611042441124687e-07, 'rewards/chosen': -0.5459603071212769, 'rewards/rejected': -1.3558356761932373, 'rewards/accuracies': 0.859375, 'rewards/margins': 0.8098753690719604, 'logps/chosen': -234.43658447265625, 'logps/rejected': -558.2542724609375, 'logps/ref_chosen': -42.703250885009766, 'logps/ref_rejected': -79.43376159667969, 'logits/chosen': -2.5236382484436035, 'logits/rejected': -2.479279041290283, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.171875, 'epsilon_dpo/beta': 0.002834628103300929, 'epsilon_dpo/loss_margin_mean': 287.087158203125, 'epsilon_dpo/beta_margin_mean': 0.8098753690719604, 'epsilon_dpo/beta_margin_std': 0.763114869594574, 'epsilon_dpo/beta_margin_grad_mean': -0.32642942667007446, 'epsilon_dpo/beta_margin_grad_std': 0.15271244943141937, 'kl/beta': 0.0028530678246170282, 'kl/avg_steps': 0.65625, 'epoch': 0.66} 66%|███████████████████████████████████████████████████████████████████████████▍ | 451/681 [30:25<09:33, 2.49s/it] 66%|███████████████████████████████████████████████████████████████████████████▋ | 452/681 [30:28<09:56, 2.60s/it] {'loss': 0.7379, 'grad_norm': 40.06062316894531, 'learning_rate': 1.549222776991186e-07, 'rewards/chosen': -0.3679235577583313, 'rewards/rejected': -1.315352439880371, 'rewards/accuracies': 0.921875, 'rewards/margins': 0.9474288821220398, 'logps/chosen': -166.42276000976562, 'logps/rejected': -570.5884399414062, 'logps/ref_chosen': -35.80718231201172, 'logps/ref_rejected': -102.27734375, 'logits/chosen': -2.5114731788635254, 'logits/rejected': -2.680518627166748, 'kl/p_epsilon_steps': 0.921875, 'kl/n_epsilon_steps': 0.078125, 'epsilon_dpo/beta': 0.0028108321130275726, 'epsilon_dpo/loss_margin_mean': 337.69549560546875, 'epsilon_dpo/beta_margin_mean': 0.9474288821220398, 'epsilon_dpo/beta_margin_std': 0.651831328868866, 'epsilon_dpo/beta_margin_grad_mean': -0.2964177429676056, 'epsilon_dpo/beta_margin_grad_std': 0.12497884780168533, 'kl/beta': 0.002834466751664877, 'kl/avg_steps': 0.84375, 'epoch': 0.66} 66%|███████████████████████████████████████████████████████████████████████████▋ | 452/681 [30:28<09:56, 2.60s/it] 67%|███████████████████████████████████████████████████████████████████████████▊ | 453/681 [30:30<09:34, 2.52s/it] {'loss': 0.7343, 'grad_norm': 39.58503723144531, 'learning_rate': 1.5373663637339584e-07, 'rewards/chosen': -0.4305250644683838, 'rewards/rejected': -1.3857460021972656, 'rewards/accuracies': 0.953125, 'rewards/margins': 0.9552209973335266, 'logps/chosen': -195.30682373046875, 'logps/rejected': -585.105224609375, 'logps/ref_chosen': -41.37712860107422, 'logps/ref_rejected': -87.86880493164062, 'logits/chosen': -2.605818748474121, 'logits/rejected': -2.60537052154541, 'kl/p_epsilon_steps': 0.921875, 'kl/n_epsilon_steps': 0.078125, 'epsilon_dpo/beta': 0.002787314122542739, 'epsilon_dpo/loss_margin_mean': 343.3067321777344, 'epsilon_dpo/beta_margin_mean': 0.9552209973335266, 'epsilon_dpo/beta_margin_std': 0.6552383899688721, 'epsilon_dpo/beta_margin_grad_mean': -0.2953357696533203, 'epsilon_dpo/beta_margin_grad_std': 0.12554973363876343, 'kl/beta': 0.002810751087963581, 'kl/avg_steps': 0.84375, 'epoch': 0.67} 67%|███████████████████████████████████████████████████████████████████████████▊ | 453/681 [30:30<09:34, 2.52s/it] 67%|████████████████████████████████████████████████████████████████████████████ | 454/681 [30:32<09:15, 2.45s/it] {'loss': 0.7676, 'grad_norm': 55.10441207885742, 'learning_rate': 1.5255353167683017e-07, 'rewards/chosen': -0.5324690341949463, 'rewards/rejected': -1.4844902753829956, 'rewards/accuracies': 0.90625, 'rewards/margins': 0.9520212411880493, 'logps/chosen': -236.1055145263672, 'logps/rejected': -627.3410034179688, 'logps/ref_chosen': -44.58696746826172, 'logps/ref_rejected': -90.57184600830078, 'logits/chosen': -2.5729262828826904, 'logits/rejected': -2.6592321395874023, 'kl/p_epsilon_steps': 0.84375, 'kl/n_epsilon_steps': 0.15625, 'epsilon_dpo/beta': 0.0027683484368026257, 'epsilon_dpo/loss_margin_mean': 345.2506103515625, 'epsilon_dpo/beta_margin_mean': 0.9520212411880493, 'epsilon_dpo/beta_margin_std': 0.7603475451469421, 'epsilon_dpo/beta_margin_grad_mean': -0.29989561438560486, 'epsilon_dpo/beta_margin_grad_std': 0.14955060184001923, 'kl/beta': 0.0027872337959706783, 'kl/avg_steps': 0.6875, 'epoch': 0.67} 67%|████████████████████████████████████████████████████████████████████████████ | 454/681 [30:32<09:15, 2.45s/it] 67%|████████████████████████████████████████████████████████████████████████████▏ | 455/681 [30:35<09:20, 2.48s/it] {'loss': 0.6379, 'grad_norm': 33.066959381103516, 'learning_rate': 1.5137299478533064e-07, 'rewards/chosen': -0.37842369079589844, 'rewards/rejected': -1.5500357151031494, 'rewards/accuracies': 0.953125, 'rewards/margins': 1.171612024307251, 'logps/chosen': -160.47991943359375, 'logps/rejected': -686.5131225585938, 'logps/ref_chosen': -22.870223999023438, 'logps/ref_rejected': -121.32386779785156, 'logits/chosen': -2.561234474182129, 'logits/rejected': -2.784088611602783, 'kl/p_epsilon_steps': 0.9375, 'kl/n_epsilon_steps': 0.0625, 'epsilon_dpo/beta': 0.0027442548889666796, 'epsilon_dpo/loss_margin_mean': 427.5795593261719, 'epsilon_dpo/beta_margin_mean': 1.1716121435165405, 'epsilon_dpo/beta_margin_std': 0.7572227120399475, 'epsilon_dpo/beta_margin_grad_mean': -0.26028749346733093, 'epsilon_dpo/beta_margin_grad_std': 0.12892557680606842, 'kl/beta': 0.002768202219158411, 'kl/avg_steps': 0.875, 'epoch': 0.67} 67%|████████████████████████████████████████████████████████████████████████████▏ | 455/681 [30:35<09:20, 2.48s/it] 67%|████████████████████████████████████████████████████████████████████████████▎ | 456/681 [30:37<09:10, 2.45s/it] {'loss': 0.7064, 'grad_norm': 35.979026794433594, 'learning_rate': 1.5019505680714232e-07, 'rewards/chosen': -0.4502452611923218, 'rewards/rejected': -1.5035851001739502, 'rewards/accuracies': 0.890625, 'rewards/margins': 1.053339958190918, 'logps/chosen': -205.63754272460938, 'logps/rejected': -664.3036499023438, 'logps/ref_chosen': -40.844276428222656, 'logps/ref_rejected': -111.70032501220703, 'logits/chosen': -2.559840202331543, 'logits/rejected': -2.7625198364257812, 'kl/p_epsilon_steps': 0.875, 'kl/n_epsilon_steps': 0.125, 'epsilon_dpo/beta': 0.002723881509155035, 'epsilon_dpo/loss_margin_mean': 387.81005859375, 'epsilon_dpo/beta_margin_mean': 1.053339958190918, 'epsilon_dpo/beta_margin_std': 0.7655569314956665, 'epsilon_dpo/beta_margin_grad_mean': -0.2826097905635834, 'epsilon_dpo/beta_margin_grad_std': 0.13924254477024078, 'kl/beta': 0.002744190627709031, 'kl/avg_steps': 0.75, 'epoch': 0.67} 67%|████████████████████████████████████████████████████████████████████████████▎ | 456/681 [30:37<09:10, 2.45s/it] 67%|████████████████████████████████████████████████████████████████████████████▌ | 457/681 [30:40<09:15, 2.48s/it] {'loss': 0.7706, 'grad_norm': 48.26573944091797, 'learning_rate': 1.4901974878202627e-07, 'rewards/chosen': -0.47887659072875977, 'rewards/rejected': -1.3794221878051758, 'rewards/accuracies': 0.921875, 'rewards/margins': 0.900545597076416, 'logps/chosen': -215.42941284179688, 'logps/rejected': -601.1631469726562, 'logps/ref_chosen': -38.554141998291016, 'logps/ref_rejected': -90.09440612792969, 'logits/chosen': -2.6022448539733887, 'logits/rejected': -2.619676351547241, 'kl/p_epsilon_steps': 0.90625, 'kl/n_epsilon_steps': 0.09375, 'epsilon_dpo/beta': 0.0027019022963941097, 'epsilon_dpo/loss_margin_mean': 334.1934509277344, 'epsilon_dpo/beta_margin_mean': 0.900545597076416, 'epsilon_dpo/beta_margin_std': 0.6524316668510437, 'epsilon_dpo/beta_margin_grad_mean': -0.3032781481742859, 'epsilon_dpo/beta_margin_grad_std': 0.13411802053451538, 'kl/beta': 0.002723762532696128, 'kl/avg_steps': 0.8125, 'epoch': 0.67} 67%|████████████████████████████████████████████████████████████████████████████▌ | 457/681 [30:40<09:15, 2.48s/it] 67%|████████████████████████████████████████████████████████████████████████████▋ | 458/681 [30:42<09:09, 2.46s/it] {'loss': 0.732, 'grad_norm': 49.126129150390625, 'learning_rate': 1.4784710168044212e-07, 'rewards/chosen': -0.4552924931049347, 'rewards/rejected': -1.4071264266967773, 'rewards/accuracies': 0.90625, 'rewards/margins': 0.951833963394165, 'logps/chosen': -206.733642578125, 'logps/rejected': -629.8433227539062, 'logps/ref_chosen': -37.41191482543945, 'logps/ref_rejected': -104.581298828125, 'logits/chosen': -2.625985622406006, 'logits/rejected': -2.772733211517334, 'kl/p_epsilon_steps': 0.890625, 'kl/n_epsilon_steps': 0.109375, 'epsilon_dpo/beta': 0.002680970588698983, 'epsilon_dpo/loss_margin_mean': 355.9402770996094, 'epsilon_dpo/beta_margin_mean': 0.951833963394165, 'epsilon_dpo/beta_margin_std': 0.6352096199989319, 'epsilon_dpo/beta_margin_grad_mean': -0.29450592398643494, 'epsilon_dpo/beta_margin_grad_std': 0.12352831661701202, 'kl/beta': 0.0027018103282898664, 'kl/avg_steps': 0.78125, 'epoch': 0.67} 67%|████████████████████████████████████████████████████████████████████████████▋ | 458/681 [30:42<09:09, 2.46s/it] 67%|████████████████████████████████████████████████████████████████████████████▊ | 459/681 [30:45<09:07, 2.47s/it] {'loss': 0.8004, 'grad_norm': 36.275413513183594, 'learning_rate': 1.466771464027316e-07, 'rewards/chosen': -0.4260733127593994, 'rewards/rejected': -1.2596666812896729, 'rewards/accuracies': 0.9375, 'rewards/margins': 0.8335932493209839, 'logps/chosen': -192.38290405273438, 'logps/rejected': -565.0028076171875, 'logps/ref_chosen': -32.51487350463867, 'logps/ref_rejected': -90.99087524414062, 'logits/chosen': -2.501951217651367, 'logits/rejected': -2.667633533477783, 'kl/p_epsilon_steps': 0.90625, 'kl/n_epsilon_steps': 0.09375, 'epsilon_dpo/beta': 0.002659349935129285, 'epsilon_dpo/loss_margin_mean': 314.1438903808594, 'epsilon_dpo/beta_margin_mean': 0.8335932493209839, 'epsilon_dpo/beta_margin_std': 0.6189974546432495, 'epsilon_dpo/beta_margin_grad_mean': -0.31732696294784546, 'epsilon_dpo/beta_margin_grad_std': 0.12530642747879028, 'kl/beta': 0.0026808660477399826, 'kl/avg_steps': 0.8125, 'epoch': 0.67} 67%|████████████████████████████████████████████████████████████████████████████▊ | 459/681 [30:45<09:07, 2.47s/it] 68%|█████████████████████████████████████████████████████████████████████████████ | 460/681 [30:47<09:10, 2.49s/it] {'loss': 0.7211, 'grad_norm': 46.03992462158203, 'learning_rate': 1.4550991377830423e-07, 'rewards/chosen': -0.4630896747112274, 'rewards/rejected': -1.4168858528137207, 'rewards/accuracies': 0.96875, 'rewards/margins': 0.9537962079048157, 'logps/chosen': -206.57542419433594, 'logps/rejected': -648.7615966796875, 'logps/ref_chosen': -31.02279281616211, 'logps/ref_rejected': -110.9461669921875, 'logits/chosen': -2.543520212173462, 'logits/rejected': -2.7920000553131104, 'kl/p_epsilon_steps': 0.953125, 'kl/n_epsilon_steps': 0.046875, 'epsilon_dpo/beta': 0.0026354235596954823, 'epsilon_dpo/loss_margin_mean': 362.2628479003906, 'epsilon_dpo/beta_margin_mean': 0.9537962079048157, 'epsilon_dpo/beta_margin_std': 0.6000516414642334, 'epsilon_dpo/beta_margin_grad_mean': -0.2933025658130646, 'epsilon_dpo/beta_margin_grad_std': 0.11362046003341675, 'kl/beta': 0.0026592593640089035, 'kl/avg_steps': 0.90625, 'epoch': 0.68} 68%|█████████████████████████████████████████████████████████████████████████████ | 460/681 [30:47<09:10, 2.49s/it] 68%|█████████████████████████████████████████████████████████████████████████████▏ | 461/681 [30:50<09:04, 2.48s/it] {'loss': 0.7926, 'grad_norm': 37.75797653198242, 'learning_rate': 1.4434543456482518e-07, 'rewards/chosen': -0.44818025827407837, 'rewards/rejected': -1.2731022834777832, 'rewards/accuracies': 0.953125, 'rewards/margins': 0.8249219655990601, 'logps/chosen': -206.56666564941406, 'logps/rejected': -580.9017333984375, 'logps/ref_chosen': -35.32524108886719, 'logps/ref_rejected': -93.41868591308594, 'logits/chosen': -2.542848587036133, 'logits/rejected': -2.7294955253601074, 'kl/p_epsilon_steps': 0.9375, 'kl/n_epsilon_steps': 0.0625, 'epsilon_dpo/beta': 0.002612577984109521, 'epsilon_dpo/loss_margin_mean': 316.2416687011719, 'epsilon_dpo/beta_margin_mean': 0.8249220252037048, 'epsilon_dpo/beta_margin_std': 0.5639069676399231, 'epsilon_dpo/beta_margin_grad_mean': -0.31691277027130127, 'epsilon_dpo/beta_margin_grad_std': 0.11491651087999344, 'kl/beta': 0.0026353762950748205, 'kl/avg_steps': 0.875, 'epoch': 0.68} 68%|█████████████████████████████████████████████████████████████████████████████▏ | 461/681 [30:50<09:04, 2.48s/it] 68%|█████████████████████████████████████████████████████████████████████████████▎ | 462/681 [30:52<08:56, 2.45s/it] {'loss': 0.8226, 'grad_norm': 40.553749084472656, 'learning_rate': 1.4318373944740484e-07, 'rewards/chosen': -0.446227490901947, 'rewards/rejected': -1.2592161893844604, 'rewards/accuracies': 0.859375, 'rewards/margins': 0.8129886984825134, 'logps/chosen': -216.20462036132812, 'logps/rejected': -571.200439453125, 'logps/ref_chosen': -44.890872955322266, 'logps/ref_rejected': -85.42142486572266, 'logits/chosen': -2.639768123626709, 'logits/rejected': -2.776127338409424, 'kl/p_epsilon_steps': 0.859375, 'kl/n_epsilon_steps': 0.140625, 'epsilon_dpo/beta': 0.0025939985644072294, 'epsilon_dpo/loss_margin_mean': 314.4653015136719, 'epsilon_dpo/beta_margin_mean': 0.8129886984825134, 'epsilon_dpo/beta_margin_std': 0.6604252457618713, 'epsilon_dpo/beta_margin_grad_mean': -0.3237418830394745, 'epsilon_dpo/beta_margin_grad_std': 0.13112208247184753, 'kl/beta': 0.00261251674965024, 'kl/avg_steps': 0.71875, 'epoch': 0.68} 68%|█████████████████████████████████████████████████████████████████████████████▎ | 462/681 [30:52<08:56, 2.45s/it] 68%|█████████████████████████████████████████████████████████████████████████████▌ | 463/681 [30:55<09:11, 2.53s/it] {'loss': 0.712, 'grad_norm': 36.294189453125, 'learning_rate': 1.4202485903778976e-07, 'rewards/chosen': -0.32232174277305603, 'rewards/rejected': -1.3272720575332642, 'rewards/accuracies': 0.953125, 'rewards/margins': 1.0049502849578857, 'logps/chosen': -152.20263671875, 'logps/rejected': -614.646240234375, 'logps/ref_chosen': -27.038963317871094, 'logps/ref_rejected': -98.03726196289062, 'logits/chosen': -2.6299610137939453, 'logits/rejected': -2.773591995239258, 'kl/p_epsilon_steps': 0.953125, 'kl/n_epsilon_steps': 0.046875, 'epsilon_dpo/beta': 0.0025706232991069555, 'epsilon_dpo/loss_margin_mean': 391.4453125, 'epsilon_dpo/beta_margin_mean': 1.0049502849578857, 'epsilon_dpo/beta_margin_std': 0.6909827589988708, 'epsilon_dpo/beta_margin_grad_mean': -0.28741469979286194, 'epsilon_dpo/beta_margin_grad_std': 0.12530295550823212, 'kl/beta': 0.0025938733015209436, 'kl/avg_steps': 0.90625, 'epoch': 0.68} 68%|█████████████████████████████████████████████████████████████████████████████▌ | 463/681 [30:55<09:11, 2.53s/it] 68%|█████████████████████████████████████████████████████████████████████████████▋ | 464/681 [30:57<09:20, 2.58s/it] {'loss': 0.6506, 'grad_norm': 48.614654541015625, 'learning_rate': 1.4086882387355658e-07, 'rewards/chosen': -0.3993784487247467, 'rewards/rejected': -1.5100371837615967, 'rewards/accuracies': 0.890625, 'rewards/margins': 1.1106586456298828, 'logps/chosen': -189.71588134765625, 'logps/rejected': -701.7393798828125, 'logps/ref_chosen': -33.55242919921875, 'logps/ref_rejected': -109.08905029296875, 'logits/chosen': -2.5940399169921875, 'logits/rejected': -2.805851697921753, 'kl/p_epsilon_steps': 0.90625, 'kl/n_epsilon_steps': 0.09375, 'epsilon_dpo/beta': 0.00254994654096663, 'epsilon_dpo/loss_margin_mean': 436.4868469238281, 'epsilon_dpo/beta_margin_mean': 1.1106587648391724, 'epsilon_dpo/beta_margin_std': 0.6612549424171448, 'epsilon_dpo/beta_margin_grad_mean': -0.2655731737613678, 'epsilon_dpo/beta_margin_grad_std': 0.12443613260984421, 'kl/beta': 0.0025705774314701557, 'kl/avg_steps': 0.8125, 'epoch': 0.68} 68%|█████████████████████████████████████████████████████████████████████████████▋ | 464/681 [30:57<09:20, 2.58s/it] 68%|█████████████████████████████████████████████████████████████████████████████▊ | 465/681 [31:00<09:09, 2.54s/it] {'loss': 0.7035, 'grad_norm': 35.46124267578125, 'learning_rate': 1.3971566441730714e-07, 'rewards/chosen': -0.36298391222953796, 'rewards/rejected': -1.3539800643920898, 'rewards/accuracies': 0.953125, 'rewards/margins': 0.9909961223602295, 'logps/chosen': -178.53143310546875, 'logps/rejected': -652.1649780273438, 'logps/ref_chosen': -35.28005599975586, 'logps/ref_rejected': -116.3499755859375, 'logits/chosen': -2.6725172996520996, 'logits/rejected': -2.8897171020507812, 'kl/p_epsilon_steps': 0.921875, 'kl/n_epsilon_steps': 0.078125, 'epsilon_dpo/beta': 0.0025285982992500067, 'epsilon_dpo/loss_margin_mean': 392.5635986328125, 'epsilon_dpo/beta_margin_mean': 0.9909960627555847, 'epsilon_dpo/beta_margin_std': 0.621567964553833, 'epsilon_dpo/beta_margin_grad_mean': -0.28670158982276917, 'epsilon_dpo/beta_margin_grad_std': 0.11507753282785416, 'kl/beta': 0.0025498599279671907, 'kl/avg_steps': 0.84375, 'epoch': 0.68} 68%|█████████████████████████████████████████████████████████████████████████████▊ | 465/681 [31:00<09:09, 2.54s/it] 68%|██████████████████████████████████████████████████████████████████████████████ | 466/681 [31:02<09:08, 2.55s/it] {'loss': 0.749, 'grad_norm': 32.29966354370117, 'learning_rate': 1.3856541105586545e-07, 'rewards/chosen': -0.3691789507865906, 'rewards/rejected': -1.2763590812683105, 'rewards/accuracies': 0.9375, 'rewards/margins': 0.9071800112724304, 'logps/chosen': -175.187255859375, 'logps/rejected': -607.10595703125, 'logps/ref_chosen': -28.18646240234375, 'logps/ref_rejected': -97.64432525634766, 'logits/chosen': -2.5877268314361572, 'logits/rejected': -2.7581279277801514, 'kl/p_epsilon_steps': 0.921875, 'kl/n_epsilon_steps': 0.078125, 'epsilon_dpo/beta': 0.0025074416771531105, 'epsilon_dpo/loss_margin_mean': 362.46087646484375, 'epsilon_dpo/beta_margin_mean': 0.9071800112724304, 'epsilon_dpo/beta_margin_std': 0.6066918969154358, 'epsilon_dpo/beta_margin_grad_mean': -0.3023064434528351, 'epsilon_dpo/beta_margin_grad_std': 0.11443696171045303, 'kl/beta': 0.002528525423258543, 'kl/avg_steps': 0.84375, 'epoch': 0.68} 68%|██████████████████████████████████████████████████████████████████████████████ | 466/681 [31:02<09:08, 2.55s/it] 69%|██████████████████████████████████████████████████████████████████████████████▏ | 467/681 [31:05<09:00, 2.52s/it] {'loss': 0.7652, 'grad_norm': 47.59917449951172, 'learning_rate': 1.3741809409947729e-07, 'rewards/chosen': -0.44280189275741577, 'rewards/rejected': -1.4351122379302979, 'rewards/accuracies': 0.890625, 'rewards/margins': 0.9923102259635925, 'logps/chosen': -223.679443359375, 'logps/rejected': -687.1947021484375, 'logps/ref_chosen': -46.7025146484375, 'logps/ref_rejected': -110.00337982177734, 'logits/chosen': -2.691765308380127, 'logits/rejected': -2.9094629287719727, 'kl/p_epsilon_steps': 0.890625, 'kl/n_epsilon_steps': 0.109375, 'epsilon_dpo/beta': 0.0024880296550691128, 'epsilon_dpo/loss_margin_mean': 400.21441650390625, 'epsilon_dpo/beta_margin_mean': 0.9923102259635925, 'epsilon_dpo/beta_margin_std': 0.829995334148407, 'epsilon_dpo/beta_margin_grad_mean': -0.2942372262477875, 'epsilon_dpo/beta_margin_grad_std': 0.15898768603801727, 'kl/beta': 0.002507369499653578, 'kl/avg_steps': 0.78125, 'epoch': 0.69} 69%|██████████████████████████████████████████████████████████████████████████████▏ | 467/681 [31:05<09:00, 2.52s/it] 69%|██████████████████████████████████████████████████████████████████████████████▎ | 468/681 [31:07<09:05, 2.56s/it] {'loss': 0.728, 'grad_norm': 40.286834716796875, 'learning_rate': 1.362737437810114e-07, 'rewards/chosen': -0.37121495604515076, 'rewards/rejected': -1.3838683366775513, 'rewards/accuracies': 0.9375, 'rewards/margins': 1.0126533508300781, 'logps/chosen': -192.12745666503906, 'logps/rejected': -670.7412109375, 'logps/ref_chosen': -42.05735778808594, 'logps/ref_rejected': -109.48826599121094, 'logits/chosen': -2.67857027053833, 'logits/rejected': -2.87001895904541, 'kl/p_epsilon_steps': 0.90625, 'kl/n_epsilon_steps': 0.09375, 'epsilon_dpo/beta': 0.0024679647758603096, 'epsilon_dpo/loss_margin_mean': 411.1828918457031, 'epsilon_dpo/beta_margin_mean': 1.0126533508300781, 'epsilon_dpo/beta_margin_std': 0.7630032896995544, 'epsilon_dpo/beta_margin_grad_mean': -0.29025694727897644, 'epsilon_dpo/beta_margin_grad_std': 0.14004936814308167, 'kl/beta': 0.002487932564690709, 'kl/avg_steps': 0.8125, 'epoch': 0.69} 69%|██████████████████████████████████████████████████████████████████████████████▎ | 468/681 [31:08<09:05, 2.56s/it] 69%|██████████████████████████████████████████████████████████████████████████████▌ | 469/681 [31:10<08:47, 2.49s/it] {'loss': 0.7019, 'grad_norm': 48.018531799316406, 'learning_rate': 1.351323902551631e-07, 'rewards/chosen': -0.4174906611442566, 'rewards/rejected': -1.417689561843872, 'rewards/accuracies': 0.96875, 'rewards/margins': 1.0001988410949707, 'logps/chosen': -206.51370239257812, 'logps/rejected': -690.9351806640625, 'logps/ref_chosen': -36.20582580566406, 'logps/ref_rejected': -111.1355972290039, 'logits/chosen': -2.6168570518493652, 'logits/rejected': -2.8470990657806396, 'kl/p_epsilon_steps': 0.953125, 'kl/n_epsilon_steps': 0.046875, 'epsilon_dpo/beta': 0.002445760415866971, 'epsilon_dpo/loss_margin_mean': 409.49169921875, 'epsilon_dpo/beta_margin_mean': 1.0001988410949707, 'epsilon_dpo/beta_margin_std': 0.6260125637054443, 'epsilon_dpo/beta_margin_grad_mean': -0.2844110131263733, 'epsilon_dpo/beta_margin_grad_std': 0.1195027157664299, 'kl/beta': 0.002467880956828594, 'kl/avg_steps': 0.90625, 'epoch': 0.69} 69%|██████████████████████████████████████████████████████████████████████████████▌ | 469/681 [31:10<08:47, 2.49s/it] 69%|██████████████████████████████████████████████████████████████████████████████▋ | 470/681 [31:12<08:30, 2.42s/it] {'loss': 0.7329, 'grad_norm': 35.75682830810547, 'learning_rate': 1.339940635976592e-07, 'rewards/chosen': -0.30687910318374634, 'rewards/rejected': -1.2565128803253174, 'rewards/accuracies': 0.953125, 'rewards/margins': 0.9496338367462158, 'logps/chosen': -157.40719604492188, 'logps/rejected': -605.90576171875, 'logps/ref_chosen': -31.09160804748535, 'logps/ref_rejected': -87.30916595458984, 'logits/chosen': -2.6013760566711426, 'logits/rejected': -2.87990665435791, 'kl/p_epsilon_steps': 0.921875, 'kl/n_epsilon_steps': 0.078125, 'epsilon_dpo/beta': 0.002425323473289609, 'epsilon_dpo/loss_margin_mean': 392.281005859375, 'epsilon_dpo/beta_margin_mean': 0.9496338963508606, 'epsilon_dpo/beta_margin_std': 0.6434259414672852, 'epsilon_dpo/beta_margin_grad_mean': -0.2954522371292114, 'epsilon_dpo/beta_margin_grad_std': 0.12080518156290054, 'kl/beta': 0.002445716643705964, 'kl/avg_steps': 0.84375, 'epoch': 0.69} 69%|██████████████████████████████████████████████████████████████████████████████▋ | 470/681 [31:12<08:30, 2.42s/it] 69%|██████████████████████████████████████████████████████████████████████████████▊ | 471/681 [31:14<08:19, 2.38s/it] {'loss': 0.745, 'grad_norm': 35.150997161865234, 'learning_rate': 1.3285879380446563e-07, 'rewards/chosen': -0.4026249945163727, 'rewards/rejected': -1.3193340301513672, 'rewards/accuracies': 0.921875, 'rewards/margins': 0.9167090654373169, 'logps/chosen': -211.00067138671875, 'logps/rejected': -638.4447631835938, 'logps/ref_chosen': -44.132484436035156, 'logps/ref_rejected': -89.70744323730469, 'logits/chosen': -2.7026596069335938, 'logits/rejected': -2.8277225494384766, 'kl/p_epsilon_steps': 0.90625, 'kl/n_epsilon_steps': 0.09375, 'epsilon_dpo/beta': 0.002405789215117693, 'epsilon_dpo/loss_margin_mean': 381.869140625, 'epsilon_dpo/beta_margin_mean': 0.9167090654373169, 'epsilon_dpo/beta_margin_std': 0.606320858001709, 'epsilon_dpo/beta_margin_grad_mean': -0.30035099387168884, 'epsilon_dpo/beta_margin_grad_std': 0.11763791739940643, 'kl/beta': 0.002425253624096513, 'kl/avg_steps': 0.8125, 'epoch': 0.69} 69%|██████████████████████████████████████████████████████████████████████████████▊ | 471/681 [31:14<08:19, 2.38s/it] 69%|███████████████████████████████████████████████████████████████████████████████ | 472/681 [31:17<08:34, 2.46s/it] {'loss': 0.778, 'grad_norm': 36.8669548034668, 'learning_rate': 1.317266107909975e-07, 'rewards/chosen': -0.4928061366081238, 'rewards/rejected': -1.4490196704864502, 'rewards/accuracies': 0.90625, 'rewards/margins': 0.9562135934829712, 'logps/chosen': -265.7175598144531, 'logps/rejected': -731.2271728515625, 'logps/ref_chosen': -59.834197998046875, 'logps/ref_rejected': -123.57960510253906, 'logits/chosen': -2.770143985748291, 'logits/rejected': -2.981727361679077, 'kl/p_epsilon_steps': 0.90625, 'kl/n_epsilon_steps': 0.09375, 'epsilon_dpo/beta': 0.002386399544775486, 'epsilon_dpo/loss_margin_mean': 401.76422119140625, 'epsilon_dpo/beta_margin_mean': 0.9562135934829712, 'epsilon_dpo/beta_margin_std': 0.8416454792022705, 'epsilon_dpo/beta_margin_grad_mean': -0.3045946955680847, 'epsilon_dpo/beta_margin_grad_std': 0.14598874747753143, 'kl/beta': 0.002405707258731127, 'kl/avg_steps': 0.8125, 'epoch': 0.69} 69%|███████████████████████████████████████████████████████████████████████████████ | 472/681 [31:17<08:34, 2.46s/it] 69%|███████████████████████████████████████████████████████████████████████████████▏ | 473/681 [31:20<08:42, 2.51s/it] {'loss': 0.8578, 'grad_norm': 50.768165588378906, 'learning_rate': 1.3059754439133002e-07, 'rewards/chosen': -0.4595049023628235, 'rewards/rejected': -1.2219587564468384, 'rewards/accuracies': 0.90625, 'rewards/margins': 0.7624538540840149, 'logps/chosen': -238.464599609375, 'logps/rejected': -602.4376220703125, 'logps/ref_chosen': -44.87860870361328, 'logps/ref_rejected': -85.82889556884766, 'logits/chosen': -2.749878406524658, 'logits/rejected': -2.8519415855407715, 'kl/p_epsilon_steps': 0.90625, 'kl/n_epsilon_steps': 0.09375, 'epsilon_dpo/beta': 0.0023671663366258144, 'epsilon_dpo/loss_margin_mean': 323.02276611328125, 'epsilon_dpo/beta_margin_mean': 0.7624538540840149, 'epsilon_dpo/beta_margin_std': 0.6814138889312744, 'epsilon_dpo/beta_margin_grad_mean': -0.3334435224533081, 'epsilon_dpo/beta_margin_grad_std': 0.130352184176445, 'kl/beta': 0.0023863185197114944, 'kl/avg_steps': 0.8125, 'epoch': 0.69} 69%|███████████████████████████████████████████████████████████████████████████████▏ | 473/681 [31:20<08:42, 2.51s/it] 70%|███████████████████████████████████████████████████████████████████████████████▎ | 474/681 [31:22<08:45, 2.54s/it] {'loss': 0.7665, 'grad_norm': 44.010650634765625, 'learning_rate': 1.2947162435741277e-07, 'rewards/chosen': -0.38724249601364136, 'rewards/rejected': -1.2767717838287354, 'rewards/accuracies': 0.921875, 'rewards/margins': 0.889529287815094, 'logps/chosen': -195.01571655273438, 'logps/rejected': -641.80322265625, 'logps/ref_chosen': -30.269367218017578, 'logps/ref_rejected': -97.37470245361328, 'logits/chosen': -2.6113648414611816, 'logits/rejected': -2.8044967651367188, 'kl/p_epsilon_steps': 0.921875, 'kl/n_epsilon_steps': 0.078125, 'epsilon_dpo/beta': 0.0023473482578992844, 'epsilon_dpo/loss_margin_mean': 379.68218994140625, 'epsilon_dpo/beta_margin_mean': 0.889529287815094, 'epsilon_dpo/beta_margin_std': 0.6321231126785278, 'epsilon_dpo/beta_margin_grad_mean': -0.3070417642593384, 'epsilon_dpo/beta_margin_grad_std': 0.12120737135410309, 'kl/beta': 0.00236708577722311, 'kl/avg_steps': 0.84375, 'epoch': 0.7} 70%|███████████████████████████████████████████████████████████████████████████████▎ | 474/681 [31:22<08:45, 2.54s/it] 70%|███████████████████████████████████████████████████████████████████████████████▌ | 475/681 [31:25<08:43, 2.54s/it] {'loss': 0.8274, 'grad_norm': 35.23362350463867, 'learning_rate': 1.2834888035828596e-07, 'rewards/chosen': -0.4007193148136139, 'rewards/rejected': -1.2023236751556396, 'rewards/accuracies': 0.90625, 'rewards/margins': 0.8016044497489929, 'logps/chosen': -206.45596313476562, 'logps/rejected': -611.5294189453125, 'logps/ref_chosen': -34.96168518066406, 'logps/ref_rejected': -94.91036987304688, 'logits/chosen': -2.689065933227539, 'logits/rejected': -2.8997483253479004, 'kl/p_epsilon_steps': 0.90625, 'kl/n_epsilon_steps': 0.09375, 'epsilon_dpo/beta': 0.002328441943973303, 'epsilon_dpo/loss_margin_mean': 345.1247863769531, 'epsilon_dpo/beta_margin_mean': 0.8016044497489929, 'epsilon_dpo/beta_margin_std': 0.6676663160324097, 'epsilon_dpo/beta_margin_grad_mean': -0.32659581303596497, 'epsilon_dpo/beta_margin_grad_std': 0.1249208152294159, 'kl/beta': 0.0023472807370126247, 'kl/avg_steps': 0.8125, 'epoch': 0.7} 70%|███████████████████████████████████████████████████████████████████████████████▌ | 475/681 [31:25<08:43, 2.54s/it] 70%|███████████████████████████████████████████████████████████████████████████████▋ | 476/681 [31:27<08:24, 2.46s/it] {'loss': 0.8208, 'grad_norm': 40.693206787109375, 'learning_rate': 1.2722934197929802e-07, 'rewards/chosen': -0.37969422340393066, 'rewards/rejected': -1.183998942375183, 'rewards/accuracies': 0.90625, 'rewards/margins': 0.8043047189712524, 'logps/chosen': -192.21990966796875, 'logps/rejected': -591.2510986328125, 'logps/ref_chosen': -28.34685516357422, 'logps/ref_rejected': -78.29444885253906, 'logits/chosen': -2.668698310852051, 'logits/rejected': -2.884255886077881, 'kl/p_epsilon_steps': 0.890625, 'kl/n_epsilon_steps': 0.109375, 'epsilon_dpo/beta': 0.002310403622686863, 'epsilon_dpo/loss_margin_mean': 349.0835876464844, 'epsilon_dpo/beta_margin_mean': 0.8043047189712524, 'epsilon_dpo/beta_margin_std': 0.6316364407539368, 'epsilon_dpo/beta_margin_grad_mean': -0.32414335012435913, 'epsilon_dpo/beta_margin_grad_std': 0.12600372731685638, 'kl/beta': 0.0023283627815544605, 'kl/avg_steps': 0.78125, 'epoch': 0.7} 70%|███████████████████████████████████████████████████████████████████████████████▋ | 476/681 [31:27<08:24, 2.46s/it] 70%|███████████████████████████████████████████████████████████████████████████████▊ | 477/681 [31:30<08:28, 2.49s/it] {'loss': 0.802, 'grad_norm': 37.30575942993164, 'learning_rate': 1.2611303872132631e-07, 'rewards/chosen': -0.40344566106796265, 'rewards/rejected': -1.219561219215393, 'rewards/accuracies': 0.921875, 'rewards/margins': 0.8161154985427856, 'logps/chosen': -217.6685028076172, 'logps/rejected': -614.880126953125, 'logps/ref_chosen': -42.40911865234375, 'logps/ref_rejected': -82.68942260742188, 'logits/chosen': -2.7471470832824707, 'logits/rejected': -2.842437267303467, 'kl/p_epsilon_steps': 0.890625, 'kl/n_epsilon_steps': 0.109375, 'epsilon_dpo/beta': 0.0022924933582544327, 'epsilon_dpo/loss_margin_mean': 356.9313049316406, 'epsilon_dpo/beta_margin_mean': 0.8161155581474304, 'epsilon_dpo/beta_margin_std': 0.5828248858451843, 'epsilon_dpo/beta_margin_grad_mean': -0.3193974196910858, 'epsilon_dpo/beta_margin_grad_std': 0.11756820976734161, 'kl/beta': 0.002310313517227769, 'kl/avg_steps': 0.78125, 'epoch': 0.7} 70%|███████████████████████████████████████████████████████████████████████████████▊ | 477/681 [31:30<08:28, 2.49s/it] 70%|████████████████████████████████████████████████████████████████████████████████ | 478/681 [31:32<08:10, 2.42s/it] {'loss': 0.7954, 'grad_norm': 39.92375564575195, 'learning_rate': 1.2500000000000005e-07, 'rewards/chosen': -0.407898485660553, 'rewards/rejected': -1.2313954830169678, 'rewards/accuracies': 0.90625, 'rewards/margins': 0.8234970569610596, 'logps/chosen': -207.5630340576172, 'logps/rejected': -632.2252197265625, 'logps/ref_chosen': -28.737815856933594, 'logps/ref_rejected': -90.47331237792969, 'logits/chosen': -2.6334609985351562, 'logits/rejected': -2.921363592147827, 'kl/p_epsilon_steps': 0.890625, 'kl/n_epsilon_steps': 0.109375, 'epsilon_dpo/beta': 0.0022747220937162638, 'epsilon_dpo/loss_margin_mean': 362.92669677734375, 'epsilon_dpo/beta_margin_mean': 0.8234969973564148, 'epsilon_dpo/beta_margin_std': 0.577464759349823, 'epsilon_dpo/beta_margin_grad_mean': -0.3177875578403473, 'epsilon_dpo/beta_margin_grad_std': 0.11528382450342178, 'kl/beta': 0.0022924039512872696, 'kl/avg_steps': 0.78125, 'epoch': 0.7} 70%|████████████████████████████████████████████████████████████████████████████████ | 478/681 [31:32<08:10, 2.42s/it] 70%|████████████████████████████████████████████████████████████████████████████████▏ | 479/681 [31:34<08:13, 2.44s/it] {'loss': 0.7896, 'grad_norm': 42.03719711303711, 'learning_rate': 1.2389025514492456e-07, 'rewards/chosen': -0.37124890089035034, 'rewards/rejected': -1.2771828174591064, 'rewards/accuracies': 0.90625, 'rewards/margins': 0.9059340357780457, 'logps/chosen': -194.02374267578125, 'logps/rejected': -670.686767578125, 'logps/ref_chosen': -29.93898582458496, 'logps/ref_rejected': -104.23573303222656, 'logits/chosen': -2.710920810699463, 'logits/rejected': -3.0226893424987793, 'kl/p_epsilon_steps': 0.890625, 'kl/n_epsilon_steps': 0.109375, 'epsilon_dpo/beta': 0.002257088664919138, 'epsilon_dpo/loss_margin_mean': 402.3663330078125, 'epsilon_dpo/beta_margin_mean': 0.9059340357780457, 'epsilon_dpo/beta_margin_std': 0.7668508291244507, 'epsilon_dpo/beta_margin_grad_mean': -0.3112712800502777, 'epsilon_dpo/beta_margin_grad_std': 0.1403777301311493, 'kl/beta': 0.0022746333852410316, 'kl/avg_steps': 0.78125, 'epoch': 0.7} 70%|████████████████████████████████████████████████████████████████████████████████▏ | 479/681 [31:34<08:13, 2.44s/it] 70%|████████████████████████████████████████████████████████████████████████████████▎ | 480/681 [31:37<08:06, 2.42s/it] {'loss': 0.8298, 'grad_norm': 41.93212127685547, 'learning_rate': 1.227838333989088e-07, 'rewards/chosen': -0.4799705743789673, 'rewards/rejected': -1.2608397006988525, 'rewards/accuracies': 0.890625, 'rewards/margins': 0.7808691263198853, 'logps/chosen': -257.56707763671875, 'logps/rejected': -650.7171020507812, 'logps/ref_chosen': -43.97426223754883, 'logps/ref_rejected': -87.41323852539062, 'logits/chosen': -2.7833218574523926, 'logits/rejected': -2.883632183074951, 'kl/p_epsilon_steps': 0.890625, 'kl/n_epsilon_steps': 0.109375, 'epsilon_dpo/beta': 0.0022395916748791933, 'epsilon_dpo/loss_margin_mean': 349.7110595703125, 'epsilon_dpo/beta_margin_mean': 0.7808691263198853, 'epsilon_dpo/beta_margin_std': 0.6041759848594666, 'epsilon_dpo/beta_margin_grad_mean': -0.3270338773727417, 'epsilon_dpo/beta_margin_grad_std': 0.12246443331241608, 'kl/beta': 0.002257000654935837, 'kl/avg_steps': 0.78125, 'epoch': 0.7} 70%|████████████████████████████████████████████████████████████████████████████████▎ | 480/681 [31:37<08:06, 2.42s/it] 71%|████████████████████████████████████████████████████████████████████████████████▌ | 481/681 [31:39<08:10, 2.45s/it] {'loss': 0.8128, 'grad_norm': 38.608699798583984, 'learning_rate': 1.2168076391719489e-07, 'rewards/chosen': -0.4300834834575653, 'rewards/rejected': -1.267479658126831, 'rewards/accuracies': 0.890625, 'rewards/margins': 0.8373961448669434, 'logps/chosen': -229.94204711914062, 'logps/rejected': -669.5849609375, 'logps/ref_chosen': -36.98882293701172, 'logps/ref_rejected': -98.65377807617188, 'logits/chosen': -2.710186004638672, 'logits/rejected': -3.0134506225585938, 'kl/p_epsilon_steps': 0.890625, 'kl/n_epsilon_steps': 0.109375, 'epsilon_dpo/beta': 0.0022222306579351425, 'epsilon_dpo/loss_margin_mean': 377.9779968261719, 'epsilon_dpo/beta_margin_mean': 0.8373962044715881, 'epsilon_dpo/beta_margin_std': 0.6917349100112915, 'epsilon_dpo/beta_margin_grad_mean': -0.31972822546958923, 'epsilon_dpo/beta_margin_grad_std': 0.13162846863269806, 'kl/beta': 0.0022395045962184668, 'kl/avg_steps': 0.78125, 'epoch': 0.71} 71%|████████████████████████████████████████████████████████████████████████████████▌ | 481/681 [31:39<08:10, 2.45s/it] 71%|████████████████████████████████████████████████████████████████████████████████▋ | 482/681 [31:42<08:17, 2.50s/it] {'loss': 0.875, 'grad_norm': 43.73499298095703, 'learning_rate': 1.2058107576668938e-07, 'rewards/chosen': -0.4530524015426636, 'rewards/rejected': -1.1681525707244873, 'rewards/accuracies': 0.921875, 'rewards/margins': 0.7151001691818237, 'logps/chosen': -252.25537109375, 'logps/rejected': -622.5506591796875, 'logps/ref_chosen': -47.419219970703125, 'logps/ref_rejected': -92.47096252441406, 'logits/chosen': -2.777127504348755, 'logits/rejected': -3.0118846893310547, 'kl/p_epsilon_steps': 0.890625, 'kl/n_epsilon_steps': 0.109375, 'epsilon_dpo/beta': 0.0022050042171031237, 'epsilon_dpo/loss_margin_mean': 325.2435607910156, 'epsilon_dpo/beta_margin_mean': 0.7151001691818237, 'epsilon_dpo/beta_margin_std': 0.6155868172645569, 'epsilon_dpo/beta_margin_grad_mean': -0.34148189425468445, 'epsilon_dpo/beta_margin_grad_std': 0.12353808432817459, 'kl/beta': 0.0022221440449357033, 'kl/avg_steps': 0.78125, 'epoch': 0.71} 71%|████████████████████████████████████████████████████████████████████████████████▋ | 482/681 [31:42<08:17, 2.50s/it] 71%|████████████████████████████████████████████████████████████████████████████████▊ | 483/681 [31:44<08:14, 2.50s/it] {'loss': 0.7801, 'grad_norm': 37.85783386230469, 'learning_rate': 1.194847979251979e-07, 'rewards/chosen': -0.39537718892097473, 'rewards/rejected': -1.2726057767868042, 'rewards/accuracies': 0.921875, 'rewards/margins': 0.8772286176681519, 'logps/chosen': -219.78411865234375, 'logps/rejected': -683.1080322265625, 'logps/ref_chosen': -39.672393798828125, 'logps/ref_rejected': -100.94681549072266, 'logits/chosen': -2.7550508975982666, 'logits/rejected': -3.005962371826172, 'kl/p_epsilon_steps': 0.90625, 'kl/n_epsilon_steps': 0.09375, 'epsilon_dpo/beta': 0.0021872217766940594, 'epsilon_dpo/loss_margin_mean': 402.0494689941406, 'epsilon_dpo/beta_margin_mean': 0.8772286176681519, 'epsilon_dpo/beta_margin_std': 0.6511039733886719, 'epsilon_dpo/beta_margin_grad_mean': -0.3099426329135895, 'epsilon_dpo/beta_margin_grad_std': 0.1277041733264923, 'kl/beta': 0.0022049180697649717, 'kl/avg_steps': 0.8125, 'epoch': 0.71} 71%|████████████████████████████████████████████████████████████████████████████████▊ | 483/681 [31:44<08:14, 2.50s/it] 71%|█████████████████████████████████████████████████████████████████████████████████ | 484/681 [31:47<08:11, 2.50s/it] {'loss': 0.9168, 'grad_norm': 40.76497268676758, 'learning_rate': 1.1839195928066101e-07, 'rewards/chosen': -0.4243118464946747, 'rewards/rejected': -1.095069169998169, 'rewards/accuracies': 0.8125, 'rewards/margins': 0.6707573533058167, 'logps/chosen': -237.5784149169922, 'logps/rejected': -594.3012084960938, 'logps/ref_chosen': -43.43277359008789, 'logps/ref_rejected': -89.96736145019531, 'logits/chosen': -2.699538230895996, 'logits/rejected': -2.9171078205108643, 'kl/p_epsilon_steps': 0.796875, 'kl/n_epsilon_steps': 0.203125, 'epsilon_dpo/beta': 0.002174378838390112, 'epsilon_dpo/loss_margin_mean': 310.1882019042969, 'epsilon_dpo/beta_margin_mean': 0.6707573533058167, 'epsilon_dpo/beta_margin_std': 0.6563224792480469, 'epsilon_dpo/beta_margin_grad_mean': -0.35192134976387024, 'epsilon_dpo/beta_margin_grad_std': 0.1350928395986557, 'kl/beta': 0.002187147503718734, 'kl/avg_steps': 0.59375, 'epoch': 0.71} 71%|█████████████████████████████████████████████████████████████████████████████████ | 484/681 [31:47<08:11, 2.50s/it] 71%|█████████████████████████████████████████████████████████████████████████████████▏ | 485/681 [31:49<07:49, 2.40s/it] {'loss': 0.7557, 'grad_norm': 32.71714782714844, 'learning_rate': 1.1730258863039347e-07, 'rewards/chosen': -0.28553080558776855, 'rewards/rejected': -1.2164714336395264, 'rewards/accuracies': 0.921875, 'rewards/margins': 0.9309406876564026, 'logps/chosen': -169.16915893554688, 'logps/rejected': -675.8591918945312, 'logps/ref_chosen': -37.34454345703125, 'logps/ref_rejected': -111.42447662353516, 'logits/chosen': -2.6770548820495605, 'logits/rejected': -2.999387264251709, 'kl/p_epsilon_steps': 0.890625, 'kl/n_epsilon_steps': 0.109375, 'epsilon_dpo/beta': 0.0021574674174189568, 'epsilon_dpo/loss_margin_mean': 432.610107421875, 'epsilon_dpo/beta_margin_mean': 0.9309406876564026, 'epsilon_dpo/beta_margin_std': 0.6903952360153198, 'epsilon_dpo/beta_margin_grad_mean': -0.3015494644641876, 'epsilon_dpo/beta_margin_grad_std': 0.12973572313785553, 'kl/beta': 0.0021742379758507013, 'kl/avg_steps': 0.78125, 'epoch': 0.71} 71%|█████████████████████████████████████████████████████████████████████████████████▏ | 485/681 [31:49<07:49, 2.40s/it] 71%|█████████████████████████████████████████████████████████████████████████████████▎ | 486/681 [31:51<07:35, 2.34s/it] {'loss': 0.8285, 'grad_norm': 35.33386993408203, 'learning_rate': 1.1621671468032493e-07, 'rewards/chosen': -0.34016796946525574, 'rewards/rejected': -1.1584980487823486, 'rewards/accuracies': 0.90625, 'rewards/margins': 0.8183300495147705, 'logps/chosen': -193.83453369140625, 'logps/rejected': -640.993896484375, 'logps/ref_chosen': -35.52677536010742, 'logps/ref_rejected': -99.17495727539062, 'logits/chosen': -2.755924701690674, 'logits/rejected': -3.058840274810791, 'kl/p_epsilon_steps': 0.90625, 'kl/n_epsilon_steps': 0.09375, 'epsilon_dpo/beta': 0.002140068681910634, 'epsilon_dpo/loss_margin_mean': 383.51116943359375, 'epsilon_dpo/beta_margin_mean': 0.8183300495147705, 'epsilon_dpo/beta_margin_std': 0.6934942007064819, 'epsilon_dpo/beta_margin_grad_mean': -0.3228636085987091, 'epsilon_dpo/beta_margin_grad_std': 0.1375182420015335, 'kl/beta': 0.0021573833655565977, 'kl/avg_steps': 0.8125, 'epoch': 0.71} 71%|█████████████████████████████████████████████████████████████████████████████████▎ | 486/681 [31:51<07:35, 2.34s/it] 72%|█████████████████████████████████████████████████████████████████████████████████▌ | 487/681 [31:54<07:55, 2.45s/it] {'loss': 0.763, 'grad_norm': 43.43318557739258, 'learning_rate': 1.1513436604424378e-07, 'rewards/chosen': -0.3262515366077423, 'rewards/rejected': -1.1740999221801758, 'rewards/accuracies': 0.9375, 'rewards/margins': 0.8478483557701111, 'logps/chosen': -184.3863525390625, 'logps/rejected': -652.44091796875, 'logps/ref_chosen': -31.08715057373047, 'logps/ref_rejected': -98.84352111816406, 'logits/chosen': -2.668581247329712, 'logits/rejected': -3.001206636428833, 'kl/p_epsilon_steps': 0.9375, 'kl/n_epsilon_steps': 0.0625, 'epsilon_dpo/beta': 0.0021214832086116076, 'epsilon_dpo/loss_margin_mean': 400.2981872558594, 'epsilon_dpo/beta_margin_mean': 0.8478483557701111, 'epsilon_dpo/beta_margin_std': 0.48607146739959717, 'epsilon_dpo/beta_margin_grad_mean': -0.30856984853744507, 'epsilon_dpo/beta_margin_grad_std': 0.10273387283086777, 'kl/beta': 0.002139996038749814, 'kl/avg_steps': 0.875, 'epoch': 0.72} 72%|█████████████████████████████████████████████████████████████████████████████████▌ | 487/681 [31:54<07:55, 2.45s/it] 72%|█████████████████████████████████████████████████████████████████████████████████▋ | 488/681 [31:57<08:03, 2.51s/it] {'loss': 0.8576, 'grad_norm': 35.68146514892578, 'learning_rate': 1.1405557124304335e-07, 'rewards/chosen': -0.3142842650413513, 'rewards/rejected': -1.047201156616211, 'rewards/accuracies': 0.890625, 'rewards/margins': 0.7329168319702148, 'logps/chosen': -184.39353942871094, 'logps/rejected': -587.299072265625, 'logps/ref_chosen': -35.27953338623047, 'logps/ref_rejected': -89.09225463867188, 'logits/chosen': -2.7139856815338135, 'logits/rejected': -2.9660186767578125, 'kl/p_epsilon_steps': 0.890625, 'kl/n_epsilon_steps': 0.109375, 'epsilon_dpo/beta': 0.0021050700452178717, 'epsilon_dpo/loss_margin_mean': 349.0928039550781, 'epsilon_dpo/beta_margin_mean': 0.7329168319702148, 'epsilon_dpo/beta_margin_std': 0.5872458815574646, 'epsilon_dpo/beta_margin_grad_mean': -0.33707907795906067, 'epsilon_dpo/beta_margin_grad_std': 0.12150773406028748, 'kl/beta': 0.0021214333828538656, 'kl/avg_steps': 0.78125, 'epoch': 0.72} 72%|█████████████████████████████████████████████████████████████████████████████████▋ | 488/681 [31:57<08:03, 2.51s/it] 72%|█████████████████████████████████████████████████████████████████████████████████▊ | 489/681 [31:59<08:14, 2.58s/it] {'loss': 0.8644, 'grad_norm': 40.08867263793945, 'learning_rate': 1.1298035870396985e-07, 'rewards/chosen': -0.3389902710914612, 'rewards/rejected': -1.0370516777038574, 'rewards/accuracies': 0.921875, 'rewards/margins': 0.6980613470077515, 'logps/chosen': -199.35702514648438, 'logps/rejected': -584.06689453125, 'logps/ref_chosen': -37.423851013183594, 'logps/ref_rejected': -87.10142517089844, 'logits/chosen': -2.7437334060668945, 'logits/rejected': -2.992414712905884, 'kl/p_epsilon_steps': 0.90625, 'kl/n_epsilon_steps': 0.09375, 'epsilon_dpo/beta': 0.0020880938973277807, 'epsilon_dpo/loss_margin_mean': 335.0322570800781, 'epsilon_dpo/beta_margin_mean': 0.6980612874031067, 'epsilon_dpo/beta_margin_std': 0.5159537196159363, 'epsilon_dpo/beta_margin_grad_mean': -0.34188124537467957, 'epsilon_dpo/beta_margin_grad_std': 0.10775003582239151, 'kl/beta': 0.0021049880888313055, 'kl/avg_steps': 0.8125, 'epoch': 0.72} 72%|█████████████████████████████████████████████████████████████████████████████████▊ | 489/681 [31:59<08:14, 2.58s/it] 72%|██████████████████████████████████████████████████████████████████████████████████ | 490/681 [32:02<08:16, 2.60s/it] {'loss': 0.9131, 'grad_norm': 42.104305267333984, 'learning_rate': 1.1190875675987355e-07, 'rewards/chosen': -0.42905905842781067, 'rewards/rejected': -1.1452784538269043, 'rewards/accuracies': 0.796875, 'rewards/margins': 0.7162193059921265, 'logps/chosen': -247.21926879882812, 'logps/rejected': -668.4090576171875, 'logps/ref_chosen': -41.46424102783203, 'logps/ref_rejected': -115.67326354980469, 'logits/chosen': -2.7477211952209473, 'logits/rejected': -3.100935935974121, 'kl/p_epsilon_steps': 0.78125, 'kl/n_epsilon_steps': 0.21875, 'epsilon_dpo/beta': 0.002076485427096486, 'epsilon_dpo/loss_margin_mean': 346.98077392578125, 'epsilon_dpo/beta_margin_mean': 0.7162193655967712, 'epsilon_dpo/beta_margin_std': 0.7605259418487549, 'epsilon_dpo/beta_margin_grad_mean': -0.3476724922657013, 'epsilon_dpo/beta_margin_grad_std': 0.1495998054742813, 'kl/beta': 0.0020880228839814663, 'kl/avg_steps': 0.5625, 'epoch': 0.72} 72%|██████████████████████████████████████████████████████████████████████████████████ | 490/681 [32:02<08:16, 2.60s/it] 72%|██████████████████████████████████████████████████████████████████████████████████▏ | 491/681 [32:04<08:08, 2.57s/it] {'loss': 0.9096, 'grad_norm': 40.64850997924805, 'learning_rate': 1.1084079364846241e-07, 'rewards/chosen': -0.38356825709342957, 'rewards/rejected': -1.0272889137268066, 'rewards/accuracies': 0.90625, 'rewards/margins': 0.6437206268310547, 'logps/chosen': -226.95785522460938, 'logps/rejected': -578.698974609375, 'logps/ref_chosen': -41.33907699584961, 'logps/ref_rejected': -79.69932556152344, 'logits/chosen': -2.7214245796203613, 'logits/rejected': -3.0119452476501465, 'kl/p_epsilon_steps': 0.890625, 'kl/n_epsilon_steps': 0.09375, 'epsilon_dpo/beta': 0.002060000551864505, 'epsilon_dpo/loss_margin_mean': 313.3808288574219, 'epsilon_dpo/beta_margin_mean': 0.6437206268310547, 'epsilon_dpo/beta_margin_std': 0.547960102558136, 'epsilon_dpo/beta_margin_grad_mean': -0.3543255031108856, 'epsilon_dpo/beta_margin_grad_std': 0.11612068116664886, 'kl/beta': 0.002076343633234501, 'kl/avg_steps': 0.796875, 'epoch': 0.72} 72%|██████████████████████████████████████████████████████████████████████████████████▏ | 491/681 [32:04<08:08, 2.57s/it] 72%|██████████████████████████████████████████████████████████████████████████████████▎ | 492/681 [32:07<07:47, 2.47s/it] {'loss': 0.929, 'grad_norm': 34.33412551879883, 'learning_rate': 1.097764975115576e-07, 'rewards/chosen': -0.3345920443534851, 'rewards/rejected': -0.9495918154716492, 'rewards/accuracies': 0.890625, 'rewards/margins': 0.6149997711181641, 'logps/chosen': -195.3032989501953, 'logps/rejected': -544.8341064453125, 'logps/ref_chosen': -31.90703582763672, 'logps/ref_rejected': -79.67924499511719, 'logits/chosen': -2.7548234462738037, 'logits/rejected': -2.9680824279785156, 'kl/p_epsilon_steps': 0.890625, 'kl/n_epsilon_steps': 0.109375, 'epsilon_dpo/beta': 0.0020440397784113884, 'epsilon_dpo/loss_margin_mean': 301.7585754394531, 'epsilon_dpo/beta_margin_mean': 0.6149997711181641, 'epsilon_dpo/beta_margin_std': 0.5465402603149414, 'epsilon_dpo/beta_margin_grad_mean': -0.360959529876709, 'epsilon_dpo/beta_margin_grad_std': 0.11504260450601578, 'kl/beta': 0.0020599286071956158, 'kl/avg_steps': 0.78125, 'epoch': 0.72} 72%|██████████████████████████████████████████████████████████████████████████████████▎ | 492/681 [32:07<07:47, 2.47s/it] 72%|██████████████████████████████████████████████████████████████████████████████████▌ | 493/681 [32:09<07:52, 2.51s/it] {'loss': 0.9167, 'grad_norm': 38.02676773071289, 'learning_rate': 1.0871589639435203e-07, 'rewards/chosen': -0.3943876028060913, 'rewards/rejected': -1.0185136795043945, 'rewards/accuracies': 0.890625, 'rewards/margins': 0.6241260766983032, 'logps/chosen': -245.80068969726562, 'logps/rejected': -593.231201171875, 'logps/ref_chosen': -52.45185089111328, 'logps/ref_rejected': -91.2623291015625, 'logits/chosen': -2.806032657623291, 'logits/rejected': -3.0264089107513428, 'kl/p_epsilon_steps': 0.84375, 'kl/n_epsilon_steps': 0.15625, 'epsilon_dpo/beta': 0.0020301109179854393, 'epsilon_dpo/loss_margin_mean': 308.62005615234375, 'epsilon_dpo/beta_margin_mean': 0.6241260170936584, 'epsilon_dpo/beta_margin_std': 0.5125545859336853, 'epsilon_dpo/beta_margin_grad_mean': -0.3570352792739868, 'epsilon_dpo/beta_margin_grad_std': 0.11298612505197525, 'kl/beta': 0.0020439601503312588, 'kl/avg_steps': 0.6875, 'epoch': 0.72} 72%|██████████████████████████████████████████████████████████████████████████████████▌ | 493/681 [32:09<07:52, 2.51s/it] 73%|██████████████████████████████████████████████████████████████████████████████████▋ | 494/681 [32:12<07:44, 2.48s/it] {'loss': 0.8385, 'grad_norm': 42.489864349365234, 'learning_rate': 1.0765901824467166e-07, 'rewards/chosen': -0.28744399547576904, 'rewards/rejected': -1.0603857040405273, 'rewards/accuracies': 0.90625, 'rewards/margins': 0.7729417085647583, 'logps/chosen': -170.007568359375, 'logps/rejected': -619.1396484375, 'logps/ref_chosen': -27.903043746948242, 'logps/ref_rejected': -92.12089538574219, 'logits/chosen': -2.750584125518799, 'logits/rejected': -3.075985908508301, 'kl/p_epsilon_steps': 0.90625, 'kl/n_epsilon_steps': 0.09375, 'epsilon_dpo/beta': 0.0020137112587690353, 'epsilon_dpo/loss_margin_mean': 384.9142150878906, 'epsilon_dpo/beta_margin_mean': 0.7729417085647583, 'epsilon_dpo/beta_margin_std': 0.6120246052742004, 'epsilon_dpo/beta_margin_grad_mean': -0.3272671103477478, 'epsilon_dpo/beta_margin_grad_std': 0.12577760219573975, 'kl/beta': 0.0020300038158893585, 'kl/avg_steps': 0.8125, 'epoch': 0.73} 73%|██████████████████████████████████████████████████████████████████████████████████▋ | 494/681 [32:12<07:44, 2.48s/it] 73%|██████████████████████████████████████████████████████████████████████████████████▊ | 495/681 [32:14<07:47, 2.51s/it] {'loss': 0.8, 'grad_norm': 40.98225021362305, 'learning_rate': 1.0660589091223854e-07, 'rewards/chosen': -0.3492244780063629, 'rewards/rejected': -1.1413087844848633, 'rewards/accuracies': 0.953125, 'rewards/margins': 0.7920843362808228, 'logps/chosen': -212.49075317382812, 'logps/rejected': -669.8682861328125, 'logps/ref_chosen': -37.603515625, 'logps/ref_rejected': -97.60113525390625, 'logits/chosen': -2.7651305198669434, 'logits/rejected': -3.138665199279785, 'kl/p_epsilon_steps': 0.953125, 'kl/n_epsilon_steps': 0.046875, 'epsilon_dpo/beta': 0.0019955940078943968, 'epsilon_dpo/loss_margin_mean': 397.37994384765625, 'epsilon_dpo/beta_margin_mean': 0.7920843362808228, 'epsilon_dpo/beta_margin_std': 0.502342939376831, 'epsilon_dpo/beta_margin_grad_mean': -0.3212648630142212, 'epsilon_dpo/beta_margin_grad_std': 0.1038970947265625, 'kl/beta': 0.0020136430393904448, 'kl/avg_steps': 0.90625, 'epoch': 0.73} 73%|██████████████████████████████████████████████████████████████████████████████████▊ | 495/681 [32:14<07:47, 2.51s/it] 73%|███████████████████████████████████████████████████████████████████████████████████ | 496/681 [32:17<07:48, 2.53s/it] {'loss': 0.8456, 'grad_norm': 36.258426666259766, 'learning_rate': 1.0555654214793722e-07, 'rewards/chosen': -0.3781554102897644, 'rewards/rejected': -1.0874574184417725, 'rewards/accuracies': 0.9375, 'rewards/margins': 0.7093019485473633, 'logps/chosen': -235.90985107421875, 'logps/rejected': -641.947998046875, 'logps/ref_chosen': -45.088035583496094, 'logps/ref_rejected': -92.02516174316406, 'logits/chosen': -2.80366587638855, 'logits/rejected': -3.1190853118896484, 'kl/p_epsilon_steps': 0.9375, 'kl/n_epsilon_steps': 0.0625, 'epsilon_dpo/beta': 0.001978294923901558, 'epsilon_dpo/loss_margin_mean': 359.1010437011719, 'epsilon_dpo/beta_margin_mean': 0.7093020081520081, 'epsilon_dpo/beta_margin_std': 0.45665252208709717, 'epsilon_dpo/beta_margin_grad_mean': -0.3371340036392212, 'epsilon_dpo/beta_margin_grad_std': 0.09828732907772064, 'kl/beta': 0.001995558151975274, 'kl/avg_steps': 0.875, 'epoch': 0.73} 73%|███████████████████████████████████████████████████████████████████████████████████ | 496/681 [32:17<07:48, 2.53s/it] 73%|███████████████████████████████████████████████████████████████████████████████████▏ | 497/681 [32:19<07:40, 2.50s/it] {'loss': 0.8518, 'grad_norm': 31.69774627685547, 'learning_rate': 1.0451099960308374e-07, 'rewards/chosen': -0.2986891269683838, 'rewards/rejected': -1.003806471824646, 'rewards/accuracies': 0.9375, 'rewards/margins': 0.7051173448562622, 'logps/chosen': -182.076416015625, 'logps/rejected': -593.888671875, 'logps/ref_chosen': -30.02985382080078, 'logps/ref_rejected': -81.73121643066406, 'logits/chosen': -2.7518420219421387, 'logits/rejected': -3.1248888969421387, 'kl/p_epsilon_steps': 0.9375, 'kl/n_epsilon_steps': 0.0625, 'epsilon_dpo/beta': 0.001961135072633624, 'epsilon_dpo/loss_margin_mean': 360.11083984375, 'epsilon_dpo/beta_margin_mean': 0.7051173448562622, 'epsilon_dpo/beta_margin_std': 0.4776591360569, 'epsilon_dpo/beta_margin_grad_mean': -0.3387848436832428, 'epsilon_dpo/beta_margin_grad_std': 0.10064252465963364, 'kl/beta': 0.001978248590603471, 'kl/avg_steps': 0.875, 'epoch': 0.73} 73%|███████████████████████████████████████████████████████████████████████████████████▏ | 497/681 [32:19<07:40, 2.50s/it] 73%|███████████████████████████████████████████████████████████████████████████████████▎ | 498/681 [32:22<07:42, 2.53s/it] {'loss': 0.9234, 'grad_norm': 43.087093353271484, 'learning_rate': 1.0346929082869641e-07, 'rewards/chosen': -0.40171539783477783, 'rewards/rejected': -1.0523879528045654, 'rewards/accuracies': 0.890625, 'rewards/margins': 0.6506726741790771, 'logps/chosen': -252.92184448242188, 'logps/rejected': -630.257080078125, 'logps/ref_chosen': -47.59989929199219, 'logps/ref_rejected': -89.41059875488281, 'logits/chosen': -2.85072660446167, 'logits/rejected': -3.1245601177215576, 'kl/p_epsilon_steps': 0.859375, 'kl/n_epsilon_steps': 0.140625, 'epsilon_dpo/beta': 0.0019471884006634355, 'epsilon_dpo/loss_margin_mean': 335.5245056152344, 'epsilon_dpo/beta_margin_mean': 0.6506726145744324, 'epsilon_dpo/beta_margin_std': 0.6256114840507507, 'epsilon_dpo/beta_margin_grad_mean': -0.35530975461006165, 'epsilon_dpo/beta_margin_grad_std': 0.13022439181804657, 'kl/beta': 0.0019610889721661806, 'kl/avg_steps': 0.71875, 'epoch': 0.73} 73%|███████████████████████████████████████████████████████████████████████████████████▎ | 498/681 [32:22<07:42, 2.53s/it] 73%|███████████████████████████████████████████████████████████████████████████████████▌ | 499/681 [32:25<07:44, 2.55s/it] {'loss': 0.849, 'grad_norm': 35.18621826171875, 'learning_rate': 1.0243144327477013e-07, 'rewards/chosen': -0.3641180992126465, 'rewards/rejected': -1.1295077800750732, 'rewards/accuracies': 0.90625, 'rewards/margins': 0.765389621257782, 'logps/chosen': -222.212158203125, 'logps/rejected': -693.384033203125, 'logps/ref_chosen': -34.13922882080078, 'logps/ref_rejected': -108.009521484375, 'logits/chosen': -2.809879779815674, 'logits/rejected': -3.214648485183716, 'kl/p_epsilon_steps': 0.90625, 'kl/n_epsilon_steps': 0.09375, 'epsilon_dpo/beta': 0.0019314672099426389, 'epsilon_dpo/loss_margin_mean': 397.3015441894531, 'epsilon_dpo/beta_margin_mean': 0.765389621257782, 'epsilon_dpo/beta_margin_std': 0.637798011302948, 'epsilon_dpo/beta_margin_grad_mean': -0.3319936990737915, 'epsilon_dpo/beta_margin_grad_std': 0.13098543882369995, 'kl/beta': 0.0019470942206680775, 'kl/avg_steps': 0.8125, 'epoch': 0.73} 73%|███████████████████████████████████████████████████████████████████████████████████▌ | 499/681 [32:25<07:44, 2.55s/it] 73%|███████████████████████████████████████████████████████████████████████████████████▋ | 500/681 [32:27<07:42, 2.55s/it] {'loss': 0.8394, 'grad_norm': 37.49474334716797, 'learning_rate': 1.0139748428955333e-07, 'rewards/chosen': -0.37080904841423035, 'rewards/rejected': -1.1393522024154663, 'rewards/accuracies': 0.9375, 'rewards/margins': 0.7685431241989136, 'logps/chosen': -230.16627502441406, 'logps/rejected': -695.9005737304688, 'logps/ref_chosen': -36.92897033691406, 'logps/ref_rejected': -100.48208618164062, 'logits/chosen': -2.810401201248169, 'logits/rejected': -3.235410451889038, 'kl/p_epsilon_steps': 0.9375, 'kl/n_epsilon_steps': 0.0625, 'epsilon_dpo/beta': 0.0019146933918818831, 'epsilon_dpo/loss_margin_mean': 402.1811828613281, 'epsilon_dpo/beta_margin_mean': 0.7685431241989136, 'epsilon_dpo/beta_margin_std': 0.6095584630966187, 'epsilon_dpo/beta_margin_grad_mean': -0.3297097384929657, 'epsilon_dpo/beta_margin_grad_std': 0.12436074763536453, 'kl/beta': 0.0019314016681164503, 'kl/avg_steps': 0.875, 'epoch': 0.73} 73%|███████████████████████████████████████████████████████████████████████████████████▋ | 500/681 [32:27<07:42, 2.55s/it][INFO|trainer.py:4307] 2026-04-18 10:03:49,530 >> ***** Running Evaluation ***** [INFO|trainer.py:4309] 2026-04-18 10:03:49,531 >> Num examples = 2339 [INFO|trainer.py:4312] 2026-04-18 10:03:49,531 >> Batch size = 8 0%| | 0/73 [00:00> ***** Running Evaluation ***** [INFO|trainer.py:4309] 2026-04-18 10:08:39,549 >> Num examples = 2339 [INFO|trainer.py:4312] 2026-04-18 10:08:39,549 >> Batch size = 8 0%| | 0/73 [00:00> Saving model checkpoint to /scratch/feng.yulu/dynamic-dpo-v4/outputs/mistral-7b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260418-015332/checkpoint-600 [INFO|configuration_utils.py:419] 2026-04-18 10:09:33,959 >> Configuration saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/mistral-7b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260418-015332/checkpoint-600/config.json [INFO|configuration_utils.py:911] 2026-04-18 10:09:33,976 >> Configuration saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/mistral-7b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260418-015332/checkpoint-600/generation_config.json [INFO|modeling_utils.py:3580] 2026-04-18 10:10:31,283 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 6 checkpoint shards. You can find where each parameters has been saved in the index located at /scratch/feng.yulu/dynamic-dpo-v4/outputs/mistral-7b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260418-015332/checkpoint-600/model.safetensors.index.json. [INFO|tokenization_utils_base.py:2510] 2026-04-18 10:10:31,322 >> tokenizer config file saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/mistral-7b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260418-015332/checkpoint-600/tokenizer_config.json [INFO|tokenization_utils_base.py:2519] 2026-04-18 10:10:31,334 >> Special tokens file saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/mistral-7b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260418-015332/checkpoint-600/special_tokens_map.json [INFO|trainer.py:4083] 2026-04-18 10:13:53,767 >> Deleting older checkpoint [/scratch/feng.yulu/dynamic-dpo-v4/outputs/mistral-7b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260418-015332/checkpoint-200] due to args.save_total_limit 88%|██████████████████████████████████████████████████████████████████████████████████████████████████▊ | 601/681 [42:36<2:09:47, 97.35s/it] {'loss': 0.852, 'grad_norm': 48.359046936035156, 'learning_rate': 2.1301532877994742e-08, 'rewards/chosen': -0.35252028703689575, 'rewards/rejected': -1.2315044403076172, 'rewards/accuracies': 0.875, 'rewards/margins': 0.8789842128753662, 'logps/chosen': -406.67852783203125, 'logps/rejected': -1410.8968505859375, 'logps/ref_chosen': -34.16810607910156, 'logps/ref_rejected': -99.91683959960938, 'logits/chosen': -3.238035202026367, 'logits/rejected': -4.882007122039795, 'kl/p_epsilon_steps': 0.875, 'kl/n_epsilon_steps': 0.125, 'epsilon_dpo/beta': 0.0009403903386555612, 'epsilon_dpo/loss_margin_mean': 938.4696044921875, 'epsilon_dpo/beta_margin_mean': 0.8789842128753662, 'epsilon_dpo/beta_margin_std': 0.901688814163208, 'epsilon_dpo/beta_margin_grad_mean': -0.32094281911849976, 'epsilon_dpo/beta_margin_grad_std': 0.16856704652309418, 'kl/beta': 0.0009474018006585538, 'kl/avg_steps': 0.75, 'epoch': 0.88} 88%|██████████████████████████████████████████████████████████████████████████████████████████████████▊ | 601/681 [42:36<2:09:47, 97.35s/it] 88%|███████████████████████████████████████████████████████████████████████████████████████████████████ | 602/681 [42:38<1:30:45, 68.93s/it] {'loss': 0.8003, 'grad_norm': 48.63134765625, 'learning_rate': 2.0786184285784298e-08, 'rewards/chosen': -0.25088733434677124, 'rewards/rejected': -1.12050199508667, 'rewards/accuracies': 0.953125, 'rewards/margins': 0.8696146011352539, 'logps/chosen': -303.0639343261719, 'logps/rejected': -1296.177978515625, 'logps/ref_chosen': -34.405120849609375, 'logps/ref_rejected': -93.47988891601562, 'logits/chosen': -3.2103443145751953, 'logits/rejected': -4.763635635375977, 'kl/p_epsilon_steps': 0.90625, 'kl/n_epsilon_steps': 0.09375, 'epsilon_dpo/beta': 0.0009328021551482379, 'epsilon_dpo/loss_margin_mean': 934.039306640625, 'epsilon_dpo/beta_margin_mean': 0.8696146011352539, 'epsilon_dpo/beta_margin_std': 0.728697657585144, 'epsilon_dpo/beta_margin_grad_mean': -0.3170680105686188, 'epsilon_dpo/beta_margin_grad_std': 0.13237425684928894, 'kl/beta': 0.0009403491858392954, 'kl/avg_steps': 0.8125, 'epoch': 0.88} 88%|███████████████████████████████████████████████████████████████████████████████████████████████████ | 602/681 [42:38<1:30:45, 68.93s/it] 89%|███████████████████████████████████████████████████████████████████████████████████████████████████▏ | 603/681 [42:41<1:03:44, 49.03s/it] {'loss': 0.9099, 'grad_norm': 51.643531799316406, 'learning_rate': 2.0276875690788204e-08, 'rewards/chosen': -0.38833025097846985, 'rewards/rejected': -1.1105290651321411, 'rewards/accuracies': 0.890625, 'rewards/margins': 0.7221988439559937, 'logps/chosen': -470.8554992675781, 'logps/rejected': -1308.265380859375, 'logps/ref_chosen': -53.07399368286133, 'logps/ref_rejected': -107.52302551269531, 'logits/chosen': -3.2944273948669434, 'logits/rejected': -4.612957000732422, 'kl/p_epsilon_steps': 0.859375, 'kl/n_epsilon_steps': 0.140625, 'epsilon_dpo/beta': 0.0009261587401852012, 'epsilon_dpo/loss_margin_mean': 782.9607543945312, 'epsilon_dpo/beta_margin_mean': 0.7221988439559937, 'epsilon_dpo/beta_margin_std': 0.7671189904212952, 'epsilon_dpo/beta_margin_grad_mean': -0.3474172055721283, 'epsilon_dpo/beta_margin_grad_std': 0.14739488065242767, 'kl/beta': 0.0009327704319730401, 'kl/avg_steps': 0.71875, 'epoch': 0.89} 89%|███████████████████████████████████████████████████████████████████████████████████████████████████▏ | 603/681 [42:41<1:03:44, 49.03s/it] 89%|█████████████████████████████████████████████████████████████████████████████████████████████████████ | 604/681 [42:44<45:03, 35.11s/it] {'loss': 0.8561, 'grad_norm': 44.274085998535156, 'learning_rate': 1.977362051376158e-08, 'rewards/chosen': -0.3594130277633667, 'rewards/rejected': -1.2111127376556396, 'rewards/accuracies': 0.890625, 'rewards/margins': 0.8516995906829834, 'logps/chosen': -422.5206298828125, 'logps/rejected': -1419.331298828125, 'logps/ref_chosen': -32.21878433227539, 'logps/ref_rejected': -99.47515106201172, 'logits/chosen': -3.4287848472595215, 'logits/rejected': -4.8621602058410645, 'kl/p_epsilon_steps': 0.890625, 'kl/n_epsilon_steps': 0.109375, 'epsilon_dpo/beta': 0.0009189706179313362, 'epsilon_dpo/loss_margin_mean': 929.5543823242188, 'epsilon_dpo/beta_margin_mean': 0.8516996502876282, 'epsilon_dpo/beta_margin_std': 0.9336607456207275, 'epsilon_dpo/beta_margin_grad_mean': -0.3316219747066498, 'epsilon_dpo/beta_margin_grad_std': 0.14688213169574738, 'kl/beta': 0.0009261139784939587, 'kl/avg_steps': 0.78125, 'epoch': 0.89} 89%|█████████████████████████████████████████████████████████████████████████████████████████████████████ | 604/681 [42:44<45:03, 35.11s/it] 89%|█████████████████████████████████████████████████████████████████████████████████████████████████████▎ | 605/681 [42:46<32:06, 25.35s/it] {'loss': 0.882, 'grad_norm': 41.927146911621094, 'learning_rate': 1.9276432015946446e-08, 'rewards/chosen': -0.3291019797325134, 'rewards/rejected': -1.110598087310791, 'rewards/accuracies': 0.875, 'rewards/margins': 0.7814960479736328, 'logps/chosen': -402.01544189453125, 'logps/rejected': -1326.990234375, 'logps/ref_chosen': -42.914276123046875, 'logps/ref_rejected': -108.40269470214844, 'logits/chosen': -3.3077027797698975, 'logits/rejected': -4.713131904602051, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.171875, 'epsilon_dpo/beta': 0.0009129956015385687, 'epsilon_dpo/loss_margin_mean': 859.486328125, 'epsilon_dpo/beta_margin_mean': 0.7814961075782776, 'epsilon_dpo/beta_margin_std': 0.8144083619117737, 'epsilon_dpo/beta_margin_grad_mean': -0.33933547139167786, 'epsilon_dpo/beta_margin_grad_std': 0.15050913393497467, 'kl/beta': 0.0009189348202198744, 'kl/avg_steps': 0.65625, 'epoch': 0.89} 89%|█████████████████████████████████████████████████████████████████████████████████████████████████████▎ | 605/681 [42:46<32:06, 25.35s/it] 89%|█████████████████████████████████████████████████████████████████████████████████████████████████████▍ | 606/681 [42:49<23:07, 18.50s/it] {'loss': 0.9904, 'grad_norm': 57.995399475097656, 'learning_rate': 1.8785323298722093e-08, 'rewards/chosen': -0.4276297986507416, 'rewards/rejected': -1.045759916305542, 'rewards/accuracies': 0.828125, 'rewards/margins': 0.618130087852478, 'logps/chosen': -506.50384521484375, 'logps/rejected': -1257.863037109375, 'logps/ref_chosen': -37.19722366333008, 'logps/ref_rejected': -102.87519836425781, 'logits/chosen': -3.486124277114868, 'logits/rejected': -4.584151268005371, 'kl/p_epsilon_steps': 0.78125, 'kl/n_epsilon_steps': 0.21875, 'epsilon_dpo/beta': 0.0009078990551643074, 'epsilon_dpo/loss_margin_mean': 685.6812133789062, 'epsilon_dpo/beta_margin_mean': 0.618130087852478, 'epsilon_dpo/beta_margin_std': 0.7797470688819885, 'epsilon_dpo/beta_margin_grad_mean': -0.3660072982311249, 'epsilon_dpo/beta_margin_grad_std': 0.15605977177619934, 'kl/beta': 0.000912943622097373, 'kl/avg_steps': 0.5625, 'epoch': 0.89} 89%|█████████████████████████████████████████████████████████████████████████████████████████████████████▍ | 606/681 [42:49<23:07, 18.50s/it] 89%|█████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 607/681 [42:51<17:00, 13.79s/it] {'loss': 0.9843, 'grad_norm': 55.2100830078125, 'learning_rate': 1.8300307303259904e-08, 'rewards/chosen': -0.4154084026813507, 'rewards/rejected': -1.046553611755371, 'rewards/accuracies': 0.796875, 'rewards/margins': 0.6311452984809875, 'logps/chosen': -499.9947509765625, 'logps/rejected': -1246.0013427734375, 'logps/ref_chosen': -43.06529235839844, 'logps/ref_rejected': -84.84536743164062, 'logits/chosen': -3.4703445434570312, 'logits/rejected': -4.856854438781738, 'kl/p_epsilon_steps': 0.765625, 'kl/n_epsilon_steps': 0.234375, 'epsilon_dpo/beta': 0.000903104490134865, 'epsilon_dpo/loss_margin_mean': 704.2265014648438, 'epsilon_dpo/beta_margin_mean': 0.6311452984809875, 'epsilon_dpo/beta_margin_std': 0.7819616198539734, 'epsilon_dpo/beta_margin_grad_mean': -0.36472412943840027, 'epsilon_dpo/beta_margin_grad_std': 0.16157901287078857, 'kl/beta': 0.0009078370640054345, 'kl/avg_steps': 0.53125, 'epoch': 0.89} 89%|█████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 607/681 [42:52<17:00, 13.79s/it] 89%|█████████████████████████████████████████████████████████████████████████████████████████████████████▊ | 608/681 [42:54<12:39, 10.40s/it] {'loss': 0.843, 'grad_norm': 60.484779357910156, 'learning_rate': 1.7821396810182437e-08, 'rewards/chosen': -0.2668021619319916, 'rewards/rejected': -1.0676581859588623, 'rewards/accuracies': 0.90625, 'rewards/margins': 0.8008559942245483, 'logps/chosen': -324.84686279296875, 'logps/rejected': -1294.824462890625, 'logps/ref_chosen': -27.870777130126953, 'logps/ref_rejected': -101.65553283691406, 'logits/chosen': -3.1860623359680176, 'logits/rejected': -4.672609806060791, 'kl/p_epsilon_steps': 0.890625, 'kl/n_epsilon_steps': 0.109375, 'epsilon_dpo/beta': 0.0008960742270573974, 'epsilon_dpo/loss_margin_mean': 896.19287109375, 'epsilon_dpo/beta_margin_mean': 0.8008559942245483, 'epsilon_dpo/beta_margin_std': 0.7176367044448853, 'epsilon_dpo/beta_margin_grad_mean': -0.3294691741466522, 'epsilon_dpo/beta_margin_grad_std': 0.13619713485240936, 'kl/beta': 0.0009030396468006074, 'kl/avg_steps': 0.78125, 'epoch': 0.89} 89%|█████████████████████████████████████████████████████████████████████████████████████████████████████▊ | 608/681 [42:54<12:39, 10.40s/it] 89%|█████████████████████████████████████████████████████████████████████████████████████████████████████▉ | 609/681 [42:57<09:40, 8.06s/it] {'loss': 0.829, 'grad_norm': 52.808658599853516, 'learning_rate': 1.7348604439226617e-08, 'rewards/chosen': -0.3561561107635498, 'rewards/rejected': -1.166949987411499, 'rewards/accuracies': 0.890625, 'rewards/margins': 0.8107938766479492, 'logps/chosen': -432.412841796875, 'logps/rejected': -1410.12060546875, 'logps/ref_chosen': -33.51665496826172, 'logps/ref_rejected': -96.93180084228516, 'logits/chosen': -3.266610860824585, 'logits/rejected': -4.786360740661621, 'kl/p_epsilon_steps': 0.859375, 'kl/n_epsilon_steps': 0.140625, 'epsilon_dpo/beta': 0.0008896880317479372, 'epsilon_dpo/loss_margin_mean': 914.2926025390625, 'epsilon_dpo/beta_margin_mean': 0.8107938766479492, 'epsilon_dpo/beta_margin_std': 0.689571738243103, 'epsilon_dpo/beta_margin_grad_mean': -0.3255755305290222, 'epsilon_dpo/beta_margin_grad_std': 0.13147076964378357, 'kl/beta': 0.0008960393606685102, 'kl/avg_steps': 0.71875, 'epoch': 0.89} 89%|█████████████████████████████████████████████████████████████████████████████████████████████████████▉ | 609/681 [42:57<09:40, 8.06s/it] 90%|██████████████████████████████████████████████████████████████████████████████████████████████████████ | 610/681 [42:59<07:37, 6.44s/it] {'loss': 0.9283, 'grad_norm': 50.597049713134766, 'learning_rate': 1.6881942648911074e-08, 'rewards/chosen': -0.34177446365356445, 'rewards/rejected': -1.1019885540008545, 'rewards/accuracies': 0.84375, 'rewards/margins': 0.7602142095565796, 'logps/chosen': -423.406982421875, 'logps/rejected': -1337.302490234375, 'logps/ref_chosen': -39.733856201171875, 'logps/ref_rejected': -88.57766723632812, 'logits/chosen': -3.319277286529541, 'logits/rejected': -4.948149681091309, 'kl/p_epsilon_steps': 0.8125, 'kl/n_epsilon_steps': 0.1875, 'epsilon_dpo/beta': 0.0008841731469146907, 'epsilon_dpo/loss_margin_mean': 865.0516967773438, 'epsilon_dpo/beta_margin_mean': 0.7602141499519348, 'epsilon_dpo/beta_margin_std': 0.892071008682251, 'epsilon_dpo/beta_margin_grad_mean': -0.34196192026138306, 'epsilon_dpo/beta_margin_grad_std': 0.17104457318782806, 'kl/beta': 0.0008896450162865222, 'kl/avg_steps': 0.625, 'epoch': 0.9} 90%|██████████████████████████████████████████████████████████████████████████████████████████████████████ | 610/681 [42:59<07:37, 6.44s/it] 90%|██████████████████████████████████████████████████████████████████████████████████████████████████████▎ | 611/681 [43:02<06:11, 5.31s/it] {'loss': 0.9188, 'grad_norm': 54.84195327758789, 'learning_rate': 1.6421423736208e-08, 'rewards/chosen': -0.3749154508113861, 'rewards/rejected': -1.1266663074493408, 'rewards/accuracies': 0.859375, 'rewards/margins': 0.7517509460449219, 'logps/chosen': -460.02740478515625, 'logps/rejected': -1376.174072265625, 'logps/ref_chosen': -34.78019332885742, 'logps/ref_rejected': -90.61834716796875, 'logits/chosen': -3.4497299194335938, 'logits/rejected': -4.820193290710449, 'kl/p_epsilon_steps': 0.859375, 'kl/n_epsilon_steps': 0.125, 'epsilon_dpo/beta': 0.0008777128532528877, 'epsilon_dpo/loss_margin_mean': 860.3085327148438, 'epsilon_dpo/beta_margin_mean': 0.7517509460449219, 'epsilon_dpo/beta_margin_std': 0.8516819477081299, 'epsilon_dpo/beta_margin_grad_mean': -0.3441890478134155, 'epsilon_dpo/beta_margin_grad_std': 0.1642007678747177, 'kl/beta': 0.0008841192466206849, 'kl/avg_steps': 0.734375, 'epoch': 0.9} 90%|██████████████████████████████████████████████████████████████████████████████████████████████████████▎ | 611/681 [43:02<06:11, 5.31s/it] 90%|██████████████████████████████████████████████████████████████████████████████████████████████████████▍ | 612/681 [43:04<05:07, 4.45s/it] {'loss': 0.8826, 'grad_norm': 41.552490234375, 'learning_rate': 1.5967059836219042e-08, 'rewards/chosen': -0.3019838333129883, 'rewards/rejected': -1.0831260681152344, 'rewards/accuracies': 0.875, 'rewards/margins': 0.7811422348022461, 'logps/chosen': -380.4996337890625, 'logps/rejected': -1338.08837890625, 'logps/ref_chosen': -35.333831787109375, 'logps/ref_rejected': -93.14432525634766, 'logits/chosen': -3.163311004638672, 'logits/rejected': -4.886482238769531, 'kl/p_epsilon_steps': 0.84375, 'kl/n_epsilon_steps': 0.15625, 'epsilon_dpo/beta': 0.0008717270102351904, 'epsilon_dpo/loss_margin_mean': 899.7782592773438, 'epsilon_dpo/beta_margin_mean': 0.7811422348022461, 'epsilon_dpo/beta_margin_std': 0.8147025108337402, 'epsilon_dpo/beta_margin_grad_mean': -0.33847570419311523, 'epsilon_dpo/beta_margin_grad_std': 0.15062828361988068, 'kl/beta': 0.0008776738541200757, 'kl/avg_steps': 0.6875, 'epoch': 0.9} 90%|██████████████████████████████████████████████████████████████████████████████████████████████████████▍ | 612/681 [43:04<05:07, 4.45s/it] 90%|██████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 613/681 [43:07<04:24, 3.90s/it] {'loss': 0.7942, 'grad_norm': 61.14183044433594, 'learning_rate': 1.551886292185553e-08, 'rewards/chosen': -0.33134663105010986, 'rewards/rejected': -1.1872930526733398, 'rewards/accuracies': 0.90625, 'rewards/margins': 0.8559463620185852, 'logps/chosen': -418.6805114746094, 'logps/rejected': -1487.4290771484375, 'logps/ref_chosen': -36.464019775390625, 'logps/ref_rejected': -113.0091781616211, 'logits/chosen': -3.458996057510376, 'logits/rejected': -4.9003472328186035, 'kl/p_epsilon_steps': 0.875, 'kl/n_epsilon_steps': 0.125, 'epsilon_dpo/beta': 0.0008652299293316901, 'epsilon_dpo/loss_margin_mean': 992.203369140625, 'epsilon_dpo/beta_margin_mean': 0.8559463620185852, 'epsilon_dpo/beta_margin_std': 0.6566179990768433, 'epsilon_dpo/beta_margin_grad_mean': -0.3142598271369934, 'epsilon_dpo/beta_margin_grad_std': 0.12845391035079956, 'kl/beta': 0.0008716810261830688, 'kl/avg_steps': 0.75, 'epoch': 0.9} 90%|██████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 613/681 [43:07<04:24, 3.90s/it] 90%|██████████████████████████████████████████████████████████████████████████████████████████████████████▊ | 614/681 [43:10<03:54, 3.51s/it] {'loss': 0.8482, 'grad_norm': 48.429771423339844, 'learning_rate': 1.507684480352292e-08, 'rewards/chosen': -0.3362823724746704, 'rewards/rejected': -1.1700525283813477, 'rewards/accuracies': 0.890625, 'rewards/margins': 0.8337700366973877, 'logps/chosen': -424.6036376953125, 'logps/rejected': -1475.20361328125, 'logps/ref_chosen': -34.81976318359375, 'logps/ref_rejected': -111.12577819824219, 'logits/chosen': -3.3637969493865967, 'logits/rejected': -4.854522705078125, 'kl/p_epsilon_steps': 0.84375, 'kl/n_epsilon_steps': 0.15625, 'epsilon_dpo/beta': 0.0008593298261985183, 'epsilon_dpo/loss_margin_mean': 974.2940063476562, 'epsilon_dpo/beta_margin_mean': 0.8337700366973877, 'epsilon_dpo/beta_margin_std': 0.7923649549484253, 'epsilon_dpo/beta_margin_grad_mean': -0.3246811330318451, 'epsilon_dpo/beta_margin_grad_std': 0.15554648637771606, 'kl/beta': 0.0008651920943520963, 'kl/avg_steps': 0.6875, 'epoch': 0.9} 90%|██████████████████████████████████████████████████████████████████████████████████████████████████████▊ | 614/681 [43:10<03:54, 3.51s/it] 90%|██████████████████████████████████████████████████████████████████████████████████████████████████████▉ | 615/681 [43:12<03:34, 3.25s/it] {'loss': 0.8908, 'grad_norm': 52.556758880615234, 'learning_rate': 1.4641017128809801e-08, 'rewards/chosen': -0.3474053144454956, 'rewards/rejected': -1.126778483390808, 'rewards/accuracies': 0.84375, 'rewards/margins': 0.7793731689453125, 'logps/chosen': -447.3005676269531, 'logps/rejected': -1424.577392578125, 'logps/ref_chosen': -41.42036819458008, 'logps/ref_rejected': -101.49702453613281, 'logits/chosen': -3.465533971786499, 'logits/rejected': -4.829683303833008, 'kl/p_epsilon_steps': 0.859375, 'kl/n_epsilon_steps': 0.140625, 'epsilon_dpo/beta': 0.0008531937492080033, 'epsilon_dpo/loss_margin_mean': 917.2001342773438, 'epsilon_dpo/beta_margin_mean': 0.7793731689453125, 'epsilon_dpo/beta_margin_std': 0.8427423238754272, 'epsilon_dpo/beta_margin_grad_mean': -0.3395027816295624, 'epsilon_dpo/beta_margin_grad_std': 0.15511348843574524, 'kl/beta': 0.0008592845406383276, 'kl/avg_steps': 0.71875, 'epoch': 0.9} 90%|██████████████████████████████████████████████████████████████████████████████████████████████████████▉ | 615/681 [43:12<03:34, 3.25s/it] 90%|███████████████████████████████████████████████████████████████████████████████████████████████████████ | 616/681 [43:15<03:18, 3.05s/it] {'loss': 1.0265, 'grad_norm': 59.032936096191406, 'learning_rate': 1.4211391382180637e-08, 'rewards/chosen': -0.3976947069168091, 'rewards/rejected': -0.9945331811904907, 'rewards/accuracies': 0.828125, 'rewards/margins': 0.5968384146690369, 'logps/chosen': -511.3292236328125, 'logps/rejected': -1254.6151123046875, 'logps/ref_chosen': -45.615753173828125, 'logps/ref_rejected': -80.37959289550781, 'logits/chosen': -3.490535259246826, 'logits/rejected': -4.813992500305176, 'kl/p_epsilon_steps': 0.75, 'kl/n_epsilon_steps': 0.25, 'epsilon_dpo/beta': 0.000848971598315984, 'epsilon_dpo/loss_margin_mean': 708.5220336914062, 'epsilon_dpo/beta_margin_mean': 0.5968384146690369, 'epsilon_dpo/beta_margin_std': 0.8491699695587158, 'epsilon_dpo/beta_margin_grad_mean': -0.3760168254375458, 'epsilon_dpo/beta_margin_grad_std': 0.1653953492641449, 'kl/beta': 0.0008531524799764156, 'kl/avg_steps': 0.5, 'epoch': 0.9} 90%|███████████████████████████████████████████████████████████████████████████████████████████████████████ | 616/681 [43:15<03:18, 3.05s/it] 91%|███████████████████████████████████████████████████████████████████████████████████████████████████████▎ | 617/681 [43:17<03:07, 2.93s/it] {'loss': 1.088, 'grad_norm': 69.23668670654297, 'learning_rate': 1.378797888467345e-08, 'rewards/chosen': -0.4543150067329407, 'rewards/rejected': -0.927322268486023, 'rewards/accuracies': 0.78125, 'rewards/margins': 0.4730072617530823, 'logps/chosen': -583.8492431640625, 'logps/rejected': -1167.814208984375, 'logps/ref_chosen': -50.210060119628906, 'logps/ref_rejected': -69.55174255371094, 'logits/chosen': -3.5469822883605957, 'logits/rejected': -4.843328475952148, 'kl/p_epsilon_steps': 0.71875, 'kl/n_epsilon_steps': 0.28125, 'epsilon_dpo/beta': 0.0008452784968540072, 'epsilon_dpo/loss_margin_mean': 564.623291015625, 'epsilon_dpo/beta_margin_mean': 0.4730072617530823, 'epsilon_dpo/beta_margin_std': 0.7559821605682373, 'epsilon_dpo/beta_margin_grad_mean': -0.3991197347640991, 'epsilon_dpo/beta_margin_grad_std': 0.14801205694675446, 'kl/beta': 0.0008489079191349447, 'kl/avg_steps': 0.4375, 'epoch': 0.91} 91%|███████████████████████████████████████████████████████████████████████████████████████████████████████▎ | 617/681 [43:18<03:07, 2.93s/it] 91%|███████████████████████████████████████████████████████████████████████████████████████████████████████▍ | 618/681 [43:20<03:02, 2.89s/it] {'loss': 0.9866, 'grad_norm': 48.00564193725586, 'learning_rate': 1.3370790793601371e-08, 'rewards/chosen': -0.41405919194221497, 'rewards/rejected': -1.0804648399353027, 'rewards/accuracies': 0.875, 'rewards/margins': 0.6664056181907654, 'logps/chosen': -533.0503540039062, 'logps/rejected': -1387.2470703125, 'logps/ref_chosen': -43.185306549072266, 'logps/ref_rejected': -98.49762725830078, 'logits/chosen': -3.4216156005859375, 'logits/rejected': -4.760526657104492, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.15625, 'epsilon_dpo/beta': 0.0008396140765398741, 'epsilon_dpo/loss_margin_mean': 798.8843383789062, 'epsilon_dpo/beta_margin_mean': 0.6664056181907654, 'epsilon_dpo/beta_margin_std': 0.8637328147888184, 'epsilon_dpo/beta_margin_grad_mean': -0.35606545209884644, 'epsilon_dpo/beta_margin_grad_std': 0.17117975652217865, 'kl/beta': 0.0008452101610600948, 'kl/avg_steps': 0.671875, 'epoch': 0.91} 91%|███████████████████████████████████████████████████████████████████████████████████████████████████████▍ | 618/681 [43:20<03:02, 2.89s/it] 91%|███████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 619/681 [43:23<02:54, 2.82s/it] {'loss': 0.891, 'grad_norm': 55.89384460449219, 'learning_rate': 1.2959838102258535e-08, 'rewards/chosen': -0.37517303228378296, 'rewards/rejected': -1.1234097480773926, 'rewards/accuracies': 0.859375, 'rewards/margins': 0.7482366561889648, 'logps/chosen': -482.2840576171875, 'logps/rejected': -1450.6636962890625, 'logps/ref_chosen': -33.69963836669922, 'logps/ref_rejected': -101.2354736328125, 'logits/chosen': -3.394374370574951, 'logits/rejected': -4.941725254058838, 'kl/p_epsilon_steps': 0.859375, 'kl/n_epsilon_steps': 0.140625, 'epsilon_dpo/beta': 0.0008336182800121605, 'epsilon_dpo/loss_margin_mean': 900.8438110351562, 'epsilon_dpo/beta_margin_mean': 0.7482366561889648, 'epsilon_dpo/beta_margin_std': 0.7654976844787598, 'epsilon_dpo/beta_margin_grad_mean': -0.34246736764907837, 'epsilon_dpo/beta_margin_grad_std': 0.14548242092132568, 'kl/beta': 0.0008395693148486316, 'kl/avg_steps': 0.71875, 'epoch': 0.91} 91%|███████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 619/681 [43:23<02:54, 2.82s/it] 91%|███████████████████████████████████████████████████████████████████████████████████████████████████████▊ | 620/681 [43:25<02:44, 2.70s/it] {'loss': 1.0273, 'grad_norm': 51.175804138183594, 'learning_rate': 1.2555131639630567e-08, 'rewards/chosen': -0.4475584030151367, 'rewards/rejected': -1.0251778364181519, 'rewards/accuracies': 0.796875, 'rewards/margins': 0.5776193737983704, 'logps/chosen': -579.7201538085938, 'logps/rejected': -1323.3944091796875, 'logps/ref_chosen': -42.774513244628906, 'logps/ref_rejected': -84.47439575195312, 'logits/chosen': -3.558187961578369, 'logits/rejected': -4.871739387512207, 'kl/p_epsilon_steps': 0.765625, 'kl/n_epsilon_steps': 0.234375, 'epsilon_dpo/beta': 0.0008292324491776526, 'epsilon_dpo/loss_margin_mean': 701.974365234375, 'epsilon_dpo/beta_margin_mean': 0.5776193737983704, 'epsilon_dpo/beta_margin_std': 0.8028265833854675, 'epsilon_dpo/beta_margin_grad_mean': -0.37484031915664673, 'epsilon_dpo/beta_margin_grad_std': 0.15843062102794647, 'kl/beta': 0.0008335779421031475, 'kl/avg_steps': 0.53125, 'epoch': 0.91} 91%|███████████████████████████████████████████████████████████████████████████████████████████████████████▊ | 620/681 [43:25<02:44, 2.70s/it] 91%|███████████████████████████████████████████████████████████████████████████████████████████████████████▉ | 621/681 [43:28<02:39, 2.66s/it] {'loss': 0.8891, 'grad_norm': 52.058837890625, 'learning_rate': 1.2156682070109086e-08, 'rewards/chosen': -0.3498554825782776, 'rewards/rejected': -1.1445496082305908, 'rewards/accuracies': 0.828125, 'rewards/margins': 0.794694185256958, 'logps/chosen': -459.840576171875, 'logps/rejected': -1485.9166259765625, 'logps/ref_chosen': -37.82067108154297, 'logps/ref_rejected': -94.49537658691406, 'logits/chosen': -3.53486967086792, 'logits/rejected': -5.024101257324219, 'kl/p_epsilon_steps': 0.796875, 'kl/n_epsilon_steps': 0.203125, 'epsilon_dpo/beta': 0.0008243321790359914, 'epsilon_dpo/loss_margin_mean': 969.4013061523438, 'epsilon_dpo/beta_margin_mean': 0.794694185256958, 'epsilon_dpo/beta_margin_std': 0.8520438075065613, 'epsilon_dpo/beta_margin_grad_mean': -0.3368692100048065, 'epsilon_dpo/beta_margin_grad_std': 0.16272993385791779, 'kl/beta': 0.000829172960948199, 'kl/avg_steps': 0.59375, 'epoch': 0.91} 91%|███████████████████████████████████████████████████████████████████████████████████████████████████████▉ | 621/681 [43:28<02:39, 2.66s/it] 91%|████████████████████████████████████████████████████████████████████████████████████████████████████████ | 622/681 [43:31<02:37, 2.67s/it] {'loss': 0.9872, 'grad_norm': 49.092411041259766, 'learning_rate': 1.1764499893210878e-08, 'rewards/chosen': -0.4212498664855957, 'rewards/rejected': -1.0025246143341064, 'rewards/accuracies': 0.875, 'rewards/margins': 0.5812746286392212, 'logps/chosen': -553.1505126953125, 'logps/rejected': -1318.96630859375, 'logps/ref_chosen': -39.961334228515625, 'logps/ref_rejected': -92.28267669677734, 'logits/chosen': -3.374379873275757, 'logits/rejected': -4.7842888832092285, 'kl/p_epsilon_steps': 0.859375, 'kl/n_epsilon_steps': 0.140625, 'epsilon_dpo/beta': 0.0008184361504390836, 'epsilon_dpo/loss_margin_mean': 713.4944458007812, 'epsilon_dpo/beta_margin_mean': 0.581274688243866, 'epsilon_dpo/beta_margin_std': 0.6987016797065735, 'epsilon_dpo/beta_margin_grad_mean': -0.3743629455566406, 'epsilon_dpo/beta_margin_grad_std': 0.13544589281082153, 'kl/beta': 0.0008242788026109338, 'kl/avg_steps': 0.71875, 'epoch': 0.91} 91%|████████████████████████████████████████████████████████████████████████████████████████████████████████ | 622/681 [43:31<02:37, 2.67s/it] 91%|████████████████████████████████████████████████████████████████████████████████████████████████████████▎ | 623/681 [43:33<02:29, 2.59s/it] {'loss': 1.0243, 'grad_norm': 50.80109405517578, 'learning_rate': 1.1378595443300998e-08, 'rewards/chosen': -0.432176798582077, 'rewards/rejected': -1.0211231708526611, 'rewards/accuracies': 0.78125, 'rewards/margins': 0.5889463424682617, 'logps/chosen': -577.177001953125, 'logps/rejected': -1348.3131103515625, 'logps/ref_chosen': -49.0926513671875, 'logps/ref_rejected': -91.09358215332031, 'logits/chosen': -3.542423963546753, 'logits/rejected': -4.91584587097168, 'kl/p_epsilon_steps': 0.78125, 'kl/n_epsilon_steps': 0.21875, 'epsilon_dpo/beta': 0.000813874474260956, 'epsilon_dpo/loss_margin_mean': 729.1351928710938, 'epsilon_dpo/beta_margin_mean': 0.5889464020729065, 'epsilon_dpo/beta_margin_std': 0.8087703585624695, 'epsilon_dpo/beta_margin_grad_mean': -0.3725954592227936, 'epsilon_dpo/beta_margin_grad_std': 0.1656135767698288, 'kl/beta': 0.0008183965692296624, 'kl/avg_steps': 0.5625, 'epoch': 0.91} 91%|████████████████████████████████████████████████████████████████████████████████████████████████████████▎ | 623/681 [43:33<02:29, 2.59s/it] 92%|████████████████████████████████████████████████████████████████████████████████████████████████████████▍ | 624/681 [43:36<02:31, 2.66s/it] {'loss': 0.9153, 'grad_norm': 48.523738861083984, 'learning_rate': 1.0998978889320582e-08, 'rewards/chosen': -0.3814919590950012, 'rewards/rejected': -1.1524710655212402, 'rewards/accuracies': 0.84375, 'rewards/margins': 0.7709789872169495, 'logps/chosen': -515.30517578125, 'logps/rejected': -1532.546142578125, 'logps/ref_chosen': -46.57392501831055, 'logps/ref_rejected': -105.08536529541016, 'logits/chosen': -3.3430838584899902, 'logits/rejected': -4.918627738952637, 'kl/p_epsilon_steps': 0.8125, 'kl/n_epsilon_steps': 0.1875, 'epsilon_dpo/beta': 0.000808813376352191, 'epsilon_dpo/loss_margin_mean': 958.7295532226562, 'epsilon_dpo/beta_margin_mean': 0.7709789872169495, 'epsilon_dpo/beta_margin_std': 0.8832123875617981, 'epsilon_dpo/beta_margin_grad_mean': -0.34141990542411804, 'epsilon_dpo/beta_margin_grad_std': 0.16724152863025665, 'kl/beta': 0.0008138188859447837, 'kl/avg_steps': 0.625, 'epoch': 0.92} 92%|████████████████████████████████████████████████████████████████████████████████████████████████████████▍ | 624/681 [43:36<02:31, 2.66s/it] 92%|████████████████████████████████████████████████████████████████████████████████████████████████████████▋ | 625/681 [43:38<02:26, 2.61s/it] {'loss': 0.902, 'grad_norm': 49.30683135986328, 'learning_rate': 1.0625660234518913e-08, 'rewards/chosen': -0.38019776344299316, 'rewards/rejected': -1.0790600776672363, 'rewards/accuracies': 0.84375, 'rewards/margins': 0.6988623142242432, 'logps/chosen': -515.078857421875, 'logps/rejected': -1437.4361572265625, 'logps/ref_chosen': -43.60509490966797, 'logps/ref_rejected': -92.33833312988281, 'logits/chosen': -3.506286144256592, 'logits/rejected': -4.933239936828613, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.171875, 'epsilon_dpo/beta': 0.0008035369100980461, 'epsilon_dpo/loss_margin_mean': 873.6240844726562, 'epsilon_dpo/beta_margin_mean': 0.6988623738288879, 'epsilon_dpo/beta_margin_std': 0.6729094386100769, 'epsilon_dpo/beta_margin_grad_mean': -0.34749388694763184, 'epsilon_dpo/beta_margin_grad_std': 0.13651293516159058, 'kl/beta': 0.0008087640744633973, 'kl/avg_steps': 0.65625, 'epoch': 0.92} 92%|████████████████████████████████████████████████████████████████████████████████████████████████████████▋ | 625/681 [43:38<02:26, 2.61s/it] 92%|████████████████████████████████████████████████████████████████████████████████████████████████████████▊ | 626/681 [43:41<02:24, 2.63s/it] {'loss': 0.9691, 'grad_norm': 50.068603515625, 'learning_rate': 1.0258649316189721e-08, 'rewards/chosen': -0.43541914224624634, 'rewards/rejected': -1.1092710494995117, 'rewards/accuracies': 0.8125, 'rewards/margins': 0.6738518476486206, 'logps/chosen': -593.0721435546875, 'logps/rejected': -1494.8515625, 'logps/ref_chosen': -50.95122528076172, 'logps/ref_rejected': -103.29271697998047, 'logits/chosen': -3.526883363723755, 'logits/rejected': -4.905704021453857, 'kl/p_epsilon_steps': 0.796875, 'kl/n_epsilon_steps': 0.203125, 'epsilon_dpo/beta': 0.0007988003198988736, 'epsilon_dpo/loss_margin_mean': 849.4379272460938, 'epsilon_dpo/beta_margin_mean': 0.6738518476486206, 'epsilon_dpo/beta_margin_std': 0.8334734439849854, 'epsilon_dpo/beta_margin_grad_mean': -0.3574766218662262, 'epsilon_dpo/beta_margin_grad_std': 0.16741888225078583, 'kl/beta': 0.0008034911588765681, 'kl/avg_steps': 0.59375, 'epoch': 0.92} 92%|████████████████████████████████████████████████████████████████████████████████████████████████████████▊ | 626/681 [43:41<02:24, 2.63s/it] 92%|████████████████████████████████████████████████████████████████████████████████████████████████████████▉ | 627/681 [43:44<02:21, 2.63s/it] {'loss': 0.8381, 'grad_norm': 55.83228302001953, 'learning_rate': 9.897955805412e-09, 'rewards/chosen': -0.24587570130825043, 'rewards/rejected': -1.0799684524536133, 'rewards/accuracies': 0.875, 'rewards/margins': 0.8340927362442017, 'logps/chosen': -337.388427734375, 'logps/rejected': -1478.910400390625, 'logps/ref_chosen': -28.80577850341797, 'logps/ref_rejected': -114.96311950683594, 'logits/chosen': -3.3846006393432617, 'logits/rejected': -4.88566780090332, 'kl/p_epsilon_steps': 0.84375, 'kl/n_epsilon_steps': 0.15625, 'epsilon_dpo/beta': 0.00079333659959957, 'epsilon_dpo/loss_margin_mean': 1055.364501953125, 'epsilon_dpo/beta_margin_mean': 0.8340927958488464, 'epsilon_dpo/beta_margin_std': 0.7881202101707458, 'epsilon_dpo/beta_margin_grad_mean': -0.32665178179740906, 'epsilon_dpo/beta_margin_grad_std': 0.14166025817394257, 'kl/beta': 0.0007987486314959824, 'kl/avg_steps': 0.6875, 'epoch': 0.92} 92%|████████████████████████████████████████████████████████████████████████████████████████████████████████▉ | 627/681 [43:44<02:21, 2.63s/it] 92%|█████████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 628/681 [43:46<02:19, 2.63s/it] {'loss': 0.9161, 'grad_norm': 49.38209915161133, 'learning_rate': 9.543589206795238e-09, 'rewards/chosen': -0.42482954263687134, 'rewards/rejected': -1.1567974090576172, 'rewards/accuracies': 0.84375, 'rewards/margins': 0.7319678068161011, 'logps/chosen': -581.771484375, 'logps/rejected': -1578.183837890625, 'logps/ref_chosen': -45.28186798095703, 'logps/ref_rejected': -108.524169921875, 'logits/chosen': -3.572722911834717, 'logits/rejected': -4.90597677230835, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.171875, 'epsilon_dpo/beta': 0.0007881674682721496, 'epsilon_dpo/loss_margin_mean': 933.1701049804688, 'epsilon_dpo/beta_margin_mean': 0.7319678068161011, 'epsilon_dpo/beta_margin_std': 0.8163183331489563, 'epsilon_dpo/beta_margin_grad_mean': -0.34786510467529297, 'epsilon_dpo/beta_margin_grad_std': 0.15347820520401, 'kl/beta': 0.0007932946900837123, 'kl/avg_steps': 0.65625, 'epoch': 0.92} 92%|█████████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 628/681 [43:46<02:19, 2.63s/it] 92%|█████████████████████████████████████████████████████████████████████████████████████████████████████████▎ | 629/681 [43:49<02:18, 2.66s/it] {'loss': 0.9092, 'grad_norm': 46.65474319458008, 'learning_rate': 9.19555885822887e-09, 'rewards/chosen': -0.3272828459739685, 'rewards/rejected': -1.0314680337905884, 'rewards/accuracies': 0.796875, 'rewards/margins': 0.7041851282119751, 'logps/chosen': -457.57720947265625, 'logps/rejected': -1415.8564453125, 'logps/ref_chosen': -41.636070251464844, 'logps/ref_rejected': -96.60995483398438, 'logits/chosen': -3.379412889480591, 'logits/rejected': -4.922481536865234, 'kl/p_epsilon_steps': 0.765625, 'kl/n_epsilon_steps': 0.234375, 'epsilon_dpo/beta': 0.00078401411883533, 'epsilon_dpo/loss_margin_mean': 903.3053588867188, 'epsilon_dpo/beta_margin_mean': 0.7041851878166199, 'epsilon_dpo/beta_margin_std': 0.7174273133277893, 'epsilon_dpo/beta_margin_grad_mean': -0.34917938709259033, 'epsilon_dpo/beta_margin_grad_std': 0.14244189858436584, 'kl/beta': 0.0007881226483732462, 'kl/avg_steps': 0.53125, 'epoch': 0.92} 92%|█████████████████████████████████████████████████████████████████████████████████████████████████████████▎ | 629/681 [43:49<02:18, 2.66s/it] 93%|█████████████████████████████████████████████████████████████████████████████████████████████████████████▍ | 630/681 [43:52<02:15, 2.65s/it] {'loss': 0.9169, 'grad_norm': 40.582828521728516, 'learning_rate': 8.85387393063622e-09, 'rewards/chosen': -0.26877981424331665, 'rewards/rejected': -0.9388737678527832, 'rewards/accuracies': 0.859375, 'rewards/margins': 0.6700939536094666, 'logps/chosen': -374.94354248046875, 'logps/rejected': -1298.19873046875, 'logps/ref_chosen': -31.366878509521484, 'logps/ref_rejected': -90.5899658203125, 'logits/chosen': -3.2850284576416016, 'logits/rejected': -4.819733619689941, 'kl/p_epsilon_steps': 0.84375, 'kl/n_epsilon_steps': 0.15625, 'epsilon_dpo/beta': 0.0007786460919305682, 'epsilon_dpo/loss_margin_mean': 864.0321655273438, 'epsilon_dpo/beta_margin_mean': 0.6700939536094666, 'epsilon_dpo/beta_margin_std': 0.6649956703186035, 'epsilon_dpo/beta_margin_grad_mean': -0.3540729284286499, 'epsilon_dpo/beta_margin_grad_std': 0.13038687407970428, 'kl/beta': 0.0007839578902348876, 'kl/avg_steps': 0.6875, 'epoch': 0.93} 93%|█████████████████████████████████████████████████████████████████████████████████████████████████████████▍ | 630/681 [43:52<02:15, 2.65s/it] 93%|█████████████████████████████████████████████████████████████████████████████████████████████████████████▋ | 631/681 [43:54<02:11, 2.63s/it] {'loss': 0.9527, 'grad_norm': 46.966888427734375, 'learning_rate': 8.518543427732949e-09, 'rewards/chosen': -0.34111732244491577, 'rewards/rejected': -0.9839361906051636, 'rewards/accuracies': 0.875, 'rewards/margins': 0.642818808555603, 'logps/chosen': -483.23895263671875, 'logps/rejected': -1360.65087890625, 'logps/ref_chosen': -44.379119873046875, 'logps/ref_rejected': -86.64693450927734, 'logits/chosen': -3.4160025119781494, 'logits/rejected': -4.869636535644531, 'kl/p_epsilon_steps': 0.859375, 'kl/n_epsilon_steps': 0.140625, 'epsilon_dpo/beta': 0.0007730860379524529, 'epsilon_dpo/loss_margin_mean': 835.1441040039062, 'epsilon_dpo/beta_margin_mean': 0.6428188681602478, 'epsilon_dpo/beta_margin_std': 0.7184677124023438, 'epsilon_dpo/beta_margin_grad_mean': -0.3612503111362457, 'epsilon_dpo/beta_margin_grad_std': 0.14340829849243164, 'kl/beta': 0.0007786049391143024, 'kl/avg_steps': 0.71875, 'epoch': 0.93} 93%|█████████████████████████████████████████████████████████████████████████████████████████████████████████▋ | 631/681 [43:54<02:11, 2.63s/it] 93%|█████████████████████████████████████████████████████████████████████████████████████████████████████████▊ | 632/681 [43:57<02:04, 2.54s/it] {'loss': 0.9639, 'grad_norm': 54.08295440673828, 'learning_rate': 8.189576185789637e-09, 'rewards/chosen': -0.35203373432159424, 'rewards/rejected': -0.9687473773956299, 'rewards/accuracies': 0.875, 'rewards/margins': 0.6167136430740356, 'logps/chosen': -500.601318359375, 'logps/rejected': -1354.220947265625, 'logps/ref_chosen': -43.92643737792969, 'logps/ref_rejected': -90.67631530761719, 'logits/chosen': -3.6578102111816406, 'logits/rejected': -5.017969131469727, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.171875, 'epsilon_dpo/beta': 0.0007680523558519781, 'epsilon_dpo/loss_margin_mean': 806.8697509765625, 'epsilon_dpo/beta_margin_mean': 0.6167137026786804, 'epsilon_dpo/beta_margin_std': 0.6938005685806274, 'epsilon_dpo/beta_margin_grad_mean': -0.3662717640399933, 'epsilon_dpo/beta_margin_grad_std': 0.13854503631591797, 'kl/beta': 0.0007730486686341465, 'kl/avg_steps': 0.65625, 'epoch': 0.93} 93%|█████████████████████████████████████████████████████████████████████████████████████████████████████████▊ | 632/681 [43:57<02:04, 2.54s/it] 93%|█████████████████████████████████████████████████████████████████████████████████████████████████████████▉ | 633/681 [43:59<02:00, 2.52s/it] {'loss': 0.9689, 'grad_norm': 49.735836029052734, 'learning_rate': 7.866980873399015e-09, 'rewards/chosen': -0.3694680333137512, 'rewards/rejected': -0.9572535157203674, 'rewards/accuracies': 0.859375, 'rewards/margins': 0.5877854824066162, 'logps/chosen': -524.2335205078125, 'logps/rejected': -1356.517822265625, 'logps/ref_chosen': -42.23455047607422, 'logps/ref_rejected': -100.14579772949219, 'logits/chosen': -3.5298781394958496, 'logits/rejected': -4.785070419311523, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.171875, 'epsilon_dpo/beta': 0.0007630449254065752, 'epsilon_dpo/loss_margin_mean': 774.373046875, 'epsilon_dpo/beta_margin_mean': 0.5877854824066162, 'epsilon_dpo/beta_margin_std': 0.6214975118637085, 'epsilon_dpo/beta_margin_grad_mean': -0.36804521083831787, 'epsilon_dpo/beta_margin_grad_std': 0.1329408884048462, 'kl/beta': 0.000768008641898632, 'kl/avg_steps': 0.65625, 'epoch': 0.93} 93%|█████████████████████████████████████████████████████████████████████████████████████████████████████████▉ | 633/681 [43:59<02:00, 2.52s/it] 93%|██████████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 634/681 [44:02<01:59, 2.55s/it] {'loss': 0.933, 'grad_norm': 43.45764923095703, 'learning_rate': 7.550765991247654e-09, 'rewards/chosen': -0.35321560502052307, 'rewards/rejected': -1.0450574159622192, 'rewards/accuracies': 0.796875, 'rewards/margins': 0.6918417811393738, 'logps/chosen': -502.7445068359375, 'logps/rejected': -1493.4161376953125, 'logps/ref_chosen': -39.36439895629883, 'logps/ref_rejected': -113.15769958496094, 'logits/chosen': -3.4708662033081055, 'logits/rejected': -4.9770402908325195, 'kl/p_epsilon_steps': 0.78125, 'kl/n_epsilon_steps': 0.21875, 'epsilon_dpo/beta': 0.0007587854051962495, 'epsilon_dpo/loss_margin_mean': 916.8782958984375, 'epsilon_dpo/beta_margin_mean': 0.6918417811393738, 'epsilon_dpo/beta_margin_std': 0.782641589641571, 'epsilon_dpo/beta_margin_grad_mean': -0.3558174967765808, 'epsilon_dpo/beta_margin_grad_std': 0.14750294387340546, 'kl/beta': 0.0007630014442838728, 'kl/avg_steps': 0.5625, 'epoch': 0.93} 93%|██████████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 634/681 [44:02<01:59, 2.55s/it] 93%|██████████████████████████████████████████████████████████████████████████████████████████████████████████▎ | 635/681 [44:04<01:58, 2.58s/it] {'loss': 1.0411, 'grad_norm': 51.91278839111328, 'learning_rate': 7.240939871891699e-09, 'rewards/chosen': -0.38917070627212524, 'rewards/rejected': -0.8954064846038818, 'rewards/accuracies': 0.859375, 'rewards/margins': 0.5062357783317566, 'logps/chosen': -562.4381103515625, 'logps/rejected': -1277.62646484375, 'logps/ref_chosen': -49.88642120361328, 'logps/ref_rejected': -89.69390869140625, 'logits/chosen': -3.523716926574707, 'logits/rejected': -4.783617973327637, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.171875, 'epsilon_dpo/beta': 0.0007538297213613987, 'epsilon_dpo/loss_margin_mean': 675.380859375, 'epsilon_dpo/beta_margin_mean': 0.5062357783317566, 'epsilon_dpo/beta_margin_std': 0.6855813264846802, 'epsilon_dpo/beta_margin_grad_mean': -0.39024749398231506, 'epsilon_dpo/beta_margin_grad_std': 0.13376963138580322, 'kl/beta': 0.0007587335421703756, 'kl/avg_steps': 0.65625, 'epoch': 0.93} 93%|██████████████████████████████████████████████████████████████████████████████████████████████████████████▎ | 635/681 [44:04<01:58, 2.58s/it] 93%|██████████████████████████████████████████████████████████████████████████████████████████████████████████▍ | 636/681 [44:07<01:59, 2.64s/it] {'loss': 1.043, 'grad_norm': 52.333683013916016, 'learning_rate': 6.937510679537628e-09, 'rewards/chosen': -0.39245936274528503, 'rewards/rejected': -0.8954870104789734, 'rewards/accuracies': 0.8125, 'rewards/margins': 0.5030276775360107, 'logps/chosen': -566.8447875976562, 'logps/rejected': -1282.334716796875, 'logps/ref_chosen': -46.58656692504883, 'logps/ref_rejected': -86.21536254882812, 'logits/chosen': -3.715489387512207, 'logits/rejected': -4.909304618835449, 'kl/p_epsilon_steps': 0.796875, 'kl/n_epsilon_steps': 0.203125, 'epsilon_dpo/beta': 0.0007493861485272646, 'epsilon_dpo/loss_margin_mean': 675.861083984375, 'epsilon_dpo/beta_margin_mean': 0.5030276775360107, 'epsilon_dpo/beta_margin_std': 0.6626613140106201, 'epsilon_dpo/beta_margin_grad_mean': -0.3881148397922516, 'epsilon_dpo/beta_margin_grad_std': 0.1402616947889328, 'kl/beta': 0.0007537868223153055, 'kl/avg_steps': 0.59375, 'epoch': 0.93} 93%|██████████████████████████████████████████████████████████████████████████████████████████████████████████▍ | 636/681 [44:07<01:59, 2.64s/it] 94%|██████████████████████████████████████████████████████████████████████████████████████████████████████████▋ | 637/681 [44:10<01:55, 2.64s/it] {'loss': 0.9332, 'grad_norm': 41.60480880737305, 'learning_rate': 6.640486409826785e-09, 'rewards/chosen': -0.33350521326065063, 'rewards/rejected': -1.0211637020111084, 'rewards/accuracies': 0.84375, 'rewards/margins': 0.6876585483551025, 'logps/chosen': -483.88134765625, 'logps/rejected': -1478.459228515625, 'logps/ref_chosen': -37.54460144042969, 'logps/ref_rejected': -103.94780731201172, 'logits/chosen': -3.4943830966949463, 'logits/rejected': -4.821347236633301, 'kl/p_epsilon_steps': 0.796875, 'kl/n_epsilon_steps': 0.203125, 'epsilon_dpo/beta': 0.0007449628901667893, 'epsilon_dpo/loss_margin_mean': 928.1746215820312, 'epsilon_dpo/beta_margin_mean': 0.6876585483551025, 'epsilon_dpo/beta_margin_std': 0.7672451138496399, 'epsilon_dpo/beta_margin_grad_mean': -0.3551463186740875, 'epsilon_dpo/beta_margin_grad_std': 0.14750678837299347, 'kl/beta': 0.0007493376033380628, 'kl/avg_steps': 0.59375, 'epoch': 0.94} 94%|██████████████████████████████████████████████████████████████████████████████████████████████████████████▋ | 637/681 [44:10<01:55, 2.64s/it] 94%|██████████████████████████████████████████████████████████████████████████████████████████████████████████▊ | 638/681 [44:12<01:51, 2.59s/it] {'loss': 0.9433, 'grad_norm': 38.22664260864258, 'learning_rate': 6.349874889624962e-09, 'rewards/chosen': -0.30027681589126587, 'rewards/rejected': -0.9476705193519592, 'rewards/accuracies': 0.890625, 'rewards/margins': 0.6473937034606934, 'logps/chosen': -439.9283447265625, 'logps/rejected': -1368.220947265625, 'logps/ref_chosen': -35.51661682128906, 'logps/ref_rejected': -85.09121704101562, 'logits/chosen': -3.2752623558044434, 'logits/rejected': -5.041168212890625, 'kl/p_epsilon_steps': 0.875, 'kl/n_epsilon_steps': 0.125, 'epsilon_dpo/beta': 0.0007394017884507775, 'epsilon_dpo/loss_margin_mean': 878.7179565429688, 'epsilon_dpo/beta_margin_mean': 0.6473937034606934, 'epsilon_dpo/beta_margin_std': 0.7178838849067688, 'epsilon_dpo/beta_margin_grad_mean': -0.3611353039741516, 'epsilon_dpo/beta_margin_grad_std': 0.13404038548469543, 'kl/beta': 0.0007449146942235529, 'kl/avg_steps': 0.75, 'epoch': 0.94} 94%|██████████████████████████████████████████████████████████████████████████████████████████████████████████▊ | 638/681 [44:12<01:51, 2.59s/it] 94%|██████████████████████████████████████████████████████████████████████████████████████████████████████████▉ | 639/681 [44:15<01:48, 2.60s/it] {'loss': 1.0788, 'grad_norm': 46.616573333740234, 'learning_rate': 6.065683776815933e-09, 'rewards/chosen': -0.3750152587890625, 'rewards/rejected': -0.8208526372909546, 'rewards/accuracies': 0.8125, 'rewards/margins': 0.4458373486995697, 'logps/chosen': -551.689208984375, 'logps/rejected': -1200.0047607421875, 'logps/ref_chosen': -44.109619140625, 'logps/ref_rejected': -81.57601928710938, 'logits/chosen': -3.443708896636963, 'logits/rejected': -4.675836563110352, 'kl/p_epsilon_steps': 0.765625, 'kl/n_epsilon_steps': 0.234375, 'epsilon_dpo/beta': 0.0007355150883086026, 'epsilon_dpo/loss_margin_mean': 610.8491821289062, 'epsilon_dpo/beta_margin_mean': 0.4458373486995697, 'epsilon_dpo/beta_margin_std': 0.6368516087532043, 'epsilon_dpo/beta_margin_grad_mean': -0.400001585483551, 'epsilon_dpo/beta_margin_grad_std': 0.13325192034244537, 'kl/beta': 0.0007393694249913096, 'kl/avg_steps': 0.53125, 'epoch': 0.94} 94%|██████████████████████████████████████████████████████████████████████████████████████████████████████████▉ | 639/681 [44:15<01:48, 2.60s/it] 94%|███████████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 640/681 [44:17<01:45, 2.58s/it] {'loss': 0.8862, 'grad_norm': 47.58893966674805, 'learning_rate': 5.7879205600998296e-09, 'rewards/chosen': -0.2974758744239807, 'rewards/rejected': -1.0234477519989014, 'rewards/accuracies': 0.890625, 'rewards/margins': 0.7259718179702759, 'logps/chosen': -445.7485656738281, 'logps/rejected': -1517.7373046875, 'logps/ref_chosen': -40.14595413208008, 'logps/ref_rejected': -114.68016815185547, 'logits/chosen': -3.3290483951568604, 'logits/rejected': -4.9194793701171875, 'kl/p_epsilon_steps': 0.859375, 'kl/n_epsilon_steps': 0.140625, 'epsilon_dpo/beta': 0.0007302492158487439, 'epsilon_dpo/loss_margin_mean': 997.45458984375, 'epsilon_dpo/beta_margin_mean': 0.7259718775749207, 'epsilon_dpo/beta_margin_std': 0.7215431928634644, 'epsilon_dpo/beta_margin_grad_mean': -0.34541165828704834, 'epsilon_dpo/beta_margin_grad_std': 0.12778370082378387, 'kl/beta': 0.0007354622939601541, 'kl/avg_steps': 0.71875, 'epoch': 0.94} 94%|███████████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 640/681 [44:17<01:45, 2.58s/it] 94%|███████████████████████████████████████████████████████████████████████████████████████████████████████████▎ | 641/681 [44:20<01:43, 2.59s/it] {'loss': 0.9185, 'grad_norm': 43.9137077331543, 'learning_rate': 5.516592558795746e-09, 'rewards/chosen': -0.3105979263782501, 'rewards/rejected': -1.0035767555236816, 'rewards/accuracies': 0.875, 'rewards/margins': 0.6929788589477539, 'logps/chosen': -465.90545654296875, 'logps/rejected': -1480.5869140625, 'logps/ref_chosen': -39.17839050292969, 'logps/ref_rejected': -94.15284729003906, 'logits/chosen': -3.348452568054199, 'logits/rejected': -5.015802383422852, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.171875, 'epsilon_dpo/beta': 0.0007254944066517055, 'epsilon_dpo/loss_margin_mean': 959.70703125, 'epsilon_dpo/beta_margin_mean': 0.6929788589477539, 'epsilon_dpo/beta_margin_std': 0.7391694188117981, 'epsilon_dpo/beta_margin_grad_mean': -0.3516106605529785, 'epsilon_dpo/beta_margin_grad_std': 0.14073027670383453, 'kl/beta': 0.0007302138837985694, 'kl/avg_steps': 0.65625, 'epoch': 0.94} 94%|███████████████████████████████████████████████████████████████████████████████████████████████████████████▎ | 641/681 [44:20<01:43, 2.59s/it] 94%|███████████████████████████████████████████████████████████████████████████████████████████████████████████▍ | 642/681 [44:23<01:42, 2.62s/it] {'loss': 0.8997, 'grad_norm': 42.03846740722656, 'learning_rate': 5.251706922648868e-09, 'rewards/chosen': -0.3134210705757141, 'rewards/rejected': -1.0650629997253418, 'rewards/accuracies': 0.921875, 'rewards/margins': 0.7516418695449829, 'logps/chosen': -480.06494140625, 'logps/rejected': -1596.8970947265625, 'logps/ref_chosen': -46.66090393066406, 'logps/ref_rejected': -115.78807067871094, 'logits/chosen': -3.4749832153320312, 'logits/rejected': -5.020747184753418, 'kl/p_epsilon_steps': 0.875, 'kl/n_epsilon_steps': 0.125, 'epsilon_dpo/beta': 0.0007200841791927814, 'epsilon_dpo/loss_margin_mean': 1047.7049560546875, 'epsilon_dpo/beta_margin_mean': 0.7516418695449829, 'epsilon_dpo/beta_margin_std': 0.7943530082702637, 'epsilon_dpo/beta_margin_grad_mean': -0.34033486247062683, 'epsilon_dpo/beta_margin_grad_std': 0.1520089954137802, 'kl/beta': 0.0007254530792124569, 'kl/avg_steps': 0.75, 'epoch': 0.94} 94%|███████████████████████████████████████████████████████████████████████████████████████████████████████████▍ | 642/681 [44:23<01:42, 2.62s/it] 94%|███████████████████████████████████████████████████████████████████████████████████████████████████████████▋ | 643/681 [44:25<01:37, 2.57s/it] {'loss': 0.9222, 'grad_norm': 39.312984466552734, 'learning_rate': 4.993270631642038e-09, 'rewards/chosen': -0.29885566234588623, 'rewards/rejected': -0.9357801079750061, 'rewards/accuracies': 0.90625, 'rewards/margins': 0.6369244456291199, 'logps/chosen': -453.1264343261719, 'logps/rejected': -1403.9027099609375, 'logps/ref_chosen': -35.49954605102539, 'logps/ref_rejected': -92.612060546875, 'logits/chosen': -3.4138784408569336, 'logits/rejected': -4.991301536560059, 'kl/p_epsilon_steps': 0.859375, 'kl/n_epsilon_steps': 0.140625, 'epsilon_dpo/beta': 0.0007149488083086908, 'epsilon_dpo/loss_margin_mean': 893.6637573242188, 'epsilon_dpo/beta_margin_mean': 0.6369244456291199, 'epsilon_dpo/beta_margin_std': 0.594021201133728, 'epsilon_dpo/beta_margin_grad_mean': -0.3584926128387451, 'epsilon_dpo/beta_margin_grad_std': 0.11874385923147202, 'kl/beta': 0.0007200526888482273, 'kl/avg_steps': 0.71875, 'epoch': 0.94} 94%|███████████████████████████████████████████████████████████████████████████████████████████████████████████▋ | 643/681 [44:25<01:37, 2.57s/it] 95%|███████████████████████████████████████████████████████████████████████████████████████████████████████████▊ | 644/681 [44:28<01:37, 2.63s/it] {'loss': 0.9704, 'grad_norm': 46.13814926147461, 'learning_rate': 4.741290495811873e-09, 'rewards/chosen': -0.28092193603515625, 'rewards/rejected': -0.8519556522369385, 'rewards/accuracies': 0.859375, 'rewards/margins': 0.5710337162017822, 'logps/chosen': -432.34515380859375, 'logps/rejected': -1295.53857421875, 'logps/ref_chosen': -37.63022232055664, 'logps/ref_rejected': -93.44629669189453, 'logits/chosen': -3.398604154586792, 'logits/rejected': -4.7896952629089355, 'kl/p_epsilon_steps': 0.84375, 'kl/n_epsilon_steps': 0.15625, 'epsilon_dpo/beta': 0.0007100701914168894, 'epsilon_dpo/loss_margin_mean': 807.3773193359375, 'epsilon_dpo/beta_margin_mean': 0.5710337162017822, 'epsilon_dpo/beta_margin_std': 0.5858953595161438, 'epsilon_dpo/beta_margin_grad_mean': -0.37228909134864807, 'epsilon_dpo/beta_margin_grad_std': 0.1232423335313797, 'kl/beta': 0.0007149142329581082, 'kl/avg_steps': 0.6875, 'epoch': 0.95} 95%|███████████████████████████████████████████████████████████████████████████████████████████████████████████▊ | 644/681 [44:28<01:37, 2.63s/it] 95%|███████████████████████████████████████████████████████████████████████████████████████████████████████████▉ | 645/681 [44:31<01:35, 2.65s/it] {'loss': 1.0346, 'grad_norm': 42.622047424316406, 'learning_rate': 4.495773155069299e-09, 'rewards/chosen': -0.35279667377471924, 'rewards/rejected': -0.8792734146118164, 'rewards/accuracies': 0.828125, 'rewards/margins': 0.5264767408370972, 'logps/chosen': -535.257568359375, 'logps/rejected': -1353.5076904296875, 'logps/ref_chosen': -37.85113525390625, 'logps/ref_rejected': -105.40227508544922, 'logits/chosen': -3.5165293216705322, 'logits/rejected': -4.699129104614258, 'kl/p_epsilon_steps': 0.78125, 'kl/n_epsilon_steps': 0.21875, 'epsilon_dpo/beta': 0.0007061094511300325, 'epsilon_dpo/loss_margin_mean': 750.698974609375, 'epsilon_dpo/beta_margin_mean': 0.5264767408370972, 'epsilon_dpo/beta_margin_std': 0.6991389393806458, 'epsilon_dpo/beta_margin_grad_mean': -0.38581857085227966, 'epsilon_dpo/beta_margin_grad_std': 0.14517748355865479, 'kl/beta': 0.0007100327638909221, 'kl/avg_steps': 0.5625, 'epoch': 0.95} 95%|███████████████████████████████████████████████████████████████████████████████████████████████████████████▉ | 645/681 [44:31<01:35, 2.65s/it] 95%|████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 646/681 [44:33<01:32, 2.63s/it] {'loss': 1.0461, 'grad_norm': 47.977237701416016, 'learning_rate': 4.256725079024553e-09, 'rewards/chosen': -0.33754080533981323, 'rewards/rejected': -0.8128384947776794, 'rewards/accuracies': 0.8125, 'rewards/margins': 0.4752976894378662, 'logps/chosen': -520.402099609375, 'logps/rejected': -1243.4267578125, 'logps/ref_chosen': -41.30128860473633, 'logps/ref_rejected': -82.82234954833984, 'logits/chosen': -3.479154109954834, 'logits/rejected': -4.7768402099609375, 'kl/p_epsilon_steps': 0.796875, 'kl/n_epsilon_steps': 0.203125, 'epsilon_dpo/beta': 0.0007019391632638872, 'epsilon_dpo/loss_margin_mean': 681.503662109375, 'epsilon_dpo/beta_margin_mean': 0.4752976894378662, 'epsilon_dpo/beta_margin_std': 0.5920587778091431, 'epsilon_dpo/beta_margin_grad_mean': -0.39241456985473633, 'epsilon_dpo/beta_margin_grad_std': 0.129283145070076, 'kl/beta': 0.0007060611969791353, 'kl/avg_steps': 0.59375, 'epoch': 0.95} 95%|████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 646/681 [44:33<01:32, 2.63s/it] 95%|████████████████████████████████████████████████████████████████████████████████████████████████████████████▎ | 647/681 [44:36<01:30, 2.67s/it] {'loss': 0.9654, 'grad_norm': 48.550132751464844, 'learning_rate': 4.024152566816791e-09, 'rewards/chosen': -0.2764018774032593, 'rewards/rejected': -0.8342074155807495, 'rewards/accuracies': 0.890625, 'rewards/margins': 0.5578055381774902, 'logps/chosen': -431.2979736328125, 'logps/rejected': -1297.52197265625, 'logps/ref_chosen': -35.967567443847656, 'logps/ref_rejected': -98.74945068359375, 'logits/chosen': -3.453265905380249, 'logits/rejected': -4.754009246826172, 'kl/p_epsilon_steps': 0.890625, 'kl/n_epsilon_steps': 0.109375, 'epsilon_dpo/beta': 0.0006964798085391521, 'epsilon_dpo/loss_margin_mean': 803.4420776367188, 'epsilon_dpo/beta_margin_mean': 0.5578054785728455, 'epsilon_dpo/beta_margin_std': 0.5175791382789612, 'epsilon_dpo/beta_margin_grad_mean': -0.3713095486164093, 'epsilon_dpo/beta_margin_grad_std': 0.11274945735931396, 'kl/beta': 0.0007018937030807137, 'kl/avg_steps': 0.78125, 'epoch': 0.95} 95%|████████████████████████████████████████████████████████████████████████████████████████████████████████████▎ | 647/681 [44:36<01:30, 2.67s/it] 95%|████████████████████████████████████████████████████████████████████████████████████████████████████████████▍ | 648/681 [44:38<01:27, 2.64s/it] {'loss': 0.9453, 'grad_norm': 52.17890930175781, 'learning_rate': 3.798061746947995e-09, 'rewards/chosen': -0.2765774726867676, 'rewards/rejected': -0.927411675453186, 'rewards/accuracies': 0.890625, 'rewards/margins': 0.6508342027664185, 'logps/chosen': -432.31951904296875, 'logps/rejected': -1449.33203125, 'logps/ref_chosen': -33.676727294921875, 'logps/ref_rejected': -105.11663818359375, 'logits/chosen': -3.3199658393859863, 'logits/rejected': -4.965579986572266, 'kl/p_epsilon_steps': 0.890625, 'kl/n_epsilon_steps': 0.109375, 'epsilon_dpo/beta': 0.0006910806987434626, 'epsilon_dpo/loss_margin_mean': 945.5726318359375, 'epsilon_dpo/beta_margin_mean': 0.6508342027664185, 'epsilon_dpo/beta_margin_std': 0.699810802936554, 'epsilon_dpo/beta_margin_grad_mean': -0.35618698596954346, 'epsilon_dpo/beta_margin_grad_std': 0.14514584839344025, 'kl/beta': 0.0006964526255615056, 'kl/avg_steps': 0.78125, 'epoch': 0.95} 95%|████████████████████████████████████████████████████████████████████████████████████████████████████████████▍ | 648/681 [44:38<01:27, 2.64s/it] 95%|████████████████████████████████████████████████████████████████████████████████████████████████████████████▋ | 649/681 [44:41<01:25, 2.67s/it] {'loss': 1.1091, 'grad_norm': 40.87678146362305, 'learning_rate': 3.5784585771215235e-09, 'rewards/chosen': -0.3326740860939026, 'rewards/rejected': -0.7718292474746704, 'rewards/accuracies': 0.78125, 'rewards/margins': 0.43915510177612305, 'logps/chosen': -526.4951171875, 'logps/rejected': -1212.298095703125, 'logps/ref_chosen': -45.06011199951172, 'logps/ref_rejected': -86.40021514892578, 'logits/chosen': -3.504033088684082, 'logits/rejected': -4.718470573425293, 'kl/p_epsilon_steps': 0.765625, 'kl/n_epsilon_steps': 0.21875, 'epsilon_dpo/beta': 0.0006873422535136342, 'epsilon_dpo/loss_margin_mean': 644.4627685546875, 'epsilon_dpo/beta_margin_mean': 0.43915510177612305, 'epsilon_dpo/beta_margin_std': 0.7278276085853577, 'epsilon_dpo/beta_margin_grad_mean': -0.40578776597976685, 'epsilon_dpo/beta_margin_grad_std': 0.1477607786655426, 'kl/beta': 0.0006910538068041205, 'kl/avg_steps': 0.546875, 'epoch': 0.95} 95%|████████████████████████████████████████████████████████████████████████████████████████████████████████████▋ | 649/681 [44:41<01:25, 2.67s/it] 95%|████████████████████████████████████████████████████████████████████████████████████████████████████████████▊ | 650/681 [44:44<01:20, 2.61s/it] {'loss': 0.957, 'grad_norm': 45.862091064453125, 'learning_rate': 3.3653488440851253e-09, 'rewards/chosen': -0.36932051181793213, 'rewards/rejected': -1.0521395206451416, 'rewards/accuracies': 0.84375, 'rewards/margins': 0.682819128036499, 'logps/chosen': -577.81494140625, 'logps/rejected': -1647.8070068359375, 'logps/ref_chosen': -39.21210861206055, 'logps/ref_rejected': -103.7435302734375, 'logits/chosen': -3.553635835647583, 'logits/rejected': -5.113847732543945, 'kl/p_epsilon_steps': 0.8125, 'kl/n_epsilon_steps': 0.1875, 'epsilon_dpo/beta': 0.0006830678903497756, 'epsilon_dpo/loss_margin_mean': 1005.4606323242188, 'epsilon_dpo/beta_margin_mean': 0.6828190684318542, 'epsilon_dpo/beta_margin_std': 0.8265655040740967, 'epsilon_dpo/beta_margin_grad_mean': -0.3580322861671448, 'epsilon_dpo/beta_margin_grad_std': 0.16048245131969452, 'kl/beta': 0.0006872951635159552, 'kl/avg_steps': 0.625, 'epoch': 0.95} 95%|████████████████████████████████████████████████████████████████████████████████████████████████████████████▊ | 650/681 [44:44<01:20, 2.61s/it] 96%|████████████████████████████████████████████████████████████████████████████████████████████████████████████▉ | 651/681 [44:46<01:17, 2.58s/it] {'loss': 0.9624, 'grad_norm': 40.79838943481445, 'learning_rate': 3.158738163478475e-09, 'rewards/chosen': -0.3056422173976898, 'rewards/rejected': -0.9292333126068115, 'rewards/accuracies': 0.828125, 'rewards/margins': 0.6235911250114441, 'logps/chosen': -481.7735595703125, 'logps/rejected': -1477.223876953125, 'logps/ref_chosen': -34.16796875, 'logps/ref_rejected': -105.96416473388672, 'logits/chosen': -3.4843456745147705, 'logits/rejected': -4.997735500335693, 'kl/p_epsilon_steps': 0.796875, 'kl/n_epsilon_steps': 0.203125, 'epsilon_dpo/beta': 0.0006790386396460235, 'epsilon_dpo/loss_margin_mean': 923.6541137695312, 'epsilon_dpo/beta_margin_mean': 0.6235911250114441, 'epsilon_dpo/beta_margin_std': 0.7001570463180542, 'epsilon_dpo/beta_margin_grad_mean': -0.3646124601364136, 'epsilon_dpo/beta_margin_grad_std': 0.14354108273983002, 'kl/beta': 0.0006830262136645615, 'kl/avg_steps': 0.59375, 'epoch': 0.96} 96%|████████████████████████████████████████████████████████████████████████████████████████████████████████████▉ | 651/681 [44:46<01:17, 2.58s/it] 96%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 652/681 [44:49<01:16, 2.64s/it] {'loss': 0.9666, 'grad_norm': 47.86579895019531, 'learning_rate': 2.9586319796851555e-09, 'rewards/chosen': -0.32234591245651245, 'rewards/rejected': -0.9603723883628845, 'rewards/accuracies': 0.8125, 'rewards/margins': 0.6380264759063721, 'logps/chosen': -517.8300170898438, 'logps/rejected': -1544.671142578125, 'logps/ref_chosen': -43.1708984375, 'logps/ref_rejected': -119.19691467285156, 'logits/chosen': -3.3652729988098145, 'logits/rejected': -4.884889602661133, 'kl/p_epsilon_steps': 0.796875, 'kl/n_epsilon_steps': 0.203125, 'epsilon_dpo/beta': 0.0006750306929461658, 'epsilon_dpo/loss_margin_mean': 950.8151245117188, 'epsilon_dpo/beta_margin_mean': 0.6380264759063721, 'epsilon_dpo/beta_margin_std': 0.7482752203941345, 'epsilon_dpo/beta_margin_grad_mean': -0.36323583126068115, 'epsilon_dpo/beta_margin_grad_std': 0.1515437662601471, 'kl/beta': 0.0006789946928620338, 'kl/avg_steps': 0.59375, 'epoch': 0.96} 96%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 652/681 [44:49<01:16, 2.64s/it] 96%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████▎ | 653/681 [44:52<01:13, 2.63s/it] {'loss': 0.8559, 'grad_norm': 48.55562973022461, 'learning_rate': 2.7650355656892166e-09, 'rewards/chosen': -0.28941768407821655, 'rewards/rejected': -1.0600758790969849, 'rewards/accuracies': 0.90625, 'rewards/margins': 0.7706582546234131, 'logps/chosen': -467.5439453125, 'logps/rejected': -1696.89404296875, 'logps/ref_chosen': -36.02939987182617, 'logps/ref_rejected': -111.6745376586914, 'logits/chosen': -3.2305068969726562, 'logits/rejected': -5.08480167388916, 'kl/p_epsilon_steps': 0.921875, 'kl/n_epsilon_steps': 0.078125, 'epsilon_dpo/beta': 0.0006693587056361139, 'epsilon_dpo/loss_margin_mean': 1153.705078125, 'epsilon_dpo/beta_margin_mean': 0.7706581950187683, 'epsilon_dpo/beta_margin_std': 0.685066819190979, 'epsilon_dpo/beta_margin_grad_mean': -0.3340756595134735, 'epsilon_dpo/beta_margin_grad_std': 0.13510021567344666, 'kl/beta': 0.0006749869789928198, 'kl/avg_steps': 0.84375, 'epoch': 0.96} 96%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████▎ | 653/681 [44:52<01:13, 2.63s/it] 96%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████▍ | 654/681 [44:54<01:11, 2.63s/it] {'loss': 0.9534, 'grad_norm': 41.65330505371094, 'learning_rate': 2.577954022936174e-09, 'rewards/chosen': -0.2903170585632324, 'rewards/rejected': -0.8954617977142334, 'rewards/accuracies': 0.890625, 'rewards/margins': 0.605144739151001, 'logps/chosen': -473.72698974609375, 'logps/rejected': -1454.8271484375, 'logps/ref_chosen': -38.32045364379883, 'logps/ref_rejected': -105.32196807861328, 'logits/chosen': -3.4308786392211914, 'logits/rejected': -4.949398040771484, 'kl/p_epsilon_steps': 0.859375, 'kl/n_epsilon_steps': 0.140625, 'epsilon_dpo/beta': 0.0006645949906669557, 'epsilon_dpo/loss_margin_mean': 914.0985717773438, 'epsilon_dpo/beta_margin_mean': 0.605144739151001, 'epsilon_dpo/beta_margin_std': 0.6342909932136536, 'epsilon_dpo/beta_margin_grad_mean': -0.36657387018203735, 'epsilon_dpo/beta_margin_grad_std': 0.1234474629163742, 'kl/beta': 0.0006693393806926906, 'kl/avg_steps': 0.71875, 'epoch': 0.96} 96%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████▍ | 654/681 [44:54<01:11, 2.63s/it] 96%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████▋ | 655/681 [44:57<01:07, 2.61s/it] {'loss': 0.9735, 'grad_norm': 43.04616165161133, 'learning_rate': 2.397392281198729e-09, 'rewards/chosen': -0.23886506259441376, 'rewards/rejected': -0.8641390800476074, 'rewards/accuracies': 0.8125, 'rewards/margins': 0.6252740025520325, 'logps/chosen': -389.5122985839844, 'logps/rejected': -1416.551025390625, 'logps/ref_chosen': -29.801528930664062, 'logps/ref_rejected': -104.75204467773438, 'logits/chosen': -3.374847173690796, 'logits/rejected': -4.775622844696045, 'kl/p_epsilon_steps': 0.8125, 'kl/n_epsilon_steps': 0.1875, 'epsilon_dpo/beta': 0.000660475343465805, 'epsilon_dpo/loss_margin_mean': 952.0883178710938, 'epsilon_dpo/beta_margin_mean': 0.6252740025520325, 'epsilon_dpo/beta_margin_std': 0.7545697093009949, 'epsilon_dpo/beta_margin_grad_mean': -0.3668808341026306, 'epsilon_dpo/beta_margin_grad_std': 0.14767590165138245, 'kl/beta': 0.0006645628600381315, 'kl/avg_steps': 0.625, 'epoch': 0.96} 96%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████▋ | 655/681 [44:57<01:07, 2.61s/it] 96%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████▊ | 656/681 [44:59<01:05, 2.63s/it] {'loss': 0.8158, 'grad_norm': 51.54441833496094, 'learning_rate': 2.223355098446622e-09, 'rewards/chosen': -0.2671954035758972, 'rewards/rejected': -1.0838143825531006, 'rewards/accuracies': 0.921875, 'rewards/margins': 0.8166189193725586, 'logps/chosen': -448.05487060546875, 'logps/rejected': -1777.021484375, 'logps/ref_chosen': -40.732295989990234, 'logps/ref_rejected': -120.36123657226562, 'logits/chosen': -3.3955602645874023, 'logits/rejected': -5.103156089782715, 'kl/p_epsilon_steps': 0.921875, 'kl/n_epsilon_steps': 0.078125, 'epsilon_dpo/beta': 0.0006549282115884125, 'epsilon_dpo/loss_margin_mean': 1249.337646484375, 'epsilon_dpo/beta_margin_mean': 0.8166189193725586, 'epsilon_dpo/beta_margin_std': 0.6401075720787048, 'epsilon_dpo/beta_margin_grad_mean': -0.32215529680252075, 'epsilon_dpo/beta_margin_grad_std': 0.1285041719675064, 'kl/beta': 0.0006604351219721138, 'kl/avg_steps': 0.84375, 'epoch': 0.96} 96%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████▊ | 656/681 [44:59<01:05, 2.63s/it] 96%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████▉ | 657/681 [45:02<01:01, 2.57s/it] {'loss': 0.9133, 'grad_norm': 43.996578216552734, 'learning_rate': 2.055847060721566e-09, 'rewards/chosen': -0.29575812816619873, 'rewards/rejected': -0.9470652341842651, 'rewards/accuracies': 0.90625, 'rewards/margins': 0.6513071060180664, 'logps/chosen': -486.0547790527344, 'logps/rejected': -1563.21728515625, 'logps/ref_chosen': -32.56511688232422, 'logps/ref_rejected': -104.74242401123047, 'logits/chosen': -3.505584716796875, 'logits/rejected': -5.049525260925293, 'kl/p_epsilon_steps': 0.859375, 'kl/n_epsilon_steps': 0.140625, 'epsilon_dpo/beta': 0.0006502671749331057, 'epsilon_dpo/loss_margin_mean': 1004.9851684570312, 'epsilon_dpo/beta_margin_mean': 0.6513071060180664, 'epsilon_dpo/beta_margin_std': 0.5884512662887573, 'epsilon_dpo/beta_margin_grad_mean': -0.3548159599304199, 'epsilon_dpo/beta_margin_grad_std': 0.12233418226242065, 'kl/beta': 0.0006549093523062766, 'kl/avg_steps': 0.71875, 'epoch': 0.96} 96%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████▉ | 657/681 [45:02<01:01, 2.57s/it] 97%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 658/681 [45:05<00:59, 2.61s/it] {'loss': 0.9409, 'grad_norm': 35.85395431518555, 'learning_rate': 1.8948725820160662e-09, 'rewards/chosen': -0.2720116972923279, 'rewards/rejected': -0.8697974681854248, 'rewards/accuracies': 0.890625, 'rewards/margins': 0.5977858304977417, 'logps/chosen': -452.764404296875, 'logps/rejected': -1451.1217041015625, 'logps/ref_chosen': -32.16458511352539, 'logps/ref_rejected': -100.98091125488281, 'logits/chosen': -3.3597793579101562, 'logits/rejected': -5.021791458129883, 'kl/p_epsilon_steps': 0.890625, 'kl/n_epsilon_steps': 0.109375, 'epsilon_dpo/beta': 0.0006452203379012644, 'epsilon_dpo/loss_margin_mean': 929.5409545898438, 'epsilon_dpo/beta_margin_mean': 0.5977858304977417, 'epsilon_dpo/beta_margin_std': 0.5393864512443542, 'epsilon_dpo/beta_margin_grad_mean': -0.36330175399780273, 'epsilon_dpo/beta_margin_grad_std': 0.11596217751502991, 'kl/beta': 0.0006502358010038733, 'kl/avg_steps': 0.78125, 'epoch': 0.97} 97%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 658/681 [45:05<00:59, 2.61s/it] 97%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████▎ | 659/681 [45:07<00:57, 2.63s/it] {'loss': 0.9852, 'grad_norm': 40.669891357421875, 'learning_rate': 1.7404359041573723e-09, 'rewards/chosen': -0.2872786521911621, 'rewards/rejected': -0.8520359992980957, 'rewards/accuracies': 0.828125, 'rewards/margins': 0.5647573471069336, 'logps/chosen': -490.76068115234375, 'logps/rejected': -1424.7255859375, 'logps/ref_chosen': -44.455406188964844, 'logps/ref_rejected': -93.29725646972656, 'logits/chosen': -3.3426105976104736, 'logits/rejected': -4.952641487121582, 'kl/p_epsilon_steps': 0.796875, 'kl/n_epsilon_steps': 0.203125, 'epsilon_dpo/beta': 0.0006414285162463784, 'epsilon_dpo/loss_margin_mean': 885.123046875, 'epsilon_dpo/beta_margin_mean': 0.5647572875022888, 'epsilon_dpo/beta_margin_std': 0.6336069703102112, 'epsilon_dpo/beta_margin_grad_mean': -0.3757038414478302, 'epsilon_dpo/beta_margin_grad_std': 0.12834706902503967, 'kl/beta': 0.0006451951921917498, 'kl/avg_steps': 0.59375, 'epoch': 0.97} 97%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████▎ | 659/681 [45:07<00:57, 2.63s/it] 97%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████▍ | 660/681 [45:10<00:55, 2.65s/it] {'loss': 0.9552, 'grad_norm': 37.189697265625, 'learning_rate': 1.592541096695571e-09, 'rewards/chosen': -0.26376718282699585, 'rewards/rejected': -0.8527300953865051, 'rewards/accuracies': 0.859375, 'rewards/margins': 0.5889629125595093, 'logps/chosen': -454.3048095703125, 'logps/rejected': -1423.1195068359375, 'logps/ref_chosen': -41.76560974121094, 'logps/ref_rejected': -82.32925415039062, 'logits/chosen': -3.402561664581299, 'logits/rejected': -5.116570472717285, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.171875, 'epsilon_dpo/beta': 0.0006372415809892118, 'epsilon_dpo/loss_margin_mean': 928.2510375976562, 'epsilon_dpo/beta_margin_mean': 0.5889629125595093, 'epsilon_dpo/beta_margin_std': 0.5796295404434204, 'epsilon_dpo/beta_margin_grad_mean': -0.36822497844696045, 'epsilon_dpo/beta_margin_grad_std': 0.1204964891076088, 'kl/beta': 0.0006413869559764862, 'kl/avg_steps': 0.65625, 'epoch': 0.97} 97%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████▍ | 660/681 [45:10<00:55, 2.65s/it] 97%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████▋ | 661/681 [45:12<00:51, 2.55s/it] {'loss': 0.9829, 'grad_norm': 32.038639068603516, 'learning_rate': 1.4511920567963908e-09, 'rewards/chosen': -0.29568618535995483, 'rewards/rejected': -0.8556575775146484, 'rewards/accuracies': 0.90625, 'rewards/margins': 0.5599713921546936, 'logps/chosen': -504.87115478515625, 'logps/rejected': -1446.969482421875, 'logps/ref_chosen': -38.271453857421875, 'logps/ref_rejected': -92.10589599609375, 'logits/chosen': -3.5372705459594727, 'logits/rejected': -4.98753547668457, 'kl/p_epsilon_steps': 0.90625, 'kl/n_epsilon_steps': 0.09375, 'epsilon_dpo/beta': 0.0006320912507362664, 'epsilon_dpo/loss_margin_mean': 888.2638549804688, 'epsilon_dpo/beta_margin_mean': 0.5599713921546936, 'epsilon_dpo/beta_margin_std': 0.6463156938552856, 'epsilon_dpo/beta_margin_grad_mean': -0.3780907392501831, 'epsilon_dpo/beta_margin_grad_std': 0.11522994190454483, 'kl/beta': 0.0006372053176164627, 'kl/avg_steps': 0.8125, 'epoch': 0.97} 97%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████▋ | 661/681 [45:12<00:51, 2.55s/it] 97%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████▊ | 662/681 [45:15<00:48, 2.54s/it] {'loss': 0.9818, 'grad_norm': 37.34529113769531, 'learning_rate': 1.3163925091384532e-09, 'rewards/chosen': -0.31171175837516785, 'rewards/rejected': -0.9033301472663879, 'rewards/accuracies': 0.875, 'rewards/margins': 0.5916184186935425, 'logps/chosen': -533.3956298828125, 'logps/rejected': -1537.224365234375, 'logps/ref_chosen': -39.42928695678711, 'logps/ref_rejected': -96.40357208251953, 'logits/chosen': -3.4751718044281006, 'logits/rejected': -5.172747611999512, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.171875, 'epsilon_dpo/beta': 0.0006279845256358385, 'epsilon_dpo/loss_margin_mean': 946.8544311523438, 'epsilon_dpo/beta_margin_mean': 0.5916183590888977, 'epsilon_dpo/beta_margin_std': 0.681064784526825, 'epsilon_dpo/beta_margin_grad_mean': -0.3694445788860321, 'epsilon_dpo/beta_margin_grad_std': 0.1416751593351364, 'kl/beta': 0.0006320697139017284, 'kl/avg_steps': 0.65625, 'epoch': 0.97} 97%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████▊ | 662/681 [45:15<00:48, 2.54s/it] 97%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████▉ | 663/681 [45:17<00:45, 2.55s/it] {'loss': 0.9729, 'grad_norm': 46.52898025512695, 'learning_rate': 1.1881460058152382e-09, 'rewards/chosen': -0.2980585992336273, 'rewards/rejected': -0.9053161144256592, 'rewards/accuracies': 0.828125, 'rewards/margins': 0.6072574853897095, 'logps/chosen': -518.152099609375, 'logps/rejected': -1573.593017578125, 'logps/ref_chosen': -44.08625411987305, 'logps/ref_rejected': -121.31452178955078, 'logits/chosen': -3.4985265731811523, 'logits/rejected': -4.843735694885254, 'kl/p_epsilon_steps': 0.765625, 'kl/n_epsilon_steps': 0.234375, 'epsilon_dpo/beta': 0.0006246752454899251, 'epsilon_dpo/loss_margin_mean': 978.2127075195312, 'epsilon_dpo/beta_margin_mean': 0.6072575449943542, 'epsilon_dpo/beta_margin_std': 0.6948506832122803, 'epsilon_dpo/beta_margin_grad_mean': -0.3684435486793518, 'epsilon_dpo/beta_margin_grad_std': 0.14285063743591309, 'kl/beta': 0.0006279487861320376, 'kl/avg_steps': 0.53125, 'epoch': 0.97} 97%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████▉ | 663/681 [45:17<00:45, 2.55s/it] 98%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 664/681 [45:20<00:44, 2.60s/it] {'loss': 0.8558, 'grad_norm': 43.377994537353516, 'learning_rate': 1.066455926241383e-09, 'rewards/chosen': -0.19074614346027374, 'rewards/rejected': -0.943513035774231, 'rewards/accuracies': 0.9375, 'rewards/margins': 0.7527669072151184, 'logps/chosen': -338.7730712890625, 'logps/rejected': -1640.69140625, 'logps/ref_chosen': -31.542118072509766, 'logps/ref_rejected': -116.06498718261719, 'logits/chosen': -3.1317358016967773, 'logits/rejected': -5.002200126647949, 'kl/p_epsilon_steps': 0.921875, 'kl/n_epsilon_steps': 0.078125, 'epsilon_dpo/beta': 0.0006194221205078065, 'epsilon_dpo/loss_margin_mean': 1217.3953857421875, 'epsilon_dpo/beta_margin_mean': 0.7527669072151184, 'epsilon_dpo/beta_margin_std': 0.655570924282074, 'epsilon_dpo/beta_margin_grad_mean': -0.33726438879966736, 'epsilon_dpo/beta_margin_grad_std': 0.1218588799238205, 'kl/beta': 0.0006246304837986827, 'kl/avg_steps': 0.84375, 'epoch': 0.98} 98%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 664/681 [45:20<00:44, 2.60s/it] 98%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████▎ | 665/681 [45:23<00:41, 2.58s/it] {'loss': 0.9663, 'grad_norm': 44.8764762878418, 'learning_rate': 9.513254770636137e-10, 'rewards/chosen': -0.2572126090526581, 'rewards/rejected': -0.8239928483963013, 'rewards/accuracies': 0.90625, 'rewards/margins': 0.5667802095413208, 'logps/chosen': -454.4164123535156, 'logps/rejected': -1432.787353515625, 'logps/ref_chosen': -36.933109283447266, 'logps/ref_rejected': -89.9020767211914, 'logits/chosen': -3.3389735221862793, 'logits/rejected': -5.07022762298584, 'kl/p_epsilon_steps': 0.921875, 'kl/n_epsilon_steps': 0.078125, 'epsilon_dpo/beta': 0.0006142394850030541, 'epsilon_dpo/loss_margin_mean': 925.40185546875, 'epsilon_dpo/beta_margin_mean': 0.5667802691459656, 'epsilon_dpo/beta_margin_std': 0.5540540218353271, 'epsilon_dpo/beta_margin_grad_mean': -0.37084051966667175, 'epsilon_dpo/beta_margin_grad_std': 0.11781810969114304, 'kl/beta': 0.0006194042507559061, 'kl/avg_steps': 0.84375, 'epoch': 0.98} 98%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████▎ | 665/681 [45:23<00:41, 2.58s/it] 98%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████▍ | 666/681 [45:25<00:38, 2.60s/it] {'loss': 1.0509, 'grad_norm': 41.77029037475586, 'learning_rate': 8.427576920763956e-10, 'rewards/chosen': -0.3538281321525574, 'rewards/rejected': -0.8195853233337402, 'rewards/accuracies': 0.796875, 'rewards/margins': 0.46575719118118286, 'logps/chosen': -623.1746826171875, 'logps/rejected': -1445.141845703125, 'logps/ref_chosen': -47.59907913208008, 'logps/ref_rejected': -102.33778381347656, 'logits/chosen': -3.6731343269348145, 'logits/rejected': -4.908390522003174, 'kl/p_epsilon_steps': 0.765625, 'kl/n_epsilon_steps': 0.234375, 'epsilon_dpo/beta': 0.0006110197864472866, 'epsilon_dpo/loss_margin_mean': 767.228515625, 'epsilon_dpo/beta_margin_mean': 0.46575719118118286, 'epsilon_dpo/beta_margin_std': 0.5870808362960815, 'epsilon_dpo/beta_margin_grad_mean': -0.39552050828933716, 'epsilon_dpo/beta_margin_grad_std': 0.1250423640012741, 'kl/beta': 0.0006142217316664755, 'kl/avg_steps': 0.53125, 'epoch': 0.98} 98%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████▍ | 666/681 [45:25<00:38, 2.60s/it] 98%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████▋ | 667/681 [45:28<00:36, 2.58s/it] {'loss': 1.0431, 'grad_norm': 38.44181823730469, 'learning_rate': 7.407554321417764e-10, 'rewards/chosen': -0.320189893245697, 'rewards/rejected': -0.8006361722946167, 'rewards/accuracies': 0.84375, 'rewards/margins': 0.4804462790489197, 'logps/chosen': -567.8262939453125, 'logps/rejected': -1414.545654296875, 'logps/ref_chosen': -43.69598388671875, 'logps/ref_rejected': -93.95926666259766, 'logits/chosen': -3.463103771209717, 'logits/rejected': -4.9513959884643555, 'kl/p_epsilon_steps': 0.84375, 'kl/n_epsilon_steps': 0.15625, 'epsilon_dpo/beta': 0.000606836169026792, 'epsilon_dpo/loss_margin_mean': 796.4560546875, 'epsilon_dpo/beta_margin_mean': 0.4804462492465973, 'epsilon_dpo/beta_margin_std': 0.5962188839912415, 'epsilon_dpo/beta_margin_grad_mean': -0.38951799273490906, 'epsilon_dpo/beta_margin_grad_std': 0.12825334072113037, 'kl/beta': 0.0006109759560786188, 'kl/avg_steps': 0.6875, 'epoch': 0.98} 98%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████▋ | 667/681 [45:28<00:36, 2.58s/it] 98%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████▊ | 668/681 [45:31<00:34, 2.65s/it] {'loss': 0.9999, 'grad_norm': 37.77009963989258, 'learning_rate': 6.453213851142225e-10, 'rewards/chosen': -0.31520360708236694, 'rewards/rejected': -0.8256018161773682, 'rewards/accuracies': 0.84375, 'rewards/margins': 0.5103981494903564, 'logps/chosen': -569.862548828125, 'logps/rejected': -1479.616943359375, 'logps/ref_chosen': -48.83540344238281, 'logps/ref_rejected': -108.24037170410156, 'logits/chosen': -3.5545034408569336, 'logits/rejected': -4.857883453369141, 'kl/p_epsilon_steps': 0.796875, 'kl/n_epsilon_steps': 0.203125, 'epsilon_dpo/beta': 0.0006032615783624351, 'epsilon_dpo/loss_margin_mean': 850.3494262695312, 'epsilon_dpo/beta_margin_mean': 0.5103982090950012, 'epsilon_dpo/beta_margin_std': 0.517078161239624, 'epsilon_dpo/beta_margin_grad_mean': -0.3832368850708008, 'epsilon_dpo/beta_margin_grad_std': 0.11170457303524017, 'kl/beta': 0.0006068041548132896, 'kl/avg_steps': 0.59375, 'epoch': 0.98} 98%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████▊ | 668/681 [45:31<00:34, 2.65s/it] 98%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████▉ | 669/681 [45:33<00:32, 2.67s/it] {'loss': 1.0343, 'grad_norm': 37.88663101196289, 'learning_rate': 5.564580657695939e-10, 'rewards/chosen': -0.2562495172023773, 'rewards/rejected': -0.7303292751312256, 'rewards/accuracies': 0.828125, 'rewards/margins': 0.4740796983242035, 'logps/chosen': -459.27081298828125, 'logps/rejected': -1305.064208984375, 'logps/ref_chosen': -33.70474624633789, 'logps/ref_rejected': -84.20474243164062, 'logits/chosen': -3.4180922508239746, 'logits/rejected': -4.825163841247559, 'kl/p_epsilon_steps': 0.8125, 'kl/n_epsilon_steps': 0.1875, 'epsilon_dpo/beta': 0.0005995123065076768, 'epsilon_dpo/loss_margin_mean': 795.2932739257812, 'epsilon_dpo/beta_margin_mean': 0.4740796983242035, 'epsilon_dpo/beta_margin_std': 0.5450995564460754, 'epsilon_dpo/beta_margin_grad_mean': -0.3920701742172241, 'epsilon_dpo/beta_margin_grad_std': 0.11787072569131851, 'kl/beta': 0.0006032225210219622, 'kl/avg_steps': 0.625, 'epoch': 0.98} 98%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████▉ | 669/681 [45:33<00:32, 2.67s/it] 98%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 670/681 [45:36<00:29, 2.67s/it] {'loss': 1.0026, 'grad_norm': 38.17559051513672, 'learning_rate': 4.741678157389739e-10, 'rewards/chosen': -0.27749842405319214, 'rewards/rejected': -0.8247913122177124, 'rewards/accuracies': 0.890625, 'rewards/margins': 0.547292947769165, 'logps/chosen': -505.29241943359375, 'logps/rejected': -1491.165283203125, 'logps/ref_chosen': -41.06857681274414, 'logps/ref_rejected': -103.42701721191406, 'logits/chosen': -3.493618965148926, 'logits/rejected': -5.067645072937012, 'kl/p_epsilon_steps': 0.859375, 'kl/n_epsilon_steps': 0.140625, 'epsilon_dpo/beta': 0.0005952265928499401, 'epsilon_dpo/loss_margin_mean': 923.514404296875, 'epsilon_dpo/beta_margin_mean': 0.5472928881645203, 'epsilon_dpo/beta_margin_std': 0.6459078788757324, 'epsilon_dpo/beta_margin_grad_mean': -0.3791544735431671, 'epsilon_dpo/beta_margin_grad_std': 0.1323607712984085, 'kl/beta': 0.0005994758103042841, 'kl/avg_steps': 0.71875, 'epoch': 0.98} 98%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 670/681 [45:36<00:29, 2.67s/it] 99%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎ | 671/681 [45:38<00:25, 2.60s/it] {'loss': 1.0497, 'grad_norm': 41.20299530029297, 'learning_rate': 3.9845280344705245e-10, 'rewards/chosen': -0.3173062801361084, 'rewards/rejected': -0.8268150091171265, 'rewards/accuracies': 0.796875, 'rewards/margins': 0.5095087289810181, 'logps/chosen': -567.7605590820312, 'logps/rejected': -1490.3914794921875, 'logps/ref_chosen': -35.17292785644531, 'logps/ref_rejected': -90.90328216552734, 'logits/chosen': -3.6080431938171387, 'logits/rejected': -5.071390151977539, 'kl/p_epsilon_steps': 0.78125, 'kl/n_epsilon_steps': 0.21875, 'epsilon_dpo/beta': 0.000591909047216177, 'epsilon_dpo/loss_margin_mean': 866.9005737304688, 'epsilon_dpo/beta_margin_mean': 0.5095086693763733, 'epsilon_dpo/beta_margin_std': 0.7063105702400208, 'epsilon_dpo/beta_margin_grad_mean': -0.38887372612953186, 'epsilon_dpo/beta_margin_grad_std': 0.14615273475646973, 'kl/beta': 0.0005951978382654488, 'kl/avg_steps': 0.5625, 'epoch': 0.99} 99%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎ | 671/681 [45:38<00:25, 2.60s/it] 99%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍ | 672/681 [45:41<00:23, 2.63s/it] {'loss': 1.0379, 'grad_norm': 42.78655242919922, 'learning_rate': 3.293150240547549e-10, 'rewards/chosen': -0.32212620973587036, 'rewards/rejected': -0.8134155869483948, 'rewards/accuracies': 0.8125, 'rewards/margins': 0.491289347410202, 'logps/chosen': -580.21044921875, 'logps/rejected': -1485.48779296875, 'logps/ref_chosen': -35.802879333496094, 'logps/ref_rejected': -100.60040283203125, 'logits/chosen': -3.515069007873535, 'logits/rejected': -5.043622970581055, 'kl/p_epsilon_steps': 0.78125, 'kl/n_epsilon_steps': 0.21875, 'epsilon_dpo/beta': 0.000588598137255758, 'epsilon_dpo/loss_margin_mean': 840.4798583984375, 'epsilon_dpo/beta_margin_mean': 0.491289347410202, 'epsilon_dpo/beta_margin_std': 0.6089746356010437, 'epsilon_dpo/beta_margin_grad_mean': -0.3892599046230316, 'epsilon_dpo/beta_margin_grad_std': 0.13194513320922852, 'kl/beta': 0.0005918685346841812, 'kl/avg_steps': 0.5625, 'epoch': 0.99} 99%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍ | 672/681 [45:41<00:23, 2.63s/it] 99%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋ | 673/681 [45:44<00:20, 2.59s/it] {'loss': 0.963, 'grad_norm': 38.58977508544922, 'learning_rate': 2.6675629940689504e-10, 'rewards/chosen': -0.2182503640651703, 'rewards/rejected': -0.7731301784515381, 'rewards/accuracies': 0.9375, 'rewards/margins': 0.5548798441886902, 'logps/chosen': -399.2283935546875, 'logps/rejected': -1416.7021484375, 'logps/ref_chosen': -26.926271438598633, 'logps/ref_rejected': -91.83390808105469, 'logits/chosen': -3.419285297393799, 'logits/rejected': -5.03121280670166, 'kl/p_epsilon_steps': 0.859375, 'kl/n_epsilon_steps': 0.140625, 'epsilon_dpo/beta': 0.0005843861144967377, 'epsilon_dpo/loss_margin_mean': 952.566162109375, 'epsilon_dpo/beta_margin_mean': 0.5548798441886902, 'epsilon_dpo/beta_margin_std': 0.5018904209136963, 'epsilon_dpo/beta_margin_grad_mean': -0.3730253279209137, 'epsilon_dpo/beta_margin_grad_std': 0.10759243369102478, 'kl/beta': 0.0005885579157620668, 'kl/avg_steps': 0.71875, 'epoch': 0.99} 99%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋ | 673/681 [45:44<00:20, 2.59s/it] 99%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊ | 674/681 [45:46<00:17, 2.54s/it] {'loss': 1.0513, 'grad_norm': 32.84586715698242, 'learning_rate': 2.1077827798404725e-10, 'rewards/chosen': -0.25154006481170654, 'rewards/rejected': -0.7310316562652588, 'rewards/accuracies': 0.796875, 'rewards/margins': 0.479491651058197, 'logps/chosen': -462.48150634765625, 'logps/rejected': -1334.8271484375, 'logps/ref_chosen': -32.161155700683594, 'logps/ref_rejected': -73.33118438720703, 'logits/chosen': -3.5797486305236816, 'logits/rejected': -5.058620452880859, 'kl/p_epsilon_steps': 0.796875, 'kl/n_epsilon_steps': 0.203125, 'epsilon_dpo/beta': 0.0005809462745673954, 'epsilon_dpo/loss_margin_mean': 831.1757202148438, 'epsilon_dpo/beta_margin_mean': 0.47949162125587463, 'epsilon_dpo/beta_margin_std': 0.6231409311294556, 'epsilon_dpo/beta_margin_grad_mean': -0.392082154750824, 'epsilon_dpo/beta_margin_grad_std': 0.13613100349903107, 'kl/beta': 0.0005843578255735338, 'kl/avg_steps': 0.59375, 'epoch': 0.99} 99%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊ | 674/681 [45:46<00:17, 2.54s/it] 99%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉ | 675/681 [45:48<00:14, 2.48s/it] {'loss': 0.9896, 'grad_norm': 30.256996154785156, 'learning_rate': 1.6138243485910863e-10, 'rewards/chosen': -0.20505669713020325, 'rewards/rejected': -0.7371607422828674, 'rewards/accuracies': 0.90625, 'rewards/margins': 0.5321040749549866, 'logps/chosen': -380.3410339355469, 'logps/rejected': -1360.566162109375, 'logps/ref_chosen': -26.25046157836914, 'logps/ref_rejected': -80.57308959960938, 'logits/chosen': -3.4077110290527344, 'logits/rejected': -5.058513641357422, 'kl/p_epsilon_steps': 0.890625, 'kl/n_epsilon_steps': 0.109375, 'epsilon_dpo/beta': 0.0005764279630966485, 'epsilon_dpo/loss_margin_mean': 925.9025268554688, 'epsilon_dpo/beta_margin_mean': 0.5321040749549866, 'epsilon_dpo/beta_margin_std': 0.5455756187438965, 'epsilon_dpo/beta_margin_grad_mean': -0.37939831614494324, 'epsilon_dpo/beta_margin_grad_std': 0.11590909957885742, 'kl/beta': 0.0005809086724184453, 'kl/avg_steps': 0.78125, 'epoch': 0.99} 99%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉ | 675/681 [45:48<00:14, 2.48s/it] 99%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏| 676/681 [45:51<00:12, 2.49s/it] {'loss': 1.0612, 'grad_norm': 36.964908599853516, 'learning_rate': 1.1857007165852472e-10, 'rewards/chosen': -0.332570344209671, 'rewards/rejected': -0.7871416211128235, 'rewards/accuracies': 0.828125, 'rewards/margins': 0.4545712471008301, 'logps/chosen': -621.6793823242188, 'logps/rejected': -1472.0751953125, 'logps/ref_chosen': -44.59165954589844, 'logps/ref_rejected': -96.4842300415039, 'logits/chosen': -3.637737512588501, 'logits/rejected': -4.942368507385254, 'kl/p_epsilon_steps': 0.78125, 'kl/n_epsilon_steps': 0.21875, 'epsilon_dpo/beta': 0.0005732206045649946, 'epsilon_dpo/loss_margin_mean': 798.5032958984375, 'epsilon_dpo/beta_margin_mean': 0.45457127690315247, 'epsilon_dpo/beta_margin_std': 0.583877682685852, 'epsilon_dpo/beta_margin_grad_mean': -0.39650610089302063, 'epsilon_dpo/beta_margin_grad_std': 0.13006651401519775, 'kl/beta': 0.0005764055531471968, 'kl/avg_steps': 0.5625, 'epoch': 0.99} 99%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏| 676/681 [45:51<00:12, 2.49s/it] 99%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎| 677/681 [45:53<00:10, 2.52s/it] {'loss': 0.9904, 'grad_norm': 35.77093505859375, 'learning_rate': 8.23423165278725e-11, 'rewards/chosen': -0.23343393206596375, 'rewards/rejected': -0.7628414630889893, 'rewards/accuracies': 0.921875, 'rewards/margins': 0.5294075012207031, 'logps/chosen': -438.6134948730469, 'logps/rejected': -1427.509521484375, 'logps/ref_chosen': -29.166034698486328, 'logps/ref_rejected': -84.22445678710938, 'logits/chosen': -3.3925907611846924, 'logits/rejected': -5.077644348144531, 'kl/p_epsilon_steps': 0.90625, 'kl/n_epsilon_steps': 0.09375, 'epsilon_dpo/beta': 0.0005685811047442257, 'epsilon_dpo/loss_margin_mean': 933.8375244140625, 'epsilon_dpo/beta_margin_mean': 0.5294075012207031, 'epsilon_dpo/beta_margin_std': 0.5406895279884338, 'epsilon_dpo/beta_margin_grad_mean': -0.3787325918674469, 'epsilon_dpo/beta_margin_grad_std': 0.11398439854383469, 'kl/beta': 0.0005731813726015389, 'kl/avg_steps': 0.8125, 'epoch': 0.99} 99%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎| 677/681 [45:53<00:10, 2.52s/it] 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍| 678/681 [45:56<00:07, 2.54s/it] {'loss': 1.0318, 'grad_norm': 34.48164749145508, 'learning_rate': 5.270012410216185e-11, 'rewards/chosen': -0.24528416991233826, 'rewards/rejected': -0.7404882907867432, 'rewards/accuracies': 0.859375, 'rewards/margins': 0.4952041506767273, 'logps/chosen': -463.51409912109375, 'logps/rejected': -1399.691162109375, 'logps/ref_chosen': -31.562625885009766, 'logps/ref_rejected': -86.20599365234375, 'logits/chosen': -3.5082788467407227, 'logits/rejected': -5.051633358001709, 'kl/p_epsilon_steps': 0.84375, 'kl/n_epsilon_steps': 0.15625, 'epsilon_dpo/beta': 0.0005647094221785665, 'epsilon_dpo/loss_margin_mean': 881.53369140625, 'epsilon_dpo/beta_margin_mean': 0.4952041506767273, 'epsilon_dpo/beta_margin_std': 0.5945535898208618, 'epsilon_dpo/beta_margin_grad_mean': -0.3867654800415039, 'epsilon_dpo/beta_margin_grad_std': 0.12935031950473785, 'kl/beta': 0.0005685618380084634, 'kl/avg_steps': 0.6875, 'epoch': 1.0} 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍| 678/681 [45:56<00:07, 2.54s/it] 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋| 679/681 [45:58<00:04, 2.48s/it] {'loss': 1.0791, 'grad_norm': 37.97880935668945, 'learning_rate': 2.9644275480772416e-11, 'rewards/chosen': -0.2744186222553253, 'rewards/rejected': -0.691596269607544, 'rewards/accuracies': 0.796875, 'rewards/margins': 0.41717761754989624, 'logps/chosen': -521.1033935546875, 'logps/rejected': -1316.021484375, 'logps/ref_chosen': -35.110084533691406, 'logps/ref_rejected': -82.25491333007812, 'logits/chosen': -3.43449068069458, 'logits/rejected': -4.8752970695495605, 'kl/p_epsilon_steps': 0.734375, 'kl/n_epsilon_steps': 0.265625, 'epsilon_dpo/beta': 0.0005620888550765812, 'epsilon_dpo/loss_margin_mean': 747.7733154296875, 'epsilon_dpo/beta_margin_mean': 0.41717761754989624, 'epsilon_dpo/beta_margin_std': 0.5460913777351379, 'epsilon_dpo/beta_margin_grad_mean': -0.4052530825138092, 'epsilon_dpo/beta_margin_grad_std': 0.11750102043151855, 'kl/beta': 0.0005646796198561788, 'kl/avg_steps': 0.46875, 'epoch': 1.0} 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋| 679/681 [45:58<00:04, 2.48s/it] 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊| 680/681 [46:01<00:02, 2.54s/it] {'loss': 0.9288, 'grad_norm': 45.923912048339844, 'learning_rate': 1.31753782067201e-11, 'rewards/chosen': -0.24402347207069397, 'rewards/rejected': -0.8650712966918945, 'rewards/accuracies': 0.890625, 'rewards/margins': 0.621047854423523, 'logps/chosen': -484.4205017089844, 'logps/rejected': -1670.13427734375, 'logps/ref_chosen': -49.117393493652344, 'logps/ref_rejected': -118.43990325927734, 'logits/chosen': -3.399845600128174, 'logits/rejected': -4.949437618255615, 'kl/p_epsilon_steps': 0.859375, 'kl/n_epsilon_steps': 0.140625, 'epsilon_dpo/beta': 0.0005580611759796739, 'epsilon_dpo/loss_margin_mean': 1116.3912353515625, 'epsilon_dpo/beta_margin_mean': 0.621047854423523, 'epsilon_dpo/beta_margin_std': 0.5679141879081726, 'epsilon_dpo/beta_margin_grad_mean': -0.360996276140213, 'epsilon_dpo/beta_margin_grad_std': 0.11674246937036514, 'kl/beta': 0.0005620450829155743, 'kl/avg_steps': 0.71875, 'epoch': 1.0} 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊| 680/681 [46:01<00:02, 2.54s/it] 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 681/681 [46:04<00:00, 2.54s/it] {'loss': 0.9757, 'grad_norm': 39.61741638183594, 'learning_rate': 3.2938662507808745e-12, 'rewards/chosen': -0.2478674352169037, 'rewards/rejected': -0.7954316735267639, 'rewards/accuracies': 0.90625, 'rewards/margins': 0.5475642085075378, 'logps/chosen': -487.08197021484375, 'logps/rejected': -1532.971923828125, 'logps/ref_chosen': -40.42382049560547, 'logps/ref_rejected': -94.08821105957031, 'logits/chosen': -3.4865760803222656, 'logits/rejected': -5.149275302886963, 'kl/p_epsilon_steps': 0.90625, 'kl/n_epsilon_steps': 0.09375, 'epsilon_dpo/beta': 0.0005535554955713451, 'epsilon_dpo/loss_margin_mean': 992.2255859375, 'epsilon_dpo/beta_margin_mean': 0.5475642085075378, 'epsilon_dpo/beta_margin_std': 0.535968005657196, 'epsilon_dpo/beta_margin_grad_mean': -0.37510794401168823, 'epsilon_dpo/beta_margin_grad_std': 0.11336696147918701, 'kl/beta': 0.00055803416762501, 'kl/avg_steps': 0.8125, 'epoch': 1.0} 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 681/681 [46:04<00:00, 2.54s/it][INFO|trainer.py:3984] 2026-04-18 10:17:43,490 >> Saving model checkpoint to /scratch/feng.yulu/dynamic-dpo-v4/outputs/mistral-7b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260418-015332/checkpoint-681 [INFO|configuration_utils.py:419] 2026-04-18 10:17:43,600 >> Configuration saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/mistral-7b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260418-015332/checkpoint-681/config.json [INFO|configuration_utils.py:911] 2026-04-18 10:17:43,698 >> Configuration saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/mistral-7b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260418-015332/checkpoint-681/generation_config.json [INFO|modeling_utils.py:3580] 2026-04-18 10:18:38,983 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 6 checkpoint shards. You can find where each parameters has been saved in the index located at /scratch/feng.yulu/dynamic-dpo-v4/outputs/mistral-7b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260418-015332/checkpoint-681/model.safetensors.index.json. [INFO|tokenization_utils_base.py:2510] 2026-04-18 10:18:39,006 >> tokenizer config file saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/mistral-7b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260418-015332/checkpoint-681/tokenizer_config.json [INFO|tokenization_utils_base.py:2519] 2026-04-18 10:18:39,011 >> Special tokens file saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/mistral-7b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260418-015332/checkpoint-681/special_tokens_map.json [INFO|trainer.py:4083] 2026-04-18 10:22:03,798 >> Deleting older checkpoint [/scratch/feng.yulu/dynamic-dpo-v4/outputs/mistral-7b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260418-015332/checkpoint-400] due to args.save_total_limit [INFO|trainer.py:2681] 2026-04-18 10:22:07,478 >> Training completed. Do not forget to share your model on huggingface.co/models =) {'train_runtime': 3053.5655, 'train_samples_per_second': 14.278, 'train_steps_per_second': 0.223, 'train_loss': 0.6833117403385398, 'epoch': 1.0} 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 681/681 [50:45<00:00, 2.54s/it] 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 681/681 [50:45<00:00, 4.47s/it] ***** train metrics ***** epoch = 1.0 total_flos = 0GF train_loss = 0.6833 train_runtime = 0:50:53.56 train_samples = 43598 train_samples_per_second = 14.278 train_steps_per_second = 0.223 2026-04-18 10:22:07 - INFO - __main__ - *** Training complete *** 2026-04-18 10:22:07 - INFO - __main__ - *** Save model *** [INFO|configuration_utils.py:419] 2026-04-18 10:22:23,810 >> Configuration saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/mistral-7b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260418-015332/config.json [INFO|configuration_utils.py:911] 2026-04-18 10:22:23,866 >> Configuration saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/mistral-7b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260418-015332/generation_config.json [INFO|modeling_utils.py:3580] 2026-04-18 10:23:14,522 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 6 checkpoint shards. You can find where each parameters has been saved in the index located at /scratch/feng.yulu/dynamic-dpo-v4/outputs/mistral-7b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260418-015332/model.safetensors.index.json. [INFO|tokenization_utils_base.py:2510] 2026-04-18 10:23:14,538 >> tokenizer config file saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/mistral-7b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260418-015332/tokenizer_config.json [INFO|tokenization_utils_base.py:2519] 2026-04-18 10:23:14,542 >> Special tokens file saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/mistral-7b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260418-015332/special_tokens_map.json 2026-04-18 10:23:14 - INFO - __main__ - Saved HF-compatible model artifacts to /scratch/feng.yulu/dynamic-dpo-v4/outputs/mistral-7b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260418-015332 [INFO|modelcard.py:450] 2026-04-18 10:23:14,816 >> Dropping the following result as it does not have all the necessary fields: {'dataset': {'name': 'Anthropic/hh-rlhf', 'type': 'Anthropic/hh-rlhf'}} [INFO|configuration_utils.py:419] 2026-04-18 10:23:14,830 >> Configuration saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/mistral-7b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260418-015332/config.json 2026-04-18 10:23:14 - INFO - __main__ - *** Evaluate *** [INFO|trainer.py:4307] 2026-04-18 10:23:14,831 >> ***** Running Evaluation ***** [INFO|trainer.py:4309] 2026-04-18 10:23:14,831 >> Num examples = 2339 [INFO|trainer.py:4312] 2026-04-18 10:23:14,832 >> Batch size = 8 0%| | 0/73 [00:00