2026-04-14 19:26:26 - WARNING - __main__ - Process rank: 0, device: cuda:0, n_gpu: 1 distributed training: True, 16-bits training: False 2026-04-14 19:26:26 - INFO - __main__ - Model parameters ModelArguments(base_model_revision=None, model_name_or_path='Qwen/Qwen3-8B-Base', model_revision='main', model_code_revision=None, torch_dtype='bfloat16', tokenizer_name_or_path=None, trust_remote_code=False, attn_implementation='flash_attention_2', use_peft=False, lora_r=16, lora_alpha=32, lora_dropout=0.05, lora_target_modules=None, lora_modules_to_save=None, load_in_8bit=False, load_in_4bit=False, bnb_4bit_quant_type='nf4', use_bnb_nested_quant=False, bnb_4bit_quant_storage='uint8') 2026-04-14 19:26:26 - INFO - __main__ - Data parameters DataArguments(chat_template=None, dataset_mixer={'Anthropic/hh-rlhf': 1.0}, text_column='text', dataset_splits=['train', 'test'], dataset_configs=['helpful-base'], dataset_dir=None, preprocessing_num_workers=12, use_persistent_hf_cache=False, hf_cache_dir=None, truncation_side=None, auto_insert_empty_system_msg=True, preprocessing_log_samples=0, preprocessing_log_dir=None) 2026-04-14 19:26:26 - INFO - __main__ - Training/evaluation parameters SFTConfig( _n_gpu=1, accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, auto_find_batch_size=False, average_tokens_across_devices=False, batch_eval_metrics=False, bf16=True, bf16_full_eval=False, chars_per_token=, data_seed=None, dataloader_drop_last=False, dataloader_num_workers=0, dataloader_persistent_workers=False, dataloader_pin_memory=True, dataloader_prefetch_factor=None, dataset_batch_size=1000, dataset_kwargs=None, dataset_num_proc=None, dataset_text_field=None, ddp_backend=None, ddp_broadcast_buffers=None, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=None, ddp_timeout=1800, debug=[], deepspeed=None, disable_tqdm=False, do_eval=True, do_predict=False, do_train=False, eval_accumulation_steps=None, eval_delay=0, eval_do_concat_batches=True, eval_on_start=False, eval_packing=None, eval_steps=100, eval_strategy=IntervalStrategy.STEPS, eval_use_gather_object=False, fp16=False, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, gradient_accumulation_steps=1, gradient_checkpointing=True, gradient_checkpointing_kwargs={'use_reentrant': False}, greater_is_better=None, group_by_length=False, half_precision_backend=auto, hub_always_push=False, hub_model_id=qwen3-8b-base-sft-hh-helpful-8xh200, hub_model_revision=main, hub_private_repo=None, hub_strategy=HubStrategy.END, hub_token=, ignore_data_skip=False, include_for_metrics=[], include_inputs_for_metrics=False, include_num_input_tokens_seen=False, include_tokens_per_second=False, jit_mode_eval=False, label_names=None, label_smoothing_factor=0.0, learning_rate=2e-05, length_column_name=length, load_best_model_at_end=False, local_rank=0, log_level=info, log_level_replica=warning, log_on_each_node=True, logging_dir=outputs/qwen3-8b-base-sft-hh-helpful-8xh200/runs/Apr14_19-26-24_d4053, logging_first_step=True, logging_nan_inf_filter=True, logging_steps=10, logging_strategy=IntervalStrategy.STEPS, lr_scheduler_kwargs={}, lr_scheduler_type=SchedulerType.COSINE, max_grad_norm=1.0, max_seq_length=512, max_steps=-1, metric_for_best_model=None, model_init_kwargs=None, mp_parameters=, neftune_noise_alpha=None, no_cuda=False, num_of_sequences=1024, num_train_epochs=1, optim=OptimizerNames.ADAMW_TORCH, optim_args=None, optim_target_modules=None, output_dir=/scratch/qu.yang1/outputs/qwen3-8b-base-sft-hh-helpful-8xh200-20260414-192602-232981, overwrite_output_dir=True, packing=False, past_index=-1, per_device_eval_batch_size=16, per_device_train_batch_size=16, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=, ray_scope=last, remove_unused_columns=True, report_to=['wandb'], restore_callback_states_from_checkpoint=False, resume_from_checkpoint=None, run_name=qwen3-8b-base-sft-hh-helpful-8xh200-20260414-192602-232981, save_on_each_node=False, save_only_model=False, save_safetensors=True, save_steps=200, save_strategy=SaveStrategy.STEPS, save_total_limit=2, seed=42, skip_memory_metrics=True, tf32=None, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torch_empty_cache_steps=None, torchdynamo=None, tp_size=0, tpu_metrics_debug=False, tpu_num_cores=None, use_cpu=False, use_ipex=False, use_legacy_prediction_loop=False, use_liger=False, use_liger_kernel=False, use_mps_device=False, warmup_ratio=0.1, warmup_steps=0, weight_decay=0.0, ) 2026-04-14 19:26:26 - WARNING - __main__ - Process rank: 1, device: cuda:1, n_gpu: 1 distributed training: True, 16-bits training: False 2026-04-14 19:26:26 - WARNING - __main__ - Process rank: 5, device: cuda:5, n_gpu: 1 distributed training: True, 16-bits training: False 2026-04-14 19:26:26 - WARNING - __main__ - Process rank: 6, device: cuda:6, n_gpu: 1 distributed training: True, 16-bits training: False 2026-04-14 19:26:26 - WARNING - __main__ - Process rank: 7, device: cuda:7, n_gpu: 1 distributed training: True, 16-bits training: False 2026-04-14 19:26:26 - WARNING - __main__ - Process rank: 4, device: cuda:4, n_gpu: 1 distributed training: True, 16-bits training: False 2026-04-14 19:26:26 - WARNING - __main__ - Process rank: 3, device: cuda:3, n_gpu: 1 distributed training: True, 16-bits training: False 2026-04-14 19:26:26 - WARNING - __main__ - Process rank: 2, device: cuda:2, n_gpu: 1 distributed training: True, 16-bits training: False hf://datasets/Anthropic/hh-rlhf@09be8c5bbc57cb3887f3a9732ad6aa7ec602a1fa/README.md not found in cache or force_download set to True, downloading to /scratch/qu.yang1/hf/datasets/downloads/4f81b314c38f15a41852e8177b3798694a3b34661cba63570f6f4cb2bfd194b9.incomplete 2026-04-14 19:26:26 - INFO - datasets.utils.file_utils - hf://datasets/Anthropic/hh-rlhf@09be8c5bbc57cb3887f3a9732ad6aa7ec602a1fa/README.md not found in cache or force_download set to True, downloading to /scratch/qu.yang1/hf/datasets/downloads/4f81b314c38f15a41852e8177b3798694a3b34661cba63570f6f4cb2bfd194b9.incomplete Downloading readme: 0%| | 0.00/5.77k [00:00> loading file vocab.json from cache at /scratch/qu.yang1/hf/hub/models--Qwen--Qwen3-8B-Base/snapshots/49e3418fbbbca6ecbdf9608b4d22e5a407081db4/vocab.json [INFO|tokenization_utils_base.py:2060] 2026-04-14 19:26:38,916 >> loading file merges.txt from cache at /scratch/qu.yang1/hf/hub/models--Qwen--Qwen3-8B-Base/snapshots/49e3418fbbbca6ecbdf9608b4d22e5a407081db4/merges.txt [INFO|tokenization_utils_base.py:2060] 2026-04-14 19:26:38,916 >> loading file tokenizer.json from cache at /scratch/qu.yang1/hf/hub/models--Qwen--Qwen3-8B-Base/snapshots/49e3418fbbbca6ecbdf9608b4d22e5a407081db4/tokenizer.json [INFO|tokenization_utils_base.py:2060] 2026-04-14 19:26:38,916 >> loading file added_tokens.json from cache at None [INFO|tokenization_utils_base.py:2060] 2026-04-14 19:26:38,916 >> loading file special_tokens_map.json from cache at None [INFO|tokenization_utils_base.py:2060] 2026-04-14 19:26:38,916 >> loading file tokenizer_config.json from cache at /scratch/qu.yang1/hf/hub/models--Qwen--Qwen3-8B-Base/snapshots/49e3418fbbbca6ecbdf9608b4d22e5a407081db4/tokenizer_config.json [INFO|tokenization_utils_base.py:2060] 2026-04-14 19:26:38,916 >> loading file chat_template.jinja from cache at None [INFO|tokenization_utils_base.py:2323] 2026-04-14 19:26:39,206 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. 2026-04-14 19:26:39 - INFO - __main__ - *** Load pretrained model *** Applying chat template (num_proc=12): 0%| | 0/43598 [00:00system <|im_end|> <|im_start|>user What is the best way to clean my refrigerator?<|im_end|> <|im_start|>assistant Well, I have a few ideas, but the best way to know what works best is to run some tests and then analyze the data to figure out which technique works best.<|im_end|> <|im_start|>user What are your ideas?<|im_end|> <|im_start|>assistant Some of my ideas include wiping down the interior with a rag, using bleach and water to wipe down the door, emptying the contents of the fridge to scrub the shelves with a clean cloth, cleaning the interior of the fridge with vinegar and water, and using a vacuum cleaner to clean the interior.<|im_end|> 2026-04-14 19:27:37 - INFO - __main__ - Sample 7296 of the processed training set: <|im_start|>system <|im_end|> <|im_start|>user How can I get blood out of cotton?<|im_end|> <|im_start|>assistant Blood? What do you mean?<|im_end|> 2026-04-14 19:27:37 - INFO - __main__ - Sample 1639 of the processed training set: <|im_start|>system <|im_end|> <|im_start|>user I'm trying to get my friend to open up<|im_end|> <|im_start|>assistant Sounds great! What have you tried so far?<|im_end|> <|im_start|>user I've tried to give my friend a safe space to open up about their feelings but they still aren't comfortable<|im_end|> <|im_start|>assistant It’s really important that people feel safe sharing their feelings, but I totally hear that this isn’t easy for them. And in my experience, that has a lot to do with them feeling that you can accept them and what they’re saying, even if they’re feeling difficult or uncomfortable feelings. Do you think you’ve done that, and that they still haven’t opened up?<|im_end|> Filter: 69%|██████▉ | 30000/43598 [00:41<00:18, 735.95 examples/s] Filter: 100%|██████████| 2339/2339 [00:02<00:00, 1110.24 examples/s] Filter: 100%|██████████| 2339/2339 [00:02<00:00, 1096.85 examples/s] Filter: 100%|██████████| 2339/2339 [00:02<00:00, 1093.09 examples/s] Filter: 100%|██████████| 2339/2339 [00:02<00:00, 1088.24 examples/s] Filter: 100%|██████████| 2339/2339 [00:02<00:00, 1086.56 examples/s] Filter: 100%|██████████| 2339/2339 [00:02<00:00, 1081.76 examples/s] Filter: 92%|█████████▏| 40000/43598 [00:51<00:04, 813.13 examples/s] Filter: 92%|█████████▏| 40000/43598 [00:51<00:04, 800.78 examples/s] Filter: 100%|██████████| 43598/43598 [00:55<00:00, 836.17 examples/s] Filter: 100%|██████████| 43598/43598 [00:55<00:00, 787.85 examples/s] Filter: 100%|██████████| 43598/43598 [00:55<00:00, 820.95 examples/s] Filter: 100%|██████████| 43598/43598 [00:55<00:00, 782.73 examples/s] /home/qu.yang1/.conda/envs/dpo_v4/lib/python3.11/site-packages/huggingface_hub/utils/_deprecation.py:100: FutureWarning: Deprecated argument(s) used in '__init__': model_init_kwargs, dataset_text_field, max_seq_length, packing. Will not be supported from version '1.0.0'. Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead. warnings.warn(message, FutureWarning) /home/qu.yang1/.conda/envs/dpo_v4/lib/python3.11/site-packages/huggingface_hub/utils/_deprecation.py:100: FutureWarning: Deprecated argument(s) used in '__init__': model_init_kwargs, dataset_text_field, max_seq_length, packing. Will not be supported from version '1.0.0'. Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead. warnings.warn(message, FutureWarning) /home/qu.yang1/.conda/envs/dpo_v4/lib/python3.11/site-packages/huggingface_hub/utils/_deprecation.py:100: FutureWarning: Deprecated argument(s) used in '__init__': model_init_kwargs, dataset_text_field, max_seq_length, packing. Will not be supported from version '1.0.0'. Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead. warnings.warn(message, FutureWarning) /home/qu.yang1/.conda/envs/dpo_v4/lib/python3.11/site-packages/huggingface_hub/utils/_deprecation.py:100: FutureWarning: Deprecated argument(s) used in '__init__': model_init_kwargs, dataset_text_field, max_seq_length, packing. Will not be supported from version '1.0.0'. Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead. warnings.warn(message, FutureWarning) /home/qu.yang1/.conda/envs/dpo_v4/lib/python3.11/site-packages/huggingface_hub/utils/_deprecation.py:100: FutureWarning: Deprecated argument(s) used in '__init__': model_init_kwargs, dataset_text_field, max_seq_length, packing. Will not be supported from version '1.0.0'. Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead. warnings.warn(message, FutureWarning) /home/qu.yang1/.conda/envs/dpo_v4/lib/python3.11/site-packages/huggingface_hub/utils/_deprecation.py:100: FutureWarning: Deprecated argument(s) used in '__init__': model_init_kwargs, dataset_text_field, max_seq_length, packing. Will not be supported from version '1.0.0'. Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead. warnings.warn(message, FutureWarning) /home/qu.yang1/.conda/envs/dpo_v4/lib/python3.11/site-packages/huggingface_hub/utils/_deprecation.py:100: FutureWarning: Deprecated argument(s) used in '__init__': model_init_kwargs, dataset_text_field, max_seq_length, packing. Will not be supported from version '1.0.0'. Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead. warnings.warn(message, FutureWarning) /home/qu.yang1/.conda/envs/dpo_v4/lib/python3.11/site-packages/huggingface_hub/utils/_deprecation.py:100: FutureWarning: Deprecated argument(s) used in '__init__': model_init_kwargs, dataset_text_field, max_seq_length, packing. Will not be supported from version '1.0.0'. Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead. warnings.warn(message, FutureWarning) /home/qu.yang1/.conda/envs/dpo_v4/lib/python3.11/site-packages/trl/trainer/sft_trainer.py:158: UserWarning: You passed `model_init_kwargs` to the SFTTrainer, the value you passed will override the one in the `SFTConfig`. warnings.warn( /home/qu.yang1/.conda/envs/dpo_v4/lib/python3.11/site-packages/trl/trainer/sft_trainer.py:185: UserWarning: You passed a model_id to the SFTTrainer. This will automatically create an `AutoModelForCausalLM` or a `PeftModel` (if you passed a `peft_config`) for you. warnings.warn( /home/qu.yang1/.conda/envs/dpo_v4/lib/python3.11/site-packages/trl/trainer/sft_trainer.py:158: UserWarning: You passed `model_init_kwargs` to the SFTTrainer, the value you passed will override the one in the `SFTConfig`. warnings.warn( /home/qu.yang1/.conda/envs/dpo_v4/lib/python3.11/site-packages/trl/trainer/sft_trainer.py:158: UserWarning: You passed `model_init_kwargs` to the SFTTrainer, the value you passed will override the one in the `SFTConfig`. warnings.warn( /home/qu.yang1/.conda/envs/dpo_v4/lib/python3.11/site-packages/trl/trainer/sft_trainer.py:158: UserWarning: You passed `model_init_kwargs` to the SFTTrainer, the value you passed will override the one in the `SFTConfig`. warnings.warn( /home/qu.yang1/.conda/envs/dpo_v4/lib/python3.11/site-packages/trl/trainer/sft_trainer.py:158: UserWarning: You passed `model_init_kwargs` to the SFTTrainer, the value you passed will override the one in the `SFTConfig`. warnings.warn( /home/qu.yang1/.conda/envs/dpo_v4/lib/python3.11/site-packages/trl/trainer/sft_trainer.py:158: UserWarning: You passed `model_init_kwargs` to the SFTTrainer, the value you passed will override the one in the `SFTConfig`. warnings.warn( /home/qu.yang1/.conda/envs/dpo_v4/lib/python3.11/site-packages/trl/trainer/sft_trainer.py:185: UserWarning: You passed a model_id to the SFTTrainer. This will automatically create an `AutoModelForCausalLM` or a `PeftModel` (if you passed a `peft_config`) for you. warnings.warn( /home/qu.yang1/.conda/envs/dpo_v4/lib/python3.11/site-packages/trl/trainer/sft_trainer.py:158: UserWarning: You passed `model_init_kwargs` to the SFTTrainer, the value you passed will override the one in the `SFTConfig`. warnings.warn( /home/qu.yang1/.conda/envs/dpo_v4/lib/python3.11/site-packages/trl/trainer/sft_trainer.py:185: UserWarning: You passed a model_id to the SFTTrainer. This will automatically create an `AutoModelForCausalLM` or a `PeftModel` (if you passed a `peft_config`) for you. warnings.warn( /home/qu.yang1/.conda/envs/dpo_v4/lib/python3.11/site-packages/trl/trainer/sft_trainer.py:185: UserWarning: You passed a model_id to the SFTTrainer. This will automatically create an `AutoModelForCausalLM` or a `PeftModel` (if you passed a `peft_config`) for you. warnings.warn( /home/qu.yang1/.conda/envs/dpo_v4/lib/python3.11/site-packages/trl/trainer/sft_trainer.py:185: UserWarning: You passed a model_id to the SFTTrainer. This will automatically create an `AutoModelForCausalLM` or a `PeftModel` (if you passed a `peft_config`) for you. warnings.warn( /home/qu.yang1/.conda/envs/dpo_v4/lib/python3.11/site-packages/trl/trainer/sft_trainer.py:185: UserWarning: You passed a model_id to the SFTTrainer. This will automatically create an `AutoModelForCausalLM` or a `PeftModel` (if you passed a `peft_config`) for you. warnings.warn( /home/qu.yang1/.conda/envs/dpo_v4/lib/python3.11/site-packages/trl/trainer/sft_trainer.py:185: UserWarning: You passed a model_id to the SFTTrainer. This will automatically create an `AutoModelForCausalLM` or a `PeftModel` (if you passed a `peft_config`) for you. warnings.warn( /home/qu.yang1/.conda/envs/dpo_v4/lib/python3.11/site-packages/trl/trainer/sft_trainer.py:158: UserWarning: You passed `model_init_kwargs` to the SFTTrainer, the value you passed will override the one in the `SFTConfig`. warnings.warn( /home/qu.yang1/.conda/envs/dpo_v4/lib/python3.11/site-packages/trl/trainer/sft_trainer.py:185: UserWarning: You passed a model_id to the SFTTrainer. This will automatically create an `AutoModelForCausalLM` or a `PeftModel` (if you passed a `peft_config`) for you. warnings.warn( [INFO|configuration_utils.py:693] 2026-04-14 19:27:54,084 >> loading configuration file config.json from cache at /scratch/qu.yang1/hf/hub/models--Qwen--Qwen3-8B-Base/snapshots/49e3418fbbbca6ecbdf9608b4d22e5a407081db4/config.json [INFO|configuration_utils.py:765] 2026-04-14 19:27:54,085 >> Model config Qwen3Config { "architectures": [ "Qwen3ForCausalLM" ], "attention_bias": false, "attention_dropout": 0.0, "bos_token_id": 151643, "eos_token_id": 151643, "head_dim": 128, "hidden_act": "silu", "hidden_size": 4096, "initializer_range": 0.02, "intermediate_size": 12288, "max_position_embeddings": 32768, "max_window_layers": 36, "model_type": "qwen3", "num_attention_heads": 32, "num_hidden_layers": 36, "num_key_value_heads": 8, "rms_norm_eps": 1e-06, "rope_scaling": null, "rope_theta": 1000000, "sliding_window": null, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.51.0", "use_cache": false, "use_sliding_window": false, "vocab_size": 151936 } Fetching 5 files: 0%| | 0/5 [00:00> loading weights file model.safetensors from cache at /scratch/qu.yang1/hf/hub/models--Qwen--Qwen3-8B-Base/snapshots/49e3418fbbbca6ecbdf9608b4d22e5a407081db4/model.safetensors.index.json Fetching 5 files: 0%| | 0/5 [00:00> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. Fetching 5 files: 20%|██ | 1/5 [01:41<06:46, 101.75s/it] Fetching 5 files: 20%|██ | 1/5 [01:41<06:46, 101.63s/it] Fetching 5 files: 100%|██████████| 5/5 [01:41<00:00, 20.33s/it] [INFO|modeling_utils.py:2167] 2026-04-14 19:29:36,215 >> Instantiating Qwen3ForCausalLM model under default dtype torch.bfloat16. [WARNING|logging.py:328] 2026-04-14 19:29:36,218 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. [INFO|configuration_utils.py:1142] 2026-04-14 19:29:36,219 >> Generate config GenerationConfig { "bos_token_id": 151643, "eos_token_id": 151643, "use_cache": false } Loading checkpoint shards: 0%| | 0/5 [00:00> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. Fetching 5 files: 20%|██ | 1/5 [01:41<06:46, 101.71s/it] Loading checkpoint shards: 0%| | 0/5 [00:00> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. Fetching 5 files: 20%|██ | 1/5 [01:41<06:47, 101.95s/it] Fetching 5 files: 100%|██████████| 5/5 [01:41<00:00, 20.39s/it] [WARNING|logging.py:328] 2026-04-14 19:29:36,398 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. Loading checkpoint shards: 20%|██ | 1/5 [00:00<00:00, 6.84it/s] Fetching 5 files: 60%|██████ | 3/5 [01:41<00:52, 26.47s/it] Fetching 5 files: 100%|██████████| 5/5 [01:41<00:00, 20.39s/it] [WARNING|logging.py:328] 2026-04-14 19:29:36,434 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. Loading checkpoint shards: 0%| | 0/5 [00:00> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. Fetching 5 files: 60%|██████ | 3/5 [01:41<00:52, 26.42s/it] Fetching 5 files: 100%|██████████| 5/5 [01:41<00:00, 20.37s/it] [WARNING|logging.py:328] 2026-04-14 19:29:36,498 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. Loading checkpoint shards: 0%| | 0/5 [00:00> All model checkpoint weights were used when initializing Qwen3ForCausalLM. [INFO|modeling_utils.py:4934] 2026-04-14 19:29:37,034 >> All the weights of Qwen3ForCausalLM were initialized from the model checkpoint at Qwen/Qwen3-8B-Base. If your task is similar to the task the model of the checkpoint was trained on, you can already use Qwen3ForCausalLM for predictions without further training. /home/qu.yang1/.conda/envs/dpo_v4/lib/python3.11/site-packages/trl/trainer/sft_trainer.py:195: UserWarning: You passed a `packing` argument to the SFTTrainer, the value you passed will override the one in the `SFTConfig`. warnings.warn( /home/qu.yang1/.conda/envs/dpo_v4/lib/python3.11/site-packages/trl/trainer/sft_trainer.py:283: UserWarning: You passed a `max_seq_length` argument to the SFTTrainer, the value you passed will override the one in the `SFTConfig`. warnings.warn( /home/qu.yang1/.conda/envs/dpo_v4/lib/python3.11/site-packages/trl/trainer/sft_trainer.py:321: UserWarning: You passed a `dataset_text_field` argument to the SFTTrainer, the value you passed will override the one in the `SFTConfig`. warnings.warn( /home/qu.yang1/.conda/envs/dpo_v4/lib/python3.11/site-packages/trl/trainer/sft_trainer.py:195: UserWarning: You passed a `packing` argument to the SFTTrainer, the value you passed will override the one in the `SFTConfig`. warnings.warn( /home/qu.yang1/.conda/envs/dpo_v4/lib/python3.11/site-packages/trl/trainer/sft_trainer.py:283: UserWarning: You passed a `max_seq_length` argument to the SFTTrainer, the value you passed will override the one in the `SFTConfig`. warnings.warn( /home/qu.yang1/.conda/envs/dpo_v4/lib/python3.11/site-packages/trl/trainer/sft_trainer.py:321: UserWarning: You passed a `dataset_text_field` argument to the SFTTrainer, the value you passed will override the one in the `SFTConfig`. warnings.warn( /home/qu.yang1/.conda/envs/dpo_v4/lib/python3.11/site-packages/trl/trainer/sft_trainer.py:195: UserWarning: You passed a `packing` argument to the SFTTrainer, the value you passed will override the one in the `SFTConfig`. warnings.warn( /home/qu.yang1/.conda/envs/dpo_v4/lib/python3.11/site-packages/trl/trainer/sft_trainer.py:283: UserWarning: You passed a `max_seq_length` argument to the SFTTrainer, the value you passed will override the one in the `SFTConfig`. warnings.warn( /home/qu.yang1/.conda/envs/dpo_v4/lib/python3.11/site-packages/trl/trainer/sft_trainer.py:321: UserWarning: You passed a `dataset_text_field` argument to the SFTTrainer, the value you passed will override the one in the `SFTConfig`. warnings.warn( /home/qu.yang1/.conda/envs/dpo_v4/lib/python3.11/site-packages/trl/trainer/sft_trainer.py:195: UserWarning: You passed a `packing` argument to the SFTTrainer, the value you passed will override the one in the `SFTConfig`. warnings.warn( /home/qu.yang1/.conda/envs/dpo_v4/lib/python3.11/site-packages/trl/trainer/sft_trainer.py:283: UserWarning: You passed a `max_seq_length` argument to the SFTTrainer, the value you passed will override the one in the `SFTConfig`. warnings.warn( /home/qu.yang1/.conda/envs/dpo_v4/lib/python3.11/site-packages/trl/trainer/sft_trainer.py:321: UserWarning: You passed a `dataset_text_field` argument to the SFTTrainer, the value you passed will override the one in the `SFTConfig`. warnings.warn( /home/qu.yang1/.conda/envs/dpo_v4/lib/python3.11/site-packages/trl/trainer/sft_trainer.py:195: UserWarning: You passed a `packing` argument to the SFTTrainer, the value you passed will override the one in the `SFTConfig`. warnings.warn( /home/qu.yang1/.conda/envs/dpo_v4/lib/python3.11/site-packages/trl/trainer/sft_trainer.py:283: UserWarning: You passed a `max_seq_length` argument to the SFTTrainer, the value you passed will override the one in the `SFTConfig`. warnings.warn( /home/qu.yang1/.conda/envs/dpo_v4/lib/python3.11/site-packages/trl/trainer/sft_trainer.py:321: UserWarning: You passed a `dataset_text_field` argument to the SFTTrainer, the value you passed will override the one in the `SFTConfig`. warnings.warn( [INFO|configuration_utils.py:1097] 2026-04-14 19:29:37,289 >> loading configuration file generation_config.json from cache at /scratch/qu.yang1/hf/hub/models--Qwen--Qwen3-8B-Base/snapshots/49e3418fbbbca6ecbdf9608b4d22e5a407081db4/generation_config.json [INFO|configuration_utils.py:1142] 2026-04-14 19:29:37,290 >> Generate config GenerationConfig { "bos_token_id": 151643, "eos_token_id": 151643, "max_new_tokens": 2048 } /home/qu.yang1/.conda/envs/dpo_v4/lib/python3.11/site-packages/trl/trainer/sft_trainer.py:195: UserWarning: You passed a `packing` argument to the SFTTrainer, the value you passed will override the one in the `SFTConfig`. warnings.warn( /home/qu.yang1/.conda/envs/dpo_v4/lib/python3.11/site-packages/trl/trainer/sft_trainer.py:283: UserWarning: You passed a `max_seq_length` argument to the SFTTrainer, the value you passed will override the one in the `SFTConfig`. warnings.warn( /home/qu.yang1/.conda/envs/dpo_v4/lib/python3.11/site-packages/trl/trainer/sft_trainer.py:321: UserWarning: You passed a `dataset_text_field` argument to the SFTTrainer, the value you passed will override the one in the `SFTConfig`. warnings.warn( /home/qu.yang1/.conda/envs/dpo_v4/lib/python3.11/site-packages/trl/trainer/sft_trainer.py:195: UserWarning: You passed a `packing` argument to the SFTTrainer, the value you passed will override the one in the `SFTConfig`. warnings.warn( /home/qu.yang1/.conda/envs/dpo_v4/lib/python3.11/site-packages/trl/trainer/sft_trainer.py:283: UserWarning: You passed a `max_seq_length` argument to the SFTTrainer, the value you passed will override the one in the `SFTConfig`. warnings.warn( /home/qu.yang1/.conda/envs/dpo_v4/lib/python3.11/site-packages/trl/trainer/sft_trainer.py:321: UserWarning: You passed a `dataset_text_field` argument to the SFTTrainer, the value you passed will override the one in the `SFTConfig`. warnings.warn( /home/qu.yang1/.conda/envs/dpo_v4/lib/python3.11/site-packages/trl/trainer/sft_trainer.py:195: UserWarning: You passed a `packing` argument to the SFTTrainer, the value you passed will override the one in the `SFTConfig`. warnings.warn( /home/qu.yang1/.conda/envs/dpo_v4/lib/python3.11/site-packages/trl/trainer/sft_trainer.py:283: UserWarning: You passed a `max_seq_length` argument to the SFTTrainer, the value you passed will override the one in the `SFTConfig`. warnings.warn( /home/qu.yang1/.conda/envs/dpo_v4/lib/python3.11/site-packages/trl/trainer/sft_trainer.py:321: UserWarning: You passed a `dataset_text_field` argument to the SFTTrainer, the value you passed will override the one in the `SFTConfig`. warnings.warn( Using custom data configuration default-4e6a8a2c525b008a 2026-04-14 19:29:37 - INFO - datasets.builder - Using custom data configuration default-4e6a8a2c525b008a Loading Dataset Infos from /home/qu.yang1/.conda/envs/dpo_v4/lib/python3.11/site-packages/datasets/packaged_modules/generator 2026-04-14 19:29:37 - INFO - datasets.info - Loading Dataset Infos from /home/qu.yang1/.conda/envs/dpo_v4/lib/python3.11/site-packages/datasets/packaged_modules/generator Generating dataset generator (/scratch/qu.yang1/hf/datasets/generator/default-4e6a8a2c525b008a/0.0.0) 2026-04-14 19:29:37 - INFO - datasets.builder - Generating dataset generator (/scratch/qu.yang1/hf/datasets/generator/default-4e6a8a2c525b008a/0.0.0) Downloading and preparing dataset generator/default to /scratch/qu.yang1/hf/datasets/generator/default-4e6a8a2c525b008a/0.0.0... 2026-04-14 19:29:37 - INFO - datasets.builder - Downloading and preparing dataset generator/default to /scratch/qu.yang1/hf/datasets/generator/default-4e6a8a2c525b008a/0.0.0... Generating train split 2026-04-14 19:29:37 - INFO - datasets.builder - Generating train split Generating train split: 0 examples [00:00, ? examples/s] Generating train split: 1 examples [00:00, 1.42 examples/s] Generating train split: 804 examples [00:01, 575.81 examples/s] Generating train split: 1610 examples [00:02, 785.39 examples/s] Generating train split: 2413 examples [00:03, 887.58 examples/s] Generating train split: 3217 examples [00:03, 947.79 examples/s] Generating train split: 4020 examples [00:04, 984.69 examples/s] Generating train split: 4824 examples [00:05, 1015.96 examples/s] Generating train split: 5627 examples [00:06, 1027.25 examples/s] Generating train split: 6432 examples [00:06, 1032.33 examples/s] Generating train split: 7000 examples [00:07, 1246.40 examples/s] Generating train split: 7234 examples [00:07, 919.64 examples/s] Generating train split: 8038 examples [00:08, 960.61 examples/s] Generating train split: 8843 examples [00:09, 1002.69 examples/s] Generating train split: 9650 examples [00:10, 1019.41 examples/s] Generating train split: 10455 examples [00:10, 1027.62 examples/s] Generating train split: 11259 examples [00:11, 1025.73 examples/s] Generating train split: 12063 examples [00:12, 1032.88 examples/s] Generating train split: 12867 examples [00:13, 986.40 examples/s] Generating train split: 13668 examples [00:14, 1004.89 examples/s] Generating train split: 14473 examples [00:14, 1018.48 examples/s] Generating train split: 15278 examples [00:15, 1025.85 examples/s] Generating train split: 16082 examples [00:16, 1031.54 examples/s] Generating train split: 16888 examples [00:16, 1258.68 examples/s] Generating train split: 17195 examples [00:16, 1015.26 examples/s] Unable to verify splits sizes. 2026-04-14 19:29:54 - INFO - datasets.utils.info_utils - Unable to verify splits sizes. Dataset generator downloaded and prepared to /scratch/qu.yang1/hf/datasets/generator/default-4e6a8a2c525b008a/0.0.0. Subsequent calls will reuse this data. 2026-04-14 19:29:54 - INFO - datasets.builder - Dataset generator downloaded and prepared to /scratch/qu.yang1/hf/datasets/generator/default-4e6a8a2c525b008a/0.0.0. Subsequent calls will reuse this data. Using custom data configuration default-281ae67d28ccbe3f 2026-04-14 19:29:54 - INFO - datasets.builder - Using custom data configuration default-281ae67d28ccbe3f Loading Dataset Infos from /home/qu.yang1/.conda/envs/dpo_v4/lib/python3.11/site-packages/datasets/packaged_modules/generator 2026-04-14 19:29:54 - INFO - datasets.info - Loading Dataset Infos from /home/qu.yang1/.conda/envs/dpo_v4/lib/python3.11/site-packages/datasets/packaged_modules/generator Generating dataset generator (/scratch/qu.yang1/hf/datasets/generator/default-281ae67d28ccbe3f/0.0.0) 2026-04-14 19:29:54 - INFO - datasets.builder - Generating dataset generator (/scratch/qu.yang1/hf/datasets/generator/default-281ae67d28ccbe3f/0.0.0) Downloading and preparing dataset generator/default to /scratch/qu.yang1/hf/datasets/generator/default-281ae67d28ccbe3f/0.0.0... 2026-04-14 19:29:54 - INFO - datasets.builder - Downloading and preparing dataset generator/default to /scratch/qu.yang1/hf/datasets/generator/default-281ae67d28ccbe3f/0.0.0... Generating train split 2026-04-14 19:29:54 - INFO - datasets.builder - Generating train split Generating train split: 0 examples [00:00, ? examples/s] Generating train split: 1 examples [00:01, 1.05s/ examples] Generating train split: 807 examples [00:01, 845.24 examples/s] Generating train split: 931 examples [00:01, 707.59 examples/s] Unable to verify splits sizes. 2026-04-14 19:29:56 - INFO - datasets.utils.info_utils - Unable to verify splits sizes. Dataset generator downloaded and prepared to /scratch/qu.yang1/hf/datasets/generator/default-281ae67d28ccbe3f/0.0.0. Subsequent calls will reuse this data. 2026-04-14 19:29:56 - INFO - datasets.builder - Dataset generator downloaded and prepared to /scratch/qu.yang1/hf/datasets/generator/default-281ae67d28ccbe3f/0.0.0. Subsequent calls will reuse this data. /home/qu.yang1/.conda/envs/dpo_v4/lib/python3.11/site-packages/trl/trainer/sft_trainer.py:412: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `SFTTrainer.__init__`. Use `processing_class` instead. super().__init__( /home/qu.yang1/.conda/envs/dpo_v4/lib/python3.11/site-packages/trl/trainer/sft_trainer.py:412: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `SFTTrainer.__init__`. Use `processing_class` instead. super().__init__( /home/qu.yang1/.conda/envs/dpo_v4/lib/python3.11/site-packages/trl/trainer/sft_trainer.py:412: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `SFTTrainer.__init__`. Use `processing_class` instead. super().__init__( /home/qu.yang1/.conda/envs/dpo_v4/lib/python3.11/site-packages/trl/trainer/sft_trainer.py:412: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `SFTTrainer.__init__`. Use `processing_class` instead. super().__init__( /home/qu.yang1/.conda/envs/dpo_v4/lib/python3.11/site-packages/trl/trainer/sft_trainer.py:412: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `SFTTrainer.__init__`. Use `processing_class` instead. super().__init__( /home/qu.yang1/.conda/envs/dpo_v4/lib/python3.11/site-packages/trl/trainer/sft_trainer.py:412: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `SFTTrainer.__init__`. Use `processing_class` instead. super().__init__( /home/qu.yang1/.conda/envs/dpo_v4/lib/python3.11/site-packages/trl/trainer/sft_trainer.py:412: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `SFTTrainer.__init__`. Use `processing_class` instead. super().__init__( /home/qu.yang1/.conda/envs/dpo_v4/lib/python3.11/site-packages/trl/trainer/sft_trainer.py:412: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `SFTTrainer.__init__`. Use `processing_class` instead. super().__init__( [INFO|trainer.py:748] 2026-04-14 19:29:58,464 >> Using auto half precision backend 2026-04-14 19:29:58 - INFO - __main__ - *** Train *** /home/qu.yang1/.conda/envs/dpo_v4/lib/python3.11/site-packages/accelerate/accelerator.py:1557: UserWarning: Upcasted low precision parameters in Qwen3ForCausalLM because mixed precision turned on in FSDP. Affects: model.embed_tokens.weight, model.norm.weight, lm_head.weight. warnings.warn( /home/qu.yang1/.conda/envs/dpo_v4/lib/python3.11/site-packages/accelerate/accelerator.py:1557: UserWarning: Upcasted low precision parameters in Qwen3DecoderLayer because mixed precision turned on in FSDP. Affects: self_attn.q_proj.weight, self_attn.k_proj.weight, self_attn.v_proj.weight, self_attn.o_proj.weight, self_attn.q_norm.weight, self_attn.k_norm.weight, mlp.gate_proj.weight, mlp.up_proj.weight, mlp.down_proj.weight, input_layernorm.weight, post_attention_layernorm.weight. warnings.warn( /home/qu.yang1/.conda/envs/dpo_v4/lib/python3.11/site-packages/accelerate/accelerator.py:1563: UserWarning: FSDP upcast of low precision parameters may affect the precision of model checkpoints. warnings.warn( [INFO|trainer.py:2414] 2026-04-14 19:30:35,921 >> ***** Running training ***** [INFO|trainer.py:2415] 2026-04-14 19:30:35,921 >> Num examples = 17,195 [INFO|trainer.py:2416] 2026-04-14 19:30:35,922 >> Num Epochs = 1 [INFO|trainer.py:2417] 2026-04-14 19:30:35,922 >> Instantaneous batch size per device = 16 [INFO|trainer.py:2420] 2026-04-14 19:30:35,922 >> Total train batch size (w. parallel, distributed & accumulation) = 128 [INFO|trainer.py:2421] 2026-04-14 19:30:35,922 >> Gradient Accumulation steps = 1 [INFO|trainer.py:2422] 2026-04-14 19:30:35,922 >> Total optimization steps = 135 [INFO|trainer.py:2423] 2026-04-14 19:30:35,923 >> Number of trainable parameters = 1,023,841,920 [INFO|integration_utils.py:831] 2026-04-14 19:30:35,925 >> Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true" wandb: Currently logged in as: feng-cheng (feng-cheng-northeastern-university). Use `wandb login --relogin` to force relogin wandb: wandb version 0.26.0 is available! To upgrade, please run: wandb: $ pip install wandb --upgrade wandb: Tracking run with wandb version 0.17.5 wandb: Run data is saved locally in /scratch/qu.yang1/wandb/wandb/run-20260414_193038-mlwlhzba wandb: Run `wandb offline` to turn off syncing. wandb: Syncing run qwen3-8b-base-sft-hh-helpful-8xh200-20260414-192602-232981 wandb: ⭐️ View project at https://wandb.ai/feng-cheng-northeastern-university/huggingface wandb: 🚀 View run at https://wandb.ai/feng-cheng-northeastern-university/huggingface/runs/mlwlhzba 0%| | 0/135 [00:00> ***** Running Evaluation ***** [INFO|trainer.py:4309] 2026-04-14 19:32:56,372 >> Num examples = 931 [INFO|trainer.py:4312] 2026-04-14 19:32:56,372 >> Batch size = 16 0%| | 0/8 [00:00> Saving model checkpoint to /scratch/qu.yang1/outputs/qwen3-8b-base-sft-hh-helpful-8xh200-20260414-192602-232981/checkpoint-135 [INFO|configuration_utils.py:419] 2026-04-14 19:34:05,349 >> Configuration saved in /scratch/qu.yang1/outputs/qwen3-8b-base-sft-hh-helpful-8xh200-20260414-192602-232981/checkpoint-135/config.json [INFO|configuration_utils.py:911] 2026-04-14 19:34:05,353 >> Configuration saved in /scratch/qu.yang1/outputs/qwen3-8b-base-sft-hh-helpful-8xh200-20260414-192602-232981/checkpoint-135/generation_config.json [INFO|modeling_utils.py:3580] 2026-04-14 19:34:58,174 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 6 checkpoint shards. You can find where each parameters has been saved in the index located at /scratch/qu.yang1/outputs/qwen3-8b-base-sft-hh-helpful-8xh200-20260414-192602-232981/checkpoint-135/model.safetensors.index.json. [INFO|tokenization_utils_base.py:2510] 2026-04-14 19:34:58,183 >> tokenizer config file saved in /scratch/qu.yang1/outputs/qwen3-8b-base-sft-hh-helpful-8xh200-20260414-192602-232981/checkpoint-135/tokenizer_config.json [INFO|tokenization_utils_base.py:2519] 2026-04-14 19:34:58,190 >> Special tokens file saved in /scratch/qu.yang1/outputs/qwen3-8b-base-sft-hh-helpful-8xh200-20260414-192602-232981/checkpoint-135/special_tokens_map.json [INFO|trainer.py:2681] 2026-04-14 19:38:57,475 >> Training completed. Do not forget to share your model on huggingface.co/models =) {'train_runtime': 501.5523, 'train_samples_per_second': 34.284, 'train_steps_per_second': 0.269, 'train_loss': 1.7195094338169805, 'epoch': 1.0} 100%|██████████| 135/135 [08:14<00:00, 1.35s/it] 100%|██████████| 135/135 [08:14<00:00, 3.66s/it] ***** train metrics ***** epoch = 1.0 total_flos = 46771303GF train_loss = 1.7195 train_runtime = 0:08:21.55 train_samples = 43598 train_samples_per_second = 34.284 train_steps_per_second = 0.269 2026-04-14 19:38:57 - INFO - __main__ - *** Save model *** [INFO|configuration_utils.py:419] 2026-04-14 19:39:19,837 >> Configuration saved in /scratch/qu.yang1/outputs/qwen3-8b-base-sft-hh-helpful-8xh200-20260414-192602-232981/config.json [INFO|configuration_utils.py:911] 2026-04-14 19:39:19,841 >> Configuration saved in /scratch/qu.yang1/outputs/qwen3-8b-base-sft-hh-helpful-8xh200-20260414-192602-232981/generation_config.json [INFO|modeling_utils.py:3580] 2026-04-14 19:40:19,850 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 7 checkpoint shards. You can find where each parameters has been saved in the index located at /scratch/qu.yang1/outputs/qwen3-8b-base-sft-hh-helpful-8xh200-20260414-192602-232981/model.safetensors.index.json. [INFO|tokenization_utils_base.py:2510] 2026-04-14 19:40:19,858 >> tokenizer config file saved in /scratch/qu.yang1/outputs/qwen3-8b-base-sft-hh-helpful-8xh200-20260414-192602-232981/tokenizer_config.json [INFO|tokenization_utils_base.py:2519] 2026-04-14 19:40:19,867 >> Special tokens file saved in /scratch/qu.yang1/outputs/qwen3-8b-base-sft-hh-helpful-8xh200-20260414-192602-232981/special_tokens_map.json 2026-04-14 19:40:20 - INFO - __main__ - Saved HF-compatible model artifacts to /scratch/qu.yang1/outputs/qwen3-8b-base-sft-hh-helpful-8xh200-20260414-192602-232981 2026-04-14 19:40:20 - INFO - __main__ - Saved validated HF-compatible model artifacts to /scratch/qu.yang1/outputs/qwen3-8b-base-sft-hh-helpful-8xh200-20260414-192602-232981 [INFO|modelcard.py:450] 2026-04-14 19:40:20,569 >> Dropping the following result as it does not have all the necessary fields: {'dataset': {'name': 'Anthropic/hh-rlhf', 'type': 'Anthropic/hh-rlhf', 'config': 'default', 'split': 'train', 'args': 'default'}} [INFO|configuration_utils.py:419] 2026-04-14 19:40:20,592 >> Configuration saved in /scratch/qu.yang1/outputs/qwen3-8b-base-sft-hh-helpful-8xh200-20260414-192602-232981/config.json 2026-04-14 19:40:20 - INFO - __main__ - *** Evaluate *** [INFO|trainer.py:4307] 2026-04-14 19:40:20,597 >> ***** Running Evaluation ***** [INFO|trainer.py:4309] 2026-04-14 19:40:20,597 >> Num examples = 931 [INFO|trainer.py:4312] 2026-04-14 19:40:20,597 >> Batch size = 16 0%| | 0/8 [00:00