bartleby-qwen3-1.7b_dpo/logs.log

==========================================
DPO From Existing Checkpoint
==========================================
Source: staeiou/bartleby-qwen3-1.7b_v5/
Output: staeiou/bartleby-qwen3-1.7b_dpo
DPO Data: data/training_data_dpo.jsonl
Train: bs=2 grad_accum=16 lr=5e-7 epochs=1 beta=0.1

→ No local vLLM detected, proceeding with DPO
→ Starting DPO-only fine-tuning...
LD_LIBRARY_PATH="/opt/venv/lib/python3.10/site-packages/nvidia/cu13/lib${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH}" \
FULL_FINETUNING=1 \
RUN_SFT=0 \
BASE_MODEL=staeiou/bartleby-qwen3-1.7b_v5/ \
MODEL_DIR=staeiou/bartleby-qwen3-1.7b_dpo \
DPO_DATA=data/training_data_dpo.jsonl \
MAX_SEQ_LENGTH=1024 \
VAL_FRACTION=0.05 \
DPO_BETA=0.1 \
DPO_NUM_TRAIN_EPOCHS=1 \
DPO_LEARNING_RATE=5e-7 \
DPO_LR_SCHEDULER_TYPE=cosine \
DPO_WARMUP_RATIO=0.03 \
DPO_WEIGHT_DECAY=0.05 \
DPO_MAX_GRAD_NORM=1.0 \
DPO_PER_DEVICE_TRAIN_BATCH_SIZE=2 \
DPO_GRADIENT_ACCUMULATION_STEPS=16 \
DPO_EVAL_STEPS=100 \
DPO_SAVE_STEPS=100 \
DPO_LOGGING_STEPS=10 \
DPO_MAX_LENGTH=1024 \
DPO_MAX_PROMPT_LENGTH=512 \
DPO_MAX_COMPLETION_LENGTH=512 \
python finetune.py
Skipping import of cpp extensions due to incompatible torch version 2.9.0+cu128 for torchao version 0.15.0+cu129             Please see https://github.com/pytorch/ao/issues/2919 for more info
/workspace/bartleby-1b/finetune.py:129: UserWarning: WARNING: Unsloth should be imported before [trl, transformers, peft] to ensure all optimizations are applied. Your code may run slower or encounter memory issues without these optimizations.

Please restructure your imports with 'import unsloth' at the top of your file.
  from unsloth import FastLanguageModel
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
Unsloth: Your Flash Attention 2 installation seems to be broken. Using Xformers instead. No performance changes will be seen.
🦥 Unsloth Zoo will now patch everything to make training faster!
================================================================================
BARTLEBY FULL FINETUNE — 16-BIT — AUTO TEMPLATE+MASK DETECT — LAST-ANSWER MULTITURN
================================================================================
MODEL      : staeiou/bartleby-qwen3-1.7b_v5/
DATA       : data/training_data_v2_filtered.jsonl
GOLD       : data/gold_seed_training_data.jsonl
SFT_OUTPUT : staeiou/bartleby-qwen3-1.7b_dpo
OUTPUT     : staeiou/bartleby-qwen3-1.7b_dpo
CACHE_DIR  : /workspace/.cache/huggingface/datasets
SEQ        : 1024
PACKING    : False
LOAD_4BIT  : False (forced 16-bit base)
FULL_FT    : True
RUN_SFT    : False
REMOTE_CODE: False
FAMILY     : qwen
TRL_COMPAT : ConstantLengthDataset patched=True
TRL_DPO    : mergekit_detection_patched=True
TRL_DPO2   : llm_blender_detection_patched=True
TRL_DPO3   : weave_detection_patched=True
ADAPTERS   : disabled
TRAIN      : bs=4 grad_accum=4 eff_bs=16
EPOCHS     : 4.0
LR         : 0.0002 scheduler=cosine warmup=0.05 weight_decay=0.01 max_grad_norm=1.0
MULTITURN  : num=0 max_turns=5 (only last assistant supervised)
GOLD_REPEAT: 5
DPO        : enabled=True (using DPO dataset data/training_data_dpo.jsonl)
DPO_TRAIN  : bs=2 grad_accum=16 lr=5e-07 epochs=1.0 beta=0.1
DPO_SEQ    : max_length=1024 prompt=512 completion=512
GPU        : Single GPU (CUDA_VISIBLE_DEVICES=0)
================================================================================

[1/1] Skipping SFT and preparing DPO-only run from existing checkpoint...
⚠️  Qwen chat template surgery applied: disabled automatic <think> tag insertion

[8/9] Loading DPO dataset...
Loaded DPO pairs: 45
DPO split -> train=42 val=3

[9/9] Running DPO...
warmup_ratio is deprecated and will be removed in v5.2. Use `warmup_steps` instead.
Loading policy model for DPO...
==((====))==  Unsloth 2026.3.5: Fast Qwen3 patching. Transformers: 5.3.0. vLLM: 0.13.0.
   \\   /|    NVIDIA RTX 5000 Ada Generation. Num GPUs = 1. Max memory: 31.475 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu128. CUDA: 8.9. CUDA Toolkit: 12.8. Triton: 3.5.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.33.post1. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: Using bfloat16 full finetuning which cuts memory usage by 50%.
To enable float32 training, use `float32_mixed_precision = True` during FastLanguageModel.from_pretrained
Loading weights: 100%|█████████████████████████| 310/310 [00:00<00:00, 1009.50it/s]
Loading reference model for DPO...
==((====))==  Unsloth 2026.3.5: Fast Qwen3 patching. Transformers: 5.3.0. vLLM: 0.13.0.
   \\   /|    NVIDIA RTX 5000 Ada Generation. Num GPUs = 1. Max memory: 31.475 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu128. CUDA: 8.9. CUDA Toolkit: 12.8. Triton: 3.5.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.33.post1. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: Using bfloat16 full finetuning which cuts memory usage by 50%.
To enable float32 training, use `float32_mixed_precision = True` during FastLanguageModel.from_pretrained
Loading weights: 100%|██████████████████████████| 310/310 [00:00<00:00, 942.10it/s]
[trl.trainer.dpo_trainer|WARNING]You passed `model_init_kwargs` to the `DPOConfig`, but your model is already instantiated. The `model_init_kwargs` will be ignored.
[trl.trainer.dpo_trainer|WARNING]You passed `ref_model_init_kwargs` to the `DPOConfig`, but your model is already instantiated. The `ref_model_init_kwargs` will be ignored.
num_proc must be <= 42. Reducing num_proc to 42 for dataset of size 42.
[datasets.arrow_dataset|WARNING]num_proc must be <= 42. Reducing num_proc to 42 for dataset of size 42.
Extracting prompt in train dataset (num_proc=42): 100%|█| 42/42 [00:01<00:00, 25.91
num_proc must be <= 42. Reducing num_proc to 42 for dataset of size 42.
[datasets.arrow_dataset|WARNING]num_proc must be <= 42. Reducing num_proc to 42 for dataset of size 42.
Applying chat template to train dataset (num_proc=42):  31%|▎| 13/42 [00:17<00:36, Applying chat template to train dataset (num_proc=42): 100%|█| 42/42 [00:54<00:00, 
num_proc must be <= 42. Reducing num_proc to 42 for dataset of size 42.
[datasets.arrow_dataset|WARNING]num_proc must be <= 42. Reducing num_proc to 42 for dataset of size 42.
Tokenizing train dataset (num_proc=42): 100%|█| 42/42 [00:54<00:00,  1.30s/ example
num_proc must be <= 3. Reducing num_proc to 3 for dataset of size 3.
[datasets.arrow_dataset|WARNING]num_proc must be <= 3. Reducing num_proc to 3 for dataset of size 3.
Extracting prompt in eval dataset (num_proc=3): 100%|█| 3/3 [00:00<00:00,  7.59 exa
num_proc must be <= 3. Reducing num_proc to 3 for dataset of size 3.
[datasets.arrow_dataset|WARNING]num_proc must be <= 3. Reducing num_proc to 3 for dataset of size 3.
Applying chat template to eval dataset (num_proc=3): 100%|█| 3/3 [00:04<00:00,  1.5
num_proc must be <= 3. Reducing num_proc to 3 for dataset of size 3.
[datasets.arrow_dataset|WARNING]num_proc must be <= 3. Reducing num_proc to 3 for dataset of size 3.
Tokenizing eval dataset (num_proc=3): 100%|███| 3/3 [00:04<00:00,  1.49s/ examples]
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 42 | Num Epochs = 1 | Total steps = 2
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 16
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 16 x 1) = 32
 "-____-"     Trainable parameters = 1,720,574,976 of 1,720,574,976 (100.00% trained)
Writing model shards: 100%|██████████████████████████| 1/1 [00:04<00:00,  4.71s/it]
{'train_runtime': '41.2', 'train_samples_per_second': '1.019', 'train_steps_per_second': '0.049', 'train_loss': '0.6924', 'epoch': '1'}
100%|████████████████████████████████████████████████| 2/2 [00:41<00:00, 20.60s/it]
Writing model shards: 100%|██████████████████████████| 1/1 [00:05<00:00,  5.85s/it]
Done.
✓ DPO-only fine-tuning complete!
初始化项目，由ModelHub XC社区提供模型 Model: staeiou/bartleby-qwen3-1.7b_dpo Source: Original Platform 2026-04-29 12:28:49 +08:00			`==========================================`
			`DPO From Existing Checkpoint`
			`==========================================`
			`Source: staeiou/bartleby-qwen3-1.7b_v5/`
			`Output: staeiou/bartleby-qwen3-1.7b_dpo`
			`DPO Data: data/training_data_dpo.jsonl`
			`Train: bs=2 grad_accum=16 lr=5e-7 epochs=1 beta=0.1`

			`→ No local vLLM detected, proceeding with DPO`
			`→ Starting DPO-only fine-tuning...`
			`LD_LIBRARY_PATH="/opt/venv/lib/python3.10/site-packages/nvidia/cu13/lib${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH}" \`
			`FULL_FINETUNING=1 \`
			`RUN_SFT=0 \`
			`BASE_MODEL=staeiou/bartleby-qwen3-1.7b_v5/ \`
			`MODEL_DIR=staeiou/bartleby-qwen3-1.7b_dpo \`
			`DPO_DATA=data/training_data_dpo.jsonl \`
			`MAX_SEQ_LENGTH=1024 \`
			`VAL_FRACTION=0.05 \`
			`DPO_BETA=0.1 \`
			`DPO_NUM_TRAIN_EPOCHS=1 \`
			`DPO_LEARNING_RATE=5e-7 \`
			`DPO_LR_SCHEDULER_TYPE=cosine \`
			`DPO_WARMUP_RATIO=0.03 \`
			`DPO_WEIGHT_DECAY=0.05 \`
			`DPO_MAX_GRAD_NORM=1.0 \`
			`DPO_PER_DEVICE_TRAIN_BATCH_SIZE=2 \`
			`DPO_GRADIENT_ACCUMULATION_STEPS=16 \`
			`DPO_EVAL_STEPS=100 \`
			`DPO_SAVE_STEPS=100 \`
			`DPO_LOGGING_STEPS=10 \`
			`DPO_MAX_LENGTH=1024 \`
			`DPO_MAX_PROMPT_LENGTH=512 \`
			`DPO_MAX_COMPLETION_LENGTH=512 \`
			`python finetune.py`
			`Skipping import of cpp extensions due to incompatible torch version 2.9.0+cu128 for torchao version 0.15.0+cu129 Please see https://github.com/pytorch/ao/issues/2919 for more info`
			`/workspace/bartleby-1b/finetune.py:129: UserWarning: WARNING: Unsloth should be imported before [trl, transformers, peft] to ensure all optimizations are applied. Your code may run slower or encounter memory issues without these optimizations.`

			`Please restructure your imports with 'import unsloth' at the top of your file.`
			`from unsloth import FastLanguageModel`
			`🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.`
			`Unsloth: Your Flash Attention 2 installation seems to be broken. Using Xformers instead. No performance changes will be seen.`
			`🦥 Unsloth Zoo will now patch everything to make training faster!`
			`================================================================================`
			`BARTLEBY FULL FINETUNE — 16-BIT — AUTO TEMPLATE+MASK DETECT — LAST-ANSWER MULTITURN`
			`================================================================================`
			`MODEL : staeiou/bartleby-qwen3-1.7b_v5/`
			`DATA : data/training_data_v2_filtered.jsonl`
			`GOLD : data/gold_seed_training_data.jsonl`
			`SFT_OUTPUT : staeiou/bartleby-qwen3-1.7b_dpo`
			`OUTPUT : staeiou/bartleby-qwen3-1.7b_dpo`
			`CACHE_DIR : /workspace/.cache/huggingface/datasets`
			`SEQ : 1024`
			`PACKING : False`
			`LOAD_4BIT : False (forced 16-bit base)`
			`FULL_FT : True`
			`RUN_SFT : False`
			`REMOTE_CODE: False`
			`FAMILY : qwen`
			`TRL_COMPAT : ConstantLengthDataset patched=True`
			`TRL_DPO : mergekit_detection_patched=True`
			`TRL_DPO2 : llm_blender_detection_patched=True`
			`TRL_DPO3 : weave_detection_patched=True`
			`ADAPTERS : disabled`
			`TRAIN : bs=4 grad_accum=4 eff_bs=16`
			`EPOCHS : 4.0`
			`LR : 0.0002 scheduler=cosine warmup=0.05 weight_decay=0.01 max_grad_norm=1.0`
			`MULTITURN : num=0 max_turns=5 (only last assistant supervised)`
			`GOLD_REPEAT: 5`
			`DPO : enabled=True (using DPO dataset data/training_data_dpo.jsonl)`
			`DPO_TRAIN : bs=2 grad_accum=16 lr=5e-07 epochs=1.0 beta=0.1`
			`DPO_SEQ : max_length=1024 prompt=512 completion=512`
			`GPU : Single GPU (CUDA_VISIBLE_DEVICES=0)`
			`================================================================================`

			`[1/1] Skipping SFT and preparing DPO-only run from existing checkpoint...`
			`⚠️ Qwen chat template surgery applied: disabled automatic <think> tag insertion`

			`[8/9] Loading DPO dataset...`
			`Loaded DPO pairs: 45`
			`DPO split -> train=42 val=3`

			`[9/9] Running DPO...`
			warmup_ratio is deprecated and will be removed in v5.2. Use `warmup_steps` instead.
			`Loading policy model for DPO...`
			`==((====))== Unsloth 2026.3.5: Fast Qwen3 patching. Transformers: 5.3.0. vLLM: 0.13.0.`
			`\\ /\| NVIDIA RTX 5000 Ada Generation. Num GPUs = 1. Max memory: 31.475 GB. Platform: Linux.`
			`O^O/ \_/ \ Torch: 2.9.0+cu128. CUDA: 8.9. CUDA Toolkit: 12.8. Triton: 3.5.0`
			`\ / Bfloat16 = TRUE. FA [Xformers = 0.0.33.post1. FA2 = False]`
			`"-____-" Free license: http://github.com/unslothai/unsloth`
			`Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!`
			`Unsloth: Using bfloat16 full finetuning which cuts memory usage by 50%.`
			To enable float32 training, use `float32_mixed_precision = True` during FastLanguageModel.from_pretrained
			`Loading weights: 100%\|█████████████████████████\| 310/310 [00:00<00:00, 1009.50it/s]`
			`Loading reference model for DPO...`
			`==((====))== Unsloth 2026.3.5: Fast Qwen3 patching. Transformers: 5.3.0. vLLM: 0.13.0.`
			`\\ /\| NVIDIA RTX 5000 Ada Generation. Num GPUs = 1. Max memory: 31.475 GB. Platform: Linux.`
			`O^O/ \_/ \ Torch: 2.9.0+cu128. CUDA: 8.9. CUDA Toolkit: 12.8. Triton: 3.5.0`
			`\ / Bfloat16 = TRUE. FA [Xformers = 0.0.33.post1. FA2 = False]`
			`"-____-" Free license: http://github.com/unslothai/unsloth`
			`Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!`
			`Unsloth: Using bfloat16 full finetuning which cuts memory usage by 50%.`
			To enable float32 training, use `float32_mixed_precision = True` during FastLanguageModel.from_pretrained
			`Loading weights: 100%\|██████████████████████████\| 310/310 [00:00<00:00, 942.10it/s]`
			[trl.trainer.dpo_trainer\|WARNING]You passed `model_init_kwargs` to the `DPOConfig`, but your model is already instantiated. The `model_init_kwargs` will be ignored.
			[trl.trainer.dpo_trainer\|WARNING]You passed `ref_model_init_kwargs` to the `DPOConfig`, but your model is already instantiated. The `ref_model_init_kwargs` will be ignored.
			`num_proc must be <= 42. Reducing num_proc to 42 for dataset of size 42.`
			`[datasets.arrow_dataset\|WARNING]num_proc must be <= 42. Reducing num_proc to 42 for dataset of size 42.`
			`Extracting prompt in train dataset (num_proc=42): 100%\|█\| 42/42 [00:01<00:00, 25.91`
			`num_proc must be <= 42. Reducing num_proc to 42 for dataset of size 42.`
			`[datasets.arrow_dataset\|WARNING]num_proc must be <= 42. Reducing num_proc to 42 for dataset of size 42.`
			`Applying chat template to train dataset (num_proc=42): 31%\|▎\| 13/42 [00:17<00:36, Applying chat template to train dataset (num_proc=42): 100%\|█\| 42/42 [00:54<00:00,`
			`num_proc must be <= 42. Reducing num_proc to 42 for dataset of size 42.`
			`[datasets.arrow_dataset\|WARNING]num_proc must be <= 42. Reducing num_proc to 42 for dataset of size 42.`
			`Tokenizing train dataset (num_proc=42): 100%\|█\| 42/42 [00:54<00:00, 1.30s/ example`
			`num_proc must be <= 3. Reducing num_proc to 3 for dataset of size 3.`
			`[datasets.arrow_dataset\|WARNING]num_proc must be <= 3. Reducing num_proc to 3 for dataset of size 3.`
			`Extracting prompt in eval dataset (num_proc=3): 100%\|█\| 3/3 [00:00<00:00, 7.59 exa`
			`num_proc must be <= 3. Reducing num_proc to 3 for dataset of size 3.`
			`[datasets.arrow_dataset\|WARNING]num_proc must be <= 3. Reducing num_proc to 3 for dataset of size 3.`
			`Applying chat template to eval dataset (num_proc=3): 100%\|█\| 3/3 [00:04<00:00, 1.5`
			`num_proc must be <= 3. Reducing num_proc to 3 for dataset of size 3.`
			`[datasets.arrow_dataset\|WARNING]num_proc must be <= 3. Reducing num_proc to 3 for dataset of size 3.`
			`Tokenizing eval dataset (num_proc=3): 100%\|███\| 3/3 [00:04<00:00, 1.49s/ examples]`
			`==((====))== Unsloth - 2x faster free finetuning \| Num GPUs used = 1`
			`\\ /\| Num examples = 42 \| Num Epochs = 1 \| Total steps = 2`
			`O^O/ \_/ \ Batch size per device = 2 \| Gradient accumulation steps = 16`
			`\ / Data Parallel GPUs = 1 \| Total batch size (2 x 16 x 1) = 32`
			`"-____-" Trainable parameters = 1,720,574,976 of 1,720,574,976 (100.00% trained)`
			`Writing model shards: 100%\|██████████████████████████\| 1/1 [00:04<00:00, 4.71s/it]`
			`{'train_runtime': '41.2', 'train_samples_per_second': '1.019', 'train_steps_per_second': '0.049', 'train_loss': '0.6924', 'epoch': '1'}`
			`100%\|████████████████████████████████████████████████\| 2/2 [00:41<00:00, 20.60s/it]`
			`Writing model shards: 100%\|██████████████████████████\| 1/1 [00:05<00:00, 5.85s/it]`
			`Done.`
			`✓ DPO-only fine-tuning complete!`