Files
tw-data-train_final_replace…/slurm/9198833.0.err
ModelHub XC 51ac6fb2f7 初始化项目,由ModelHub XC社区提供模型
Model: ligeng-dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume
Source: Original Platform
2026-05-01 08:51:19 +08:00

598 lines
415 KiB
Plaintext

./slurm_setup.sh: line 48: GLOBAL_TRAIN_BATCH_SIZE / NNODES / GPUS_PER_NODE / GRADIENT_ACCUMULATION_STEPS: division by 0 (error token is "GRADIENT_ACCUMULATION_STEPS")
./slurm_setup.sh: line 48: GLOBAL_TRAIN_BATCH_SIZE / NNODES / GPUS_PER_NODE / GRADIENT_ACCUMULATION_STEPS: division by 0 (error token is "GRADIENT_ACCUMULATION_STEPS")
./slurm_setup.sh: line 48: GLOBAL_TRAIN_BATCH_SIZE / NNODES / GPUS_PER_NODE / GRADIENT_ACCUMULATION_STEPS: division by 0 (error token is "GRADIENT_ACCUMULATION_STEPS")
./slurm_setup.sh: line 48: GLOBAL_TRAIN_BATCH_SIZE / NNODES / GPUS_PER_NODE / GRADIENT_ACCUMULATION_STEPS: division by 0 (error token is "GRADIENT_ACCUMULATION_STEPS")
./slurm_setup.sh: line 48: GLOBAL_TRAIN_BATCH_SIZE / NNODES / GPUS_PER_NODE / GRADIENT_ACCUMULATION_STEPS: division by 0 (error token is "GRADIENT_ACCUMULATION_STEPS")
./slurm_setup.sh: line 48: GLOBAL_TRAIN_BATCH_SIZE / NNODES / GPUS_PER_NODE / GRADIENT_ACCUMULATION_STEPS: division by 0 (error token is "GRADIENT_ACCUMULATION_STEPS")
./slurm_setup.sh: line 48: GLOBAL_TRAIN_BATCH_SIZE / NNODES / GPUS_PER_NODE / GRADIENT_ACCUMULATION_STEPS: division by 0 (error token is "GRADIENT_ACCUMULATION_STEPS")
./slurm_setup.sh: line 48: GLOBAL_TRAIN_BATCH_SIZE / NNODES / GPUS_PER_NODE / GRADIENT_ACCUMULATION_STEPS: division by 0 (error token is "GRADIENT_ACCUMULATION_STEPS")
2026-04-15 11:23:57,936 - INFO - Note: detected 255 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2026-04-15 11:23:57,936 - INFO - Note: detected 255 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2026-04-15 11:23:57,936 - INFO - Note: detected 255 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2026-04-15 11:23:57,936 - INFO - Note: detected 255 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2026-04-15 11:23:57,936 - INFO - Note: detected 255 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2026-04-15 11:23:57,936 - INFO - Note: detected 255 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2026-04-15 11:23:57,935 - INFO - Note: detected 255 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2026-04-15 11:23:57,935 - INFO - Note: detected 255 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2026-04-15 11:23:57,935 - INFO - Note: detected 255 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2026-04-15 11:23:57,935 - INFO - Note: detected 255 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2026-04-15 11:23:57,935 - INFO - Note: detected 255 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2026-04-15 11:23:57,936 - INFO - Note: detected 255 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2026-04-15 11:23:57,935 - INFO - Note: detected 255 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2026-04-15 11:23:57,935 - INFO - Note: NumExpr detected 255 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2026-04-15 11:23:57,935 - INFO - Note: detected 255 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2026-04-15 11:23:57,935 - INFO - Note: NumExpr detected 255 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2026-04-15 11:23:57,935 - INFO - Note: NumExpr detected 255 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2026-04-15 11:23:57,935 - INFO - Note: NumExpr detected 255 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2026-04-15 11:23:57,935 - INFO - Note: NumExpr detected 255 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2026-04-15 11:23:57,936 - INFO - Note: NumExpr detected 255 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2026-04-15 11:23:57,936 - INFO - Note: NumExpr detected 255 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2026-04-15 11:23:57,936 - INFO - Note: NumExpr detected 255 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2026-04-15 11:23:57,936 - INFO - Note: NumExpr detected 255 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2026-04-15 11:23:57,936 - INFO - Note: NumExpr detected 255 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2026-04-15 11:23:57,936 - INFO - Note: NumExpr detected 255 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2026-04-15 11:23:57,935 - INFO - Note: NumExpr detected 255 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2026-04-15 11:23:57,935 - INFO - Note: NumExpr detected 255 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2026-04-15 11:23:57,936 - INFO - Note: NumExpr detected 255 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2026-04-15 11:23:57,935 - INFO - Note: detected 255 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2026-04-15 11:23:57,936 - INFO - Note: detected 255 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2026-04-15 11:23:57,935 - INFO - Note: NumExpr detected 255 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2026-04-15 11:23:57,936 - INFO - Note: NumExpr detected 255 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2026-04-15 11:23:57,974 - INFO - Note: detected 255 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2026-04-15 11:23:57,974 - INFO - Note: detected 255 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2026-04-15 11:23:57,974 - INFO - Note: detected 255 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2026-04-15 11:23:57,974 - INFO - Note: detected 255 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2026-04-15 11:23:57,974 - INFO - Note: detected 255 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2026-04-15 11:23:57,974 - INFO - Note: detected 255 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2026-04-15 11:23:57,974 - INFO - Note: NumExpr detected 255 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2026-04-15 11:23:57,974 - INFO - Note: NumExpr detected 255 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2026-04-15 11:23:57,974 - INFO - Note: NumExpr detected 255 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2026-04-15 11:23:57,974 - INFO - Note: detected 255 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2026-04-15 11:23:57,974 - INFO - Note: NumExpr detected 255 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2026-04-15 11:23:57,974 - INFO - Note: NumExpr detected 255 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2026-04-15 11:23:57,974 - INFO - Note: NumExpr detected 255 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2026-04-15 11:23:57,974 - INFO - Note: NumExpr detected 255 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2026-04-15 11:23:57,974 - INFO - Note: detected 255 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2026-04-15 11:23:57,974 - INFO - Note: NumExpr detected 255 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2026-04-15 11:23:58,016 - INFO - Note: detected 255 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2026-04-15 11:23:58,016 - INFO - Note: detected 255 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2026-04-15 11:23:58,016 - INFO - Note: detected 255 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2026-04-15 11:23:58,016 - INFO - Note: detected 255 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2026-04-15 11:23:58,016 - INFO - Note: detected 255 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2026-04-15 11:23:58,016 - INFO - Note: detected 255 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2026-04-15 11:23:58,016 - INFO - Note: detected 255 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2026-04-15 11:23:58,016 - INFO - Note: NumExpr detected 255 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2026-04-15 11:23:58,016 - INFO - Note: NumExpr detected 255 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2026-04-15 11:23:58,016 - INFO - Note: NumExpr detected 255 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2026-04-15 11:23:58,016 - INFO - Note: NumExpr detected 255 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2026-04-15 11:23:58,016 - INFO - Note: NumExpr detected 255 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2026-04-15 11:23:58,016 - INFO - Note: NumExpr detected 255 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2026-04-15 11:23:58,016 - INFO - Note: NumExpr detected 255 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2026-04-15 11:23:58,016 - INFO - Note: detected 255 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2026-04-15 11:23:58,016 - INFO - Note: NumExpr detected 255 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2026-04-15 11:23:58,044 - INFO - Note: detected 255 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2026-04-15 11:23:58,044 - INFO - Note: detected 255 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2026-04-15 11:23:58,044 - INFO - Note: detected 255 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2026-04-15 11:23:58,044 - INFO - Note: detected 255 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2026-04-15 11:23:58,044 - INFO - Note: detected 255 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2026-04-15 11:23:58,044 - INFO - Note: NumExpr detected 255 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2026-04-15 11:23:58,044 - INFO - Note: detected 255 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2026-04-15 11:23:58,044 - INFO - Note: NumExpr detected 255 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2026-04-15 11:23:58,044 - INFO - Note: NumExpr detected 255 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2026-04-15 11:23:58,044 - INFO - Note: detected 255 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2026-04-15 11:23:58,044 - INFO - Note: NumExpr detected 255 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2026-04-15 11:23:58,044 - INFO - Note: NumExpr detected 255 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2026-04-15 11:23:58,044 - INFO - Note: NumExpr detected 255 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2026-04-15 11:23:58,044 - INFO - Note: NumExpr detected 255 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2026-04-15 11:23:58,044 - INFO - Note: detected 255 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2026-04-15 11:23:58,045 - INFO - Note: NumExpr detected 255 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2026-04-15 11:23:58,048 - INFO - Note: detected 255 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2026-04-15 11:23:58,048 - INFO - Note: NumExpr detected 255 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2026-04-15 11:23:58,048 - INFO - Note: detected 255 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2026-04-15 11:23:58,048 - INFO - Note: detected 255 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2026-04-15 11:23:58,048 - INFO - Note: detected 255 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2026-04-15 11:23:58,048 - INFO - Note: detected 255 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2026-04-15 11:23:58,048 - INFO - Note: detected 255 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2026-04-15 11:23:58,048 - INFO - Note: NumExpr detected 255 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2026-04-15 11:23:58,048 - INFO - Note: NumExpr detected 255 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2026-04-15 11:23:58,048 - INFO - Note: NumExpr detected 255 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2026-04-15 11:23:58,048 - INFO - Note: NumExpr detected 255 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2026-04-15 11:23:58,048 - INFO - Note: NumExpr detected 255 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2026-04-15 11:23:58,048 - INFO - Note: detected 255 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2026-04-15 11:23:58,048 - INFO - Note: NumExpr detected 255 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2026-04-15 11:23:58,048 - INFO - Note: detected 255 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2026-04-15 11:23:58,048 - INFO - Note: NumExpr detected 255 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2026-04-15 11:23:58,068 - INFO - Note: detected 255 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2026-04-15 11:23:58,068 - INFO - Note: detected 255 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2026-04-15 11:23:58,068 - INFO - Note: detected 255 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2026-04-15 11:23:58,068 - INFO - Note: detected 255 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2026-04-15 11:23:58,068 - INFO - Note: detected 255 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2026-04-15 11:23:58,068 - INFO - Note: NumExpr detected 255 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2026-04-15 11:23:58,068 - INFO - Note: NumExpr detected 255 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2026-04-15 11:23:58,068 - INFO - Note: detected 255 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2026-04-15 11:23:58,068 - INFO - Note: detected 255 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2026-04-15 11:23:58,068 - INFO - Note: NumExpr detected 255 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2026-04-15 11:23:58,068 - INFO - Note: NumExpr detected 255 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2026-04-15 11:23:58,068 - INFO - Note: NumExpr detected 255 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2026-04-15 11:23:58,068 - INFO - Note: NumExpr detected 255 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2026-04-15 11:23:58,068 - INFO - Note: NumExpr detected 255 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2026-04-15 11:23:58,068 - INFO - Note: detected 255 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2026-04-15 11:23:58,068 - INFO - Note: NumExpr detected 255 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
/home/ligengz/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
from pandas.core import (
/home/ligengz/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
from pandas.core import (
/home/ligengz/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
from pandas.core import (
/home/ligengz/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
from pandas.core import (
/home/ligengz/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
from pandas.core import (
/home/ligengz/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
from pandas.core import (
/home/ligengz/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
from pandas.core import (
/home/ligengz/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
from pandas.core import (
/home/ligengz/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
from pandas.core import (
/home/ligengz/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
from pandas.core import (
/home/ligengz/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
from pandas.core import (
/home/ligengz/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
from pandas.core import (
/home/ligengz/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
from pandas.core import (
/home/ligengz/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
from pandas.core import (
/home/ligengz/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
from pandas.core import (
/home/ligengz/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
from pandas.core import (
2026-04-15 11:23:58,087 - INFO - Note: detected 255 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2026-04-15 11:23:58,087 - INFO - Note: detected 255 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2026-04-15 11:23:58,087 - INFO - Note: detected 255 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2026-04-15 11:23:58,087 - INFO - Note: detected 255 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2026-04-15 11:23:58,088 - INFO - Note: NumExpr detected 255 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2026-04-15 11:23:58,087 - INFO - Note: detected 255 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2026-04-15 11:23:58,088 - INFO - Note: NumExpr detected 255 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2026-04-15 11:23:58,087 - INFO - Note: detected 255 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2026-04-15 11:23:58,088 - INFO - Note: NumExpr detected 255 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2026-04-15 11:23:58,087 - INFO - Note: detected 255 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2026-04-15 11:23:58,088 - INFO - Note: NumExpr detected 255 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2026-04-15 11:23:58,088 - INFO - Note: NumExpr detected 255 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2026-04-15 11:23:58,088 - INFO - Note: NumExpr detected 255 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2026-04-15 11:23:58,088 - INFO - Note: NumExpr detected 255 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2026-04-15 11:23:58,088 - INFO - Note: detected 255 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2026-04-15 11:23:58,088 - INFO - Note: NumExpr detected 255 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
/home/ligengz/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
from pandas.core import (
/home/ligengz/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
from pandas.core import (
/home/ligengz/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
from pandas.core import (
/home/ligengz/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
from pandas.core import (
/home/ligengz/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
from pandas.core import (
/home/ligengz/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
from pandas.core import (
/home/ligengz/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
from pandas.core import (
/home/ligengz/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
from pandas.core import (
/home/ligengz/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
from pandas.core import (
/home/ligengz/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
from pandas.core import (
/home/ligengz/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
from pandas.core import (
/home/ligengz/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
from pandas.core import (
/home/ligengz/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
from pandas.core import (
/home/ligengz/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
from pandas.core import (
/home/ligengz/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
from pandas.core import (
/home/ligengz/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
from pandas.core import (
/home/ligengz/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
from pandas.core import (
/home/ligengz/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
from pandas.core import (
/home/ligengz/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
from pandas.core import (
/home/ligengz/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
from pandas.core import (
/home/ligengz/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
from pandas.core import (
/home/ligengz/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
from pandas.core import (
/home/ligengz/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
from pandas.core import (
/home/ligengz/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
from pandas.core import (
/home/ligengz/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
from pandas.core import (
/home/ligengz/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
from pandas.core import (
/home/ligengz/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
from pandas.core import (
/home/ligengz/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
from pandas.core import (
/home/ligengz/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
from pandas.core import (
/home/ligengz/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
from pandas.core import (
/home/ligengz/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
from pandas.core import (
/home/ligengz/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
from pandas.core import (
/home/ligengz/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
from pandas.core import (
/home/ligengz/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
from pandas.core import (
/home/ligengz/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
from pandas.core import (
/home/ligengz/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
from pandas.core import (
/home/ligengz/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
from pandas.core import (
/home/ligengz/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
from pandas.core import (
/home/ligengz/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
from pandas.core import (
/home/ligengz/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
from pandas.core import (
/home/ligengz/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
from pandas.core import (
/home/ligengz/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
from pandas.core import (
/home/ligengz/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
from pandas.core import (
/home/ligengz/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
from pandas.core import (
/home/ligengz/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
from pandas.core import (
/home/ligengz/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
from pandas.core import (
/home/ligengz/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
from pandas.core import (
/home/ligengz/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
from pandas.core import (
2026-04-15 11:23:59,077 - INFO - PyTorch version 2.6.0 available.
2026-04-15 11:23:59,077 - INFO - PyTorch version 2.6.0 available.
2026-04-15 11:23:59,077 - INFO - PyTorch version 2.6.0 available.
2026-04-15 11:23:59,077 - INFO - PyTorch version 2.6.0 available.
2026-04-15 11:23:59,077 - INFO - PyTorch version 2.6.0 available.
2026-04-15 11:23:59,076 - INFO - PyTorch version 2.6.0 available.
2026-04-15 11:23:59,076 - INFO - PyTorch version 2.6.0 available.
2026-04-15 11:23:59,076 - INFO - PyTorch version 2.6.0 available.
2026-04-15 11:23:59,076 - INFO - PyTorch version 2.6.0 available.
2026-04-15 11:23:59,076 - INFO - PyTorch version 2.6.0 available.
2026-04-15 11:23:59,077 - INFO - PyTorch version 2.6.0 available.
2026-04-15 11:23:59,077 - INFO - PyTorch version 2.6.0 available.
2026-04-15 11:23:59,076 - INFO - PyTorch version 2.6.0 available.
2026-04-15 11:23:59,076 - INFO - PyTorch version 2.6.0 available.
2026-04-15 11:23:59,078 - INFO - PyTorch version 2.6.0 available.
2026-04-15 11:23:59,076 - INFO - PyTorch version 2.6.0 available.
2026-04-15 11:23:59,144 - INFO - PyTorch version 2.6.0 available.
2026-04-15 11:23:59,144 - INFO - PyTorch version 2.6.0 available.
2026-04-15 11:23:59,144 - INFO - PyTorch version 2.6.0 available.
2026-04-15 11:23:59,144 - INFO - PyTorch version 2.6.0 available.
2026-04-15 11:23:59,144 - INFO - PyTorch version 2.6.0 available.
2026-04-15 11:23:59,144 - INFO - PyTorch version 2.6.0 available.
2026-04-15 11:23:59,144 - INFO - PyTorch version 2.6.0 available.
2026-04-15 11:23:59,144 - INFO - PyTorch version 2.6.0 available.
2026-04-15 11:23:59,144 - INFO - PyTorch version 2.6.0 available.
2026-04-15 11:23:59,145 - INFO - PyTorch version 2.6.0 available.
2026-04-15 11:23:59,144 - INFO - PyTorch version 2.6.0 available.
2026-04-15 11:23:59,144 - INFO - PyTorch version 2.6.0 available.
2026-04-15 11:23:59,144 - INFO - PyTorch version 2.6.0 available.
2026-04-15 11:23:59,144 - INFO - PyTorch version 2.6.0 available.
2026-04-15 11:23:59,145 - INFO - PyTorch version 2.6.0 available.
2026-04-15 11:23:59,145 - INFO - PyTorch version 2.6.0 available.
2026-04-15 11:23:59,150 - INFO - PyTorch version 2.6.0 available.
2026-04-15 11:23:59,150 - INFO - PyTorch version 2.6.0 available.
2026-04-15 11:23:59,150 - INFO - PyTorch version 2.6.0 available.
2026-04-15 11:23:59,150 - INFO - PyTorch version 2.6.0 available.
2026-04-15 11:23:59,150 - INFO - PyTorch version 2.6.0 available.
2026-04-15 11:23:59,150 - INFO - PyTorch version 2.6.0 available.
2026-04-15 11:23:59,150 - INFO - PyTorch version 2.6.0 available.
2026-04-15 11:23:59,151 - INFO - PyTorch version 2.6.0 available.
2026-04-15 11:23:59,151 - INFO - PyTorch version 2.6.0 available.
2026-04-15 11:23:59,151 - INFO - PyTorch version 2.6.0 available.
2026-04-15 11:23:59,151 - INFO - PyTorch version 2.6.0 available.
2026-04-15 11:23:59,151 - INFO - PyTorch version 2.6.0 available.
2026-04-15 11:23:59,151 - INFO - PyTorch version 2.6.0 available.
2026-04-15 11:23:59,151 - INFO - PyTorch version 2.6.0 available.
2026-04-15 11:23:59,151 - INFO - PyTorch version 2.6.0 available.
2026-04-15 11:23:59,151 - INFO - PyTorch version 2.6.0 available.
2026-04-15 11:23:59,151 - INFO - PyTorch version 2.6.0 available.
2026-04-15 11:23:59,151 - INFO - PyTorch version 2.6.0 available.
2026-04-15 11:23:59,151 - INFO - PyTorch version 2.6.0 available.
2026-04-15 11:23:59,151 - INFO - PyTorch version 2.6.0 available.
2026-04-15 11:23:59,151 - INFO - PyTorch version 2.6.0 available.
2026-04-15 11:23:59,151 - INFO - PyTorch version 2.6.0 available.
2026-04-15 11:23:59,151 - INFO - PyTorch version 2.6.0 available.
2026-04-15 11:23:59,151 - INFO - PyTorch version 2.6.0 available.
2026-04-15 11:23:59,210 - INFO - PyTorch version 2.6.0 available.
2026-04-15 11:23:59,210 - INFO - PyTorch version 2.6.0 available.
2026-04-15 11:23:59,210 - INFO - PyTorch version 2.6.0 available.
2026-04-15 11:23:59,210 - INFO - PyTorch version 2.6.0 available.
2026-04-15 11:23:59,210 - INFO - PyTorch version 2.6.0 available.
2026-04-15 11:23:59,210 - INFO - PyTorch version 2.6.0 available.
2026-04-15 11:23:59,210 - INFO - PyTorch version 2.6.0 available.
2026-04-15 11:23:59,211 - INFO - PyTorch version 2.6.0 available.
2026-04-15 11:24:36,146 - INFO - Training config: {'model_name': 'Qwen/Qwen3-8B', 'template_name': 'qwen', 'block_size': 40960, 'wandb_project': 'ThreadWeaver', 'train_file_path': './data-train_final_replaced_from_classified-fix-format', 'dagger': False, 'attn_implementation': 'flex_attention', 'output_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'overwrite_output_dir': False, 'do_train': False, 'do_eval': False, 'do_predict': False, 'eval_strategy': <IntervalStrategy.NO: 'no'>, 'prediction_loss_only': False, 'per_device_train_batch_size': 1, 'per_device_eval_batch_size': 1, 'per_gpu_train_batch_size': None, 'per_gpu_eval_batch_size': None, 'gradient_accumulation_steps': 2, 'eval_accumulation_steps': None, 'eval_delay': 0, 'torch_empty_cache_steps': None, 'learning_rate': 1e-05, 'weight_decay': 0.0001, 'adam_beta1': 0.9, 'adam_beta2': 0.95, 'adam_epsilon': 1e-08, 'max_grad_norm': 1.0, 'num_train_epochs': 8.0, 'max_steps': -1, 'lr_scheduler_type': <SchedulerType.COSINE: 'cosine'>, 'lr_scheduler_kwargs': {}, 'warmup_ratio': 0.05, 'warmup_steps': 0, 'log_level': 'passive', 'log_level_replica': 'warning', 'log_on_each_node': True, 'logging_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume/runs/Apr15_11-24-34_batch-block1-3859', 'logging_strategy': <IntervalStrategy.STEPS: 'steps'>, 'logging_first_step': False, 'logging_steps': 1.0, 'logging_nan_inf_filter': True, 'save_strategy': <SaveStrategy.STEPS: 'steps'>, 'save_steps': 20, 'save_total_limit': 2, 'save_safetensors': True, 'save_on_each_node': False, 'save_only_model': False, 'restore_callback_states_from_checkpoint': False, 'no_cuda': False, 'use_cpu': False, 'use_mps_device': False, 'seed': 42, 'data_seed': None, 'jit_mode_eval': False, 'use_ipex': False, 'bf16': True, 'fp16': False, 'fp16_opt_level': 'O1', 'half_precision_backend': 'auto', 'bf16_full_eval': False, 'fp16_full_eval': False, 'tf32': None, 'local_rank': 0, 'ddp_backend': None, 'tpu_num_cores': None, 'tpu_metrics_debug': False, 'debug': [], 'dataloader_drop_last': False, 'eval_steps': None, 'dataloader_num_workers': 0, 'dataloader_prefetch_factor': None, 'past_index': -1, 'run_name': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'disable_tqdm': False, 'remove_unused_columns': True, 'label_names': None, 'load_best_model_at_end': False, 'metric_for_best_model': None, 'greater_is_better': None, 'ignore_data_skip': False, 'fsdp': [], 'fsdp_min_num_params': 0, 'fsdp_config': {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, 'tp_size': 0, 'fsdp_transformer_layer_cls_to_wrap': None, 'accelerator_config': {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, 'deepspeed': 'configs/deepspeed_zero3.json', 'label_smoothing_factor': 0.0, 'optim': <OptimizerNames.ADAMW_TORCH: 'adamw_torch'>, 'optim_args': None, 'adafactor': False, 'group_by_length': False, 'length_column_name': 'length', 'report_to': ['wandb'], 'ddp_find_unused_parameters': None, 'ddp_bucket_cap_mb': None, 'ddp_broadcast_buffers': None, 'dataloader_pin_memory': True, 'dataloader_persistent_workers': False, 'skip_memory_metrics': True, 'use_legacy_prediction_loop': False, 'push_to_hub': False, 'resume_from_checkpoint': None, 'hub_model_id': None, 'hub_strategy': <HubStrategy.EVERY_SAVE: 'every_save'>, 'hub_token': None, 'hub_private_repo': None, 'hub_always_push': False, 'gradient_checkpointing': True, 'gradient_checkpointing_kwargs': None, 'include_inputs_for_metrics': False, 'include_for_metrics': [], 'eval_do_concat_batches': True, 'fp16_backend': 'auto', 'push_to_hub_model_id': None, 'push_to_hub_organization': None, 'push_to_hub_token': None, '_n_gpu': 1, 'mp_parameters': '', 'auto_find_batch_size': False, 'full_determinism': False, 'torchdynamo': None, 'ray_scope': 'last', 'ddp_timeout': 1800, 'torch_compile': False, 'torch_compile_backend': None, 'torch_compile_mode': None, 'include_tokens_per_second': False, 'include_num_input_tokens_seen': False, 'neftune_noise_alpha': None, 'optim_target_modules': None, 'batch_eval_metrics': False, 'eval_on_start': False, 'use_liger_kernel': True, 'eval_use_gather_object': False, 'average_tokens_across_devices': True, 'model_init_kwargs': None, 'chat_template_path': None, 'dataset_text_field': 'qwen_text', 'dataset_kwargs': None, 'dataset_num_proc': None, 'eos_token': None, 'pad_token': None, 'max_length': 1024, 'packing': False, 'packing_strategy': 'ffd', 'padding_free': False, 'pad_to_multiple_of': None, 'eval_packing': None, 'completion_only_loss': None, 'assistant_only_loss': False, 'activation_offloading': False, 'max_seq_length': None}
2026-04-15 11:24:36,153 - INFO - Training config: {'model_name': 'Qwen/Qwen3-8B', 'template_name': 'qwen', 'block_size': 40960, 'wandb_project': 'ThreadWeaver', 'train_file_path': './data-train_final_replaced_from_classified-fix-format', 'dagger': False, 'attn_implementation': 'flex_attention', 'output_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'overwrite_output_dir': False, 'do_train': False, 'do_eval': False, 'do_predict': False, 'eval_strategy': <IntervalStrategy.NO: 'no'>, 'prediction_loss_only': False, 'per_device_train_batch_size': 1, 'per_device_eval_batch_size': 1, 'per_gpu_train_batch_size': None, 'per_gpu_eval_batch_size': None, 'gradient_accumulation_steps': 2, 'eval_accumulation_steps': None, 'eval_delay': 0, 'torch_empty_cache_steps': None, 'learning_rate': 1e-05, 'weight_decay': 0.0001, 'adam_beta1': 0.9, 'adam_beta2': 0.95, 'adam_epsilon': 1e-08, 'max_grad_norm': 1.0, 'num_train_epochs': 8.0, 'max_steps': -1, 'lr_scheduler_type': <SchedulerType.COSINE: 'cosine'>, 'lr_scheduler_kwargs': {}, 'warmup_ratio': 0.05, 'warmup_steps': 0, 'log_level': 'passive', 'log_level_replica': 'warning', 'log_on_each_node': True, 'logging_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume/runs/Apr15_11-24-34_batch-block1-0070', 'logging_strategy': <IntervalStrategy.STEPS: 'steps'>, 'logging_first_step': False, 'logging_steps': 1.0, 'logging_nan_inf_filter': True, 'save_strategy': <SaveStrategy.STEPS: 'steps'>, 'save_steps': 20, 'save_total_limit': 2, 'save_safetensors': True, 'save_on_each_node': False, 'save_only_model': False, 'restore_callback_states_from_checkpoint': False, 'no_cuda': False, 'use_cpu': False, 'use_mps_device': False, 'seed': 42, 'data_seed': None, 'jit_mode_eval': False, 'use_ipex': False, 'bf16': True, 'fp16': False, 'fp16_opt_level': 'O1', 'half_precision_backend': 'auto', 'bf16_full_eval': False, 'fp16_full_eval': False, 'tf32': None, 'local_rank': 0, 'ddp_backend': None, 'tpu_num_cores': None, 'tpu_metrics_debug': False, 'debug': [], 'dataloader_drop_last': False, 'eval_steps': None, 'dataloader_num_workers': 0, 'dataloader_prefetch_factor': None, 'past_index': -1, 'run_name': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'disable_tqdm': False, 'remove_unused_columns': True, 'label_names': None, 'load_best_model_at_end': False, 'metric_for_best_model': None, 'greater_is_better': None, 'ignore_data_skip': False, 'fsdp': [], 'fsdp_min_num_params': 0, 'fsdp_config': {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, 'tp_size': 0, 'fsdp_transformer_layer_cls_to_wrap': None, 'accelerator_config': {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, 'deepspeed': 'configs/deepspeed_zero3.json', 'label_smoothing_factor': 0.0, 'optim': <OptimizerNames.ADAMW_TORCH: 'adamw_torch'>, 'optim_args': None, 'adafactor': False, 'group_by_length': False, 'length_column_name': 'length', 'report_to': ['wandb'], 'ddp_find_unused_parameters': None, 'ddp_bucket_cap_mb': None, 'ddp_broadcast_buffers': None, 'dataloader_pin_memory': True, 'dataloader_persistent_workers': False, 'skip_memory_metrics': True, 'use_legacy_prediction_loop': False, 'push_to_hub': False, 'resume_from_checkpoint': None, 'hub_model_id': None, 'hub_strategy': <HubStrategy.EVERY_SAVE: 'every_save'>, 'hub_token': None, 'hub_private_repo': None, 'hub_always_push': False, 'gradient_checkpointing': True, 'gradient_checkpointing_kwargs': None, 'include_inputs_for_metrics': False, 'include_for_metrics': [], 'eval_do_concat_batches': True, 'fp16_backend': 'auto', 'push_to_hub_model_id': None, 'push_to_hub_organization': None, 'push_to_hub_token': None, '_n_gpu': 1, 'mp_parameters': '', 'auto_find_batch_size': False, 'full_determinism': False, 'torchdynamo': None, 'ray_scope': 'last', 'ddp_timeout': 1800, 'torch_compile': False, 'torch_compile_backend': None, 'torch_compile_mode': None, 'include_tokens_per_second': False, 'include_num_input_tokens_seen': False, 'neftune_noise_alpha': None, 'optim_target_modules': None, 'batch_eval_metrics': False, 'eval_on_start': False, 'use_liger_kernel': True, 'eval_use_gather_object': False, 'average_tokens_across_devices': True, 'model_init_kwargs': None, 'chat_template_path': None, 'dataset_text_field': 'qwen_text', 'dataset_kwargs': None, 'dataset_num_proc': None, 'eos_token': None, 'pad_token': None, 'max_length': 1024, 'packing': False, 'packing_strategy': 'ffd', 'padding_free': False, 'pad_to_multiple_of': None, 'eval_packing': None, 'completion_only_loss': None, 'assistant_only_loss': False, 'activation_offloading': False, 'max_seq_length': None}
2026-04-15 11:24:36,238 - INFO - Training config: {'model_name': 'Qwen/Qwen3-8B', 'template_name': 'qwen', 'block_size': 40960, 'wandb_project': 'ThreadWeaver', 'train_file_path': './data-train_final_replaced_from_classified-fix-format', 'dagger': False, 'attn_implementation': 'flex_attention', 'output_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'overwrite_output_dir': False, 'do_train': False, 'do_eval': False, 'do_predict': False, 'eval_strategy': <IntervalStrategy.NO: 'no'>, 'prediction_loss_only': False, 'per_device_train_batch_size': 1, 'per_device_eval_batch_size': 1, 'per_gpu_train_batch_size': None, 'per_gpu_eval_batch_size': None, 'gradient_accumulation_steps': 2, 'eval_accumulation_steps': None, 'eval_delay': 0, 'torch_empty_cache_steps': None, 'learning_rate': 1e-05, 'weight_decay': 0.0001, 'adam_beta1': 0.9, 'adam_beta2': 0.95, 'adam_epsilon': 1e-08, 'max_grad_norm': 1.0, 'num_train_epochs': 8.0, 'max_steps': -1, 'lr_scheduler_type': <SchedulerType.COSINE: 'cosine'>, 'lr_scheduler_kwargs': {}, 'warmup_ratio': 0.05, 'warmup_steps': 0, 'log_level': 'passive', 'log_level_replica': 'warning', 'log_on_each_node': True, 'logging_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume/runs/Apr15_11-24-34_batch-block1-0075', 'logging_strategy': <IntervalStrategy.STEPS: 'steps'>, 'logging_first_step': False, 'logging_steps': 1.0, 'logging_nan_inf_filter': True, 'save_strategy': <SaveStrategy.STEPS: 'steps'>, 'save_steps': 20, 'save_total_limit': 2, 'save_safetensors': True, 'save_on_each_node': False, 'save_only_model': False, 'restore_callback_states_from_checkpoint': False, 'no_cuda': False, 'use_cpu': False, 'use_mps_device': False, 'seed': 42, 'data_seed': None, 'jit_mode_eval': False, 'use_ipex': False, 'bf16': True, 'fp16': False, 'fp16_opt_level': 'O1', 'half_precision_backend': 'auto', 'bf16_full_eval': False, 'fp16_full_eval': False, 'tf32': None, 'local_rank': 3, 'ddp_backend': None, 'tpu_num_cores': None, 'tpu_metrics_debug': False, 'debug': [], 'dataloader_drop_last': False, 'eval_steps': None, 'dataloader_num_workers': 0, 'dataloader_prefetch_factor': None, 'past_index': -1, 'run_name': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'disable_tqdm': False, 'remove_unused_columns': True, 'label_names': None, 'load_best_model_at_end': False, 'metric_for_best_model': None, 'greater_is_better': None, 'ignore_data_skip': False, 'fsdp': [], 'fsdp_min_num_params': 0, 'fsdp_config': {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, 'tp_size': 0, 'fsdp_transformer_layer_cls_to_wrap': None, 'accelerator_config': {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, 'deepspeed': 'configs/deepspeed_zero3.json', 'label_smoothing_factor': 0.0, 'optim': <OptimizerNames.ADAMW_TORCH: 'adamw_torch'>, 'optim_args': None, 'adafactor': False, 'group_by_length': False, 'length_column_name': 'length', 'report_to': ['wandb'], 'ddp_find_unused_parameters': None, 'ddp_bucket_cap_mb': None, 'ddp_broadcast_buffers': None, 'dataloader_pin_memory': True, 'dataloader_persistent_workers': False, 'skip_memory_metrics': True, 'use_legacy_prediction_loop': False, 'push_to_hub': False, 'resume_from_checkpoint': None, 'hub_model_id': None, 'hub_strategy': <HubStrategy.EVERY_SAVE: 'every_save'>, 'hub_token': None, 'hub_private_repo': None, 'hub_always_push': False, 'gradient_checkpointing': True, 'gradient_checkpointing_kwargs': None, 'include_inputs_for_metrics': False, 'include_for_metrics': [], 'eval_do_concat_batches': True, 'fp16_backend': 'auto', 'push_to_hub_model_id': None, 'push_to_hub_organization': None, 'push_to_hub_token': None, '_n_gpu': 1, 'mp_parameters': '', 'auto_find_batch_size': False, 'full_determinism': False, 'torchdynamo': None, 'ray_scope': 'last', 'ddp_timeout': 1800, 'torch_compile': False, 'torch_compile_backend': None, 'torch_compile_mode': None, 'include_tokens_per_second': False, 'include_num_input_tokens_seen': False, 'neftune_noise_alpha': None, 'optim_target_modules': None, 'batch_eval_metrics': False, 'eval_on_start': False, 'use_liger_kernel': True, 'eval_use_gather_object': False, 'average_tokens_across_devices': True, 'model_init_kwargs': None, 'chat_template_path': None, 'dataset_text_field': 'qwen_text', 'dataset_kwargs': None, 'dataset_num_proc': None, 'eos_token': None, 'pad_token': None, 'max_length': 1024, 'packing': False, 'packing_strategy': 'ffd', 'padding_free': False, 'pad_to_multiple_of': None, 'eval_packing': None, 'completion_only_loss': None, 'assistant_only_loss': False, 'activation_offloading': False, 'max_seq_length': None}
2026-04-15 11:24:36,239 - INFO - Training config: {'model_name': 'Qwen/Qwen3-8B', 'template_name': 'qwen', 'block_size': 40960, 'wandb_project': 'ThreadWeaver', 'train_file_path': './data-train_final_replaced_from_classified-fix-format', 'dagger': False, 'attn_implementation': 'flex_attention', 'output_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'overwrite_output_dir': False, 'do_train': False, 'do_eval': False, 'do_predict': False, 'eval_strategy': <IntervalStrategy.NO: 'no'>, 'prediction_loss_only': False, 'per_device_train_batch_size': 1, 'per_device_eval_batch_size': 1, 'per_gpu_train_batch_size': None, 'per_gpu_eval_batch_size': None, 'gradient_accumulation_steps': 2, 'eval_accumulation_steps': None, 'eval_delay': 0, 'torch_empty_cache_steps': None, 'learning_rate': 1e-05, 'weight_decay': 0.0001, 'adam_beta1': 0.9, 'adam_beta2': 0.95, 'adam_epsilon': 1e-08, 'max_grad_norm': 1.0, 'num_train_epochs': 8.0, 'max_steps': -1, 'lr_scheduler_type': <SchedulerType.COSINE: 'cosine'>, 'lr_scheduler_kwargs': {}, 'warmup_ratio': 0.05, 'warmup_steps': 0, 'log_level': 'passive', 'log_level_replica': 'warning', 'log_on_each_node': True, 'logging_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume/runs/Apr15_11-24-34_batch-block1-0075', 'logging_strategy': <IntervalStrategy.STEPS: 'steps'>, 'logging_first_step': False, 'logging_steps': 1.0, 'logging_nan_inf_filter': True, 'save_strategy': <SaveStrategy.STEPS: 'steps'>, 'save_steps': 20, 'save_total_limit': 2, 'save_safetensors': True, 'save_on_each_node': False, 'save_only_model': False, 'restore_callback_states_from_checkpoint': False, 'no_cuda': False, 'use_cpu': False, 'use_mps_device': False, 'seed': 42, 'data_seed': None, 'jit_mode_eval': False, 'use_ipex': False, 'bf16': True, 'fp16': False, 'fp16_opt_level': 'O1', 'half_precision_backend': 'auto', 'bf16_full_eval': False, 'fp16_full_eval': False, 'tf32': None, 'local_rank': 1, 'ddp_backend': None, 'tpu_num_cores': None, 'tpu_metrics_debug': False, 'debug': [], 'dataloader_drop_last': False, 'eval_steps': None, 'dataloader_num_workers': 0, 'dataloader_prefetch_factor': None, 'past_index': -1, 'run_name': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'disable_tqdm': False, 'remove_unused_columns': True, 'label_names': None, 'load_best_model_at_end': False, 'metric_for_best_model': None, 'greater_is_better': None, 'ignore_data_skip': False, 'fsdp': [], 'fsdp_min_num_params': 0, 'fsdp_config': {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, 'tp_size': 0, 'fsdp_transformer_layer_cls_to_wrap': None, 'accelerator_config': {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, 'deepspeed': 'configs/deepspeed_zero3.json', 'label_smoothing_factor': 0.0, 'optim': <OptimizerNames.ADAMW_TORCH: 'adamw_torch'>, 'optim_args': None, 'adafactor': False, 'group_by_length': False, 'length_column_name': 'length', 'report_to': ['wandb'], 'ddp_find_unused_parameters': None, 'ddp_bucket_cap_mb': None, 'ddp_broadcast_buffers': None, 'dataloader_pin_memory': True, 'dataloader_persistent_workers': False, 'skip_memory_metrics': True, 'use_legacy_prediction_loop': False, 'push_to_hub': False, 'resume_from_checkpoint': None, 'hub_model_id': None, 'hub_strategy': <HubStrategy.EVERY_SAVE: 'every_save'>, 'hub_token': None, 'hub_private_repo': None, 'hub_always_push': False, 'gradient_checkpointing': True, 'gradient_checkpointing_kwargs': None, 'include_inputs_for_metrics': False, 'include_for_metrics': [], 'eval_do_concat_batches': True, 'fp16_backend': 'auto', 'push_to_hub_model_id': None, 'push_to_hub_organization': None, 'push_to_hub_token': None, '_n_gpu': 1, 'mp_parameters': '', 'auto_find_batch_size': False, 'full_determinism': False, 'torchdynamo': None, 'ray_scope': 'last', 'ddp_timeout': 1800, 'torch_compile': False, 'torch_compile_backend': None, 'torch_compile_mode': None, 'include_tokens_per_second': False, 'include_num_input_tokens_seen': False, 'neftune_noise_alpha': None, 'optim_target_modules': None, 'batch_eval_metrics': False, 'eval_on_start': False, 'use_liger_kernel': True, 'eval_use_gather_object': False, 'average_tokens_across_devices': True, 'model_init_kwargs': None, 'chat_template_path': None, 'dataset_text_field': 'qwen_text', 'dataset_kwargs': None, 'dataset_num_proc': None, 'eos_token': None, 'pad_token': None, 'max_length': 1024, 'packing': False, 'packing_strategy': 'ffd', 'padding_free': False, 'pad_to_multiple_of': None, 'eval_packing': None, 'completion_only_loss': None, 'assistant_only_loss': False, 'activation_offloading': False, 'max_seq_length': None}
2026-04-15 11:24:36,257 - INFO - Training config: {'model_name': 'Qwen/Qwen3-8B', 'template_name': 'qwen', 'block_size': 40960, 'wandb_project': 'ThreadWeaver', 'train_file_path': './data-train_final_replaced_from_classified-fix-format', 'dagger': False, 'attn_implementation': 'flex_attention', 'output_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'overwrite_output_dir': False, 'do_train': False, 'do_eval': False, 'do_predict': False, 'eval_strategy': <IntervalStrategy.NO: 'no'>, 'prediction_loss_only': False, 'per_device_train_batch_size': 1, 'per_device_eval_batch_size': 1, 'per_gpu_train_batch_size': None, 'per_gpu_eval_batch_size': None, 'gradient_accumulation_steps': 2, 'eval_accumulation_steps': None, 'eval_delay': 0, 'torch_empty_cache_steps': None, 'learning_rate': 1e-05, 'weight_decay': 0.0001, 'adam_beta1': 0.9, 'adam_beta2': 0.95, 'adam_epsilon': 1e-08, 'max_grad_norm': 1.0, 'num_train_epochs': 8.0, 'max_steps': -1, 'lr_scheduler_type': <SchedulerType.COSINE: 'cosine'>, 'lr_scheduler_kwargs': {}, 'warmup_ratio': 0.05, 'warmup_steps': 0, 'log_level': 'passive', 'log_level_replica': 'warning', 'log_on_each_node': True, 'logging_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume/runs/Apr15_11-24-35_batch-block1-0086', 'logging_strategy': <IntervalStrategy.STEPS: 'steps'>, 'logging_first_step': False, 'logging_steps': 1.0, 'logging_nan_inf_filter': True, 'save_strategy': <SaveStrategy.STEPS: 'steps'>, 'save_steps': 20, 'save_total_limit': 2, 'save_safetensors': True, 'save_on_each_node': False, 'save_only_model': False, 'restore_callback_states_from_checkpoint': False, 'no_cuda': False, 'use_cpu': False, 'use_mps_device': False, 'seed': 42, 'data_seed': None, 'jit_mode_eval': False, 'use_ipex': False, 'bf16': True, 'fp16': False, 'fp16_opt_level': 'O1', 'half_precision_backend': 'auto', 'bf16_full_eval': False, 'fp16_full_eval': False, 'tf32': None, 'local_rank': 4, 'ddp_backend': None, 'tpu_num_cores': None, 'tpu_metrics_debug': False, 'debug': [], 'dataloader_drop_last': False, 'eval_steps': None, 'dataloader_num_workers': 0, 'dataloader_prefetch_factor': None, 'past_index': -1, 'run_name': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'disable_tqdm': False, 'remove_unused_columns': True, 'label_names': None, 'load_best_model_at_end': False, 'metric_for_best_model': None, 'greater_is_better': None, 'ignore_data_skip': False, 'fsdp': [], 'fsdp_min_num_params': 0, 'fsdp_config': {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, 'tp_size': 0, 'fsdp_transformer_layer_cls_to_wrap': None, 'accelerator_config': {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, 'deepspeed': 'configs/deepspeed_zero3.json', 'label_smoothing_factor': 0.0, 'optim': <OptimizerNames.ADAMW_TORCH: 'adamw_torch'>, 'optim_args': None, 'adafactor': False, 'group_by_length': False, 'length_column_name': 'length', 'report_to': ['wandb'], 'ddp_find_unused_parameters': None, 'ddp_bucket_cap_mb': None, 'ddp_broadcast_buffers': None, 'dataloader_pin_memory': True, 'dataloader_persistent_workers': False, 'skip_memory_metrics': True, 'use_legacy_prediction_loop': False, 'push_to_hub': False, 'resume_from_checkpoint': None, 'hub_model_id': None, 'hub_strategy': <HubStrategy.EVERY_SAVE: 'every_save'>, 'hub_token': None, 'hub_private_repo': None, 'hub_always_push': False, 'gradient_checkpointing': True, 'gradient_checkpointing_kwargs': None, 'include_inputs_for_metrics': False, 'include_for_metrics': [], 'eval_do_concat_batches': True, 'fp16_backend': 'auto', 'push_to_hub_model_id': None, 'push_to_hub_organization': None, 'push_to_hub_token': None, '_n_gpu': 1, 'mp_parameters': '', 'auto_find_batch_size': False, 'full_determinism': False, 'torchdynamo': None, 'ray_scope': 'last', 'ddp_timeout': 1800, 'torch_compile': False, 'torch_compile_backend': None, 'torch_compile_mode': None, 'include_tokens_per_second': False, 'include_num_input_tokens_seen': False, 'neftune_noise_alpha': None, 'optim_target_modules': None, 'batch_eval_metrics': False, 'eval_on_start': False, 'use_liger_kernel': True, 'eval_use_gather_object': False, 'average_tokens_across_devices': True, 'model_init_kwargs': None, 'chat_template_path': None, 'dataset_text_field': 'qwen_text', 'dataset_kwargs': None, 'dataset_num_proc': None, 'eos_token': None, 'pad_token': None, 'max_length': 1024, 'packing': False, 'packing_strategy': 'ffd', 'padding_free': False, 'pad_to_multiple_of': None, 'eval_packing': None, 'completion_only_loss': None, 'assistant_only_loss': False, 'activation_offloading': False, 'max_seq_length': None}
2026-04-15 11:24:36,295 - INFO - Training config: {'model_name': 'Qwen/Qwen3-8B', 'template_name': 'qwen', 'block_size': 40960, 'wandb_project': 'ThreadWeaver', 'train_file_path': './data-train_final_replaced_from_classified-fix-format', 'dagger': False, 'attn_implementation': 'flex_attention', 'output_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'overwrite_output_dir': False, 'do_train': False, 'do_eval': False, 'do_predict': False, 'eval_strategy': <IntervalStrategy.NO: 'no'>, 'prediction_loss_only': False, 'per_device_train_batch_size': 1, 'per_device_eval_batch_size': 1, 'per_gpu_train_batch_size': None, 'per_gpu_eval_batch_size': None, 'gradient_accumulation_steps': 2, 'eval_accumulation_steps': None, 'eval_delay': 0, 'torch_empty_cache_steps': None, 'learning_rate': 1e-05, 'weight_decay': 0.0001, 'adam_beta1': 0.9, 'adam_beta2': 0.95, 'adam_epsilon': 1e-08, 'max_grad_norm': 1.0, 'num_train_epochs': 8.0, 'max_steps': -1, 'lr_scheduler_type': <SchedulerType.COSINE: 'cosine'>, 'lr_scheduler_kwargs': {}, 'warmup_ratio': 0.05, 'warmup_steps': 0, 'log_level': 'passive', 'log_level_replica': 'warning', 'log_on_each_node': True, 'logging_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume/runs/Apr15_11-24-34_batch-block1-3859', 'logging_strategy': <IntervalStrategy.STEPS: 'steps'>, 'logging_first_step': False, 'logging_steps': 1.0, 'logging_nan_inf_filter': True, 'save_strategy': <SaveStrategy.STEPS: 'steps'>, 'save_steps': 20, 'save_total_limit': 2, 'save_safetensors': True, 'save_on_each_node': False, 'save_only_model': False, 'restore_callback_states_from_checkpoint': False, 'no_cuda': False, 'use_cpu': False, 'use_mps_device': False, 'seed': 42, 'data_seed': None, 'jit_mode_eval': False, 'use_ipex': False, 'bf16': True, 'fp16': False, 'fp16_opt_level': 'O1', 'half_precision_backend': 'auto', 'bf16_full_eval': False, 'fp16_full_eval': False, 'tf32': None, 'local_rank': 6, 'ddp_backend': None, 'tpu_num_cores': None, 'tpu_metrics_debug': False, 'debug': [], 'dataloader_drop_last': False, 'eval_steps': None, 'dataloader_num_workers': 0, 'dataloader_prefetch_factor': None, 'past_index': -1, 'run_name': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'disable_tqdm': False, 'remove_unused_columns': True, 'label_names': None, 'load_best_model_at_end': False, 'metric_for_best_model': None, 'greater_is_better': None, 'ignore_data_skip': False, 'fsdp': [], 'fsdp_min_num_params': 0, 'fsdp_config': {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, 'tp_size': 0, 'fsdp_transformer_layer_cls_to_wrap': None, 'accelerator_config': {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, 'deepspeed': 'configs/deepspeed_zero3.json', 'label_smoothing_factor': 0.0, 'optim': <OptimizerNames.ADAMW_TORCH: 'adamw_torch'>, 'optim_args': None, 'adafactor': False, 'group_by_length': False, 'length_column_name': 'length', 'report_to': ['wandb'], 'ddp_find_unused_parameters': None, 'ddp_bucket_cap_mb': None, 'ddp_broadcast_buffers': None, 'dataloader_pin_memory': True, 'dataloader_persistent_workers': False, 'skip_memory_metrics': True, 'use_legacy_prediction_loop': False, 'push_to_hub': False, 'resume_from_checkpoint': None, 'hub_model_id': None, 'hub_strategy': <HubStrategy.EVERY_SAVE: 'every_save'>, 'hub_token': None, 'hub_private_repo': None, 'hub_always_push': False, 'gradient_checkpointing': True, 'gradient_checkpointing_kwargs': None, 'include_inputs_for_metrics': False, 'include_for_metrics': [], 'eval_do_concat_batches': True, 'fp16_backend': 'auto', 'push_to_hub_model_id': None, 'push_to_hub_organization': None, 'push_to_hub_token': None, '_n_gpu': 1, 'mp_parameters': '', 'auto_find_batch_size': False, 'full_determinism': False, 'torchdynamo': None, 'ray_scope': 'last', 'ddp_timeout': 1800, 'torch_compile': False, 'torch_compile_backend': None, 'torch_compile_mode': None, 'include_tokens_per_second': False, 'include_num_input_tokens_seen': False, 'neftune_noise_alpha': None, 'optim_target_modules': None, 'batch_eval_metrics': False, 'eval_on_start': False, 'use_liger_kernel': True, 'eval_use_gather_object': False, 'average_tokens_across_devices': True, 'model_init_kwargs': None, 'chat_template_path': None, 'dataset_text_field': 'qwen_text', 'dataset_kwargs': None, 'dataset_num_proc': None, 'eos_token': None, 'pad_token': None, 'max_length': 1024, 'packing': False, 'packing_strategy': 'ffd', 'padding_free': False, 'pad_to_multiple_of': None, 'eval_packing': None, 'completion_only_loss': None, 'assistant_only_loss': False, 'activation_offloading': False, 'max_seq_length': None}
2026-04-15 11:24:36,299 - INFO - Training config: {'model_name': 'Qwen/Qwen3-8B', 'template_name': 'qwen', 'block_size': 40960, 'wandb_project': 'ThreadWeaver', 'train_file_path': './data-train_final_replaced_from_classified-fix-format', 'dagger': False, 'attn_implementation': 'flex_attention', 'output_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'overwrite_output_dir': False, 'do_train': False, 'do_eval': False, 'do_predict': False, 'eval_strategy': <IntervalStrategy.NO: 'no'>, 'prediction_loss_only': False, 'per_device_train_batch_size': 1, 'per_device_eval_batch_size': 1, 'per_gpu_train_batch_size': None, 'per_gpu_eval_batch_size': None, 'gradient_accumulation_steps': 2, 'eval_accumulation_steps': None, 'eval_delay': 0, 'torch_empty_cache_steps': None, 'learning_rate': 1e-05, 'weight_decay': 0.0001, 'adam_beta1': 0.9, 'adam_beta2': 0.95, 'adam_epsilon': 1e-08, 'max_grad_norm': 1.0, 'num_train_epochs': 8.0, 'max_steps': -1, 'lr_scheduler_type': <SchedulerType.COSINE: 'cosine'>, 'lr_scheduler_kwargs': {}, 'warmup_ratio': 0.05, 'warmup_steps': 0, 'log_level': 'passive', 'log_level_replica': 'warning', 'log_on_each_node': True, 'logging_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume/runs/Apr15_11-24-35_batch-block1-3534', 'logging_strategy': <IntervalStrategy.STEPS: 'steps'>, 'logging_first_step': False, 'logging_steps': 1.0, 'logging_nan_inf_filter': True, 'save_strategy': <SaveStrategy.STEPS: 'steps'>, 'save_steps': 20, 'save_total_limit': 2, 'save_safetensors': True, 'save_on_each_node': False, 'save_only_model': False, 'restore_callback_states_from_checkpoint': False, 'no_cuda': False, 'use_cpu': False, 'use_mps_device': False, 'seed': 42, 'data_seed': None, 'jit_mode_eval': False, 'use_ipex': False, 'bf16': True, 'fp16': False, 'fp16_opt_level': 'O1', 'half_precision_backend': 'auto', 'bf16_full_eval': False, 'fp16_full_eval': False, 'tf32': None, 'local_rank': 0, 'ddp_backend': None, 'tpu_num_cores': None, 'tpu_metrics_debug': False, 'debug': [], 'dataloader_drop_last': False, 'eval_steps': None, 'dataloader_num_workers': 0, 'dataloader_prefetch_factor': None, 'past_index': -1, 'run_name': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'disable_tqdm': False, 'remove_unused_columns': True, 'label_names': None, 'load_best_model_at_end': False, 'metric_for_best_model': None, 'greater_is_better': None, 'ignore_data_skip': False, 'fsdp': [], 'fsdp_min_num_params': 0, 'fsdp_config': {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, 'tp_size': 0, 'fsdp_transformer_layer_cls_to_wrap': None, 'accelerator_config': {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, 'deepspeed': 'configs/deepspeed_zero3.json', 'label_smoothing_factor': 0.0, 'optim': <OptimizerNames.ADAMW_TORCH: 'adamw_torch'>, 'optim_args': None, 'adafactor': False, 'group_by_length': False, 'length_column_name': 'length', 'report_to': ['wandb'], 'ddp_find_unused_parameters': None, 'ddp_bucket_cap_mb': None, 'ddp_broadcast_buffers': None, 'dataloader_pin_memory': True, 'dataloader_persistent_workers': False, 'skip_memory_metrics': True, 'use_legacy_prediction_loop': False, 'push_to_hub': False, 'resume_from_checkpoint': None, 'hub_model_id': None, 'hub_strategy': <HubStrategy.EVERY_SAVE: 'every_save'>, 'hub_token': None, 'hub_private_repo': None, 'hub_always_push': False, 'gradient_checkpointing': True, 'gradient_checkpointing_kwargs': None, 'include_inputs_for_metrics': False, 'include_for_metrics': [], 'eval_do_concat_batches': True, 'fp16_backend': 'auto', 'push_to_hub_model_id': None, 'push_to_hub_organization': None, 'push_to_hub_token': None, '_n_gpu': 1, 'mp_parameters': '', 'auto_find_batch_size': False, 'full_determinism': False, 'torchdynamo': None, 'ray_scope': 'last', 'ddp_timeout': 1800, 'torch_compile': False, 'torch_compile_backend': None, 'torch_compile_mode': None, 'include_tokens_per_second': False, 'include_num_input_tokens_seen': False, 'neftune_noise_alpha': None, 'optim_target_modules': None, 'batch_eval_metrics': False, 'eval_on_start': False, 'use_liger_kernel': True, 'eval_use_gather_object': False, 'average_tokens_across_devices': True, 'model_init_kwargs': None, 'chat_template_path': None, 'dataset_text_field': 'qwen_text', 'dataset_kwargs': None, 'dataset_num_proc': None, 'eos_token': None, 'pad_token': None, 'max_length': 1024, 'packing': False, 'packing_strategy': 'ffd', 'padding_free': False, 'pad_to_multiple_of': None, 'eval_packing': None, 'completion_only_loss': None, 'assistant_only_loss': False, 'activation_offloading': False, 'max_seq_length': None}
2026-04-15 11:24:36,316 - INFO - Training config: {'model_name': 'Qwen/Qwen3-8B', 'template_name': 'qwen', 'block_size': 40960, 'wandb_project': 'ThreadWeaver', 'train_file_path': './data-train_final_replaced_from_classified-fix-format', 'dagger': False, 'attn_implementation': 'flex_attention', 'output_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'overwrite_output_dir': False, 'do_train': False, 'do_eval': False, 'do_predict': False, 'eval_strategy': <IntervalStrategy.NO: 'no'>, 'prediction_loss_only': False, 'per_device_train_batch_size': 1, 'per_device_eval_batch_size': 1, 'per_gpu_train_batch_size': None, 'per_gpu_eval_batch_size': None, 'gradient_accumulation_steps': 2, 'eval_accumulation_steps': None, 'eval_delay': 0, 'torch_empty_cache_steps': None, 'learning_rate': 1e-05, 'weight_decay': 0.0001, 'adam_beta1': 0.9, 'adam_beta2': 0.95, 'adam_epsilon': 1e-08, 'max_grad_norm': 1.0, 'num_train_epochs': 8.0, 'max_steps': -1, 'lr_scheduler_type': <SchedulerType.COSINE: 'cosine'>, 'lr_scheduler_kwargs': {}, 'warmup_ratio': 0.05, 'warmup_steps': 0, 'log_level': 'passive', 'log_level_replica': 'warning', 'log_on_each_node': True, 'logging_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume/runs/Apr15_11-24-34_batch-block1-3859', 'logging_strategy': <IntervalStrategy.STEPS: 'steps'>, 'logging_first_step': False, 'logging_steps': 1.0, 'logging_nan_inf_filter': True, 'save_strategy': <SaveStrategy.STEPS: 'steps'>, 'save_steps': 20, 'save_total_limit': 2, 'save_safetensors': True, 'save_on_each_node': False, 'save_only_model': False, 'restore_callback_states_from_checkpoint': False, 'no_cuda': False, 'use_cpu': False, 'use_mps_device': False, 'seed': 42, 'data_seed': None, 'jit_mode_eval': False, 'use_ipex': False, 'bf16': True, 'fp16': False, 'fp16_opt_level': 'O1', 'half_precision_backend': 'auto', 'bf16_full_eval': False, 'fp16_full_eval': False, 'tf32': None, 'local_rank': 2, 'ddp_backend': None, 'tpu_num_cores': None, 'tpu_metrics_debug': False, 'debug': [], 'dataloader_drop_last': False, 'eval_steps': None, 'dataloader_num_workers': 0, 'dataloader_prefetch_factor': None, 'past_index': -1, 'run_name': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'disable_tqdm': False, 'remove_unused_columns': True, 'label_names': None, 'load_best_model_at_end': False, 'metric_for_best_model': None, 'greater_is_better': None, 'ignore_data_skip': False, 'fsdp': [], 'fsdp_min_num_params': 0, 'fsdp_config': {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, 'tp_size': 0, 'fsdp_transformer_layer_cls_to_wrap': None, 'accelerator_config': {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, 'deepspeed': 'configs/deepspeed_zero3.json', 'label_smoothing_factor': 0.0, 'optim': <OptimizerNames.ADAMW_TORCH: 'adamw_torch'>, 'optim_args': None, 'adafactor': False, 'group_by_length': False, 'length_column_name': 'length', 'report_to': ['wandb'], 'ddp_find_unused_parameters': None, 'ddp_bucket_cap_mb': None, 'ddp_broadcast_buffers': None, 'dataloader_pin_memory': True, 'dataloader_persistent_workers': False, 'skip_memory_metrics': True, 'use_legacy_prediction_loop': False, 'push_to_hub': False, 'resume_from_checkpoint': None, 'hub_model_id': None, 'hub_strategy': <HubStrategy.EVERY_SAVE: 'every_save'>, 'hub_token': None, 'hub_private_repo': None, 'hub_always_push': False, 'gradient_checkpointing': True, 'gradient_checkpointing_kwargs': None, 'include_inputs_for_metrics': False, 'include_for_metrics': [], 'eval_do_concat_batches': True, 'fp16_backend': 'auto', 'push_to_hub_model_id': None, 'push_to_hub_organization': None, 'push_to_hub_token': None, '_n_gpu': 1, 'mp_parameters': '', 'auto_find_batch_size': False, 'full_determinism': False, 'torchdynamo': None, 'ray_scope': 'last', 'ddp_timeout': 1800, 'torch_compile': False, 'torch_compile_backend': None, 'torch_compile_mode': None, 'include_tokens_per_second': False, 'include_num_input_tokens_seen': False, 'neftune_noise_alpha': None, 'optim_target_modules': None, 'batch_eval_metrics': False, 'eval_on_start': False, 'use_liger_kernel': True, 'eval_use_gather_object': False, 'average_tokens_across_devices': True, 'model_init_kwargs': None, 'chat_template_path': None, 'dataset_text_field': 'qwen_text', 'dataset_kwargs': None, 'dataset_num_proc': None, 'eos_token': None, 'pad_token': None, 'max_length': 1024, 'packing': False, 'packing_strategy': 'ffd', 'padding_free': False, 'pad_to_multiple_of': None, 'eval_packing': None, 'completion_only_loss': None, 'assistant_only_loss': False, 'activation_offloading': False, 'max_seq_length': None}
2026-04-15 11:24:36,317 - INFO - Training config: {'model_name': 'Qwen/Qwen3-8B', 'template_name': 'qwen', 'block_size': 40960, 'wandb_project': 'ThreadWeaver', 'train_file_path': './data-train_final_replaced_from_classified-fix-format', 'dagger': False, 'attn_implementation': 'flex_attention', 'output_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'overwrite_output_dir': False, 'do_train': False, 'do_eval': False, 'do_predict': False, 'eval_strategy': <IntervalStrategy.NO: 'no'>, 'prediction_loss_only': False, 'per_device_train_batch_size': 1, 'per_device_eval_batch_size': 1, 'per_gpu_train_batch_size': None, 'per_gpu_eval_batch_size': None, 'gradient_accumulation_steps': 2, 'eval_accumulation_steps': None, 'eval_delay': 0, 'torch_empty_cache_steps': None, 'learning_rate': 1e-05, 'weight_decay': 0.0001, 'adam_beta1': 0.9, 'adam_beta2': 0.95, 'adam_epsilon': 1e-08, 'max_grad_norm': 1.0, 'num_train_epochs': 8.0, 'max_steps': -1, 'lr_scheduler_type': <SchedulerType.COSINE: 'cosine'>, 'lr_scheduler_kwargs': {}, 'warmup_ratio': 0.05, 'warmup_steps': 0, 'log_level': 'passive', 'log_level_replica': 'warning', 'log_on_each_node': True, 'logging_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume/runs/Apr15_11-24-34_batch-block1-3859', 'logging_strategy': <IntervalStrategy.STEPS: 'steps'>, 'logging_first_step': False, 'logging_steps': 1.0, 'logging_nan_inf_filter': True, 'save_strategy': <SaveStrategy.STEPS: 'steps'>, 'save_steps': 20, 'save_total_limit': 2, 'save_safetensors': True, 'save_on_each_node': False, 'save_only_model': False, 'restore_callback_states_from_checkpoint': False, 'no_cuda': False, 'use_cpu': False, 'use_mps_device': False, 'seed': 42, 'data_seed': None, 'jit_mode_eval': False, 'use_ipex': False, 'bf16': True, 'fp16': False, 'fp16_opt_level': 'O1', 'half_precision_backend': 'auto', 'bf16_full_eval': False, 'fp16_full_eval': False, 'tf32': None, 'local_rank': 7, 'ddp_backend': None, 'tpu_num_cores': None, 'tpu_metrics_debug': False, 'debug': [], 'dataloader_drop_last': False, 'eval_steps': None, 'dataloader_num_workers': 0, 'dataloader_prefetch_factor': None, 'past_index': -1, 'run_name': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'disable_tqdm': False, 'remove_unused_columns': True, 'label_names': None, 'load_best_model_at_end': False, 'metric_for_best_model': None, 'greater_is_better': None, 'ignore_data_skip': False, 'fsdp': [], 'fsdp_min_num_params': 0, 'fsdp_config': {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, 'tp_size': 0, 'fsdp_transformer_layer_cls_to_wrap': None, 'accelerator_config': {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, 'deepspeed': 'configs/deepspeed_zero3.json', 'label_smoothing_factor': 0.0, 'optim': <OptimizerNames.ADAMW_TORCH: 'adamw_torch'>, 'optim_args': None, 'adafactor': False, 'group_by_length': False, 'length_column_name': 'length', 'report_to': ['wandb'], 'ddp_find_unused_parameters': None, 'ddp_bucket_cap_mb': None, 'ddp_broadcast_buffers': None, 'dataloader_pin_memory': True, 'dataloader_persistent_workers': False, 'skip_memory_metrics': True, 'use_legacy_prediction_loop': False, 'push_to_hub': False, 'resume_from_checkpoint': None, 'hub_model_id': None, 'hub_strategy': <HubStrategy.EVERY_SAVE: 'every_save'>, 'hub_token': None, 'hub_private_repo': None, 'hub_always_push': False, 'gradient_checkpointing': True, 'gradient_checkpointing_kwargs': None, 'include_inputs_for_metrics': False, 'include_for_metrics': [], 'eval_do_concat_batches': True, 'fp16_backend': 'auto', 'push_to_hub_model_id': None, 'push_to_hub_organization': None, 'push_to_hub_token': None, '_n_gpu': 1, 'mp_parameters': '', 'auto_find_batch_size': False, 'full_determinism': False, 'torchdynamo': None, 'ray_scope': 'last', 'ddp_timeout': 1800, 'torch_compile': False, 'torch_compile_backend': None, 'torch_compile_mode': None, 'include_tokens_per_second': False, 'include_num_input_tokens_seen': False, 'neftune_noise_alpha': None, 'optim_target_modules': None, 'batch_eval_metrics': False, 'eval_on_start': False, 'use_liger_kernel': True, 'eval_use_gather_object': False, 'average_tokens_across_devices': True, 'model_init_kwargs': None, 'chat_template_path': None, 'dataset_text_field': 'qwen_text', 'dataset_kwargs': None, 'dataset_num_proc': None, 'eos_token': None, 'pad_token': None, 'max_length': 1024, 'packing': False, 'packing_strategy': 'ffd', 'padding_free': False, 'pad_to_multiple_of': None, 'eval_packing': None, 'completion_only_loss': None, 'assistant_only_loss': False, 'activation_offloading': False, 'max_seq_length': None}
2026-04-15 11:24:36,319 - INFO - Training config: {'model_name': 'Qwen/Qwen3-8B', 'template_name': 'qwen', 'block_size': 40960, 'wandb_project': 'ThreadWeaver', 'train_file_path': './data-train_final_replaced_from_classified-fix-format', 'dagger': False, 'attn_implementation': 'flex_attention', 'output_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'overwrite_output_dir': False, 'do_train': False, 'do_eval': False, 'do_predict': False, 'eval_strategy': <IntervalStrategy.NO: 'no'>, 'prediction_loss_only': False, 'per_device_train_batch_size': 1, 'per_device_eval_batch_size': 1, 'per_gpu_train_batch_size': None, 'per_gpu_eval_batch_size': None, 'gradient_accumulation_steps': 2, 'eval_accumulation_steps': None, 'eval_delay': 0, 'torch_empty_cache_steps': None, 'learning_rate': 1e-05, 'weight_decay': 0.0001, 'adam_beta1': 0.9, 'adam_beta2': 0.95, 'adam_epsilon': 1e-08, 'max_grad_norm': 1.0, 'num_train_epochs': 8.0, 'max_steps': -1, 'lr_scheduler_type': <SchedulerType.COSINE: 'cosine'>, 'lr_scheduler_kwargs': {}, 'warmup_ratio': 0.05, 'warmup_steps': 0, 'log_level': 'passive', 'log_level_replica': 'warning', 'log_on_each_node': True, 'logging_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume/runs/Apr15_11-24-34_batch-block1-0070', 'logging_strategy': <IntervalStrategy.STEPS: 'steps'>, 'logging_first_step': False, 'logging_steps': 1.0, 'logging_nan_inf_filter': True, 'save_strategy': <SaveStrategy.STEPS: 'steps'>, 'save_steps': 20, 'save_total_limit': 2, 'save_safetensors': True, 'save_on_each_node': False, 'save_only_model': False, 'restore_callback_states_from_checkpoint': False, 'no_cuda': False, 'use_cpu': False, 'use_mps_device': False, 'seed': 42, 'data_seed': None, 'jit_mode_eval': False, 'use_ipex': False, 'bf16': True, 'fp16': False, 'fp16_opt_level': 'O1', 'half_precision_backend': 'auto', 'bf16_full_eval': False, 'fp16_full_eval': False, 'tf32': None, 'local_rank': 7, 'ddp_backend': None, 'tpu_num_cores': None, 'tpu_metrics_debug': False, 'debug': [], 'dataloader_drop_last': False, 'eval_steps': None, 'dataloader_num_workers': 0, 'dataloader_prefetch_factor': None, 'past_index': -1, 'run_name': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'disable_tqdm': False, 'remove_unused_columns': True, 'label_names': None, 'load_best_model_at_end': False, 'metric_for_best_model': None, 'greater_is_better': None, 'ignore_data_skip': False, 'fsdp': [], 'fsdp_min_num_params': 0, 'fsdp_config': {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, 'tp_size': 0, 'fsdp_transformer_layer_cls_to_wrap': None, 'accelerator_config': {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, 'deepspeed': 'configs/deepspeed_zero3.json', 'label_smoothing_factor': 0.0, 'optim': <OptimizerNames.ADAMW_TORCH: 'adamw_torch'>, 'optim_args': None, 'adafactor': False, 'group_by_length': False, 'length_column_name': 'length', 'report_to': ['wandb'], 'ddp_find_unused_parameters': None, 'ddp_bucket_cap_mb': None, 'ddp_broadcast_buffers': None, 'dataloader_pin_memory': True, 'dataloader_persistent_workers': False, 'skip_memory_metrics': True, 'use_legacy_prediction_loop': False, 'push_to_hub': False, 'resume_from_checkpoint': None, 'hub_model_id': None, 'hub_strategy': <HubStrategy.EVERY_SAVE: 'every_save'>, 'hub_token': None, 'hub_private_repo': None, 'hub_always_push': False, 'gradient_checkpointing': True, 'gradient_checkpointing_kwargs': None, 'include_inputs_for_metrics': False, 'include_for_metrics': [], 'eval_do_concat_batches': True, 'fp16_backend': 'auto', 'push_to_hub_model_id': None, 'push_to_hub_organization': None, 'push_to_hub_token': None, '_n_gpu': 1, 'mp_parameters': '', 'auto_find_batch_size': False, 'full_determinism': False, 'torchdynamo': None, 'ray_scope': 'last', 'ddp_timeout': 1800, 'torch_compile': False, 'torch_compile_backend': None, 'torch_compile_mode': None, 'include_tokens_per_second': False, 'include_num_input_tokens_seen': False, 'neftune_noise_alpha': None, 'optim_target_modules': None, 'batch_eval_metrics': False, 'eval_on_start': False, 'use_liger_kernel': True, 'eval_use_gather_object': False, 'average_tokens_across_devices': True, 'model_init_kwargs': None, 'chat_template_path': None, 'dataset_text_field': 'qwen_text', 'dataset_kwargs': None, 'dataset_num_proc': None, 'eos_token': None, 'pad_token': None, 'max_length': 1024, 'packing': False, 'packing_strategy': 'ffd', 'padding_free': False, 'pad_to_multiple_of': None, 'eval_packing': None, 'completion_only_loss': None, 'assistant_only_loss': False, 'activation_offloading': False, 'max_seq_length': None}
2026-04-15 11:24:36,319 - INFO - Training config: {'model_name': 'Qwen/Qwen3-8B', 'template_name': 'qwen', 'block_size': 40960, 'wandb_project': 'ThreadWeaver', 'train_file_path': './data-train_final_replaced_from_classified-fix-format', 'dagger': False, 'attn_implementation': 'flex_attention', 'output_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'overwrite_output_dir': False, 'do_train': False, 'do_eval': False, 'do_predict': False, 'eval_strategy': <IntervalStrategy.NO: 'no'>, 'prediction_loss_only': False, 'per_device_train_batch_size': 1, 'per_device_eval_batch_size': 1, 'per_gpu_train_batch_size': None, 'per_gpu_eval_batch_size': None, 'gradient_accumulation_steps': 2, 'eval_accumulation_steps': None, 'eval_delay': 0, 'torch_empty_cache_steps': None, 'learning_rate': 1e-05, 'weight_decay': 0.0001, 'adam_beta1': 0.9, 'adam_beta2': 0.95, 'adam_epsilon': 1e-08, 'max_grad_norm': 1.0, 'num_train_epochs': 8.0, 'max_steps': -1, 'lr_scheduler_type': <SchedulerType.COSINE: 'cosine'>, 'lr_scheduler_kwargs': {}, 'warmup_ratio': 0.05, 'warmup_steps': 0, 'log_level': 'passive', 'log_level_replica': 'warning', 'log_on_each_node': True, 'logging_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume/runs/Apr15_11-24-34_batch-block1-0070', 'logging_strategy': <IntervalStrategy.STEPS: 'steps'>, 'logging_first_step': False, 'logging_steps': 1.0, 'logging_nan_inf_filter': True, 'save_strategy': <SaveStrategy.STEPS: 'steps'>, 'save_steps': 20, 'save_total_limit': 2, 'save_safetensors': True, 'save_on_each_node': False, 'save_only_model': False, 'restore_callback_states_from_checkpoint': False, 'no_cuda': False, 'use_cpu': False, 'use_mps_device': False, 'seed': 42, 'data_seed': None, 'jit_mode_eval': False, 'use_ipex': False, 'bf16': True, 'fp16': False, 'fp16_opt_level': 'O1', 'half_precision_backend': 'auto', 'bf16_full_eval': False, 'fp16_full_eval': False, 'tf32': None, 'local_rank': 6, 'ddp_backend': None, 'tpu_num_cores': None, 'tpu_metrics_debug': False, 'debug': [], 'dataloader_drop_last': False, 'eval_steps': None, 'dataloader_num_workers': 0, 'dataloader_prefetch_factor': None, 'past_index': -1, 'run_name': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'disable_tqdm': False, 'remove_unused_columns': True, 'label_names': None, 'load_best_model_at_end': False, 'metric_for_best_model': None, 'greater_is_better': None, 'ignore_data_skip': False, 'fsdp': [], 'fsdp_min_num_params': 0, 'fsdp_config': {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, 'tp_size': 0, 'fsdp_transformer_layer_cls_to_wrap': None, 'accelerator_config': {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, 'deepspeed': 'configs/deepspeed_zero3.json', 'label_smoothing_factor': 0.0, 'optim': <OptimizerNames.ADAMW_TORCH: 'adamw_torch'>, 'optim_args': None, 'adafactor': False, 'group_by_length': False, 'length_column_name': 'length', 'report_to': ['wandb'], 'ddp_find_unused_parameters': None, 'ddp_bucket_cap_mb': None, 'ddp_broadcast_buffers': None, 'dataloader_pin_memory': True, 'dataloader_persistent_workers': False, 'skip_memory_metrics': True, 'use_legacy_prediction_loop': False, 'push_to_hub': False, 'resume_from_checkpoint': None, 'hub_model_id': None, 'hub_strategy': <HubStrategy.EVERY_SAVE: 'every_save'>, 'hub_token': None, 'hub_private_repo': None, 'hub_always_push': False, 'gradient_checkpointing': True, 'gradient_checkpointing_kwargs': None, 'include_inputs_for_metrics': False, 'include_for_metrics': [], 'eval_do_concat_batches': True, 'fp16_backend': 'auto', 'push_to_hub_model_id': None, 'push_to_hub_organization': None, 'push_to_hub_token': None, '_n_gpu': 1, 'mp_parameters': '', 'auto_find_batch_size': False, 'full_determinism': False, 'torchdynamo': None, 'ray_scope': 'last', 'ddp_timeout': 1800, 'torch_compile': False, 'torch_compile_backend': None, 'torch_compile_mode': None, 'include_tokens_per_second': False, 'include_num_input_tokens_seen': False, 'neftune_noise_alpha': None, 'optim_target_modules': None, 'batch_eval_metrics': False, 'eval_on_start': False, 'use_liger_kernel': True, 'eval_use_gather_object': False, 'average_tokens_across_devices': True, 'model_init_kwargs': None, 'chat_template_path': None, 'dataset_text_field': 'qwen_text', 'dataset_kwargs': None, 'dataset_num_proc': None, 'eos_token': None, 'pad_token': None, 'max_length': 1024, 'packing': False, 'packing_strategy': 'ffd', 'padding_free': False, 'pad_to_multiple_of': None, 'eval_packing': None, 'completion_only_loss': None, 'assistant_only_loss': False, 'activation_offloading': False, 'max_seq_length': None}
2026-04-15 11:24:36,323 - INFO - Training config: {'model_name': 'Qwen/Qwen3-8B', 'template_name': 'qwen', 'block_size': 40960, 'wandb_project': 'ThreadWeaver', 'train_file_path': './data-train_final_replaced_from_classified-fix-format', 'dagger': False, 'attn_implementation': 'flex_attention', 'output_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'overwrite_output_dir': False, 'do_train': False, 'do_eval': False, 'do_predict': False, 'eval_strategy': <IntervalStrategy.NO: 'no'>, 'prediction_loss_only': False, 'per_device_train_batch_size': 1, 'per_device_eval_batch_size': 1, 'per_gpu_train_batch_size': None, 'per_gpu_eval_batch_size': None, 'gradient_accumulation_steps': 2, 'eval_accumulation_steps': None, 'eval_delay': 0, 'torch_empty_cache_steps': None, 'learning_rate': 1e-05, 'weight_decay': 0.0001, 'adam_beta1': 0.9, 'adam_beta2': 0.95, 'adam_epsilon': 1e-08, 'max_grad_norm': 1.0, 'num_train_epochs': 8.0, 'max_steps': -1, 'lr_scheduler_type': <SchedulerType.COSINE: 'cosine'>, 'lr_scheduler_kwargs': {}, 'warmup_ratio': 0.05, 'warmup_steps': 0, 'log_level': 'passive', 'log_level_replica': 'warning', 'log_on_each_node': True, 'logging_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume/runs/Apr15_11-24-34_batch-block1-3273', 'logging_strategy': <IntervalStrategy.STEPS: 'steps'>, 'logging_first_step': False, 'logging_steps': 1.0, 'logging_nan_inf_filter': True, 'save_strategy': <SaveStrategy.STEPS: 'steps'>, 'save_steps': 20, 'save_total_limit': 2, 'save_safetensors': True, 'save_on_each_node': False, 'save_only_model': False, 'restore_callback_states_from_checkpoint': False, 'no_cuda': False, 'use_cpu': False, 'use_mps_device': False, 'seed': 42, 'data_seed': None, 'jit_mode_eval': False, 'use_ipex': False, 'bf16': True, 'fp16': False, 'fp16_opt_level': 'O1', 'half_precision_backend': 'auto', 'bf16_full_eval': False, 'fp16_full_eval': False, 'tf32': None, 'local_rank': 3, 'ddp_backend': None, 'tpu_num_cores': None, 'tpu_metrics_debug': False, 'debug': [], 'dataloader_drop_last': False, 'eval_steps': None, 'dataloader_num_workers': 0, 'dataloader_prefetch_factor': None, 'past_index': -1, 'run_name': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'disable_tqdm': False, 'remove_unused_columns': True, 'label_names': None, 'load_best_model_at_end': False, 'metric_for_best_model': None, 'greater_is_better': None, 'ignore_data_skip': False, 'fsdp': [], 'fsdp_min_num_params': 0, 'fsdp_config': {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, 'tp_size': 0, 'fsdp_transformer_layer_cls_to_wrap': None, 'accelerator_config': {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, 'deepspeed': 'configs/deepspeed_zero3.json', 'label_smoothing_factor': 0.0, 'optim': <OptimizerNames.ADAMW_TORCH: 'adamw_torch'>, 'optim_args': None, 'adafactor': False, 'group_by_length': False, 'length_column_name': 'length', 'report_to': ['wandb'], 'ddp_find_unused_parameters': None, 'ddp_bucket_cap_mb': None, 'ddp_broadcast_buffers': None, 'dataloader_pin_memory': True, 'dataloader_persistent_workers': False, 'skip_memory_metrics': True, 'use_legacy_prediction_loop': False, 'push_to_hub': False, 'resume_from_checkpoint': None, 'hub_model_id': None, 'hub_strategy': <HubStrategy.EVERY_SAVE: 'every_save'>, 'hub_token': None, 'hub_private_repo': None, 'hub_always_push': False, 'gradient_checkpointing': True, 'gradient_checkpointing_kwargs': None, 'include_inputs_for_metrics': False, 'include_for_metrics': [], 'eval_do_concat_batches': True, 'fp16_backend': 'auto', 'push_to_hub_model_id': None, 'push_to_hub_organization': None, 'push_to_hub_token': None, '_n_gpu': 1, 'mp_parameters': '', 'auto_find_batch_size': False, 'full_determinism': False, 'torchdynamo': None, 'ray_scope': 'last', 'ddp_timeout': 1800, 'torch_compile': False, 'torch_compile_backend': None, 'torch_compile_mode': None, 'include_tokens_per_second': False, 'include_num_input_tokens_seen': False, 'neftune_noise_alpha': None, 'optim_target_modules': None, 'batch_eval_metrics': False, 'eval_on_start': False, 'use_liger_kernel': True, 'eval_use_gather_object': False, 'average_tokens_across_devices': True, 'model_init_kwargs': None, 'chat_template_path': None, 'dataset_text_field': 'qwen_text', 'dataset_kwargs': None, 'dataset_num_proc': None, 'eos_token': None, 'pad_token': None, 'max_length': 1024, 'packing': False, 'packing_strategy': 'ffd', 'padding_free': False, 'pad_to_multiple_of': None, 'eval_packing': None, 'completion_only_loss': None, 'assistant_only_loss': False, 'activation_offloading': False, 'max_seq_length': None}
2026-04-15 11:24:36,323 - INFO - Training config: {'model_name': 'Qwen/Qwen3-8B', 'template_name': 'qwen', 'block_size': 40960, 'wandb_project': 'ThreadWeaver', 'train_file_path': './data-train_final_replaced_from_classified-fix-format', 'dagger': False, 'attn_implementation': 'flex_attention', 'output_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'overwrite_output_dir': False, 'do_train': False, 'do_eval': False, 'do_predict': False, 'eval_strategy': <IntervalStrategy.NO: 'no'>, 'prediction_loss_only': False, 'per_device_train_batch_size': 1, 'per_device_eval_batch_size': 1, 'per_gpu_train_batch_size': None, 'per_gpu_eval_batch_size': None, 'gradient_accumulation_steps': 2, 'eval_accumulation_steps': None, 'eval_delay': 0, 'torch_empty_cache_steps': None, 'learning_rate': 1e-05, 'weight_decay': 0.0001, 'adam_beta1': 0.9, 'adam_beta2': 0.95, 'adam_epsilon': 1e-08, 'max_grad_norm': 1.0, 'num_train_epochs': 8.0, 'max_steps': -1, 'lr_scheduler_type': <SchedulerType.COSINE: 'cosine'>, 'lr_scheduler_kwargs': {}, 'warmup_ratio': 0.05, 'warmup_steps': 0, 'log_level': 'passive', 'log_level_replica': 'warning', 'log_on_each_node': True, 'logging_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume/runs/Apr15_11-24-34_batch-block1-3273', 'logging_strategy': <IntervalStrategy.STEPS: 'steps'>, 'logging_first_step': False, 'logging_steps': 1.0, 'logging_nan_inf_filter': True, 'save_strategy': <SaveStrategy.STEPS: 'steps'>, 'save_steps': 20, 'save_total_limit': 2, 'save_safetensors': True, 'save_on_each_node': False, 'save_only_model': False, 'restore_callback_states_from_checkpoint': False, 'no_cuda': False, 'use_cpu': False, 'use_mps_device': False, 'seed': 42, 'data_seed': None, 'jit_mode_eval': False, 'use_ipex': False, 'bf16': True, 'fp16': False, 'fp16_opt_level': 'O1', 'half_precision_backend': 'auto', 'bf16_full_eval': False, 'fp16_full_eval': False, 'tf32': None, 'local_rank': 1, 'ddp_backend': None, 'tpu_num_cores': None, 'tpu_metrics_debug': False, 'debug': [], 'dataloader_drop_last': False, 'eval_steps': None, 'dataloader_num_workers': 0, 'dataloader_prefetch_factor': None, 'past_index': -1, 'run_name': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'disable_tqdm': False, 'remove_unused_columns': True, 'label_names': None, 'load_best_model_at_end': False, 'metric_for_best_model': None, 'greater_is_better': None, 'ignore_data_skip': False, 'fsdp': [], 'fsdp_min_num_params': 0, 'fsdp_config': {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, 'tp_size': 0, 'fsdp_transformer_layer_cls_to_wrap': None, 'accelerator_config': {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, 'deepspeed': 'configs/deepspeed_zero3.json', 'label_smoothing_factor': 0.0, 'optim': <OptimizerNames.ADAMW_TORCH: 'adamw_torch'>, 'optim_args': None, 'adafactor': False, 'group_by_length': False, 'length_column_name': 'length', 'report_to': ['wandb'], 'ddp_find_unused_parameters': None, 'ddp_bucket_cap_mb': None, 'ddp_broadcast_buffers': None, 'dataloader_pin_memory': True, 'dataloader_persistent_workers': False, 'skip_memory_metrics': True, 'use_legacy_prediction_loop': False, 'push_to_hub': False, 'resume_from_checkpoint': None, 'hub_model_id': None, 'hub_strategy': <HubStrategy.EVERY_SAVE: 'every_save'>, 'hub_token': None, 'hub_private_repo': None, 'hub_always_push': False, 'gradient_checkpointing': True, 'gradient_checkpointing_kwargs': None, 'include_inputs_for_metrics': False, 'include_for_metrics': [], 'eval_do_concat_batches': True, 'fp16_backend': 'auto', 'push_to_hub_model_id': None, 'push_to_hub_organization': None, 'push_to_hub_token': None, '_n_gpu': 1, 'mp_parameters': '', 'auto_find_batch_size': False, 'full_determinism': False, 'torchdynamo': None, 'ray_scope': 'last', 'ddp_timeout': 1800, 'torch_compile': False, 'torch_compile_backend': None, 'torch_compile_mode': None, 'include_tokens_per_second': False, 'include_num_input_tokens_seen': False, 'neftune_noise_alpha': None, 'optim_target_modules': None, 'batch_eval_metrics': False, 'eval_on_start': False, 'use_liger_kernel': True, 'eval_use_gather_object': False, 'average_tokens_across_devices': True, 'model_init_kwargs': None, 'chat_template_path': None, 'dataset_text_field': 'qwen_text', 'dataset_kwargs': None, 'dataset_num_proc': None, 'eos_token': None, 'pad_token': None, 'max_length': 1024, 'packing': False, 'packing_strategy': 'ffd', 'padding_free': False, 'pad_to_multiple_of': None, 'eval_packing': None, 'completion_only_loss': None, 'assistant_only_loss': False, 'activation_offloading': False, 'max_seq_length': None}
2026-04-15 11:24:36,338 - INFO - Training config: {'model_name': 'Qwen/Qwen3-8B', 'template_name': 'qwen', 'block_size': 40960, 'wandb_project': 'ThreadWeaver', 'train_file_path': './data-train_final_replaced_from_classified-fix-format', 'dagger': False, 'attn_implementation': 'flex_attention', 'output_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'overwrite_output_dir': False, 'do_train': False, 'do_eval': False, 'do_predict': False, 'eval_strategy': <IntervalStrategy.NO: 'no'>, 'prediction_loss_only': False, 'per_device_train_batch_size': 1, 'per_device_eval_batch_size': 1, 'per_gpu_train_batch_size': None, 'per_gpu_eval_batch_size': None, 'gradient_accumulation_steps': 2, 'eval_accumulation_steps': None, 'eval_delay': 0, 'torch_empty_cache_steps': None, 'learning_rate': 1e-05, 'weight_decay': 0.0001, 'adam_beta1': 0.9, 'adam_beta2': 0.95, 'adam_epsilon': 1e-08, 'max_grad_norm': 1.0, 'num_train_epochs': 8.0, 'max_steps': -1, 'lr_scheduler_type': <SchedulerType.COSINE: 'cosine'>, 'lr_scheduler_kwargs': {}, 'warmup_ratio': 0.05, 'warmup_steps': 0, 'log_level': 'passive', 'log_level_replica': 'warning', 'log_on_each_node': True, 'logging_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume/runs/Apr15_11-24-34_batch-block1-0070', 'logging_strategy': <IntervalStrategy.STEPS: 'steps'>, 'logging_first_step': False, 'logging_steps': 1.0, 'logging_nan_inf_filter': True, 'save_strategy': <SaveStrategy.STEPS: 'steps'>, 'save_steps': 20, 'save_total_limit': 2, 'save_safetensors': True, 'save_on_each_node': False, 'save_only_model': False, 'restore_callback_states_from_checkpoint': False, 'no_cuda': False, 'use_cpu': False, 'use_mps_device': False, 'seed': 42, 'data_seed': None, 'jit_mode_eval': False, 'use_ipex': False, 'bf16': True, 'fp16': False, 'fp16_opt_level': 'O1', 'half_precision_backend': 'auto', 'bf16_full_eval': False, 'fp16_full_eval': False, 'tf32': None, 'local_rank': 2, 'ddp_backend': None, 'tpu_num_cores': None, 'tpu_metrics_debug': False, 'debug': [], 'dataloader_drop_last': False, 'eval_steps': None, 'dataloader_num_workers': 0, 'dataloader_prefetch_factor': None, 'past_index': -1, 'run_name': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'disable_tqdm': False, 'remove_unused_columns': True, 'label_names': None, 'load_best_model_at_end': False, 'metric_for_best_model': None, 'greater_is_better': None, 'ignore_data_skip': False, 'fsdp': [], 'fsdp_min_num_params': 0, 'fsdp_config': {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, 'tp_size': 0, 'fsdp_transformer_layer_cls_to_wrap': None, 'accelerator_config': {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, 'deepspeed': 'configs/deepspeed_zero3.json', 'label_smoothing_factor': 0.0, 'optim': <OptimizerNames.ADAMW_TORCH: 'adamw_torch'>, 'optim_args': None, 'adafactor': False, 'group_by_length': False, 'length_column_name': 'length', 'report_to': ['wandb'], 'ddp_find_unused_parameters': None, 'ddp_bucket_cap_mb': None, 'ddp_broadcast_buffers': None, 'dataloader_pin_memory': True, 'dataloader_persistent_workers': False, 'skip_memory_metrics': True, 'use_legacy_prediction_loop': False, 'push_to_hub': False, 'resume_from_checkpoint': None, 'hub_model_id': None, 'hub_strategy': <HubStrategy.EVERY_SAVE: 'every_save'>, 'hub_token': None, 'hub_private_repo': None, 'hub_always_push': False, 'gradient_checkpointing': True, 'gradient_checkpointing_kwargs': None, 'include_inputs_for_metrics': False, 'include_for_metrics': [], 'eval_do_concat_batches': True, 'fp16_backend': 'auto', 'push_to_hub_model_id': None, 'push_to_hub_organization': None, 'push_to_hub_token': None, '_n_gpu': 1, 'mp_parameters': '', 'auto_find_batch_size': False, 'full_determinism': False, 'torchdynamo': None, 'ray_scope': 'last', 'ddp_timeout': 1800, 'torch_compile': False, 'torch_compile_backend': None, 'torch_compile_mode': None, 'include_tokens_per_second': False, 'include_num_input_tokens_seen': False, 'neftune_noise_alpha': None, 'optim_target_modules': None, 'batch_eval_metrics': False, 'eval_on_start': False, 'use_liger_kernel': True, 'eval_use_gather_object': False, 'average_tokens_across_devices': True, 'model_init_kwargs': None, 'chat_template_path': None, 'dataset_text_field': 'qwen_text', 'dataset_kwargs': None, 'dataset_num_proc': None, 'eos_token': None, 'pad_token': None, 'max_length': 1024, 'packing': False, 'packing_strategy': 'ffd', 'padding_free': False, 'pad_to_multiple_of': None, 'eval_packing': None, 'completion_only_loss': None, 'assistant_only_loss': False, 'activation_offloading': False, 'max_seq_length': None}
2026-04-15 11:24:36,353 - INFO - Training config: {'model_name': 'Qwen/Qwen3-8B', 'template_name': 'qwen', 'block_size': 40960, 'wandb_project': 'ThreadWeaver', 'train_file_path': './data-train_final_replaced_from_classified-fix-format', 'dagger': False, 'attn_implementation': 'flex_attention', 'output_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'overwrite_output_dir': False, 'do_train': False, 'do_eval': False, 'do_predict': False, 'eval_strategy': <IntervalStrategy.NO: 'no'>, 'prediction_loss_only': False, 'per_device_train_batch_size': 1, 'per_device_eval_batch_size': 1, 'per_gpu_train_batch_size': None, 'per_gpu_eval_batch_size': None, 'gradient_accumulation_steps': 2, 'eval_accumulation_steps': None, 'eval_delay': 0, 'torch_empty_cache_steps': None, 'learning_rate': 1e-05, 'weight_decay': 0.0001, 'adam_beta1': 0.9, 'adam_beta2': 0.95, 'adam_epsilon': 1e-08, 'max_grad_norm': 1.0, 'num_train_epochs': 8.0, 'max_steps': -1, 'lr_scheduler_type': <SchedulerType.COSINE: 'cosine'>, 'lr_scheduler_kwargs': {}, 'warmup_ratio': 0.05, 'warmup_steps': 0, 'log_level': 'passive', 'log_level_replica': 'warning', 'log_on_each_node': True, 'logging_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume/runs/Apr15_11-24-34_batch-block1-0070', 'logging_strategy': <IntervalStrategy.STEPS: 'steps'>, 'logging_first_step': False, 'logging_steps': 1.0, 'logging_nan_inf_filter': True, 'save_strategy': <SaveStrategy.STEPS: 'steps'>, 'save_steps': 20, 'save_total_limit': 2, 'save_safetensors': True, 'save_on_each_node': False, 'save_only_model': False, 'restore_callback_states_from_checkpoint': False, 'no_cuda': False, 'use_cpu': False, 'use_mps_device': False, 'seed': 42, 'data_seed': None, 'jit_mode_eval': False, 'use_ipex': False, 'bf16': True, 'fp16': False, 'fp16_opt_level': 'O1', 'half_precision_backend': 'auto', 'bf16_full_eval': False, 'fp16_full_eval': False, 'tf32': None, 'local_rank': 4, 'ddp_backend': None, 'tpu_num_cores': None, 'tpu_metrics_debug': False, 'debug': [], 'dataloader_drop_last': False, 'eval_steps': None, 'dataloader_num_workers': 0, 'dataloader_prefetch_factor': None, 'past_index': -1, 'run_name': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'disable_tqdm': False, 'remove_unused_columns': True, 'label_names': None, 'load_best_model_at_end': False, 'metric_for_best_model': None, 'greater_is_better': None, 'ignore_data_skip': False, 'fsdp': [], 'fsdp_min_num_params': 0, 'fsdp_config': {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, 'tp_size': 0, 'fsdp_transformer_layer_cls_to_wrap': None, 'accelerator_config': {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, 'deepspeed': 'configs/deepspeed_zero3.json', 'label_smoothing_factor': 0.0, 'optim': <OptimizerNames.ADAMW_TORCH: 'adamw_torch'>, 'optim_args': None, 'adafactor': False, 'group_by_length': False, 'length_column_name': 'length', 'report_to': ['wandb'], 'ddp_find_unused_parameters': None, 'ddp_bucket_cap_mb': None, 'ddp_broadcast_buffers': None, 'dataloader_pin_memory': True, 'dataloader_persistent_workers': False, 'skip_memory_metrics': True, 'use_legacy_prediction_loop': False, 'push_to_hub': False, 'resume_from_checkpoint': None, 'hub_model_id': None, 'hub_strategy': <HubStrategy.EVERY_SAVE: 'every_save'>, 'hub_token': None, 'hub_private_repo': None, 'hub_always_push': False, 'gradient_checkpointing': True, 'gradient_checkpointing_kwargs': None, 'include_inputs_for_metrics': False, 'include_for_metrics': [], 'eval_do_concat_batches': True, 'fp16_backend': 'auto', 'push_to_hub_model_id': None, 'push_to_hub_organization': None, 'push_to_hub_token': None, '_n_gpu': 1, 'mp_parameters': '', 'auto_find_batch_size': False, 'full_determinism': False, 'torchdynamo': None, 'ray_scope': 'last', 'ddp_timeout': 1800, 'torch_compile': False, 'torch_compile_backend': None, 'torch_compile_mode': None, 'include_tokens_per_second': False, 'include_num_input_tokens_seen': False, 'neftune_noise_alpha': None, 'optim_target_modules': None, 'batch_eval_metrics': False, 'eval_on_start': False, 'use_liger_kernel': True, 'eval_use_gather_object': False, 'average_tokens_across_devices': True, 'model_init_kwargs': None, 'chat_template_path': None, 'dataset_text_field': 'qwen_text', 'dataset_kwargs': None, 'dataset_num_proc': None, 'eos_token': None, 'pad_token': None, 'max_length': 1024, 'packing': False, 'packing_strategy': 'ffd', 'padding_free': False, 'pad_to_multiple_of': None, 'eval_packing': None, 'completion_only_loss': None, 'assistant_only_loss': False, 'activation_offloading': False, 'max_seq_length': None}
2026-04-15 11:24:36,356 - INFO - Training config: {'model_name': 'Qwen/Qwen3-8B', 'template_name': 'qwen', 'block_size': 40960, 'wandb_project': 'ThreadWeaver', 'train_file_path': './data-train_final_replaced_from_classified-fix-format', 'dagger': False, 'attn_implementation': 'flex_attention', 'output_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'overwrite_output_dir': False, 'do_train': False, 'do_eval': False, 'do_predict': False, 'eval_strategy': <IntervalStrategy.NO: 'no'>, 'prediction_loss_only': False, 'per_device_train_batch_size': 1, 'per_device_eval_batch_size': 1, 'per_gpu_train_batch_size': None, 'per_gpu_eval_batch_size': None, 'gradient_accumulation_steps': 2, 'eval_accumulation_steps': None, 'eval_delay': 0, 'torch_empty_cache_steps': None, 'learning_rate': 1e-05, 'weight_decay': 0.0001, 'adam_beta1': 0.9, 'adam_beta2': 0.95, 'adam_epsilon': 1e-08, 'max_grad_norm': 1.0, 'num_train_epochs': 8.0, 'max_steps': -1, 'lr_scheduler_type': <SchedulerType.COSINE: 'cosine'>, 'lr_scheduler_kwargs': {}, 'warmup_ratio': 0.05, 'warmup_steps': 0, 'log_level': 'passive', 'log_level_replica': 'warning', 'log_on_each_node': True, 'logging_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume/runs/Apr15_11-24-34_batch-block1-3273', 'logging_strategy': <IntervalStrategy.STEPS: 'steps'>, 'logging_first_step': False, 'logging_steps': 1.0, 'logging_nan_inf_filter': True, 'save_strategy': <SaveStrategy.STEPS: 'steps'>, 'save_steps': 20, 'save_total_limit': 2, 'save_safetensors': True, 'save_on_each_node': False, 'save_only_model': False, 'restore_callback_states_from_checkpoint': False, 'no_cuda': False, 'use_cpu': False, 'use_mps_device': False, 'seed': 42, 'data_seed': None, 'jit_mode_eval': False, 'use_ipex': False, 'bf16': True, 'fp16': False, 'fp16_opt_level': 'O1', 'half_precision_backend': 'auto', 'bf16_full_eval': False, 'fp16_full_eval': False, 'tf32': None, 'local_rank': 5, 'ddp_backend': None, 'tpu_num_cores': None, 'tpu_metrics_debug': False, 'debug': [], 'dataloader_drop_last': False, 'eval_steps': None, 'dataloader_num_workers': 0, 'dataloader_prefetch_factor': None, 'past_index': -1, 'run_name': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'disable_tqdm': False, 'remove_unused_columns': True, 'label_names': None, 'load_best_model_at_end': False, 'metric_for_best_model': None, 'greater_is_better': None, 'ignore_data_skip': False, 'fsdp': [], 'fsdp_min_num_params': 0, 'fsdp_config': {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, 'tp_size': 0, 'fsdp_transformer_layer_cls_to_wrap': None, 'accelerator_config': {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, 'deepspeed': 'configs/deepspeed_zero3.json', 'label_smoothing_factor': 0.0, 'optim': <OptimizerNames.ADAMW_TORCH: 'adamw_torch'>, 'optim_args': None, 'adafactor': False, 'group_by_length': False, 'length_column_name': 'length', 'report_to': ['wandb'], 'ddp_find_unused_parameters': None, 'ddp_bucket_cap_mb': None, 'ddp_broadcast_buffers': None, 'dataloader_pin_memory': True, 'dataloader_persistent_workers': False, 'skip_memory_metrics': True, 'use_legacy_prediction_loop': False, 'push_to_hub': False, 'resume_from_checkpoint': None, 'hub_model_id': None, 'hub_strategy': <HubStrategy.EVERY_SAVE: 'every_save'>, 'hub_token': None, 'hub_private_repo': None, 'hub_always_push': False, 'gradient_checkpointing': True, 'gradient_checkpointing_kwargs': None, 'include_inputs_for_metrics': False, 'include_for_metrics': [], 'eval_do_concat_batches': True, 'fp16_backend': 'auto', 'push_to_hub_model_id': None, 'push_to_hub_organization': None, 'push_to_hub_token': None, '_n_gpu': 1, 'mp_parameters': '', 'auto_find_batch_size': False, 'full_determinism': False, 'torchdynamo': None, 'ray_scope': 'last', 'ddp_timeout': 1800, 'torch_compile': False, 'torch_compile_backend': None, 'torch_compile_mode': None, 'include_tokens_per_second': False, 'include_num_input_tokens_seen': False, 'neftune_noise_alpha': None, 'optim_target_modules': None, 'batch_eval_metrics': False, 'eval_on_start': False, 'use_liger_kernel': True, 'eval_use_gather_object': False, 'average_tokens_across_devices': True, 'model_init_kwargs': None, 'chat_template_path': None, 'dataset_text_field': 'qwen_text', 'dataset_kwargs': None, 'dataset_num_proc': None, 'eos_token': None, 'pad_token': None, 'max_length': 1024, 'packing': False, 'packing_strategy': 'ffd', 'padding_free': False, 'pad_to_multiple_of': None, 'eval_packing': None, 'completion_only_loss': None, 'assistant_only_loss': False, 'activation_offloading': False, 'max_seq_length': None}
2026-04-15 11:24:36,459 - INFO - Training config: {'model_name': 'Qwen/Qwen3-8B', 'template_name': 'qwen', 'block_size': 40960, 'wandb_project': 'ThreadWeaver', 'train_file_path': './data-train_final_replaced_from_classified-fix-format', 'dagger': False, 'attn_implementation': 'flex_attention', 'output_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'overwrite_output_dir': False, 'do_train': False, 'do_eval': False, 'do_predict': False, 'eval_strategy': <IntervalStrategy.NO: 'no'>, 'prediction_loss_only': False, 'per_device_train_batch_size': 1, 'per_device_eval_batch_size': 1, 'per_gpu_train_batch_size': None, 'per_gpu_eval_batch_size': None, 'gradient_accumulation_steps': 2, 'eval_accumulation_steps': None, 'eval_delay': 0, 'torch_empty_cache_steps': None, 'learning_rate': 1e-05, 'weight_decay': 0.0001, 'adam_beta1': 0.9, 'adam_beta2': 0.95, 'adam_epsilon': 1e-08, 'max_grad_norm': 1.0, 'num_train_epochs': 8.0, 'max_steps': -1, 'lr_scheduler_type': <SchedulerType.COSINE: 'cosine'>, 'lr_scheduler_kwargs': {}, 'warmup_ratio': 0.05, 'warmup_steps': 0, 'log_level': 'passive', 'log_level_replica': 'warning', 'log_on_each_node': True, 'logging_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume/runs/Apr15_11-24-35_batch-block1-3534', 'logging_strategy': <IntervalStrategy.STEPS: 'steps'>, 'logging_first_step': False, 'logging_steps': 1.0, 'logging_nan_inf_filter': True, 'save_strategy': <SaveStrategy.STEPS: 'steps'>, 'save_steps': 20, 'save_total_limit': 2, 'save_safetensors': True, 'save_on_each_node': False, 'save_only_model': False, 'restore_callback_states_from_checkpoint': False, 'no_cuda': False, 'use_cpu': False, 'use_mps_device': False, 'seed': 42, 'data_seed': None, 'jit_mode_eval': False, 'use_ipex': False, 'bf16': True, 'fp16': False, 'fp16_opt_level': 'O1', 'half_precision_backend': 'auto', 'bf16_full_eval': False, 'fp16_full_eval': False, 'tf32': None, 'local_rank': 1, 'ddp_backend': None, 'tpu_num_cores': None, 'tpu_metrics_debug': False, 'debug': [], 'dataloader_drop_last': False, 'eval_steps': None, 'dataloader_num_workers': 0, 'dataloader_prefetch_factor': None, 'past_index': -1, 'run_name': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'disable_tqdm': False, 'remove_unused_columns': True, 'label_names': None, 'load_best_model_at_end': False, 'metric_for_best_model': None, 'greater_is_better': None, 'ignore_data_skip': False, 'fsdp': [], 'fsdp_min_num_params': 0, 'fsdp_config': {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, 'tp_size': 0, 'fsdp_transformer_layer_cls_to_wrap': None, 'accelerator_config': {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, 'deepspeed': 'configs/deepspeed_zero3.json', 'label_smoothing_factor': 0.0, 'optim': <OptimizerNames.ADAMW_TORCH: 'adamw_torch'>, 'optim_args': None, 'adafactor': False, 'group_by_length': False, 'length_column_name': 'length', 'report_to': ['wandb'], 'ddp_find_unused_parameters': None, 'ddp_bucket_cap_mb': None, 'ddp_broadcast_buffers': None, 'dataloader_pin_memory': True, 'dataloader_persistent_workers': False, 'skip_memory_metrics': True, 'use_legacy_prediction_loop': False, 'push_to_hub': False, 'resume_from_checkpoint': None, 'hub_model_id': None, 'hub_strategy': <HubStrategy.EVERY_SAVE: 'every_save'>, 'hub_token': None, 'hub_private_repo': None, 'hub_always_push': False, 'gradient_checkpointing': True, 'gradient_checkpointing_kwargs': None, 'include_inputs_for_metrics': False, 'include_for_metrics': [], 'eval_do_concat_batches': True, 'fp16_backend': 'auto', 'push_to_hub_model_id': None, 'push_to_hub_organization': None, 'push_to_hub_token': None, '_n_gpu': 1, 'mp_parameters': '', 'auto_find_batch_size': False, 'full_determinism': False, 'torchdynamo': None, 'ray_scope': 'last', 'ddp_timeout': 1800, 'torch_compile': False, 'torch_compile_backend': None, 'torch_compile_mode': None, 'include_tokens_per_second': False, 'include_num_input_tokens_seen': False, 'neftune_noise_alpha': None, 'optim_target_modules': None, 'batch_eval_metrics': False, 'eval_on_start': False, 'use_liger_kernel': True, 'eval_use_gather_object': False, 'average_tokens_across_devices': True, 'model_init_kwargs': None, 'chat_template_path': None, 'dataset_text_field': 'qwen_text', 'dataset_kwargs': None, 'dataset_num_proc': None, 'eos_token': None, 'pad_token': None, 'max_length': 1024, 'packing': False, 'packing_strategy': 'ffd', 'padding_free': False, 'pad_to_multiple_of': None, 'eval_packing': None, 'completion_only_loss': None, 'assistant_only_loss': False, 'activation_offloading': False, 'max_seq_length': None}
2026-04-15 11:24:36,459 - INFO - Training config: {'model_name': 'Qwen/Qwen3-8B', 'template_name': 'qwen', 'block_size': 40960, 'wandb_project': 'ThreadWeaver', 'train_file_path': './data-train_final_replaced_from_classified-fix-format', 'dagger': False, 'attn_implementation': 'flex_attention', 'output_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'overwrite_output_dir': False, 'do_train': False, 'do_eval': False, 'do_predict': False, 'eval_strategy': <IntervalStrategy.NO: 'no'>, 'prediction_loss_only': False, 'per_device_train_batch_size': 1, 'per_device_eval_batch_size': 1, 'per_gpu_train_batch_size': None, 'per_gpu_eval_batch_size': None, 'gradient_accumulation_steps': 2, 'eval_accumulation_steps': None, 'eval_delay': 0, 'torch_empty_cache_steps': None, 'learning_rate': 1e-05, 'weight_decay': 0.0001, 'adam_beta1': 0.9, 'adam_beta2': 0.95, 'adam_epsilon': 1e-08, 'max_grad_norm': 1.0, 'num_train_epochs': 8.0, 'max_steps': -1, 'lr_scheduler_type': <SchedulerType.COSINE: 'cosine'>, 'lr_scheduler_kwargs': {}, 'warmup_ratio': 0.05, 'warmup_steps': 0, 'log_level': 'passive', 'log_level_replica': 'warning', 'log_on_each_node': True, 'logging_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume/runs/Apr15_11-24-35_batch-block1-3534', 'logging_strategy': <IntervalStrategy.STEPS: 'steps'>, 'logging_first_step': False, 'logging_steps': 1.0, 'logging_nan_inf_filter': True, 'save_strategy': <SaveStrategy.STEPS: 'steps'>, 'save_steps': 20, 'save_total_limit': 2, 'save_safetensors': True, 'save_on_each_node': False, 'save_only_model': False, 'restore_callback_states_from_checkpoint': False, 'no_cuda': False, 'use_cpu': False, 'use_mps_device': False, 'seed': 42, 'data_seed': None, 'jit_mode_eval': False, 'use_ipex': False, 'bf16': True, 'fp16': False, 'fp16_opt_level': 'O1', 'half_precision_backend': 'auto', 'bf16_full_eval': False, 'fp16_full_eval': False, 'tf32': None, 'local_rank': 3, 'ddp_backend': None, 'tpu_num_cores': None, 'tpu_metrics_debug': False, 'debug': [], 'dataloader_drop_last': False, 'eval_steps': None, 'dataloader_num_workers': 0, 'dataloader_prefetch_factor': None, 'past_index': -1, 'run_name': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'disable_tqdm': False, 'remove_unused_columns': True, 'label_names': None, 'load_best_model_at_end': False, 'metric_for_best_model': None, 'greater_is_better': None, 'ignore_data_skip': False, 'fsdp': [], 'fsdp_min_num_params': 0, 'fsdp_config': {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, 'tp_size': 0, 'fsdp_transformer_layer_cls_to_wrap': None, 'accelerator_config': {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, 'deepspeed': 'configs/deepspeed_zero3.json', 'label_smoothing_factor': 0.0, 'optim': <OptimizerNames.ADAMW_TORCH: 'adamw_torch'>, 'optim_args': None, 'adafactor': False, 'group_by_length': False, 'length_column_name': 'length', 'report_to': ['wandb'], 'ddp_find_unused_parameters': None, 'ddp_bucket_cap_mb': None, 'ddp_broadcast_buffers': None, 'dataloader_pin_memory': True, 'dataloader_persistent_workers': False, 'skip_memory_metrics': True, 'use_legacy_prediction_loop': False, 'push_to_hub': False, 'resume_from_checkpoint': None, 'hub_model_id': None, 'hub_strategy': <HubStrategy.EVERY_SAVE: 'every_save'>, 'hub_token': None, 'hub_private_repo': None, 'hub_always_push': False, 'gradient_checkpointing': True, 'gradient_checkpointing_kwargs': None, 'include_inputs_for_metrics': False, 'include_for_metrics': [], 'eval_do_concat_batches': True, 'fp16_backend': 'auto', 'push_to_hub_model_id': None, 'push_to_hub_organization': None, 'push_to_hub_token': None, '_n_gpu': 1, 'mp_parameters': '', 'auto_find_batch_size': False, 'full_determinism': False, 'torchdynamo': None, 'ray_scope': 'last', 'ddp_timeout': 1800, 'torch_compile': False, 'torch_compile_backend': None, 'torch_compile_mode': None, 'include_tokens_per_second': False, 'include_num_input_tokens_seen': False, 'neftune_noise_alpha': None, 'optim_target_modules': None, 'batch_eval_metrics': False, 'eval_on_start': False, 'use_liger_kernel': True, 'eval_use_gather_object': False, 'average_tokens_across_devices': True, 'model_init_kwargs': None, 'chat_template_path': None, 'dataset_text_field': 'qwen_text', 'dataset_kwargs': None, 'dataset_num_proc': None, 'eos_token': None, 'pad_token': None, 'max_length': 1024, 'packing': False, 'packing_strategy': 'ffd', 'padding_free': False, 'pad_to_multiple_of': None, 'eval_packing': None, 'completion_only_loss': None, 'assistant_only_loss': False, 'activation_offloading': False, 'max_seq_length': None}
2026-04-15 11:24:36,475 - INFO - Training config: {'model_name': 'Qwen/Qwen3-8B', 'template_name': 'qwen', 'block_size': 40960, 'wandb_project': 'ThreadWeaver', 'train_file_path': './data-train_final_replaced_from_classified-fix-format', 'dagger': False, 'attn_implementation': 'flex_attention', 'output_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'overwrite_output_dir': False, 'do_train': False, 'do_eval': False, 'do_predict': False, 'eval_strategy': <IntervalStrategy.NO: 'no'>, 'prediction_loss_only': False, 'per_device_train_batch_size': 1, 'per_device_eval_batch_size': 1, 'per_gpu_train_batch_size': None, 'per_gpu_eval_batch_size': None, 'gradient_accumulation_steps': 2, 'eval_accumulation_steps': None, 'eval_delay': 0, 'torch_empty_cache_steps': None, 'learning_rate': 1e-05, 'weight_decay': 0.0001, 'adam_beta1': 0.9, 'adam_beta2': 0.95, 'adam_epsilon': 1e-08, 'max_grad_norm': 1.0, 'num_train_epochs': 8.0, 'max_steps': -1, 'lr_scheduler_type': <SchedulerType.COSINE: 'cosine'>, 'lr_scheduler_kwargs': {}, 'warmup_ratio': 0.05, 'warmup_steps': 0, 'log_level': 'passive', 'log_level_replica': 'warning', 'log_on_each_node': True, 'logging_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume/runs/Apr15_11-24-35_batch-block1-3534', 'logging_strategy': <IntervalStrategy.STEPS: 'steps'>, 'logging_first_step': False, 'logging_steps': 1.0, 'logging_nan_inf_filter': True, 'save_strategy': <SaveStrategy.STEPS: 'steps'>, 'save_steps': 20, 'save_total_limit': 2, 'save_safetensors': True, 'save_on_each_node': False, 'save_only_model': False, 'restore_callback_states_from_checkpoint': False, 'no_cuda': False, 'use_cpu': False, 'use_mps_device': False, 'seed': 42, 'data_seed': None, 'jit_mode_eval': False, 'use_ipex': False, 'bf16': True, 'fp16': False, 'fp16_opt_level': 'O1', 'half_precision_backend': 'auto', 'bf16_full_eval': False, 'fp16_full_eval': False, 'tf32': None, 'local_rank': 6, 'ddp_backend': None, 'tpu_num_cores': None, 'tpu_metrics_debug': False, 'debug': [], 'dataloader_drop_last': False, 'eval_steps': None, 'dataloader_num_workers': 0, 'dataloader_prefetch_factor': None, 'past_index': -1, 'run_name': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'disable_tqdm': False, 'remove_unused_columns': True, 'label_names': None, 'load_best_model_at_end': False, 'metric_for_best_model': None, 'greater_is_better': None, 'ignore_data_skip': False, 'fsdp': [], 'fsdp_min_num_params': 0, 'fsdp_config': {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, 'tp_size': 0, 'fsdp_transformer_layer_cls_to_wrap': None, 'accelerator_config': {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, 'deepspeed': 'configs/deepspeed_zero3.json', 'label_smoothing_factor': 0.0, 'optim': <OptimizerNames.ADAMW_TORCH: 'adamw_torch'>, 'optim_args': None, 'adafactor': False, 'group_by_length': False, 'length_column_name': 'length', 'report_to': ['wandb'], 'ddp_find_unused_parameters': None, 'ddp_bucket_cap_mb': None, 'ddp_broadcast_buffers': None, 'dataloader_pin_memory': True, 'dataloader_persistent_workers': False, 'skip_memory_metrics': True, 'use_legacy_prediction_loop': False, 'push_to_hub': False, 'resume_from_checkpoint': None, 'hub_model_id': None, 'hub_strategy': <HubStrategy.EVERY_SAVE: 'every_save'>, 'hub_token': None, 'hub_private_repo': None, 'hub_always_push': False, 'gradient_checkpointing': True, 'gradient_checkpointing_kwargs': None, 'include_inputs_for_metrics': False, 'include_for_metrics': [], 'eval_do_concat_batches': True, 'fp16_backend': 'auto', 'push_to_hub_model_id': None, 'push_to_hub_organization': None, 'push_to_hub_token': None, '_n_gpu': 1, 'mp_parameters': '', 'auto_find_batch_size': False, 'full_determinism': False, 'torchdynamo': None, 'ray_scope': 'last', 'ddp_timeout': 1800, 'torch_compile': False, 'torch_compile_backend': None, 'torch_compile_mode': None, 'include_tokens_per_second': False, 'include_num_input_tokens_seen': False, 'neftune_noise_alpha': None, 'optim_target_modules': None, 'batch_eval_metrics': False, 'eval_on_start': False, 'use_liger_kernel': True, 'eval_use_gather_object': False, 'average_tokens_across_devices': True, 'model_init_kwargs': None, 'chat_template_path': None, 'dataset_text_field': 'qwen_text', 'dataset_kwargs': None, 'dataset_num_proc': None, 'eos_token': None, 'pad_token': None, 'max_length': 1024, 'packing': False, 'packing_strategy': 'ffd', 'padding_free': False, 'pad_to_multiple_of': None, 'eval_packing': None, 'completion_only_loss': None, 'assistant_only_loss': False, 'activation_offloading': False, 'max_seq_length': None}
2026-04-15 11:24:36,476 - INFO - Training config: {'model_name': 'Qwen/Qwen3-8B', 'template_name': 'qwen', 'block_size': 40960, 'wandb_project': 'ThreadWeaver', 'train_file_path': './data-train_final_replaced_from_classified-fix-format', 'dagger': False, 'attn_implementation': 'flex_attention', 'output_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'overwrite_output_dir': False, 'do_train': False, 'do_eval': False, 'do_predict': False, 'eval_strategy': <IntervalStrategy.NO: 'no'>, 'prediction_loss_only': False, 'per_device_train_batch_size': 1, 'per_device_eval_batch_size': 1, 'per_gpu_train_batch_size': None, 'per_gpu_eval_batch_size': None, 'gradient_accumulation_steps': 2, 'eval_accumulation_steps': None, 'eval_delay': 0, 'torch_empty_cache_steps': None, 'learning_rate': 1e-05, 'weight_decay': 0.0001, 'adam_beta1': 0.9, 'adam_beta2': 0.95, 'adam_epsilon': 1e-08, 'max_grad_norm': 1.0, 'num_train_epochs': 8.0, 'max_steps': -1, 'lr_scheduler_type': <SchedulerType.COSINE: 'cosine'>, 'lr_scheduler_kwargs': {}, 'warmup_ratio': 0.05, 'warmup_steps': 0, 'log_level': 'passive', 'log_level_replica': 'warning', 'log_on_each_node': True, 'logging_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume/runs/Apr15_11-24-35_batch-block1-3534', 'logging_strategy': <IntervalStrategy.STEPS: 'steps'>, 'logging_first_step': False, 'logging_steps': 1.0, 'logging_nan_inf_filter': True, 'save_strategy': <SaveStrategy.STEPS: 'steps'>, 'save_steps': 20, 'save_total_limit': 2, 'save_safetensors': True, 'save_on_each_node': False, 'save_only_model': False, 'restore_callback_states_from_checkpoint': False, 'no_cuda': False, 'use_cpu': False, 'use_mps_device': False, 'seed': 42, 'data_seed': None, 'jit_mode_eval': False, 'use_ipex': False, 'bf16': True, 'fp16': False, 'fp16_opt_level': 'O1', 'half_precision_backend': 'auto', 'bf16_full_eval': False, 'fp16_full_eval': False, 'tf32': None, 'local_rank': 7, 'ddp_backend': None, 'tpu_num_cores': None, 'tpu_metrics_debug': False, 'debug': [], 'dataloader_drop_last': False, 'eval_steps': None, 'dataloader_num_workers': 0, 'dataloader_prefetch_factor': None, 'past_index': -1, 'run_name': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'disable_tqdm': False, 'remove_unused_columns': True, 'label_names': None, 'load_best_model_at_end': False, 'metric_for_best_model': None, 'greater_is_better': None, 'ignore_data_skip': False, 'fsdp': [], 'fsdp_min_num_params': 0, 'fsdp_config': {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, 'tp_size': 0, 'fsdp_transformer_layer_cls_to_wrap': None, 'accelerator_config': {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, 'deepspeed': 'configs/deepspeed_zero3.json', 'label_smoothing_factor': 0.0, 'optim': <OptimizerNames.ADAMW_TORCH: 'adamw_torch'>, 'optim_args': None, 'adafactor': False, 'group_by_length': False, 'length_column_name': 'length', 'report_to': ['wandb'], 'ddp_find_unused_parameters': None, 'ddp_bucket_cap_mb': None, 'ddp_broadcast_buffers': None, 'dataloader_pin_memory': True, 'dataloader_persistent_workers': False, 'skip_memory_metrics': True, 'use_legacy_prediction_loop': False, 'push_to_hub': False, 'resume_from_checkpoint': None, 'hub_model_id': None, 'hub_strategy': <HubStrategy.EVERY_SAVE: 'every_save'>, 'hub_token': None, 'hub_private_repo': None, 'hub_always_push': False, 'gradient_checkpointing': True, 'gradient_checkpointing_kwargs': None, 'include_inputs_for_metrics': False, 'include_for_metrics': [], 'eval_do_concat_batches': True, 'fp16_backend': 'auto', 'push_to_hub_model_id': None, 'push_to_hub_organization': None, 'push_to_hub_token': None, '_n_gpu': 1, 'mp_parameters': '', 'auto_find_batch_size': False, 'full_determinism': False, 'torchdynamo': None, 'ray_scope': 'last', 'ddp_timeout': 1800, 'torch_compile': False, 'torch_compile_backend': None, 'torch_compile_mode': None, 'include_tokens_per_second': False, 'include_num_input_tokens_seen': False, 'neftune_noise_alpha': None, 'optim_target_modules': None, 'batch_eval_metrics': False, 'eval_on_start': False, 'use_liger_kernel': True, 'eval_use_gather_object': False, 'average_tokens_across_devices': True, 'model_init_kwargs': None, 'chat_template_path': None, 'dataset_text_field': 'qwen_text', 'dataset_kwargs': None, 'dataset_num_proc': None, 'eos_token': None, 'pad_token': None, 'max_length': 1024, 'packing': False, 'packing_strategy': 'ffd', 'padding_free': False, 'pad_to_multiple_of': None, 'eval_packing': None, 'completion_only_loss': None, 'assistant_only_loss': False, 'activation_offloading': False, 'max_seq_length': None}
2026-04-15 11:24:36,478 - INFO - Training config: {'model_name': 'Qwen/Qwen3-8B', 'template_name': 'qwen', 'block_size': 40960, 'wandb_project': 'ThreadWeaver', 'train_file_path': './data-train_final_replaced_from_classified-fix-format', 'dagger': False, 'attn_implementation': 'flex_attention', 'output_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'overwrite_output_dir': False, 'do_train': False, 'do_eval': False, 'do_predict': False, 'eval_strategy': <IntervalStrategy.NO: 'no'>, 'prediction_loss_only': False, 'per_device_train_batch_size': 1, 'per_device_eval_batch_size': 1, 'per_gpu_train_batch_size': None, 'per_gpu_eval_batch_size': None, 'gradient_accumulation_steps': 2, 'eval_accumulation_steps': None, 'eval_delay': 0, 'torch_empty_cache_steps': None, 'learning_rate': 1e-05, 'weight_decay': 0.0001, 'adam_beta1': 0.9, 'adam_beta2': 0.95, 'adam_epsilon': 1e-08, 'max_grad_norm': 1.0, 'num_train_epochs': 8.0, 'max_steps': -1, 'lr_scheduler_type': <SchedulerType.COSINE: 'cosine'>, 'lr_scheduler_kwargs': {}, 'warmup_ratio': 0.05, 'warmup_steps': 0, 'log_level': 'passive', 'log_level_replica': 'warning', 'log_on_each_node': True, 'logging_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume/runs/Apr15_11-24-35_batch-block1-3534', 'logging_strategy': <IntervalStrategy.STEPS: 'steps'>, 'logging_first_step': False, 'logging_steps': 1.0, 'logging_nan_inf_filter': True, 'save_strategy': <SaveStrategy.STEPS: 'steps'>, 'save_steps': 20, 'save_total_limit': 2, 'save_safetensors': True, 'save_on_each_node': False, 'save_only_model': False, 'restore_callback_states_from_checkpoint': False, 'no_cuda': False, 'use_cpu': False, 'use_mps_device': False, 'seed': 42, 'data_seed': None, 'jit_mode_eval': False, 'use_ipex': False, 'bf16': True, 'fp16': False, 'fp16_opt_level': 'O1', 'half_precision_backend': 'auto', 'bf16_full_eval': False, 'fp16_full_eval': False, 'tf32': None, 'local_rank': 4, 'ddp_backend': None, 'tpu_num_cores': None, 'tpu_metrics_debug': False, 'debug': [], 'dataloader_drop_last': False, 'eval_steps': None, 'dataloader_num_workers': 0, 'dataloader_prefetch_factor': None, 'past_index': -1, 'run_name': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'disable_tqdm': False, 'remove_unused_columns': True, 'label_names': None, 'load_best_model_at_end': False, 'metric_for_best_model': None, 'greater_is_better': None, 'ignore_data_skip': False, 'fsdp': [], 'fsdp_min_num_params': 0, 'fsdp_config': {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, 'tp_size': 0, 'fsdp_transformer_layer_cls_to_wrap': None, 'accelerator_config': {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, 'deepspeed': 'configs/deepspeed_zero3.json', 'label_smoothing_factor': 0.0, 'optim': <OptimizerNames.ADAMW_TORCH: 'adamw_torch'>, 'optim_args': None, 'adafactor': False, 'group_by_length': False, 'length_column_name': 'length', 'report_to': ['wandb'], 'ddp_find_unused_parameters': None, 'ddp_bucket_cap_mb': None, 'ddp_broadcast_buffers': None, 'dataloader_pin_memory': True, 'dataloader_persistent_workers': False, 'skip_memory_metrics': True, 'use_legacy_prediction_loop': False, 'push_to_hub': False, 'resume_from_checkpoint': None, 'hub_model_id': None, 'hub_strategy': <HubStrategy.EVERY_SAVE: 'every_save'>, 'hub_token': None, 'hub_private_repo': None, 'hub_always_push': False, 'gradient_checkpointing': True, 'gradient_checkpointing_kwargs': None, 'include_inputs_for_metrics': False, 'include_for_metrics': [], 'eval_do_concat_batches': True, 'fp16_backend': 'auto', 'push_to_hub_model_id': None, 'push_to_hub_organization': None, 'push_to_hub_token': None, '_n_gpu': 1, 'mp_parameters': '', 'auto_find_batch_size': False, 'full_determinism': False, 'torchdynamo': None, 'ray_scope': 'last', 'ddp_timeout': 1800, 'torch_compile': False, 'torch_compile_backend': None, 'torch_compile_mode': None, 'include_tokens_per_second': False, 'include_num_input_tokens_seen': False, 'neftune_noise_alpha': None, 'optim_target_modules': None, 'batch_eval_metrics': False, 'eval_on_start': False, 'use_liger_kernel': True, 'eval_use_gather_object': False, 'average_tokens_across_devices': True, 'model_init_kwargs': None, 'chat_template_path': None, 'dataset_text_field': 'qwen_text', 'dataset_kwargs': None, 'dataset_num_proc': None, 'eos_token': None, 'pad_token': None, 'max_length': 1024, 'packing': False, 'packing_strategy': 'ffd', 'padding_free': False, 'pad_to_multiple_of': None, 'eval_packing': None, 'completion_only_loss': None, 'assistant_only_loss': False, 'activation_offloading': False, 'max_seq_length': None}
2026-04-15 11:24:36,479 - INFO - Training config: {'model_name': 'Qwen/Qwen3-8B', 'template_name': 'qwen', 'block_size': 40960, 'wandb_project': 'ThreadWeaver', 'train_file_path': './data-train_final_replaced_from_classified-fix-format', 'dagger': False, 'attn_implementation': 'flex_attention', 'output_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'overwrite_output_dir': False, 'do_train': False, 'do_eval': False, 'do_predict': False, 'eval_strategy': <IntervalStrategy.NO: 'no'>, 'prediction_loss_only': False, 'per_device_train_batch_size': 1, 'per_device_eval_batch_size': 1, 'per_gpu_train_batch_size': None, 'per_gpu_eval_batch_size': None, 'gradient_accumulation_steps': 2, 'eval_accumulation_steps': None, 'eval_delay': 0, 'torch_empty_cache_steps': None, 'learning_rate': 1e-05, 'weight_decay': 0.0001, 'adam_beta1': 0.9, 'adam_beta2': 0.95, 'adam_epsilon': 1e-08, 'max_grad_norm': 1.0, 'num_train_epochs': 8.0, 'max_steps': -1, 'lr_scheduler_type': <SchedulerType.COSINE: 'cosine'>, 'lr_scheduler_kwargs': {}, 'warmup_ratio': 0.05, 'warmup_steps': 0, 'log_level': 'passive', 'log_level_replica': 'warning', 'log_on_each_node': True, 'logging_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume/runs/Apr15_11-24-35_batch-block1-3534', 'logging_strategy': <IntervalStrategy.STEPS: 'steps'>, 'logging_first_step': False, 'logging_steps': 1.0, 'logging_nan_inf_filter': True, 'save_strategy': <SaveStrategy.STEPS: 'steps'>, 'save_steps': 20, 'save_total_limit': 2, 'save_safetensors': True, 'save_on_each_node': False, 'save_only_model': False, 'restore_callback_states_from_checkpoint': False, 'no_cuda': False, 'use_cpu': False, 'use_mps_device': False, 'seed': 42, 'data_seed': None, 'jit_mode_eval': False, 'use_ipex': False, 'bf16': True, 'fp16': False, 'fp16_opt_level': 'O1', 'half_precision_backend': 'auto', 'bf16_full_eval': False, 'fp16_full_eval': False, 'tf32': None, 'local_rank': 5, 'ddp_backend': None, 'tpu_num_cores': None, 'tpu_metrics_debug': False, 'debug': [], 'dataloader_drop_last': False, 'eval_steps': None, 'dataloader_num_workers': 0, 'dataloader_prefetch_factor': None, 'past_index': -1, 'run_name': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'disable_tqdm': False, 'remove_unused_columns': True, 'label_names': None, 'load_best_model_at_end': False, 'metric_for_best_model': None, 'greater_is_better': None, 'ignore_data_skip': False, 'fsdp': [], 'fsdp_min_num_params': 0, 'fsdp_config': {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, 'tp_size': 0, 'fsdp_transformer_layer_cls_to_wrap': None, 'accelerator_config': {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, 'deepspeed': 'configs/deepspeed_zero3.json', 'label_smoothing_factor': 0.0, 'optim': <OptimizerNames.ADAMW_TORCH: 'adamw_torch'>, 'optim_args': None, 'adafactor': False, 'group_by_length': False, 'length_column_name': 'length', 'report_to': ['wandb'], 'ddp_find_unused_parameters': None, 'ddp_bucket_cap_mb': None, 'ddp_broadcast_buffers': None, 'dataloader_pin_memory': True, 'dataloader_persistent_workers': False, 'skip_memory_metrics': True, 'use_legacy_prediction_loop': False, 'push_to_hub': False, 'resume_from_checkpoint': None, 'hub_model_id': None, 'hub_strategy': <HubStrategy.EVERY_SAVE: 'every_save'>, 'hub_token': None, 'hub_private_repo': None, 'hub_always_push': False, 'gradient_checkpointing': True, 'gradient_checkpointing_kwargs': None, 'include_inputs_for_metrics': False, 'include_for_metrics': [], 'eval_do_concat_batches': True, 'fp16_backend': 'auto', 'push_to_hub_model_id': None, 'push_to_hub_organization': None, 'push_to_hub_token': None, '_n_gpu': 1, 'mp_parameters': '', 'auto_find_batch_size': False, 'full_determinism': False, 'torchdynamo': None, 'ray_scope': 'last', 'ddp_timeout': 1800, 'torch_compile': False, 'torch_compile_backend': None, 'torch_compile_mode': None, 'include_tokens_per_second': False, 'include_num_input_tokens_seen': False, 'neftune_noise_alpha': None, 'optim_target_modules': None, 'batch_eval_metrics': False, 'eval_on_start': False, 'use_liger_kernel': True, 'eval_use_gather_object': False, 'average_tokens_across_devices': True, 'model_init_kwargs': None, 'chat_template_path': None, 'dataset_text_field': 'qwen_text', 'dataset_kwargs': None, 'dataset_num_proc': None, 'eos_token': None, 'pad_token': None, 'max_length': 1024, 'packing': False, 'packing_strategy': 'ffd', 'padding_free': False, 'pad_to_multiple_of': None, 'eval_packing': None, 'completion_only_loss': None, 'assistant_only_loss': False, 'activation_offloading': False, 'max_seq_length': None}
2026-04-15 11:24:36,496 - INFO - Training config: {'model_name': 'Qwen/Qwen3-8B', 'template_name': 'qwen', 'block_size': 40960, 'wandb_project': 'ThreadWeaver', 'train_file_path': './data-train_final_replaced_from_classified-fix-format', 'dagger': False, 'attn_implementation': 'flex_attention', 'output_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'overwrite_output_dir': False, 'do_train': False, 'do_eval': False, 'do_predict': False, 'eval_strategy': <IntervalStrategy.NO: 'no'>, 'prediction_loss_only': False, 'per_device_train_batch_size': 1, 'per_device_eval_batch_size': 1, 'per_gpu_train_batch_size': None, 'per_gpu_eval_batch_size': None, 'gradient_accumulation_steps': 2, 'eval_accumulation_steps': None, 'eval_delay': 0, 'torch_empty_cache_steps': None, 'learning_rate': 1e-05, 'weight_decay': 0.0001, 'adam_beta1': 0.9, 'adam_beta2': 0.95, 'adam_epsilon': 1e-08, 'max_grad_norm': 1.0, 'num_train_epochs': 8.0, 'max_steps': -1, 'lr_scheduler_type': <SchedulerType.COSINE: 'cosine'>, 'lr_scheduler_kwargs': {}, 'warmup_ratio': 0.05, 'warmup_steps': 0, 'log_level': 'passive', 'log_level_replica': 'warning', 'log_on_each_node': True, 'logging_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume/runs/Apr15_11-24-35_batch-block1-3534', 'logging_strategy': <IntervalStrategy.STEPS: 'steps'>, 'logging_first_step': False, 'logging_steps': 1.0, 'logging_nan_inf_filter': True, 'save_strategy': <SaveStrategy.STEPS: 'steps'>, 'save_steps': 20, 'save_total_limit': 2, 'save_safetensors': True, 'save_on_each_node': False, 'save_only_model': False, 'restore_callback_states_from_checkpoint': False, 'no_cuda': False, 'use_cpu': False, 'use_mps_device': False, 'seed': 42, 'data_seed': None, 'jit_mode_eval': False, 'use_ipex': False, 'bf16': True, 'fp16': False, 'fp16_opt_level': 'O1', 'half_precision_backend': 'auto', 'bf16_full_eval': False, 'fp16_full_eval': False, 'tf32': None, 'local_rank': 2, 'ddp_backend': None, 'tpu_num_cores': None, 'tpu_metrics_debug': False, 'debug': [], 'dataloader_drop_last': False, 'eval_steps': None, 'dataloader_num_workers': 0, 'dataloader_prefetch_factor': None, 'past_index': -1, 'run_name': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'disable_tqdm': False, 'remove_unused_columns': True, 'label_names': None, 'load_best_model_at_end': False, 'metric_for_best_model': None, 'greater_is_better': None, 'ignore_data_skip': False, 'fsdp': [], 'fsdp_min_num_params': 0, 'fsdp_config': {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, 'tp_size': 0, 'fsdp_transformer_layer_cls_to_wrap': None, 'accelerator_config': {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, 'deepspeed': 'configs/deepspeed_zero3.json', 'label_smoothing_factor': 0.0, 'optim': <OptimizerNames.ADAMW_TORCH: 'adamw_torch'>, 'optim_args': None, 'adafactor': False, 'group_by_length': False, 'length_column_name': 'length', 'report_to': ['wandb'], 'ddp_find_unused_parameters': None, 'ddp_bucket_cap_mb': None, 'ddp_broadcast_buffers': None, 'dataloader_pin_memory': True, 'dataloader_persistent_workers': False, 'skip_memory_metrics': True, 'use_legacy_prediction_loop': False, 'push_to_hub': False, 'resume_from_checkpoint': None, 'hub_model_id': None, 'hub_strategy': <HubStrategy.EVERY_SAVE: 'every_save'>, 'hub_token': None, 'hub_private_repo': None, 'hub_always_push': False, 'gradient_checkpointing': True, 'gradient_checkpointing_kwargs': None, 'include_inputs_for_metrics': False, 'include_for_metrics': [], 'eval_do_concat_batches': True, 'fp16_backend': 'auto', 'push_to_hub_model_id': None, 'push_to_hub_organization': None, 'push_to_hub_token': None, '_n_gpu': 1, 'mp_parameters': '', 'auto_find_batch_size': False, 'full_determinism': False, 'torchdynamo': None, 'ray_scope': 'last', 'ddp_timeout': 1800, 'torch_compile': False, 'torch_compile_backend': None, 'torch_compile_mode': None, 'include_tokens_per_second': False, 'include_num_input_tokens_seen': False, 'neftune_noise_alpha': None, 'optim_target_modules': None, 'batch_eval_metrics': False, 'eval_on_start': False, 'use_liger_kernel': True, 'eval_use_gather_object': False, 'average_tokens_across_devices': True, 'model_init_kwargs': None, 'chat_template_path': None, 'dataset_text_field': 'qwen_text', 'dataset_kwargs': None, 'dataset_num_proc': None, 'eos_token': None, 'pad_token': None, 'max_length': 1024, 'packing': False, 'packing_strategy': 'ffd', 'padding_free': False, 'pad_to_multiple_of': None, 'eval_packing': None, 'completion_only_loss': None, 'assistant_only_loss': False, 'activation_offloading': False, 'max_seq_length': None}
2026-04-15 11:24:40,375 - INFO - Training config: {'model_name': 'Qwen/Qwen3-8B', 'template_name': 'qwen', 'block_size': 40960, 'wandb_project': 'ThreadWeaver', 'train_file_path': './data-train_final_replaced_from_classified-fix-format', 'dagger': False, 'attn_implementation': 'flex_attention', 'output_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'overwrite_output_dir': False, 'do_train': False, 'do_eval': False, 'do_predict': False, 'eval_strategy': <IntervalStrategy.NO: 'no'>, 'prediction_loss_only': False, 'per_device_train_batch_size': 1, 'per_device_eval_batch_size': 1, 'per_gpu_train_batch_size': None, 'per_gpu_eval_batch_size': None, 'gradient_accumulation_steps': 2, 'eval_accumulation_steps': None, 'eval_delay': 0, 'torch_empty_cache_steps': None, 'learning_rate': 1e-05, 'weight_decay': 0.0001, 'adam_beta1': 0.9, 'adam_beta2': 0.95, 'adam_epsilon': 1e-08, 'max_grad_norm': 1.0, 'num_train_epochs': 8.0, 'max_steps': -1, 'lr_scheduler_type': <SchedulerType.COSINE: 'cosine'>, 'lr_scheduler_kwargs': {}, 'warmup_ratio': 0.05, 'warmup_steps': 0, 'log_level': 'passive', 'log_level_replica': 'warning', 'log_on_each_node': True, 'logging_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume/runs/Apr15_11-24-38_batch-block1-3833', 'logging_strategy': <IntervalStrategy.STEPS: 'steps'>, 'logging_first_step': False, 'logging_steps': 1.0, 'logging_nan_inf_filter': True, 'save_strategy': <SaveStrategy.STEPS: 'steps'>, 'save_steps': 20, 'save_total_limit': 2, 'save_safetensors': True, 'save_on_each_node': False, 'save_only_model': False, 'restore_callback_states_from_checkpoint': False, 'no_cuda': False, 'use_cpu': False, 'use_mps_device': False, 'seed': 42, 'data_seed': None, 'jit_mode_eval': False, 'use_ipex': False, 'bf16': True, 'fp16': False, 'fp16_opt_level': 'O1', 'half_precision_backend': 'auto', 'bf16_full_eval': False, 'fp16_full_eval': False, 'tf32': None, 'local_rank': 0, 'ddp_backend': None, 'tpu_num_cores': None, 'tpu_metrics_debug': False, 'debug': [], 'dataloader_drop_last': False, 'eval_steps': None, 'dataloader_num_workers': 0, 'dataloader_prefetch_factor': None, 'past_index': -1, 'run_name': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'disable_tqdm': False, 'remove_unused_columns': True, 'label_names': None, 'load_best_model_at_end': False, 'metric_for_best_model': None, 'greater_is_better': None, 'ignore_data_skip': False, 'fsdp': [], 'fsdp_min_num_params': 0, 'fsdp_config': {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, 'tp_size': 0, 'fsdp_transformer_layer_cls_to_wrap': None, 'accelerator_config': {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, 'deepspeed': 'configs/deepspeed_zero3.json', 'label_smoothing_factor': 0.0, 'optim': <OptimizerNames.ADAMW_TORCH: 'adamw_torch'>, 'optim_args': None, 'adafactor': False, 'group_by_length': False, 'length_column_name': 'length', 'report_to': ['wandb'], 'ddp_find_unused_parameters': None, 'ddp_bucket_cap_mb': None, 'ddp_broadcast_buffers': None, 'dataloader_pin_memory': True, 'dataloader_persistent_workers': False, 'skip_memory_metrics': True, 'use_legacy_prediction_loop': False, 'push_to_hub': False, 'resume_from_checkpoint': None, 'hub_model_id': None, 'hub_strategy': <HubStrategy.EVERY_SAVE: 'every_save'>, 'hub_token': None, 'hub_private_repo': None, 'hub_always_push': False, 'gradient_checkpointing': True, 'gradient_checkpointing_kwargs': None, 'include_inputs_for_metrics': False, 'include_for_metrics': [], 'eval_do_concat_batches': True, 'fp16_backend': 'auto', 'push_to_hub_model_id': None, 'push_to_hub_organization': None, 'push_to_hub_token': None, '_n_gpu': 1, 'mp_parameters': '', 'auto_find_batch_size': False, 'full_determinism': False, 'torchdynamo': None, 'ray_scope': 'last', 'ddp_timeout': 1800, 'torch_compile': False, 'torch_compile_backend': None, 'torch_compile_mode': None, 'include_tokens_per_second': False, 'include_num_input_tokens_seen': False, 'neftune_noise_alpha': None, 'optim_target_modules': None, 'batch_eval_metrics': False, 'eval_on_start': False, 'use_liger_kernel': True, 'eval_use_gather_object': False, 'average_tokens_across_devices': True, 'model_init_kwargs': None, 'chat_template_path': None, 'dataset_text_field': 'qwen_text', 'dataset_kwargs': None, 'dataset_num_proc': None, 'eos_token': None, 'pad_token': None, 'max_length': 1024, 'packing': False, 'packing_strategy': 'ffd', 'padding_free': False, 'pad_to_multiple_of': None, 'eval_packing': None, 'completion_only_loss': None, 'assistant_only_loss': False, 'activation_offloading': False, 'max_seq_length': None}
2026-04-15 11:24:40,551 - INFO - Training config: {'model_name': 'Qwen/Qwen3-8B', 'template_name': 'qwen', 'block_size': 40960, 'wandb_project': 'ThreadWeaver', 'train_file_path': './data-train_final_replaced_from_classified-fix-format', 'dagger': False, 'attn_implementation': 'flex_attention', 'output_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'overwrite_output_dir': False, 'do_train': False, 'do_eval': False, 'do_predict': False, 'eval_strategy': <IntervalStrategy.NO: 'no'>, 'prediction_loss_only': False, 'per_device_train_batch_size': 1, 'per_device_eval_batch_size': 1, 'per_gpu_train_batch_size': None, 'per_gpu_eval_batch_size': None, 'gradient_accumulation_steps': 2, 'eval_accumulation_steps': None, 'eval_delay': 0, 'torch_empty_cache_steps': None, 'learning_rate': 1e-05, 'weight_decay': 0.0001, 'adam_beta1': 0.9, 'adam_beta2': 0.95, 'adam_epsilon': 1e-08, 'max_grad_norm': 1.0, 'num_train_epochs': 8.0, 'max_steps': -1, 'lr_scheduler_type': <SchedulerType.COSINE: 'cosine'>, 'lr_scheduler_kwargs': {}, 'warmup_ratio': 0.05, 'warmup_steps': 0, 'log_level': 'passive', 'log_level_replica': 'warning', 'log_on_each_node': True, 'logging_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume/runs/Apr15_11-24-35_batch-block1-0069', 'logging_strategy': <IntervalStrategy.STEPS: 'steps'>, 'logging_first_step': False, 'logging_steps': 1.0, 'logging_nan_inf_filter': True, 'save_strategy': <SaveStrategy.STEPS: 'steps'>, 'save_steps': 20, 'save_total_limit': 2, 'save_safetensors': True, 'save_on_each_node': False, 'save_only_model': False, 'restore_callback_states_from_checkpoint': False, 'no_cuda': False, 'use_cpu': False, 'use_mps_device': False, 'seed': 42, 'data_seed': None, 'jit_mode_eval': False, 'use_ipex': False, 'bf16': True, 'fp16': False, 'fp16_opt_level': 'O1', 'half_precision_backend': 'auto', 'bf16_full_eval': False, 'fp16_full_eval': False, 'tf32': None, 'local_rank': 0, 'ddp_backend': None, 'tpu_num_cores': None, 'tpu_metrics_debug': False, 'debug': [], 'dataloader_drop_last': False, 'eval_steps': None, 'dataloader_num_workers': 0, 'dataloader_prefetch_factor': None, 'past_index': -1, 'run_name': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'disable_tqdm': False, 'remove_unused_columns': True, 'label_names': None, 'load_best_model_at_end': False, 'metric_for_best_model': None, 'greater_is_better': None, 'ignore_data_skip': False, 'fsdp': [], 'fsdp_min_num_params': 0, 'fsdp_config': {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, 'tp_size': 0, 'fsdp_transformer_layer_cls_to_wrap': None, 'accelerator_config': {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, 'deepspeed': 'configs/deepspeed_zero3.json', 'label_smoothing_factor': 0.0, 'optim': <OptimizerNames.ADAMW_TORCH: 'adamw_torch'>, 'optim_args': None, 'adafactor': False, 'group_by_length': False, 'length_column_name': 'length', 'report_to': ['wandb'], 'ddp_find_unused_parameters': None, 'ddp_bucket_cap_mb': None, 'ddp_broadcast_buffers': None, 'dataloader_pin_memory': True, 'dataloader_persistent_workers': False, 'skip_memory_metrics': True, 'use_legacy_prediction_loop': False, 'push_to_hub': False, 'resume_from_checkpoint': None, 'hub_model_id': None, 'hub_strategy': <HubStrategy.EVERY_SAVE: 'every_save'>, 'hub_token': None, 'hub_private_repo': None, 'hub_always_push': False, 'gradient_checkpointing': True, 'gradient_checkpointing_kwargs': None, 'include_inputs_for_metrics': False, 'include_for_metrics': [], 'eval_do_concat_batches': True, 'fp16_backend': 'auto', 'push_to_hub_model_id': None, 'push_to_hub_organization': None, 'push_to_hub_token': None, '_n_gpu': 1, 'mp_parameters': '', 'auto_find_batch_size': False, 'full_determinism': False, 'torchdynamo': None, 'ray_scope': 'last', 'ddp_timeout': 1800, 'torch_compile': False, 'torch_compile_backend': None, 'torch_compile_mode': None, 'include_tokens_per_second': False, 'include_num_input_tokens_seen': False, 'neftune_noise_alpha': None, 'optim_target_modules': None, 'batch_eval_metrics': False, 'eval_on_start': False, 'use_liger_kernel': True, 'eval_use_gather_object': False, 'average_tokens_across_devices': True, 'model_init_kwargs': None, 'chat_template_path': None, 'dataset_text_field': 'qwen_text', 'dataset_kwargs': None, 'dataset_num_proc': None, 'eos_token': None, 'pad_token': None, 'max_length': 1024, 'packing': False, 'packing_strategy': 'ffd', 'padding_free': False, 'pad_to_multiple_of': None, 'eval_packing': None, 'completion_only_loss': None, 'assistant_only_loss': False, 'activation_offloading': False, 'max_seq_length': None}
2026-04-15 11:24:40,585 - INFO - Training config: {'model_name': 'Qwen/Qwen3-8B', 'template_name': 'qwen', 'block_size': 40960, 'wandb_project': 'ThreadWeaver', 'train_file_path': './data-train_final_replaced_from_classified-fix-format', 'dagger': False, 'attn_implementation': 'flex_attention', 'output_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'overwrite_output_dir': False, 'do_train': False, 'do_eval': False, 'do_predict': False, 'eval_strategy': <IntervalStrategy.NO: 'no'>, 'prediction_loss_only': False, 'per_device_train_batch_size': 1, 'per_device_eval_batch_size': 1, 'per_gpu_train_batch_size': None, 'per_gpu_eval_batch_size': None, 'gradient_accumulation_steps': 2, 'eval_accumulation_steps': None, 'eval_delay': 0, 'torch_empty_cache_steps': None, 'learning_rate': 1e-05, 'weight_decay': 0.0001, 'adam_beta1': 0.9, 'adam_beta2': 0.95, 'adam_epsilon': 1e-08, 'max_grad_norm': 1.0, 'num_train_epochs': 8.0, 'max_steps': -1, 'lr_scheduler_type': <SchedulerType.COSINE: 'cosine'>, 'lr_scheduler_kwargs': {}, 'warmup_ratio': 0.05, 'warmup_steps': 0, 'log_level': 'passive', 'log_level_replica': 'warning', 'log_on_each_node': True, 'logging_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume/runs/Apr15_11-24-38_batch-block1-3833', 'logging_strategy': <IntervalStrategy.STEPS: 'steps'>, 'logging_first_step': False, 'logging_steps': 1.0, 'logging_nan_inf_filter': True, 'save_strategy': <SaveStrategy.STEPS: 'steps'>, 'save_steps': 20, 'save_total_limit': 2, 'save_safetensors': True, 'save_on_each_node': False, 'save_only_model': False, 'restore_callback_states_from_checkpoint': False, 'no_cuda': False, 'use_cpu': False, 'use_mps_device': False, 'seed': 42, 'data_seed': None, 'jit_mode_eval': False, 'use_ipex': False, 'bf16': True, 'fp16': False, 'fp16_opt_level': 'O1', 'half_precision_backend': 'auto', 'bf16_full_eval': False, 'fp16_full_eval': False, 'tf32': None, 'local_rank': 4, 'ddp_backend': None, 'tpu_num_cores': None, 'tpu_metrics_debug': False, 'debug': [], 'dataloader_drop_last': False, 'eval_steps': None, 'dataloader_num_workers': 0, 'dataloader_prefetch_factor': None, 'past_index': -1, 'run_name': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'disable_tqdm': False, 'remove_unused_columns': True, 'label_names': None, 'load_best_model_at_end': False, 'metric_for_best_model': None, 'greater_is_better': None, 'ignore_data_skip': False, 'fsdp': [], 'fsdp_min_num_params': 0, 'fsdp_config': {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, 'tp_size': 0, 'fsdp_transformer_layer_cls_to_wrap': None, 'accelerator_config': {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, 'deepspeed': 'configs/deepspeed_zero3.json', 'label_smoothing_factor': 0.0, 'optim': <OptimizerNames.ADAMW_TORCH: 'adamw_torch'>, 'optim_args': None, 'adafactor': False, 'group_by_length': False, 'length_column_name': 'length', 'report_to': ['wandb'], 'ddp_find_unused_parameters': None, 'ddp_bucket_cap_mb': None, 'ddp_broadcast_buffers': None, 'dataloader_pin_memory': True, 'dataloader_persistent_workers': False, 'skip_memory_metrics': True, 'use_legacy_prediction_loop': False, 'push_to_hub': False, 'resume_from_checkpoint': None, 'hub_model_id': None, 'hub_strategy': <HubStrategy.EVERY_SAVE: 'every_save'>, 'hub_token': None, 'hub_private_repo': None, 'hub_always_push': False, 'gradient_checkpointing': True, 'gradient_checkpointing_kwargs': None, 'include_inputs_for_metrics': False, 'include_for_metrics': [], 'eval_do_concat_batches': True, 'fp16_backend': 'auto', 'push_to_hub_model_id': None, 'push_to_hub_organization': None, 'push_to_hub_token': None, '_n_gpu': 1, 'mp_parameters': '', 'auto_find_batch_size': False, 'full_determinism': False, 'torchdynamo': None, 'ray_scope': 'last', 'ddp_timeout': 1800, 'torch_compile': False, 'torch_compile_backend': None, 'torch_compile_mode': None, 'include_tokens_per_second': False, 'include_num_input_tokens_seen': False, 'neftune_noise_alpha': None, 'optim_target_modules': None, 'batch_eval_metrics': False, 'eval_on_start': False, 'use_liger_kernel': True, 'eval_use_gather_object': False, 'average_tokens_across_devices': True, 'model_init_kwargs': None, 'chat_template_path': None, 'dataset_text_field': 'qwen_text', 'dataset_kwargs': None, 'dataset_num_proc': None, 'eos_token': None, 'pad_token': None, 'max_length': 1024, 'packing': False, 'packing_strategy': 'ffd', 'padding_free': False, 'pad_to_multiple_of': None, 'eval_packing': None, 'completion_only_loss': None, 'assistant_only_loss': False, 'activation_offloading': False, 'max_seq_length': None}
2026-04-15 11:24:40,599 - INFO - Training config: {'model_name': 'Qwen/Qwen3-8B', 'template_name': 'qwen', 'block_size': 40960, 'wandb_project': 'ThreadWeaver', 'train_file_path': './data-train_final_replaced_from_classified-fix-format', 'dagger': False, 'attn_implementation': 'flex_attention', 'output_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'overwrite_output_dir': False, 'do_train': False, 'do_eval': False, 'do_predict': False, 'eval_strategy': <IntervalStrategy.NO: 'no'>, 'prediction_loss_only': False, 'per_device_train_batch_size': 1, 'per_device_eval_batch_size': 1, 'per_gpu_train_batch_size': None, 'per_gpu_eval_batch_size': None, 'gradient_accumulation_steps': 2, 'eval_accumulation_steps': None, 'eval_delay': 0, 'torch_empty_cache_steps': None, 'learning_rate': 1e-05, 'weight_decay': 0.0001, 'adam_beta1': 0.9, 'adam_beta2': 0.95, 'adam_epsilon': 1e-08, 'max_grad_norm': 1.0, 'num_train_epochs': 8.0, 'max_steps': -1, 'lr_scheduler_type': <SchedulerType.COSINE: 'cosine'>, 'lr_scheduler_kwargs': {}, 'warmup_ratio': 0.05, 'warmup_steps': 0, 'log_level': 'passive', 'log_level_replica': 'warning', 'log_on_each_node': True, 'logging_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume/runs/Apr15_11-24-38_batch-block1-3833', 'logging_strategy': <IntervalStrategy.STEPS: 'steps'>, 'logging_first_step': False, 'logging_steps': 1.0, 'logging_nan_inf_filter': True, 'save_strategy': <SaveStrategy.STEPS: 'steps'>, 'save_steps': 20, 'save_total_limit': 2, 'save_safetensors': True, 'save_on_each_node': False, 'save_only_model': False, 'restore_callback_states_from_checkpoint': False, 'no_cuda': False, 'use_cpu': False, 'use_mps_device': False, 'seed': 42, 'data_seed': None, 'jit_mode_eval': False, 'use_ipex': False, 'bf16': True, 'fp16': False, 'fp16_opt_level': 'O1', 'half_precision_backend': 'auto', 'bf16_full_eval': False, 'fp16_full_eval': False, 'tf32': None, 'local_rank': 1, 'ddp_backend': None, 'tpu_num_cores': None, 'tpu_metrics_debug': False, 'debug': [], 'dataloader_drop_last': False, 'eval_steps': None, 'dataloader_num_workers': 0, 'dataloader_prefetch_factor': None, 'past_index': -1, 'run_name': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'disable_tqdm': False, 'remove_unused_columns': True, 'label_names': None, 'load_best_model_at_end': False, 'metric_for_best_model': None, 'greater_is_better': None, 'ignore_data_skip': False, 'fsdp': [], 'fsdp_min_num_params': 0, 'fsdp_config': {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, 'tp_size': 0, 'fsdp_transformer_layer_cls_to_wrap': None, 'accelerator_config': {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, 'deepspeed': 'configs/deepspeed_zero3.json', 'label_smoothing_factor': 0.0, 'optim': <OptimizerNames.ADAMW_TORCH: 'adamw_torch'>, 'optim_args': None, 'adafactor': False, 'group_by_length': False, 'length_column_name': 'length', 'report_to': ['wandb'], 'ddp_find_unused_parameters': None, 'ddp_bucket_cap_mb': None, 'ddp_broadcast_buffers': None, 'dataloader_pin_memory': True, 'dataloader_persistent_workers': False, 'skip_memory_metrics': True, 'use_legacy_prediction_loop': False, 'push_to_hub': False, 'resume_from_checkpoint': None, 'hub_model_id': None, 'hub_strategy': <HubStrategy.EVERY_SAVE: 'every_save'>, 'hub_token': None, 'hub_private_repo': None, 'hub_always_push': False, 'gradient_checkpointing': True, 'gradient_checkpointing_kwargs': None, 'include_inputs_for_metrics': False, 'include_for_metrics': [], 'eval_do_concat_batches': True, 'fp16_backend': 'auto', 'push_to_hub_model_id': None, 'push_to_hub_organization': None, 'push_to_hub_token': None, '_n_gpu': 1, 'mp_parameters': '', 'auto_find_batch_size': False, 'full_determinism': False, 'torchdynamo': None, 'ray_scope': 'last', 'ddp_timeout': 1800, 'torch_compile': False, 'torch_compile_backend': None, 'torch_compile_mode': None, 'include_tokens_per_second': False, 'include_num_input_tokens_seen': False, 'neftune_noise_alpha': None, 'optim_target_modules': None, 'batch_eval_metrics': False, 'eval_on_start': False, 'use_liger_kernel': True, 'eval_use_gather_object': False, 'average_tokens_across_devices': True, 'model_init_kwargs': None, 'chat_template_path': None, 'dataset_text_field': 'qwen_text', 'dataset_kwargs': None, 'dataset_num_proc': None, 'eos_token': None, 'pad_token': None, 'max_length': 1024, 'packing': False, 'packing_strategy': 'ffd', 'padding_free': False, 'pad_to_multiple_of': None, 'eval_packing': None, 'completion_only_loss': None, 'assistant_only_loss': False, 'activation_offloading': False, 'max_seq_length': None}
2026-04-15 11:24:40,604 - INFO - Training config: {'model_name': 'Qwen/Qwen3-8B', 'template_name': 'qwen', 'block_size': 40960, 'wandb_project': 'ThreadWeaver', 'train_file_path': './data-train_final_replaced_from_classified-fix-format', 'dagger': False, 'attn_implementation': 'flex_attention', 'output_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'overwrite_output_dir': False, 'do_train': False, 'do_eval': False, 'do_predict': False, 'eval_strategy': <IntervalStrategy.NO: 'no'>, 'prediction_loss_only': False, 'per_device_train_batch_size': 1, 'per_device_eval_batch_size': 1, 'per_gpu_train_batch_size': None, 'per_gpu_eval_batch_size': None, 'gradient_accumulation_steps': 2, 'eval_accumulation_steps': None, 'eval_delay': 0, 'torch_empty_cache_steps': None, 'learning_rate': 1e-05, 'weight_decay': 0.0001, 'adam_beta1': 0.9, 'adam_beta2': 0.95, 'adam_epsilon': 1e-08, 'max_grad_norm': 1.0, 'num_train_epochs': 8.0, 'max_steps': -1, 'lr_scheduler_type': <SchedulerType.COSINE: 'cosine'>, 'lr_scheduler_kwargs': {}, 'warmup_ratio': 0.05, 'warmup_steps': 0, 'log_level': 'passive', 'log_level_replica': 'warning', 'log_on_each_node': True, 'logging_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume/runs/Apr15_11-24-38_batch-block1-3833', 'logging_strategy': <IntervalStrategy.STEPS: 'steps'>, 'logging_first_step': False, 'logging_steps': 1.0, 'logging_nan_inf_filter': True, 'save_strategy': <SaveStrategy.STEPS: 'steps'>, 'save_steps': 20, 'save_total_limit': 2, 'save_safetensors': True, 'save_on_each_node': False, 'save_only_model': False, 'restore_callback_states_from_checkpoint': False, 'no_cuda': False, 'use_cpu': False, 'use_mps_device': False, 'seed': 42, 'data_seed': None, 'jit_mode_eval': False, 'use_ipex': False, 'bf16': True, 'fp16': False, 'fp16_opt_level': 'O1', 'half_precision_backend': 'auto', 'bf16_full_eval': False, 'fp16_full_eval': False, 'tf32': None, 'local_rank': 7, 'ddp_backend': None, 'tpu_num_cores': None, 'tpu_metrics_debug': False, 'debug': [], 'dataloader_drop_last': False, 'eval_steps': None, 'dataloader_num_workers': 0, 'dataloader_prefetch_factor': None, 'past_index': -1, 'run_name': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'disable_tqdm': False, 'remove_unused_columns': True, 'label_names': None, 'load_best_model_at_end': False, 'metric_for_best_model': None, 'greater_is_better': None, 'ignore_data_skip': False, 'fsdp': [], 'fsdp_min_num_params': 0, 'fsdp_config': {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, 'tp_size': 0, 'fsdp_transformer_layer_cls_to_wrap': None, 'accelerator_config': {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, 'deepspeed': 'configs/deepspeed_zero3.json', 'label_smoothing_factor': 0.0, 'optim': <OptimizerNames.ADAMW_TORCH: 'adamw_torch'>, 'optim_args': None, 'adafactor': False, 'group_by_length': False, 'length_column_name': 'length', 'report_to': ['wandb'], 'ddp_find_unused_parameters': None, 'ddp_bucket_cap_mb': None, 'ddp_broadcast_buffers': None, 'dataloader_pin_memory': True, 'dataloader_persistent_workers': False, 'skip_memory_metrics': True, 'use_legacy_prediction_loop': False, 'push_to_hub': False, 'resume_from_checkpoint': None, 'hub_model_id': None, 'hub_strategy': <HubStrategy.EVERY_SAVE: 'every_save'>, 'hub_token': None, 'hub_private_repo': None, 'hub_always_push': False, 'gradient_checkpointing': True, 'gradient_checkpointing_kwargs': None, 'include_inputs_for_metrics': False, 'include_for_metrics': [], 'eval_do_concat_batches': True, 'fp16_backend': 'auto', 'push_to_hub_model_id': None, 'push_to_hub_organization': None, 'push_to_hub_token': None, '_n_gpu': 1, 'mp_parameters': '', 'auto_find_batch_size': False, 'full_determinism': False, 'torchdynamo': None, 'ray_scope': 'last', 'ddp_timeout': 1800, 'torch_compile': False, 'torch_compile_backend': None, 'torch_compile_mode': None, 'include_tokens_per_second': False, 'include_num_input_tokens_seen': False, 'neftune_noise_alpha': None, 'optim_target_modules': None, 'batch_eval_metrics': False, 'eval_on_start': False, 'use_liger_kernel': True, 'eval_use_gather_object': False, 'average_tokens_across_devices': True, 'model_init_kwargs': None, 'chat_template_path': None, 'dataset_text_field': 'qwen_text', 'dataset_kwargs': None, 'dataset_num_proc': None, 'eos_token': None, 'pad_token': None, 'max_length': 1024, 'packing': False, 'packing_strategy': 'ffd', 'padding_free': False, 'pad_to_multiple_of': None, 'eval_packing': None, 'completion_only_loss': None, 'assistant_only_loss': False, 'activation_offloading': False, 'max_seq_length': None}
2026-04-15 11:24:40,605 - INFO - Training config: {'model_name': 'Qwen/Qwen3-8B', 'template_name': 'qwen', 'block_size': 40960, 'wandb_project': 'ThreadWeaver', 'train_file_path': './data-train_final_replaced_from_classified-fix-format', 'dagger': False, 'attn_implementation': 'flex_attention', 'output_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'overwrite_output_dir': False, 'do_train': False, 'do_eval': False, 'do_predict': False, 'eval_strategy': <IntervalStrategy.NO: 'no'>, 'prediction_loss_only': False, 'per_device_train_batch_size': 1, 'per_device_eval_batch_size': 1, 'per_gpu_train_batch_size': None, 'per_gpu_eval_batch_size': None, 'gradient_accumulation_steps': 2, 'eval_accumulation_steps': None, 'eval_delay': 0, 'torch_empty_cache_steps': None, 'learning_rate': 1e-05, 'weight_decay': 0.0001, 'adam_beta1': 0.9, 'adam_beta2': 0.95, 'adam_epsilon': 1e-08, 'max_grad_norm': 1.0, 'num_train_epochs': 8.0, 'max_steps': -1, 'lr_scheduler_type': <SchedulerType.COSINE: 'cosine'>, 'lr_scheduler_kwargs': {}, 'warmup_ratio': 0.05, 'warmup_steps': 0, 'log_level': 'passive', 'log_level_replica': 'warning', 'log_on_each_node': True, 'logging_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume/runs/Apr15_11-24-38_batch-block1-3833', 'logging_strategy': <IntervalStrategy.STEPS: 'steps'>, 'logging_first_step': False, 'logging_steps': 1.0, 'logging_nan_inf_filter': True, 'save_strategy': <SaveStrategy.STEPS: 'steps'>, 'save_steps': 20, 'save_total_limit': 2, 'save_safetensors': True, 'save_on_each_node': False, 'save_only_model': False, 'restore_callback_states_from_checkpoint': False, 'no_cuda': False, 'use_cpu': False, 'use_mps_device': False, 'seed': 42, 'data_seed': None, 'jit_mode_eval': False, 'use_ipex': False, 'bf16': True, 'fp16': False, 'fp16_opt_level': 'O1', 'half_precision_backend': 'auto', 'bf16_full_eval': False, 'fp16_full_eval': False, 'tf32': None, 'local_rank': 2, 'ddp_backend': None, 'tpu_num_cores': None, 'tpu_metrics_debug': False, 'debug': [], 'dataloader_drop_last': False, 'eval_steps': None, 'dataloader_num_workers': 0, 'dataloader_prefetch_factor': None, 'past_index': -1, 'run_name': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'disable_tqdm': False, 'remove_unused_columns': True, 'label_names': None, 'load_best_model_at_end': False, 'metric_for_best_model': None, 'greater_is_better': None, 'ignore_data_skip': False, 'fsdp': [], 'fsdp_min_num_params': 0, 'fsdp_config': {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, 'tp_size': 0, 'fsdp_transformer_layer_cls_to_wrap': None, 'accelerator_config': {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, 'deepspeed': 'configs/deepspeed_zero3.json', 'label_smoothing_factor': 0.0, 'optim': <OptimizerNames.ADAMW_TORCH: 'adamw_torch'>, 'optim_args': None, 'adafactor': False, 'group_by_length': False, 'length_column_name': 'length', 'report_to': ['wandb'], 'ddp_find_unused_parameters': None, 'ddp_bucket_cap_mb': None, 'ddp_broadcast_buffers': None, 'dataloader_pin_memory': True, 'dataloader_persistent_workers': False, 'skip_memory_metrics': True, 'use_legacy_prediction_loop': False, 'push_to_hub': False, 'resume_from_checkpoint': None, 'hub_model_id': None, 'hub_strategy': <HubStrategy.EVERY_SAVE: 'every_save'>, 'hub_token': None, 'hub_private_repo': None, 'hub_always_push': False, 'gradient_checkpointing': True, 'gradient_checkpointing_kwargs': None, 'include_inputs_for_metrics': False, 'include_for_metrics': [], 'eval_do_concat_batches': True, 'fp16_backend': 'auto', 'push_to_hub_model_id': None, 'push_to_hub_organization': None, 'push_to_hub_token': None, '_n_gpu': 1, 'mp_parameters': '', 'auto_find_batch_size': False, 'full_determinism': False, 'torchdynamo': None, 'ray_scope': 'last', 'ddp_timeout': 1800, 'torch_compile': False, 'torch_compile_backend': None, 'torch_compile_mode': None, 'include_tokens_per_second': False, 'include_num_input_tokens_seen': False, 'neftune_noise_alpha': None, 'optim_target_modules': None, 'batch_eval_metrics': False, 'eval_on_start': False, 'use_liger_kernel': True, 'eval_use_gather_object': False, 'average_tokens_across_devices': True, 'model_init_kwargs': None, 'chat_template_path': None, 'dataset_text_field': 'qwen_text', 'dataset_kwargs': None, 'dataset_num_proc': None, 'eos_token': None, 'pad_token': None, 'max_length': 1024, 'packing': False, 'packing_strategy': 'ffd', 'padding_free': False, 'pad_to_multiple_of': None, 'eval_packing': None, 'completion_only_loss': None, 'assistant_only_loss': False, 'activation_offloading': False, 'max_seq_length': None}
2026-04-15 11:24:40,624 - INFO - Training config: {'model_name': 'Qwen/Qwen3-8B', 'template_name': 'qwen', 'block_size': 40960, 'wandb_project': 'ThreadWeaver', 'train_file_path': './data-train_final_replaced_from_classified-fix-format', 'dagger': False, 'attn_implementation': 'flex_attention', 'output_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'overwrite_output_dir': False, 'do_train': False, 'do_eval': False, 'do_predict': False, 'eval_strategy': <IntervalStrategy.NO: 'no'>, 'prediction_loss_only': False, 'per_device_train_batch_size': 1, 'per_device_eval_batch_size': 1, 'per_gpu_train_batch_size': None, 'per_gpu_eval_batch_size': None, 'gradient_accumulation_steps': 2, 'eval_accumulation_steps': None, 'eval_delay': 0, 'torch_empty_cache_steps': None, 'learning_rate': 1e-05, 'weight_decay': 0.0001, 'adam_beta1': 0.9, 'adam_beta2': 0.95, 'adam_epsilon': 1e-08, 'max_grad_norm': 1.0, 'num_train_epochs': 8.0, 'max_steps': -1, 'lr_scheduler_type': <SchedulerType.COSINE: 'cosine'>, 'lr_scheduler_kwargs': {}, 'warmup_ratio': 0.05, 'warmup_steps': 0, 'log_level': 'passive', 'log_level_replica': 'warning', 'log_on_each_node': True, 'logging_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume/runs/Apr15_11-24-38_batch-block1-3833', 'logging_strategy': <IntervalStrategy.STEPS: 'steps'>, 'logging_first_step': False, 'logging_steps': 1.0, 'logging_nan_inf_filter': True, 'save_strategy': <SaveStrategy.STEPS: 'steps'>, 'save_steps': 20, 'save_total_limit': 2, 'save_safetensors': True, 'save_on_each_node': False, 'save_only_model': False, 'restore_callback_states_from_checkpoint': False, 'no_cuda': False, 'use_cpu': False, 'use_mps_device': False, 'seed': 42, 'data_seed': None, 'jit_mode_eval': False, 'use_ipex': False, 'bf16': True, 'fp16': False, 'fp16_opt_level': 'O1', 'half_precision_backend': 'auto', 'bf16_full_eval': False, 'fp16_full_eval': False, 'tf32': None, 'local_rank': 5, 'ddp_backend': None, 'tpu_num_cores': None, 'tpu_metrics_debug': False, 'debug': [], 'dataloader_drop_last': False, 'eval_steps': None, 'dataloader_num_workers': 0, 'dataloader_prefetch_factor': None, 'past_index': -1, 'run_name': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'disable_tqdm': False, 'remove_unused_columns': True, 'label_names': None, 'load_best_model_at_end': False, 'metric_for_best_model': None, 'greater_is_better': None, 'ignore_data_skip': False, 'fsdp': [], 'fsdp_min_num_params': 0, 'fsdp_config': {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, 'tp_size': 0, 'fsdp_transformer_layer_cls_to_wrap': None, 'accelerator_config': {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, 'deepspeed': 'configs/deepspeed_zero3.json', 'label_smoothing_factor': 0.0, 'optim': <OptimizerNames.ADAMW_TORCH: 'adamw_torch'>, 'optim_args': None, 'adafactor': False, 'group_by_length': False, 'length_column_name': 'length', 'report_to': ['wandb'], 'ddp_find_unused_parameters': None, 'ddp_bucket_cap_mb': None, 'ddp_broadcast_buffers': None, 'dataloader_pin_memory': True, 'dataloader_persistent_workers': False, 'skip_memory_metrics': True, 'use_legacy_prediction_loop': False, 'push_to_hub': False, 'resume_from_checkpoint': None, 'hub_model_id': None, 'hub_strategy': <HubStrategy.EVERY_SAVE: 'every_save'>, 'hub_token': None, 'hub_private_repo': None, 'hub_always_push': False, 'gradient_checkpointing': True, 'gradient_checkpointing_kwargs': None, 'include_inputs_for_metrics': False, 'include_for_metrics': [], 'eval_do_concat_batches': True, 'fp16_backend': 'auto', 'push_to_hub_model_id': None, 'push_to_hub_organization': None, 'push_to_hub_token': None, '_n_gpu': 1, 'mp_parameters': '', 'auto_find_batch_size': False, 'full_determinism': False, 'torchdynamo': None, 'ray_scope': 'last', 'ddp_timeout': 1800, 'torch_compile': False, 'torch_compile_backend': None, 'torch_compile_mode': None, 'include_tokens_per_second': False, 'include_num_input_tokens_seen': False, 'neftune_noise_alpha': None, 'optim_target_modules': None, 'batch_eval_metrics': False, 'eval_on_start': False, 'use_liger_kernel': True, 'eval_use_gather_object': False, 'average_tokens_across_devices': True, 'model_init_kwargs': None, 'chat_template_path': None, 'dataset_text_field': 'qwen_text', 'dataset_kwargs': None, 'dataset_num_proc': None, 'eos_token': None, 'pad_token': None, 'max_length': 1024, 'packing': False, 'packing_strategy': 'ffd', 'padding_free': False, 'pad_to_multiple_of': None, 'eval_packing': None, 'completion_only_loss': None, 'assistant_only_loss': False, 'activation_offloading': False, 'max_seq_length': None}
2026-04-15 11:24:40,624 - INFO - Training config: {'model_name': 'Qwen/Qwen3-8B', 'template_name': 'qwen', 'block_size': 40960, 'wandb_project': 'ThreadWeaver', 'train_file_path': './data-train_final_replaced_from_classified-fix-format', 'dagger': False, 'attn_implementation': 'flex_attention', 'output_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'overwrite_output_dir': False, 'do_train': False, 'do_eval': False, 'do_predict': False, 'eval_strategy': <IntervalStrategy.NO: 'no'>, 'prediction_loss_only': False, 'per_device_train_batch_size': 1, 'per_device_eval_batch_size': 1, 'per_gpu_train_batch_size': None, 'per_gpu_eval_batch_size': None, 'gradient_accumulation_steps': 2, 'eval_accumulation_steps': None, 'eval_delay': 0, 'torch_empty_cache_steps': None, 'learning_rate': 1e-05, 'weight_decay': 0.0001, 'adam_beta1': 0.9, 'adam_beta2': 0.95, 'adam_epsilon': 1e-08, 'max_grad_norm': 1.0, 'num_train_epochs': 8.0, 'max_steps': -1, 'lr_scheduler_type': <SchedulerType.COSINE: 'cosine'>, 'lr_scheduler_kwargs': {}, 'warmup_ratio': 0.05, 'warmup_steps': 0, 'log_level': 'passive', 'log_level_replica': 'warning', 'log_on_each_node': True, 'logging_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume/runs/Apr15_11-24-38_batch-block1-3833', 'logging_strategy': <IntervalStrategy.STEPS: 'steps'>, 'logging_first_step': False, 'logging_steps': 1.0, 'logging_nan_inf_filter': True, 'save_strategy': <SaveStrategy.STEPS: 'steps'>, 'save_steps': 20, 'save_total_limit': 2, 'save_safetensors': True, 'save_on_each_node': False, 'save_only_model': False, 'restore_callback_states_from_checkpoint': False, 'no_cuda': False, 'use_cpu': False, 'use_mps_device': False, 'seed': 42, 'data_seed': None, 'jit_mode_eval': False, 'use_ipex': False, 'bf16': True, 'fp16': False, 'fp16_opt_level': 'O1', 'half_precision_backend': 'auto', 'bf16_full_eval': False, 'fp16_full_eval': False, 'tf32': None, 'local_rank': 3, 'ddp_backend': None, 'tpu_num_cores': None, 'tpu_metrics_debug': False, 'debug': [], 'dataloader_drop_last': False, 'eval_steps': None, 'dataloader_num_workers': 0, 'dataloader_prefetch_factor': None, 'past_index': -1, 'run_name': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'disable_tqdm': False, 'remove_unused_columns': True, 'label_names': None, 'load_best_model_at_end': False, 'metric_for_best_model': None, 'greater_is_better': None, 'ignore_data_skip': False, 'fsdp': [], 'fsdp_min_num_params': 0, 'fsdp_config': {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, 'tp_size': 0, 'fsdp_transformer_layer_cls_to_wrap': None, 'accelerator_config': {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, 'deepspeed': 'configs/deepspeed_zero3.json', 'label_smoothing_factor': 0.0, 'optim': <OptimizerNames.ADAMW_TORCH: 'adamw_torch'>, 'optim_args': None, 'adafactor': False, 'group_by_length': False, 'length_column_name': 'length', 'report_to': ['wandb'], 'ddp_find_unused_parameters': None, 'ddp_bucket_cap_mb': None, 'ddp_broadcast_buffers': None, 'dataloader_pin_memory': True, 'dataloader_persistent_workers': False, 'skip_memory_metrics': True, 'use_legacy_prediction_loop': False, 'push_to_hub': False, 'resume_from_checkpoint': None, 'hub_model_id': None, 'hub_strategy': <HubStrategy.EVERY_SAVE: 'every_save'>, 'hub_token': None, 'hub_private_repo': None, 'hub_always_push': False, 'gradient_checkpointing': True, 'gradient_checkpointing_kwargs': None, 'include_inputs_for_metrics': False, 'include_for_metrics': [], 'eval_do_concat_batches': True, 'fp16_backend': 'auto', 'push_to_hub_model_id': None, 'push_to_hub_organization': None, 'push_to_hub_token': None, '_n_gpu': 1, 'mp_parameters': '', 'auto_find_batch_size': False, 'full_determinism': False, 'torchdynamo': None, 'ray_scope': 'last', 'ddp_timeout': 1800, 'torch_compile': False, 'torch_compile_backend': None, 'torch_compile_mode': None, 'include_tokens_per_second': False, 'include_num_input_tokens_seen': False, 'neftune_noise_alpha': None, 'optim_target_modules': None, 'batch_eval_metrics': False, 'eval_on_start': False, 'use_liger_kernel': True, 'eval_use_gather_object': False, 'average_tokens_across_devices': True, 'model_init_kwargs': None, 'chat_template_path': None, 'dataset_text_field': 'qwen_text', 'dataset_kwargs': None, 'dataset_num_proc': None, 'eos_token': None, 'pad_token': None, 'max_length': 1024, 'packing': False, 'packing_strategy': 'ffd', 'padding_free': False, 'pad_to_multiple_of': None, 'eval_packing': None, 'completion_only_loss': None, 'assistant_only_loss': False, 'activation_offloading': False, 'max_seq_length': None}
2026-04-15 11:24:40,624 - INFO - Training config: {'model_name': 'Qwen/Qwen3-8B', 'template_name': 'qwen', 'block_size': 40960, 'wandb_project': 'ThreadWeaver', 'train_file_path': './data-train_final_replaced_from_classified-fix-format', 'dagger': False, 'attn_implementation': 'flex_attention', 'output_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'overwrite_output_dir': False, 'do_train': False, 'do_eval': False, 'do_predict': False, 'eval_strategy': <IntervalStrategy.NO: 'no'>, 'prediction_loss_only': False, 'per_device_train_batch_size': 1, 'per_device_eval_batch_size': 1, 'per_gpu_train_batch_size': None, 'per_gpu_eval_batch_size': None, 'gradient_accumulation_steps': 2, 'eval_accumulation_steps': None, 'eval_delay': 0, 'torch_empty_cache_steps': None, 'learning_rate': 1e-05, 'weight_decay': 0.0001, 'adam_beta1': 0.9, 'adam_beta2': 0.95, 'adam_epsilon': 1e-08, 'max_grad_norm': 1.0, 'num_train_epochs': 8.0, 'max_steps': -1, 'lr_scheduler_type': <SchedulerType.COSINE: 'cosine'>, 'lr_scheduler_kwargs': {}, 'warmup_ratio': 0.05, 'warmup_steps': 0, 'log_level': 'passive', 'log_level_replica': 'warning', 'log_on_each_node': True, 'logging_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume/runs/Apr15_11-24-38_batch-block1-3833', 'logging_strategy': <IntervalStrategy.STEPS: 'steps'>, 'logging_first_step': False, 'logging_steps': 1.0, 'logging_nan_inf_filter': True, 'save_strategy': <SaveStrategy.STEPS: 'steps'>, 'save_steps': 20, 'save_total_limit': 2, 'save_safetensors': True, 'save_on_each_node': False, 'save_only_model': False, 'restore_callback_states_from_checkpoint': False, 'no_cuda': False, 'use_cpu': False, 'use_mps_device': False, 'seed': 42, 'data_seed': None, 'jit_mode_eval': False, 'use_ipex': False, 'bf16': True, 'fp16': False, 'fp16_opt_level': 'O1', 'half_precision_backend': 'auto', 'bf16_full_eval': False, 'fp16_full_eval': False, 'tf32': None, 'local_rank': 6, 'ddp_backend': None, 'tpu_num_cores': None, 'tpu_metrics_debug': False, 'debug': [], 'dataloader_drop_last': False, 'eval_steps': None, 'dataloader_num_workers': 0, 'dataloader_prefetch_factor': None, 'past_index': -1, 'run_name': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'disable_tqdm': False, 'remove_unused_columns': True, 'label_names': None, 'load_best_model_at_end': False, 'metric_for_best_model': None, 'greater_is_better': None, 'ignore_data_skip': False, 'fsdp': [], 'fsdp_min_num_params': 0, 'fsdp_config': {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, 'tp_size': 0, 'fsdp_transformer_layer_cls_to_wrap': None, 'accelerator_config': {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, 'deepspeed': 'configs/deepspeed_zero3.json', 'label_smoothing_factor': 0.0, 'optim': <OptimizerNames.ADAMW_TORCH: 'adamw_torch'>, 'optim_args': None, 'adafactor': False, 'group_by_length': False, 'length_column_name': 'length', 'report_to': ['wandb'], 'ddp_find_unused_parameters': None, 'ddp_bucket_cap_mb': None, 'ddp_broadcast_buffers': None, 'dataloader_pin_memory': True, 'dataloader_persistent_workers': False, 'skip_memory_metrics': True, 'use_legacy_prediction_loop': False, 'push_to_hub': False, 'resume_from_checkpoint': None, 'hub_model_id': None, 'hub_strategy': <HubStrategy.EVERY_SAVE: 'every_save'>, 'hub_token': None, 'hub_private_repo': None, 'hub_always_push': False, 'gradient_checkpointing': True, 'gradient_checkpointing_kwargs': None, 'include_inputs_for_metrics': False, 'include_for_metrics': [], 'eval_do_concat_batches': True, 'fp16_backend': 'auto', 'push_to_hub_model_id': None, 'push_to_hub_organization': None, 'push_to_hub_token': None, '_n_gpu': 1, 'mp_parameters': '', 'auto_find_batch_size': False, 'full_determinism': False, 'torchdynamo': None, 'ray_scope': 'last', 'ddp_timeout': 1800, 'torch_compile': False, 'torch_compile_backend': None, 'torch_compile_mode': None, 'include_tokens_per_second': False, 'include_num_input_tokens_seen': False, 'neftune_noise_alpha': None, 'optim_target_modules': None, 'batch_eval_metrics': False, 'eval_on_start': False, 'use_liger_kernel': True, 'eval_use_gather_object': False, 'average_tokens_across_devices': True, 'model_init_kwargs': None, 'chat_template_path': None, 'dataset_text_field': 'qwen_text', 'dataset_kwargs': None, 'dataset_num_proc': None, 'eos_token': None, 'pad_token': None, 'max_length': 1024, 'packing': False, 'packing_strategy': 'ffd', 'padding_free': False, 'pad_to_multiple_of': None, 'eval_packing': None, 'completion_only_loss': None, 'assistant_only_loss': False, 'activation_offloading': False, 'max_seq_length': None}
2026-04-15 11:24:40,776 - INFO - Training config: {'model_name': 'Qwen/Qwen3-8B', 'template_name': 'qwen', 'block_size': 40960, 'wandb_project': 'ThreadWeaver', 'train_file_path': './data-train_final_replaced_from_classified-fix-format', 'dagger': False, 'attn_implementation': 'flex_attention', 'output_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'overwrite_output_dir': False, 'do_train': False, 'do_eval': False, 'do_predict': False, 'eval_strategy': <IntervalStrategy.NO: 'no'>, 'prediction_loss_only': False, 'per_device_train_batch_size': 1, 'per_device_eval_batch_size': 1, 'per_gpu_train_batch_size': None, 'per_gpu_eval_batch_size': None, 'gradient_accumulation_steps': 2, 'eval_accumulation_steps': None, 'eval_delay': 0, 'torch_empty_cache_steps': None, 'learning_rate': 1e-05, 'weight_decay': 0.0001, 'adam_beta1': 0.9, 'adam_beta2': 0.95, 'adam_epsilon': 1e-08, 'max_grad_norm': 1.0, 'num_train_epochs': 8.0, 'max_steps': -1, 'lr_scheduler_type': <SchedulerType.COSINE: 'cosine'>, 'lr_scheduler_kwargs': {}, 'warmup_ratio': 0.05, 'warmup_steps': 0, 'log_level': 'passive', 'log_level_replica': 'warning', 'log_on_each_node': True, 'logging_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume/runs/Apr15_11-24-35_batch-block1-0069', 'logging_strategy': <IntervalStrategy.STEPS: 'steps'>, 'logging_first_step': False, 'logging_steps': 1.0, 'logging_nan_inf_filter': True, 'save_strategy': <SaveStrategy.STEPS: 'steps'>, 'save_steps': 20, 'save_total_limit': 2, 'save_safetensors': True, 'save_on_each_node': False, 'save_only_model': False, 'restore_callback_states_from_checkpoint': False, 'no_cuda': False, 'use_cpu': False, 'use_mps_device': False, 'seed': 42, 'data_seed': None, 'jit_mode_eval': False, 'use_ipex': False, 'bf16': True, 'fp16': False, 'fp16_opt_level': 'O1', 'half_precision_backend': 'auto', 'bf16_full_eval': False, 'fp16_full_eval': False, 'tf32': None, 'local_rank': 2, 'ddp_backend': None, 'tpu_num_cores': None, 'tpu_metrics_debug': False, 'debug': [], 'dataloader_drop_last': False, 'eval_steps': None, 'dataloader_num_workers': 0, 'dataloader_prefetch_factor': None, 'past_index': -1, 'run_name': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'disable_tqdm': False, 'remove_unused_columns': True, 'label_names': None, 'load_best_model_at_end': False, 'metric_for_best_model': None, 'greater_is_better': None, 'ignore_data_skip': False, 'fsdp': [], 'fsdp_min_num_params': 0, 'fsdp_config': {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, 'tp_size': 0, 'fsdp_transformer_layer_cls_to_wrap': None, 'accelerator_config': {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, 'deepspeed': 'configs/deepspeed_zero3.json', 'label_smoothing_factor': 0.0, 'optim': <OptimizerNames.ADAMW_TORCH: 'adamw_torch'>, 'optim_args': None, 'adafactor': False, 'group_by_length': False, 'length_column_name': 'length', 'report_to': ['wandb'], 'ddp_find_unused_parameters': None, 'ddp_bucket_cap_mb': None, 'ddp_broadcast_buffers': None, 'dataloader_pin_memory': True, 'dataloader_persistent_workers': False, 'skip_memory_metrics': True, 'use_legacy_prediction_loop': False, 'push_to_hub': False, 'resume_from_checkpoint': None, 'hub_model_id': None, 'hub_strategy': <HubStrategy.EVERY_SAVE: 'every_save'>, 'hub_token': None, 'hub_private_repo': None, 'hub_always_push': False, 'gradient_checkpointing': True, 'gradient_checkpointing_kwargs': None, 'include_inputs_for_metrics': False, 'include_for_metrics': [], 'eval_do_concat_batches': True, 'fp16_backend': 'auto', 'push_to_hub_model_id': None, 'push_to_hub_organization': None, 'push_to_hub_token': None, '_n_gpu': 1, 'mp_parameters': '', 'auto_find_batch_size': False, 'full_determinism': False, 'torchdynamo': None, 'ray_scope': 'last', 'ddp_timeout': 1800, 'torch_compile': False, 'torch_compile_backend': None, 'torch_compile_mode': None, 'include_tokens_per_second': False, 'include_num_input_tokens_seen': False, 'neftune_noise_alpha': None, 'optim_target_modules': None, 'batch_eval_metrics': False, 'eval_on_start': False, 'use_liger_kernel': True, 'eval_use_gather_object': False, 'average_tokens_across_devices': True, 'model_init_kwargs': None, 'chat_template_path': None, 'dataset_text_field': 'qwen_text', 'dataset_kwargs': None, 'dataset_num_proc': None, 'eos_token': None, 'pad_token': None, 'max_length': 1024, 'packing': False, 'packing_strategy': 'ffd', 'padding_free': False, 'pad_to_multiple_of': None, 'eval_packing': None, 'completion_only_loss': None, 'assistant_only_loss': False, 'activation_offloading': False, 'max_seq_length': None}
2026-04-15 11:24:40,778 - INFO - Training config: {'model_name': 'Qwen/Qwen3-8B', 'template_name': 'qwen', 'block_size': 40960, 'wandb_project': 'ThreadWeaver', 'train_file_path': './data-train_final_replaced_from_classified-fix-format', 'dagger': False, 'attn_implementation': 'flex_attention', 'output_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'overwrite_output_dir': False, 'do_train': False, 'do_eval': False, 'do_predict': False, 'eval_strategy': <IntervalStrategy.NO: 'no'>, 'prediction_loss_only': False, 'per_device_train_batch_size': 1, 'per_device_eval_batch_size': 1, 'per_gpu_train_batch_size': None, 'per_gpu_eval_batch_size': None, 'gradient_accumulation_steps': 2, 'eval_accumulation_steps': None, 'eval_delay': 0, 'torch_empty_cache_steps': None, 'learning_rate': 1e-05, 'weight_decay': 0.0001, 'adam_beta1': 0.9, 'adam_beta2': 0.95, 'adam_epsilon': 1e-08, 'max_grad_norm': 1.0, 'num_train_epochs': 8.0, 'max_steps': -1, 'lr_scheduler_type': <SchedulerType.COSINE: 'cosine'>, 'lr_scheduler_kwargs': {}, 'warmup_ratio': 0.05, 'warmup_steps': 0, 'log_level': 'passive', 'log_level_replica': 'warning', 'log_on_each_node': True, 'logging_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume/runs/Apr15_11-24-35_batch-block1-0069', 'logging_strategy': <IntervalStrategy.STEPS: 'steps'>, 'logging_first_step': False, 'logging_steps': 1.0, 'logging_nan_inf_filter': True, 'save_strategy': <SaveStrategy.STEPS: 'steps'>, 'save_steps': 20, 'save_total_limit': 2, 'save_safetensors': True, 'save_on_each_node': False, 'save_only_model': False, 'restore_callback_states_from_checkpoint': False, 'no_cuda': False, 'use_cpu': False, 'use_mps_device': False, 'seed': 42, 'data_seed': None, 'jit_mode_eval': False, 'use_ipex': False, 'bf16': True, 'fp16': False, 'fp16_opt_level': 'O1', 'half_precision_backend': 'auto', 'bf16_full_eval': False, 'fp16_full_eval': False, 'tf32': None, 'local_rank': 6, 'ddp_backend': None, 'tpu_num_cores': None, 'tpu_metrics_debug': False, 'debug': [], 'dataloader_drop_last': False, 'eval_steps': None, 'dataloader_num_workers': 0, 'dataloader_prefetch_factor': None, 'past_index': -1, 'run_name': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'disable_tqdm': False, 'remove_unused_columns': True, 'label_names': None, 'load_best_model_at_end': False, 'metric_for_best_model': None, 'greater_is_better': None, 'ignore_data_skip': False, 'fsdp': [], 'fsdp_min_num_params': 0, 'fsdp_config': {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, 'tp_size': 0, 'fsdp_transformer_layer_cls_to_wrap': None, 'accelerator_config': {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, 'deepspeed': 'configs/deepspeed_zero3.json', 'label_smoothing_factor': 0.0, 'optim': <OptimizerNames.ADAMW_TORCH: 'adamw_torch'>, 'optim_args': None, 'adafactor': False, 'group_by_length': False, 'length_column_name': 'length', 'report_to': ['wandb'], 'ddp_find_unused_parameters': None, 'ddp_bucket_cap_mb': None, 'ddp_broadcast_buffers': None, 'dataloader_pin_memory': True, 'dataloader_persistent_workers': False, 'skip_memory_metrics': True, 'use_legacy_prediction_loop': False, 'push_to_hub': False, 'resume_from_checkpoint': None, 'hub_model_id': None, 'hub_strategy': <HubStrategy.EVERY_SAVE: 'every_save'>, 'hub_token': None, 'hub_private_repo': None, 'hub_always_push': False, 'gradient_checkpointing': True, 'gradient_checkpointing_kwargs': None, 'include_inputs_for_metrics': False, 'include_for_metrics': [], 'eval_do_concat_batches': True, 'fp16_backend': 'auto', 'push_to_hub_model_id': None, 'push_to_hub_organization': None, 'push_to_hub_token': None, '_n_gpu': 1, 'mp_parameters': '', 'auto_find_batch_size': False, 'full_determinism': False, 'torchdynamo': None, 'ray_scope': 'last', 'ddp_timeout': 1800, 'torch_compile': False, 'torch_compile_backend': None, 'torch_compile_mode': None, 'include_tokens_per_second': False, 'include_num_input_tokens_seen': False, 'neftune_noise_alpha': None, 'optim_target_modules': None, 'batch_eval_metrics': False, 'eval_on_start': False, 'use_liger_kernel': True, 'eval_use_gather_object': False, 'average_tokens_across_devices': True, 'model_init_kwargs': None, 'chat_template_path': None, 'dataset_text_field': 'qwen_text', 'dataset_kwargs': None, 'dataset_num_proc': None, 'eos_token': None, 'pad_token': None, 'max_length': 1024, 'packing': False, 'packing_strategy': 'ffd', 'padding_free': False, 'pad_to_multiple_of': None, 'eval_packing': None, 'completion_only_loss': None, 'assistant_only_loss': False, 'activation_offloading': False, 'max_seq_length': None}
2026-04-15 11:24:40,778 - INFO - Training config: {'model_name': 'Qwen/Qwen3-8B', 'template_name': 'qwen', 'block_size': 40960, 'wandb_project': 'ThreadWeaver', 'train_file_path': './data-train_final_replaced_from_classified-fix-format', 'dagger': False, 'attn_implementation': 'flex_attention', 'output_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'overwrite_output_dir': False, 'do_train': False, 'do_eval': False, 'do_predict': False, 'eval_strategy': <IntervalStrategy.NO: 'no'>, 'prediction_loss_only': False, 'per_device_train_batch_size': 1, 'per_device_eval_batch_size': 1, 'per_gpu_train_batch_size': None, 'per_gpu_eval_batch_size': None, 'gradient_accumulation_steps': 2, 'eval_accumulation_steps': None, 'eval_delay': 0, 'torch_empty_cache_steps': None, 'learning_rate': 1e-05, 'weight_decay': 0.0001, 'adam_beta1': 0.9, 'adam_beta2': 0.95, 'adam_epsilon': 1e-08, 'max_grad_norm': 1.0, 'num_train_epochs': 8.0, 'max_steps': -1, 'lr_scheduler_type': <SchedulerType.COSINE: 'cosine'>, 'lr_scheduler_kwargs': {}, 'warmup_ratio': 0.05, 'warmup_steps': 0, 'log_level': 'passive', 'log_level_replica': 'warning', 'log_on_each_node': True, 'logging_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume/runs/Apr15_11-24-35_batch-block1-0069', 'logging_strategy': <IntervalStrategy.STEPS: 'steps'>, 'logging_first_step': False, 'logging_steps': 1.0, 'logging_nan_inf_filter': True, 'save_strategy': <SaveStrategy.STEPS: 'steps'>, 'save_steps': 20, 'save_total_limit': 2, 'save_safetensors': True, 'save_on_each_node': False, 'save_only_model': False, 'restore_callback_states_from_checkpoint': False, 'no_cuda': False, 'use_cpu': False, 'use_mps_device': False, 'seed': 42, 'data_seed': None, 'jit_mode_eval': False, 'use_ipex': False, 'bf16': True, 'fp16': False, 'fp16_opt_level': 'O1', 'half_precision_backend': 'auto', 'bf16_full_eval': False, 'fp16_full_eval': False, 'tf32': None, 'local_rank': 5, 'ddp_backend': None, 'tpu_num_cores': None, 'tpu_metrics_debug': False, 'debug': [], 'dataloader_drop_last': False, 'eval_steps': None, 'dataloader_num_workers': 0, 'dataloader_prefetch_factor': None, 'past_index': -1, 'run_name': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'disable_tqdm': False, 'remove_unused_columns': True, 'label_names': None, 'load_best_model_at_end': False, 'metric_for_best_model': None, 'greater_is_better': None, 'ignore_data_skip': False, 'fsdp': [], 'fsdp_min_num_params': 0, 'fsdp_config': {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, 'tp_size': 0, 'fsdp_transformer_layer_cls_to_wrap': None, 'accelerator_config': {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, 'deepspeed': 'configs/deepspeed_zero3.json', 'label_smoothing_factor': 0.0, 'optim': <OptimizerNames.ADAMW_TORCH: 'adamw_torch'>, 'optim_args': None, 'adafactor': False, 'group_by_length': False, 'length_column_name': 'length', 'report_to': ['wandb'], 'ddp_find_unused_parameters': None, 'ddp_bucket_cap_mb': None, 'ddp_broadcast_buffers': None, 'dataloader_pin_memory': True, 'dataloader_persistent_workers': False, 'skip_memory_metrics': True, 'use_legacy_prediction_loop': False, 'push_to_hub': False, 'resume_from_checkpoint': None, 'hub_model_id': None, 'hub_strategy': <HubStrategy.EVERY_SAVE: 'every_save'>, 'hub_token': None, 'hub_private_repo': None, 'hub_always_push': False, 'gradient_checkpointing': True, 'gradient_checkpointing_kwargs': None, 'include_inputs_for_metrics': False, 'include_for_metrics': [], 'eval_do_concat_batches': True, 'fp16_backend': 'auto', 'push_to_hub_model_id': None, 'push_to_hub_organization': None, 'push_to_hub_token': None, '_n_gpu': 1, 'mp_parameters': '', 'auto_find_batch_size': False, 'full_determinism': False, 'torchdynamo': None, 'ray_scope': 'last', 'ddp_timeout': 1800, 'torch_compile': False, 'torch_compile_backend': None, 'torch_compile_mode': None, 'include_tokens_per_second': False, 'include_num_input_tokens_seen': False, 'neftune_noise_alpha': None, 'optim_target_modules': None, 'batch_eval_metrics': False, 'eval_on_start': False, 'use_liger_kernel': True, 'eval_use_gather_object': False, 'average_tokens_across_devices': True, 'model_init_kwargs': None, 'chat_template_path': None, 'dataset_text_field': 'qwen_text', 'dataset_kwargs': None, 'dataset_num_proc': None, 'eos_token': None, 'pad_token': None, 'max_length': 1024, 'packing': False, 'packing_strategy': 'ffd', 'padding_free': False, 'pad_to_multiple_of': None, 'eval_packing': None, 'completion_only_loss': None, 'assistant_only_loss': False, 'activation_offloading': False, 'max_seq_length': None}
2026-04-15 11:24:40,780 - INFO - Training config: {'model_name': 'Qwen/Qwen3-8B', 'template_name': 'qwen', 'block_size': 40960, 'wandb_project': 'ThreadWeaver', 'train_file_path': './data-train_final_replaced_from_classified-fix-format', 'dagger': False, 'attn_implementation': 'flex_attention', 'output_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'overwrite_output_dir': False, 'do_train': False, 'do_eval': False, 'do_predict': False, 'eval_strategy': <IntervalStrategy.NO: 'no'>, 'prediction_loss_only': False, 'per_device_train_batch_size': 1, 'per_device_eval_batch_size': 1, 'per_gpu_train_batch_size': None, 'per_gpu_eval_batch_size': None, 'gradient_accumulation_steps': 2, 'eval_accumulation_steps': None, 'eval_delay': 0, 'torch_empty_cache_steps': None, 'learning_rate': 1e-05, 'weight_decay': 0.0001, 'adam_beta1': 0.9, 'adam_beta2': 0.95, 'adam_epsilon': 1e-08, 'max_grad_norm': 1.0, 'num_train_epochs': 8.0, 'max_steps': -1, 'lr_scheduler_type': <SchedulerType.COSINE: 'cosine'>, 'lr_scheduler_kwargs': {}, 'warmup_ratio': 0.05, 'warmup_steps': 0, 'log_level': 'passive', 'log_level_replica': 'warning', 'log_on_each_node': True, 'logging_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume/runs/Apr15_11-24-35_batch-block1-0069', 'logging_strategy': <IntervalStrategy.STEPS: 'steps'>, 'logging_first_step': False, 'logging_steps': 1.0, 'logging_nan_inf_filter': True, 'save_strategy': <SaveStrategy.STEPS: 'steps'>, 'save_steps': 20, 'save_total_limit': 2, 'save_safetensors': True, 'save_on_each_node': False, 'save_only_model': False, 'restore_callback_states_from_checkpoint': False, 'no_cuda': False, 'use_cpu': False, 'use_mps_device': False, 'seed': 42, 'data_seed': None, 'jit_mode_eval': False, 'use_ipex': False, 'bf16': True, 'fp16': False, 'fp16_opt_level': 'O1', 'half_precision_backend': 'auto', 'bf16_full_eval': False, 'fp16_full_eval': False, 'tf32': None, 'local_rank': 4, 'ddp_backend': None, 'tpu_num_cores': None, 'tpu_metrics_debug': False, 'debug': [], 'dataloader_drop_last': False, 'eval_steps': None, 'dataloader_num_workers': 0, 'dataloader_prefetch_factor': None, 'past_index': -1, 'run_name': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'disable_tqdm': False, 'remove_unused_columns': True, 'label_names': None, 'load_best_model_at_end': False, 'metric_for_best_model': None, 'greater_is_better': None, 'ignore_data_skip': False, 'fsdp': [], 'fsdp_min_num_params': 0, 'fsdp_config': {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, 'tp_size': 0, 'fsdp_transformer_layer_cls_to_wrap': None, 'accelerator_config': {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, 'deepspeed': 'configs/deepspeed_zero3.json', 'label_smoothing_factor': 0.0, 'optim': <OptimizerNames.ADAMW_TORCH: 'adamw_torch'>, 'optim_args': None, 'adafactor': False, 'group_by_length': False, 'length_column_name': 'length', 'report_to': ['wandb'], 'ddp_find_unused_parameters': None, 'ddp_bucket_cap_mb': None, 'ddp_broadcast_buffers': None, 'dataloader_pin_memory': True, 'dataloader_persistent_workers': False, 'skip_memory_metrics': True, 'use_legacy_prediction_loop': False, 'push_to_hub': False, 'resume_from_checkpoint': None, 'hub_model_id': None, 'hub_strategy': <HubStrategy.EVERY_SAVE: 'every_save'>, 'hub_token': None, 'hub_private_repo': None, 'hub_always_push': False, 'gradient_checkpointing': True, 'gradient_checkpointing_kwargs': None, 'include_inputs_for_metrics': False, 'include_for_metrics': [], 'eval_do_concat_batches': True, 'fp16_backend': 'auto', 'push_to_hub_model_id': None, 'push_to_hub_organization': None, 'push_to_hub_token': None, '_n_gpu': 1, 'mp_parameters': '', 'auto_find_batch_size': False, 'full_determinism': False, 'torchdynamo': None, 'ray_scope': 'last', 'ddp_timeout': 1800, 'torch_compile': False, 'torch_compile_backend': None, 'torch_compile_mode': None, 'include_tokens_per_second': False, 'include_num_input_tokens_seen': False, 'neftune_noise_alpha': None, 'optim_target_modules': None, 'batch_eval_metrics': False, 'eval_on_start': False, 'use_liger_kernel': True, 'eval_use_gather_object': False, 'average_tokens_across_devices': True, 'model_init_kwargs': None, 'chat_template_path': None, 'dataset_text_field': 'qwen_text', 'dataset_kwargs': None, 'dataset_num_proc': None, 'eos_token': None, 'pad_token': None, 'max_length': 1024, 'packing': False, 'packing_strategy': 'ffd', 'padding_free': False, 'pad_to_multiple_of': None, 'eval_packing': None, 'completion_only_loss': None, 'assistant_only_loss': False, 'activation_offloading': False, 'max_seq_length': None}
2026-04-15 11:24:40,781 - INFO - Training config: {'model_name': 'Qwen/Qwen3-8B', 'template_name': 'qwen', 'block_size': 40960, 'wandb_project': 'ThreadWeaver', 'train_file_path': './data-train_final_replaced_from_classified-fix-format', 'dagger': False, 'attn_implementation': 'flex_attention', 'output_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'overwrite_output_dir': False, 'do_train': False, 'do_eval': False, 'do_predict': False, 'eval_strategy': <IntervalStrategy.NO: 'no'>, 'prediction_loss_only': False, 'per_device_train_batch_size': 1, 'per_device_eval_batch_size': 1, 'per_gpu_train_batch_size': None, 'per_gpu_eval_batch_size': None, 'gradient_accumulation_steps': 2, 'eval_accumulation_steps': None, 'eval_delay': 0, 'torch_empty_cache_steps': None, 'learning_rate': 1e-05, 'weight_decay': 0.0001, 'adam_beta1': 0.9, 'adam_beta2': 0.95, 'adam_epsilon': 1e-08, 'max_grad_norm': 1.0, 'num_train_epochs': 8.0, 'max_steps': -1, 'lr_scheduler_type': <SchedulerType.COSINE: 'cosine'>, 'lr_scheduler_kwargs': {}, 'warmup_ratio': 0.05, 'warmup_steps': 0, 'log_level': 'passive', 'log_level_replica': 'warning', 'log_on_each_node': True, 'logging_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume/runs/Apr15_11-24-35_batch-block1-0069', 'logging_strategy': <IntervalStrategy.STEPS: 'steps'>, 'logging_first_step': False, 'logging_steps': 1.0, 'logging_nan_inf_filter': True, 'save_strategy': <SaveStrategy.STEPS: 'steps'>, 'save_steps': 20, 'save_total_limit': 2, 'save_safetensors': True, 'save_on_each_node': False, 'save_only_model': False, 'restore_callback_states_from_checkpoint': False, 'no_cuda': False, 'use_cpu': False, 'use_mps_device': False, 'seed': 42, 'data_seed': None, 'jit_mode_eval': False, 'use_ipex': False, 'bf16': True, 'fp16': False, 'fp16_opt_level': 'O1', 'half_precision_backend': 'auto', 'bf16_full_eval': False, 'fp16_full_eval': False, 'tf32': None, 'local_rank': 7, 'ddp_backend': None, 'tpu_num_cores': None, 'tpu_metrics_debug': False, 'debug': [], 'dataloader_drop_last': False, 'eval_steps': None, 'dataloader_num_workers': 0, 'dataloader_prefetch_factor': None, 'past_index': -1, 'run_name': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'disable_tqdm': False, 'remove_unused_columns': True, 'label_names': None, 'load_best_model_at_end': False, 'metric_for_best_model': None, 'greater_is_better': None, 'ignore_data_skip': False, 'fsdp': [], 'fsdp_min_num_params': 0, 'fsdp_config': {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, 'tp_size': 0, 'fsdp_transformer_layer_cls_to_wrap': None, 'accelerator_config': {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, 'deepspeed': 'configs/deepspeed_zero3.json', 'label_smoothing_factor': 0.0, 'optim': <OptimizerNames.ADAMW_TORCH: 'adamw_torch'>, 'optim_args': None, 'adafactor': False, 'group_by_length': False, 'length_column_name': 'length', 'report_to': ['wandb'], 'ddp_find_unused_parameters': None, 'ddp_bucket_cap_mb': None, 'ddp_broadcast_buffers': None, 'dataloader_pin_memory': True, 'dataloader_persistent_workers': False, 'skip_memory_metrics': True, 'use_legacy_prediction_loop': False, 'push_to_hub': False, 'resume_from_checkpoint': None, 'hub_model_id': None, 'hub_strategy': <HubStrategy.EVERY_SAVE: 'every_save'>, 'hub_token': None, 'hub_private_repo': None, 'hub_always_push': False, 'gradient_checkpointing': True, 'gradient_checkpointing_kwargs': None, 'include_inputs_for_metrics': False, 'include_for_metrics': [], 'eval_do_concat_batches': True, 'fp16_backend': 'auto', 'push_to_hub_model_id': None, 'push_to_hub_organization': None, 'push_to_hub_token': None, '_n_gpu': 1, 'mp_parameters': '', 'auto_find_batch_size': False, 'full_determinism': False, 'torchdynamo': None, 'ray_scope': 'last', 'ddp_timeout': 1800, 'torch_compile': False, 'torch_compile_backend': None, 'torch_compile_mode': None, 'include_tokens_per_second': False, 'include_num_input_tokens_seen': False, 'neftune_noise_alpha': None, 'optim_target_modules': None, 'batch_eval_metrics': False, 'eval_on_start': False, 'use_liger_kernel': True, 'eval_use_gather_object': False, 'average_tokens_across_devices': True, 'model_init_kwargs': None, 'chat_template_path': None, 'dataset_text_field': 'qwen_text', 'dataset_kwargs': None, 'dataset_num_proc': None, 'eos_token': None, 'pad_token': None, 'max_length': 1024, 'packing': False, 'packing_strategy': 'ffd', 'padding_free': False, 'pad_to_multiple_of': None, 'eval_packing': None, 'completion_only_loss': None, 'assistant_only_loss': False, 'activation_offloading': False, 'max_seq_length': None}
2026-04-15 11:24:40,782 - INFO - Training config: {'model_name': 'Qwen/Qwen3-8B', 'template_name': 'qwen', 'block_size': 40960, 'wandb_project': 'ThreadWeaver', 'train_file_path': './data-train_final_replaced_from_classified-fix-format', 'dagger': False, 'attn_implementation': 'flex_attention', 'output_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'overwrite_output_dir': False, 'do_train': False, 'do_eval': False, 'do_predict': False, 'eval_strategy': <IntervalStrategy.NO: 'no'>, 'prediction_loss_only': False, 'per_device_train_batch_size': 1, 'per_device_eval_batch_size': 1, 'per_gpu_train_batch_size': None, 'per_gpu_eval_batch_size': None, 'gradient_accumulation_steps': 2, 'eval_accumulation_steps': None, 'eval_delay': 0, 'torch_empty_cache_steps': None, 'learning_rate': 1e-05, 'weight_decay': 0.0001, 'adam_beta1': 0.9, 'adam_beta2': 0.95, 'adam_epsilon': 1e-08, 'max_grad_norm': 1.0, 'num_train_epochs': 8.0, 'max_steps': -1, 'lr_scheduler_type': <SchedulerType.COSINE: 'cosine'>, 'lr_scheduler_kwargs': {}, 'warmup_ratio': 0.05, 'warmup_steps': 0, 'log_level': 'passive', 'log_level_replica': 'warning', 'log_on_each_node': True, 'logging_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume/runs/Apr15_11-24-35_batch-block1-0069', 'logging_strategy': <IntervalStrategy.STEPS: 'steps'>, 'logging_first_step': False, 'logging_steps': 1.0, 'logging_nan_inf_filter': True, 'save_strategy': <SaveStrategy.STEPS: 'steps'>, 'save_steps': 20, 'save_total_limit': 2, 'save_safetensors': True, 'save_on_each_node': False, 'save_only_model': False, 'restore_callback_states_from_checkpoint': False, 'no_cuda': False, 'use_cpu': False, 'use_mps_device': False, 'seed': 42, 'data_seed': None, 'jit_mode_eval': False, 'use_ipex': False, 'bf16': True, 'fp16': False, 'fp16_opt_level': 'O1', 'half_precision_backend': 'auto', 'bf16_full_eval': False, 'fp16_full_eval': False, 'tf32': None, 'local_rank': 1, 'ddp_backend': None, 'tpu_num_cores': None, 'tpu_metrics_debug': False, 'debug': [], 'dataloader_drop_last': False, 'eval_steps': None, 'dataloader_num_workers': 0, 'dataloader_prefetch_factor': None, 'past_index': -1, 'run_name': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'disable_tqdm': False, 'remove_unused_columns': True, 'label_names': None, 'load_best_model_at_end': False, 'metric_for_best_model': None, 'greater_is_better': None, 'ignore_data_skip': False, 'fsdp': [], 'fsdp_min_num_params': 0, 'fsdp_config': {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, 'tp_size': 0, 'fsdp_transformer_layer_cls_to_wrap': None, 'accelerator_config': {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, 'deepspeed': 'configs/deepspeed_zero3.json', 'label_smoothing_factor': 0.0, 'optim': <OptimizerNames.ADAMW_TORCH: 'adamw_torch'>, 'optim_args': None, 'adafactor': False, 'group_by_length': False, 'length_column_name': 'length', 'report_to': ['wandb'], 'ddp_find_unused_parameters': None, 'ddp_bucket_cap_mb': None, 'ddp_broadcast_buffers': None, 'dataloader_pin_memory': True, 'dataloader_persistent_workers': False, 'skip_memory_metrics': True, 'use_legacy_prediction_loop': False, 'push_to_hub': False, 'resume_from_checkpoint': None, 'hub_model_id': None, 'hub_strategy': <HubStrategy.EVERY_SAVE: 'every_save'>, 'hub_token': None, 'hub_private_repo': None, 'hub_always_push': False, 'gradient_checkpointing': True, 'gradient_checkpointing_kwargs': None, 'include_inputs_for_metrics': False, 'include_for_metrics': [], 'eval_do_concat_batches': True, 'fp16_backend': 'auto', 'push_to_hub_model_id': None, 'push_to_hub_organization': None, 'push_to_hub_token': None, '_n_gpu': 1, 'mp_parameters': '', 'auto_find_batch_size': False, 'full_determinism': False, 'torchdynamo': None, 'ray_scope': 'last', 'ddp_timeout': 1800, 'torch_compile': False, 'torch_compile_backend': None, 'torch_compile_mode': None, 'include_tokens_per_second': False, 'include_num_input_tokens_seen': False, 'neftune_noise_alpha': None, 'optim_target_modules': None, 'batch_eval_metrics': False, 'eval_on_start': False, 'use_liger_kernel': True, 'eval_use_gather_object': False, 'average_tokens_across_devices': True, 'model_init_kwargs': None, 'chat_template_path': None, 'dataset_text_field': 'qwen_text', 'dataset_kwargs': None, 'dataset_num_proc': None, 'eos_token': None, 'pad_token': None, 'max_length': 1024, 'packing': False, 'packing_strategy': 'ffd', 'padding_free': False, 'pad_to_multiple_of': None, 'eval_packing': None, 'completion_only_loss': None, 'assistant_only_loss': False, 'activation_offloading': False, 'max_seq_length': None}
2026-04-15 11:24:40,782 - INFO - Training config: {'model_name': 'Qwen/Qwen3-8B', 'template_name': 'qwen', 'block_size': 40960, 'wandb_project': 'ThreadWeaver', 'train_file_path': './data-train_final_replaced_from_classified-fix-format', 'dagger': False, 'attn_implementation': 'flex_attention', 'output_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'overwrite_output_dir': False, 'do_train': False, 'do_eval': False, 'do_predict': False, 'eval_strategy': <IntervalStrategy.NO: 'no'>, 'prediction_loss_only': False, 'per_device_train_batch_size': 1, 'per_device_eval_batch_size': 1, 'per_gpu_train_batch_size': None, 'per_gpu_eval_batch_size': None, 'gradient_accumulation_steps': 2, 'eval_accumulation_steps': None, 'eval_delay': 0, 'torch_empty_cache_steps': None, 'learning_rate': 1e-05, 'weight_decay': 0.0001, 'adam_beta1': 0.9, 'adam_beta2': 0.95, 'adam_epsilon': 1e-08, 'max_grad_norm': 1.0, 'num_train_epochs': 8.0, 'max_steps': -1, 'lr_scheduler_type': <SchedulerType.COSINE: 'cosine'>, 'lr_scheduler_kwargs': {}, 'warmup_ratio': 0.05, 'warmup_steps': 0, 'log_level': 'passive', 'log_level_replica': 'warning', 'log_on_each_node': True, 'logging_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume/runs/Apr15_11-24-35_batch-block1-0069', 'logging_strategy': <IntervalStrategy.STEPS: 'steps'>, 'logging_first_step': False, 'logging_steps': 1.0, 'logging_nan_inf_filter': True, 'save_strategy': <SaveStrategy.STEPS: 'steps'>, 'save_steps': 20, 'save_total_limit': 2, 'save_safetensors': True, 'save_on_each_node': False, 'save_only_model': False, 'restore_callback_states_from_checkpoint': False, 'no_cuda': False, 'use_cpu': False, 'use_mps_device': False, 'seed': 42, 'data_seed': None, 'jit_mode_eval': False, 'use_ipex': False, 'bf16': True, 'fp16': False, 'fp16_opt_level': 'O1', 'half_precision_backend': 'auto', 'bf16_full_eval': False, 'fp16_full_eval': False, 'tf32': None, 'local_rank': 3, 'ddp_backend': None, 'tpu_num_cores': None, 'tpu_metrics_debug': False, 'debug': [], 'dataloader_drop_last': False, 'eval_steps': None, 'dataloader_num_workers': 0, 'dataloader_prefetch_factor': None, 'past_index': -1, 'run_name': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'disable_tqdm': False, 'remove_unused_columns': True, 'label_names': None, 'load_best_model_at_end': False, 'metric_for_best_model': None, 'greater_is_better': None, 'ignore_data_skip': False, 'fsdp': [], 'fsdp_min_num_params': 0, 'fsdp_config': {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, 'tp_size': 0, 'fsdp_transformer_layer_cls_to_wrap': None, 'accelerator_config': {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, 'deepspeed': 'configs/deepspeed_zero3.json', 'label_smoothing_factor': 0.0, 'optim': <OptimizerNames.ADAMW_TORCH: 'adamw_torch'>, 'optim_args': None, 'adafactor': False, 'group_by_length': False, 'length_column_name': 'length', 'report_to': ['wandb'], 'ddp_find_unused_parameters': None, 'ddp_bucket_cap_mb': None, 'ddp_broadcast_buffers': None, 'dataloader_pin_memory': True, 'dataloader_persistent_workers': False, 'skip_memory_metrics': True, 'use_legacy_prediction_loop': False, 'push_to_hub': False, 'resume_from_checkpoint': None, 'hub_model_id': None, 'hub_strategy': <HubStrategy.EVERY_SAVE: 'every_save'>, 'hub_token': None, 'hub_private_repo': None, 'hub_always_push': False, 'gradient_checkpointing': True, 'gradient_checkpointing_kwargs': None, 'include_inputs_for_metrics': False, 'include_for_metrics': [], 'eval_do_concat_batches': True, 'fp16_backend': 'auto', 'push_to_hub_model_id': None, 'push_to_hub_organization': None, 'push_to_hub_token': None, '_n_gpu': 1, 'mp_parameters': '', 'auto_find_batch_size': False, 'full_determinism': False, 'torchdynamo': None, 'ray_scope': 'last', 'ddp_timeout': 1800, 'torch_compile': False, 'torch_compile_backend': None, 'torch_compile_mode': None, 'include_tokens_per_second': False, 'include_num_input_tokens_seen': False, 'neftune_noise_alpha': None, 'optim_target_modules': None, 'batch_eval_metrics': False, 'eval_on_start': False, 'use_liger_kernel': True, 'eval_use_gather_object': False, 'average_tokens_across_devices': True, 'model_init_kwargs': None, 'chat_template_path': None, 'dataset_text_field': 'qwen_text', 'dataset_kwargs': None, 'dataset_num_proc': None, 'eos_token': None, 'pad_token': None, 'max_length': 1024, 'packing': False, 'packing_strategy': 'ffd', 'padding_free': False, 'pad_to_multiple_of': None, 'eval_packing': None, 'completion_only_loss': None, 'assistant_only_loss': False, 'activation_offloading': False, 'max_seq_length': None}
2026-04-15 11:24:42,263 - INFO - Training config: {'model_name': 'Qwen/Qwen3-8B', 'template_name': 'qwen', 'block_size': 40960, 'wandb_project': 'ThreadWeaver', 'train_file_path': './data-train_final_replaced_from_classified-fix-format', 'dagger': False, 'attn_implementation': 'flex_attention', 'output_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'overwrite_output_dir': False, 'do_train': False, 'do_eval': False, 'do_predict': False, 'eval_strategy': <IntervalStrategy.NO: 'no'>, 'prediction_loss_only': False, 'per_device_train_batch_size': 1, 'per_device_eval_batch_size': 1, 'per_gpu_train_batch_size': None, 'per_gpu_eval_batch_size': None, 'gradient_accumulation_steps': 2, 'eval_accumulation_steps': None, 'eval_delay': 0, 'torch_empty_cache_steps': None, 'learning_rate': 1e-05, 'weight_decay': 0.0001, 'adam_beta1': 0.9, 'adam_beta2': 0.95, 'adam_epsilon': 1e-08, 'max_grad_norm': 1.0, 'num_train_epochs': 8.0, 'max_steps': -1, 'lr_scheduler_type': <SchedulerType.COSINE: 'cosine'>, 'lr_scheduler_kwargs': {}, 'warmup_ratio': 0.05, 'warmup_steps': 0, 'log_level': 'passive', 'log_level_replica': 'warning', 'log_on_each_node': True, 'logging_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume/runs/Apr15_11-24-34_batch-block1-3273', 'logging_strategy': <IntervalStrategy.STEPS: 'steps'>, 'logging_first_step': False, 'logging_steps': 1.0, 'logging_nan_inf_filter': True, 'save_strategy': <SaveStrategy.STEPS: 'steps'>, 'save_steps': 20, 'save_total_limit': 2, 'save_safetensors': True, 'save_on_each_node': False, 'save_only_model': False, 'restore_callback_states_from_checkpoint': False, 'no_cuda': False, 'use_cpu': False, 'use_mps_device': False, 'seed': 42, 'data_seed': None, 'jit_mode_eval': False, 'use_ipex': False, 'bf16': True, 'fp16': False, 'fp16_opt_level': 'O1', 'half_precision_backend': 'auto', 'bf16_full_eval': False, 'fp16_full_eval': False, 'tf32': None, 'local_rank': 7, 'ddp_backend': None, 'tpu_num_cores': None, 'tpu_metrics_debug': False, 'debug': [], 'dataloader_drop_last': False, 'eval_steps': None, 'dataloader_num_workers': 0, 'dataloader_prefetch_factor': None, 'past_index': -1, 'run_name': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'disable_tqdm': False, 'remove_unused_columns': True, 'label_names': None, 'load_best_model_at_end': False, 'metric_for_best_model': None, 'greater_is_better': None, 'ignore_data_skip': False, 'fsdp': [], 'fsdp_min_num_params': 0, 'fsdp_config': {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, 'tp_size': 0, 'fsdp_transformer_layer_cls_to_wrap': None, 'accelerator_config': {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, 'deepspeed': 'configs/deepspeed_zero3.json', 'label_smoothing_factor': 0.0, 'optim': <OptimizerNames.ADAMW_TORCH: 'adamw_torch'>, 'optim_args': None, 'adafactor': False, 'group_by_length': False, 'length_column_name': 'length', 'report_to': ['wandb'], 'ddp_find_unused_parameters': None, 'ddp_bucket_cap_mb': None, 'ddp_broadcast_buffers': None, 'dataloader_pin_memory': True, 'dataloader_persistent_workers': False, 'skip_memory_metrics': True, 'use_legacy_prediction_loop': False, 'push_to_hub': False, 'resume_from_checkpoint': None, 'hub_model_id': None, 'hub_strategy': <HubStrategy.EVERY_SAVE: 'every_save'>, 'hub_token': None, 'hub_private_repo': None, 'hub_always_push': False, 'gradient_checkpointing': True, 'gradient_checkpointing_kwargs': None, 'include_inputs_for_metrics': False, 'include_for_metrics': [], 'eval_do_concat_batches': True, 'fp16_backend': 'auto', 'push_to_hub_model_id': None, 'push_to_hub_organization': None, 'push_to_hub_token': None, '_n_gpu': 1, 'mp_parameters': '', 'auto_find_batch_size': False, 'full_determinism': False, 'torchdynamo': None, 'ray_scope': 'last', 'ddp_timeout': 1800, 'torch_compile': False, 'torch_compile_backend': None, 'torch_compile_mode': None, 'include_tokens_per_second': False, 'include_num_input_tokens_seen': False, 'neftune_noise_alpha': None, 'optim_target_modules': None, 'batch_eval_metrics': False, 'eval_on_start': False, 'use_liger_kernel': True, 'eval_use_gather_object': False, 'average_tokens_across_devices': True, 'model_init_kwargs': None, 'chat_template_path': None, 'dataset_text_field': 'qwen_text', 'dataset_kwargs': None, 'dataset_num_proc': None, 'eos_token': None, 'pad_token': None, 'max_length': 1024, 'packing': False, 'packing_strategy': 'ffd', 'padding_free': False, 'pad_to_multiple_of': None, 'eval_packing': None, 'completion_only_loss': None, 'assistant_only_loss': False, 'activation_offloading': False, 'max_seq_length': None}
2026-04-15 11:24:42,270 - INFO - Training config: {'model_name': 'Qwen/Qwen3-8B', 'template_name': 'qwen', 'block_size': 40960, 'wandb_project': 'ThreadWeaver', 'train_file_path': './data-train_final_replaced_from_classified-fix-format', 'dagger': False, 'attn_implementation': 'flex_attention', 'output_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'overwrite_output_dir': False, 'do_train': False, 'do_eval': False, 'do_predict': False, 'eval_strategy': <IntervalStrategy.NO: 'no'>, 'prediction_loss_only': False, 'per_device_train_batch_size': 1, 'per_device_eval_batch_size': 1, 'per_gpu_train_batch_size': None, 'per_gpu_eval_batch_size': None, 'gradient_accumulation_steps': 2, 'eval_accumulation_steps': None, 'eval_delay': 0, 'torch_empty_cache_steps': None, 'learning_rate': 1e-05, 'weight_decay': 0.0001, 'adam_beta1': 0.9, 'adam_beta2': 0.95, 'adam_epsilon': 1e-08, 'max_grad_norm': 1.0, 'num_train_epochs': 8.0, 'max_steps': -1, 'lr_scheduler_type': <SchedulerType.COSINE: 'cosine'>, 'lr_scheduler_kwargs': {}, 'warmup_ratio': 0.05, 'warmup_steps': 0, 'log_level': 'passive', 'log_level_replica': 'warning', 'log_on_each_node': True, 'logging_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume/runs/Apr15_11-24-34_batch-block1-3859', 'logging_strategy': <IntervalStrategy.STEPS: 'steps'>, 'logging_first_step': False, 'logging_steps': 1.0, 'logging_nan_inf_filter': True, 'save_strategy': <SaveStrategy.STEPS: 'steps'>, 'save_steps': 20, 'save_total_limit': 2, 'save_safetensors': True, 'save_on_each_node': False, 'save_only_model': False, 'restore_callback_states_from_checkpoint': False, 'no_cuda': False, 'use_cpu': False, 'use_mps_device': False, 'seed': 42, 'data_seed': None, 'jit_mode_eval': False, 'use_ipex': False, 'bf16': True, 'fp16': False, 'fp16_opt_level': 'O1', 'half_precision_backend': 'auto', 'bf16_full_eval': False, 'fp16_full_eval': False, 'tf32': None, 'local_rank': 5, 'ddp_backend': None, 'tpu_num_cores': None, 'tpu_metrics_debug': False, 'debug': [], 'dataloader_drop_last': False, 'eval_steps': None, 'dataloader_num_workers': 0, 'dataloader_prefetch_factor': None, 'past_index': -1, 'run_name': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'disable_tqdm': False, 'remove_unused_columns': True, 'label_names': None, 'load_best_model_at_end': False, 'metric_for_best_model': None, 'greater_is_better': None, 'ignore_data_skip': False, 'fsdp': [], 'fsdp_min_num_params': 0, 'fsdp_config': {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, 'tp_size': 0, 'fsdp_transformer_layer_cls_to_wrap': None, 'accelerator_config': {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, 'deepspeed': 'configs/deepspeed_zero3.json', 'label_smoothing_factor': 0.0, 'optim': <OptimizerNames.ADAMW_TORCH: 'adamw_torch'>, 'optim_args': None, 'adafactor': False, 'group_by_length': False, 'length_column_name': 'length', 'report_to': ['wandb'], 'ddp_find_unused_parameters': None, 'ddp_bucket_cap_mb': None, 'ddp_broadcast_buffers': None, 'dataloader_pin_memory': True, 'dataloader_persistent_workers': False, 'skip_memory_metrics': True, 'use_legacy_prediction_loop': False, 'push_to_hub': False, 'resume_from_checkpoint': None, 'hub_model_id': None, 'hub_strategy': <HubStrategy.EVERY_SAVE: 'every_save'>, 'hub_token': None, 'hub_private_repo': None, 'hub_always_push': False, 'gradient_checkpointing': True, 'gradient_checkpointing_kwargs': None, 'include_inputs_for_metrics': False, 'include_for_metrics': [], 'eval_do_concat_batches': True, 'fp16_backend': 'auto', 'push_to_hub_model_id': None, 'push_to_hub_organization': None, 'push_to_hub_token': None, '_n_gpu': 1, 'mp_parameters': '', 'auto_find_batch_size': False, 'full_determinism': False, 'torchdynamo': None, 'ray_scope': 'last', 'ddp_timeout': 1800, 'torch_compile': False, 'torch_compile_backend': None, 'torch_compile_mode': None, 'include_tokens_per_second': False, 'include_num_input_tokens_seen': False, 'neftune_noise_alpha': None, 'optim_target_modules': None, 'batch_eval_metrics': False, 'eval_on_start': False, 'use_liger_kernel': True, 'eval_use_gather_object': False, 'average_tokens_across_devices': True, 'model_init_kwargs': None, 'chat_template_path': None, 'dataset_text_field': 'qwen_text', 'dataset_kwargs': None, 'dataset_num_proc': None, 'eos_token': None, 'pad_token': None, 'max_length': 1024, 'packing': False, 'packing_strategy': 'ffd', 'padding_free': False, 'pad_to_multiple_of': None, 'eval_packing': None, 'completion_only_loss': None, 'assistant_only_loss': False, 'activation_offloading': False, 'max_seq_length': None}
2026-04-15 11:24:42,270 - INFO - Training config: {'model_name': 'Qwen/Qwen3-8B', 'template_name': 'qwen', 'block_size': 40960, 'wandb_project': 'ThreadWeaver', 'train_file_path': './data-train_final_replaced_from_classified-fix-format', 'dagger': False, 'attn_implementation': 'flex_attention', 'output_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'overwrite_output_dir': False, 'do_train': False, 'do_eval': False, 'do_predict': False, 'eval_strategy': <IntervalStrategy.NO: 'no'>, 'prediction_loss_only': False, 'per_device_train_batch_size': 1, 'per_device_eval_batch_size': 1, 'per_gpu_train_batch_size': None, 'per_gpu_eval_batch_size': None, 'gradient_accumulation_steps': 2, 'eval_accumulation_steps': None, 'eval_delay': 0, 'torch_empty_cache_steps': None, 'learning_rate': 1e-05, 'weight_decay': 0.0001, 'adam_beta1': 0.9, 'adam_beta2': 0.95, 'adam_epsilon': 1e-08, 'max_grad_norm': 1.0, 'num_train_epochs': 8.0, 'max_steps': -1, 'lr_scheduler_type': <SchedulerType.COSINE: 'cosine'>, 'lr_scheduler_kwargs': {}, 'warmup_ratio': 0.05, 'warmup_steps': 0, 'log_level': 'passive', 'log_level_replica': 'warning', 'log_on_each_node': True, 'logging_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume/runs/Apr15_11-24-34_batch-block1-3859', 'logging_strategy': <IntervalStrategy.STEPS: 'steps'>, 'logging_first_step': False, 'logging_steps': 1.0, 'logging_nan_inf_filter': True, 'save_strategy': <SaveStrategy.STEPS: 'steps'>, 'save_steps': 20, 'save_total_limit': 2, 'save_safetensors': True, 'save_on_each_node': False, 'save_only_model': False, 'restore_callback_states_from_checkpoint': False, 'no_cuda': False, 'use_cpu': False, 'use_mps_device': False, 'seed': 42, 'data_seed': None, 'jit_mode_eval': False, 'use_ipex': False, 'bf16': True, 'fp16': False, 'fp16_opt_level': 'O1', 'half_precision_backend': 'auto', 'bf16_full_eval': False, 'fp16_full_eval': False, 'tf32': None, 'local_rank': 1, 'ddp_backend': None, 'tpu_num_cores': None, 'tpu_metrics_debug': False, 'debug': [], 'dataloader_drop_last': False, 'eval_steps': None, 'dataloader_num_workers': 0, 'dataloader_prefetch_factor': None, 'past_index': -1, 'run_name': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'disable_tqdm': False, 'remove_unused_columns': True, 'label_names': None, 'load_best_model_at_end': False, 'metric_for_best_model': None, 'greater_is_better': None, 'ignore_data_skip': False, 'fsdp': [], 'fsdp_min_num_params': 0, 'fsdp_config': {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, 'tp_size': 0, 'fsdp_transformer_layer_cls_to_wrap': None, 'accelerator_config': {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, 'deepspeed': 'configs/deepspeed_zero3.json', 'label_smoothing_factor': 0.0, 'optim': <OptimizerNames.ADAMW_TORCH: 'adamw_torch'>, 'optim_args': None, 'adafactor': False, 'group_by_length': False, 'length_column_name': 'length', 'report_to': ['wandb'], 'ddp_find_unused_parameters': None, 'ddp_bucket_cap_mb': None, 'ddp_broadcast_buffers': None, 'dataloader_pin_memory': True, 'dataloader_persistent_workers': False, 'skip_memory_metrics': True, 'use_legacy_prediction_loop': False, 'push_to_hub': False, 'resume_from_checkpoint': None, 'hub_model_id': None, 'hub_strategy': <HubStrategy.EVERY_SAVE: 'every_save'>, 'hub_token': None, 'hub_private_repo': None, 'hub_always_push': False, 'gradient_checkpointing': True, 'gradient_checkpointing_kwargs': None, 'include_inputs_for_metrics': False, 'include_for_metrics': [], 'eval_do_concat_batches': True, 'fp16_backend': 'auto', 'push_to_hub_model_id': None, 'push_to_hub_organization': None, 'push_to_hub_token': None, '_n_gpu': 1, 'mp_parameters': '', 'auto_find_batch_size': False, 'full_determinism': False, 'torchdynamo': None, 'ray_scope': 'last', 'ddp_timeout': 1800, 'torch_compile': False, 'torch_compile_backend': None, 'torch_compile_mode': None, 'include_tokens_per_second': False, 'include_num_input_tokens_seen': False, 'neftune_noise_alpha': None, 'optim_target_modules': None, 'batch_eval_metrics': False, 'eval_on_start': False, 'use_liger_kernel': True, 'eval_use_gather_object': False, 'average_tokens_across_devices': True, 'model_init_kwargs': None, 'chat_template_path': None, 'dataset_text_field': 'qwen_text', 'dataset_kwargs': None, 'dataset_num_proc': None, 'eos_token': None, 'pad_token': None, 'max_length': 1024, 'packing': False, 'packing_strategy': 'ffd', 'padding_free': False, 'pad_to_multiple_of': None, 'eval_packing': None, 'completion_only_loss': None, 'assistant_only_loss': False, 'activation_offloading': False, 'max_seq_length': None}
2026-04-15 11:24:42,271 - INFO - Training config: {'model_name': 'Qwen/Qwen3-8B', 'template_name': 'qwen', 'block_size': 40960, 'wandb_project': 'ThreadWeaver', 'train_file_path': './data-train_final_replaced_from_classified-fix-format', 'dagger': False, 'attn_implementation': 'flex_attention', 'output_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'overwrite_output_dir': False, 'do_train': False, 'do_eval': False, 'do_predict': False, 'eval_strategy': <IntervalStrategy.NO: 'no'>, 'prediction_loss_only': False, 'per_device_train_batch_size': 1, 'per_device_eval_batch_size': 1, 'per_gpu_train_batch_size': None, 'per_gpu_eval_batch_size': None, 'gradient_accumulation_steps': 2, 'eval_accumulation_steps': None, 'eval_delay': 0, 'torch_empty_cache_steps': None, 'learning_rate': 1e-05, 'weight_decay': 0.0001, 'adam_beta1': 0.9, 'adam_beta2': 0.95, 'adam_epsilon': 1e-08, 'max_grad_norm': 1.0, 'num_train_epochs': 8.0, 'max_steps': -1, 'lr_scheduler_type': <SchedulerType.COSINE: 'cosine'>, 'lr_scheduler_kwargs': {}, 'warmup_ratio': 0.05, 'warmup_steps': 0, 'log_level': 'passive', 'log_level_replica': 'warning', 'log_on_each_node': True, 'logging_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume/runs/Apr15_11-24-35_batch-block1-0086', 'logging_strategy': <IntervalStrategy.STEPS: 'steps'>, 'logging_first_step': False, 'logging_steps': 1.0, 'logging_nan_inf_filter': True, 'save_strategy': <SaveStrategy.STEPS: 'steps'>, 'save_steps': 20, 'save_total_limit': 2, 'save_safetensors': True, 'save_on_each_node': False, 'save_only_model': False, 'restore_callback_states_from_checkpoint': False, 'no_cuda': False, 'use_cpu': False, 'use_mps_device': False, 'seed': 42, 'data_seed': None, 'jit_mode_eval': False, 'use_ipex': False, 'bf16': True, 'fp16': False, 'fp16_opt_level': 'O1', 'half_precision_backend': 'auto', 'bf16_full_eval': False, 'fp16_full_eval': False, 'tf32': None, 'local_rank': 1, 'ddp_backend': None, 'tpu_num_cores': None, 'tpu_metrics_debug': False, 'debug': [], 'dataloader_drop_last': False, 'eval_steps': None, 'dataloader_num_workers': 0, 'dataloader_prefetch_factor': None, 'past_index': -1, 'run_name': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'disable_tqdm': False, 'remove_unused_columns': True, 'label_names': None, 'load_best_model_at_end': False, 'metric_for_best_model': None, 'greater_is_better': None, 'ignore_data_skip': False, 'fsdp': [], 'fsdp_min_num_params': 0, 'fsdp_config': {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, 'tp_size': 0, 'fsdp_transformer_layer_cls_to_wrap': None, 'accelerator_config': {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, 'deepspeed': 'configs/deepspeed_zero3.json', 'label_smoothing_factor': 0.0, 'optim': <OptimizerNames.ADAMW_TORCH: 'adamw_torch'>, 'optim_args': None, 'adafactor': False, 'group_by_length': False, 'length_column_name': 'length', 'report_to': ['wandb'], 'ddp_find_unused_parameters': None, 'ddp_bucket_cap_mb': None, 'ddp_broadcast_buffers': None, 'dataloader_pin_memory': True, 'dataloader_persistent_workers': False, 'skip_memory_metrics': True, 'use_legacy_prediction_loop': False, 'push_to_hub': False, 'resume_from_checkpoint': None, 'hub_model_id': None, 'hub_strategy': <HubStrategy.EVERY_SAVE: 'every_save'>, 'hub_token': None, 'hub_private_repo': None, 'hub_always_push': False, 'gradient_checkpointing': True, 'gradient_checkpointing_kwargs': None, 'include_inputs_for_metrics': False, 'include_for_metrics': [], 'eval_do_concat_batches': True, 'fp16_backend': 'auto', 'push_to_hub_model_id': None, 'push_to_hub_organization': None, 'push_to_hub_token': None, '_n_gpu': 1, 'mp_parameters': '', 'auto_find_batch_size': False, 'full_determinism': False, 'torchdynamo': None, 'ray_scope': 'last', 'ddp_timeout': 1800, 'torch_compile': False, 'torch_compile_backend': None, 'torch_compile_mode': None, 'include_tokens_per_second': False, 'include_num_input_tokens_seen': False, 'neftune_noise_alpha': None, 'optim_target_modules': None, 'batch_eval_metrics': False, 'eval_on_start': False, 'use_liger_kernel': True, 'eval_use_gather_object': False, 'average_tokens_across_devices': True, 'model_init_kwargs': None, 'chat_template_path': None, 'dataset_text_field': 'qwen_text', 'dataset_kwargs': None, 'dataset_num_proc': None, 'eos_token': None, 'pad_token': None, 'max_length': 1024, 'packing': False, 'packing_strategy': 'ffd', 'padding_free': False, 'pad_to_multiple_of': None, 'eval_packing': None, 'completion_only_loss': None, 'assistant_only_loss': False, 'activation_offloading': False, 'max_seq_length': None}
2026-04-15 11:24:42,271 - INFO - Training config: {'model_name': 'Qwen/Qwen3-8B', 'template_name': 'qwen', 'block_size': 40960, 'wandb_project': 'ThreadWeaver', 'train_file_path': './data-train_final_replaced_from_classified-fix-format', 'dagger': False, 'attn_implementation': 'flex_attention', 'output_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'overwrite_output_dir': False, 'do_train': False, 'do_eval': False, 'do_predict': False, 'eval_strategy': <IntervalStrategy.NO: 'no'>, 'prediction_loss_only': False, 'per_device_train_batch_size': 1, 'per_device_eval_batch_size': 1, 'per_gpu_train_batch_size': None, 'per_gpu_eval_batch_size': None, 'gradient_accumulation_steps': 2, 'eval_accumulation_steps': None, 'eval_delay': 0, 'torch_empty_cache_steps': None, 'learning_rate': 1e-05, 'weight_decay': 0.0001, 'adam_beta1': 0.9, 'adam_beta2': 0.95, 'adam_epsilon': 1e-08, 'max_grad_norm': 1.0, 'num_train_epochs': 8.0, 'max_steps': -1, 'lr_scheduler_type': <SchedulerType.COSINE: 'cosine'>, 'lr_scheduler_kwargs': {}, 'warmup_ratio': 0.05, 'warmup_steps': 0, 'log_level': 'passive', 'log_level_replica': 'warning', 'log_on_each_node': True, 'logging_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume/runs/Apr15_11-24-35_batch-block1-0086', 'logging_strategy': <IntervalStrategy.STEPS: 'steps'>, 'logging_first_step': False, 'logging_steps': 1.0, 'logging_nan_inf_filter': True, 'save_strategy': <SaveStrategy.STEPS: 'steps'>, 'save_steps': 20, 'save_total_limit': 2, 'save_safetensors': True, 'save_on_each_node': False, 'save_only_model': False, 'restore_callback_states_from_checkpoint': False, 'no_cuda': False, 'use_cpu': False, 'use_mps_device': False, 'seed': 42, 'data_seed': None, 'jit_mode_eval': False, 'use_ipex': False, 'bf16': True, 'fp16': False, 'fp16_opt_level': 'O1', 'half_precision_backend': 'auto', 'bf16_full_eval': False, 'fp16_full_eval': False, 'tf32': None, 'local_rank': 3, 'ddp_backend': None, 'tpu_num_cores': None, 'tpu_metrics_debug': False, 'debug': [], 'dataloader_drop_last': False, 'eval_steps': None, 'dataloader_num_workers': 0, 'dataloader_prefetch_factor': None, 'past_index': -1, 'run_name': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'disable_tqdm': False, 'remove_unused_columns': True, 'label_names': None, 'load_best_model_at_end': False, 'metric_for_best_model': None, 'greater_is_better': None, 'ignore_data_skip': False, 'fsdp': [], 'fsdp_min_num_params': 0, 'fsdp_config': {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, 'tp_size': 0, 'fsdp_transformer_layer_cls_to_wrap': None, 'accelerator_config': {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, 'deepspeed': 'configs/deepspeed_zero3.json', 'label_smoothing_factor': 0.0, 'optim': <OptimizerNames.ADAMW_TORCH: 'adamw_torch'>, 'optim_args': None, 'adafactor': False, 'group_by_length': False, 'length_column_name': 'length', 'report_to': ['wandb'], 'ddp_find_unused_parameters': None, 'ddp_bucket_cap_mb': None, 'ddp_broadcast_buffers': None, 'dataloader_pin_memory': True, 'dataloader_persistent_workers': False, 'skip_memory_metrics': True, 'use_legacy_prediction_loop': False, 'push_to_hub': False, 'resume_from_checkpoint': None, 'hub_model_id': None, 'hub_strategy': <HubStrategy.EVERY_SAVE: 'every_save'>, 'hub_token': None, 'hub_private_repo': None, 'hub_always_push': False, 'gradient_checkpointing': True, 'gradient_checkpointing_kwargs': None, 'include_inputs_for_metrics': False, 'include_for_metrics': [], 'eval_do_concat_batches': True, 'fp16_backend': 'auto', 'push_to_hub_model_id': None, 'push_to_hub_organization': None, 'push_to_hub_token': None, '_n_gpu': 1, 'mp_parameters': '', 'auto_find_batch_size': False, 'full_determinism': False, 'torchdynamo': None, 'ray_scope': 'last', 'ddp_timeout': 1800, 'torch_compile': False, 'torch_compile_backend': None, 'torch_compile_mode': None, 'include_tokens_per_second': False, 'include_num_input_tokens_seen': False, 'neftune_noise_alpha': None, 'optim_target_modules': None, 'batch_eval_metrics': False, 'eval_on_start': False, 'use_liger_kernel': True, 'eval_use_gather_object': False, 'average_tokens_across_devices': True, 'model_init_kwargs': None, 'chat_template_path': None, 'dataset_text_field': 'qwen_text', 'dataset_kwargs': None, 'dataset_num_proc': None, 'eos_token': None, 'pad_token': None, 'max_length': 1024, 'packing': False, 'packing_strategy': 'ffd', 'padding_free': False, 'pad_to_multiple_of': None, 'eval_packing': None, 'completion_only_loss': None, 'assistant_only_loss': False, 'activation_offloading': False, 'max_seq_length': None}
2026-04-15 11:24:42,301 - INFO - Training config: {'model_name': 'Qwen/Qwen3-8B', 'template_name': 'qwen', 'block_size': 40960, 'wandb_project': 'ThreadWeaver', 'train_file_path': './data-train_final_replaced_from_classified-fix-format', 'dagger': False, 'attn_implementation': 'flex_attention', 'output_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'overwrite_output_dir': False, 'do_train': False, 'do_eval': False, 'do_predict': False, 'eval_strategy': <IntervalStrategy.NO: 'no'>, 'prediction_loss_only': False, 'per_device_train_batch_size': 1, 'per_device_eval_batch_size': 1, 'per_gpu_train_batch_size': None, 'per_gpu_eval_batch_size': None, 'gradient_accumulation_steps': 2, 'eval_accumulation_steps': None, 'eval_delay': 0, 'torch_empty_cache_steps': None, 'learning_rate': 1e-05, 'weight_decay': 0.0001, 'adam_beta1': 0.9, 'adam_beta2': 0.95, 'adam_epsilon': 1e-08, 'max_grad_norm': 1.0, 'num_train_epochs': 8.0, 'max_steps': -1, 'lr_scheduler_type': <SchedulerType.COSINE: 'cosine'>, 'lr_scheduler_kwargs': {}, 'warmup_ratio': 0.05, 'warmup_steps': 0, 'log_level': 'passive', 'log_level_replica': 'warning', 'log_on_each_node': True, 'logging_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume/runs/Apr15_11-24-34_batch-block1-3273', 'logging_strategy': <IntervalStrategy.STEPS: 'steps'>, 'logging_first_step': False, 'logging_steps': 1.0, 'logging_nan_inf_filter': True, 'save_strategy': <SaveStrategy.STEPS: 'steps'>, 'save_steps': 20, 'save_total_limit': 2, 'save_safetensors': True, 'save_on_each_node': False, 'save_only_model': False, 'restore_callback_states_from_checkpoint': False, 'no_cuda': False, 'use_cpu': False, 'use_mps_device': False, 'seed': 42, 'data_seed': None, 'jit_mode_eval': False, 'use_ipex': False, 'bf16': True, 'fp16': False, 'fp16_opt_level': 'O1', 'half_precision_backend': 'auto', 'bf16_full_eval': False, 'fp16_full_eval': False, 'tf32': None, 'local_rank': 4, 'ddp_backend': None, 'tpu_num_cores': None, 'tpu_metrics_debug': False, 'debug': [], 'dataloader_drop_last': False, 'eval_steps': None, 'dataloader_num_workers': 0, 'dataloader_prefetch_factor': None, 'past_index': -1, 'run_name': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'disable_tqdm': False, 'remove_unused_columns': True, 'label_names': None, 'load_best_model_at_end': False, 'metric_for_best_model': None, 'greater_is_better': None, 'ignore_data_skip': False, 'fsdp': [], 'fsdp_min_num_params': 0, 'fsdp_config': {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, 'tp_size': 0, 'fsdp_transformer_layer_cls_to_wrap': None, 'accelerator_config': {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, 'deepspeed': 'configs/deepspeed_zero3.json', 'label_smoothing_factor': 0.0, 'optim': <OptimizerNames.ADAMW_TORCH: 'adamw_torch'>, 'optim_args': None, 'adafactor': False, 'group_by_length': False, 'length_column_name': 'length', 'report_to': ['wandb'], 'ddp_find_unused_parameters': None, 'ddp_bucket_cap_mb': None, 'ddp_broadcast_buffers': None, 'dataloader_pin_memory': True, 'dataloader_persistent_workers': False, 'skip_memory_metrics': True, 'use_legacy_prediction_loop': False, 'push_to_hub': False, 'resume_from_checkpoint': None, 'hub_model_id': None, 'hub_strategy': <HubStrategy.EVERY_SAVE: 'every_save'>, 'hub_token': None, 'hub_private_repo': None, 'hub_always_push': False, 'gradient_checkpointing': True, 'gradient_checkpointing_kwargs': None, 'include_inputs_for_metrics': False, 'include_for_metrics': [], 'eval_do_concat_batches': True, 'fp16_backend': 'auto', 'push_to_hub_model_id': None, 'push_to_hub_organization': None, 'push_to_hub_token': None, '_n_gpu': 1, 'mp_parameters': '', 'auto_find_batch_size': False, 'full_determinism': False, 'torchdynamo': None, 'ray_scope': 'last', 'ddp_timeout': 1800, 'torch_compile': False, 'torch_compile_backend': None, 'torch_compile_mode': None, 'include_tokens_per_second': False, 'include_num_input_tokens_seen': False, 'neftune_noise_alpha': None, 'optim_target_modules': None, 'batch_eval_metrics': False, 'eval_on_start': False, 'use_liger_kernel': True, 'eval_use_gather_object': False, 'average_tokens_across_devices': True, 'model_init_kwargs': None, 'chat_template_path': None, 'dataset_text_field': 'qwen_text', 'dataset_kwargs': None, 'dataset_num_proc': None, 'eos_token': None, 'pad_token': None, 'max_length': 1024, 'packing': False, 'packing_strategy': 'ffd', 'padding_free': False, 'pad_to_multiple_of': None, 'eval_packing': None, 'completion_only_loss': None, 'assistant_only_loss': False, 'activation_offloading': False, 'max_seq_length': None}
2026-04-15 11:24:42,304 - INFO - Training config: {'model_name': 'Qwen/Qwen3-8B', 'template_name': 'qwen', 'block_size': 40960, 'wandb_project': 'ThreadWeaver', 'train_file_path': './data-train_final_replaced_from_classified-fix-format', 'dagger': False, 'attn_implementation': 'flex_attention', 'output_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'overwrite_output_dir': False, 'do_train': False, 'do_eval': False, 'do_predict': False, 'eval_strategy': <IntervalStrategy.NO: 'no'>, 'prediction_loss_only': False, 'per_device_train_batch_size': 1, 'per_device_eval_batch_size': 1, 'per_gpu_train_batch_size': None, 'per_gpu_eval_batch_size': None, 'gradient_accumulation_steps': 2, 'eval_accumulation_steps': None, 'eval_delay': 0, 'torch_empty_cache_steps': None, 'learning_rate': 1e-05, 'weight_decay': 0.0001, 'adam_beta1': 0.9, 'adam_beta2': 0.95, 'adam_epsilon': 1e-08, 'max_grad_norm': 1.0, 'num_train_epochs': 8.0, 'max_steps': -1, 'lr_scheduler_type': <SchedulerType.COSINE: 'cosine'>, 'lr_scheduler_kwargs': {}, 'warmup_ratio': 0.05, 'warmup_steps': 0, 'log_level': 'passive', 'log_level_replica': 'warning', 'log_on_each_node': True, 'logging_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume/runs/Apr15_11-24-34_batch-block1-0070', 'logging_strategy': <IntervalStrategy.STEPS: 'steps'>, 'logging_first_step': False, 'logging_steps': 1.0, 'logging_nan_inf_filter': True, 'save_strategy': <SaveStrategy.STEPS: 'steps'>, 'save_steps': 20, 'save_total_limit': 2, 'save_safetensors': True, 'save_on_each_node': False, 'save_only_model': False, 'restore_callback_states_from_checkpoint': False, 'no_cuda': False, 'use_cpu': False, 'use_mps_device': False, 'seed': 42, 'data_seed': None, 'jit_mode_eval': False, 'use_ipex': False, 'bf16': True, 'fp16': False, 'fp16_opt_level': 'O1', 'half_precision_backend': 'auto', 'bf16_full_eval': False, 'fp16_full_eval': False, 'tf32': None, 'local_rank': 3, 'ddp_backend': None, 'tpu_num_cores': None, 'tpu_metrics_debug': False, 'debug': [], 'dataloader_drop_last': False, 'eval_steps': None, 'dataloader_num_workers': 0, 'dataloader_prefetch_factor': None, 'past_index': -1, 'run_name': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'disable_tqdm': False, 'remove_unused_columns': True, 'label_names': None, 'load_best_model_at_end': False, 'metric_for_best_model': None, 'greater_is_better': None, 'ignore_data_skip': False, 'fsdp': [], 'fsdp_min_num_params': 0, 'fsdp_config': {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, 'tp_size': 0, 'fsdp_transformer_layer_cls_to_wrap': None, 'accelerator_config': {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, 'deepspeed': 'configs/deepspeed_zero3.json', 'label_smoothing_factor': 0.0, 'optim': <OptimizerNames.ADAMW_TORCH: 'adamw_torch'>, 'optim_args': None, 'adafactor': False, 'group_by_length': False, 'length_column_name': 'length', 'report_to': ['wandb'], 'ddp_find_unused_parameters': None, 'ddp_bucket_cap_mb': None, 'ddp_broadcast_buffers': None, 'dataloader_pin_memory': True, 'dataloader_persistent_workers': False, 'skip_memory_metrics': True, 'use_legacy_prediction_loop': False, 'push_to_hub': False, 'resume_from_checkpoint': None, 'hub_model_id': None, 'hub_strategy': <HubStrategy.EVERY_SAVE: 'every_save'>, 'hub_token': None, 'hub_private_repo': None, 'hub_always_push': False, 'gradient_checkpointing': True, 'gradient_checkpointing_kwargs': None, 'include_inputs_for_metrics': False, 'include_for_metrics': [], 'eval_do_concat_batches': True, 'fp16_backend': 'auto', 'push_to_hub_model_id': None, 'push_to_hub_organization': None, 'push_to_hub_token': None, '_n_gpu': 1, 'mp_parameters': '', 'auto_find_batch_size': False, 'full_determinism': False, 'torchdynamo': None, 'ray_scope': 'last', 'ddp_timeout': 1800, 'torch_compile': False, 'torch_compile_backend': None, 'torch_compile_mode': None, 'include_tokens_per_second': False, 'include_num_input_tokens_seen': False, 'neftune_noise_alpha': None, 'optim_target_modules': None, 'batch_eval_metrics': False, 'eval_on_start': False, 'use_liger_kernel': True, 'eval_use_gather_object': False, 'average_tokens_across_devices': True, 'model_init_kwargs': None, 'chat_template_path': None, 'dataset_text_field': 'qwen_text', 'dataset_kwargs': None, 'dataset_num_proc': None, 'eos_token': None, 'pad_token': None, 'max_length': 1024, 'packing': False, 'packing_strategy': 'ffd', 'padding_free': False, 'pad_to_multiple_of': None, 'eval_packing': None, 'completion_only_loss': None, 'assistant_only_loss': False, 'activation_offloading': False, 'max_seq_length': None}
2026-04-15 11:24:42,305 - INFO - Training config: {'model_name': 'Qwen/Qwen3-8B', 'template_name': 'qwen', 'block_size': 40960, 'wandb_project': 'ThreadWeaver', 'train_file_path': './data-train_final_replaced_from_classified-fix-format', 'dagger': False, 'attn_implementation': 'flex_attention', 'output_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'overwrite_output_dir': False, 'do_train': False, 'do_eval': False, 'do_predict': False, 'eval_strategy': <IntervalStrategy.NO: 'no'>, 'prediction_loss_only': False, 'per_device_train_batch_size': 1, 'per_device_eval_batch_size': 1, 'per_gpu_train_batch_size': None, 'per_gpu_eval_batch_size': None, 'gradient_accumulation_steps': 2, 'eval_accumulation_steps': None, 'eval_delay': 0, 'torch_empty_cache_steps': None, 'learning_rate': 1e-05, 'weight_decay': 0.0001, 'adam_beta1': 0.9, 'adam_beta2': 0.95, 'adam_epsilon': 1e-08, 'max_grad_norm': 1.0, 'num_train_epochs': 8.0, 'max_steps': -1, 'lr_scheduler_type': <SchedulerType.COSINE: 'cosine'>, 'lr_scheduler_kwargs': {}, 'warmup_ratio': 0.05, 'warmup_steps': 0, 'log_level': 'passive', 'log_level_replica': 'warning', 'log_on_each_node': True, 'logging_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume/runs/Apr15_11-24-35_batch-block1-0086', 'logging_strategy': <IntervalStrategy.STEPS: 'steps'>, 'logging_first_step': False, 'logging_steps': 1.0, 'logging_nan_inf_filter': True, 'save_strategy': <SaveStrategy.STEPS: 'steps'>, 'save_steps': 20, 'save_total_limit': 2, 'save_safetensors': True, 'save_on_each_node': False, 'save_only_model': False, 'restore_callback_states_from_checkpoint': False, 'no_cuda': False, 'use_cpu': False, 'use_mps_device': False, 'seed': 42, 'data_seed': None, 'jit_mode_eval': False, 'use_ipex': False, 'bf16': True, 'fp16': False, 'fp16_opt_level': 'O1', 'half_precision_backend': 'auto', 'bf16_full_eval': False, 'fp16_full_eval': False, 'tf32': None, 'local_rank': 6, 'ddp_backend': None, 'tpu_num_cores': None, 'tpu_metrics_debug': False, 'debug': [], 'dataloader_drop_last': False, 'eval_steps': None, 'dataloader_num_workers': 0, 'dataloader_prefetch_factor': None, 'past_index': -1, 'run_name': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'disable_tqdm': False, 'remove_unused_columns': True, 'label_names': None, 'load_best_model_at_end': False, 'metric_for_best_model': None, 'greater_is_better': None, 'ignore_data_skip': False, 'fsdp': [], 'fsdp_min_num_params': 0, 'fsdp_config': {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, 'tp_size': 0, 'fsdp_transformer_layer_cls_to_wrap': None, 'accelerator_config': {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, 'deepspeed': 'configs/deepspeed_zero3.json', 'label_smoothing_factor': 0.0, 'optim': <OptimizerNames.ADAMW_TORCH: 'adamw_torch'>, 'optim_args': None, 'adafactor': False, 'group_by_length': False, 'length_column_name': 'length', 'report_to': ['wandb'], 'ddp_find_unused_parameters': None, 'ddp_bucket_cap_mb': None, 'ddp_broadcast_buffers': None, 'dataloader_pin_memory': True, 'dataloader_persistent_workers': False, 'skip_memory_metrics': True, 'use_legacy_prediction_loop': False, 'push_to_hub': False, 'resume_from_checkpoint': None, 'hub_model_id': None, 'hub_strategy': <HubStrategy.EVERY_SAVE: 'every_save'>, 'hub_token': None, 'hub_private_repo': None, 'hub_always_push': False, 'gradient_checkpointing': True, 'gradient_checkpointing_kwargs': None, 'include_inputs_for_metrics': False, 'include_for_metrics': [], 'eval_do_concat_batches': True, 'fp16_backend': 'auto', 'push_to_hub_model_id': None, 'push_to_hub_organization': None, 'push_to_hub_token': None, '_n_gpu': 1, 'mp_parameters': '', 'auto_find_batch_size': False, 'full_determinism': False, 'torchdynamo': None, 'ray_scope': 'last', 'ddp_timeout': 1800, 'torch_compile': False, 'torch_compile_backend': None, 'torch_compile_mode': None, 'include_tokens_per_second': False, 'include_num_input_tokens_seen': False, 'neftune_noise_alpha': None, 'optim_target_modules': None, 'batch_eval_metrics': False, 'eval_on_start': False, 'use_liger_kernel': True, 'eval_use_gather_object': False, 'average_tokens_across_devices': True, 'model_init_kwargs': None, 'chat_template_path': None, 'dataset_text_field': 'qwen_text', 'dataset_kwargs': None, 'dataset_num_proc': None, 'eos_token': None, 'pad_token': None, 'max_length': 1024, 'packing': False, 'packing_strategy': 'ffd', 'padding_free': False, 'pad_to_multiple_of': None, 'eval_packing': None, 'completion_only_loss': None, 'assistant_only_loss': False, 'activation_offloading': False, 'max_seq_length': None}
2026-04-15 11:24:42,306 - INFO - Training config: {'model_name': 'Qwen/Qwen3-8B', 'template_name': 'qwen', 'block_size': 40960, 'wandb_project': 'ThreadWeaver', 'train_file_path': './data-train_final_replaced_from_classified-fix-format', 'dagger': False, 'attn_implementation': 'flex_attention', 'output_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'overwrite_output_dir': False, 'do_train': False, 'do_eval': False, 'do_predict': False, 'eval_strategy': <IntervalStrategy.NO: 'no'>, 'prediction_loss_only': False, 'per_device_train_batch_size': 1, 'per_device_eval_batch_size': 1, 'per_gpu_train_batch_size': None, 'per_gpu_eval_batch_size': None, 'gradient_accumulation_steps': 2, 'eval_accumulation_steps': None, 'eval_delay': 0, 'torch_empty_cache_steps': None, 'learning_rate': 1e-05, 'weight_decay': 0.0001, 'adam_beta1': 0.9, 'adam_beta2': 0.95, 'adam_epsilon': 1e-08, 'max_grad_norm': 1.0, 'num_train_epochs': 8.0, 'max_steps': -1, 'lr_scheduler_type': <SchedulerType.COSINE: 'cosine'>, 'lr_scheduler_kwargs': {}, 'warmup_ratio': 0.05, 'warmup_steps': 0, 'log_level': 'passive', 'log_level_replica': 'warning', 'log_on_each_node': True, 'logging_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume/runs/Apr15_11-24-34_batch-block1-3859', 'logging_strategy': <IntervalStrategy.STEPS: 'steps'>, 'logging_first_step': False, 'logging_steps': 1.0, 'logging_nan_inf_filter': True, 'save_strategy': <SaveStrategy.STEPS: 'steps'>, 'save_steps': 20, 'save_total_limit': 2, 'save_safetensors': True, 'save_on_each_node': False, 'save_only_model': False, 'restore_callback_states_from_checkpoint': False, 'no_cuda': False, 'use_cpu': False, 'use_mps_device': False, 'seed': 42, 'data_seed': None, 'jit_mode_eval': False, 'use_ipex': False, 'bf16': True, 'fp16': False, 'fp16_opt_level': 'O1', 'half_precision_backend': 'auto', 'bf16_full_eval': False, 'fp16_full_eval': False, 'tf32': None, 'local_rank': 3, 'ddp_backend': None, 'tpu_num_cores': None, 'tpu_metrics_debug': False, 'debug': [], 'dataloader_drop_last': False, 'eval_steps': None, 'dataloader_num_workers': 0, 'dataloader_prefetch_factor': None, 'past_index': -1, 'run_name': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'disable_tqdm': False, 'remove_unused_columns': True, 'label_names': None, 'load_best_model_at_end': False, 'metric_for_best_model': None, 'greater_is_better': None, 'ignore_data_skip': False, 'fsdp': [], 'fsdp_min_num_params': 0, 'fsdp_config': {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, 'tp_size': 0, 'fsdp_transformer_layer_cls_to_wrap': None, 'accelerator_config': {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, 'deepspeed': 'configs/deepspeed_zero3.json', 'label_smoothing_factor': 0.0, 'optim': <OptimizerNames.ADAMW_TORCH: 'adamw_torch'>, 'optim_args': None, 'adafactor': False, 'group_by_length': False, 'length_column_name': 'length', 'report_to': ['wandb'], 'ddp_find_unused_parameters': None, 'ddp_bucket_cap_mb': None, 'ddp_broadcast_buffers': None, 'dataloader_pin_memory': True, 'dataloader_persistent_workers': False, 'skip_memory_metrics': True, 'use_legacy_prediction_loop': False, 'push_to_hub': False, 'resume_from_checkpoint': None, 'hub_model_id': None, 'hub_strategy': <HubStrategy.EVERY_SAVE: 'every_save'>, 'hub_token': None, 'hub_private_repo': None, 'hub_always_push': False, 'gradient_checkpointing': True, 'gradient_checkpointing_kwargs': None, 'include_inputs_for_metrics': False, 'include_for_metrics': [], 'eval_do_concat_batches': True, 'fp16_backend': 'auto', 'push_to_hub_model_id': None, 'push_to_hub_organization': None, 'push_to_hub_token': None, '_n_gpu': 1, 'mp_parameters': '', 'auto_find_batch_size': False, 'full_determinism': False, 'torchdynamo': None, 'ray_scope': 'last', 'ddp_timeout': 1800, 'torch_compile': False, 'torch_compile_backend': None, 'torch_compile_mode': None, 'include_tokens_per_second': False, 'include_num_input_tokens_seen': False, 'neftune_noise_alpha': None, 'optim_target_modules': None, 'batch_eval_metrics': False, 'eval_on_start': False, 'use_liger_kernel': True, 'eval_use_gather_object': False, 'average_tokens_across_devices': True, 'model_init_kwargs': None, 'chat_template_path': None, 'dataset_text_field': 'qwen_text', 'dataset_kwargs': None, 'dataset_num_proc': None, 'eos_token': None, 'pad_token': None, 'max_length': 1024, 'packing': False, 'packing_strategy': 'ffd', 'padding_free': False, 'pad_to_multiple_of': None, 'eval_packing': None, 'completion_only_loss': None, 'assistant_only_loss': False, 'activation_offloading': False, 'max_seq_length': None}
2026-04-15 11:24:42,306 - INFO - Training config: {'model_name': 'Qwen/Qwen3-8B', 'template_name': 'qwen', 'block_size': 40960, 'wandb_project': 'ThreadWeaver', 'train_file_path': './data-train_final_replaced_from_classified-fix-format', 'dagger': False, 'attn_implementation': 'flex_attention', 'output_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'overwrite_output_dir': False, 'do_train': False, 'do_eval': False, 'do_predict': False, 'eval_strategy': <IntervalStrategy.NO: 'no'>, 'prediction_loss_only': False, 'per_device_train_batch_size': 1, 'per_device_eval_batch_size': 1, 'per_gpu_train_batch_size': None, 'per_gpu_eval_batch_size': None, 'gradient_accumulation_steps': 2, 'eval_accumulation_steps': None, 'eval_delay': 0, 'torch_empty_cache_steps': None, 'learning_rate': 1e-05, 'weight_decay': 0.0001, 'adam_beta1': 0.9, 'adam_beta2': 0.95, 'adam_epsilon': 1e-08, 'max_grad_norm': 1.0, 'num_train_epochs': 8.0, 'max_steps': -1, 'lr_scheduler_type': <SchedulerType.COSINE: 'cosine'>, 'lr_scheduler_kwargs': {}, 'warmup_ratio': 0.05, 'warmup_steps': 0, 'log_level': 'passive', 'log_level_replica': 'warning', 'log_on_each_node': True, 'logging_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume/runs/Apr15_11-24-34_batch-block1-0070', 'logging_strategy': <IntervalStrategy.STEPS: 'steps'>, 'logging_first_step': False, 'logging_steps': 1.0, 'logging_nan_inf_filter': True, 'save_strategy': <SaveStrategy.STEPS: 'steps'>, 'save_steps': 20, 'save_total_limit': 2, 'save_safetensors': True, 'save_on_each_node': False, 'save_only_model': False, 'restore_callback_states_from_checkpoint': False, 'no_cuda': False, 'use_cpu': False, 'use_mps_device': False, 'seed': 42, 'data_seed': None, 'jit_mode_eval': False, 'use_ipex': False, 'bf16': True, 'fp16': False, 'fp16_opt_level': 'O1', 'half_precision_backend': 'auto', 'bf16_full_eval': False, 'fp16_full_eval': False, 'tf32': None, 'local_rank': 5, 'ddp_backend': None, 'tpu_num_cores': None, 'tpu_metrics_debug': False, 'debug': [], 'dataloader_drop_last': False, 'eval_steps': None, 'dataloader_num_workers': 0, 'dataloader_prefetch_factor': None, 'past_index': -1, 'run_name': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'disable_tqdm': False, 'remove_unused_columns': True, 'label_names': None, 'load_best_model_at_end': False, 'metric_for_best_model': None, 'greater_is_better': None, 'ignore_data_skip': False, 'fsdp': [], 'fsdp_min_num_params': 0, 'fsdp_config': {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, 'tp_size': 0, 'fsdp_transformer_layer_cls_to_wrap': None, 'accelerator_config': {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, 'deepspeed': 'configs/deepspeed_zero3.json', 'label_smoothing_factor': 0.0, 'optim': <OptimizerNames.ADAMW_TORCH: 'adamw_torch'>, 'optim_args': None, 'adafactor': False, 'group_by_length': False, 'length_column_name': 'length', 'report_to': ['wandb'], 'ddp_find_unused_parameters': None, 'ddp_bucket_cap_mb': None, 'ddp_broadcast_buffers': None, 'dataloader_pin_memory': True, 'dataloader_persistent_workers': False, 'skip_memory_metrics': True, 'use_legacy_prediction_loop': False, 'push_to_hub': False, 'resume_from_checkpoint': None, 'hub_model_id': None, 'hub_strategy': <HubStrategy.EVERY_SAVE: 'every_save'>, 'hub_token': None, 'hub_private_repo': None, 'hub_always_push': False, 'gradient_checkpointing': True, 'gradient_checkpointing_kwargs': None, 'include_inputs_for_metrics': False, 'include_for_metrics': [], 'eval_do_concat_batches': True, 'fp16_backend': 'auto', 'push_to_hub_model_id': None, 'push_to_hub_organization': None, 'push_to_hub_token': None, '_n_gpu': 1, 'mp_parameters': '', 'auto_find_batch_size': False, 'full_determinism': False, 'torchdynamo': None, 'ray_scope': 'last', 'ddp_timeout': 1800, 'torch_compile': False, 'torch_compile_backend': None, 'torch_compile_mode': None, 'include_tokens_per_second': False, 'include_num_input_tokens_seen': False, 'neftune_noise_alpha': None, 'optim_target_modules': None, 'batch_eval_metrics': False, 'eval_on_start': False, 'use_liger_kernel': True, 'eval_use_gather_object': False, 'average_tokens_across_devices': True, 'model_init_kwargs': None, 'chat_template_path': None, 'dataset_text_field': 'qwen_text', 'dataset_kwargs': None, 'dataset_num_proc': None, 'eos_token': None, 'pad_token': None, 'max_length': 1024, 'packing': False, 'packing_strategy': 'ffd', 'padding_free': False, 'pad_to_multiple_of': None, 'eval_packing': None, 'completion_only_loss': None, 'assistant_only_loss': False, 'activation_offloading': False, 'max_seq_length': None}
2026-04-15 11:24:42,327 - INFO - Training config: {'model_name': 'Qwen/Qwen3-8B', 'template_name': 'qwen', 'block_size': 40960, 'wandb_project': 'ThreadWeaver', 'train_file_path': './data-train_final_replaced_from_classified-fix-format', 'dagger': False, 'attn_implementation': 'flex_attention', 'output_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'overwrite_output_dir': False, 'do_train': False, 'do_eval': False, 'do_predict': False, 'eval_strategy': <IntervalStrategy.NO: 'no'>, 'prediction_loss_only': False, 'per_device_train_batch_size': 1, 'per_device_eval_batch_size': 1, 'per_gpu_train_batch_size': None, 'per_gpu_eval_batch_size': None, 'gradient_accumulation_steps': 2, 'eval_accumulation_steps': None, 'eval_delay': 0, 'torch_empty_cache_steps': None, 'learning_rate': 1e-05, 'weight_decay': 0.0001, 'adam_beta1': 0.9, 'adam_beta2': 0.95, 'adam_epsilon': 1e-08, 'max_grad_norm': 1.0, 'num_train_epochs': 8.0, 'max_steps': -1, 'lr_scheduler_type': <SchedulerType.COSINE: 'cosine'>, 'lr_scheduler_kwargs': {}, 'warmup_ratio': 0.05, 'warmup_steps': 0, 'log_level': 'passive', 'log_level_replica': 'warning', 'log_on_each_node': True, 'logging_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume/runs/Apr15_11-24-34_batch-block1-0070', 'logging_strategy': <IntervalStrategy.STEPS: 'steps'>, 'logging_first_step': False, 'logging_steps': 1.0, 'logging_nan_inf_filter': True, 'save_strategy': <SaveStrategy.STEPS: 'steps'>, 'save_steps': 20, 'save_total_limit': 2, 'save_safetensors': True, 'save_on_each_node': False, 'save_only_model': False, 'restore_callback_states_from_checkpoint': False, 'no_cuda': False, 'use_cpu': False, 'use_mps_device': False, 'seed': 42, 'data_seed': None, 'jit_mode_eval': False, 'use_ipex': False, 'bf16': True, 'fp16': False, 'fp16_opt_level': 'O1', 'half_precision_backend': 'auto', 'bf16_full_eval': False, 'fp16_full_eval': False, 'tf32': None, 'local_rank': 1, 'ddp_backend': None, 'tpu_num_cores': None, 'tpu_metrics_debug': False, 'debug': [], 'dataloader_drop_last': False, 'eval_steps': None, 'dataloader_num_workers': 0, 'dataloader_prefetch_factor': None, 'past_index': -1, 'run_name': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'disable_tqdm': False, 'remove_unused_columns': True, 'label_names': None, 'load_best_model_at_end': False, 'metric_for_best_model': None, 'greater_is_better': None, 'ignore_data_skip': False, 'fsdp': [], 'fsdp_min_num_params': 0, 'fsdp_config': {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, 'tp_size': 0, 'fsdp_transformer_layer_cls_to_wrap': None, 'accelerator_config': {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, 'deepspeed': 'configs/deepspeed_zero3.json', 'label_smoothing_factor': 0.0, 'optim': <OptimizerNames.ADAMW_TORCH: 'adamw_torch'>, 'optim_args': None, 'adafactor': False, 'group_by_length': False, 'length_column_name': 'length', 'report_to': ['wandb'], 'ddp_find_unused_parameters': None, 'ddp_bucket_cap_mb': None, 'ddp_broadcast_buffers': None, 'dataloader_pin_memory': True, 'dataloader_persistent_workers': False, 'skip_memory_metrics': True, 'use_legacy_prediction_loop': False, 'push_to_hub': False, 'resume_from_checkpoint': None, 'hub_model_id': None, 'hub_strategy': <HubStrategy.EVERY_SAVE: 'every_save'>, 'hub_token': None, 'hub_private_repo': None, 'hub_always_push': False, 'gradient_checkpointing': True, 'gradient_checkpointing_kwargs': None, 'include_inputs_for_metrics': False, 'include_for_metrics': [], 'eval_do_concat_batches': True, 'fp16_backend': 'auto', 'push_to_hub_model_id': None, 'push_to_hub_organization': None, 'push_to_hub_token': None, '_n_gpu': 1, 'mp_parameters': '', 'auto_find_batch_size': False, 'full_determinism': False, 'torchdynamo': None, 'ray_scope': 'last', 'ddp_timeout': 1800, 'torch_compile': False, 'torch_compile_backend': None, 'torch_compile_mode': None, 'include_tokens_per_second': False, 'include_num_input_tokens_seen': False, 'neftune_noise_alpha': None, 'optim_target_modules': None, 'batch_eval_metrics': False, 'eval_on_start': False, 'use_liger_kernel': True, 'eval_use_gather_object': False, 'average_tokens_across_devices': True, 'model_init_kwargs': None, 'chat_template_path': None, 'dataset_text_field': 'qwen_text', 'dataset_kwargs': None, 'dataset_num_proc': None, 'eos_token': None, 'pad_token': None, 'max_length': 1024, 'packing': False, 'packing_strategy': 'ffd', 'padding_free': False, 'pad_to_multiple_of': None, 'eval_packing': None, 'completion_only_loss': None, 'assistant_only_loss': False, 'activation_offloading': False, 'max_seq_length': None}
2026-04-15 11:24:42,329 - INFO - Training config: {'model_name': 'Qwen/Qwen3-8B', 'template_name': 'qwen', 'block_size': 40960, 'wandb_project': 'ThreadWeaver', 'train_file_path': './data-train_final_replaced_from_classified-fix-format', 'dagger': False, 'attn_implementation': 'flex_attention', 'output_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'overwrite_output_dir': False, 'do_train': False, 'do_eval': False, 'do_predict': False, 'eval_strategy': <IntervalStrategy.NO: 'no'>, 'prediction_loss_only': False, 'per_device_train_batch_size': 1, 'per_device_eval_batch_size': 1, 'per_gpu_train_batch_size': None, 'per_gpu_eval_batch_size': None, 'gradient_accumulation_steps': 2, 'eval_accumulation_steps': None, 'eval_delay': 0, 'torch_empty_cache_steps': None, 'learning_rate': 1e-05, 'weight_decay': 0.0001, 'adam_beta1': 0.9, 'adam_beta2': 0.95, 'adam_epsilon': 1e-08, 'max_grad_norm': 1.0, 'num_train_epochs': 8.0, 'max_steps': -1, 'lr_scheduler_type': <SchedulerType.COSINE: 'cosine'>, 'lr_scheduler_kwargs': {}, 'warmup_ratio': 0.05, 'warmup_steps': 0, 'log_level': 'passive', 'log_level_replica': 'warning', 'log_on_each_node': True, 'logging_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume/runs/Apr15_11-24-34_batch-block1-0075', 'logging_strategy': <IntervalStrategy.STEPS: 'steps'>, 'logging_first_step': False, 'logging_steps': 1.0, 'logging_nan_inf_filter': True, 'save_strategy': <SaveStrategy.STEPS: 'steps'>, 'save_steps': 20, 'save_total_limit': 2, 'save_safetensors': True, 'save_on_each_node': False, 'save_only_model': False, 'restore_callback_states_from_checkpoint': False, 'no_cuda': False, 'use_cpu': False, 'use_mps_device': False, 'seed': 42, 'data_seed': None, 'jit_mode_eval': False, 'use_ipex': False, 'bf16': True, 'fp16': False, 'fp16_opt_level': 'O1', 'half_precision_backend': 'auto', 'bf16_full_eval': False, 'fp16_full_eval': False, 'tf32': None, 'local_rank': 7, 'ddp_backend': None, 'tpu_num_cores': None, 'tpu_metrics_debug': False, 'debug': [], 'dataloader_drop_last': False, 'eval_steps': None, 'dataloader_num_workers': 0, 'dataloader_prefetch_factor': None, 'past_index': -1, 'run_name': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'disable_tqdm': False, 'remove_unused_columns': True, 'label_names': None, 'load_best_model_at_end': False, 'metric_for_best_model': None, 'greater_is_better': None, 'ignore_data_skip': False, 'fsdp': [], 'fsdp_min_num_params': 0, 'fsdp_config': {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, 'tp_size': 0, 'fsdp_transformer_layer_cls_to_wrap': None, 'accelerator_config': {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, 'deepspeed': 'configs/deepspeed_zero3.json', 'label_smoothing_factor': 0.0, 'optim': <OptimizerNames.ADAMW_TORCH: 'adamw_torch'>, 'optim_args': None, 'adafactor': False, 'group_by_length': False, 'length_column_name': 'length', 'report_to': ['wandb'], 'ddp_find_unused_parameters': None, 'ddp_bucket_cap_mb': None, 'ddp_broadcast_buffers': None, 'dataloader_pin_memory': True, 'dataloader_persistent_workers': False, 'skip_memory_metrics': True, 'use_legacy_prediction_loop': False, 'push_to_hub': False, 'resume_from_checkpoint': None, 'hub_model_id': None, 'hub_strategy': <HubStrategy.EVERY_SAVE: 'every_save'>, 'hub_token': None, 'hub_private_repo': None, 'hub_always_push': False, 'gradient_checkpointing': True, 'gradient_checkpointing_kwargs': None, 'include_inputs_for_metrics': False, 'include_for_metrics': [], 'eval_do_concat_batches': True, 'fp16_backend': 'auto', 'push_to_hub_model_id': None, 'push_to_hub_organization': None, 'push_to_hub_token': None, '_n_gpu': 1, 'mp_parameters': '', 'auto_find_batch_size': False, 'full_determinism': False, 'torchdynamo': None, 'ray_scope': 'last', 'ddp_timeout': 1800, 'torch_compile': False, 'torch_compile_backend': None, 'torch_compile_mode': None, 'include_tokens_per_second': False, 'include_num_input_tokens_seen': False, 'neftune_noise_alpha': None, 'optim_target_modules': None, 'batch_eval_metrics': False, 'eval_on_start': False, 'use_liger_kernel': True, 'eval_use_gather_object': False, 'average_tokens_across_devices': True, 'model_init_kwargs': None, 'chat_template_path': None, 'dataset_text_field': 'qwen_text', 'dataset_kwargs': None, 'dataset_num_proc': None, 'eos_token': None, 'pad_token': None, 'max_length': 1024, 'packing': False, 'packing_strategy': 'ffd', 'padding_free': False, 'pad_to_multiple_of': None, 'eval_packing': None, 'completion_only_loss': None, 'assistant_only_loss': False, 'activation_offloading': False, 'max_seq_length': None}
2026-04-15 11:24:42,345 - INFO - Training config: {'model_name': 'Qwen/Qwen3-8B', 'template_name': 'qwen', 'block_size': 40960, 'wandb_project': 'ThreadWeaver', 'train_file_path': './data-train_final_replaced_from_classified-fix-format', 'dagger': False, 'attn_implementation': 'flex_attention', 'output_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'overwrite_output_dir': False, 'do_train': False, 'do_eval': False, 'do_predict': False, 'eval_strategy': <IntervalStrategy.NO: 'no'>, 'prediction_loss_only': False, 'per_device_train_batch_size': 1, 'per_device_eval_batch_size': 1, 'per_gpu_train_batch_size': None, 'per_gpu_eval_batch_size': None, 'gradient_accumulation_steps': 2, 'eval_accumulation_steps': None, 'eval_delay': 0, 'torch_empty_cache_steps': None, 'learning_rate': 1e-05, 'weight_decay': 0.0001, 'adam_beta1': 0.9, 'adam_beta2': 0.95, 'adam_epsilon': 1e-08, 'max_grad_norm': 1.0, 'num_train_epochs': 8.0, 'max_steps': -1, 'lr_scheduler_type': <SchedulerType.COSINE: 'cosine'>, 'lr_scheduler_kwargs': {}, 'warmup_ratio': 0.05, 'warmup_steps': 0, 'log_level': 'passive', 'log_level_replica': 'warning', 'log_on_each_node': True, 'logging_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume/runs/Apr15_11-24-34_batch-block1-0075', 'logging_strategy': <IntervalStrategy.STEPS: 'steps'>, 'logging_first_step': False, 'logging_steps': 1.0, 'logging_nan_inf_filter': True, 'save_strategy': <SaveStrategy.STEPS: 'steps'>, 'save_steps': 20, 'save_total_limit': 2, 'save_safetensors': True, 'save_on_each_node': False, 'save_only_model': False, 'restore_callback_states_from_checkpoint': False, 'no_cuda': False, 'use_cpu': False, 'use_mps_device': False, 'seed': 42, 'data_seed': None, 'jit_mode_eval': False, 'use_ipex': False, 'bf16': True, 'fp16': False, 'fp16_opt_level': 'O1', 'half_precision_backend': 'auto', 'bf16_full_eval': False, 'fp16_full_eval': False, 'tf32': None, 'local_rank': 2, 'ddp_backend': None, 'tpu_num_cores': None, 'tpu_metrics_debug': False, 'debug': [], 'dataloader_drop_last': False, 'eval_steps': None, 'dataloader_num_workers': 0, 'dataloader_prefetch_factor': None, 'past_index': -1, 'run_name': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'disable_tqdm': False, 'remove_unused_columns': True, 'label_names': None, 'load_best_model_at_end': False, 'metric_for_best_model': None, 'greater_is_better': None, 'ignore_data_skip': False, 'fsdp': [], 'fsdp_min_num_params': 0, 'fsdp_config': {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, 'tp_size': 0, 'fsdp_transformer_layer_cls_to_wrap': None, 'accelerator_config': {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, 'deepspeed': 'configs/deepspeed_zero3.json', 'label_smoothing_factor': 0.0, 'optim': <OptimizerNames.ADAMW_TORCH: 'adamw_torch'>, 'optim_args': None, 'adafactor': False, 'group_by_length': False, 'length_column_name': 'length', 'report_to': ['wandb'], 'ddp_find_unused_parameters': None, 'ddp_bucket_cap_mb': None, 'ddp_broadcast_buffers': None, 'dataloader_pin_memory': True, 'dataloader_persistent_workers': False, 'skip_memory_metrics': True, 'use_legacy_prediction_loop': False, 'push_to_hub': False, 'resume_from_checkpoint': None, 'hub_model_id': None, 'hub_strategy': <HubStrategy.EVERY_SAVE: 'every_save'>, 'hub_token': None, 'hub_private_repo': None, 'hub_always_push': False, 'gradient_checkpointing': True, 'gradient_checkpointing_kwargs': None, 'include_inputs_for_metrics': False, 'include_for_metrics': [], 'eval_do_concat_batches': True, 'fp16_backend': 'auto', 'push_to_hub_model_id': None, 'push_to_hub_organization': None, 'push_to_hub_token': None, '_n_gpu': 1, 'mp_parameters': '', 'auto_find_batch_size': False, 'full_determinism': False, 'torchdynamo': None, 'ray_scope': 'last', 'ddp_timeout': 1800, 'torch_compile': False, 'torch_compile_backend': None, 'torch_compile_mode': None, 'include_tokens_per_second': False, 'include_num_input_tokens_seen': False, 'neftune_noise_alpha': None, 'optim_target_modules': None, 'batch_eval_metrics': False, 'eval_on_start': False, 'use_liger_kernel': True, 'eval_use_gather_object': False, 'average_tokens_across_devices': True, 'model_init_kwargs': None, 'chat_template_path': None, 'dataset_text_field': 'qwen_text', 'dataset_kwargs': None, 'dataset_num_proc': None, 'eos_token': None, 'pad_token': None, 'max_length': 1024, 'packing': False, 'packing_strategy': 'ffd', 'padding_free': False, 'pad_to_multiple_of': None, 'eval_packing': None, 'completion_only_loss': None, 'assistant_only_loss': False, 'activation_offloading': False, 'max_seq_length': None}
2026-04-15 11:24:42,348 - INFO - Training config: {'model_name': 'Qwen/Qwen3-8B', 'template_name': 'qwen', 'block_size': 40960, 'wandb_project': 'ThreadWeaver', 'train_file_path': './data-train_final_replaced_from_classified-fix-format', 'dagger': False, 'attn_implementation': 'flex_attention', 'output_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'overwrite_output_dir': False, 'do_train': False, 'do_eval': False, 'do_predict': False, 'eval_strategy': <IntervalStrategy.NO: 'no'>, 'prediction_loss_only': False, 'per_device_train_batch_size': 1, 'per_device_eval_batch_size': 1, 'per_gpu_train_batch_size': None, 'per_gpu_eval_batch_size': None, 'gradient_accumulation_steps': 2, 'eval_accumulation_steps': None, 'eval_delay': 0, 'torch_empty_cache_steps': None, 'learning_rate': 1e-05, 'weight_decay': 0.0001, 'adam_beta1': 0.9, 'adam_beta2': 0.95, 'adam_epsilon': 1e-08, 'max_grad_norm': 1.0, 'num_train_epochs': 8.0, 'max_steps': -1, 'lr_scheduler_type': <SchedulerType.COSINE: 'cosine'>, 'lr_scheduler_kwargs': {}, 'warmup_ratio': 0.05, 'warmup_steps': 0, 'log_level': 'passive', 'log_level_replica': 'warning', 'log_on_each_node': True, 'logging_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume/runs/Apr15_11-24-34_batch-block1-0075', 'logging_strategy': <IntervalStrategy.STEPS: 'steps'>, 'logging_first_step': False, 'logging_steps': 1.0, 'logging_nan_inf_filter': True, 'save_strategy': <SaveStrategy.STEPS: 'steps'>, 'save_steps': 20, 'save_total_limit': 2, 'save_safetensors': True, 'save_on_each_node': False, 'save_only_model': False, 'restore_callback_states_from_checkpoint': False, 'no_cuda': False, 'use_cpu': False, 'use_mps_device': False, 'seed': 42, 'data_seed': None, 'jit_mode_eval': False, 'use_ipex': False, 'bf16': True, 'fp16': False, 'fp16_opt_level': 'O1', 'half_precision_backend': 'auto', 'bf16_full_eval': False, 'fp16_full_eval': False, 'tf32': None, 'local_rank': 6, 'ddp_backend': None, 'tpu_num_cores': None, 'tpu_metrics_debug': False, 'debug': [], 'dataloader_drop_last': False, 'eval_steps': None, 'dataloader_num_workers': 0, 'dataloader_prefetch_factor': None, 'past_index': -1, 'run_name': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'disable_tqdm': False, 'remove_unused_columns': True, 'label_names': None, 'load_best_model_at_end': False, 'metric_for_best_model': None, 'greater_is_better': None, 'ignore_data_skip': False, 'fsdp': [], 'fsdp_min_num_params': 0, 'fsdp_config': {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, 'tp_size': 0, 'fsdp_transformer_layer_cls_to_wrap': None, 'accelerator_config': {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, 'deepspeed': 'configs/deepspeed_zero3.json', 'label_smoothing_factor': 0.0, 'optim': <OptimizerNames.ADAMW_TORCH: 'adamw_torch'>, 'optim_args': None, 'adafactor': False, 'group_by_length': False, 'length_column_name': 'length', 'report_to': ['wandb'], 'ddp_find_unused_parameters': None, 'ddp_bucket_cap_mb': None, 'ddp_broadcast_buffers': None, 'dataloader_pin_memory': True, 'dataloader_persistent_workers': False, 'skip_memory_metrics': True, 'use_legacy_prediction_loop': False, 'push_to_hub': False, 'resume_from_checkpoint': None, 'hub_model_id': None, 'hub_strategy': <HubStrategy.EVERY_SAVE: 'every_save'>, 'hub_token': None, 'hub_private_repo': None, 'hub_always_push': False, 'gradient_checkpointing': True, 'gradient_checkpointing_kwargs': None, 'include_inputs_for_metrics': False, 'include_for_metrics': [], 'eval_do_concat_batches': True, 'fp16_backend': 'auto', 'push_to_hub_model_id': None, 'push_to_hub_organization': None, 'push_to_hub_token': None, '_n_gpu': 1, 'mp_parameters': '', 'auto_find_batch_size': False, 'full_determinism': False, 'torchdynamo': None, 'ray_scope': 'last', 'ddp_timeout': 1800, 'torch_compile': False, 'torch_compile_backend': None, 'torch_compile_mode': None, 'include_tokens_per_second': False, 'include_num_input_tokens_seen': False, 'neftune_noise_alpha': None, 'optim_target_modules': None, 'batch_eval_metrics': False, 'eval_on_start': False, 'use_liger_kernel': True, 'eval_use_gather_object': False, 'average_tokens_across_devices': True, 'model_init_kwargs': None, 'chat_template_path': None, 'dataset_text_field': 'qwen_text', 'dataset_kwargs': None, 'dataset_num_proc': None, 'eos_token': None, 'pad_token': None, 'max_length': 1024, 'packing': False, 'packing_strategy': 'ffd', 'padding_free': False, 'pad_to_multiple_of': None, 'eval_packing': None, 'completion_only_loss': None, 'assistant_only_loss': False, 'activation_offloading': False, 'max_seq_length': None}
2026-04-15 11:24:42,351 - INFO - Training config: {'model_name': 'Qwen/Qwen3-8B', 'template_name': 'qwen', 'block_size': 40960, 'wandb_project': 'ThreadWeaver', 'train_file_path': './data-train_final_replaced_from_classified-fix-format', 'dagger': False, 'attn_implementation': 'flex_attention', 'output_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'overwrite_output_dir': False, 'do_train': False, 'do_eval': False, 'do_predict': False, 'eval_strategy': <IntervalStrategy.NO: 'no'>, 'prediction_loss_only': False, 'per_device_train_batch_size': 1, 'per_device_eval_batch_size': 1, 'per_gpu_train_batch_size': None, 'per_gpu_eval_batch_size': None, 'gradient_accumulation_steps': 2, 'eval_accumulation_steps': None, 'eval_delay': 0, 'torch_empty_cache_steps': None, 'learning_rate': 1e-05, 'weight_decay': 0.0001, 'adam_beta1': 0.9, 'adam_beta2': 0.95, 'adam_epsilon': 1e-08, 'max_grad_norm': 1.0, 'num_train_epochs': 8.0, 'max_steps': -1, 'lr_scheduler_type': <SchedulerType.COSINE: 'cosine'>, 'lr_scheduler_kwargs': {}, 'warmup_ratio': 0.05, 'warmup_steps': 0, 'log_level': 'passive', 'log_level_replica': 'warning', 'log_on_each_node': True, 'logging_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume/runs/Apr15_11-24-34_batch-block1-0075', 'logging_strategy': <IntervalStrategy.STEPS: 'steps'>, 'logging_first_step': False, 'logging_steps': 1.0, 'logging_nan_inf_filter': True, 'save_strategy': <SaveStrategy.STEPS: 'steps'>, 'save_steps': 20, 'save_total_limit': 2, 'save_safetensors': True, 'save_on_each_node': False, 'save_only_model': False, 'restore_callback_states_from_checkpoint': False, 'no_cuda': False, 'use_cpu': False, 'use_mps_device': False, 'seed': 42, 'data_seed': None, 'jit_mode_eval': False, 'use_ipex': False, 'bf16': True, 'fp16': False, 'fp16_opt_level': 'O1', 'half_precision_backend': 'auto', 'bf16_full_eval': False, 'fp16_full_eval': False, 'tf32': None, 'local_rank': 4, 'ddp_backend': None, 'tpu_num_cores': None, 'tpu_metrics_debug': False, 'debug': [], 'dataloader_drop_last': False, 'eval_steps': None, 'dataloader_num_workers': 0, 'dataloader_prefetch_factor': None, 'past_index': -1, 'run_name': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'disable_tqdm': False, 'remove_unused_columns': True, 'label_names': None, 'load_best_model_at_end': False, 'metric_for_best_model': None, 'greater_is_better': None, 'ignore_data_skip': False, 'fsdp': [], 'fsdp_min_num_params': 0, 'fsdp_config': {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, 'tp_size': 0, 'fsdp_transformer_layer_cls_to_wrap': None, 'accelerator_config': {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, 'deepspeed': 'configs/deepspeed_zero3.json', 'label_smoothing_factor': 0.0, 'optim': <OptimizerNames.ADAMW_TORCH: 'adamw_torch'>, 'optim_args': None, 'adafactor': False, 'group_by_length': False, 'length_column_name': 'length', 'report_to': ['wandb'], 'ddp_find_unused_parameters': None, 'ddp_bucket_cap_mb': None, 'ddp_broadcast_buffers': None, 'dataloader_pin_memory': True, 'dataloader_persistent_workers': False, 'skip_memory_metrics': True, 'use_legacy_prediction_loop': False, 'push_to_hub': False, 'resume_from_checkpoint': None, 'hub_model_id': None, 'hub_strategy': <HubStrategy.EVERY_SAVE: 'every_save'>, 'hub_token': None, 'hub_private_repo': None, 'hub_always_push': False, 'gradient_checkpointing': True, 'gradient_checkpointing_kwargs': None, 'include_inputs_for_metrics': False, 'include_for_metrics': [], 'eval_do_concat_batches': True, 'fp16_backend': 'auto', 'push_to_hub_model_id': None, 'push_to_hub_organization': None, 'push_to_hub_token': None, '_n_gpu': 1, 'mp_parameters': '', 'auto_find_batch_size': False, 'full_determinism': False, 'torchdynamo': None, 'ray_scope': 'last', 'ddp_timeout': 1800, 'torch_compile': False, 'torch_compile_backend': None, 'torch_compile_mode': None, 'include_tokens_per_second': False, 'include_num_input_tokens_seen': False, 'neftune_noise_alpha': None, 'optim_target_modules': None, 'batch_eval_metrics': False, 'eval_on_start': False, 'use_liger_kernel': True, 'eval_use_gather_object': False, 'average_tokens_across_devices': True, 'model_init_kwargs': None, 'chat_template_path': None, 'dataset_text_field': 'qwen_text', 'dataset_kwargs': None, 'dataset_num_proc': None, 'eos_token': None, 'pad_token': None, 'max_length': 1024, 'packing': False, 'packing_strategy': 'ffd', 'padding_free': False, 'pad_to_multiple_of': None, 'eval_packing': None, 'completion_only_loss': None, 'assistant_only_loss': False, 'activation_offloading': False, 'max_seq_length': None}
2026-04-15 11:24:47,152 - INFO - Training config: {'model_name': 'Qwen/Qwen3-8B', 'template_name': 'qwen', 'block_size': 40960, 'wandb_project': 'ThreadWeaver', 'train_file_path': './data-train_final_replaced_from_classified-fix-format', 'dagger': False, 'attn_implementation': 'flex_attention', 'output_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'overwrite_output_dir': False, 'do_train': False, 'do_eval': False, 'do_predict': False, 'eval_strategy': <IntervalStrategy.NO: 'no'>, 'prediction_loss_only': False, 'per_device_train_batch_size': 1, 'per_device_eval_batch_size': 1, 'per_gpu_train_batch_size': None, 'per_gpu_eval_batch_size': None, 'gradient_accumulation_steps': 2, 'eval_accumulation_steps': None, 'eval_delay': 0, 'torch_empty_cache_steps': None, 'learning_rate': 1e-05, 'weight_decay': 0.0001, 'adam_beta1': 0.9, 'adam_beta2': 0.95, 'adam_epsilon': 1e-08, 'max_grad_norm': 1.0, 'num_train_epochs': 8.0, 'max_steps': -1, 'lr_scheduler_type': <SchedulerType.COSINE: 'cosine'>, 'lr_scheduler_kwargs': {}, 'warmup_ratio': 0.05, 'warmup_steps': 0, 'log_level': 'passive', 'log_level_replica': 'warning', 'log_on_each_node': True, 'logging_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume/runs/Apr15_11-24-34_batch-block1-3273', 'logging_strategy': <IntervalStrategy.STEPS: 'steps'>, 'logging_first_step': False, 'logging_steps': 1.0, 'logging_nan_inf_filter': True, 'save_strategy': <SaveStrategy.STEPS: 'steps'>, 'save_steps': 20, 'save_total_limit': 2, 'save_safetensors': True, 'save_on_each_node': False, 'save_only_model': False, 'restore_callback_states_from_checkpoint': False, 'no_cuda': False, 'use_cpu': False, 'use_mps_device': False, 'seed': 42, 'data_seed': None, 'jit_mode_eval': False, 'use_ipex': False, 'bf16': True, 'fp16': False, 'fp16_opt_level': 'O1', 'half_precision_backend': 'auto', 'bf16_full_eval': False, 'fp16_full_eval': False, 'tf32': None, 'local_rank': 0, 'ddp_backend': None, 'tpu_num_cores': None, 'tpu_metrics_debug': False, 'debug': [], 'dataloader_drop_last': False, 'eval_steps': None, 'dataloader_num_workers': 0, 'dataloader_prefetch_factor': None, 'past_index': -1, 'run_name': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'disable_tqdm': False, 'remove_unused_columns': True, 'label_names': None, 'load_best_model_at_end': False, 'metric_for_best_model': None, 'greater_is_better': None, 'ignore_data_skip': False, 'fsdp': [], 'fsdp_min_num_params': 0, 'fsdp_config': {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, 'tp_size': 0, 'fsdp_transformer_layer_cls_to_wrap': None, 'accelerator_config': {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, 'deepspeed': 'configs/deepspeed_zero3.json', 'label_smoothing_factor': 0.0, 'optim': <OptimizerNames.ADAMW_TORCH: 'adamw_torch'>, 'optim_args': None, 'adafactor': False, 'group_by_length': False, 'length_column_name': 'length', 'report_to': ['wandb'], 'ddp_find_unused_parameters': None, 'ddp_bucket_cap_mb': None, 'ddp_broadcast_buffers': None, 'dataloader_pin_memory': True, 'dataloader_persistent_workers': False, 'skip_memory_metrics': True, 'use_legacy_prediction_loop': False, 'push_to_hub': False, 'resume_from_checkpoint': None, 'hub_model_id': None, 'hub_strategy': <HubStrategy.EVERY_SAVE: 'every_save'>, 'hub_token': None, 'hub_private_repo': None, 'hub_always_push': False, 'gradient_checkpointing': True, 'gradient_checkpointing_kwargs': None, 'include_inputs_for_metrics': False, 'include_for_metrics': [], 'eval_do_concat_batches': True, 'fp16_backend': 'auto', 'push_to_hub_model_id': None, 'push_to_hub_organization': None, 'push_to_hub_token': None, '_n_gpu': 1, 'mp_parameters': '', 'auto_find_batch_size': False, 'full_determinism': False, 'torchdynamo': None, 'ray_scope': 'last', 'ddp_timeout': 1800, 'torch_compile': False, 'torch_compile_backend': None, 'torch_compile_mode': None, 'include_tokens_per_second': False, 'include_num_input_tokens_seen': False, 'neftune_noise_alpha': None, 'optim_target_modules': None, 'batch_eval_metrics': False, 'eval_on_start': False, 'use_liger_kernel': True, 'eval_use_gather_object': False, 'average_tokens_across_devices': True, 'model_init_kwargs': None, 'chat_template_path': None, 'dataset_text_field': 'qwen_text', 'dataset_kwargs': None, 'dataset_num_proc': None, 'eos_token': None, 'pad_token': None, 'max_length': 1024, 'packing': False, 'packing_strategy': 'ffd', 'padding_free': False, 'pad_to_multiple_of': None, 'eval_packing': None, 'completion_only_loss': None, 'assistant_only_loss': False, 'activation_offloading': False, 'max_seq_length': None}
2026-04-15 11:24:47,157 - INFO - Training config: {'model_name': 'Qwen/Qwen3-8B', 'template_name': 'qwen', 'block_size': 40960, 'wandb_project': 'ThreadWeaver', 'train_file_path': './data-train_final_replaced_from_classified-fix-format', 'dagger': False, 'attn_implementation': 'flex_attention', 'output_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'overwrite_output_dir': False, 'do_train': False, 'do_eval': False, 'do_predict': False, 'eval_strategy': <IntervalStrategy.NO: 'no'>, 'prediction_loss_only': False, 'per_device_train_batch_size': 1, 'per_device_eval_batch_size': 1, 'per_gpu_train_batch_size': None, 'per_gpu_eval_batch_size': None, 'gradient_accumulation_steps': 2, 'eval_accumulation_steps': None, 'eval_delay': 0, 'torch_empty_cache_steps': None, 'learning_rate': 1e-05, 'weight_decay': 0.0001, 'adam_beta1': 0.9, 'adam_beta2': 0.95, 'adam_epsilon': 1e-08, 'max_grad_norm': 1.0, 'num_train_epochs': 8.0, 'max_steps': -1, 'lr_scheduler_type': <SchedulerType.COSINE: 'cosine'>, 'lr_scheduler_kwargs': {}, 'warmup_ratio': 0.05, 'warmup_steps': 0, 'log_level': 'passive', 'log_level_replica': 'warning', 'log_on_each_node': True, 'logging_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume/runs/Apr15_11-24-34_batch-block1-0075', 'logging_strategy': <IntervalStrategy.STEPS: 'steps'>, 'logging_first_step': False, 'logging_steps': 1.0, 'logging_nan_inf_filter': True, 'save_strategy': <SaveStrategy.STEPS: 'steps'>, 'save_steps': 20, 'save_total_limit': 2, 'save_safetensors': True, 'save_on_each_node': False, 'save_only_model': False, 'restore_callback_states_from_checkpoint': False, 'no_cuda': False, 'use_cpu': False, 'use_mps_device': False, 'seed': 42, 'data_seed': None, 'jit_mode_eval': False, 'use_ipex': False, 'bf16': True, 'fp16': False, 'fp16_opt_level': 'O1', 'half_precision_backend': 'auto', 'bf16_full_eval': False, 'fp16_full_eval': False, 'tf32': None, 'local_rank': 0, 'ddp_backend': None, 'tpu_num_cores': None, 'tpu_metrics_debug': False, 'debug': [], 'dataloader_drop_last': False, 'eval_steps': None, 'dataloader_num_workers': 0, 'dataloader_prefetch_factor': None, 'past_index': -1, 'run_name': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'disable_tqdm': False, 'remove_unused_columns': True, 'label_names': None, 'load_best_model_at_end': False, 'metric_for_best_model': None, 'greater_is_better': None, 'ignore_data_skip': False, 'fsdp': [], 'fsdp_min_num_params': 0, 'fsdp_config': {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, 'tp_size': 0, 'fsdp_transformer_layer_cls_to_wrap': None, 'accelerator_config': {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, 'deepspeed': 'configs/deepspeed_zero3.json', 'label_smoothing_factor': 0.0, 'optim': <OptimizerNames.ADAMW_TORCH: 'adamw_torch'>, 'optim_args': None, 'adafactor': False, 'group_by_length': False, 'length_column_name': 'length', 'report_to': ['wandb'], 'ddp_find_unused_parameters': None, 'ddp_bucket_cap_mb': None, 'ddp_broadcast_buffers': None, 'dataloader_pin_memory': True, 'dataloader_persistent_workers': False, 'skip_memory_metrics': True, 'use_legacy_prediction_loop': False, 'push_to_hub': False, 'resume_from_checkpoint': None, 'hub_model_id': None, 'hub_strategy': <HubStrategy.EVERY_SAVE: 'every_save'>, 'hub_token': None, 'hub_private_repo': None, 'hub_always_push': False, 'gradient_checkpointing': True, 'gradient_checkpointing_kwargs': None, 'include_inputs_for_metrics': False, 'include_for_metrics': [], 'eval_do_concat_batches': True, 'fp16_backend': 'auto', 'push_to_hub_model_id': None, 'push_to_hub_organization': None, 'push_to_hub_token': None, '_n_gpu': 1, 'mp_parameters': '', 'auto_find_batch_size': False, 'full_determinism': False, 'torchdynamo': None, 'ray_scope': 'last', 'ddp_timeout': 1800, 'torch_compile': False, 'torch_compile_backend': None, 'torch_compile_mode': None, 'include_tokens_per_second': False, 'include_num_input_tokens_seen': False, 'neftune_noise_alpha': None, 'optim_target_modules': None, 'batch_eval_metrics': False, 'eval_on_start': False, 'use_liger_kernel': True, 'eval_use_gather_object': False, 'average_tokens_across_devices': True, 'model_init_kwargs': None, 'chat_template_path': None, 'dataset_text_field': 'qwen_text', 'dataset_kwargs': None, 'dataset_num_proc': None, 'eos_token': None, 'pad_token': None, 'max_length': 1024, 'packing': False, 'packing_strategy': 'ffd', 'padding_free': False, 'pad_to_multiple_of': None, 'eval_packing': None, 'completion_only_loss': None, 'assistant_only_loss': False, 'activation_offloading': False, 'max_seq_length': None}
2026-04-15 11:24:47,172 - INFO - Training config: {'model_name': 'Qwen/Qwen3-8B', 'template_name': 'qwen', 'block_size': 40960, 'wandb_project': 'ThreadWeaver', 'train_file_path': './data-train_final_replaced_from_classified-fix-format', 'dagger': False, 'attn_implementation': 'flex_attention', 'output_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'overwrite_output_dir': False, 'do_train': False, 'do_eval': False, 'do_predict': False, 'eval_strategy': <IntervalStrategy.NO: 'no'>, 'prediction_loss_only': False, 'per_device_train_batch_size': 1, 'per_device_eval_batch_size': 1, 'per_gpu_train_batch_size': None, 'per_gpu_eval_batch_size': None, 'gradient_accumulation_steps': 2, 'eval_accumulation_steps': None, 'eval_delay': 0, 'torch_empty_cache_steps': None, 'learning_rate': 1e-05, 'weight_decay': 0.0001, 'adam_beta1': 0.9, 'adam_beta2': 0.95, 'adam_epsilon': 1e-08, 'max_grad_norm': 1.0, 'num_train_epochs': 8.0, 'max_steps': -1, 'lr_scheduler_type': <SchedulerType.COSINE: 'cosine'>, 'lr_scheduler_kwargs': {}, 'warmup_ratio': 0.05, 'warmup_steps': 0, 'log_level': 'passive', 'log_level_replica': 'warning', 'log_on_each_node': True, 'logging_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume/runs/Apr15_11-24-35_batch-block1-0086', 'logging_strategy': <IntervalStrategy.STEPS: 'steps'>, 'logging_first_step': False, 'logging_steps': 1.0, 'logging_nan_inf_filter': True, 'save_strategy': <SaveStrategy.STEPS: 'steps'>, 'save_steps': 20, 'save_total_limit': 2, 'save_safetensors': True, 'save_on_each_node': False, 'save_only_model': False, 'restore_callback_states_from_checkpoint': False, 'no_cuda': False, 'use_cpu': False, 'use_mps_device': False, 'seed': 42, 'data_seed': None, 'jit_mode_eval': False, 'use_ipex': False, 'bf16': True, 'fp16': False, 'fp16_opt_level': 'O1', 'half_precision_backend': 'auto', 'bf16_full_eval': False, 'fp16_full_eval': False, 'tf32': None, 'local_rank': 0, 'ddp_backend': None, 'tpu_num_cores': None, 'tpu_metrics_debug': False, 'debug': [], 'dataloader_drop_last': False, 'eval_steps': None, 'dataloader_num_workers': 0, 'dataloader_prefetch_factor': None, 'past_index': -1, 'run_name': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'disable_tqdm': False, 'remove_unused_columns': True, 'label_names': None, 'load_best_model_at_end': False, 'metric_for_best_model': None, 'greater_is_better': None, 'ignore_data_skip': False, 'fsdp': [], 'fsdp_min_num_params': 0, 'fsdp_config': {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, 'tp_size': 0, 'fsdp_transformer_layer_cls_to_wrap': None, 'accelerator_config': {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, 'deepspeed': 'configs/deepspeed_zero3.json', 'label_smoothing_factor': 0.0, 'optim': <OptimizerNames.ADAMW_TORCH: 'adamw_torch'>, 'optim_args': None, 'adafactor': False, 'group_by_length': False, 'length_column_name': 'length', 'report_to': ['wandb'], 'ddp_find_unused_parameters': None, 'ddp_bucket_cap_mb': None, 'ddp_broadcast_buffers': None, 'dataloader_pin_memory': True, 'dataloader_persistent_workers': False, 'skip_memory_metrics': True, 'use_legacy_prediction_loop': False, 'push_to_hub': False, 'resume_from_checkpoint': None, 'hub_model_id': None, 'hub_strategy': <HubStrategy.EVERY_SAVE: 'every_save'>, 'hub_token': None, 'hub_private_repo': None, 'hub_always_push': False, 'gradient_checkpointing': True, 'gradient_checkpointing_kwargs': None, 'include_inputs_for_metrics': False, 'include_for_metrics': [], 'eval_do_concat_batches': True, 'fp16_backend': 'auto', 'push_to_hub_model_id': None, 'push_to_hub_organization': None, 'push_to_hub_token': None, '_n_gpu': 1, 'mp_parameters': '', 'auto_find_batch_size': False, 'full_determinism': False, 'torchdynamo': None, 'ray_scope': 'last', 'ddp_timeout': 1800, 'torch_compile': False, 'torch_compile_backend': None, 'torch_compile_mode': None, 'include_tokens_per_second': False, 'include_num_input_tokens_seen': False, 'neftune_noise_alpha': None, 'optim_target_modules': None, 'batch_eval_metrics': False, 'eval_on_start': False, 'use_liger_kernel': True, 'eval_use_gather_object': False, 'average_tokens_across_devices': True, 'model_init_kwargs': None, 'chat_template_path': None, 'dataset_text_field': 'qwen_text', 'dataset_kwargs': None, 'dataset_num_proc': None, 'eos_token': None, 'pad_token': None, 'max_length': 1024, 'packing': False, 'packing_strategy': 'ffd', 'padding_free': False, 'pad_to_multiple_of': None, 'eval_packing': None, 'completion_only_loss': None, 'assistant_only_loss': False, 'activation_offloading': False, 'max_seq_length': None}
2026-04-15 11:24:47,174 - INFO - Training config: {'model_name': 'Qwen/Qwen3-8B', 'template_name': 'qwen', 'block_size': 40960, 'wandb_project': 'ThreadWeaver', 'train_file_path': './data-train_final_replaced_from_classified-fix-format', 'dagger': False, 'attn_implementation': 'flex_attention', 'output_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'overwrite_output_dir': False, 'do_train': False, 'do_eval': False, 'do_predict': False, 'eval_strategy': <IntervalStrategy.NO: 'no'>, 'prediction_loss_only': False, 'per_device_train_batch_size': 1, 'per_device_eval_batch_size': 1, 'per_gpu_train_batch_size': None, 'per_gpu_eval_batch_size': None, 'gradient_accumulation_steps': 2, 'eval_accumulation_steps': None, 'eval_delay': 0, 'torch_empty_cache_steps': None, 'learning_rate': 1e-05, 'weight_decay': 0.0001, 'adam_beta1': 0.9, 'adam_beta2': 0.95, 'adam_epsilon': 1e-08, 'max_grad_norm': 1.0, 'num_train_epochs': 8.0, 'max_steps': -1, 'lr_scheduler_type': <SchedulerType.COSINE: 'cosine'>, 'lr_scheduler_kwargs': {}, 'warmup_ratio': 0.05, 'warmup_steps': 0, 'log_level': 'passive', 'log_level_replica': 'warning', 'log_on_each_node': True, 'logging_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume/runs/Apr15_11-24-34_batch-block1-3859', 'logging_strategy': <IntervalStrategy.STEPS: 'steps'>, 'logging_first_step': False, 'logging_steps': 1.0, 'logging_nan_inf_filter': True, 'save_strategy': <SaveStrategy.STEPS: 'steps'>, 'save_steps': 20, 'save_total_limit': 2, 'save_safetensors': True, 'save_on_each_node': False, 'save_only_model': False, 'restore_callback_states_from_checkpoint': False, 'no_cuda': False, 'use_cpu': False, 'use_mps_device': False, 'seed': 42, 'data_seed': None, 'jit_mode_eval': False, 'use_ipex': False, 'bf16': True, 'fp16': False, 'fp16_opt_level': 'O1', 'half_precision_backend': 'auto', 'bf16_full_eval': False, 'fp16_full_eval': False, 'tf32': None, 'local_rank': 4, 'ddp_backend': None, 'tpu_num_cores': None, 'tpu_metrics_debug': False, 'debug': [], 'dataloader_drop_last': False, 'eval_steps': None, 'dataloader_num_workers': 0, 'dataloader_prefetch_factor': None, 'past_index': -1, 'run_name': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'disable_tqdm': False, 'remove_unused_columns': True, 'label_names': None, 'load_best_model_at_end': False, 'metric_for_best_model': None, 'greater_is_better': None, 'ignore_data_skip': False, 'fsdp': [], 'fsdp_min_num_params': 0, 'fsdp_config': {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, 'tp_size': 0, 'fsdp_transformer_layer_cls_to_wrap': None, 'accelerator_config': {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, 'deepspeed': 'configs/deepspeed_zero3.json', 'label_smoothing_factor': 0.0, 'optim': <OptimizerNames.ADAMW_TORCH: 'adamw_torch'>, 'optim_args': None, 'adafactor': False, 'group_by_length': False, 'length_column_name': 'length', 'report_to': ['wandb'], 'ddp_find_unused_parameters': None, 'ddp_bucket_cap_mb': None, 'ddp_broadcast_buffers': None, 'dataloader_pin_memory': True, 'dataloader_persistent_workers': False, 'skip_memory_metrics': True, 'use_legacy_prediction_loop': False, 'push_to_hub': False, 'resume_from_checkpoint': None, 'hub_model_id': None, 'hub_strategy': <HubStrategy.EVERY_SAVE: 'every_save'>, 'hub_token': None, 'hub_private_repo': None, 'hub_always_push': False, 'gradient_checkpointing': True, 'gradient_checkpointing_kwargs': None, 'include_inputs_for_metrics': False, 'include_for_metrics': [], 'eval_do_concat_batches': True, 'fp16_backend': 'auto', 'push_to_hub_model_id': None, 'push_to_hub_organization': None, 'push_to_hub_token': None, '_n_gpu': 1, 'mp_parameters': '', 'auto_find_batch_size': False, 'full_determinism': False, 'torchdynamo': None, 'ray_scope': 'last', 'ddp_timeout': 1800, 'torch_compile': False, 'torch_compile_backend': None, 'torch_compile_mode': None, 'include_tokens_per_second': False, 'include_num_input_tokens_seen': False, 'neftune_noise_alpha': None, 'optim_target_modules': None, 'batch_eval_metrics': False, 'eval_on_start': False, 'use_liger_kernel': True, 'eval_use_gather_object': False, 'average_tokens_across_devices': True, 'model_init_kwargs': None, 'chat_template_path': None, 'dataset_text_field': 'qwen_text', 'dataset_kwargs': None, 'dataset_num_proc': None, 'eos_token': None, 'pad_token': None, 'max_length': 1024, 'packing': False, 'packing_strategy': 'ffd', 'padding_free': False, 'pad_to_multiple_of': None, 'eval_packing': None, 'completion_only_loss': None, 'assistant_only_loss': False, 'activation_offloading': False, 'max_seq_length': None}
2026-04-15 11:24:47,182 - INFO - Training config: {'model_name': 'Qwen/Qwen3-8B', 'template_name': 'qwen', 'block_size': 40960, 'wandb_project': 'ThreadWeaver', 'train_file_path': './data-train_final_replaced_from_classified-fix-format', 'dagger': False, 'attn_implementation': 'flex_attention', 'output_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'overwrite_output_dir': False, 'do_train': False, 'do_eval': False, 'do_predict': False, 'eval_strategy': <IntervalStrategy.NO: 'no'>, 'prediction_loss_only': False, 'per_device_train_batch_size': 1, 'per_device_eval_batch_size': 1, 'per_gpu_train_batch_size': None, 'per_gpu_eval_batch_size': None, 'gradient_accumulation_steps': 2, 'eval_accumulation_steps': None, 'eval_delay': 0, 'torch_empty_cache_steps': None, 'learning_rate': 1e-05, 'weight_decay': 0.0001, 'adam_beta1': 0.9, 'adam_beta2': 0.95, 'adam_epsilon': 1e-08, 'max_grad_norm': 1.0, 'num_train_epochs': 8.0, 'max_steps': -1, 'lr_scheduler_type': <SchedulerType.COSINE: 'cosine'>, 'lr_scheduler_kwargs': {}, 'warmup_ratio': 0.05, 'warmup_steps': 0, 'log_level': 'passive', 'log_level_replica': 'warning', 'log_on_each_node': True, 'logging_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume/runs/Apr15_11-24-34_batch-block1-3273', 'logging_strategy': <IntervalStrategy.STEPS: 'steps'>, 'logging_first_step': False, 'logging_steps': 1.0, 'logging_nan_inf_filter': True, 'save_strategy': <SaveStrategy.STEPS: 'steps'>, 'save_steps': 20, 'save_total_limit': 2, 'save_safetensors': True, 'save_on_each_node': False, 'save_only_model': False, 'restore_callback_states_from_checkpoint': False, 'no_cuda': False, 'use_cpu': False, 'use_mps_device': False, 'seed': 42, 'data_seed': None, 'jit_mode_eval': False, 'use_ipex': False, 'bf16': True, 'fp16': False, 'fp16_opt_level': 'O1', 'half_precision_backend': 'auto', 'bf16_full_eval': False, 'fp16_full_eval': False, 'tf32': None, 'local_rank': 2, 'ddp_backend': None, 'tpu_num_cores': None, 'tpu_metrics_debug': False, 'debug': [], 'dataloader_drop_last': False, 'eval_steps': None, 'dataloader_num_workers': 0, 'dataloader_prefetch_factor': None, 'past_index': -1, 'run_name': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'disable_tqdm': False, 'remove_unused_columns': True, 'label_names': None, 'load_best_model_at_end': False, 'metric_for_best_model': None, 'greater_is_better': None, 'ignore_data_skip': False, 'fsdp': [], 'fsdp_min_num_params': 0, 'fsdp_config': {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, 'tp_size': 0, 'fsdp_transformer_layer_cls_to_wrap': None, 'accelerator_config': {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, 'deepspeed': 'configs/deepspeed_zero3.json', 'label_smoothing_factor': 0.0, 'optim': <OptimizerNames.ADAMW_TORCH: 'adamw_torch'>, 'optim_args': None, 'adafactor': False, 'group_by_length': False, 'length_column_name': 'length', 'report_to': ['wandb'], 'ddp_find_unused_parameters': None, 'ddp_bucket_cap_mb': None, 'ddp_broadcast_buffers': None, 'dataloader_pin_memory': True, 'dataloader_persistent_workers': False, 'skip_memory_metrics': True, 'use_legacy_prediction_loop': False, 'push_to_hub': False, 'resume_from_checkpoint': None, 'hub_model_id': None, 'hub_strategy': <HubStrategy.EVERY_SAVE: 'every_save'>, 'hub_token': None, 'hub_private_repo': None, 'hub_always_push': False, 'gradient_checkpointing': True, 'gradient_checkpointing_kwargs': None, 'include_inputs_for_metrics': False, 'include_for_metrics': [], 'eval_do_concat_batches': True, 'fp16_backend': 'auto', 'push_to_hub_model_id': None, 'push_to_hub_organization': None, 'push_to_hub_token': None, '_n_gpu': 1, 'mp_parameters': '', 'auto_find_batch_size': False, 'full_determinism': False, 'torchdynamo': None, 'ray_scope': 'last', 'ddp_timeout': 1800, 'torch_compile': False, 'torch_compile_backend': None, 'torch_compile_mode': None, 'include_tokens_per_second': False, 'include_num_input_tokens_seen': False, 'neftune_noise_alpha': None, 'optim_target_modules': None, 'batch_eval_metrics': False, 'eval_on_start': False, 'use_liger_kernel': True, 'eval_use_gather_object': False, 'average_tokens_across_devices': True, 'model_init_kwargs': None, 'chat_template_path': None, 'dataset_text_field': 'qwen_text', 'dataset_kwargs': None, 'dataset_num_proc': None, 'eos_token': None, 'pad_token': None, 'max_length': 1024, 'packing': False, 'packing_strategy': 'ffd', 'padding_free': False, 'pad_to_multiple_of': None, 'eval_packing': None, 'completion_only_loss': None, 'assistant_only_loss': False, 'activation_offloading': False, 'max_seq_length': None}
2026-04-15 11:24:47,220 - INFO - Training config: {'model_name': 'Qwen/Qwen3-8B', 'template_name': 'qwen', 'block_size': 40960, 'wandb_project': 'ThreadWeaver', 'train_file_path': './data-train_final_replaced_from_classified-fix-format', 'dagger': False, 'attn_implementation': 'flex_attention', 'output_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'overwrite_output_dir': False, 'do_train': False, 'do_eval': False, 'do_predict': False, 'eval_strategy': <IntervalStrategy.NO: 'no'>, 'prediction_loss_only': False, 'per_device_train_batch_size': 1, 'per_device_eval_batch_size': 1, 'per_gpu_train_batch_size': None, 'per_gpu_eval_batch_size': None, 'gradient_accumulation_steps': 2, 'eval_accumulation_steps': None, 'eval_delay': 0, 'torch_empty_cache_steps': None, 'learning_rate': 1e-05, 'weight_decay': 0.0001, 'adam_beta1': 0.9, 'adam_beta2': 0.95, 'adam_epsilon': 1e-08, 'max_grad_norm': 1.0, 'num_train_epochs': 8.0, 'max_steps': -1, 'lr_scheduler_type': <SchedulerType.COSINE: 'cosine'>, 'lr_scheduler_kwargs': {}, 'warmup_ratio': 0.05, 'warmup_steps': 0, 'log_level': 'passive', 'log_level_replica': 'warning', 'log_on_each_node': True, 'logging_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume/runs/Apr15_11-24-35_batch-block1-0086', 'logging_strategy': <IntervalStrategy.STEPS: 'steps'>, 'logging_first_step': False, 'logging_steps': 1.0, 'logging_nan_inf_filter': True, 'save_strategy': <SaveStrategy.STEPS: 'steps'>, 'save_steps': 20, 'save_total_limit': 2, 'save_safetensors': True, 'save_on_each_node': False, 'save_only_model': False, 'restore_callback_states_from_checkpoint': False, 'no_cuda': False, 'use_cpu': False, 'use_mps_device': False, 'seed': 42, 'data_seed': None, 'jit_mode_eval': False, 'use_ipex': False, 'bf16': True, 'fp16': False, 'fp16_opt_level': 'O1', 'half_precision_backend': 'auto', 'bf16_full_eval': False, 'fp16_full_eval': False, 'tf32': None, 'local_rank': 7, 'ddp_backend': None, 'tpu_num_cores': None, 'tpu_metrics_debug': False, 'debug': [], 'dataloader_drop_last': False, 'eval_steps': None, 'dataloader_num_workers': 0, 'dataloader_prefetch_factor': None, 'past_index': -1, 'run_name': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'disable_tqdm': False, 'remove_unused_columns': True, 'label_names': None, 'load_best_model_at_end': False, 'metric_for_best_model': None, 'greater_is_better': None, 'ignore_data_skip': False, 'fsdp': [], 'fsdp_min_num_params': 0, 'fsdp_config': {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, 'tp_size': 0, 'fsdp_transformer_layer_cls_to_wrap': None, 'accelerator_config': {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, 'deepspeed': 'configs/deepspeed_zero3.json', 'label_smoothing_factor': 0.0, 'optim': <OptimizerNames.ADAMW_TORCH: 'adamw_torch'>, 'optim_args': None, 'adafactor': False, 'group_by_length': False, 'length_column_name': 'length', 'report_to': ['wandb'], 'ddp_find_unused_parameters': None, 'ddp_bucket_cap_mb': None, 'ddp_broadcast_buffers': None, 'dataloader_pin_memory': True, 'dataloader_persistent_workers': False, 'skip_memory_metrics': True, 'use_legacy_prediction_loop': False, 'push_to_hub': False, 'resume_from_checkpoint': None, 'hub_model_id': None, 'hub_strategy': <HubStrategy.EVERY_SAVE: 'every_save'>, 'hub_token': None, 'hub_private_repo': None, 'hub_always_push': False, 'gradient_checkpointing': True, 'gradient_checkpointing_kwargs': None, 'include_inputs_for_metrics': False, 'include_for_metrics': [], 'eval_do_concat_batches': True, 'fp16_backend': 'auto', 'push_to_hub_model_id': None, 'push_to_hub_organization': None, 'push_to_hub_token': None, '_n_gpu': 1, 'mp_parameters': '', 'auto_find_batch_size': False, 'full_determinism': False, 'torchdynamo': None, 'ray_scope': 'last', 'ddp_timeout': 1800, 'torch_compile': False, 'torch_compile_backend': None, 'torch_compile_mode': None, 'include_tokens_per_second': False, 'include_num_input_tokens_seen': False, 'neftune_noise_alpha': None, 'optim_target_modules': None, 'batch_eval_metrics': False, 'eval_on_start': False, 'use_liger_kernel': True, 'eval_use_gather_object': False, 'average_tokens_across_devices': True, 'model_init_kwargs': None, 'chat_template_path': None, 'dataset_text_field': 'qwen_text', 'dataset_kwargs': None, 'dataset_num_proc': None, 'eos_token': None, 'pad_token': None, 'max_length': 1024, 'packing': False, 'packing_strategy': 'ffd', 'padding_free': False, 'pad_to_multiple_of': None, 'eval_packing': None, 'completion_only_loss': None, 'assistant_only_loss': False, 'activation_offloading': False, 'max_seq_length': None}
2026-04-15 11:24:47,221 - INFO - Training config: {'model_name': 'Qwen/Qwen3-8B', 'template_name': 'qwen', 'block_size': 40960, 'wandb_project': 'ThreadWeaver', 'train_file_path': './data-train_final_replaced_from_classified-fix-format', 'dagger': False, 'attn_implementation': 'flex_attention', 'output_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'overwrite_output_dir': False, 'do_train': False, 'do_eval': False, 'do_predict': False, 'eval_strategy': <IntervalStrategy.NO: 'no'>, 'prediction_loss_only': False, 'per_device_train_batch_size': 1, 'per_device_eval_batch_size': 1, 'per_gpu_train_batch_size': None, 'per_gpu_eval_batch_size': None, 'gradient_accumulation_steps': 2, 'eval_accumulation_steps': None, 'eval_delay': 0, 'torch_empty_cache_steps': None, 'learning_rate': 1e-05, 'weight_decay': 0.0001, 'adam_beta1': 0.9, 'adam_beta2': 0.95, 'adam_epsilon': 1e-08, 'max_grad_norm': 1.0, 'num_train_epochs': 8.0, 'max_steps': -1, 'lr_scheduler_type': <SchedulerType.COSINE: 'cosine'>, 'lr_scheduler_kwargs': {}, 'warmup_ratio': 0.05, 'warmup_steps': 0, 'log_level': 'passive', 'log_level_replica': 'warning', 'log_on_each_node': True, 'logging_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume/runs/Apr15_11-24-34_batch-block1-3273', 'logging_strategy': <IntervalStrategy.STEPS: 'steps'>, 'logging_first_step': False, 'logging_steps': 1.0, 'logging_nan_inf_filter': True, 'save_strategy': <SaveStrategy.STEPS: 'steps'>, 'save_steps': 20, 'save_total_limit': 2, 'save_safetensors': True, 'save_on_each_node': False, 'save_only_model': False, 'restore_callback_states_from_checkpoint': False, 'no_cuda': False, 'use_cpu': False, 'use_mps_device': False, 'seed': 42, 'data_seed': None, 'jit_mode_eval': False, 'use_ipex': False, 'bf16': True, 'fp16': False, 'fp16_opt_level': 'O1', 'half_precision_backend': 'auto', 'bf16_full_eval': False, 'fp16_full_eval': False, 'tf32': None, 'local_rank': 6, 'ddp_backend': None, 'tpu_num_cores': None, 'tpu_metrics_debug': False, 'debug': [], 'dataloader_drop_last': False, 'eval_steps': None, 'dataloader_num_workers': 0, 'dataloader_prefetch_factor': None, 'past_index': -1, 'run_name': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'disable_tqdm': False, 'remove_unused_columns': True, 'label_names': None, 'load_best_model_at_end': False, 'metric_for_best_model': None, 'greater_is_better': None, 'ignore_data_skip': False, 'fsdp': [], 'fsdp_min_num_params': 0, 'fsdp_config': {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, 'tp_size': 0, 'fsdp_transformer_layer_cls_to_wrap': None, 'accelerator_config': {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, 'deepspeed': 'configs/deepspeed_zero3.json', 'label_smoothing_factor': 0.0, 'optim': <OptimizerNames.ADAMW_TORCH: 'adamw_torch'>, 'optim_args': None, 'adafactor': False, 'group_by_length': False, 'length_column_name': 'length', 'report_to': ['wandb'], 'ddp_find_unused_parameters': None, 'ddp_bucket_cap_mb': None, 'ddp_broadcast_buffers': None, 'dataloader_pin_memory': True, 'dataloader_persistent_workers': False, 'skip_memory_metrics': True, 'use_legacy_prediction_loop': False, 'push_to_hub': False, 'resume_from_checkpoint': None, 'hub_model_id': None, 'hub_strategy': <HubStrategy.EVERY_SAVE: 'every_save'>, 'hub_token': None, 'hub_private_repo': None, 'hub_always_push': False, 'gradient_checkpointing': True, 'gradient_checkpointing_kwargs': None, 'include_inputs_for_metrics': False, 'include_for_metrics': [], 'eval_do_concat_batches': True, 'fp16_backend': 'auto', 'push_to_hub_model_id': None, 'push_to_hub_organization': None, 'push_to_hub_token': None, '_n_gpu': 1, 'mp_parameters': '', 'auto_find_batch_size': False, 'full_determinism': False, 'torchdynamo': None, 'ray_scope': 'last', 'ddp_timeout': 1800, 'torch_compile': False, 'torch_compile_backend': None, 'torch_compile_mode': None, 'include_tokens_per_second': False, 'include_num_input_tokens_seen': False, 'neftune_noise_alpha': None, 'optim_target_modules': None, 'batch_eval_metrics': False, 'eval_on_start': False, 'use_liger_kernel': True, 'eval_use_gather_object': False, 'average_tokens_across_devices': True, 'model_init_kwargs': None, 'chat_template_path': None, 'dataset_text_field': 'qwen_text', 'dataset_kwargs': None, 'dataset_num_proc': None, 'eos_token': None, 'pad_token': None, 'max_length': 1024, 'packing': False, 'packing_strategy': 'ffd', 'padding_free': False, 'pad_to_multiple_of': None, 'eval_packing': None, 'completion_only_loss': None, 'assistant_only_loss': False, 'activation_offloading': False, 'max_seq_length': None}
2026-04-15 11:24:47,222 - INFO - Training config: {'model_name': 'Qwen/Qwen3-8B', 'template_name': 'qwen', 'block_size': 40960, 'wandb_project': 'ThreadWeaver', 'train_file_path': './data-train_final_replaced_from_classified-fix-format', 'dagger': False, 'attn_implementation': 'flex_attention', 'output_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'overwrite_output_dir': False, 'do_train': False, 'do_eval': False, 'do_predict': False, 'eval_strategy': <IntervalStrategy.NO: 'no'>, 'prediction_loss_only': False, 'per_device_train_batch_size': 1, 'per_device_eval_batch_size': 1, 'per_gpu_train_batch_size': None, 'per_gpu_eval_batch_size': None, 'gradient_accumulation_steps': 2, 'eval_accumulation_steps': None, 'eval_delay': 0, 'torch_empty_cache_steps': None, 'learning_rate': 1e-05, 'weight_decay': 0.0001, 'adam_beta1': 0.9, 'adam_beta2': 0.95, 'adam_epsilon': 1e-08, 'max_grad_norm': 1.0, 'num_train_epochs': 8.0, 'max_steps': -1, 'lr_scheduler_type': <SchedulerType.COSINE: 'cosine'>, 'lr_scheduler_kwargs': {}, 'warmup_ratio': 0.05, 'warmup_steps': 0, 'log_level': 'passive', 'log_level_replica': 'warning', 'log_on_each_node': True, 'logging_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume/runs/Apr15_11-24-35_batch-block1-0086', 'logging_strategy': <IntervalStrategy.STEPS: 'steps'>, 'logging_first_step': False, 'logging_steps': 1.0, 'logging_nan_inf_filter': True, 'save_strategy': <SaveStrategy.STEPS: 'steps'>, 'save_steps': 20, 'save_total_limit': 2, 'save_safetensors': True, 'save_on_each_node': False, 'save_only_model': False, 'restore_callback_states_from_checkpoint': False, 'no_cuda': False, 'use_cpu': False, 'use_mps_device': False, 'seed': 42, 'data_seed': None, 'jit_mode_eval': False, 'use_ipex': False, 'bf16': True, 'fp16': False, 'fp16_opt_level': 'O1', 'half_precision_backend': 'auto', 'bf16_full_eval': False, 'fp16_full_eval': False, 'tf32': None, 'local_rank': 5, 'ddp_backend': None, 'tpu_num_cores': None, 'tpu_metrics_debug': False, 'debug': [], 'dataloader_drop_last': False, 'eval_steps': None, 'dataloader_num_workers': 0, 'dataloader_prefetch_factor': None, 'past_index': -1, 'run_name': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'disable_tqdm': False, 'remove_unused_columns': True, 'label_names': None, 'load_best_model_at_end': False, 'metric_for_best_model': None, 'greater_is_better': None, 'ignore_data_skip': False, 'fsdp': [], 'fsdp_min_num_params': 0, 'fsdp_config': {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, 'tp_size': 0, 'fsdp_transformer_layer_cls_to_wrap': None, 'accelerator_config': {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, 'deepspeed': 'configs/deepspeed_zero3.json', 'label_smoothing_factor': 0.0, 'optim': <OptimizerNames.ADAMW_TORCH: 'adamw_torch'>, 'optim_args': None, 'adafactor': False, 'group_by_length': False, 'length_column_name': 'length', 'report_to': ['wandb'], 'ddp_find_unused_parameters': None, 'ddp_bucket_cap_mb': None, 'ddp_broadcast_buffers': None, 'dataloader_pin_memory': True, 'dataloader_persistent_workers': False, 'skip_memory_metrics': True, 'use_legacy_prediction_loop': False, 'push_to_hub': False, 'resume_from_checkpoint': None, 'hub_model_id': None, 'hub_strategy': <HubStrategy.EVERY_SAVE: 'every_save'>, 'hub_token': None, 'hub_private_repo': None, 'hub_always_push': False, 'gradient_checkpointing': True, 'gradient_checkpointing_kwargs': None, 'include_inputs_for_metrics': False, 'include_for_metrics': [], 'eval_do_concat_batches': True, 'fp16_backend': 'auto', 'push_to_hub_model_id': None, 'push_to_hub_organization': None, 'push_to_hub_token': None, '_n_gpu': 1, 'mp_parameters': '', 'auto_find_batch_size': False, 'full_determinism': False, 'torchdynamo': None, 'ray_scope': 'last', 'ddp_timeout': 1800, 'torch_compile': False, 'torch_compile_backend': None, 'torch_compile_mode': None, 'include_tokens_per_second': False, 'include_num_input_tokens_seen': False, 'neftune_noise_alpha': None, 'optim_target_modules': None, 'batch_eval_metrics': False, 'eval_on_start': False, 'use_liger_kernel': True, 'eval_use_gather_object': False, 'average_tokens_across_devices': True, 'model_init_kwargs': None, 'chat_template_path': None, 'dataset_text_field': 'qwen_text', 'dataset_kwargs': None, 'dataset_num_proc': None, 'eos_token': None, 'pad_token': None, 'max_length': 1024, 'packing': False, 'packing_strategy': 'ffd', 'padding_free': False, 'pad_to_multiple_of': None, 'eval_packing': None, 'completion_only_loss': None, 'assistant_only_loss': False, 'activation_offloading': False, 'max_seq_length': None}
2026-04-15 11:24:47,235 - INFO - Training config: {'model_name': 'Qwen/Qwen3-8B', 'template_name': 'qwen', 'block_size': 40960, 'wandb_project': 'ThreadWeaver', 'train_file_path': './data-train_final_replaced_from_classified-fix-format', 'dagger': False, 'attn_implementation': 'flex_attention', 'output_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'overwrite_output_dir': False, 'do_train': False, 'do_eval': False, 'do_predict': False, 'eval_strategy': <IntervalStrategy.NO: 'no'>, 'prediction_loss_only': False, 'per_device_train_batch_size': 1, 'per_device_eval_batch_size': 1, 'per_gpu_train_batch_size': None, 'per_gpu_eval_batch_size': None, 'gradient_accumulation_steps': 2, 'eval_accumulation_steps': None, 'eval_delay': 0, 'torch_empty_cache_steps': None, 'learning_rate': 1e-05, 'weight_decay': 0.0001, 'adam_beta1': 0.9, 'adam_beta2': 0.95, 'adam_epsilon': 1e-08, 'max_grad_norm': 1.0, 'num_train_epochs': 8.0, 'max_steps': -1, 'lr_scheduler_type': <SchedulerType.COSINE: 'cosine'>, 'lr_scheduler_kwargs': {}, 'warmup_ratio': 0.05, 'warmup_steps': 0, 'log_level': 'passive', 'log_level_replica': 'warning', 'log_on_each_node': True, 'logging_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume/runs/Apr15_11-24-34_batch-block1-0075', 'logging_strategy': <IntervalStrategy.STEPS: 'steps'>, 'logging_first_step': False, 'logging_steps': 1.0, 'logging_nan_inf_filter': True, 'save_strategy': <SaveStrategy.STEPS: 'steps'>, 'save_steps': 20, 'save_total_limit': 2, 'save_safetensors': True, 'save_on_each_node': False, 'save_only_model': False, 'restore_callback_states_from_checkpoint': False, 'no_cuda': False, 'use_cpu': False, 'use_mps_device': False, 'seed': 42, 'data_seed': None, 'jit_mode_eval': False, 'use_ipex': False, 'bf16': True, 'fp16': False, 'fp16_opt_level': 'O1', 'half_precision_backend': 'auto', 'bf16_full_eval': False, 'fp16_full_eval': False, 'tf32': None, 'local_rank': 5, 'ddp_backend': None, 'tpu_num_cores': None, 'tpu_metrics_debug': False, 'debug': [], 'dataloader_drop_last': False, 'eval_steps': None, 'dataloader_num_workers': 0, 'dataloader_prefetch_factor': None, 'past_index': -1, 'run_name': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'disable_tqdm': False, 'remove_unused_columns': True, 'label_names': None, 'load_best_model_at_end': False, 'metric_for_best_model': None, 'greater_is_better': None, 'ignore_data_skip': False, 'fsdp': [], 'fsdp_min_num_params': 0, 'fsdp_config': {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, 'tp_size': 0, 'fsdp_transformer_layer_cls_to_wrap': None, 'accelerator_config': {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, 'deepspeed': 'configs/deepspeed_zero3.json', 'label_smoothing_factor': 0.0, 'optim': <OptimizerNames.ADAMW_TORCH: 'adamw_torch'>, 'optim_args': None, 'adafactor': False, 'group_by_length': False, 'length_column_name': 'length', 'report_to': ['wandb'], 'ddp_find_unused_parameters': None, 'ddp_bucket_cap_mb': None, 'ddp_broadcast_buffers': None, 'dataloader_pin_memory': True, 'dataloader_persistent_workers': False, 'skip_memory_metrics': True, 'use_legacy_prediction_loop': False, 'push_to_hub': False, 'resume_from_checkpoint': None, 'hub_model_id': None, 'hub_strategy': <HubStrategy.EVERY_SAVE: 'every_save'>, 'hub_token': None, 'hub_private_repo': None, 'hub_always_push': False, 'gradient_checkpointing': True, 'gradient_checkpointing_kwargs': None, 'include_inputs_for_metrics': False, 'include_for_metrics': [], 'eval_do_concat_batches': True, 'fp16_backend': 'auto', 'push_to_hub_model_id': None, 'push_to_hub_organization': None, 'push_to_hub_token': None, '_n_gpu': 1, 'mp_parameters': '', 'auto_find_batch_size': False, 'full_determinism': False, 'torchdynamo': None, 'ray_scope': 'last', 'ddp_timeout': 1800, 'torch_compile': False, 'torch_compile_backend': None, 'torch_compile_mode': None, 'include_tokens_per_second': False, 'include_num_input_tokens_seen': False, 'neftune_noise_alpha': None, 'optim_target_modules': None, 'batch_eval_metrics': False, 'eval_on_start': False, 'use_liger_kernel': True, 'eval_use_gather_object': False, 'average_tokens_across_devices': True, 'model_init_kwargs': None, 'chat_template_path': None, 'dataset_text_field': 'qwen_text', 'dataset_kwargs': None, 'dataset_num_proc': None, 'eos_token': None, 'pad_token': None, 'max_length': 1024, 'packing': False, 'packing_strategy': 'ffd', 'padding_free': False, 'pad_to_multiple_of': None, 'eval_packing': None, 'completion_only_loss': None, 'assistant_only_loss': False, 'activation_offloading': False, 'max_seq_length': None}
2026-04-15 11:24:47,245 - INFO - Training config: {'model_name': 'Qwen/Qwen3-8B', 'template_name': 'qwen', 'block_size': 40960, 'wandb_project': 'ThreadWeaver', 'train_file_path': './data-train_final_replaced_from_classified-fix-format', 'dagger': False, 'attn_implementation': 'flex_attention', 'output_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'overwrite_output_dir': False, 'do_train': False, 'do_eval': False, 'do_predict': False, 'eval_strategy': <IntervalStrategy.NO: 'no'>, 'prediction_loss_only': False, 'per_device_train_batch_size': 1, 'per_device_eval_batch_size': 1, 'per_gpu_train_batch_size': None, 'per_gpu_eval_batch_size': None, 'gradient_accumulation_steps': 2, 'eval_accumulation_steps': None, 'eval_delay': 0, 'torch_empty_cache_steps': None, 'learning_rate': 1e-05, 'weight_decay': 0.0001, 'adam_beta1': 0.9, 'adam_beta2': 0.95, 'adam_epsilon': 1e-08, 'max_grad_norm': 1.0, 'num_train_epochs': 8.0, 'max_steps': -1, 'lr_scheduler_type': <SchedulerType.COSINE: 'cosine'>, 'lr_scheduler_kwargs': {}, 'warmup_ratio': 0.05, 'warmup_steps': 0, 'log_level': 'passive', 'log_level_replica': 'warning', 'log_on_each_node': True, 'logging_dir': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume/runs/Apr15_11-24-35_batch-block1-0086', 'logging_strategy': <IntervalStrategy.STEPS: 'steps'>, 'logging_first_step': False, 'logging_steps': 1.0, 'logging_nan_inf_filter': True, 'save_strategy': <SaveStrategy.STEPS: 'steps'>, 'save_steps': 20, 'save_total_limit': 2, 'save_safetensors': True, 'save_on_each_node': False, 'save_only_model': False, 'restore_callback_states_from_checkpoint': False, 'no_cuda': False, 'use_cpu': False, 'use_mps_device': False, 'seed': 42, 'data_seed': None, 'jit_mode_eval': False, 'use_ipex': False, 'bf16': True, 'fp16': False, 'fp16_opt_level': 'O1', 'half_precision_backend': 'auto', 'bf16_full_eval': False, 'fp16_full_eval': False, 'tf32': None, 'local_rank': 2, 'ddp_backend': None, 'tpu_num_cores': None, 'tpu_metrics_debug': False, 'debug': [], 'dataloader_drop_last': False, 'eval_steps': None, 'dataloader_num_workers': 0, 'dataloader_prefetch_factor': None, 'past_index': -1, 'run_name': 'runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume', 'disable_tqdm': False, 'remove_unused_columns': True, 'label_names': None, 'load_best_model_at_end': False, 'metric_for_best_model': None, 'greater_is_better': None, 'ignore_data_skip': False, 'fsdp': [], 'fsdp_min_num_params': 0, 'fsdp_config': {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, 'tp_size': 0, 'fsdp_transformer_layer_cls_to_wrap': None, 'accelerator_config': {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, 'deepspeed': 'configs/deepspeed_zero3.json', 'label_smoothing_factor': 0.0, 'optim': <OptimizerNames.ADAMW_TORCH: 'adamw_torch'>, 'optim_args': None, 'adafactor': False, 'group_by_length': False, 'length_column_name': 'length', 'report_to': ['wandb'], 'ddp_find_unused_parameters': None, 'ddp_bucket_cap_mb': None, 'ddp_broadcast_buffers': None, 'dataloader_pin_memory': True, 'dataloader_persistent_workers': False, 'skip_memory_metrics': True, 'use_legacy_prediction_loop': False, 'push_to_hub': False, 'resume_from_checkpoint': None, 'hub_model_id': None, 'hub_strategy': <HubStrategy.EVERY_SAVE: 'every_save'>, 'hub_token': None, 'hub_private_repo': None, 'hub_always_push': False, 'gradient_checkpointing': True, 'gradient_checkpointing_kwargs': None, 'include_inputs_for_metrics': False, 'include_for_metrics': [], 'eval_do_concat_batches': True, 'fp16_backend': 'auto', 'push_to_hub_model_id': None, 'push_to_hub_organization': None, 'push_to_hub_token': None, '_n_gpu': 1, 'mp_parameters': '', 'auto_find_batch_size': False, 'full_determinism': False, 'torchdynamo': None, 'ray_scope': 'last', 'ddp_timeout': 1800, 'torch_compile': False, 'torch_compile_backend': None, 'torch_compile_mode': None, 'include_tokens_per_second': False, 'include_num_input_tokens_seen': False, 'neftune_noise_alpha': None, 'optim_target_modules': None, 'batch_eval_metrics': False, 'eval_on_start': False, 'use_liger_kernel': True, 'eval_use_gather_object': False, 'average_tokens_across_devices': True, 'model_init_kwargs': None, 'chat_template_path': None, 'dataset_text_field': 'qwen_text', 'dataset_kwargs': None, 'dataset_num_proc': None, 'eos_token': None, 'pad_token': None, 'max_length': 1024, 'packing': False, 'packing_strategy': 'ffd', 'padding_free': False, 'pad_to_multiple_of': None, 'eval_packing': None, 'completion_only_loss': None, 'assistant_only_loss': False, 'activation_offloading': False, 'max_seq_length': None}
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 20%|██ | 1/5 [00:02<00:11, 2.96s/it]
Loading checkpoint shards: 20%|██ | 1/5 [00:02<00:11, 2.96s/it]
Loading checkpoint shards: 20%|██ | 1/5 [00:02<00:11, 2.96s/it]
Loading checkpoint shards: 20%|██ | 1/5 [00:02<00:11, 2.96s/it]
Loading checkpoint shards: 20%|██ | 1/5 [00:02<00:11, 2.96s/it]
Loading checkpoint shards: 20%|██ | 1/5 [00:02<00:11, 2.96s/it]
Loading checkpoint shards: 20%|██
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 20%|██ | 1/5 [00:02<00:11, 2.86s/it]
Loading checkpoint shards: 20%|██ | 1/5 [00:02<00:11, 2.86s/it]
Loading checkpoint shards: 20%|██ | 1/5 [00:02<00:11, 2.86s/it]
Loading checkpoint shards: 20%|██ | 1/5 [00:02<00:11, 2.86s/it]
Loading checkpoint shards: 20%|██ | 1/5 [00:02<00:11, 2.86s/it]
Loading checkpoint shards: 20%|██ | 1/5 [00:02<00:11, 2.86s/it]
Loading checkpoint shards: 20%|██
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 20%|██ | 1/5 [00:02<00:11, 2.95s/it]
Loading checkpoint shards: 20%|██ | 1/5 [00:02<00:11, 2.95s/it]
Loading checkpoint shards: 20%|██ | 1/5 [00:02<00:11, 2.95s/it]
Loading checkpoint shards: 20%|██ | 1/5 [00:02<00:11, 2.95s/it]
Loading checkpoint shards: 20%|██ | 1/5 [00:02<00:11, 2.95s/it]
Loading checkpoint shards: 20%|██ | 1/5 [00:02<00:11, 2.95s/it]
Loading checkpoint shards: 20%|██
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 20%|██ | 1/5 [00:02<00:11, 2.94s/it]
Loading checkpoint shards: 20%|██ | 1/5 [00:02<00:11, 2.94s/it]
Loading checkpoint shards: 20%|██ | 1/5 [00:02<00:11, 2.94s/it]
Loading checkpoint shards: 20%|██ | 1/5 [00:02<00:11, 2.94s/it]
Loading checkpoint shards: 20%|██ | 1/5 [00:02<00:11, 2.94s/it]
Loading checkpoint shards: 20%|██ | 1/5 [00:02<00:11, 2.94s/it]
Loading checkpoint shards: 20%|██
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 20%|██ | 1/5 [00:02<00:11, 2.92s/it]
Loading checkpoint shards: 20%|██ | 1/5 [00:02<00:11, 2.92s/it]
Loading checkpoint shards: 20%|██ | 1/5 [00:02<00:11, 2.92s/it]
Loading checkpoint shards: 20%|██ | 1/5 [00:02<00:11, 2.92s/it]
Loading checkpoint shards: 20%|██ | 1/5 [00:02<00:11, 2.92s/it]
Loading checkpoint shards: 20%|██ | 1/5 [00:02<00:11, 2.92s/it]
Loading checkpoint shards: 20%|██
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 20%|██ | 1/5 [00:02<00:11, 2.95s/it]
Loading checkpoint shards: 20%|██ | 1/5 [00:02<00:11, 2.95s/it]
Loading checkpoint shards: 20%|██ | 1/5 [00:02<00:11, 2.95s/it]
Loading checkpoint shards: 20%|██ | 1/5 [00:02<00:11, 2.95s/it]
Loading checkpoint shards: 20%|██ | 1/5 [00:02<00:11, 2.95s/it]
Loading checkpoint shards: 20%|██ | 1/5 [00:02<00:11, 2.95s/it]
Loading checkpoint shards: 20%|██
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 20%|██ | 1/5 [00:02<00:11, 2.92s/it]
Loading checkpoint shards: 20%|██ | 1/5 [00:02<00:11, 2.93s/it]
Loading checkpoint shards: 20%|██ | 1/5 [00:02<00:11, 2.93s/it]
Loading checkpoint shards: 20%|██ | 1/5 [00:02<00:11, 2.93s/it]
Loading checkpoint shards: 20%|██ | 1/5 [00:02<00:11, 2.93s/it]
Loading checkpoint shards: 20%|██ | 1/5 [00:02<00:11, 2.92s/it]
Loading checkpoint shards: 20%|██
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 20%|██ | 1/5 [00:02<00:11, 2.92s/it]
Loading checkpoint shards: 20%|██ | 1/5 [00:02<00:11, 2.92s/it]
Loading checkpoint shards: 20%|██ | 1/5 [00:02<00:11, 2.92s/it]
Loading checkpoint shards: 20%|██ | 1/5 [00:02<00:11, 2.92s/it]
Loading checkpoint shards: 20%|██ | 1/5 [00:02<00:11, 2.92s/it]
Loading checkpoint shards: 20%|██ | 1/5 [00:02<00:11, 2.92s/it]
Loading checkpoint shards: 20%|██ | 1/5 [00:02<00:11, 2.94s/it]
Loading checkpoint shards: 20%|██ | 1/5 [00:02<00:11, 2.94s/it]
Loading checkpoint shards: 40%|████ | 2/5 [00:07<00:11, 3.76s/it]
Loading checkpoint shards: 40%|████ | 2/5 [00:07<00:11, 3.76s/it]
Loading checkpoint shards: 40%|████ | 2/5 [00:07<00:11, 3.76s/it]
Loading checkpoint shards: 40%|████ | 2/5 [00:07<00:11, 3.76s/it]
Loading checkpoint shards: 40%|████ | 2/5 [00:07<00:11, 3.76s/it]
Loading checkpoint shards: 40%|████ | 2/5 [00:07<00:11, 3.76s/it]
Loading checkpoint shards: 40%|████ | 2/5 [00:07<00:11, 3.76s/it]
Loading checkpoint shards: 40%|████ | 2/5 [00:07<00:11, 3.76s/it]
Loading checkpoint shards: 60%|██████ | 3/5 [00:10<00:07, 3.66s/it]
Loading checkpoint shards: 60%|██████ | 3/5 [00:10<00:07, 3.66s/it]
Loading checkpoint shards: 60%|██████ | 3/5 [00:10<00:07, 3.66s/it]
Loading | 1/5 [00:02<00:11, 2.93s/it]
Loading checkpoint shards: 20%|██ | 1/5 [00:02<00:11, 2.93s/it]
Loading checkpoint shards: 40%|████ | 2/5 [00:07<00:11, 3.75s/it]
Loading checkpoint shards: 40%|████ | 2/5 [00:07<00:11, 3.75s/it]
Loading checkpoint shards: 40%|████ | 2/5 [00:07<00:11, 3.75s/it]
Loading checkpoint shards: 40%|████ | 2/5 [00:07<00:11, 3.75s/it]
Loading checkpoint shards: 40%|████ | 2/5 [00:07<00:11, 3.75s/it]
Loading checkpoint shards: 40%|████ | 2/5 [00:07<00:11, 3.75s/it]
Loading checkpoint shards: 40%|████ | 2/5 [00:07<00:11, 3.75s/it]
Loading checkpoint shards: 40%|████ | 2/5 [00:07<00:11, 3.75s/it]
Loading checkpoint shards: 60%|██████ | 3/5 [00:10<00:07, 3.65s/it]
Loading checkpoint shards: 60%|██████ | 3/5 [00:10<00:07, 3.65s/it]
Loading checkpoint shards: 60%|██████ | 3/5 [00:10<00:07, 3.65s/it]
Loading | 1/5 [00:02<00:11, 2.92s/it]
Loading checkpoint shards: 20%|██ | 1/5 [00:02<00:11, 2.92s/it]
Loading checkpoint shards: 40%|████ | 2/5 [00:07<00:11, 3.75s/it]
Loading checkpoint shards: 40%|████ | 2/5 [00:07<00:11, 3.75s/it]
Loading checkpoint shards: 40%|████ | 2/5 [00:07<00:11, 3.75s/it]
Loading checkpoint shards: 40%|████ | 2/5 [00:07<00:11, 3.75s/it]
Loading checkpoint shards: 40%|████ | 2/5 [00:07<00:11, 3.75s/it]
Loading checkpoint shards: 40%|████ | 2/5 [00:07<00:11, 3.75s/it]
Loading checkpoint shards: 40%|████ | 2/5 [00:07<00:11, 3.75s/it]
Loading checkpoint shards: 40%|████ | 2/5 [00:07<00:11, 3.75s/it]
Loading checkpoint shards: 60%|██████ | 3/5 [00:10<00:07, 3.65s/it]
Loading checkpoint shards: 60%|██████ | 3/5 [00:10<00:07, 3.65s/it]
Loading checkpoint shards: 60%|██████ | 3/5 [00:10<00:07, 3.65s/it]
Loading | 1/5 [00:02<00:11, 2.95s/it]
Loading checkpoint shards: 20%|██ | 1/5 [00:02<00:11, 2.95s/it]
Loading checkpoint shards: 40%|████ | 2/5 [00:07<00:11, 3.76s/it]
Loading checkpoint shards: 40%|████ | 2/5 [00:07<00:11, 3.76s/it]
Loading checkpoint shards: 40%|████ | 2/5 [00:07<00:11, 3.76s/it]
Loading checkpoint shards: 40%|████ | 2/5 [00:07<00:11, 3.76s/it]
Loading checkpoint shards: 40%|████ | 2/5 [00:07<00:11, 3.76s/it]
Loading checkpoint shards: 40%|████ | 2/5 [00:07<00:11, 3.76s/it]
Loading checkpoint shards: 40%|████ | 2/5 [00:07<00:11, 3.76s/it]
Loading checkpoint shards: 40%|████ | 2/5 [00:07<00:11, 3.76s/it]
Loading checkpoint shards: 60%|██████ | 3/5 [00:10<00:07, 3.66s/it]
Loading checkpoint shards: 60%|██████ | 3/5 [00:10<00:07, 3.66s/it]
Loading checkpoint shards: 60%|██████ | 3/5 [00:10<00:07, 3.66s/it]
Loading | 1/5 [00:02<00:11, 2.95s/it]
Loading checkpoint shards: 20%|██ | 1/5 [00:02<00:11, 2.95s/it]
Loading checkpoint shards: 40%|████ | 2/5 [00:07<00:11, 3.76s/it]
Loading checkpoint shards: 40%|████ | 2/5 [00:07<00:11, 3.76s/it]
Loading checkpoint shards: 40%|████ | 2/5 [00:07<00:11, 3.76s/it]
Loading checkpoint shards: 40%|████ | 2/5 [00:07<00:11, 3.76s/it]
Loading checkpoint shards: 40%|████ | 2/5 [00:07<00:11, 3.76s/it]
Loading checkpoint shards: 40%|████ | 2/5 [00:07<00:11, 3.76s/it]
Loading checkpoint shards: 40%|████ | 2/5 [00:07<00:11, 3.76s/it]
Loading checkpoint shards: 40%|████ | 2/5 [00:07<00:11, 3.76s/it]
Loading checkpoint shards: 60%|██████ | 3/5 [00:10<00:07, 3.66s/it]
Loading checkpoint shards: 60%|██████ | 3/5 [00:10<00:07, 3.66s/it]
Loading checkpoint shards: 60%|██████ | 3/5 [00:10<00:07, 3.66s/it]
Loading | 1/5 [00:02<00:11, 2.86s/it]
Loading checkpoint shards: 20%|██ | 1/5 [00:03<00:13, 3.30s/it]
Loading checkpoint shards: 40%|████ | 2/5 [00:07<00:11, 3.73s/it]
Loading checkpoint shards: 40%|████ | 2/5 [00:07<00:11, 3.73s/it]
Loading checkpoint shards: 40%|████ | 2/5 [00:07<00:11, 3.73s/it]
Loading checkpoint shards: 40%|████ | 2/5 [00:07<00:11, 3.73s/it]
Loading checkpoint shards: 40%|████ | 2/5 [00:07<00:11, 3.73s/it]
Loading checkpoint shards: 40%|████ | 2/5 [00:07<00:11, 3.73s/it]
Loading checkpoint shards: 40%|████ | 2/5 [00:07<00:11, 3.73s/it]
Loading checkpoint shards: 40%|████ | 2/5 [00:07<00:11, 3.86s/it]
Loading checkpoint shards: 60%|██████ | 3/5 [00:10<00:07, 3.64s/it]
Loading checkpoint shards: 60%|██████ | 3/5 [00:10<00:07, 3.64s/it]
Loading checkpoint shards: 60%|██████ | 3/5 [00:10<00:07, 3.64s/it]
Loading | 1/5 [00:02<00:11, 2.93s/it]
Loading checkpoint shards: 20%|██ | 1/5 [00:02<00:11, 2.92s/it]
Loading checkpoint shards: 40%|████ | 2/5 [00:07<00:11, 3.75s/it]
Loading checkpoint shards: 40%|████ | 2/5 [00:07<00:11, 3.75s/it]
Loading checkpoint shards: 40%|████ | 2/5 [00:07<00:11, 3.75s/it]
Loading checkpoint shards: 40%|████ | 2/5 [00:07<00:11, 3.75s/it]
Loading checkpoint shards: 40%|████ | 2/5 [00:07<00:11, 3.75s/it]
Loading checkpoint shards: 40%|████ | 2/5 [00:07<00:11, 3.75s/it]
Loading checkpoint shards: 40%|████ | 2/5 [00:07<00:11, 3.75s/it]
Loading checkpoint shards: 40%|████ | 2/5 [00:07<00:11, 3.75s/it]
Loading checkpoint shards: 60%|██████ | 3/5 [00:10<00:07, 3.65s/it]
Loading checkpoint shards: 60%|██████ | 3/5 [00:10<00:07, 3.65s/it]
Loading checkpoint shards: 60%|██████ | 3/5 [00:10<00:07, 3.65s/it]
Loading | 1/5 [00:02<00:11, 2.96s/it]
Loading checkpoint shards: 20%|██ | 1/5 [00:02<00:11, 2.96s/it]
Loading checkpoint shards: 40%|████ | 2/5 [00:07<00:11, 3.76s/it]
Loading checkpoint shards: 40%|████ | 2/5 [00:07<00:11, 3.76s/it]
Loading checkpoint shards: 40%|████ | 2/5 [00:07<00:11, 3.76s/it]
Loading checkpoint shards: 40%|████ | 2/5 [00:07<00:11, 3.76s/it]
Loading checkpoint shards: 40%|████ | 2/5 [00:07<00:11, 3.76s/it]
Loading checkpoint shards: 40%|████ | 2/5 [00:07<00:11, 3.76s/it]
Loading checkpoint shards: 40%|████ | 2/5 [00:07<00:11, 3.76s/it]
Loading checkpoint shards: 40%|████ | 2/5 [00:07<00:11, 3.76s/it]
Loading checkpoint shards: 60%|██████ | 3/5 [00:10<00:07, 3.66s/it]
Loading checkpoint shards: 60%|██████ | 3/5 [00:10<00:07, 3.66s/it]
Loading checkpoint shards: 60%|██████ | 3/5 [00:10<00:07, 3.66s/it]
Loading checkpoint shards: 60%|██████ | 3/5 [00:10<00:07, 3.66s/it]
Loading checkpoint shards: 60%|██████ | 3/5 [00:10<00:07, 3.66s/it]
Loading checkpoint shards: 60%|██████ | 3/5 [00:10<00:07, 3.66s/it]
Loading checkpoint shards: 60%|██████ | 3/5 [00:10<00:07, 3.66s/it]
Loading checkpoint shards: 60%|██████ | 3/5 [00:10<00:07, 3.66s/it]
Loading checkpoint shards: 80%|████████ | 4/5 [00:13<00:03, 3.41s/it]
Loading checkpoint shards: 80%|████████ | 4/5 [00:13<00:03, 3.41s/it]
Loading checkpoint shards: 80%|████████ | 4/5 [00:13<00:03, 3.41s/it]
Loading checkpoint shards: 80%|████████ | 4/5 [00:13<00:03, 3.41s/it]
Loading checkpoint shards: 80%|████████ | 4/5 [00:13<00:03, 3.41s/it]
Loading checkpoint shards: 80%|████████ | 4/5 [00:13<00:03, 3.41s/it]
Loading checkpoint shards: 80%|████████ | 4/5 [00:13<00checkpoint shards: 60%|██████ | 3/5 [00:10<00:07, 3.66s/it]
Loading checkpoint shards: 60%|██████ | 3/5 [00:10<00:07, 3.66s/it]
Loading checkpoint shards: 60%|██████ | 3/5 [00:10<00:07, 3.66s/it]
Loading checkpoint shards: 60%|██████ | 3/5 [00:10<00:07, 3.66s/it]
Loading checkpoint shards: 60%|██████ | 3/5 [00:10<00:07, 3.66s/it]
Loading checkpoint shards: 80%|████████ | 4/5 [00:13<00:03, 3.41s/it]
Loading checkpoint shards: 80%|████████ | 4/5 [00:13<00:03, 3.41s/it]
Loading checkpoint shards: 80%|████████ | 4/5 [00:13<00:03, 3.41s/it]
Loading checkpoint shards: 80%|████████ | 4/5 [00:13<00:03, 3.41s/it]
Loading checkpoint shards: 80%|████████ | 4/5 [00:13<00:03, 3.41s/it]
Loading checkpoint shards: 80%|████████ | 4/5 [00:13<00:03, 3.41s/it]
Loading checkpoint shards: 80%|████████ | 4/5 [00:13<00checkpoint shards: 60%|██████ | 3/5 [00:10<00:07, 3.65s/it]
Loading checkpoint shards: 60%|██████ | 3/5 [00:10<00:07, 3.65s/it]
Loading checkpoint shards: 60%|██████ | 3/5 [00:10<00:07, 3.65s/it]
Loading checkpoint shards: 60%|██████ | 3/5 [00:10<00:07, 3.65s/it]
Loading checkpoint shards: 60%|██████ | 3/5 [00:10<00:07, 3.65s/it]
Loading checkpoint shards: 80%|████████ | 4/5 [00:13<00:03, 3.41s/it]
Loading checkpoint shards: 80%|████████ | 4/5 [00:13<00:03, 3.41s/it]
Loading checkpoint shards: 80%|████████ | 4/5 [00:13<00:03, 3.41s/it]
Loading checkpoint shards: 80%|████████ | 4/5 [00:13<00:03, 3.41s/it]
Loading checkpoint shards: 80%|████████ | 4/5 [00:13<00:03, 3.41s/it]
Loading checkpoint shards: 80%|████████ | 4/5 [00:13<00:03, 3.41s/it]
Loading checkpoint shards: 80%|████████ | 4/5 [00:13<00checkpoint shards: 60%|██████ | 3/5 [00:10<00:07, 3.66s/it]
Loading checkpoint shards: 60%|██████ | 3/5 [00:10<00:07, 3.66s/it]
Loading checkpoint shards: 60%|██████ | 3/5 [00:10<00:07, 3.66s/it]
Loading checkpoint shards: 60%|██████ | 3/5 [00:10<00:07, 3.66s/it]
Loading checkpoint shards: 60%|██████ | 3/5 [00:10<00:07, 3.66s/it]
Loading checkpoint shards: 80%|████████ | 4/5 [00:13<00:03, 3.41s/it]
Loading checkpoint shards: 80%|████████ | 4/5 [00:13<00:03, 3.41s/it]
Loading checkpoint shards: 80%|████████ | 4/5 [00:13<00:03, 3.41s/it]
Loading checkpoint shards: 80%|████████ | 4/5 [00:13<00:03, 3.41s/it]
Loading checkpoint shards: 80%|████████ | 4/5 [00:13<00:03, 3.41s/it]
Loading checkpoint shards: 80%|████████ | 4/5 [00:13<00:03, 3.41s/it]
Loading checkpoint shards: 80%|████████ | 4/5 [00:13<00checkpoint shards: 60%|██████ | 3/5 [00:10<00:07, 3.65s/it]
Loading checkpoint shards: 60%|██████ | 3/5 [00:10<00:07, 3.65s/it]
Loading checkpoint shards: 60%|██████ | 3/5 [00:10<00:07, 3.65s/it]
Loading checkpoint shards: 60%|██████ | 3/5 [00:10<00:07, 3.65s/it]
Loading checkpoint shards: 60%|██████ | 3/5 [00:10<00:07, 3.65s/it]
Loading checkpoint shards: 80%|████████ | 4/5 [00:13<00:03, 3.41s/it]
Loading checkpoint shards: 80%|████████ | 4/5 [00:13<00:03, 3.41s/it]
Loading checkpoint shards: 80%|████████ | 4/5 [00:13<00:03, 3.41s/it]
Loading checkpoint shards: 80%|████████ | 4/5 [00:13<00:03, 3.41s/it]
Loading checkpoint shards: 80%|████████ | 4/5 [00:13<00:03, 3.41s/it]
Loading checkpoint shards: 80%|████████ | 4/5 [00:13<00:03, 3.41s/it]
Loading checkpoint shards: 80%|████████ | 4/5 [00:13<00checkpoint shards: 60%|██████ | 3/5 [00:10<00:07, 3.66s/it]
Loading checkpoint shards: 60%|██████ | 3/5 [00:10<00:07, 3.66s/it]
Loading checkpoint shards: 60%|██████ | 3/5 [00:10<00:07, 3.66s/it]
Loading checkpoint shards: 60%|██████ | 3/5 [00:10<00:07, 3.66s/it]
Loading checkpoint shards: 60%|██████ | 3/5 [00:10<00:07, 3.66s/it]
Loading checkpoint shards: 80%|████████ | 4/5 [00:13<00:03, 3.41s/it]
Loading checkpoint shards: 80%|████████ | 4/5 [00:13<00:03, 3.41s/it]
Loading checkpoint shards: 80%|████████ | 4/5 [00:13<00:03, 3.41s/it]
Loading checkpoint shards: 80%|████████ | 4/5 [00:13<00:03, 3.41s/it]
Loading checkpoint shards: 80%|████████ | 4/5 [00:13<00:03, 3.41s/it]
Loading checkpoint shards: 80%|████████ | 4/5 [00:13<00:03, 3.41s/it]
Loading checkpoint shards: 80%|████████ | 4/5 [00:13<00checkpoint shards: 60%|██████ | 3/5 [00:10<00:07, 3.64s/it]
Loading checkpoint shards: 60%|██████ | 3/5 [00:10<00:07, 3.64s/it]
Loading checkpoint shards: 60%|██████ | 3/5 [00:10<00:07, 3.64s/it]
Loading checkpoint shards: 60%|██████ | 3/5 [00:10<00:07, 3.64s/it]
Loading checkpoint shards: 60%|██████ | 3/5 [00:11<00:07, 3.73s/it]
Loading checkpoint shards: 80%|████████ | 4/5 [00:13<00:03, 3.40s/it]
Loading checkpoint shards: 80%|████████ | 4/5 [00:13<00:03, 3.40s/it]
Loading checkpoint shards: 80%|████████ | 4/5 [00:13<00:03, 3.40s/it]
Loading checkpoint shards: 80%|████████ | 4/5 [00:13<00:03, 3.40s/it]
Loading checkpoint shards: 80%|████████ | 4/5 [00:13<00:03, 3.40s/it]
Loading checkpoint shards: 80%|████████ | 4/5 [00:13<00:03, 3.40s/it]
Loading checkpoint shards: 80%|████████ | 4/5 [00:13<00checkpoint shards: 60%|██████ | 3/5 [00:10<00:07, 3.65s/it]
Loading checkpoint shards: 60%|██████ | 3/5 [00:10<00:07, 3.65s/it]
Loading checkpoint shards: 60%|██████ | 3/5 [00:10<00:07, 3.65s/it]
Loading checkpoint shards: 60%|██████ | 3/5 [00:10<00:07, 3.65s/it]
Loading checkpoint shards: 60%|██████ | 3/5 [00:10<00:07, 3.65s/it]
Loading checkpoint shards: 80%|████████ | 4/5 [00:13<00:03, 3.41s/it]
Loading checkpoint shards: 80%|████████ | 4/5 [00:13<00:03, 3.41s/it]
Loading checkpoint shards: 80%|████████ | 4/5 [00:13<00:03, 3.41s/it]
Loading checkpoint shards: 80%|████████ | 4/5 [00:13<00:03, 3.41s/it]
Loading checkpoint shards: 80%|████████ | 4/5 [00:13<00:03, 3.41s/it]
Loading checkpoint shards: 80%|████████ | 4/5 [00:13<00:03, 3.41s/it]
Loading checkpoint shards: 80%|████████ | 4/5 [00:13<00:03, 3.41s/it]
Loading checkpoint shards: 80%|████████ | 4/5 [00:13<00:03, 3.41s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:13<00:00, 2.78s/it]
:03, 3.41s/it]
Loading checkpoint shards: 80%|████████ | 4/5 [00:13<00:03, 3.41s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:13<00:00, 2.78s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:13<00:00, 2.78s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:13<00:00, 2.78s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:13<00:00, 2.78s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:13<00:00, 2.78s/it]
:03, 3.41s/it]
Loading checkpoint shards: 80%|████████ | 4/5 [00:13<00:03, 3.41s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:13<00:00, 2.78s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:13<00:00, 2.78s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:13<00:00, 2.78s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:13<00:00, 2.78s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:13<00:00, 2.78s/it]
:03, 3.41s/it]
Loading checkpoint shards: 80%|████████ | 4/5 [00:13<00:03, 3.41s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:13<00:00, 2.77s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:13<00:00, 2.77s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:13<00:00, 2.77s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:13<00:00, 2.77s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:13<00:00, 2.77s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:13<00:00, 2.77s/it]
:03, 3.41s/it]
Loading checkpoint shards: 80%|████████ | 4/5 [00:13<00:03, 3.41s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:13<00:00, 2.77s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:13<00:00, 2.77s/it]
:03, 3.40s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:13<00:00, 2.76s/it]
:03, 3.41s/it]
Loading checkpoint shards: 80%|████████ | 4/5 [00:13<00:03, 3.41s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:13<00:00, 2.78s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:13<00:00, 2.78s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:13<00:00, 2.78s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:13<00:00, 2.78s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:13<00:00, 2.78s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:13<00:00, 2.78s/it]
:03, 3.41s/it]
Loading checkpoint shards: 80%|████████ | 4/5 [00:13<00:03, 3.41s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:13<00:00, 2.77s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:13<00:00, 2.78s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:13<00:00, 2.78s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:13<00:00, 2.78s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:13<00:00, 2.78s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:13<00:00, 2.78s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:13<00:00, 2.78s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:13<00:00, 2.78s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:13<00:00, 2.78s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:13<00:00, 2.78s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:13<00:00, 2.77s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:13<00:00, 2.78s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:13<00:00, 2.77s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:13<00:00, 2.77s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:13<00:00, 2.76s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:13<00:00, 2.76s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:13<00:00, 2.78s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:13<00:00, 2.77s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:13<00:00, 2.76s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:13<00:00, 2.78s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:13<00:00, 2.77s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:13<00:00, 2.76s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:13<00:00, 2.77s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:13<00:00, 2.77s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:13<00:00, 2.76s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:13<00:00, 2.77s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:13<00:00, 2.77s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:13<00:00, 2.78s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:13<00:00, 2.77s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:13<00:00, 2.77s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:13<00:00, 2.77s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:13<00:00, 2.78s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:13<00:00, 2.77s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:13<00:00, 2.76s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:13<00:00, 2.77s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:13<00:00, 2.78s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:13<00:00, 2.77s/it]
Loading checkpoint shards: 80%|████████ | 4/5 [00:14<00:03, 3.47s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:14<00:00, 2.43s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:14<00:00, 2.96s/it]
Tokenizing train dataset: 0%| | 0/964 [00:00<?, ? examples/s]
Tokenizing train dataset: 0%| | 4/964 [00:00<00:42, 22.42 examples/s]
Tokenizing train dataset: 1%| | 9/964 [00:00<00:34, 28.04 examples/s]
Tokenizing train dataset: 1%|▏ | 13/964 [00:00<00:31, 29.78 examples/s]
Tokenizing train dataset: 2%|▏ | 17/964 [00:00<00:29, 32.11 examples/s]
Tokenizing train dataset: 2%|▏ | 22/964 [00:00<00:28, 32.50 examples/s]
Tokenizing train dataset: 3%|▎ | 26/964 [00:00<00:30, 31.11 examples/s]
Tokenizing train dataset: 3%|▎ | 30/964 [00:00<00:29, 31.57 examples/s]
Tokenizing train dataset: 4%|▎ | 34/964 [00:01<00:33, 27.70 examples/s]
Tokenizing train dataset: 4%|▍ | 38/964 [00:01<00:35, 26.21 examples/s]
Tokenizing train dataset: 4%|▍ | 42/964 [00:01<00:33, 27.40 examples/s]
Tokenizing train dataset: 5%|▍ | 46/964 [00:01<00:31, 29.59 examples/s]
Tokenizing train dataset: 5%|▌ | 51/964 [00:01<00:30, 29.83 examples/s]
Tokenizing train dataset: 6%|▌ | 55/964 [00:01<00:31, 29.26 examples/s]
Tokenizing train dataset: 6%|▌ | 58/964 [00:02<00:33, 27.22 examples/s]
Tokenizing train dataset: 6%|▋ | 62/964 [00:02<00:30, 29.21 examples/s]
Tokenizing train dataset: 7%|▋ | 66/964 [00:02<00:30, 29.11 examples/s]
Tokenizing train dataset: 7%|▋ | 70/964 [00:02<00:29, 30.13 examples/s]
Tokenizing train dataset: 8%|▊ | 74/964 [00:02<00:27, 31.82 examples/s]
Tokenizing train dataset: 8%|▊ | 79/964 [00:02<00:26, 33.10 examples/s]
Tokenizing train dataset: 9%|▊ | 83/964 [00:02<00:26, 32.74 examples/s]
Tokenizing train dataset: 9%|▉ | 87/964 [00:02<00:27, 31.87 examples/s]
Tokenizing train dataset: 10%|▉ | 92/964 [00:03<00:30, 28.61 examples/s]
Tokenizing train dataset: 10%|█ | 97/964 [00:03<00:28, 30.19 examples/s]
Tokenizing train dataset: 11%|█ | 102/964 [00:03<00:31, 27.09 examples/s]
Tokenizing train dataset: 11%|█ | 107/964 [00:03<00:29, 29.35 examples/s]
Tokenizing train dataset: 12%|█▏ | 112/964 [00:03<00:28, 29.97 examples/s]
Tokenizing train dataset: 12%|█▏ | 117/964 [00:03<00:27, 30.33 examples/s]
Tokenizing train dataset: 13%|█▎ | 121/964 [00:04<00:28, 29.93 examples/s]
Tokenizing train dataset: 13%|█▎ | 126/964 [00:04<00:26, 31.63 examples/s]
Tokenizing train dataset: 14%|█▎ | 131/964 [00:04<00:28, 29.10 examples/s]
Tokenizing train dataset: 14%|█▍ | 135/964 [00:04<00:30, 27.33 examples/s]
Tokenizing train dataset: 14%|█▍ | 139/964 [00:04<00:31, 26.44 examples/s]
Tokenizing train dataset: 15%|█▍ | 143/964 [00:04<00:29, 27.97 examples/s]
Tokenizing train dataset: 15%|█▌ | 147/964 [00:04<00:27, 29.49 examples/s]
Tokenizing train dataset: 16%|█▌ | 151/964 [00:05<00:27, 29.71 examples/s]
Tokenizing train dataset: 16%|█▌ | 155/964 [00:05<00:26, 31.11 examples/s]
Tokenizing train dataset: 17%|█▋ | 160/964 [00:05<00:23, 33.63 examples/s]
Tokenizing train dataset: 17%|█▋ | 165/964 [00:05<00:26, 29.73 examples/s]
Tokenizing train dataset: 18%|█▊ | 170/964 [00:05<00:25, 30.90 examples/s]
Tokenizing train dataset: 18%|█▊ | 174/964 [00:05<00:26, 29.49 examples/s]
Tokenizing train dataset: 18%|█▊ | 178/964 [00:06<00:27, 28.69 examples/s]
Tokenizing train dataset: 19%|█▉ | 183/964 [00:06<00:27, 28.62 examples/s]
Tokenizing train dataset: 19%|█▉ | 187/964 [00:06<00:29, 26.04 examples/s]
Tokenizing train dataset: 20%|█▉ | 190/964 [00:06<00:31, 24.68 examples/s]
Tokenizing train dataset: 20%|██ | 195/964 [00:06<00:29, 26.05 examples/s]
Tokenizing train dataset: 21%|██ | 199/964 [00:06<00:26, 28.72 examples/s]
Tokenizing train dataset: 21%|██ | 203/964 [00:06<00:26, 28.53 examples/s]
Tokenizing train dataset: 21%|██▏ | 207/964 [00:07<00:27, 27.84 examples/s]
Tokenizing train dataset: 22%|██▏ | 212/964 [00:07<00:26, 28.90 examples/s]
Tokenizing train dataset: 22%|██▏ | 215/964 [00:07<00:26, 28.05 examples/s]
Tokenizing train dataset: 23%|██▎ | 221/964 [00:07<00:24, 30.93 examples/s]
Tokenizing train dataset: 23%|██▎ | 225/964 [00:07<00:24, 29.81 examples/s]
Tokenizing train dataset: 24%|██▍ | 229/964 [00:07<00:24, 29.50 examples/s]
Tokenizing train dataset: 24%|██▍ | 234/964 [00:08<00:24, 29.32 examples/s]
Tokenizing train dataset: 25%|██▍ | 237/964 [00:08<00:26, 27.43 examples/s]
Tokenizing train dataset: 25%|██▌ | 241/964 [00:08<00:28, 25.27 examples/s]
Tokenizing train dataset: 25%|██▌ | 245/964 [00:08<00:26, 27.13 examples/s]
Tokenizing train dataset: 26%|██▌ | 250/964 [00:08<00:23, 30.64 examples/s]
Tokenizing train dataset: 26%|██▋ | 254/964 [00:08<00:22, 31.00 examples/s]
Tokenizing train dataset: 27%|██▋ | 259/964 [00:08<00:20, 34.12 examples/s]
Tokenizing train dataset: 27%|██▋ | 263/964 [00:08<00:23, 29.83 examples/s]
Tokenizing train dataset: 28%|██▊ | 267/964 [00:09<00:22, 30.61 examples/s]
Tokenizing train dataset: 28%|██▊ | 271/964 [00:09<00:22, 30.41 examples/s]
Tokenizing train dataset: 29%|██▊ | 275/964 [00:09<00:25, 26.94 examples/s]
Tokenizing train dataset: 29%|██▉ | 279/964 [00:09<00:23, 28.69 examples/s]
Tokenizing train dataset: 29%|██▉ | 284/964 [00:09<00:21, 31.90 examples/s]
Tokenizing train dataset: 30%|██▉ | 288/964 [00:09<00:22, 30.56 examples/s]
Tokenizing train dataset: 30%|███ | 292/964 [00:09<00:21, 31.67 examples/s]
Tokenizing train dataset: 31%|███ | 296/964 [00:10<00:23, 29.02 examples/s]
Tokenizing train dataset: 31%|███ | 300/964 [00:10<00:21, 30.49 examples/s]
Tokenizing train dataset: 32%|███▏ | 304/964 [00:10<00:21, 30.54 examples/s]
Tokenizing train dataset: 32%|███▏ | 309/964 [00:10<00:20, 31.32 examples/s]
Tokenizing train dataset: 33%|███▎ | 315/964 [00:10<00:19, 32.55 examples/s]
Tokenizing train dataset: 33%|███▎ | 321/964 [00:10<00:18, 34.51 examples/s]
Tokenizing train dataset: 34%|███▎ | 325/964 [00:10<00:19, 33.02 examples/s]
Tokenizing train dataset: 34%|███▍ | 330/964 [00:11<00:18, 34.35 examples/s]
Tokenizing train dataset: 35%|███▍ | 335/964 [00:11<00:19, 31.81 examples/s]
Tokenizing train dataset: 35%|███▌ | 339/964 [00:11<00:20, 30.50 examples/s]
Tokenizing train dataset: 36%|███▌ | 343/964 [00:11<00:22, 27.29 examples/s]
Tokenizing train dataset: 36%|███▌ | 348/964 [00:11<00:19, 31.17 examples/s]
Tokenizing train dataset: 37%|███▋ | 354/964 [00:11<00:17, 34.81 examples/s]
Tokenizing train dataset: 37%|███▋ | 358/964 [00:12<00:18, 33.07 examples/s]
Tokenizing train dataset: 38%|███▊ | 363/964 [00:12<00:16, 35.54 examples/s]
Tokenizing train dataset: 38%|███▊ | 367/964 [00:12<00:18, 32.12 examples/s]
Tokenizing train dataset: 38%|███▊ | 371/964 [00:12<00:18, 31.26 examples/s]
Tokenizing train dataset: 39%|███▉ | 375/964 [00:12<00:19, 29.99 examples/s]
Tokenizing train dataset: 39%|███▉ | 379/964 [00:12<00:20, 28.32 examples/s]
Tokenizing train dataset: 40%|███▉ | 383/964 [00:12<00:19, 29.15 examples/s]
Tokenizing train dataset: 40%|████ | 386/964 [00:12<00:19, 28.91 examples/s]
Tokenizing train dataset: 40%|████ | 390/964 [00:13<00:19, 29.28 examples/s]
Tokenizing train dataset: 41%|████ | 394/964 [00:13<00:19, 29.73 examples/s]
Tokenizing train dataset: 41%|████▏ | 398/964 [00:13<00:20, 28.17 examples/s]
Tokenizing train dataset: 42%|████▏ | 401/964 [00:13<00:22, 25.40 examples/s]
Tokenizing train dataset: 42%|████▏ | 405/964 [00:13<00:21, 25.96 examples/s]
Tokenizing train dataset: 42%|████▏ | 408/964 [00:13<00:22, 24.69 examples/s]
Tokenizing train dataset: 43%|████▎ | 411/964 [00:13<00:22, 24.37 examples/s]
Tokenizing train dataset: 43%|████▎ | 415/964 [00:14<00:21, 25.47 examples/s]
Tokenizing train dataset: 43%|████▎ | 419/964 [00:14<00:19, 27.84 examples/s]
Tokenizing train dataset: 44%|████▍ | 423/964 [00:14<00:20, 26.60 examples/s]
Tokenizing train dataset: 44%|████▍ | 428/964 [00:14<00:18, 28.85 examples/s]
Tokenizing train dataset: 45%|████▍ | 432/964 [00:14<00:17, 30.20 examples/s]
Tokenizing train dataset: 45%|████▌ | 436/964 [00:14<00:17, 29.38 examples/s]
Tokenizing train dataset: 46%|████▌ | 441/964 [00:14<00:16, 31.55 examples/s]
Tokenizing train dataset: 46%|████▌ | 445/964 [00:15<00:16, 31.13 examples/s]
Tokenizing train dataset: 47%|████▋ | 449/964 [00:15<00:18, 28.26 examples/s]
Tokenizing train dataset: 47%|████▋ | 454/964 [00:15<00:16, 31.49 examples/s]
Tokenizing train dataset: 48%|████▊ | 459/964 [00:15<00:17, 28.22 examples/s]
Tokenizing train dataset: 48%|████▊ | 463/964 [00:15<00:17, 28.57 examples/s]
Tokenizing train dataset: 48%|████▊ | 467/964 [00:15<00:16, 29.42 examples/s]
Tokenizing train dataset: 49%|████▉ | 471/964 [00:15<00:17, 28.26 examples/s]
Tokenizing train dataset: 49%|████▉ | 476/964 [00:16<00:15, 31.23 examples/s]
Tokenizing train dataset: 50%|████▉ | 481/964 [00:16<00:14, 32.31 examples/s]
Tokenizing train dataset: 50%|█████ | 485/964 [00:16<00:15, 30.93 examples/s]
Tokenizing train dataset: 51%|█████ | 490/964 [00:16<00:15, 30.90 examples/s]
Tokenizing train dataset: 51%|█████▏ | 495/964 [00:16<00:14, 32.06 examples/s]
Tokenizing train dataset: 52%|█████▏ | 499/964 [00:16<00:14, 32.15 examples/s]
Tokenizing train dataset: 52%|█████▏ | 503/964 [00:16<00:14, 31.06 examples/s]
Tokenizing train dataset: 53%|█████▎ | 508/964 [00:17<00:16, 27.15 examples/s]
Tokenizing train dataset: 53%|█████▎ | 513/964 [00:17<00:14, 30.28 examples/s]
Tokenizing train dataset: 54%|█████▎ | 517/964 [00:17<00:14, 30.09 examples/s]
Tokenizing train dataset: 54%|█████▍ | 521/964 [00:17<00:14, 29.66 examples/s]
Tokenizing train dataset: 55%|█████▍ | 527/964 [00:17<00:13, 32.36 examples/s]
Tokenizing train dataset: 55%|█████▌ | 531/964 [00:17<00:13, 33.24 examples/s]
Tokenizing train dataset: 55%|█████▌ | 535/964 [00:18<00:13, 31.92 examples/s]
Tokenizing train dataset: 56%|█████▌ | 539/964 [00:18<00:13, 30.87 examples/s]
Tokenizing train dataset: 56%|█████▋ | 543/964 [00:18<00:14, 29.27 examples/s]
Tokenizing train dataset: 57%|█████▋ | 548/964 [00:18<00:12, 32.06 examples/s]
Tokenizing train dataset: 57%|█████▋ | 553/964 [00:18<00:12, 31.92 examples/s]
Tokenizing train dataset: 58%|█████▊ | 559/964 [00:18<00:13, 30.84 examples/s]
Tokenizing train dataset: 58%|█████▊ | 563/964 [00:18<00:13, 29.76 examples/s]
Tokenizing train dataset: 59%|█████▉ | 567/964 [00:19<00:12, 30.79 examples/s]
Tokenizing train dataset: 59%|█████▉ | 571/964 [00:19<00:13, 29.71 examples/s]
Tokenizing train dataset: 60%|█████▉ | 577/964 [00:19<00:11, 32.38 examples/s]
Tokenizing train dataset: 60%|██████ | 581/964 [00:19<00:11, 32.46 examples/s]
Tokenizing train dataset: 61%|██████ | 587/964 [00:19<00:11, 32.66 examples/s]
Tokenizing train dataset: 61%|██████▏ | 591/964 [00:19<00:11, 33.07 examples/s]
Tokenizing train dataset: 62%|██████▏ | 596/964 [00:19<00:10, 33.65 examples/s]
Tokenizing train dataset: 62%|██████▏ | 601/964 [00:20<00:11, 30.68 examples/s]
Tokenizing train dataset: 63%|██████▎ | 606/964 [00:20<00:11, 30.62 examples/s]
Tokenizing train dataset: 63%|██████▎ | 611/964 [00:20<00:11, 31.94 examples/s]
Tokenizing train dataset: 64%|██████▍ | 616/964 [00:20<00:11, 29.45 examples/s]
Tokenizing train dataset: 64%|██████▍ | 620/964 [00:20<00:11, 29.28 examples/s]
Tokenizing train dataset: 65%|██████▍ | 625/964 [00:20<00:11, 29.47 examples/s]
Tokenizing train dataset: 65%|██████▌ | 628/964 [00:21<00:11, 28.46 examples/s]
Tokenizing train dataset: 66%|██████▌ | 632/964 [00:21<00:11, 29.39 examples/s]
Tokenizing train dataset: 66%|██████▌ | 636/964 [00:21<00:11, 29.20 examples/s]
Tokenizing train dataset: 66%|██████▋ | 640/964 [00:21<00:11, 28.93 examples/s]
Tokenizing train dataset: 67%|██████▋ | 644/964 [00:21<00:10, 29.64 examples/s]
Tokenizing train dataset: 67%|██████▋ | 650/964 [00:21<00:09, 34.35 examples/s]
Tokenizing train dataset: 68%|██████▊ | 654/964 [00:21<00:09, 32.48 examples/s]
Tokenizing train dataset: 68%|██████▊ | 658/964 [00:21<00:09, 32.79 examples/s]
Tokenizing train dataset: 69%|██████▊ | 662/964 [00:22<00:09, 33.34 examples/s]
Tokenizing train dataset: 69%|██████▉ | 666/964 [00:22<00:08, 33.97 examples/s]
Tokenizing train dataset: 70%|██████▉ | 670/964 [00:22<00:09, 32.33 examples/s]
Tokenizing train dataset: 70%|███████ | 676/964 [00:22<00:09, 31.68 examples/s]
Tokenizing train dataset: 71%|███████ | 680/964 [00:22<00:09, 30.64 examples/s]
Tokenizing train dataset: 71%|███████ | 685/964 [00:22<00:08, 31.06 examples/s]
Tokenizing train dataset: 71%|███████▏ | 689/964 [00:22<00:09, 28.68 examples/s]
Tokenizing train dataset: 72%|███████▏ | 694/964 [00:23<00:08, 31.24 examples/s]
Tokenizing train dataset: 73%|███████▎ | 699/964 [00:23<00:07, 33.49 examples/s]
Tokenizing train dataset: 73%|███████▎ | 704/964 [00:23<00:07, 34.20 examples/s]
Tokenizing train dataset: 74%|███████▎ | 710/964 [00:23<00:06, 36.36 examples/s]
Tokenizing train dataset: 74%|███████▍ | 714/964 [00:23<00:06, 35.94 examples/s]
Tokenizing train dataset: 74%|███████▍ | 718/964 [00:23<00:06, 36.71 examples/s]
Tokenizing train dataset: 75%|███████▍ | 722/964 [00:23<00:07, 33.47 examples/s]
Tokenizing train dataset: 75%|███████▌ | 726/964 [00:24<00:07, 33.30 examples/s]
Tokenizing train dataset: 76%|███████▌ | 730/964 [00:24<00:07, 30.12 examples/s]
Tokenizing train dataset: 76%|███████▌ | 734/964 [00:24<00:07, 30.47 examples/s]
Tokenizing train dataset: 77%|███████▋ | 738/964 [00:24<00:07, 28.94 examples/s]
Tokenizing train dataset: 77%|███████▋ | 742/964 [00:24<00:08, 26.25 examples/s]
Tokenizing train dataset: 77%|███████▋ | 746/964 [00:24<00:08, 27.02 examples/s]
Tokenizing train dataset: 78%|███████▊ | 750/964 [00:24<00:07, 28.46 examples/s]
Tokenizing train dataset: 78%|███████▊ | 753/964 [00:25<00:07, 27.50 examples/s]
Tokenizing train dataset: 79%|███████▊ | 758/964 [00:25<00:07, 28.90 examples/s]
Tokenizing train dataset: 79%|███████▉ | 763/964 [00:25<00:06, 32.16 examples/s]
Tokenizing train dataset: 80%|███████▉ | 769/964 [00:25<00:05, 35.79 examples/s]
Tokenizing train dataset: 80%|████████ | 774/964 [00:25<00:05, 35.37 examples/s]
Tokenizing train dataset: 81%|████████ | 780/964 [00:25<00:05, 32.24 examples/s]
Tokenizing train dataset: 81%|████████▏ | 785/964 [00:25<00:05, 33.13 examples/s]
Tokenizing train dataset: 82%|████████▏ | 790/964 [00:26<00:04, 34.88 examples/s]
Tokenizing train dataset: 82%|████████▏ | 794/964 [00:26<00:05, 33.74 examples/s]
Tokenizing train dataset: 83%|████████▎ | 799/964 [00:26<00:04, 33.70 examples/s]
Tokenizing train dataset: 83%|████████▎ | 803/964 [00:26<00:05, 31.73 examples/s]
Tokenizing train dataset: 84%|████████▍ | 808/964 [00:26<00:04, 31.95 examples/s]
Tokenizing train dataset: 84%|████████▍ | 812/964 [00:26<00:04, 32.08 examples/s]
Tokenizing train dataset: 85%|████████▍ | 816/964 [00:26<00:04, 29.76 examples/s]
Tokenizing train dataset: 85%|████████▌ | 821/964 [00:27<00:04, 30.40 examples/s]
Tokenizing train dataset: 86%|████████▌ | 826/964 [00:27<00:04, 29.18 examples/s]
Tokenizing train dataset: 86%|████████▌ | 829/964 [00:27<00:04, 28.16 examples/s]
Tokenizing train dataset: 86%|████████▋ | 833/964 [00:27<00:04, 29.21 examples/s]
Tokenizing train dataset: 87%|████████▋ | 838/964 [00:27<00:04, 30.32 examples/s]
Tokenizing train dataset: 87%|████████▋ | 842/964 [00:27<00:04, 28.66 examples/s]
Tokenizing train dataset: 88%|████████▊ | 847/964 [00:27<00:03, 31.66 examples/s]
Tokenizing train dataset: 88%|████████▊ | 853/964 [00:28<00:03, 29.78 examples/s]
Tokenizing train dataset: 89%|████████▉ | 857/964 [00:28<00:03, 29.92 examples/s]
Tokenizing train dataset: 89%|████████▉ | 861/964 [00:28<00:03, 30.66 examples/s]
Tokenizing train dataset: 90%|████████▉ | 866/964 [00:28<00:03, 30.34 examples/s]
Tokenizing train dataset: 90%|█████████ | 870/964 [00:28<00:03, 30.92 examples/s]
Tokenizing train dataset: 91%|█████████ | 875/964 [00:28<00:02, 32.00 examples/s]
Tokenizing train dataset: 91%|█████████ | 879/964 [00:29<00:02, 31.62 examples/s]
Tokenizing train dataset: 92%|█████████▏| 883/964 [00:29<00:02, 31.06 examples/s]
Tokenizing train dataset: 92%|█████████▏| 887/964 [00:29<00:02, 31.21 examples/s]
Tokenizing train dataset: 93%|█████████▎| 892/964 [00:29<00:02, 32.65 examples/s]
Tokenizing train dataset: 93%|█████████▎| 896/964 [00:29<00:02, 32.56 examples/s]
Tokenizing train dataset: 93%|█████████▎| 901/964 [00:29<00:01, 31.90 examples/s]
Tokenizing train dataset: 94%|█████████▍| 905/964 [00:29<00:01, 30.79 examples/s]
Tokenizing train dataset: 95%|█████████▍| 911/964 [00:30<00:01, 29.06 examples/s]
Tokenizing train dataset: 95%|█████████▍| 914/964 [00:30<00:01, 28.22 examples/s]
Tokenizing train dataset: 95%|█████████▌| 918/964 [00:30<00:01, 28.33 examples/s]
Tokenizing train dataset: 96%|█████████▌| 923/964 [00:30<00:01, 31.04 examples/s]
Tokenizing train dataset: 96%|█████████▋| 928/964 [00:30<00:01, 30.69 examples/s]
Tokenizing train dataset: 97%|█████████▋| 933/964 [00:30<00:00, 31.35 examples/s]
Tokenizing train dataset: 97%|█████████▋| 937/964 [00:30<00:00, 32.02 examples/s]
Tokenizing train dataset: 98%|█████████▊| 942/964 [00:31<00:00, 33.92 examples/s]
Tokenizing train dataset: 98%|█████████▊| 946/964 [00:31<00:00, 32.86 examples/s]
Tokenizing train dataset: 99%|█████████▊| 951/964 [00:31<00:00, 28.80 examples/s]
Tokenizing train dataset: 99%|█████████▉| 955/964 [00:31<00:00, 29.86 examples/s]
Tokenizing train dataset: 99%|█████████▉| 959/964 [00:31<00:00, 28.96 examples/s]
Tokenizing train dataset: 100%|█████████▉| 963/964 [00:31<00:00, 29.99 examples/s]
Tokenizing train dataset: 100%|██████████| 964/964 [00:33<00:00, 28.63 examples/s]
Truncating train dataset: 0%| | 0/964 [00:00<?, ? examples/s]
Truncating train dataset: 100%|██████████| 964/964 [00:00<00:00, 4670.70 examples/s]
Truncating train dataset: 100%|██████████| 964/964 [00:00<00:00, 4506.57 examples/s]
2026-04-15 11:25:43,355 - INFO - Applying Liger kernels to model instance with model type: qwen3 with kwargs: {}
2026-04-15 11:25:43,355 - INFO - Applying Liger kernels to model instance with model type: qwen3 with kwargs: {}
2026-04-15 11:25:43,355 - INFO - Applying Liger kernels to model instance with model type: qwen3 with kwargs: {}
2026-04-15 11:25:43,355 - INFO - Applying Liger kernels to model instance with model type: qwen3 with kwargs: {}
2026-04-15 11:25:43,355 - INFO - Applying Liger kernels to model instance with model type: qwen3 with kwargs: {}
2026-04-15 11:25:43,355 - INFO - Applying Liger kernels to model instance with model type: qwen3 with kwargs: {}
2026-04-15 11:25:43,355 - INFO - Applying Liger kernels to model instance with model type: qwen3 with kwargs: {}
2026-04-15 11:25:43,355 - INFO - Applying Liger kernels to model instance with model type: qwen3 with kwargs: {}
2026-04-15 11:25:43,412 - INFO - Applying Liger kernels to model instance with model type: qwen3 with kwargs: {}
2026-04-15 11:25:43,412 - INFO - Applying Liger kernels to model instance with model type: qwen3 with kwargs: {}
2026-04-15 11:25:43,412 - INFO - Applying Liger kernels to model instance with model type: qwen3 with kwargs: {}
2026-04-15 11:25:43,410 - INFO - Applying Liger kernels to model instance with model type: qwen3 with kwargs: {}
2026-04-15 11:25:43,410 - INFO - Applying Liger kernels to model instance with model type: qwen3 with kwargs: {}
2026-04-15 11:25:43,410 - INFO - Applying Liger kernels to model instance with model type: qwen3 with kwargs: {}
2026-04-15 11:25:43,410 - INFO - Applying Liger kernels to model instance with model type: qwen3 with kwargs: {}
2026-04-15 11:25:43,410 - INFO - Applying Liger kernels to model instance with model type: qwen3 with kwargs: {}
2026-04-15 11:25:43,410 - INFO - Applying Liger kernels to model instance with model type: qwen3 with kwargs: {}
2026-04-15 11:25:43,410 - INFO - Applying Liger kernels to model instance with model type: qwen3 with kwargs: {}
2026-04-15 11:25:43,410 - INFO - Applying Liger kernels to model instance with model type: qwen3 with kwargs: {}
2026-04-15 11:25:43,410 - INFO - Applying Liger kernels to model instance with model type: qwen3 with kwargs: {}
2026-04-15 11:25:43,410 - INFO - Applying Liger kernels to model instance with model type: qwen3 with kwargs: {}
2026-04-15 11:25:43,410 - INFO - Applying Liger kernels to model instance with model type: qwen3 with kwargs: {}
2026-04-15 11:25:43,410 - INFO - Applying Liger kernels to model instance with model type: qwen3 with kwargs: {}
2026-04-15 11:25:43,412 - INFO - Applying Liger kernels to model instance with model type: qwen3 with kwargs: {}
2026-04-15 11:25:43,412 - INFO - Applying Liger kernels to model instance with model type: qwen3 with kwargs: {}
2026-04-15 11:25:43,412 - INFO - Applying Liger kernels to model instance with model type: qwen3 with kwargs: {}
2026-04-15 11:25:43,412 - INFO - Applying Liger kernels to model instance with model type: qwen3 with kwargs: {}
2026-04-15 11:25:43,410 - INFO - Applying Liger kernels to model instance with model type: qwen3 with kwargs: {}
2026-04-15 11:25:43,410 - INFO - Applying Liger kernels to model instance with model type: qwen3 with kwargs: {}
2026-04-15 11:25:43,410 - INFO - Applying Liger kernels to model instance with model type: qwen3 with kwargs: {}
2026-04-15 11:25:43,410 - INFO - Applying Liger kernels to model instance with model type: qwen3 with kwargs: {}
2026-04-15 11:25:43,410 - INFO - Applying Liger kernels to model instance with model type: qwen3 with kwargs: {}
2026-04-15 11:25:43,410 - INFO - Applying Liger kernels to model instance with model type: qwen3 with kwargs: {}
2026-04-15 11:25:43,410 - INFO - Applying Liger kernels to model instance with model type: qwen3 with kwargs: {}
2026-04-15 11:25:43,412 - INFO - Applying Liger kernels to model instance with model type: qwen3 with kwargs: {}
2026-04-15 11:25:43,410 - INFO - Applying Liger kernels to model instance with model type: qwen3 with kwargs: {}
2026-04-15 11:25:43,410 - INFO - Applying Liger kernels to model instance with model type: qwen3 with kwargs: {}
2026-04-15 11:25:43,410 - INFO - Applying Liger kernels to model instance with model type: qwen3 with kwargs: {}
2026-04-15 11:25:43,410 - INFO - Applying Liger kernels to model instance with model type: qwen3 with kwargs: {}
2026-04-15 11:25:43,410 - INFO - Applying Liger kernels to model instance with model type: qwen3 with kwargs: {}
2026-04-15 11:25:43,410 - INFO - Applying Liger kernels to model instance with model type: qwen3 with kwargs: {}
2026-04-15 11:25:43,411 - INFO - Applying Liger kernels to model instance with model type: qwen3 with kwargs: {}
2026-04-15 11:25:43,410 - INFO - Applying Liger kernels to model instance with model type: qwen3 with kwargs: {}
2026-04-15 11:25:43,410 - INFO - Applying Liger kernels to model instance with model type: qwen3 with kwargs: {}
2026-04-15 11:25:43,410 - INFO - Applying Liger kernels to model instance with model type: qwen3 with kwargs: {}
2026-04-15 11:25:43,410 - INFO - Applying Liger kernels to model instance with model type: qwen3 with kwargs: {}
2026-04-15 11:25:43,410 - INFO - Applying Liger kernels to model instance with model type: qwen3 with kwargs: {}
2026-04-15 11:25:43,411 - INFO - Applying Liger kernels to model instance with model type: qwen3 with kwargs: {}
2026-04-15 11:25:43,411 - INFO - Applying Liger kernels to model instance with model type: qwen3 with kwargs: {}
2026-04-15 11:25:43,411 - INFO - Applying Liger kernels to model instance with model type: qwen3 with kwargs: {}
2026-04-15 11:25:43,411 - INFO - Applying Liger kernels to model instance with model type: qwen3 with kwargs: {}
2026-04-15 11:25:43,411 - INFO - Applying Liger kernels to model instance with model type: qwen3 with kwargs: {}
2026-04-15 11:25:43,411 - INFO - Applying Liger kernels to model instance with model type: qwen3 with kwargs: {}
2026-04-15 11:25:43,411 - INFO - Applying Liger kernels to model instance with model type: qwen3 with kwargs: {}
2026-04-15 11:25:43,411 - INFO - Applying Liger kernels to model instance with model type: qwen3 with kwargs: {}
2026-04-15 11:25:43,412 - INFO - Applying Liger kernels to model instance with model type: qwen3 with kwargs: {}
2026-04-15 11:25:43,415 - INFO - Applying Liger kernels to model instance with model type: qwen3 with kwargs: {}
2026-04-15 11:25:43,415 - INFO - Applying Liger kernels to model instance with model type: qwen3 with kwargs: {}
2026-04-15 11:25:43,415 - INFO - Applying Liger kernels to model instance with model type: qwen3 with kwargs: {}
2026-04-15 11:25:43,415 - INFO - Applying Liger kernels to model instance with model type: qwen3 with kwargs: {}
2026-04-15 11:25:43,415 - INFO - Applying Liger kernels to model instance with model type: qwen3 with kwargs: {}
2026-04-15 11:25:43,415 - INFO - Applying Liger kernels to model instance with model type: qwen3 with kwargs: {}
2026-04-15 11:25:43,415 - INFO - Applying Liger kernels to model instance with model type: qwen3 with kwargs: {}
2026-04-15 11:25:43,415 - INFO - Applying Liger kernels to model instance with model type: qwen3 with kwargs: {}
wandb: WARNING The `run_name` is currently set to the same value as `TrainingArguments.output_dir`. If this was not intended, please specify a different run name by setting the `TrainingArguments.run_name` parameter.
wandb: Currently logged in as: ligeng-zhu to https://api.wandb.ai. Use `wandb login --relogin` to force relogin
wandb: creating run
wandb: Tracking run with wandb version 0.21.0
wandb: Run data is saved locally in runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume/wandb/run-20260415_112548-tw-data-train_final_replaced_from_classified-fix-format-8node-resume
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run runs/dev/tw-data-train_final_replaced_from_classified-fix-format-8node-resume
wandb: ⭐️ View project at https://wandb.ai/ligeng-zhu/ThreadWeaver
wandb: 🚀 View run at https://wandb.ai/ligeng-zhu/ThreadWeaver/runs/tw-data-train_final_replaced_from_classified-fix-format-8node-resume
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
0%| | 0/64 [00:00<?, ?it/s]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
2%|▏ | 1/64 [02:46<2:55:08, 166.80s/it]
2%|▏ | 1/64 [02:46<2:55:08, 166.80s/it]
3%|▎ | 2/64 [05:23<2:46:10, 160.82s/it]
3%|▎ | 2/64 [05:23<2:46:10, 160.82s/it]
5%|▍ | 3/64 [07:58<2:40:46, 158.15s/it]
5%|▍ | 3/64 [07:58<2:40:46, 158.15s/it]
6%|▋ | 4/64 [10:31<2:36:14, 156.24s/it]
6%|▋ | 4/64 [10:31<2:36:14, 156.24s/it]
8%|▊ | 5/64 [13:02<2:31:32, 154.11s/it]
8%|▊ | 5/64 [13:02<2:31:32, 154.11s/it]
9%|▉ | 6/64 [15:32<2:27:54, 153.01s/it]
9%|▉ | 6/64 [15:32<2:27:54, 153.01s/it]
11%|█ | 7/64 [18:02<2:24:11, 151.79s/it]
11%|█ | 7/64 [18:02<2:24:11, 151.79s/it]
12%|█▎ | 8/64 [20:32<2:21:17, 151.39s/it]
12%|█▎ | 8/64 [20:32<2:21:17, 151.39s/it]
14%|█▍ | 9/64 [23:08<2:20:03, 152.78s/it]
14%|█▍ | 9/64 [23:08<2:20:03, 152.78s/it]
16%|█▌ | 10/64 [25:40<2:17:15, 152.51s/it]
16%|█▌ | 10/64 [25:40<2:17:15, 152.51s/it]
17%|█▋ | 11/64 [28:12<2:14:37, 152.41s/it]
17%|█▋ | 11/64 [28:12<2:14:37, 152.41s/it]
19%|█▉ | 12/64 [30:45<2:12:09, 152.50s/it]
19%|█▉ | 12/64 [30:45<2:12:09, 152.50s/it]
20%|██ | 13/64 [33:14<2:08:52, 151.61s/it]
20%|██ | 13/64 [33:14<2:08:52, 151.61s/it]
22%|██▏ | 14/64 [35:46<2:06:19, 151.58s/it]
22%|██▏ | 14/64 [35:46<2:06:19, 151.58s/it]
23%|██▎ | 15/64 [38:15<2:03:05, 150.73s/it]
23%|██▎ | 15/64 [38:15<2:03:05, 150.73s/it]
25%|██▌ | 16/64 [40:45<2:00:28, 150.59s/it]
25%|██▌ | 16/64 [40:45<2:00:28, 150.59s/it]
27%|██▋ | 17/64 [43:20<1:59:06, 152.06s/it]
27%|██▋ | 17/64 [43:20<1:59:06, 152.06s/it]
28%|██▊ | 18/64 [45:53<1:56:46, 152.30s/it]
28%|██▊ | 18/64 [45:53<1:56:46, 152.30s/it]
30%|██▉ | 19/64 [48:26<1:54:13, 152.30s/it]
30%|██▉ | 19/64 [48:26<1:54:13, 152.30s/it]
31%|███▏ | 20/64 [50:58<1:51:46, 152.41s/it]
31%|███▏ | 20/64 [50:58<1:51:46, 152.41s/it]wandb: WARNING The get_url method is deprecated and will be removed in a future release. Please use `run.url` instead.
33%|███▎ | 21/64 [53:48<1:52:53, 157.51s/it]
33%|███▎ | 21/64 [53:48<1:52:53, 157.51s/it]
34%|███▍ | 22/64 [56:19<1:48:52, 155.53s/it]
34%|███▍ | 22/64 [56:19<1:48:52, 155.53s/it]
36%|███▌ | 23/64 [58:47<1:44:49, 153.41s/it]
36%|███▌ | 23/64 [58:47<1:44:49, 153.41s/it]
38%|███▊ | 24/64 [1:01:18<1:41:47, 152.70s/it]
38%|███▊ | 24/64 [1:01:18<1:41:47, 152.70s/it]
39%|███▉ | 25/64 [1:03:53<1:39:43, 153.43s/it]
39%|███▉ | 25/64 [1:03:53<1:39:43, 153.43s/it]
41%|████ | 26/64 [1:06:25<1:36:46, 152.80s/it]
41%|████ | 26/64 [1:06:25<1:36:46, 152.80s/it]
42%|████▏ | 27/64 [1:08:57<1:34:05, 152.58s/it]
42%|████▏ | 27/64 [1:08:57<1:34:05, 152.58s/it]
44%|████▍ | 28/64 [1:11:33<1:32:17, 153.82s/it]
44%|████▍ | 28/64 [1:11:33<1:32:17, 153.82s/it]
45%|████▌ | 29/64 [1:14:04<1:29:07, 152.78s/it]
45%|████▌ | 29/64 [1:14:04<1:29:07, 152.78s/it]
47%|████▋ | 30/64 [1:16:36<1:26:28, 152.59s/it]
47%|████▋ | 30/64 [1:16:36<1:26:28, 152.59s/it]
48%|████▊ | 31/64 [1:19:05<1:23:19, 151.49s/it]
48%|████▊ | 31/64 [1:19:05<1:23:19, 151.49s/it]
50%|█████ | 32/64 [1:21:35<1:20:37, 151.16s/it]
50%|█████ | 32/64 [1:21:35<1:20:37, 151.16s/it]
52%|█████▏ | 33/64 [1:24:12<1:18:54, 152.72s/it]
52%|█████▏ | 33/64 [1:24:12<1:18:54, 152.72s/it]
53%|█████▎ | 34/64 [1:26:44<1:16:17, 152.58s/it]
53%|█████▎ | 34/64 [1:26:44<1:16:17, 152.58s/it]
55%|█████▍ | 35/64 [1:29:17<1:13:46, 152.64s/it]
55%|█████▍ | 35/64 [1:29:17<1:13:46, 152.64s/it]
56%|█████▋ | 36/64 [1:31:53<1:11:49, 153.92s/it]
56%|█████▋ | 36/64 [1:31:53<1:11:49, 153.92s/it]
58%|█████▊ | 37/64 [1:34:24<1:08:48, 152.91s/it]
58%|█████▊ | 37/64 [1:34:24<1:08:48, 152.91s/it]
59%|█████▉ | 38/64 [1:36:55<1:06:01, 152.37s/it]
59%|█████▉ | 38/64 [1:36:55<1:06:01, 152.37s/it]
61%|██████ | 39/64 [1:39:24<1:03:01, 151.27s/it]
61%|██████ | 39/64 [1:39:24<1:03:01, 151.27s/it]
62%|██████▎ | 40/64 [1:41:51<1:00:03, 150.15s/it]
62%|██████▎ | 40/64 [1:41:51<1:00:03, 150.15s/it]
64%|██████▍ | 41/64 [1:44:46<1:00:19, 157.38s/it]
64%|██████▍ | 41/64 [1:44:46<1:00:19, 157.38s/it]
66%|██████▌ | 42/64 [1:47:17<57:00, 155.47s/it]
66%|██████▌ | 42/64 [1:47:17<57:00, 155.47s/it]
67%|██████▋ | 43/64 [1:49:49<54:04, 154.48s/it]
67%|██████▋ | 43/64 [1:49:49<54:04, 154.48s/it]
69%|██████▉ | 44/64 [1:52:25<51:37, 154.89s/it]
69%|██████▉ | 44/64 [1:52:25<51:37, 154.89s/it]
70%|███████ | 45/64 [1:54:55<48:38, 153.59s/it]
70%|███████ | 45/64 [1:54:55<48:38, 153.59s/it]
72%|███████▏ | 46/64 [1:57:26<45:51, 152.89s/it]
72%|███████▏ | 46/64 [1:57:26<45:51, 152.89s/it]
73%|███████▎ | 47/64 [1:59:55<42:55, 151.52s/it]
73%|███████▎ | 47/64 [1:59:55<42:55, 151.52s/it]
75%|███████▌ | 48/64 [2:02:23<40:06, 150.38s/it]
75%|███████▌ | 48/64 [2:02:23<40:06, 150.38s/it]
77%|███████▋ | 49/64 [2:04:58<37:58, 151.91s/it]
77%|███████▋ | 49/64 [2:04:58<37:58, 151.91s/it]
78%|███████▊ | 50/64 [2:07:29<35:24, 151.75s/it]
78%|███████▊ | 50/64 [2:07:29<35:24, 151.75s/it]
80%|███████▉ | 51/64 [2:10:03<33:00, 152.31s/it]
80%|███████▉ | 51/64 [2:10:03<33:00, 152.31s/it]
81%|████████▏ | 52/64 [2:12:39<30:40, 153.35s/it]
81%|████████▏ | 52/64 [2:12:39<30:40, 153.35s/it]
83%|████████▎ | 53/64 [2:15:09<27:57, 152.50s/it]
83%|████████▎ | 53/64 [2:15:09<27:57, 152.50s/it]
84%|████████▍ | 54/64 [2:17:41<25:23, 152.32s/it]
84%|████████▍ | 54/64 [2:17:41<25:23, 152.32s/it]
86%|████████▌ | 55/64 [2:20:09<22:39, 151.04s/it]
86%|████████▌ | 55/64 [2:20:09<22:39, 151.04s/it]
88%|████████▊ | 56/64 [2:22:37<20:00, 150.09s/it]
88%|████████▊ | 56/64 [2:22:37<20:00, 150.09s/it]
89%|████████▉ | 57/64 [2:25:13<17:42, 151.73s/it]
89%|████████▉ | 57/64 [2:25:13<17:42, 151.73s/it]
91%|█████████ | 58/64 [2:27:44<15:09, 151.53s/it]
91%|█████████ | 58/64 [2:27:44<15:09, 151.53s/it]
92%|█████████▏| 59/64 [2:30:16<12:38, 151.63s/it]
92%|█████████▏| 59/64 [2:30:16<12:38, 151.63s/it]
94%|█████████▍| 60/64 [2:32:51<10:11, 152.82s/it]
94%|█████████▍| 60/64 [2:32:51<10:11, 152.82s/it]
95%|█████████▌| 61/64 [2:35:40<07:52, 157.51s/it]
95%|█████████▌| 61/64 [2:35:40<07:52, 157.51s/it]
97%|█████████▋| 62/64 [2:38:11<05:11, 155.70s/it]
97%|█████████▋| 62/64 [2:38:11<05:11, 155.70s/it]
98%|█████████▊| 63/64 [2:40:39<02:33, 153.45s/it]
98%|█████████▊| 63/64 [2:40:39<02:33, 153.45s/it]
100%|██████████| 64/64 [2:43:08<00:00, 151.88s/it]
100%|██████████| 64/64 [2:43:08<00:00, 151.88s/it]
100%|██████████| 64/64 [2:43:26<00:00, 151.88s/it]
100%|██████████| 64/64 [2:43:26<00:00, 153.23s/it]