emberforge-3b-reasoner/benchmarks/lm-eval-2026-02-24/run_v3.log

2026-02-23:22:20:49,920 INFO     [__main__.py:279] Verbosity set to INFO
2026-02-23:22:20:56,465 INFO     [__main__.py:376] Selected Tasks: ['arc_challenge', 'boolq', 'gsm8k', 'hellaswag', 'mmlu', 'piqa', 'truthfulqa_mc2', 'winogrande']
2026-02-23:22:20:56,466 INFO     [evaluator.py:164] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
2026-02-23:22:20:56,466 INFO     [evaluator.py:201] Initializing hf model, with arguments: {'pretrained': 'strykes/emberforge-3b-reasoner', 'trust_remote_code': True, 'dtype': 'float16'}
2026-02-23:22:20:56,650 INFO     [huggingface.py:132] Using device 'cuda:0'
2026-02-23:22:20:59,192 INFO     [huggingface.py:369] Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda:0'}

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]
Downloading shards:  50%|█████     | 1/2 [01:39<01:39, 99.69s/it]
Downloading shards: 100%|██████████| 2/2 [02:35<00:00, 73.71s/it]
Downloading shards: 100%|██████████| 2/2 [02:35<00:00, 77.60s/it]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards:  50%|█████     | 1/2 [00:01<00:01,  1.58s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:02<00:00,  1.38s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:02<00:00,  1.41s/it]

Generating train split:   0%|          | 0/1119 [00:00<?, ? examples/s]
Generating train split: 100%|██████████| 1119/1119 [00:00<00:00, 104717.23 examples/s]

Generating test split:   0%|          | 0/1172 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 1172/1172 [00:00<00:00, 190310.66 examples/s]

Generating validation split:   0%|          | 0/299 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 299/299 [00:00<00:00, 71613.57 examples/s]
2026-02-23:22:23:41,927 WARNING  [task.py:800] [Task: boolq] metric acc is defined, but aggregation is not. using default aggregation=mean
2026-02-23:22:23:41,927 WARNING  [task.py:812] [Task: boolq] metric acc is defined, but higher_is_better is not. using default higher_is_better=True

Generating train split:   0%|          | 0/9427 [00:00<?, ? examples/s]
Generating train split: 100%|██████████| 9427/9427 [00:00<00:00, 163206.46 examples/s]

Generating validation split:   0%|          | 0/3270 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 3270/3270 [00:00<00:00, 177487.86 examples/s]

Generating test split:   0%|          | 0/3245 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 3245/3245 [00:00<00:00, 202860.45 examples/s]

Generating train split:   0%|          | 0/7473 [00:00<?, ? examples/s]
Generating train split: 100%|██████████| 7473/7473 [00:00<00:00, 331458.42 examples/s]

Generating test split:   0%|          | 0/1319 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 1319/1319 [00:00<00:00, 190243.71 examples/s]

Generating train split:   0%|          | 0/39905 [00:00<?, ? examples/s]
Generating train split:  48%|████▊     | 19000/39905 [00:00<00:00, 181468.24 examples/s]
Generating train split: 100%|██████████| 39905/39905 [00:00<00:00, 167115.13 examples/s]
Generating train split: 100%|██████████| 39905/39905 [00:00<00:00, 168677.41 examples/s]

Generating test split:   0%|          | 0/10003 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 10003/10003 [00:00<00:00, 170656.06 examples/s]

Generating validation split:   0%|          | 0/10042 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 10042/10042 [00:00<00:00, 170143.53 examples/s]

Map:   0%|          | 0/39905 [00:00<?, ? examples/s]
Map:   3%|▎         | 1000/39905 [00:00<00:04, 8689.85 examples/s]
Map:   6%|▌         | 2228/39905 [00:00<00:03, 10674.69 examples/s]
Map:   9%|▊         | 3471/39905 [00:00<00:03, 11444.97 examples/s]
Map:  12%|█▏        | 4716/39905 [00:00<00:02, 11781.93 examples/s]
Map:  15%|█▍        | 5955/39905 [00:00<00:02, 11993.75 examples/s]
Map:  19%|█▉        | 7730/39905 [00:00<00:02, 11921.15 examples/s]
Map:  22%|██▏       | 8966/39905 [00:00<00:02, 12045.60 examples/s]
Map:  27%|██▋       | 10735/39905 [00:00<00:02, 11945.17 examples/s]
Map:  30%|██▉       | 11964/39905 [00:01<00:02, 12035.00 examples/s]
Map:  35%|███▍      | 13818/39905 [00:01<00:03, 8183.79 examples/s] 
Map:  38%|███▊      | 15000/39905 [00:01<00:02, 8653.80 examples/s]
Map:  42%|████▏     | 16641/39905 [00:01<00:02, 9280.99 examples/s]
Map:  44%|████▍     | 17734/39905 [00:01<00:02, 9630.18 examples/s]
Map:  48%|████▊     | 19206/39905 [00:01<00:02, 9685.69 examples/s]
Map:  51%|█████     | 20290/39905 [00:01<00:01, 9949.61 examples/s]
Map:  54%|█████▎    | 21382/39905 [00:02<00:01, 10187.91 examples/s]
Map:  58%|█████▊    | 23000/39905 [00:02<00:01, 10154.04 examples/s]
Map:  60%|██████    | 24082/39905 [00:02<00:01, 10315.85 examples/s]
Map:  65%|██████▍   | 25780/39905 [00:02<00:01, 10657.08 examples/s]
Map:  67%|██████▋   | 26861/39905 [00:02<00:01, 10691.87 examples/s]
Map:  70%|███████   | 28000/39905 [00:02<00:01, 6777.21 examples/s] 
Map:  73%|███████▎  | 29088/39905 [00:03<00:01, 7546.68 examples/s]
Map:  76%|███████▌  | 30185/39905 [00:03<00:01, 8269.78 examples/s]
Map:  78%|███████▊  | 31286/39905 [00:03<00:00, 8905.40 examples/s]
Map:  81%|████████  | 32376/39905 [00:03<00:00, 9400.69 examples/s]
Map:  84%|████████▍ | 33644/39905 [00:03<00:00, 9906.21 examples/s]
Map:  87%|████████▋ | 34732/39905 [00:03<00:00, 10163.46 examples/s]
Map:  90%|████████▉ | 35826/39905 [00:03<00:00, 10374.23 examples/s]
Map:  93%|█████████▎| 36913/39905 [00:03<00:00, 10510.61 examples/s]
Map:  96%|█████████▌| 38369/39905 [00:03<00:00, 10206.50 examples/s]
Map:  99%|█████████▉| 39643/39905 [00:04<00:00, 10420.36 examples/s]
Map: 100%|██████████| 39905/39905 [00:04<00:00, 9868.52 examples/s] 

Map:   0%|          | 0/10042 [00:00<?, ? examples/s]
Map:   5%|▍         | 467/10042 [00:00<00:06, 1467.66 examples/s]
Map:  17%|█▋        | 1711/10042 [00:00<00:01, 4758.88 examples/s]
Map:  29%|██▉       | 2942/10042 [00:00<00:01, 7038.55 examples/s]
Map:  40%|███▉      | 4000/10042 [00:00<00:00, 7798.85 examples/s]
Map:  50%|█████     | 5061/10042 [00:00<00:00, 8619.43 examples/s]
Map:  61%|██████    | 6146/10042 [00:00<00:00, 9275.74 examples/s]
Map:  72%|███████▏  | 7242/10042 [00:00<00:00, 9773.07 examples/s]
Map:  83%|████████▎ | 8342/10042 [00:01<00:00, 10136.59 examples/s]
Map:  94%|█████████▍| 9444/10042 [00:01<00:00, 10396.60 examples/s]
Map: 100%|██████████| 10042/10042 [00:01<00:00, 8335.19 examples/s]

Generating test split:   0%|          | 0/171 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 171/171 [00:00<00:00, 37968.55 examples/s]

Generating validation split:   0%|          | 0/19 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 19/19 [00:00<00:00, 5885.66 examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1641.35 examples/s]

Generating test split:   0%|          | 0/1534 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 1534/1534 [00:00<00:00, 114719.84 examples/s]

Generating validation split:   0%|          | 0/170 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 170/170 [00:00<00:00, 40494.76 examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1663.75 examples/s]

Generating test split:   0%|          | 0/324 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 324/324 [00:00<00:00, 74085.73 examples/s]

Generating validation split:   0%|          | 0/35 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 35/35 [00:00<00:00, 11826.36 examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1770.05 examples/s]

Generating test split:   0%|          | 0/311 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 311/311 [00:00<00:00, 58275.04 examples/s]

Generating validation split:   0%|          | 0/34 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 34/34 [00:00<00:00, 10464.22 examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1711.26 examples/s]

Generating test split:   0%|          | 0/895 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 895/895 [00:00<00:00, 148899.37 examples/s]

Generating validation split:   0%|          | 0/100 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 100/100 [00:00<00:00, 30847.28 examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1844.62 examples/s]

Generating test split:   0%|          | 0/346 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 346/346 [00:00<00:00, 69185.22 examples/s]

Generating validation split:   0%|          | 0/38 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 38/38 [00:00<00:00, 12485.98 examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1842.19 examples/s]

Generating test split:   0%|          | 0/163 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 163/163 [00:00<00:00, 41613.70 examples/s]

Generating validation split:   0%|          | 0/18 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 18/18 [00:00<00:00, 6451.67 examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1825.67 examples/s]

Generating test split:   0%|          | 0/108 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 108/108 [00:00<00:00, 27090.77 examples/s]

Generating validation split:   0%|          | 0/11 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 11/11 [00:00<00:00, 3737.63 examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1764.09 examples/s]

Generating test split:   0%|          | 0/121 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 121/121 [00:00<00:00, 30838.60 examples/s]

Generating validation split:   0%|          | 0/13 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 13/13 [00:00<00:00, 4453.64 examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1898.73 examples/s]

Generating test split:   0%|          | 0/237 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 237/237 [00:00<00:00, 36752.69 examples/s]

Generating validation split:   0%|          | 0/26 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 26/26 [00:00<00:00, 8093.51 examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1829.50 examples/s]

Generating test split:   0%|          | 0/204 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 204/204 [00:00<00:00, 38618.79 examples/s]

Generating validation split:   0%|          | 0/22 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 22/22 [00:00<00:00, 7323.97 examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1757.29 examples/s]

Generating test split:   0%|          | 0/165 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 165/165 [00:00<00:00, 24043.22 examples/s]

Generating validation split:   0%|          | 0/18 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 18/18 [00:00<00:00, 5664.58 examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1725.34 examples/s]

Generating test split:   0%|          | 0/126 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 126/126 [00:00<00:00, 29231.83 examples/s]

Generating validation split:   0%|          | 0/14 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 14/14 [00:00<00:00, 4548.78 examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1669.44 examples/s]

Generating test split:   0%|          | 0/100 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 100/100 [00:00<00:00, 21143.84 examples/s]

Generating validation split:   0%|          | 0/11 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 11/11 [00:00<00:00, 3442.57 examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1674.91 examples/s]

Generating test split:   0%|          | 0/201 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 201/201 [00:00<00:00, 44788.56 examples/s]

Generating validation split:   0%|          | 0/22 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 22/22 [00:00<00:00, 7071.40 examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1800.59 examples/s]

Generating test split:   0%|          | 0/245 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 245/245 [00:00<00:00, 46390.88 examples/s]

Generating validation split:   0%|          | 0/27 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 27/27 [00:00<00:00, 9020.01 examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1747.92 examples/s]

Generating test split:   0%|          | 0/110 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 110/110 [00:00<00:00, 29267.54 examples/s]

Generating validation split:   0%|          | 0/12 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 12/12 [00:00<00:00, 4261.78 examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1776.80 examples/s]

Generating test split:   0%|          | 0/612 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 612/612 [00:00<00:00, 119764.57 examples/s]

Generating validation split:   0%|          | 0/69 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 69/69 [00:00<00:00, 21951.38 examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1835.90 examples/s]

Generating test split:   0%|          | 0/131 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 131/131 [00:00<00:00, 35366.49 examples/s]

Generating validation split:   0%|          | 0/12 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 12/12 [00:00<00:00, 4178.29 examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1850.16 examples/s]

Generating test split:   0%|          | 0/545 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 545/545 [00:00<00:00, 120303.97 examples/s]

Generating validation split:   0%|          | 0/60 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 60/60 [00:00<00:00, 19244.34 examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1786.79 examples/s]

Generating test split:   0%|          | 0/238 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 238/238 [00:00<00:00, 49624.40 examples/s]

Generating validation split:   0%|          | 0/26 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 26/26 [00:00<00:00, 8633.67 examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1837.35 examples/s]

Generating test split:   0%|          | 0/390 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 390/390 [00:00<00:00, 81341.55 examples/s]

Generating validation split:   0%|          | 0/43 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 43/43 [00:00<00:00, 13817.14 examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1855.39 examples/s]

Generating test split:   0%|          | 0/193 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 193/193 [00:00<00:00, 33528.03 examples/s]

Generating validation split:   0%|          | 0/21 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 21/21 [00:00<00:00, 7201.40 examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1696.72 examples/s]

Generating test split:   0%|          | 0/198 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 198/198 [00:00<00:00, 55687.80 examples/s]

Generating validation split:   0%|          | 0/22 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 22/22 [00:00<00:00, 8794.77 examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 2192.99 examples/s]

Generating test split:   0%|          | 0/114 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 114/114 [00:00<00:00, 29388.49 examples/s]

Generating validation split:   0%|          | 0/12 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 12/12 [00:00<00:00, 4193.95 examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1748.06 examples/s]

Generating test split:   0%|          | 0/166 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 166/166 [00:00<00:00, 40079.12 examples/s]

Generating validation split:   0%|          | 0/18 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 18/18 [00:00<00:00, 6069.42 examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1699.47 examples/s]

Generating test split:   0%|          | 0/272 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 272/272 [00:00<00:00, 57343.59 examples/s]

Generating validation split:   0%|          | 0/31 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 31/31 [00:00<00:00, 12762.41 examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 2071.47 examples/s]

Generating test split:   0%|          | 0/282 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 282/282 [00:00<00:00, 63724.68 examples/s]

Generating validation split:   0%|          | 0/31 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 31/31 [00:00<00:00, 10047.40 examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1768.55 examples/s]

Generating test split:   0%|          | 0/306 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 306/306 [00:00<00:00, 68316.23 examples/s]

Generating validation split:   0%|          | 0/33 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 33/33 [00:00<00:00, 11038.52 examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1785.11 examples/s]

Generating test split:   0%|          | 0/783 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 783/783 [00:00<00:00, 170524.95 examples/s]

Generating validation split:   0%|          | 0/86 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 86/86 [00:00<00:00, 29108.31 examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1792.44 examples/s]

Generating test split:   0%|          | 0/100 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 100/100 [00:00<00:00, 21843.06 examples/s]

Generating validation split:   0%|          | 0/11 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 11/11 [00:00<00:00, 3478.65 examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1695.08 examples/s]

Generating test split:   0%|          | 0/234 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 234/234 [00:00<00:00, 55095.27 examples/s]

Generating validation split:   0%|          | 0/25 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 25/25 [00:00<00:00, 8335.26 examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1762.31 examples/s]

Generating test split:   0%|          | 0/103 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 103/103 [00:00<00:00, 22725.58 examples/s]

Generating validation split:   0%|          | 0/11 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 11/11 [00:00<00:00, 3748.26 examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1765.13 examples/s]

Generating test split:   0%|          | 0/223 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 223/223 [00:00<00:00, 57157.77 examples/s]

Generating validation split:   0%|          | 0/23 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 23/23 [00:00<00:00, 7942.45 examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1843.49 examples/s]

Generating test split:   0%|          | 0/100 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 100/100 [00:00<00:00, 26542.87 examples/s]

Generating validation split:   0%|          | 0/10 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 10/10 [00:00<00:00, 3438.52 examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1798.90 examples/s]

Generating test split:   0%|          | 0/173 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 173/173 [00:00<00:00, 40555.25 examples/s]

Generating validation split:   0%|          | 0/22 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 22/22 [00:00<00:00, 7402.70 examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1801.37 examples/s]

Generating test split:   0%|          | 0/265 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 265/265 [00:00<00:00, 58849.50 examples/s]

Generating validation split:   0%|          | 0/29 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 29/29 [00:00<00:00, 8974.09 examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1707.22 examples/s]

Generating test split:   0%|          | 0/100 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 100/100 [00:00<00:00, 26497.59 examples/s]

Generating validation split:   0%|          | 0/11 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 11/11 [00:00<00:00, 3834.87 examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1721.94 examples/s]

Generating test split:   0%|          | 0/112 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 112/112 [00:00<00:00, 22252.00 examples/s]

Generating validation split:   0%|          | 0/11 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 11/11 [00:00<00:00, 3453.65 examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1729.75 examples/s]

Generating test split:   0%|          | 0/216 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 216/216 [00:00<00:00, 46824.98 examples/s]

Generating validation split:   0%|          | 0/23 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 23/23 [00:00<00:00, 7772.24 examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1810.54 examples/s]

Generating test split:   0%|          | 0/151 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 151/151 [00:00<00:00, 32208.09 examples/s]

Generating validation split:   0%|          | 0/17 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 17/17 [00:00<00:00, 5617.96 examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1723.21 examples/s]

Generating test split:   0%|          | 0/270 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 270/270 [00:00<00:00, 68981.06 examples/s]

Generating validation split:   0%|          | 0/29 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 29/29 [00:00<00:00, 10001.22 examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1856.54 examples/s]

Generating test split:   0%|          | 0/100 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 100/100 [00:00<00:00, 25877.99 examples/s]

Generating validation split:   0%|          | 0/9 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 9/9 [00:00<00:00, 3217.86 examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1814.15 examples/s]

Generating test split:   0%|          | 0/203 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 203/203 [00:00<00:00, 51602.65 examples/s]

Generating validation split:   0%|          | 0/22 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 22/22 [00:00<00:00, 7454.73 examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1839.12 examples/s]

Generating test split:   0%|          | 0/310 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 310/310 [00:00<00:00, 68028.79 examples/s]

Generating validation split:   0%|          | 0/32 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 32/32 [00:00<00:00, 10385.15 examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1813.36 examples/s]

Generating test split:   0%|          | 0/378 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 378/378 [00:00<00:00, 96356.32 examples/s]

Generating validation split:   0%|          | 0/41 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 41/41 [00:00<00:00, 13894.03 examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1857.86 examples/s]

Generating test split:   0%|          | 0/145 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 145/145 [00:00<00:00, 36778.79 examples/s]

Generating validation split:   0%|          | 0/16 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 16/16 [00:00<00:00, 5424.25 examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1834.30 examples/s]

Generating test split:   0%|          | 0/235 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 235/235 [00:00<00:00, 52170.72 examples/s]

Generating validation split:   0%|          | 0/26 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 26/26 [00:00<00:00, 9269.18 examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1890.35 examples/s]

Generating test split:   0%|          | 0/100 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 100/100 [00:00<00:00, 23591.34 examples/s]

Generating validation split:   0%|          | 0/11 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 11/11 [00:00<00:00, 3521.67 examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1770.94 examples/s]

Generating test split:   0%|          | 0/102 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 102/102 [00:00<00:00, 22729.73 examples/s]

Generating validation split:   0%|          | 0/11 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 11/11 [00:00<00:00, 3840.30 examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1871.62 examples/s]

Generating test split:   0%|          | 0/100 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 100/100 [00:00<00:00, 26584.93 examples/s]

Generating validation split:   0%|          | 0/11 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 11/11 [00:00<00:00, 3895.09 examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1831.73 examples/s]

Generating test split:   0%|          | 0/100 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 100/100 [00:00<00:00, 25754.05 examples/s]

Generating validation split:   0%|          | 0/11 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 11/11 [00:00<00:00, 3996.65 examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1791.06 examples/s]

Generating test split:   0%|          | 0/100 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 100/100 [00:00<00:00, 26693.21 examples/s]

Generating validation split:   0%|          | 0/8 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 8/8 [00:00<00:00, 2736.46 examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1783.75 examples/s]

Generating test split:   0%|          | 0/144 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 144/144 [00:00<00:00, 37919.37 examples/s]

Generating validation split:   0%|          | 0/16 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 16/16 [00:00<00:00, 5634.67 examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1703.34 examples/s]

Generating test split:   0%|          | 0/152 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 152/152 [00:00<00:00, 38671.25 examples/s]

Generating validation split:   0%|          | 0/16 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 16/16 [00:00<00:00, 5587.28 examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1857.70 examples/s]

Generating test split:   0%|          | 0/135 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 135/135 [00:00<00:00, 35767.23 examples/s]

Generating validation split:   0%|          | 0/14 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 14/14 [00:00<00:00, 4942.78 examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1858.02 examples/s]

Generating test split:   0%|          | 0/100 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 100/100 [00:00<00:00, 22633.99 examples/s]

Generating validation split:   0%|          | 0/11 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 11/11 [00:00<00:00, 3711.18 examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1802.30 examples/s]

Downloading data:   0%|          | 0.00/1.82M [00:00<?, ?B/s]
Downloading data:  58%|█████▊    | 1.05M/1.82M [00:00<00:00, 10.5MB/s]
Downloading data: 100%|██████████| 1.82M/1.82M [00:00<00:00, 15.2MB/s]

Downloading data:   0%|          | 0.00/220k [00:00<?, ?B/s]
Downloading data: 815kB [00:00, 30.9MB/s]                   

Generating train split:   0%|          | 0/16113 [00:00<?, ? examples/s]
Generating train split:  24%|██▎       | 3821/16113 [00:00<00:00, 38057.51 examples/s]
Generating train split:  53%|█████▎    | 8583/16113 [00:00<00:00, 43666.72 examples/s]
Generating train split:  84%|████████▎ | 13477/16113 [00:00<00:00, 45595.42 examples/s]
Generating train split: 100%|██████████| 16113/16113 [00:00<00:00, 44832.87 examples/s]

Generating test split:   0%|          | 0/3084 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 3084/3084 [00:00<00:00, 50655.69 examples/s]

Generating validation split:   0%|          | 0/1838 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 1838/1838 [00:00<00:00, 46175.46 examples/s]

Generating validation split:   0%|          | 0/817 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 817/817 [00:00<00:00, 87483.95 examples/s]

Generating train split:   0%|          | 0/40398 [00:00<?, ? examples/s]
Generating train split: 100%|██████████| 40398/40398 [00:00<00:00, 1122842.95 examples/s]

Generating test split:   0%|          | 0/1767 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 1767/1767 [00:00<00:00, 342784.11 examples/s]

Generating validation split:   0%|          | 0/1267 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 1267/1267 [00:00<00:00, 291827.74 examples/s]
2026-02-23:22:26:19,189 INFO     [task.py:415] Building contexts for winogrande on rank 0...

  0%|          | 0/1267 [00:00<?, ?it/s]
100%|██████████| 1267/1267 [00:00<00:00, 121592.11it/s]
2026-02-23:22:26:19,229 INFO     [task.py:415] Building contexts for truthfulqa_mc2 on rank 0...

  0%|          | 0/817 [00:00<?, ?it/s]
 13%|█▎        | 103/817 [00:00<00:00, 1023.46it/s]
 26%|██▌       | 214/817 [00:00<00:00, 1073.62it/s]
 40%|███▉      | 325/817 [00:00<00:00, 1088.94it/s]
 53%|█████▎    | 436/817 [00:00<00:00, 1090.18it/s]
 67%|██████▋   | 548/817 [00:00<00:00, 1097.75it/s]
 81%|████████  | 660/817 [00:00<00:00, 1104.43it/s]
 94%|█████████▍| 771/817 [00:00<00:00, 1102.25it/s]
100%|██████████| 817/817 [00:00<00:00, 1094.96it/s]
2026-02-23:22:26:20,019 INFO     [task.py:415] Building contexts for piqa on rank 0...

  0%|          | 0/1838 [00:00<?, ?it/s]
  9%|▉         | 165/1838 [00:00<00:01, 1647.90it/s]
 18%|█▊        | 330/1838 [00:00<00:00, 1633.23it/s]
 27%|██▋       | 499/1838 [00:00<00:00, 1657.08it/s]
 36%|███▋      | 667/1838 [00:00<00:00, 1665.45it/s]
 45%|████▌     | 835/1838 [00:00<00:00, 1668.69it/s]
 55%|█████▍    | 1003/1838 [00:00<00:00, 1672.37it/s]
 64%|██████▍   | 1172/1838 [00:00<00:00, 1675.35it/s]
 73%|███████▎  | 1341/1838 [00:00<00:00, 1678.48it/s]
 82%|████████▏ | 1509/1838 [00:00<00:00, 1678.79it/s]
 91%|█████████▏| 1678/1838 [00:01<00:00, 1679.59it/s]
100%|██████████| 1838/1838 [00:01<00:00, 1672.18it/s]
2026-02-23:22:26:21,163 INFO     [task.py:415] Building contexts for mmlu_abstract_algebra on rank 0...

  0%|          | 0/100 [00:00<?, ?it/s]
 95%|█████████▌| 95/100 [00:00<00:00, 945.54it/s]
100%|██████████| 100/100 [00:00<00:00, 943.82it/s]
2026-02-23:22:26:21,273 INFO     [task.py:415] Building contexts for mmlu_anatomy on rank 0...

  0%|          | 0/135 [00:00<?, ?it/s]
 71%|███████   | 96/135 [00:00<00:00, 952.47it/s]
100%|██████████| 135/135 [00:00<00:00, 952.25it/s]
2026-02-23:22:26:21,419 INFO     [task.py:415] Building contexts for mmlu_astronomy on rank 0...

  0%|          | 0/152 [00:00<?, ?it/s]
 63%|██████▎   | 96/152 [00:00<00:00, 955.44it/s]
100%|██████████| 152/152 [00:00<00:00, 956.61it/s]
2026-02-23:22:26:21,583 INFO     [task.py:415] Building contexts for mmlu_college_biology on rank 0...

  0%|          | 0/144 [00:00<?, ?it/s]
 67%|██████▋   | 97/144 [00:00<00:00, 962.33it/s]
100%|██████████| 144/144 [00:00<00:00, 962.90it/s]
2026-02-23:22:26:21,737 INFO     [task.py:415] Building contexts for mmlu_college_chemistry on rank 0...

  0%|          | 0/100 [00:00<?, ?it/s]
 92%|█████████▏| 92/100 [00:00<00:00, 912.65it/s]
100%|██████████| 100/100 [00:00<00:00, 914.39it/s]
2026-02-23:22:26:21,850 INFO     [task.py:415] Building contexts for mmlu_college_computer_science on rank 0...

  0%|          | 0/100 [00:00<?, ?it/s]
 95%|█████████▌| 95/100 [00:00<00:00, 943.37it/s]
100%|██████████| 100/100 [00:00<00:00, 941.60it/s]
2026-02-23:22:26:21,959 INFO     [task.py:415] Building contexts for mmlu_college_mathematics on rank 0...

  0%|          | 0/100 [00:00<?, ?it/s]
 95%|█████████▌| 95/100 [00:00<00:00, 940.59it/s]
100%|██████████| 100/100 [00:00<00:00, 940.02it/s]
2026-02-23:22:26:22,069 INFO     [task.py:415] Building contexts for mmlu_college_physics on rank 0...

  0%|          | 0/102 [00:00<?, ?it/s]
 93%|█████████▎| 95/102 [00:00<00:00, 940.24it/s]
100%|██████████| 102/102 [00:00<00:00, 939.31it/s]
2026-02-23:22:26:22,181 INFO     [task.py:415] Building contexts for mmlu_computer_security on rank 0...

  0%|          | 0/100 [00:00<?, ?it/s]
 95%|█████████▌| 95/100 [00:00<00:00, 940.24it/s]
100%|██████████| 100/100 [00:00<00:00, 939.87it/s]
2026-02-23:22:26:22,291 INFO     [task.py:415] Building contexts for mmlu_conceptual_physics on rank 0...

  0%|          | 0/235 [00:00<?, ?it/s]
 40%|████      | 95/235 [00:00<00:00, 946.60it/s]
 81%|████████  | 190/235 [00:00<00:00, 942.23it/s]
100%|██████████| 235/235 [00:00<00:00, 943.23it/s]
2026-02-23:22:26:22,548 INFO     [task.py:415] Building contexts for mmlu_electrical_engineering on rank 0...

  0%|          | 0/145 [00:00<?, ?it/s]
 66%|██████▌   | 95/145 [00:00<00:00, 944.79it/s]
100%|██████████| 145/145 [00:00<00:00, 944.84it/s]
2026-02-23:22:26:22,706 INFO     [task.py:415] Building contexts for mmlu_elementary_mathematics on rank 0...

  0%|          | 0/378 [00:00<?, ?it/s]
 25%|██▍       | 94/378 [00:00<00:00, 929.87it/s]
 50%|█████     | 189/378 [00:00<00:00, 938.70it/s]
 75%|███████▌  | 284/378 [00:00<00:00, 941.58it/s]
100%|██████████| 378/378 [00:00<00:00, 940.85it/s]
2026-02-23:22:26:23,119 INFO     [task.py:415] Building contexts for mmlu_high_school_biology on rank 0...

  0%|          | 0/310 [00:00<?, ?it/s]
 31%|███       | 95/310 [00:00<00:00, 941.45it/s]
 61%|██████▏   | 190/310 [00:00<00:00, 944.12it/s]
 92%|█████████▏| 285/310 [00:00<00:00, 935.99it/s]
100%|██████████| 310/310 [00:00<00:00, 938.56it/s]
2026-02-23:22:26:23,460 INFO     [task.py:415] Building contexts for mmlu_high_school_chemistry on rank 0...

  0%|          | 0/203 [00:00<?, ?it/s]
 47%|████▋     | 95/203 [00:00<00:00, 947.64it/s]
 94%|█████████▎| 190/203 [00:00<00:00, 948.38it/s]
100%|██████████| 203/203 [00:00<00:00, 947.93it/s]
2026-02-23:22:26:23,680 INFO     [task.py:415] Building contexts for mmlu_high_school_computer_science on rank 0...

  0%|          | 0/100 [00:00<?, ?it/s]
 95%|█████████▌| 95/100 [00:00<00:00, 945.25it/s]
100%|██████████| 100/100 [00:00<00:00, 944.51it/s]
2026-02-23:22:26:23,790 INFO     [task.py:415] Building contexts for mmlu_high_school_mathematics on rank 0...

  0%|          | 0/270 [00:00<?, ?it/s]
 35%|███▌      | 95/270 [00:00<00:00, 949.07it/s]
 70%|███████   | 190/270 [00:00<00:00, 945.03it/s]
100%|██████████| 270/270 [00:00<00:00, 945.24it/s]
2026-02-23:22:26:24,084 INFO     [task.py:415] Building contexts for mmlu_high_school_physics on rank 0...

  0%|          | 0/151 [00:00<?, ?it/s]
 63%|██████▎   | 95/151 [00:00<00:00, 947.05it/s]
100%|██████████| 151/151 [00:00<00:00, 945.58it/s]
2026-02-23:22:26:24,249 INFO     [task.py:415] Building contexts for mmlu_high_school_statistics on rank 0...

  0%|          | 0/216 [00:00<?, ?it/s]
 44%|████▍     | 95/216 [00:00<00:00, 944.40it/s]
 88%|████████▊ | 190/216 [00:00<00:00, 945.21it/s]
100%|██████████| 216/216 [00:00<00:00, 945.48it/s]
2026-02-23:22:26:24,484 INFO     [task.py:415] Building contexts for mmlu_machine_learning on rank 0...

  0%|          | 0/112 [00:00<?, ?it/s]
 85%|████████▍ | 95/112 [00:00<00:00, 941.58it/s]
100%|██████████| 112/112 [00:00<00:00, 941.07it/s]
2026-02-23:22:26:24,607 INFO     [task.py:415] Building contexts for mmlu_business_ethics on rank 0...

  0%|          | 0/100 [00:00<?, ?it/s]
 95%|█████████▌| 95/100 [00:00<00:00, 949.73it/s]
100%|██████████| 100/100 [00:00<00:00, 949.38it/s]
2026-02-23:22:26:24,715 INFO     [task.py:415] Building contexts for mmlu_clinical_knowledge on rank 0...

  0%|          | 0/265 [00:00<?, ?it/s]
 36%|███▌      | 95/265 [00:00<00:00, 945.55it/s]
 72%|███████▏  | 190/265 [00:00<00:00, 947.38it/s]
100%|██████████| 265/265 [00:00<00:00, 948.10it/s]
2026-02-23:22:26:25,003 INFO     [task.py:415] Building contexts for mmlu_college_medicine on rank 0...

  0%|          | 0/173 [00:00<?, ?it/s]
 55%|█████▍    | 95/173 [00:00<00:00, 948.80it/s]
100%|██████████| 173/173 [00:00<00:00, 949.59it/s]
2026-02-23:22:26:25,191 INFO     [task.py:415] Building contexts for mmlu_global_facts on rank 0...

  0%|          | 0/100 [00:00<?, ?it/s]
 95%|█████████▌| 95/100 [00:00<00:00, 948.66it/s]
100%|██████████| 100/100 [00:00<00:00, 948.33it/s]
2026-02-23:22:26:25,299 INFO     [task.py:415] Building contexts for mmlu_human_aging on rank 0...

  0%|          | 0/223 [00:00<?, ?it/s]
 43%|████▎     | 95/223 [00:00<00:00, 949.62it/s]
 86%|████████▌ | 191/223 [00:00<00:00, 950.12it/s]
100%|██████████| 223/223 [00:00<00:00, 950.03it/s]
2026-02-23:22:26:25,541 INFO     [task.py:415] Building contexts for mmlu_management on rank 0...

  0%|          | 0/103 [00:00<?, ?it/s]
 92%|█████████▏| 95/103 [00:00<00:00, 946.59it/s]
100%|██████████| 103/103 [00:00<00:00, 946.55it/s]
2026-02-23:22:26:25,654 INFO     [task.py:415] Building contexts for mmlu_marketing on rank 0...

  0%|          | 0/234 [00:00<?, ?it/s]
 41%|████      | 96/234 [00:00<00:00, 950.91it/s]
 82%|████████▏ | 192/234 [00:00<00:00, 949.95it/s]
100%|██████████| 234/234 [00:00<00:00, 950.57it/s]
2026-02-23:22:26:25,907 INFO     [task.py:415] Building contexts for mmlu_medical_genetics on rank 0...

  0%|          | 0/100 [00:00<?, ?it/s]
 95%|█████████▌| 95/100 [00:00<00:00, 943.12it/s]
100%|██████████| 100/100 [00:00<00:00, 942.03it/s]
2026-02-23:22:26:26,017 INFO     [task.py:415] Building contexts for mmlu_miscellaneous on rank 0...

  0%|          | 0/783 [00:00<?, ?it/s]
 12%|█▏        | 95/783 [00:00<00:00, 945.81it/s]
 24%|██▍       | 191/783 [00:00<00:00, 948.53it/s]
 37%|███▋      | 287/783 [00:00<00:00, 949.40it/s]
 49%|████▉     | 382/783 [00:00<00:00, 946.47it/s]
 61%|██████    | 477/783 [00:00<00:00, 947.20it/s]
 73%|███████▎  | 572/783 [00:00<00:00, 947.55it/s]
 85%|████████▌ | 667/783 [00:00<00:00, 944.00it/s]
 97%|█████████▋| 762/783 [00:00<00:00, 945.44it/s]
100%|██████████| 783/783 [00:00<00:00, 946.21it/s]
2026-02-23:22:26:26,868 INFO     [task.py:415] Building contexts for mmlu_nutrition on rank 0...

  0%|          | 0/306 [00:00<?, ?it/s]
 31%|███       | 95/306 [00:00<00:00, 945.65it/s]
 62%|██████▏   | 190/306 [00:00<00:00, 942.86it/s]
 93%|█████████▎| 285/306 [00:00<00:00, 945.70it/s]
100%|██████████| 306/306 [00:00<00:00, 945.33it/s]
2026-02-23:22:26:27,202 INFO     [task.py:415] Building contexts for mmlu_professional_accounting on rank 0...

  0%|          | 0/282 [00:00<?, ?it/s]
 34%|███▎      | 95/282 [00:00<00:00, 947.16it/s]
 67%|██████▋   | 190/282 [00:00<00:00, 948.05it/s]
100%|██████████| 282/282 [00:00<00:00, 943.82it/s]
2026-02-23:22:26:27,510 INFO     [task.py:415] Building contexts for mmlu_professional_medicine on rank 0...

  0%|          | 0/272 [00:00<?, ?it/s]
 35%|███▍      | 95/272 [00:00<00:00, 944.17it/s]
 70%|██████▉   | 190/272 [00:00<00:00, 945.88it/s]
100%|██████████| 272/272 [00:00<00:00, 945.00it/s]
2026-02-23:22:26:27,807 INFO     [task.py:415] Building contexts for mmlu_virology on rank 0...

  0%|          | 0/166 [00:00<?, ?it/s]
 57%|█████▋    | 95/166 [00:00<00:00, 949.10it/s]
100%|██████████| 166/166 [00:00<00:00, 949.47it/s]
2026-02-23:22:26:27,987 INFO     [task.py:415] Building contexts for mmlu_econometrics on rank 0...

  0%|          | 0/114 [00:00<?, ?it/s]
 83%|████████▎ | 95/114 [00:00<00:00, 949.88it/s]
100%|██████████| 114/114 [00:00<00:00, 949.11it/s]
2026-02-23:22:26:28,111 INFO     [task.py:415] Building contexts for mmlu_high_school_geography on rank 0...

  0%|          | 0/198 [00:00<?, ?it/s]
 48%|████▊     | 96/198 [00:00<00:00, 951.89it/s]
 97%|█████████▋| 192/198 [00:00<00:00, 952.15it/s]
100%|██████████| 198/198 [00:00<00:00, 951.37it/s]
2026-02-23:22:26:28,325 INFO     [task.py:415] Building contexts for mmlu_high_school_government_and_politics on rank 0...

  0%|          | 0/193 [00:00<?, ?it/s]
 49%|████▉     | 95/193 [00:00<00:00, 946.77it/s]
 98%|█████████▊| 190/193 [00:00<00:00, 945.46it/s]
100%|██████████| 193/193 [00:00<00:00, 945.28it/s]
2026-02-23:22:26:28,535 INFO     [task.py:415] Building contexts for mmlu_high_school_macroeconomics on rank 0...

  0%|          | 0/390 [00:00<?, ?it/s]
 24%|██▍       | 95/390 [00:00<00:00, 948.08it/s]
 49%|████▊     | 190/390 [00:00<00:00, 947.04it/s]
 73%|███████▎  | 285/390 [00:00<00:00, 946.92it/s]
 97%|█████████▋| 380/390 [00:00<00:00, 947.41it/s]
100%|██████████| 390/390 [00:00<00:00, 946.92it/s]
2026-02-23:22:26:28,959 INFO     [task.py:415] Building contexts for mmlu_high_school_microeconomics on rank 0...

  0%|          | 0/238 [00:00<?, ?it/s]
 40%|███▉      | 95/238 [00:00<00:00, 945.65it/s]
 80%|███████▉  | 190/238 [00:00<00:00, 946.72it/s]
100%|██████████| 238/238 [00:00<00:00, 946.27it/s]
2026-02-23:22:26:29,218 INFO     [task.py:415] Building contexts for mmlu_high_school_psychology on rank 0...

  0%|          | 0/545 [00:00<?, ?it/s]
 17%|█▋        | 95/545 [00:00<00:00, 948.64it/s]
 35%|███▍      | 190/545 [00:00<00:00, 949.16it/s]
 52%|█████▏    | 285/545 [00:00<00:00, 949.21it/s]
 70%|██████▉   | 381/545 [00:00<00:00, 949.77it/s]
 87%|████████▋ | 476/545 [00:00<00:00, 949.18it/s]
100%|██████████| 545/545 [00:00<00:00, 948.27it/s]
2026-02-23:22:26:29,810 INFO     [task.py:415] Building contexts for mmlu_human_sexuality on rank 0...

  0%|          | 0/131 [00:00<?, ?it/s]
 72%|███████▏  | 94/131 [00:00<00:00, 938.62it/s]
100%|██████████| 131/131 [00:00<00:00, 939.31it/s]
2026-02-23:22:26:29,954 INFO     [task.py:415] Building contexts for mmlu_professional_psychology on rank 0...

  0%|          | 0/612 [00:00<?, ?it/s]
 12%|█▏        | 72/612 [00:00<00:03, 146.32it/s]
 27%|██▋       | 167/612 [00:00<00:01, 326.78it/s]
 43%|████▎     | 262/612 [00:00<00:00, 477.10it/s]
 58%|█████▊    | 357/612 [00:00<00:00, 597.63it/s]
 74%|███████▎  | 451/612 [00:00<00:00, 688.95it/s]
 89%|████████▉ | 545/612 [00:00<00:00, 757.74it/s]
100%|██████████| 612/612 [00:01<00:00, 572.17it/s]
2026-02-23:22:26:31,042 INFO     [task.py:415] Building contexts for mmlu_public_relations on rank 0...

  0%|          | 0/110 [00:00<?, ?it/s]
 85%|████████▌ | 94/110 [00:00<00:00, 939.05it/s]
100%|██████████| 110/110 [00:00<00:00, 940.54it/s]
2026-02-23:22:26:31,163 INFO     [task.py:415] Building contexts for mmlu_security_studies on rank 0...

  0%|          | 0/245 [00:00<?, ?it/s]
 39%|███▉      | 95/245 [00:00<00:00, 941.24it/s]
 78%|███████▊  | 190/245 [00:00<00:00, 936.74it/s]
100%|██████████| 245/245 [00:00<00:00, 936.87it/s]
2026-02-23:22:26:31,433 INFO     [task.py:415] Building contexts for mmlu_sociology on rank 0...

  0%|          | 0/201 [00:00<?, ?it/s]
 45%|████▍     | 90/201 [00:00<00:00, 893.54it/s]
 92%|█████████▏| 184/201 [00:00<00:00, 920.66it/s]
100%|██████████| 201/201 [00:00<00:00, 918.57it/s]
2026-02-23:22:26:31,658 INFO     [task.py:415] Building contexts for mmlu_us_foreign_policy on rank 0...

  0%|          | 0/100 [00:00<?, ?it/s]
 95%|█████████▌| 95/100 [00:00<00:00, 947.47it/s]
100%|██████████| 100/100 [00:00<00:00, 945.65it/s]
2026-02-23:22:26:31,768 INFO     [task.py:415] Building contexts for mmlu_formal_logic on rank 0...

  0%|          | 0/126 [00:00<?, ?it/s]
 75%|███████▌  | 95/126 [00:00<00:00, 944.25it/s]
100%|██████████| 126/126 [00:00<00:00, 942.49it/s]
2026-02-23:22:26:31,906 INFO     [task.py:415] Building contexts for mmlu_high_school_european_history on rank 0...

  0%|          | 0/165 [00:00<?, ?it/s]
 56%|█████▋    | 93/165 [00:00<00:00, 924.18it/s]
100%|██████████| 165/165 [00:00<00:00, 930.68it/s]
2026-02-23:22:26:32,089 INFO     [task.py:415] Building contexts for mmlu_high_school_us_history on rank 0...

  0%|          | 0/204 [00:00<?, ?it/s]
 46%|████▌     | 94/204 [00:00<00:00, 933.50it/s]
 92%|█████████▏| 188/204 [00:00<00:00, 934.77it/s]
100%|██████████| 204/204 [00:00<00:00, 934.34it/s]
2026-02-23:22:26:32,314 INFO     [task.py:415] Building contexts for mmlu_high_school_world_history on rank 0...

  0%|          | 0/237 [00:00<?, ?it/s]
 40%|███▉      | 94/237 [00:00<00:00, 934.21it/s]
 79%|███████▉  | 188/237 [00:00<00:00, 913.62it/s]
100%|██████████| 237/237 [00:00<00:00, 920.74it/s]
2026-02-23:22:26:32,580 INFO     [task.py:415] Building contexts for mmlu_international_law on rank 0...

  0%|          | 0/121 [00:00<?, ?it/s]
 79%|███████▊  | 95/121 [00:00<00:00, 946.22it/s]
100%|██████████| 121/121 [00:00<00:00, 944.12it/s]
2026-02-23:22:26:32,712 INFO     [task.py:415] Building contexts for mmlu_jurisprudence on rank 0...

  0%|          | 0/108 [00:00<?, ?it/s]
 88%|████████▊ | 95/108 [00:00<00:00, 944.84it/s]
100%|██████████| 108/108 [00:00<00:00, 943.78it/s]
2026-02-23:22:26:32,830 INFO     [task.py:415] Building contexts for mmlu_logical_fallacies on rank 0...

  0%|          | 0/163 [00:00<?, ?it/s]
 58%|█████▊    | 94/163 [00:00<00:00, 938.13it/s]
100%|██████████| 163/163 [00:00<00:00, 940.19it/s]
2026-02-23:22:26:33,009 INFO     [task.py:415] Building contexts for mmlu_moral_disputes on rank 0...

  0%|          | 0/346 [00:00<?, ?it/s]
 27%|██▋       | 94/346 [00:00<00:00, 937.24it/s]
 55%|█████▍    | 189/346 [00:00<00:00, 940.45it/s]
 82%|████████▏ | 284/346 [00:00<00:00, 943.35it/s]
100%|██████████| 346/346 [00:00<00:00, 943.49it/s]
2026-02-23:22:26:33,386 INFO     [task.py:415] Building contexts for mmlu_moral_scenarios on rank 0...

  0%|          | 0/895 [00:00<?, ?it/s]
 11%|█         | 95/895 [00:00<00:00, 943.56it/s]
 21%|██        | 190/895 [00:00<00:00, 944.54it/s]
 32%|███▏      | 285/895 [00:00<00:00, 943.91it/s]
 42%|████▏     | 380/895 [00:00<00:00, 944.20it/s]
 53%|█████▎    | 475/895 [00:00<00:00, 944.06it/s]
 64%|██████▎   | 570/895 [00:00<00:00, 943.09it/s]
 74%|███████▍  | 665/895 [00:00<00:00, 942.96it/s]
 85%|████████▍ | 760/895 [00:00<00:00, 942.46it/s]
 96%|█████████▌| 855/895 [00:00<00:00, 942.32it/s]
100%|██████████| 895/895 [00:00<00:00, 943.04it/s]
2026-02-23:22:26:34,362 INFO     [task.py:415] Building contexts for mmlu_philosophy on rank 0...

  0%|          | 0/311 [00:00<?, ?it/s]
 30%|██▉       | 93/311 [00:00<00:00, 927.99it/s]
 60%|██████    | 187/311 [00:00<00:00, 930.68it/s]
 91%|█████████ | 282/311 [00:00<00:00, 937.60it/s]
100%|██████████| 311/311 [00:00<00:00, 936.32it/s]
2026-02-23:22:26:34,704 INFO     [task.py:415] Building contexts for mmlu_prehistory on rank 0...

  0%|          | 0/324 [00:00<?, ?it/s]
 29%|██▉       | 95/324 [00:00<00:00, 941.61it/s]
 59%|█████▊    | 190/324 [00:00<00:00, 943.38it/s]
 88%|████████▊ | 285/324 [00:00<00:00, 944.04it/s]
100%|██████████| 324/324 [00:00<00:00, 943.20it/s]
2026-02-23:22:26:35,058 INFO     [task.py:415] Building contexts for mmlu_professional_law on rank 0...

  0%|          | 0/1534 [00:00<?, ?it/s]
  6%|▌         | 95/1534 [00:00<00:01, 941.24it/s]
 12%|█▏        | 190/1534 [00:00<00:01, 940.89it/s]
 19%|█▊        | 285/1534 [00:00<00:01, 942.03it/s]
 25%|██▍       | 380/1534 [00:00<00:01, 942.92it/s]
 31%|███       | 475/1534 [00:00<00:01, 943.11it/s]
 37%|███▋      | 570/1534 [00:00<00:01, 943.58it/s]
 43%|████▎     | 665/1534 [00:00<00:00, 943.91it/s]
 50%|████▉     | 760/1534 [00:00<00:00, 943.91it/s]
 56%|█████▌    | 855/1534 [00:00<00:00, 944.31it/s]
 62%|██████▏   | 950/1534 [00:01<00:00, 944.68it/s]
 68%|██████▊   | 1045/1534 [00:01<00:00, 944.07it/s]
 74%|███████▍  | 1140/1534 [00:01<00:00, 943.84it/s]
 81%|████████  | 1235/1534 [00:01<00:00, 942.03it/s]
 87%|████████▋ | 1330/1534 [00:01<00:00, 942.53it/s]
 93%|█████████▎| 1425/1534 [00:01<00:00, 942.63it/s]
 99%|█████████▉| 1520/1534 [00:01<00:00, 943.61it/s]
100%|██████████| 1534/1534 [00:01<00:00, 943.20it/s]
2026-02-23:22:26:36,734 INFO     [task.py:415] Building contexts for mmlu_world_religions on rank 0...

  0%|          | 0/171 [00:00<?, ?it/s]
 55%|█████▍    | 94/171 [00:00<00:00, 931.33it/s]
100%|██████████| 171/171 [00:00<00:00, 936.74it/s]
2026-02-23:22:26:36,922 INFO     [task.py:415] Building contexts for hellaswag on rank 0...

  0%|          | 0/10042 [00:00<?, ?it/s]
  4%|▍         | 394/10042 [00:00<00:02, 3930.78it/s]
  8%|▊         | 797/10042 [00:00<00:02, 3983.89it/s]
 12%|█▏        | 1200/10042 [00:00<00:02, 4002.32it/s]
 16%|█▌        | 1604/10042 [00:00<00:02, 4016.05it/s]
 20%|█▉        | 2007/10042 [00:00<00:01, 4019.47it/s]
 24%|██▍       | 2409/10042 [00:00<00:01, 4019.44it/s]
 28%|██▊       | 2813/10042 [00:00<00:01, 4025.88it/s]
 32%|███▏      | 3218/10042 [00:00<00:01, 4031.21it/s]
 36%|███▌      | 3622/10042 [00:00<00:01, 4009.32it/s]
 40%|████      | 4025/10042 [00:01<00:01, 4013.22it/s]
 44%|████▍     | 4427/10042 [00:01<00:01, 4007.64it/s]
 48%|████▊     | 4832/10042 [00:01<00:01, 4020.38it/s]
 52%|█████▏    | 5238/10042 [00:01<00:01, 4030.52it/s]
 56%|█████▌    | 5644/10042 [00:01<00:01, 4037.33it/s]
 60%|██████    | 6048/10042 [00:01<00:01, 3990.35it/s]
 64%|██████▍   | 6448/10042 [00:01<00:00, 3992.56it/s]
 68%|██████▊   | 6853/10042 [00:01<00:00, 4007.66it/s]
 72%|███████▏  | 7260/10042 [00:01<00:00, 4023.88it/s]
 76%|███████▋  | 7665/10042 [00:01<00:00, 4029.18it/s]
 80%|████████  | 8068/10042 [00:02<00:01, 1684.27it/s]
 84%|████████▍ | 8477/10042 [00:02<00:00, 2048.50it/s]
 89%|████████▊ | 8888/10042 [00:02<00:00, 2415.04it/s]
 93%|█████████▎| 9299/10042 [00:02<00:00, 2757.67it/s]
 97%|█████████▋| 9712/10042 [00:02<00:00, 3065.17it/s]
100%|██████████| 10042/10042 [00:02<00:00, 3398.78it/s]
2026-02-23:22:26:40,721 INFO     [task.py:415] Building contexts for gsm8k on rank 0...

  0%|          | 0/1319 [00:00<?, ?it/s]
  3%|▎         | 36/1319 [00:00<00:03, 355.03it/s]
  6%|▌         | 73/1319 [00:00<00:03, 358.73it/s]
  8%|▊         | 110/1319 [00:00<00:03, 360.53it/s]
 11%|█         | 147/1319 [00:00<00:03, 361.64it/s]
 14%|█▍        | 184/1319 [00:00<00:03, 357.66it/s]
 17%|█▋        | 220/1319 [00:00<00:03, 355.14it/s]
 19%|█▉        | 256/1319 [00:00<00:02, 354.66it/s]
 22%|██▏       | 292/1319 [00:00<00:02, 354.71it/s]
 25%|██▍       | 328/1319 [00:00<00:02, 354.32it/s]
 28%|██▊       | 364/1319 [00:01<00:02, 353.99it/s]
 30%|███       | 400/1319 [00:01<00:02, 353.76it/s]
 33%|███▎      | 436/1319 [00:01<00:02, 353.77it/s]
 36%|███▌      | 472/1319 [00:01<00:02, 353.75it/s]
 39%|███▊      | 508/1319 [00:01<00:02, 353.54it/s]
 41%|████      | 544/1319 [00:01<00:02, 353.68it/s]
 44%|████▍     | 580/1319 [00:01<00:02, 353.67it/s]
 47%|████▋     | 616/1319 [00:01<00:01, 353.32it/s]
 49%|████▉     | 652/1319 [00:01<00:01, 353.47it/s]
 52%|█████▏    | 688/1319 [00:01<00:01, 353.82it/s]
 55%|█████▍    | 724/1319 [00:02<00:01, 351.95it/s]
 58%|█████▊    | 760/1319 [00:02<00:01, 350.45it/s]
 60%|██████    | 796/1319 [00:02<00:01, 351.61it/s]
 63%|██████▎   | 832/1319 [00:02<00:01, 352.26it/s]
 66%|██████▌   | 868/1319 [00:02<00:01, 353.10it/s]
 69%|██████▊   | 904/1319 [00:02<00:01, 353.49it/s]
 71%|███████▏  | 940/1319 [00:02<00:01, 353.24it/s]
 74%|███████▍  | 976/1319 [00:02<00:00, 349.38it/s]
 77%|███████▋  | 1012/1319 [00:02<00:00, 350.07it/s]
 79%|███████▉  | 1048/1319 [00:02<00:00, 350.48it/s]
 82%|████████▏ | 1084/1319 [00:03<00:00, 350.96it/s]
 85%|████████▍ | 1120/1319 [00:03<00:00, 351.32it/s]
 88%|████████▊ | 1156/1319 [00:03<00:00, 351.31it/s]
 90%|█████████ | 1192/1319 [00:03<00:00, 344.74it/s]
 93%|█████████▎| 1227/1319 [00:03<00:00, 346.04it/s]
 96%|█████████▌| 1263/1319 [00:03<00:00, 347.75it/s]
 98%|█████████▊| 1299/1319 [00:03<00:00, 348.87it/s]
100%|██████████| 1319/1319 [00:03<00:00, 352.38it/s]
2026-02-23:22:26:44,487 INFO     [task.py:415] Building contexts for boolq on rank 0...

  0%|          | 0/3270 [00:00<?, ?it/s]
  9%|▉         | 291/3270 [00:00<00:01, 2905.92it/s]
 18%|█▊        | 587/3270 [00:00<00:00, 2934.50it/s]
 27%|██▋       | 881/3270 [00:00<00:00, 2905.24it/s]
 36%|███▌      | 1178/3270 [00:00<00:00, 2925.27it/s]
 45%|████▌     | 1475/3270 [00:00<00:00, 2939.40it/s]
 54%|█████▍    | 1772/3270 [00:00<00:00, 2947.91it/s]
 63%|██████▎   | 2070/3270 [00:00<00:00, 2957.06it/s]
 72%|███████▏  | 2367/3270 [00:00<00:00, 2959.84it/s]
 81%|████████▏ | 2663/3270 [00:00<00:00, 2938.01it/s]
 90%|█████████ | 2959/3270 [00:01<00:00, 2943.64it/s]
100%|█████████▉| 3256/3270 [00:01<00:00, 2948.88it/s]
100%|██████████| 3270/3270 [00:01<00:00, 2942.06it/s]
2026-02-23:22:26:45,688 INFO     [task.py:415] Building contexts for arc_challenge on rank 0...

  0%|          | 0/1172 [00:00<?, ?it/s]
 15%|█▍        | 171/1172 [00:00<00:00, 1700.99it/s]
 29%|██▉       | 344/1172 [00:00<00:00, 1712.25it/s]
 44%|████▍     | 516/1172 [00:00<00:00, 1697.41it/s]
 59%|█████▊    | 688/1172 [00:00<00:00, 1705.76it/s]
 73%|███████▎  | 860/1172 [00:00<00:00, 1710.80it/s]
 88%|████████▊ | 1033/1172 [00:00<00:00, 1714.56it/s]
100%|██████████| 1172/1172 [00:00<00:00, 1710.01it/s]
2026-02-23:22:26:46,421 INFO     [evaluator.py:496] Running loglikelihood requests

Running loglikelihood requests:   0%|          | 0/119655 [00:00<?, ?it/s]
Running loglikelihood requests:   0%|          | 1/119655 [00:15<525:18:09, 15.80s/it]
Running loglikelihood requests:   0%|          | 31/119655 [00:16<12:21:56,  2.69it/s]
Running loglikelihood requests:   0%|          | 61/119655 [00:16<5:21:31,  6.20it/s] 
Running loglikelihood requests:   0%|          | 89/119655 [00:16<3:10:20, 10.47it/s]
Running loglikelihood requests:   0%|          | 119/119655 [00:17<2:01:46, 16.36it/s]
Running loglikelihood requests:   0%|          | 151/119655 [00:17<1:21:59, 24.29it/s]
Running loglikelihood requests:   0%|          | 179/119655 [00:17<1:01:39, 32.30it/s]
Running loglikelihood requests:   0%|          | 209/119655 [00:17<47:01, 42.33it/s]  
Running loglikelihood requests:   0%|          | 241/119655 [00:18<36:40, 54.27it/s]
Running loglikelihood requests:   0%|          | 273/119655 [00:18<29:59, 66.34it/s]
Running loglikelihood requests:   0%|          | 305/119655 [00:18<25:33, 77.82it/s]
Running loglikelihood requests:   0%|          | 337/119655 [00:18<22:18, 89.13it/s]
Running loglikelihood requests:   0%|          | 365/119655 [00:19<20:52, 95.26it/s]
Running loglikelihood requests:   0%|          | 397/119655 [00:19<19:05, 104.09it/s]
Running loglikelihood requests:   0%|          | 429/119655 [00:19<17:47, 111.70it/s]
Running loglikelihood requests:   0%|          | 461/119655 [00:19<16:45, 118.56it/s]
Running loglikelihood requests:   0%|          | 491/119655 [00:20<16:20, 121.59it/s]
Running loglikelihood requests:   0%|          | 519/119655 [00:20<16:19, 121.57it/s]
Running loglikelihood requests:   0%|          | 551/119655 [00:20<15:31, 127.87it/s]
Running loglikelihood requests:   0%|          | 583/119655 [00:20<14:57, 132.64it/s]
Running loglikelihood requests:   1%|          | 613/119655 [00:20<14:50, 133.64it/s]
Running loglikelihood requests:   1%|          | 643/119655 [00:21<14:45, 134.48it/s]
Running loglikelihood requests:   1%|          | 673/119655 [00:21<14:38, 135.51it/s]
Running loglikelihood requests:   1%|          | 705/119655 [00:21<14:15, 139.01it/s]
Running loglikelihood requests:   1%|          | 735/119655 [00:21<14:15, 139.00it/s]
Running loglikelihood requests:   1%|          | 765/119655 [00:22<14:14, 139.21it/s]
Running loglikelihood requests:   1%|          | 797/119655 [00:22<13:56, 142.12it/s]
Running loglikelihood requests:   1%|          | 829/119655 [00:22<13:43, 144.32it/s]
Running loglikelihood requests:   1%|          | 861/119655 [00:22<13:34, 145.93it/s]
Running loglikelihood requests:   1%|          | 893/119655 [00:22<13:26, 147.17it/s]
Running loglikelihood requests:   1%|          | 925/119655 [00:23<13:06, 150.93it/s]
Running loglikelihood requests:   1%|          | 957/119655 [00:23<12:51, 153.78it/s]
Running loglikelihood requests:   1%|          | 987/119655 [00:23<12:55, 152.99it/s]
Running loglikelihood requests:   1%|          | 1017/119655 [00:23<12:58, 152.45it/s]
Running loglikelihood requests:   1%|          | 1047/119655 [00:23<12:59, 152.13it/s]
Running loglikelihood requests:   1%|          | 1079/119655 [00:24<12:44, 155.17it/s]
Running loglikelihood requests:   1%|          | 1109/119655 [00:24<12:47, 154.44it/s]
Running loglikelihood requests:   1%|          | 1139/119655 [00:24<12:49, 154.02it/s]
Running loglikelihood requests:   1%|          | 1171/119655 [00:24<12:35, 156.74it/s]
Running loglikelihood requests:   1%|          | 1199/119655 [00:24<12:55, 152.73it/s]
Running loglikelihood requests:   1%|          | 1231/119655 [00:25<12:38, 156.23it/s]
Running loglikelihood requests:   1%|          | 1263/119655 [00:25<12:25, 158.73it/s]
Running loglikelihood requests:   1%|          | 1295/119655 [00:25<12:17, 160.58it/s]
Running loglikelihood requests:   1%|          | 1327/119655 [00:25<12:10, 161.94it/s]
Running loglikelihood requests:   1%|          | 1359/119655 [00:25<12:05, 162.96it/s]
Running loglikelihood requests:   1%|          | 1387/119655 [00:26<12:30, 157.67it/s]
Running loglikelihood requests:   1%|          |
2026-02-23:22:36:04,335 INFO     [evaluator.py:496] Running generate_until requests
Passed argument batch_size = auto:1. Detecting largest batch size
Determined largest batch size: 8

Running generate_until requests:   0%|          | 0/1319 [00:00<?, ?it/s]
Running generate_until requests:   0%|          | 1/1319 [00:12<4:36:02, 12.57s/it]
Running generate_until requests:   0%|          | 2/1319 [00:20<3:36:41,  9.87s/it]
Running generate_until requests:   0%|          | 3/1319 [00:23<2:25:26,  6.63s/it]
Running generate_until requests:   0%|          | 4/1319 [00:29<2:20:16,  6.40s/it]
Running generate_until requests:   0%|          | 5/1319 [00:36<2:24:38,  6.60s/it]
Running generate_until requests:   0%|          | 6/1319 [00:44<2:33:01,  6.99s/it]
Running generate_until requests:   1%|          | 7/1319 [00:46<2:01:45,  5.57s/it]
Running generate_until requests:   1%|          | 8/1319 [00:51<1:55:46,  5.30s/it]
Running generate_until requests:   1%|          | 9/1319 [00:54<1:40:41,  4.61s/it]
Running generate_until requests:   1%|          | 10/1319 [00:57<1:27:43,  4.02s/it]
Running generate_until requests:   1%|          | 11/1319 [01:02<1:38:32,  4.52s/it]
Running generate_until requests:   1%|          | 12/1319 [01:05<1:27:08,  4.00s/it]
Running generate_until requests:   1%|          | 13/1319 [01:12<1:46:33,  4.90s/it]
Running generate_until requests:   1%|          | 14/1319 [01:19<2:01:59,  5.61s/it]
Running generate_until requests:   1%|          | 15/1319 [01:26<2:05:06,  5.76s/it]
Running generate_until requests:   1%|          | 16/1319 [01:30<1:55:19,  5.31s/it]
Running generate_until requests:   1%|▏         | 17/1319 [01:33<1:42:36,  4.73s/it]
Running generate_until requests:   1%|▏         | 18/1319 [01:41<2:02:10,  5.63s/it]
Running generate_until requests:   1%|▏         | 19/1319 [01:44<1:42:43,  4.74s/it]
Running generate_until requests:   2%|▏         | 20/1319 [01:49<1:47:40,  4.97s/it]
Running generate_until requests:   2%|▏         | 21/1319 [01:52<1:34:11,  4.35s/it]
Running generate_until requests:   2%|▏         | 22/1319 [01:55<1:25:34,  3.96s/it]
Running generate_until requests:   2%|▏         | 23/1319 [02:00<1:33:10,  4.31s/it]
Running generate_until requests:   2%|▏         | 24/1319 [02:04<1:30:40,  4.20s/it]
Running generate_until requests:   2%|▏         | 25/1319 [02:07<1:21:23,  3.77s/it]
Running generate_until requests:   2%|▏         | 26/1319 [02:10<1:16:00,  3.53s/it]
Running generate_until requests:   2%|▏         | 27/1319 [02:15<1:24:03,  3.90s/it]
Running generate_until requests:   2%|▏         | 28/1319 [02:19<1:29:05,  4.14s/it]
Running generate_until requests:   2%|▏         | 29/1319 [02:25<1:38:56,  4.60s/it]
Running generate_until requests:   2%|▏         | 30/1319 [02:28<1:27:40,  4.08s/it]
Running generate_until requests:   2%|▏         | 31/1319 [02:36<1:51:45,  5.21s/it]
Running generate_until requests:   2%|▏         | 32/1319 [02:40<1:48:44,  5.07s/it]
Running generate_until requests:   3%|▎         | 33/1319 [02:48<2:06:39,  5.91s/it]
Running generate_until requests:   3%|▎         | 34/1319 [02:51<1:45:39,  4.93s/it]
Running generate_until requests:   3%|▎         | 35/1319 [02:55<1:38:12,  4.59s/it]
Running generate_until requests:   3%|▎         | 36/1319 [03:00<1:45:30,  4.93s/it]
Running generate_until requests:   3%|▎         | 37/1319 [03:06<1:50:03,  5.15s/it]
Running generate_until requests:   3%|▎         | 38/1319 [03:11<1:48:09,  5.07s/it]
Running generate_until requests:   3%|▎         | 39/1319 [03:15<1:41:15,  4.75s/it]
Running generate_until requests:   3%|▎         | 40/1319 [03:21<1:48:31,  5.09s/it]
Running generate_until requests:   3%|▎         | 41/1319 [03:25<1:39:59,  4.69s/it]
Running generate_until requests:   3%|▎         | 42/1319 [03:32<1:55:27,  5.43s/it]
Running generate_until requests:   3%|▎         | 43/1319 [03:37<1:56:02,  5.46s/it]
Running generate_until requests:   3%|▎         | 44/1319 [03:41<1:45:31,  4.97s/it]
Running generate_until requests:   3%|▎         | 45/1319 [03:43<1:27:40,  4.13s/it]
Running generate_until requests:   3%|▎         | 46/1319 [03:48<1:32:24,  4.36s/it]
Running generate_until requests:   4%|▎         | 47/1319
fatal: not a git repository (or any of the parent directories): .git
2026-02-24:00:06:21,470 INFO     [evaluation_tracker.py:206] Saving results aggregated
Passed argument batch_size = auto. Detecting largest batch size
Determined Largest batch size: 1
hf (pretrained=strykes/emberforge-3b-reasoner,trust_remote_code=True,dtype=float16), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto (8)
|                 Tasks                 |Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|---------------------------------------|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|arc_challenge                          |      1|none            |     0|acc        |↑  |0.2875|±  |0.0132|
|                                       |       |none            |     0|acc_norm   |↑  |0.3174|±  |0.0136|
|boolq                                  |      2|none            |     0|acc        |↑  |0.7437|±  |0.0076|
|gsm8k                                  |      3|flexible-extract|     5|exact_match|↑  |0.6240|±  |0.0133|
|                                       |       |strict-match    |     5|exact_match|↑  |0.6202|±  |0.0134|
|hellaswag                              |      1|none            |     0|acc        |↑  |0.4292|±  |0.0049|
|                                       |       |none            |     0|acc_norm   |↑  |0.5607|±  |0.0050|
|mmlu                                   |      2|none            |      |acc        |↑  |0.5998|±  |0.0040|
| - humanities                          |      2|none            |      |acc        |↑  |0.5301|±  |0.0068|
|  - formal_logic                       |      1|none            |     0|acc        |↑  |0.5000|±  |0.0447|
|  - high_school_european_history       |      1|none            |     0|acc        |↑  |0.7697|±  |0.0329|
|  - high_school_us_history             |      1|none            |     0|acc        |↑  |0.7157|±  |0.0317|
|  - high_school_world_history          |      1|none            |     0|acc        |↑  |0.7975|±  |0.0262|
|  - international_law                  |      1|none            |     0|acc        |↑  |0.7851|±  |0.0375|
|  - jurisprudence                      |      1|none            |     0|acc        |↑  |0.7222|±  |0.0433|
|  - logical_fallacies                  |      1|none            |     0|acc        |↑  |0.6933|±  |0.0362|
|  - moral_disputes                     |      1|none            |     0|acc        |↑  |0.5954|±  |0.0264|
|  - moral_scenarios                    |      1|none            |     0|acc        |↑  |0.2447|±  |0.0144|
|  - philosophy                         |      1|none            |     0|acc        |↑  |0.6559|±  |0.0270|
|  - prehistory                         |      1|none            |     0|acc        |↑  |0.6265|±  |0.0269|
|  - professional_law                   |      1|none            |     0|acc        |↑  |0.4746|±  |0.0128|
|  - world_religions                    |      1|none            |     0|acc        |↑  |0.7193|±  |0.0345|
| - other                               |      2|none            |      |acc        |↑  |0.6270|±  |0.0085|
|  - business_ethics                    |      1|none            |     0|acc        |↑  |0.6200|±  |0.0488|
|  - clinical_knowledge                 |      1|none            |     0|acc        |↑  |0.6415|±  |0.0295|
|  - college_medicine                   |      1|none            |     0|acc        |↑  |0.5954|±  |0.0374|
|  - global_facts                       |      1|none            |     0|acc        |↑  |0.3300|±  |0.0473|
|  - human_aging                        |      1|none            |     0|acc        |↑  |0.6009|±  |0.0329|
|  - management                         |      1|none            |     0|acc        |↑  |0.6893|±  |0.0458|
|  - marketing                          |      1|none            |     0|acc        |↑  |0.8034|±  |0.0260|
|  - medical_genetics                   |      1|none            |     0|acc        |↑  |0.6900|±  |0.0465|
|  - miscellaneous                      |      1|none            |     0|acc        |↑  |0.6718|±  |0.0168|
|  - nutrition                          |      1|none            |     0|acc        |↑  |0.6765|±  |0.0268|
|  - professional_accounting            |      1|none            |     0|acc        |↑  |0.4397|±  |0.0296|
|  - professional_medicine              |      1|none            |     0|acc        |↑  |0.6838|±  |0.0282|
|  - virology                           |      1|none            |     0|acc        |↑  |0.4518|±  |0.0387|
| - social sciences                     |      2|none            |      |acc        |↑  |0.6906|±  |0.0081|
|  - econometrics                       |      1|none            |     0|acc        |↑  |0.3596|±  |0.0451|
|  - high_school_geography              |      1|none            |     0|acc        |↑  |0.7273|±  |0.0317|
|  - high_school_government_and_politics|      1|none            |     0|acc        |↑  |0.7461|±  |0.0314|
|  - high_school_macroeconomics         |      1|none            |     0|acc        |↑  |0.6436|±  |0.0243|
|  - high_school_microeconomics         |      1|none            |     0|acc        |↑  |0.7773|±  |0.0270|
|  - high_school_psychology             |      1|none            |     0|acc        |↑  |0.8000|±  |0.0171|
|  - human_sexuality                    |      1|none            |     0|acc        |↑  |0.6947|±  |0.0404|
|  - professional_psychology            |      1|none            |     0|acc        |↑  |0.5915|±  |0.0199|
|  - public_relations                   |      1|none            |     0|acc        |↑  |0.6000|±  |0.0469|
|  - security_studies                   |      1|none            |     0|acc        |↑  |0.7020|±  |0.0293|
|  - sociology                          |      1|none            |     0|acc        |↑  |0.7711|±  |0.0297|
|  - us_foreign_policy                  |      1|none            |     0|acc        |↑  |0.7800|±  |0.0416|
| - stem                                |      2|none            |      |acc        |↑  |0.5883|±  |0.0086|
|  - abstract_algebra                   |      1|none            |     0|acc        |↑  |0.4300|±  |0.0498|
|  - anatomy                            |      1|none            |     0|acc        |↑  |0.6074|±  |0.0422|
|  - astronomy                          |      1|none            |     0|acc        |↑  |0.6974|±  |0.0374|
|  - college_biology                    |      1|none            |     0|acc        |↑  |0.8264|±  |0.0317|
|  - college_chemistry                  |      1|none            |     0|acc        |↑  |0.5300|±  |0.0502|
|  - college_computer_science           |      1|none            |     0|acc        |↑  |0.5400|±  |0.0501|
|  - college_mathematics                |      1|none            |     0|acc        |↑  |0.5000|±  |0.0503|
|  - college_physics                    |      1|none            |     0|acc        |↑  |0.5000|±  |0.0498|
|  - computer_security                  |      1|none            |     0|acc        |↑  |0.6800|±  |0.0469|
|  - conceptual_physics                 |      1|none            |     0|acc        |↑  |0.5872|±  |0.0322|
|  - electrical_engineering             |      1|none            |     0|acc        |↑  |0.6414|±  |0.0400|
|  - elementary_mathematics             |      1|none            |     0|acc        |↑  |0.5317|±  |0.0257|
|  - high_school_biology                |      1|none            |     0|acc        |↑  |0.7548|±  |0.0245|
|  - high_school_chemistry              |      1|none            |     0|acc        |↑  |0.6010|±  |0.0345|
|  - high_school_computer_science       |      1|none            |     0|acc        |↑  |0.6900|±  |0.0465|
|  - high_school_mathematics            |      1|none            |     0|acc        |↑  |0.4556|±  |0.0304|
|  - high_school_physics                |      1|none            |     0|acc        |↑  |0.5166|±  |0.0408|
|  - high_school_statistics             |      1|none            |     0|acc        |↑  |0.5694|±  |0.0338|
|  - machine_learning                   |      1|none            |     0|acc        |↑  |0.4286|±  |0.0470|
|piqa                                   |      1|none            |     0|acc        |↑  |0.6328|±  |0.0112|
|                                       |       |none            |     0|acc_norm   |↑  |0.6322|±  |0.0113|
|truthfulqa_mc2                         |      2|none            |     0|acc        |↑  |0.4534|±  |0.0160|
|winogrande                             |      1|none            |     0|acc        |↑  |0.5004|±  |0.0141|

|      Groups      |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|------------------|------:|------|------|------|---|-----:|---|-----:|
|mmlu              |      2|none  |      |acc   |↑  |0.5998|±  |0.0040|
| - humanities     |      2|none  |      |acc   |↑  |0.5301|±  |0.0068|
| - other          |      2|none  |      |acc   |↑  |0.6270|±  |0.0085|
| - social sciences|      2|none  |      |acc   |↑  |0.6906|±  |0.0081|
| - stem           |      2|none  |      |acc   |↑  |0.5883|±  |0.0086|