427 lines
528 KiB
Plaintext
427 lines
528 KiB
Plaintext
|
|
2026-02-23:22:20:49,920 INFO [__main__.py:279] Verbosity set to INFO
|
||
|
|
2026-02-23:22:20:56,465 INFO [__main__.py:376] Selected Tasks: ['arc_challenge', 'boolq', 'gsm8k', 'hellaswag', 'mmlu', 'piqa', 'truthfulqa_mc2', 'winogrande']
|
||
|
|
2026-02-23:22:20:56,466 INFO [evaluator.py:164] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
|
||
|
|
2026-02-23:22:20:56,466 INFO [evaluator.py:201] Initializing hf model, with arguments: {'pretrained': 'strykes/emberforge-3b-reasoner', 'trust_remote_code': True, 'dtype': 'float16'}
|
||
|
|
2026-02-23:22:20:56,650 INFO [huggingface.py:132] Using device 'cuda:0'
|
||
|
|
2026-02-23:22:20:59,192 INFO [huggingface.py:369] Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda:0'}
|
||
|
|
Downloading shards: 0%| | 0/2 [00:00<?, ?it/s]
Downloading shards: 50%|█████ | 1/2 [01:39<01:39, 99.69s/it]
Downloading shards: 100%|██████████| 2/2 [02:35<00:00, 73.71s/it]
Downloading shards: 100%|██████████| 2/2 [02:35<00:00, 77.60s/it]
|
||
|
|
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards: 50%|█████ | 1/2 [00:01<00:01, 1.58s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:02<00:00, 1.38s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:02<00:00, 1.41s/it]
|
||
|
|
Generating train split: 0%| | 0/1119 [00:00<?, ? examples/s]
Generating train split: 100%|██████████| 1119/1119 [00:00<00:00, 104717.23 examples/s]
|
||
|
|
Generating test split: 0%| | 0/1172 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 1172/1172 [00:00<00:00, 190310.66 examples/s]
|
||
|
|
Generating validation split: 0%| | 0/299 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 299/299 [00:00<00:00, 71613.57 examples/s]
|
||
|
|
2026-02-23:22:23:41,927 WARNING [task.py:800] [Task: boolq] metric acc is defined, but aggregation is not. using default aggregation=mean
|
||
|
|
2026-02-23:22:23:41,927 WARNING [task.py:812] [Task: boolq] metric acc is defined, but higher_is_better is not. using default higher_is_better=True
|
||
|
|
Generating train split: 0%| | 0/9427 [00:00<?, ? examples/s]
Generating train split: 100%|██████████| 9427/9427 [00:00<00:00, 163206.46 examples/s]
|
||
|
|
Generating validation split: 0%| | 0/3270 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 3270/3270 [00:00<00:00, 177487.86 examples/s]
|
||
|
|
Generating test split: 0%| | 0/3245 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 3245/3245 [00:00<00:00, 202860.45 examples/s]
|
||
|
|
Generating train split: 0%| | 0/7473 [00:00<?, ? examples/s]
Generating train split: 100%|██████████| 7473/7473 [00:00<00:00, 331458.42 examples/s]
|
||
|
|
Generating test split: 0%| | 0/1319 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 1319/1319 [00:00<00:00, 190243.71 examples/s]
|
||
|
|
Generating train split: 0%| | 0/39905 [00:00<?, ? examples/s]
Generating train split: 48%|████▊ | 19000/39905 [00:00<00:00, 181468.24 examples/s]
Generating train split: 100%|██████████| 39905/39905 [00:00<00:00, 167115.13 examples/s]
Generating train split: 100%|██████████| 39905/39905 [00:00<00:00, 168677.41 examples/s]
|
||
|
|
Generating test split: 0%| | 0/10003 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 10003/10003 [00:00<00:00, 170656.06 examples/s]
|
||
|
|
Generating validation split: 0%| | 0/10042 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 10042/10042 [00:00<00:00, 170143.53 examples/s]
|
||
|
|
Map: 0%| | 0/39905 [00:00<?, ? examples/s]
Map: 3%|▎ | 1000/39905 [00:00<00:04, 8689.85 examples/s]
Map: 6%|▌ | 2228/39905 [00:00<00:03, 10674.69 examples/s]
Map: 9%|▊ | 3471/39905 [00:00<00:03, 11444.97 examples/s]
Map: 12%|█▏ | 4716/39905 [00:00<00:02, 11781.93 examples/s]
Map: 15%|█▍ | 5955/39905 [00:00<00:02, 11993.75 examples/s]
Map: 19%|█▉ | 7730/39905 [00:00<00:02, 11921.15 examples/s]
Map: 22%|██▏ | 8966/39905 [00:00<00:02, 12045.60 examples/s]
Map: 27%|██▋ | 10735/39905 [00:00<00:02, 11945.17 examples/s]
Map: 30%|██▉ | 11964/39905 [00:01<00:02, 12035.00 examples/s]
Map: 35%|███▍ | 13818/39905 [00:01<00:03, 8183.79 examples/s]
Map: 38%|███▊ | 15000/39905 [00:01<00:02, 8653.80 examples/s]
Map: 42%|████▏ | 16641/39905 [00:01<00:02, 9280.99 examples/s]
Map: 44%|████▍ | 17734/39905 [00:01<00:02, 9630.18 examples/s]
Map: 48%|████▊ | 19206/39905 [00:01<00:02, 9685.69 examples/s]
Map: 51%|█████ | 20290/39905 [00:01<00:01, 9949.61 examples/s]
Map: 54%|█████▎ | 21382/39905 [00:02<00:01, 10187.91 examples/s]
Map: 58%|█████▊ | 23000/39905 [00:02<00:01, 10154.04 examples/s]
Map: 60%|██████ | 24082/39905 [00:02<00:01, 10315.85 examples/s]
Map: 65%|██████▍ | 25780/39905 [00:02<00:01, 10657.08 examples/s]
Map: 67%|██████▋ | 26861/39905 [00:02<00:01, 10691.87 examples/s]
Map: 70%|███████ | 28000/39905 [00:02<00:01, 6777.21 examples/s]
Map: 73%|███████▎ | 29088/39905 [00:03<00:01, 7546.68 examples/s]
Map: 76%|███████▌ | 30185/39905 [00:03<00:01, 8269.78 examples/s]
Map: 78%|███████▊ | 31286/39905 [00:03<00:00, 8905.40 examples/s]
Map: 81%|████████ | 32376/39905 [00:03<00:00, 9400.69 examples/s]
Map: 84%|████████▍ | 33644/39905 [00:03<00:00, 9906.21 examples/s]
Map: 87%|████████▋ | 34732/39905 [00:03<00:00, 10163.46 examples/s]
Map: 90%|████████▉ | 35826/39905 [00:03<00:00, 10374.23 examples/s]
Map: 93%|█████████▎| 36913/39905 [00:03<00:00, 10510.61 examples/s]
Map: 96%|█████████▌| 38369/39905 [00:03<00:00, 10206.50 examples/s]
Map: 99%|█████████▉| 39643/39905 [00:04<00:00, 10420.36 examples/s]
Map: 100%|██████████| 39905/39905 [00:04<00:00, 9868.52 examples/s]
|
||
|
|
Map: 0%| | 0/10042 [00:00<?, ? examples/s]
Map: 5%|▍ | 467/10042 [00:00<00:06, 1467.66 examples/s]
Map: 17%|█▋ | 1711/10042 [00:00<00:01, 4758.88 examples/s]
Map: 29%|██▉ | 2942/10042 [00:00<00:01, 7038.55 examples/s]
Map: 40%|███▉ | 4000/10042 [00:00<00:00, 7798.85 examples/s]
Map: 50%|█████ | 5061/10042 [00:00<00:00, 8619.43 examples/s]
Map: 61%|██████ | 6146/10042 [00:00<00:00, 9275.74 examples/s]
Map: 72%|███████▏ | 7242/10042 [00:00<00:00, 9773.07 examples/s]
Map: 83%|████████▎ | 8342/10042 [00:01<00:00, 10136.59 examples/s]
Map: 94%|█████████▍| 9444/10042 [00:01<00:00, 10396.60 examples/s]
Map: 100%|██████████| 10042/10042 [00:01<00:00, 8335.19 examples/s]
|
||
|
|
Generating test split: 0%| | 0/171 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 171/171 [00:00<00:00, 37968.55 examples/s]
|
||
|
|
Generating validation split: 0%| | 0/19 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 19/19 [00:00<00:00, 5885.66 examples/s]
|
||
|
|
Generating dev split: 0%| | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1641.35 examples/s]
|
||
|
|
Generating test split: 0%| | 0/1534 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 1534/1534 [00:00<00:00, 114719.84 examples/s]
|
||
|
|
Generating validation split: 0%| | 0/170 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 170/170 [00:00<00:00, 40494.76 examples/s]
|
||
|
|
Generating dev split: 0%| | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1663.75 examples/s]
|
||
|
|
Generating test split: 0%| | 0/324 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 324/324 [00:00<00:00, 74085.73 examples/s]
|
||
|
|
Generating validation split: 0%| | 0/35 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 35/35 [00:00<00:00, 11826.36 examples/s]
|
||
|
|
Generating dev split: 0%| | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1770.05 examples/s]
|
||
|
|
Generating test split: 0%| | 0/311 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 311/311 [00:00<00:00, 58275.04 examples/s]
|
||
|
|
Generating validation split: 0%| | 0/34 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 34/34 [00:00<00:00, 10464.22 examples/s]
|
||
|
|
Generating dev split: 0%| | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1711.26 examples/s]
|
||
|
|
Generating test split: 0%| | 0/895 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 895/895 [00:00<00:00, 148899.37 examples/s]
|
||
|
|
Generating validation split: 0%| | 0/100 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 100/100 [00:00<00:00, 30847.28 examples/s]
|
||
|
|
Generating dev split: 0%| | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1844.62 examples/s]
|
||
|
|
Generating test split: 0%| | 0/346 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 346/346 [00:00<00:00, 69185.22 examples/s]
|
||
|
|
Generating validation split: 0%| | 0/38 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 38/38 [00:00<00:00, 12485.98 examples/s]
|
||
|
|
Generating dev split: 0%| | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1842.19 examples/s]
|
||
|
|
Generating test split: 0%| | 0/163 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 163/163 [00:00<00:00, 41613.70 examples/s]
|
||
|
|
Generating validation split: 0%| | 0/18 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 18/18 [00:00<00:00, 6451.67 examples/s]
|
||
|
|
Generating dev split: 0%| | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1825.67 examples/s]
|
||
|
|
Generating test split: 0%| | 0/108 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 108/108 [00:00<00:00, 27090.77 examples/s]
|
||
|
|
Generating validation split: 0%| | 0/11 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 11/11 [00:00<00:00, 3737.63 examples/s]
|
||
|
|
Generating dev split: 0%| | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1764.09 examples/s]
|
||
|
|
Generating test split: 0%| | 0/121 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 121/121 [00:00<00:00, 30838.60 examples/s]
|
||
|
|
Generating validation split: 0%| | 0/13 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 13/13 [00:00<00:00, 4453.64 examples/s]
|
||
|
|
Generating dev split: 0%| | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1898.73 examples/s]
|
||
|
|
Generating test split: 0%| | 0/237 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 237/237 [00:00<00:00, 36752.69 examples/s]
|
||
|
|
Generating validation split: 0%| | 0/26 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 26/26 [00:00<00:00, 8093.51 examples/s]
|
||
|
|
Generating dev split: 0%| | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1829.50 examples/s]
|
||
|
|
Generating test split: 0%| | 0/204 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 204/204 [00:00<00:00, 38618.79 examples/s]
|
||
|
|
Generating validation split: 0%| | 0/22 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 22/22 [00:00<00:00, 7323.97 examples/s]
|
||
|
|
Generating dev split: 0%| | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1757.29 examples/s]
|
||
|
|
Generating test split: 0%| | 0/165 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 165/165 [00:00<00:00, 24043.22 examples/s]
|
||
|
|
Generating validation split: 0%| | 0/18 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 18/18 [00:00<00:00, 5664.58 examples/s]
|
||
|
|
Generating dev split: 0%| | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1725.34 examples/s]
|
||
|
|
Generating test split: 0%| | 0/126 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 126/126 [00:00<00:00, 29231.83 examples/s]
|
||
|
|
Generating validation split: 0%| | 0/14 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 14/14 [00:00<00:00, 4548.78 examples/s]
|
||
|
|
Generating dev split: 0%| | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1669.44 examples/s]
|
||
|
|
Generating test split: 0%| | 0/100 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 100/100 [00:00<00:00, 21143.84 examples/s]
|
||
|
|
Generating validation split: 0%| | 0/11 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 11/11 [00:00<00:00, 3442.57 examples/s]
|
||
|
|
Generating dev split: 0%| | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1674.91 examples/s]
|
||
|
|
Generating test split: 0%| | 0/201 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 201/201 [00:00<00:00, 44788.56 examples/s]
|
||
|
|
Generating validation split: 0%| | 0/22 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 22/22 [00:00<00:00, 7071.40 examples/s]
|
||
|
|
Generating dev split: 0%| | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1800.59 examples/s]
|
||
|
|
Generating test split: 0%| | 0/245 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 245/245 [00:00<00:00, 46390.88 examples/s]
|
||
|
|
Generating validation split: 0%| | 0/27 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 27/27 [00:00<00:00, 9020.01 examples/s]
|
||
|
|
Generating dev split: 0%| | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1747.92 examples/s]
|
||
|
|
Generating test split: 0%| | 0/110 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 110/110 [00:00<00:00, 29267.54 examples/s]
|
||
|
|
Generating validation split: 0%| | 0/12 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 12/12 [00:00<00:00, 4261.78 examples/s]
|
||
|
|
Generating dev split: 0%| | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1776.80 examples/s]
|
||
|
|
Generating test split: 0%| | 0/612 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 612/612 [00:00<00:00, 119764.57 examples/s]
|
||
|
|
Generating validation split: 0%| | 0/69 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 69/69 [00:00<00:00, 21951.38 examples/s]
|
||
|
|
Generating dev split: 0%| | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1835.90 examples/s]
|
||
|
|
Generating test split: 0%| | 0/131 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 131/131 [00:00<00:00, 35366.49 examples/s]
|
||
|
|
Generating validation split: 0%| | 0/12 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 12/12 [00:00<00:00, 4178.29 examples/s]
|
||
|
|
Generating dev split: 0%| | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1850.16 examples/s]
|
||
|
|
Generating test split: 0%| | 0/545 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 545/545 [00:00<00:00, 120303.97 examples/s]
|
||
|
|
Generating validation split: 0%| | 0/60 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 60/60 [00:00<00:00, 19244.34 examples/s]
|
||
|
|
Generating dev split: 0%| | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1786.79 examples/s]
|
||
|
|
Generating test split: 0%| | 0/238 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 238/238 [00:00<00:00, 49624.40 examples/s]
|
||
|
|
Generating validation split: 0%| | 0/26 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 26/26 [00:00<00:00, 8633.67 examples/s]
|
||
|
|
Generating dev split: 0%| | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1837.35 examples/s]
|
||
|
|
Generating test split: 0%| | 0/390 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 390/390 [00:00<00:00, 81341.55 examples/s]
|
||
|
|
Generating validation split: 0%| | 0/43 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 43/43 [00:00<00:00, 13817.14 examples/s]
|
||
|
|
Generating dev split: 0%| | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1855.39 examples/s]
|
||
|
|
Generating test split: 0%| | 0/193 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 193/193 [00:00<00:00, 33528.03 examples/s]
|
||
|
|
Generating validation split: 0%| | 0/21 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 21/21 [00:00<00:00, 7201.40 examples/s]
|
||
|
|
Generating dev split: 0%| | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1696.72 examples/s]
|
||
|
|
Generating test split: 0%| | 0/198 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 198/198 [00:00<00:00, 55687.80 examples/s]
|
||
|
|
Generating validation split: 0%| | 0/22 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 22/22 [00:00<00:00, 8794.77 examples/s]
|
||
|
|
Generating dev split: 0%| | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 2192.99 examples/s]
|
||
|
|
Generating test split: 0%| | 0/114 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 114/114 [00:00<00:00, 29388.49 examples/s]
|
||
|
|
Generating validation split: 0%| | 0/12 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 12/12 [00:00<00:00, 4193.95 examples/s]
|
||
|
|
Generating dev split: 0%| | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1748.06 examples/s]
|
||
|
|
Generating test split: 0%| | 0/166 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 166/166 [00:00<00:00, 40079.12 examples/s]
|
||
|
|
Generating validation split: 0%| | 0/18 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 18/18 [00:00<00:00, 6069.42 examples/s]
|
||
|
|
Generating dev split: 0%| | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1699.47 examples/s]
|
||
|
|
Generating test split: 0%| | 0/272 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 272/272 [00:00<00:00, 57343.59 examples/s]
|
||
|
|
Generating validation split: 0%| | 0/31 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 31/31 [00:00<00:00, 12762.41 examples/s]
|
||
|
|
Generating dev split: 0%| | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 2071.47 examples/s]
|
||
|
|
Generating test split: 0%| | 0/282 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 282/282 [00:00<00:00, 63724.68 examples/s]
|
||
|
|
Generating validation split: 0%| | 0/31 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 31/31 [00:00<00:00, 10047.40 examples/s]
|
||
|
|
Generating dev split: 0%| | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1768.55 examples/s]
|
||
|
|
Generating test split: 0%| | 0/306 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 306/306 [00:00<00:00, 68316.23 examples/s]
|
||
|
|
Generating validation split: 0%| | 0/33 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 33/33 [00:00<00:00, 11038.52 examples/s]
|
||
|
|
Generating dev split: 0%| | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1785.11 examples/s]
|
||
|
|
Generating test split: 0%| | 0/783 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 783/783 [00:00<00:00, 170524.95 examples/s]
|
||
|
|
Generating validation split: 0%| | 0/86 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 86/86 [00:00<00:00, 29108.31 examples/s]
|
||
|
|
Generating dev split: 0%| | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1792.44 examples/s]
|
||
|
|
Generating test split: 0%| | 0/100 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 100/100 [00:00<00:00, 21843.06 examples/s]
|
||
|
|
Generating validation split: 0%| | 0/11 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 11/11 [00:00<00:00, 3478.65 examples/s]
|
||
|
|
Generating dev split: 0%| | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1695.08 examples/s]
|
||
|
|
Generating test split: 0%| | 0/234 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 234/234 [00:00<00:00, 55095.27 examples/s]
|
||
|
|
Generating validation split: 0%| | 0/25 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 25/25 [00:00<00:00, 8335.26 examples/s]
|
||
|
|
Generating dev split: 0%| | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1762.31 examples/s]
|
||
|
|
Generating test split: 0%| | 0/103 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 103/103 [00:00<00:00, 22725.58 examples/s]
|
||
|
|
Generating validation split: 0%| | 0/11 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 11/11 [00:00<00:00, 3748.26 examples/s]
|
||
|
|
Generating dev split: 0%| | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1765.13 examples/s]
|
||
|
|
Generating test split: 0%| | 0/223 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 223/223 [00:00<00:00, 57157.77 examples/s]
|
||
|
|
Generating validation split: 0%| | 0/23 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 23/23 [00:00<00:00, 7942.45 examples/s]
|
||
|
|
Generating dev split: 0%| | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1843.49 examples/s]
|
||
|
|
Generating test split: 0%| | 0/100 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 100/100 [00:00<00:00, 26542.87 examples/s]
|
||
|
|
Generating validation split: 0%| | 0/10 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 10/10 [00:00<00:00, 3438.52 examples/s]
|
||
|
|
Generating dev split: 0%| | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1798.90 examples/s]
|
||
|
|
Generating test split: 0%| | 0/173 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 173/173 [00:00<00:00, 40555.25 examples/s]
|
||
|
|
Generating validation split: 0%| | 0/22 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 22/22 [00:00<00:00, 7402.70 examples/s]
|
||
|
|
Generating dev split: 0%| | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1801.37 examples/s]
|
||
|
|
Generating test split: 0%| | 0/265 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 265/265 [00:00<00:00, 58849.50 examples/s]
|
||
|
|
Generating validation split: 0%| | 0/29 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 29/29 [00:00<00:00, 8974.09 examples/s]
|
||
|
|
Generating dev split: 0%| | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1707.22 examples/s]
|
||
|
|
Generating test split: 0%| | 0/100 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 100/100 [00:00<00:00, 26497.59 examples/s]
|
||
|
|
Generating validation split: 0%| | 0/11 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 11/11 [00:00<00:00, 3834.87 examples/s]
|
||
|
|
Generating dev split: 0%| | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1721.94 examples/s]
|
||
|
|
Generating test split: 0%| | 0/112 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 112/112 [00:00<00:00, 22252.00 examples/s]
|
||
|
|
Generating validation split: 0%| | 0/11 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 11/11 [00:00<00:00, 3453.65 examples/s]
|
||
|
|
Generating dev split: 0%| | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1729.75 examples/s]
|
||
|
|
Generating test split: 0%| | 0/216 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 216/216 [00:00<00:00, 46824.98 examples/s]
|
||
|
|
Generating validation split: 0%| | 0/23 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 23/23 [00:00<00:00, 7772.24 examples/s]
|
||
|
|
Generating dev split: 0%| | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1810.54 examples/s]
|
||
|
|
Generating test split: 0%| | 0/151 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 151/151 [00:00<00:00, 32208.09 examples/s]
|
||
|
|
Generating validation split: 0%| | 0/17 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 17/17 [00:00<00:00, 5617.96 examples/s]
|
||
|
|
Generating dev split: 0%| | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1723.21 examples/s]
|
||
|
|
Generating test split: 0%| | 0/270 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 270/270 [00:00<00:00, 68981.06 examples/s]
|
||
|
|
Generating validation split: 0%| | 0/29 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 29/29 [00:00<00:00, 10001.22 examples/s]
|
||
|
|
Generating dev split: 0%| | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1856.54 examples/s]
|
||
|
|
Generating test split: 0%| | 0/100 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 100/100 [00:00<00:00, 25877.99 examples/s]
|
||
|
|
Generating validation split: 0%| | 0/9 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 9/9 [00:00<00:00, 3217.86 examples/s]
|
||
|
|
Generating dev split: 0%| | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1814.15 examples/s]
|
||
|
|
Generating test split: 0%| | 0/203 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 203/203 [00:00<00:00, 51602.65 examples/s]
|
||
|
|
Generating validation split: 0%| | 0/22 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 22/22 [00:00<00:00, 7454.73 examples/s]
|
||
|
|
Generating dev split: 0%| | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1839.12 examples/s]
|
||
|
|
Generating test split: 0%| | 0/310 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 310/310 [00:00<00:00, 68028.79 examples/s]
|
||
|
|
Generating validation split: 0%| | 0/32 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 32/32 [00:00<00:00, 10385.15 examples/s]
|
||
|
|
Generating dev split: 0%| | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1813.36 examples/s]
|
||
|
|
Generating test split: 0%| | 0/378 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 378/378 [00:00<00:00, 96356.32 examples/s]
|
||
|
|
Generating validation split: 0%| | 0/41 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 41/41 [00:00<00:00, 13894.03 examples/s]
|
||
|
|
Generating dev split: 0%| | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1857.86 examples/s]
|
||
|
|
Generating test split: 0%| | 0/145 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 145/145 [00:00<00:00, 36778.79 examples/s]
|
||
|
|
Generating validation split: 0%| | 0/16 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 16/16 [00:00<00:00, 5424.25 examples/s]
|
||
|
|
Generating dev split: 0%| | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1834.30 examples/s]
|
||
|
|
Generating test split: 0%| | 0/235 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 235/235 [00:00<00:00, 52170.72 examples/s]
|
||
|
|
Generating validation split: 0%| | 0/26 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 26/26 [00:00<00:00, 9269.18 examples/s]
|
||
|
|
Generating dev split: 0%| | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1890.35 examples/s]
|
||
|
|
Generating test split: 0%| | 0/100 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 100/100 [00:00<00:00, 23591.34 examples/s]
|
||
|
|
Generating validation split: 0%| | 0/11 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 11/11 [00:00<00:00, 3521.67 examples/s]
|
||
|
|
Generating dev split: 0%| | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1770.94 examples/s]
|
||
|
|
Generating test split: 0%| | 0/102 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 102/102 [00:00<00:00, 22729.73 examples/s]
|
||
|
|
Generating validation split: 0%| | 0/11 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 11/11 [00:00<00:00, 3840.30 examples/s]
|
||
|
|
Generating dev split: 0%| | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1871.62 examples/s]
|
||
|
|
Generating test split: 0%| | 0/100 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 100/100 [00:00<00:00, 26584.93 examples/s]
|
||
|
|
Generating validation split: 0%| | 0/11 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 11/11 [00:00<00:00, 3895.09 examples/s]
|
||
|
|
Generating dev split: 0%| | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1831.73 examples/s]
|
||
|
|
Generating test split: 0%| | 0/100 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 100/100 [00:00<00:00, 25754.05 examples/s]
|
||
|
|
Generating validation split: 0%| | 0/11 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 11/11 [00:00<00:00, 3996.65 examples/s]
|
||
|
|
Generating dev split: 0%| | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1791.06 examples/s]
|
||
|
|
Generating test split: 0%| | 0/100 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 100/100 [00:00<00:00, 26693.21 examples/s]
|
||
|
|
Generating validation split: 0%| | 0/8 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 8/8 [00:00<00:00, 2736.46 examples/s]
|
||
|
|
Generating dev split: 0%| | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1783.75 examples/s]
|
||
|
|
Generating test split: 0%| | 0/144 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 144/144 [00:00<00:00, 37919.37 examples/s]
|
||
|
|
Generating validation split: 0%| | 0/16 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 16/16 [00:00<00:00, 5634.67 examples/s]
|
||
|
|
Generating dev split: 0%| | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1703.34 examples/s]
|
||
|
|
Generating test split: 0%| | 0/152 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 152/152 [00:00<00:00, 38671.25 examples/s]
|
||
|
|
Generating validation split: 0%| | 0/16 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 16/16 [00:00<00:00, 5587.28 examples/s]
|
||
|
|
Generating dev split: 0%| | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1857.70 examples/s]
|
||
|
|
Generating test split: 0%| | 0/135 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 135/135 [00:00<00:00, 35767.23 examples/s]
|
||
|
|
Generating validation split: 0%| | 0/14 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 14/14 [00:00<00:00, 4942.78 examples/s]
|
||
|
|
Generating dev split: 0%| | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1858.02 examples/s]
|
||
|
|
Generating test split: 0%| | 0/100 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 100/100 [00:00<00:00, 22633.99 examples/s]
|
||
|
|
Generating validation split: 0%| | 0/11 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 11/11 [00:00<00:00, 3711.18 examples/s]
|
||
|
|
Generating dev split: 0%| | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 1802.30 examples/s]
|
||
|
|
Downloading data: 0%| | 0.00/1.82M [00:00<?, ?B/s]
Downloading data: 58%|█████▊ | 1.05M/1.82M [00:00<00:00, 10.5MB/s]
Downloading data: 100%|██████████| 1.82M/1.82M [00:00<00:00, 15.2MB/s]
|
||
|
|
Downloading data: 0%| | 0.00/220k [00:00<?, ?B/s]
Downloading data: 815kB [00:00, 30.9MB/s]
|
||
|
|
Generating train split: 0%| | 0/16113 [00:00<?, ? examples/s]
Generating train split: 24%|██▎ | 3821/16113 [00:00<00:00, 38057.51 examples/s]
Generating train split: 53%|█████▎ | 8583/16113 [00:00<00:00, 43666.72 examples/s]
Generating train split: 84%|████████▎ | 13477/16113 [00:00<00:00, 45595.42 examples/s]
Generating train split: 100%|██████████| 16113/16113 [00:00<00:00, 44832.87 examples/s]
|
||
|
|
Generating test split: 0%| | 0/3084 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 3084/3084 [00:00<00:00, 50655.69 examples/s]
|
||
|
|
Generating validation split: 0%| | 0/1838 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 1838/1838 [00:00<00:00, 46175.46 examples/s]
|
||
|
|
Generating validation split: 0%| | 0/817 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 817/817 [00:00<00:00, 87483.95 examples/s]
|
||
|
|
Generating train split: 0%| | 0/40398 [00:00<?, ? examples/s]
Generating train split: 100%|██████████| 40398/40398 [00:00<00:00, 1122842.95 examples/s]
|
||
|
|
Generating test split: 0%| | 0/1767 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 1767/1767 [00:00<00:00, 342784.11 examples/s]
|
||
|
|
Generating validation split: 0%| | 0/1267 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 1267/1267 [00:00<00:00, 291827.74 examples/s]
|
||
|
|
2026-02-23:22:26:19,189 INFO [task.py:415] Building contexts for winogrande on rank 0...
|
||
|
|
0%| | 0/1267 [00:00<?, ?it/s]
100%|██████████| 1267/1267 [00:00<00:00, 121592.11it/s]
|
||
|
|
2026-02-23:22:26:19,229 INFO [task.py:415] Building contexts for truthfulqa_mc2 on rank 0...
|
||
|
|
0%| | 0/817 [00:00<?, ?it/s]
13%|█▎ | 103/817 [00:00<00:00, 1023.46it/s]
26%|██▌ | 214/817 [00:00<00:00, 1073.62it/s]
40%|███▉ | 325/817 [00:00<00:00, 1088.94it/s]
53%|█████▎ | 436/817 [00:00<00:00, 1090.18it/s]
67%|██████▋ | 548/817 [00:00<00:00, 1097.75it/s]
81%|████████ | 660/817 [00:00<00:00, 1104.43it/s]
94%|█████████▍| 771/817 [00:00<00:00, 1102.25it/s]
100%|██████████| 817/817 [00:00<00:00, 1094.96it/s]
|
||
|
|
2026-02-23:22:26:20,019 INFO [task.py:415] Building contexts for piqa on rank 0...
|
||
|
|
0%| | 0/1838 [00:00<?, ?it/s]
9%|▉ | 165/1838 [00:00<00:01, 1647.90it/s]
18%|█▊ | 330/1838 [00:00<00:00, 1633.23it/s]
27%|██▋ | 499/1838 [00:00<00:00, 1657.08it/s]
36%|███▋ | 667/1838 [00:00<00:00, 1665.45it/s]
45%|████▌ | 835/1838 [00:00<00:00, 1668.69it/s]
55%|█████▍ | 1003/1838 [00:00<00:00, 1672.37it/s]
64%|██████▍ | 1172/1838 [00:00<00:00, 1675.35it/s]
73%|███████▎ | 1341/1838 [00:00<00:00, 1678.48it/s]
82%|████████▏ | 1509/1838 [00:00<00:00, 1678.79it/s]
91%|█████████▏| 1678/1838 [00:01<00:00, 1679.59it/s]
100%|██████████| 1838/1838 [00:01<00:00, 1672.18it/s]
|
||
|
|
2026-02-23:22:26:21,163 INFO [task.py:415] Building contexts for mmlu_abstract_algebra on rank 0...
|
||
|
|
0%| | 0/100 [00:00<?, ?it/s]
95%|█████████▌| 95/100 [00:00<00:00, 945.54it/s]
100%|██████████| 100/100 [00:00<00:00, 943.82it/s]
|
||
|
|
2026-02-23:22:26:21,273 INFO [task.py:415] Building contexts for mmlu_anatomy on rank 0...
|
||
|
|
0%| | 0/135 [00:00<?, ?it/s]
71%|███████ | 96/135 [00:00<00:00, 952.47it/s]
100%|██████████| 135/135 [00:00<00:00, 952.25it/s]
|
||
|
|
2026-02-23:22:26:21,419 INFO [task.py:415] Building contexts for mmlu_astronomy on rank 0...
|
||
|
|
0%| | 0/152 [00:00<?, ?it/s]
63%|██████▎ | 96/152 [00:00<00:00, 955.44it/s]
100%|██████████| 152/152 [00:00<00:00, 956.61it/s]
|
||
|
|
2026-02-23:22:26:21,583 INFO [task.py:415] Building contexts for mmlu_college_biology on rank 0...
|
||
|
|
0%| | 0/144 [00:00<?, ?it/s]
67%|██████▋ | 97/144 [00:00<00:00, 962.33it/s]
100%|██████████| 144/144 [00:00<00:00, 962.90it/s]
|
||
|
|
2026-02-23:22:26:21,737 INFO [task.py:415] Building contexts for mmlu_college_chemistry on rank 0...
|
||
|
|
0%| | 0/100 [00:00<?, ?it/s]
92%|█████████▏| 92/100 [00:00<00:00, 912.65it/s]
100%|██████████| 100/100 [00:00<00:00, 914.39it/s]
|
||
|
|
2026-02-23:22:26:21,850 INFO [task.py:415] Building contexts for mmlu_college_computer_science on rank 0...
|
||
|
|
0%| | 0/100 [00:00<?, ?it/s]
95%|█████████▌| 95/100 [00:00<00:00, 943.37it/s]
100%|██████████| 100/100 [00:00<00:00, 941.60it/s]
|
||
|
|
2026-02-23:22:26:21,959 INFO [task.py:415] Building contexts for mmlu_college_mathematics on rank 0...
|
||
|
|
0%| | 0/100 [00:00<?, ?it/s]
95%|█████████▌| 95/100 [00:00<00:00, 940.59it/s]
100%|██████████| 100/100 [00:00<00:00, 940.02it/s]
|
||
|
|
2026-02-23:22:26:22,069 INFO [task.py:415] Building contexts for mmlu_college_physics on rank 0...
|
||
|
|
0%| | 0/102 [00:00<?, ?it/s]
93%|█████████▎| 95/102 [00:00<00:00, 940.24it/s]
100%|██████████| 102/102 [00:00<00:00, 939.31it/s]
|
||
|
|
2026-02-23:22:26:22,181 INFO [task.py:415] Building contexts for mmlu_computer_security on rank 0...
|
||
|
|
0%| | 0/100 [00:00<?, ?it/s]
95%|█████████▌| 95/100 [00:00<00:00, 940.24it/s]
100%|██████████| 100/100 [00:00<00:00, 939.87it/s]
|
||
|
|
2026-02-23:22:26:22,291 INFO [task.py:415] Building contexts for mmlu_conceptual_physics on rank 0...
|
||
|
|
0%| | 0/235 [00:00<?, ?it/s]
40%|████ | 95/235 [00:00<00:00, 946.60it/s]
81%|████████ | 190/235 [00:00<00:00, 942.23it/s]
100%|██████████| 235/235 [00:00<00:00, 943.23it/s]
|
||
|
|
2026-02-23:22:26:22,548 INFO [task.py:415] Building contexts for mmlu_electrical_engineering on rank 0...
|
||
|
|
0%| | 0/145 [00:00<?, ?it/s]
66%|██████▌ | 95/145 [00:00<00:00, 944.79it/s]
100%|██████████| 145/145 [00:00<00:00, 944.84it/s]
|
||
|
|
2026-02-23:22:26:22,706 INFO [task.py:415] Building contexts for mmlu_elementary_mathematics on rank 0...
|
||
|
|
0%| | 0/378 [00:00<?, ?it/s]
25%|██▍ | 94/378 [00:00<00:00, 929.87it/s]
50%|█████ | 189/378 [00:00<00:00, 938.70it/s]
75%|███████▌ | 284/378 [00:00<00:00, 941.58it/s]
100%|██████████| 378/378 [00:00<00:00, 940.85it/s]
|
||
|
|
2026-02-23:22:26:23,119 INFO [task.py:415] Building contexts for mmlu_high_school_biology on rank 0...
|
||
|
|
0%| | 0/310 [00:00<?, ?it/s]
31%|███ | 95/310 [00:00<00:00, 941.45it/s]
61%|██████▏ | 190/310 [00:00<00:00, 944.12it/s]
92%|█████████▏| 285/310 [00:00<00:00, 935.99it/s]
100%|██████████| 310/310 [00:00<00:00, 938.56it/s]
|
||
|
|
2026-02-23:22:26:23,460 INFO [task.py:415] Building contexts for mmlu_high_school_chemistry on rank 0...
|
||
|
|
0%| | 0/203 [00:00<?, ?it/s]
47%|████▋ | 95/203 [00:00<00:00, 947.64it/s]
94%|█████████▎| 190/203 [00:00<00:00, 948.38it/s]
100%|██████████| 203/203 [00:00<00:00, 947.93it/s]
|
||
|
|
2026-02-23:22:26:23,680 INFO [task.py:415] Building contexts for mmlu_high_school_computer_science on rank 0...
|
||
|
|
0%| | 0/100 [00:00<?, ?it/s]
95%|█████████▌| 95/100 [00:00<00:00, 945.25it/s]
100%|██████████| 100/100 [00:00<00:00, 944.51it/s]
|
||
|
|
2026-02-23:22:26:23,790 INFO [task.py:415] Building contexts for mmlu_high_school_mathematics on rank 0...
|
||
|
|
0%| | 0/270 [00:00<?, ?it/s]
35%|███▌ | 95/270 [00:00<00:00, 949.07it/s]
70%|███████ | 190/270 [00:00<00:00, 945.03it/s]
100%|██████████| 270/270 [00:00<00:00, 945.24it/s]
|
||
|
|
2026-02-23:22:26:24,084 INFO [task.py:415] Building contexts for mmlu_high_school_physics on rank 0...
|
||
|
|
0%| | 0/151 [00:00<?, ?it/s]
63%|██████▎ | 95/151 [00:00<00:00, 947.05it/s]
100%|██████████| 151/151 [00:00<00:00, 945.58it/s]
|
||
|
|
2026-02-23:22:26:24,249 INFO [task.py:415] Building contexts for mmlu_high_school_statistics on rank 0...
|
||
|
|
0%| | 0/216 [00:00<?, ?it/s]
44%|████▍ | 95/216 [00:00<00:00, 944.40it/s]
88%|████████▊ | 190/216 [00:00<00:00, 945.21it/s]
100%|██████████| 216/216 [00:00<00:00, 945.48it/s]
|
||
|
|
2026-02-23:22:26:24,484 INFO [task.py:415] Building contexts for mmlu_machine_learning on rank 0...
|
||
|
|
0%| | 0/112 [00:00<?, ?it/s]
85%|████████▍ | 95/112 [00:00<00:00, 941.58it/s]
100%|██████████| 112/112 [00:00<00:00, 941.07it/s]
|
||
|
|
2026-02-23:22:26:24,607 INFO [task.py:415] Building contexts for mmlu_business_ethics on rank 0...
|
||
|
|
0%| | 0/100 [00:00<?, ?it/s]
95%|█████████▌| 95/100 [00:00<00:00, 949.73it/s]
100%|██████████| 100/100 [00:00<00:00, 949.38it/s]
|
||
|
|
2026-02-23:22:26:24,715 INFO [task.py:415] Building contexts for mmlu_clinical_knowledge on rank 0...
|
||
|
|
0%| | 0/265 [00:00<?, ?it/s]
36%|███▌ | 95/265 [00:00<00:00, 945.55it/s]
72%|███████▏ | 190/265 [00:00<00:00, 947.38it/s]
100%|██████████| 265/265 [00:00<00:00, 948.10it/s]
|
||
|
|
2026-02-23:22:26:25,003 INFO [task.py:415] Building contexts for mmlu_college_medicine on rank 0...
|
||
|
|
0%| | 0/173 [00:00<?, ?it/s]
55%|█████▍ | 95/173 [00:00<00:00, 948.80it/s]
100%|██████████| 173/173 [00:00<00:00, 949.59it/s]
|
||
|
|
2026-02-23:22:26:25,191 INFO [task.py:415] Building contexts for mmlu_global_facts on rank 0...
|
||
|
|
0%| | 0/100 [00:00<?, ?it/s]
95%|█████████▌| 95/100 [00:00<00:00, 948.66it/s]
100%|██████████| 100/100 [00:00<00:00, 948.33it/s]
|
||
|
|
2026-02-23:22:26:25,299 INFO [task.py:415] Building contexts for mmlu_human_aging on rank 0...
|
||
|
|
0%| | 0/223 [00:00<?, ?it/s]
43%|████▎ | 95/223 [00:00<00:00, 949.62it/s]
86%|████████▌ | 191/223 [00:00<00:00, 950.12it/s]
100%|██████████| 223/223 [00:00<00:00, 950.03it/s]
|
||
|
|
2026-02-23:22:26:25,541 INFO [task.py:415] Building contexts for mmlu_management on rank 0...
|
||
|
|
0%| | 0/103 [00:00<?, ?it/s]
92%|█████████▏| 95/103 [00:00<00:00, 946.59it/s]
100%|██████████| 103/103 [00:00<00:00, 946.55it/s]
|
||
|
|
2026-02-23:22:26:25,654 INFO [task.py:415] Building contexts for mmlu_marketing on rank 0...
|
||
|
|
0%| | 0/234 [00:00<?, ?it/s]
41%|████ | 96/234 [00:00<00:00, 950.91it/s]
82%|████████▏ | 192/234 [00:00<00:00, 949.95it/s]
100%|██████████| 234/234 [00:00<00:00, 950.57it/s]
|
||
|
|
2026-02-23:22:26:25,907 INFO [task.py:415] Building contexts for mmlu_medical_genetics on rank 0...
|
||
|
|
0%| | 0/100 [00:00<?, ?it/s]
95%|█████████▌| 95/100 [00:00<00:00, 943.12it/s]
100%|██████████| 100/100 [00:00<00:00, 942.03it/s]
|
||
|
|
2026-02-23:22:26:26,017 INFO [task.py:415] Building contexts for mmlu_miscellaneous on rank 0...
|
||
|
|
0%| | 0/783 [00:00<?, ?it/s]
12%|█▏ | 95/783 [00:00<00:00, 945.81it/s]
24%|██▍ | 191/783 [00:00<00:00, 948.53it/s]
37%|███▋ | 287/783 [00:00<00:00, 949.40it/s]
49%|████▉ | 382/783 [00:00<00:00, 946.47it/s]
61%|██████ | 477/783 [00:00<00:00, 947.20it/s]
73%|███████▎ | 572/783 [00:00<00:00, 947.55it/s]
85%|████████▌ | 667/783 [00:00<00:00, 944.00it/s]
97%|█████████▋| 762/783 [00:00<00:00, 945.44it/s]
100%|██████████| 783/783 [00:00<00:00, 946.21it/s]
|
||
|
|
2026-02-23:22:26:26,868 INFO [task.py:415] Building contexts for mmlu_nutrition on rank 0...
|
||
|
|
0%| | 0/306 [00:00<?, ?it/s]
31%|███ | 95/306 [00:00<00:00, 945.65it/s]
62%|██████▏ | 190/306 [00:00<00:00, 942.86it/s]
93%|█████████▎| 285/306 [00:00<00:00, 945.70it/s]
100%|██████████| 306/306 [00:00<00:00, 945.33it/s]
|
||
|
|
2026-02-23:22:26:27,202 INFO [task.py:415] Building contexts for mmlu_professional_accounting on rank 0...
|
||
|
|
0%| | 0/282 [00:00<?, ?it/s]
34%|███▎ | 95/282 [00:00<00:00, 947.16it/s]
67%|██████▋ | 190/282 [00:00<00:00, 948.05it/s]
100%|██████████| 282/282 [00:00<00:00, 943.82it/s]
|
||
|
|
2026-02-23:22:26:27,510 INFO [task.py:415] Building contexts for mmlu_professional_medicine on rank 0...
|
||
|
|
0%| | 0/272 [00:00<?, ?it/s]
35%|███▍ | 95/272 [00:00<00:00, 944.17it/s]
70%|██████▉ | 190/272 [00:00<00:00, 945.88it/s]
100%|██████████| 272/272 [00:00<00:00, 945.00it/s]
|
||
|
|
2026-02-23:22:26:27,807 INFO [task.py:415] Building contexts for mmlu_virology on rank 0...
|
||
|
|
0%| | 0/166 [00:00<?, ?it/s]
57%|█████▋ | 95/166 [00:00<00:00, 949.10it/s]
100%|██████████| 166/166 [00:00<00:00, 949.47it/s]
|
||
|
|
2026-02-23:22:26:27,987 INFO [task.py:415] Building contexts for mmlu_econometrics on rank 0...
|
||
|
|
0%| | 0/114 [00:00<?, ?it/s]
83%|████████▎ | 95/114 [00:00<00:00, 949.88it/s]
100%|██████████| 114/114 [00:00<00:00, 949.11it/s]
|
||
|
|
2026-02-23:22:26:28,111 INFO [task.py:415] Building contexts for mmlu_high_school_geography on rank 0...
|
||
|
|
0%| | 0/198 [00:00<?, ?it/s]
48%|████▊ | 96/198 [00:00<00:00, 951.89it/s]
97%|█████████▋| 192/198 [00:00<00:00, 952.15it/s]
100%|██████████| 198/198 [00:00<00:00, 951.37it/s]
|
||
|
|
2026-02-23:22:26:28,325 INFO [task.py:415] Building contexts for mmlu_high_school_government_and_politics on rank 0...
|
||
|
|
0%| | 0/193 [00:00<?, ?it/s]
49%|████▉ | 95/193 [00:00<00:00, 946.77it/s]
98%|█████████▊| 190/193 [00:00<00:00, 945.46it/s]
100%|██████████| 193/193 [00:00<00:00, 945.28it/s]
|
||
|
|
2026-02-23:22:26:28,535 INFO [task.py:415] Building contexts for mmlu_high_school_macroeconomics on rank 0...
|
||
|
|
0%| | 0/390 [00:00<?, ?it/s]
24%|██▍ | 95/390 [00:00<00:00, 948.08it/s]
49%|████▊ | 190/390 [00:00<00:00, 947.04it/s]
73%|███████▎ | 285/390 [00:00<00:00, 946.92it/s]
97%|█████████▋| 380/390 [00:00<00:00, 947.41it/s]
100%|██████████| 390/390 [00:00<00:00, 946.92it/s]
|
||
|
|
2026-02-23:22:26:28,959 INFO [task.py:415] Building contexts for mmlu_high_school_microeconomics on rank 0...
|
||
|
|
0%| | 0/238 [00:00<?, ?it/s]
40%|███▉ | 95/238 [00:00<00:00, 945.65it/s]
80%|███████▉ | 190/238 [00:00<00:00, 946.72it/s]
100%|██████████| 238/238 [00:00<00:00, 946.27it/s]
|
||
|
|
2026-02-23:22:26:29,218 INFO [task.py:415] Building contexts for mmlu_high_school_psychology on rank 0...
|
||
|
|
0%| | 0/545 [00:00<?, ?it/s]
17%|█▋ | 95/545 [00:00<00:00, 948.64it/s]
35%|███▍ | 190/545 [00:00<00:00, 949.16it/s]
52%|█████▏ | 285/545 [00:00<00:00, 949.21it/s]
70%|██████▉ | 381/545 [00:00<00:00, 949.77it/s]
87%|████████▋ | 476/545 [00:00<00:00, 949.18it/s]
100%|██████████| 545/545 [00:00<00:00, 948.27it/s]
|
||
|
|
2026-02-23:22:26:29,810 INFO [task.py:415] Building contexts for mmlu_human_sexuality on rank 0...
|
||
|
|
0%| | 0/131 [00:00<?, ?it/s]
72%|███████▏ | 94/131 [00:00<00:00, 938.62it/s]
100%|██████████| 131/131 [00:00<00:00, 939.31it/s]
|
||
|
|
2026-02-23:22:26:29,954 INFO [task.py:415] Building contexts for mmlu_professional_psychology on rank 0...
|
||
|
|
0%| | 0/612 [00:00<?, ?it/s]
12%|█▏ | 72/612 [00:00<00:03, 146.32it/s]
27%|██▋ | 167/612 [00:00<00:01, 326.78it/s]
43%|████▎ | 262/612 [00:00<00:00, 477.10it/s]
58%|█████▊ | 357/612 [00:00<00:00, 597.63it/s]
74%|███████▎ | 451/612 [00:00<00:00, 688.95it/s]
89%|████████▉ | 545/612 [00:00<00:00, 757.74it/s]
100%|██████████| 612/612 [00:01<00:00, 572.17it/s]
|
||
|
|
2026-02-23:22:26:31,042 INFO [task.py:415] Building contexts for mmlu_public_relations on rank 0...
|
||
|
|
0%| | 0/110 [00:00<?, ?it/s]
85%|████████▌ | 94/110 [00:00<00:00, 939.05it/s]
100%|██████████| 110/110 [00:00<00:00, 940.54it/s]
|
||
|
|
2026-02-23:22:26:31,163 INFO [task.py:415] Building contexts for mmlu_security_studies on rank 0...
|
||
|
|
0%| | 0/245 [00:00<?, ?it/s]
39%|███▉ | 95/245 [00:00<00:00, 941.24it/s]
78%|███████▊ | 190/245 [00:00<00:00, 936.74it/s]
100%|██████████| 245/245 [00:00<00:00, 936.87it/s]
|
||
|
|
2026-02-23:22:26:31,433 INFO [task.py:415] Building contexts for mmlu_sociology on rank 0...
|
||
|
|
0%| | 0/201 [00:00<?, ?it/s]
45%|████▍ | 90/201 [00:00<00:00, 893.54it/s]
92%|█████████▏| 184/201 [00:00<00:00, 920.66it/s]
100%|██████████| 201/201 [00:00<00:00, 918.57it/s]
|
||
|
|
2026-02-23:22:26:31,658 INFO [task.py:415] Building contexts for mmlu_us_foreign_policy on rank 0...
|
||
|
|
0%| | 0/100 [00:00<?, ?it/s]
95%|█████████▌| 95/100 [00:00<00:00, 947.47it/s]
100%|██████████| 100/100 [00:00<00:00, 945.65it/s]
|
||
|
|
2026-02-23:22:26:31,768 INFO [task.py:415] Building contexts for mmlu_formal_logic on rank 0...
|
||
|
|
0%| | 0/126 [00:00<?, ?it/s]
75%|███████▌ | 95/126 [00:00<00:00, 944.25it/s]
100%|██████████| 126/126 [00:00<00:00, 942.49it/s]
|
||
|
|
2026-02-23:22:26:31,906 INFO [task.py:415] Building contexts for mmlu_high_school_european_history on rank 0...
|
||
|
|
0%| | 0/165 [00:00<?, ?it/s]
56%|█████▋ | 93/165 [00:00<00:00, 924.18it/s]
100%|██████████| 165/165 [00:00<00:00, 930.68it/s]
|
||
|
|
2026-02-23:22:26:32,089 INFO [task.py:415] Building contexts for mmlu_high_school_us_history on rank 0...
|
||
|
|
0%| | 0/204 [00:00<?, ?it/s]
46%|████▌ | 94/204 [00:00<00:00, 933.50it/s]
92%|█████████▏| 188/204 [00:00<00:00, 934.77it/s]
100%|██████████| 204/204 [00:00<00:00, 934.34it/s]
|
||
|
|
2026-02-23:22:26:32,314 INFO [task.py:415] Building contexts for mmlu_high_school_world_history on rank 0...
|
||
|
|
0%| | 0/237 [00:00<?, ?it/s]
40%|███▉ | 94/237 [00:00<00:00, 934.21it/s]
79%|███████▉ | 188/237 [00:00<00:00, 913.62it/s]
100%|██████████| 237/237 [00:00<00:00, 920.74it/s]
|
||
|
|
2026-02-23:22:26:32,580 INFO [task.py:415] Building contexts for mmlu_international_law on rank 0...
|
||
|
|
0%| | 0/121 [00:00<?, ?it/s]
79%|███████▊ | 95/121 [00:00<00:00, 946.22it/s]
100%|██████████| 121/121 [00:00<00:00, 944.12it/s]
|
||
|
|
2026-02-23:22:26:32,712 INFO [task.py:415] Building contexts for mmlu_jurisprudence on rank 0...
|
||
|
|
0%| | 0/108 [00:00<?, ?it/s]
88%|████████▊ | 95/108 [00:00<00:00, 944.84it/s]
100%|██████████| 108/108 [00:00<00:00, 943.78it/s]
|
||
|
|
2026-02-23:22:26:32,830 INFO [task.py:415] Building contexts for mmlu_logical_fallacies on rank 0...
|
||
|
|
0%| | 0/163 [00:00<?, ?it/s]
58%|█████▊ | 94/163 [00:00<00:00, 938.13it/s]
100%|██████████| 163/163 [00:00<00:00, 940.19it/s]
|
||
|
|
2026-02-23:22:26:33,009 INFO [task.py:415] Building contexts for mmlu_moral_disputes on rank 0...
|
||
|
|
0%| | 0/346 [00:00<?, ?it/s]
27%|██▋ | 94/346 [00:00<00:00, 937.24it/s]
55%|█████▍ | 189/346 [00:00<00:00, 940.45it/s]
82%|████████▏ | 284/346 [00:00<00:00, 943.35it/s]
100%|██████████| 346/346 [00:00<00:00, 943.49it/s]
|
||
|
|
2026-02-23:22:26:33,386 INFO [task.py:415] Building contexts for mmlu_moral_scenarios on rank 0...
|
||
|
|
0%| | 0/895 [00:00<?, ?it/s]
11%|█ | 95/895 [00:00<00:00, 943.56it/s]
21%|██ | 190/895 [00:00<00:00, 944.54it/s]
32%|███▏ | 285/895 [00:00<00:00, 943.91it/s]
42%|████▏ | 380/895 [00:00<00:00, 944.20it/s]
53%|█████▎ | 475/895 [00:00<00:00, 944.06it/s]
64%|██████▎ | 570/895 [00:00<00:00, 943.09it/s]
74%|███████▍ | 665/895 [00:00<00:00, 942.96it/s]
85%|████████▍ | 760/895 [00:00<00:00, 942.46it/s]
96%|█████████▌| 855/895 [00:00<00:00, 942.32it/s]
100%|██████████| 895/895 [00:00<00:00, 943.04it/s]
|
||
|
|
2026-02-23:22:26:34,362 INFO [task.py:415] Building contexts for mmlu_philosophy on rank 0...
|
||
|
|
0%| | 0/311 [00:00<?, ?it/s]
30%|██▉ | 93/311 [00:00<00:00, 927.99it/s]
60%|██████ | 187/311 [00:00<00:00, 930.68it/s]
91%|█████████ | 282/311 [00:00<00:00, 937.60it/s]
100%|██████████| 311/311 [00:00<00:00, 936.32it/s]
|
||
|
|
2026-02-23:22:26:34,704 INFO [task.py:415] Building contexts for mmlu_prehistory on rank 0...
|
||
|
|
0%| | 0/324 [00:00<?, ?it/s]
29%|██▉ | 95/324 [00:00<00:00, 941.61it/s]
59%|█████▊ | 190/324 [00:00<00:00, 943.38it/s]
88%|████████▊ | 285/324 [00:00<00:00, 944.04it/s]
100%|██████████| 324/324 [00:00<00:00, 943.20it/s]
|
||
|
|
2026-02-23:22:26:35,058 INFO [task.py:415] Building contexts for mmlu_professional_law on rank 0...
|
||
|
|
0%| | 0/1534 [00:00<?, ?it/s]
6%|▌ | 95/1534 [00:00<00:01, 941.24it/s]
12%|█▏ | 190/1534 [00:00<00:01, 940.89it/s]
19%|█▊ | 285/1534 [00:00<00:01, 942.03it/s]
25%|██▍ | 380/1534 [00:00<00:01, 942.92it/s]
31%|███ | 475/1534 [00:00<00:01, 943.11it/s]
37%|███▋ | 570/1534 [00:00<00:01, 943.58it/s]
43%|████▎ | 665/1534 [00:00<00:00, 943.91it/s]
50%|████▉ | 760/1534 [00:00<00:00, 943.91it/s]
56%|█████▌ | 855/1534 [00:00<00:00, 944.31it/s]
62%|██████▏ | 950/1534 [00:01<00:00, 944.68it/s]
68%|██████▊ | 1045/1534 [00:01<00:00, 944.07it/s]
74%|███████▍ | 1140/1534 [00:01<00:00, 943.84it/s]
81%|████████ | 1235/1534 [00:01<00:00, 942.03it/s]
87%|████████▋ | 1330/1534 [00:01<00:00, 942.53it/s]
93%|█████████▎| 1425/1534 [00:01<00:00, 942.63it/s]
99%|█████████▉| 1520/1534 [00:01<00:00, 943.61it/s]
100%|██████████| 1534/1534 [00:01<00:00, 943.20it/s]
|
||
|
|
2026-02-23:22:26:36,734 INFO [task.py:415] Building contexts for mmlu_world_religions on rank 0...
|
||
|
|
0%| | 0/171 [00:00<?, ?it/s]
55%|█████▍ | 94/171 [00:00<00:00, 931.33it/s]
100%|██████████| 171/171 [00:00<00:00, 936.74it/s]
|
||
|
|
2026-02-23:22:26:36,922 INFO [task.py:415] Building contexts for hellaswag on rank 0...
|
||
|
|
0%| | 0/10042 [00:00<?, ?it/s]
4%|▍ | 394/10042 [00:00<00:02, 3930.78it/s]
8%|▊ | 797/10042 [00:00<00:02, 3983.89it/s]
12%|█▏ | 1200/10042 [00:00<00:02, 4002.32it/s]
16%|█▌ | 1604/10042 [00:00<00:02, 4016.05it/s]
20%|█▉ | 2007/10042 [00:00<00:01, 4019.47it/s]
24%|██▍ | 2409/10042 [00:00<00:01, 4019.44it/s]
28%|██▊ | 2813/10042 [00:00<00:01, 4025.88it/s]
32%|███▏ | 3218/10042 [00:00<00:01, 4031.21it/s]
36%|███▌ | 3622/10042 [00:00<00:01, 4009.32it/s]
40%|████ | 4025/10042 [00:01<00:01, 4013.22it/s]
44%|████▍ | 4427/10042 [00:01<00:01, 4007.64it/s]
48%|████▊ | 4832/10042 [00:01<00:01, 4020.38it/s]
52%|█████▏ | 5238/10042 [00:01<00:01, 4030.52it/s]
56%|█████▌ | 5644/10042 [00:01<00:01, 4037.33it/s]
60%|██████ | 6048/10042 [00:01<00:01, 3990.35it/s]
64%|██████▍ | 6448/10042 [00:01<00:00, 3992.56it/s]
68%|██████▊ | 6853/10042 [00:01<00:00, 4007.66it/s]
72%|███████▏ | 7260/10042 [00:01<00:00, 4023.88it/s]
76%|███████▋ | 7665/10042 [00:01<00:00, 4029.18it/s]
80%|████████ | 8068/10042 [00:02<00:01, 1684.27it/s]
84%|████████▍ | 8477/10042 [00:02<00:00, 2048.50it/s]
89%|████████▊ | 8888/10042 [00:02<00:00, 2415.04it/s]
93%|█████████▎| 9299/10042 [00:02<00:00, 2757.67it/s]
97%|█████████▋| 9712/10042 [00:02<00:00, 3065.17it/s]
100%|██████████| 10042/10042 [00:02<00:00, 3398.78it/s]
|
||
|
|
2026-02-23:22:26:40,721 INFO [task.py:415] Building contexts for gsm8k on rank 0...
|
||
|
|
0%| | 0/1319 [00:00<?, ?it/s]
3%|▎ | 36/1319 [00:00<00:03, 355.03it/s]
6%|▌ | 73/1319 [00:00<00:03, 358.73it/s]
8%|▊ | 110/1319 [00:00<00:03, 360.53it/s]
11%|█ | 147/1319 [00:00<00:03, 361.64it/s]
14%|█▍ | 184/1319 [00:00<00:03, 357.66it/s]
17%|█▋ | 220/1319 [00:00<00:03, 355.14it/s]
19%|█▉ | 256/1319 [00:00<00:02, 354.66it/s]
22%|██▏ | 292/1319 [00:00<00:02, 354.71it/s]
25%|██▍ | 328/1319 [00:00<00:02, 354.32it/s]
28%|██▊ | 364/1319 [00:01<00:02, 353.99it/s]
30%|███ | 400/1319 [00:01<00:02, 353.76it/s]
33%|███▎ | 436/1319 [00:01<00:02, 353.77it/s]
36%|███▌ | 472/1319 [00:01<00:02, 353.75it/s]
39%|███▊ | 508/1319 [00:01<00:02, 353.54it/s]
41%|████ | 544/1319 [00:01<00:02, 353.68it/s]
44%|████▍ | 580/1319 [00:01<00:02, 353.67it/s]
47%|████▋ | 616/1319 [00:01<00:01, 353.32it/s]
49%|████▉ | 652/1319 [00:01<00:01, 353.47it/s]
52%|█████▏ | 688/1319 [00:01<00:01, 353.82it/s]
55%|█████▍ | 724/1319 [00:02<00:01, 351.95it/s]
58%|█████▊ | 760/1319 [00:02<00:01, 350.45it/s]
60%|██████ | 796/1319 [00:02<00:01, 351.61it/s]
63%|██████▎ | 832/1319 [00:02<00:01, 352.26it/s]
66%|██████▌ | 868/1319 [00:02<00:01, 353.10it/s]
69%|██████▊ | 904/1319 [00:02<00:01, 353.49it/s]
71%|███████▏ | 940/1319 [00:02<00:01, 353.24it/s]
74%|███████▍ | 976/1319 [00:02<00:00, 349.38it/s]
77%|███████▋ | 1012/1319 [00:02<00:00, 350.07it/s]
79%|███████▉ | 1048/1319 [00:02<00:00, 350.48it/s]
82%|████████▏ | 1084/1319 [00:03<00:00, 350.96it/s]
85%|████████▍ | 1120/1319 [00:03<00:00, 351.32it/s]
88%|████████▊ | 1156/1319 [00:03<00:00, 351.31it/s]
90%|█████████ | 1192/1319 [00:03<00:00, 344.74it/s]
93%|█████████▎| 1227/1319 [00:03<00:00, 346.04it/s]
96%|█████████▌| 1263/1319 [00:03<00:00, 347.75it/s]
98%|█████████▊| 1299/1319 [00:03<00:00, 348.87it/s]
100%|██████████| 1319/1319 [00:03<00:00, 352.38it/s]
|
||
|
|
2026-02-23:22:26:44,487 INFO [task.py:415] Building contexts for boolq on rank 0...
|
||
|
|
0%| | 0/3270 [00:00<?, ?it/s]
9%|▉ | 291/3270 [00:00<00:01, 2905.92it/s]
18%|█▊ | 587/3270 [00:00<00:00, 2934.50it/s]
27%|██▋ | 881/3270 [00:00<00:00, 2905.24it/s]
36%|███▌ | 1178/3270 [00:00<00:00, 2925.27it/s]
45%|████▌ | 1475/3270 [00:00<00:00, 2939.40it/s]
54%|█████▍ | 1772/3270 [00:00<00:00, 2947.91it/s]
63%|██████▎ | 2070/3270 [00:00<00:00, 2957.06it/s]
72%|███████▏ | 2367/3270 [00:00<00:00, 2959.84it/s]
81%|████████▏ | 2663/3270 [00:00<00:00, 2938.01it/s]
90%|█████████ | 2959/3270 [00:01<00:00, 2943.64it/s]
100%|█████████▉| 3256/3270 [00:01<00:00, 2948.88it/s]
100%|██████████| 3270/3270 [00:01<00:00, 2942.06it/s]
|
||
|
|
2026-02-23:22:26:45,688 INFO [task.py:415] Building contexts for arc_challenge on rank 0...
|
||
|
|
0%| | 0/1172 [00:00<?, ?it/s]
15%|█▍ | 171/1172 [00:00<00:00, 1700.99it/s]
29%|██▉ | 344/1172 [00:00<00:00, 1712.25it/s]
44%|████▍ | 516/1172 [00:00<00:00, 1697.41it/s]
59%|█████▊ | 688/1172 [00:00<00:00, 1705.76it/s]
73%|███████▎ | 860/1172 [00:00<00:00, 1710.80it/s]
88%|████████▊ | 1033/1172 [00:00<00:00, 1714.56it/s]
100%|██████████| 1172/1172 [00:00<00:00, 1710.01it/s]
|
||
|
|
2026-02-23:22:26:46,421 INFO [evaluator.py:496] Running loglikelihood requests
|
||
|
|
Running loglikelihood requests: 0%| | 0/119655 [00:00<?, ?it/s]
Running loglikelihood requests: 0%| | 1/119655 [00:15<525:18:09, 15.80s/it]
Running loglikelihood requests: 0%| | 31/119655 [00:16<12:21:56, 2.69it/s]
Running loglikelihood requests: 0%| | 61/119655 [00:16<5:21:31, 6.20it/s]
Running loglikelihood requests: 0%| | 89/119655 [00:16<3:10:20, 10.47it/s]
Running loglikelihood requests: 0%| | 119/119655 [00:17<2:01:46, 16.36it/s]
Running loglikelihood requests: 0%| | 151/119655 [00:17<1:21:59, 24.29it/s]
Running loglikelihood requests: 0%| | 179/119655 [00:17<1:01:39, 32.30it/s]
Running loglikelihood requests: 0%| | 209/119655 [00:17<47:01, 42.33it/s]
Running loglikelihood requests: 0%| | 241/119655 [00:18<36:40, 54.27it/s]
Running loglikelihood requests: 0%| | 273/119655 [00:18<29:59, 66.34it/s]
Running loglikelihood requests: 0%| | 305/119655 [00:18<25:33, 77.82it/s]
Running loglikelihood requests: 0%| | 337/119655 [00:18<22:18, 89.13it/s]
Running loglikelihood requests: 0%| | 365/119655 [00:19<20:52, 95.26it/s]
Running loglikelihood requests: 0%| | 397/119655 [00:19<19:05, 104.09it/s]
Running loglikelihood requests: 0%| | 429/119655 [00:19<17:47, 111.70it/s]
Running loglikelihood requests: 0%| | 461/119655 [00:19<16:45, 118.56it/s]
Running loglikelihood requests: 0%| | 491/119655 [00:20<16:20, 121.59it/s]
Running loglikelihood requests: 0%| | 519/119655 [00:20<16:19, 121.57it/s]
Running loglikelihood requests: 0%| | 551/119655 [00:20<15:31, 127.87it/s]
Running loglikelihood requests: 0%| | 583/119655 [00:20<14:57, 132.64it/s]
Running loglikelihood requests: 1%| | 613/119655 [00:20<14:50, 133.64it/s]
Running loglikelihood requests: 1%| | 643/119655 [00:21<14:45, 134.48it/s]
Running loglikelihood requests: 1%| | 673/119655 [00:21<14:38, 135.51it/s]
Running loglikelihood requests: 1%| | 705/119655 [00:21<14:15, 139.01it/s]
Running loglikelihood requests: 1%| | 735/119655 [00:21<14:15, 139.00it/s]
Running loglikelihood requests: 1%| | 765/119655 [00:22<14:14, 139.21it/s]
Running loglikelihood requests: 1%| | 797/119655 [00:22<13:56, 142.12it/s]
Running loglikelihood requests: 1%| | 829/119655 [00:22<13:43, 144.32it/s]
Running loglikelihood requests: 1%| | 861/119655 [00:22<13:34, 145.93it/s]
Running loglikelihood requests: 1%| | 893/119655 [00:22<13:26, 147.17it/s]
Running loglikelihood requests: 1%| | 925/119655 [00:23<13:06, 150.93it/s]
Running loglikelihood requests: 1%| | 957/119655 [00:23<12:51, 153.78it/s]
Running loglikelihood requests: 1%| | 987/119655 [00:23<12:55, 152.99it/s]
Running loglikelihood requests: 1%| | 1017/119655 [00:23<12:58, 152.45it/s]
Running loglikelihood requests: 1%| | 1047/119655 [00:23<12:59, 152.13it/s]
Running loglikelihood requests: 1%| | 1079/119655 [00:24<12:44, 155.17it/s]
Running loglikelihood requests: 1%| | 1109/119655 [00:24<12:47, 154.44it/s]
Running loglikelihood requests: 1%| | 1139/119655 [00:24<12:49, 154.02it/s]
Running loglikelihood requests: 1%| | 1171/119655 [00:24<12:35, 156.74it/s]
Running loglikelihood requests: 1%| | 1199/119655 [00:24<12:55, 152.73it/s]
Running loglikelihood requests: 1%| | 1231/119655 [00:25<12:38, 156.23it/s]
Running loglikelihood requests: 1%| | 1263/119655 [00:25<12:25, 158.73it/s]
Running loglikelihood requests: 1%| | 1295/119655 [00:25<12:17, 160.58it/s]
Running loglikelihood requests: 1%| | 1327/119655 [00:25<12:10, 161.94it/s]
Running loglikelihood requests: 1%| | 1359/119655 [00:25<12:05, 162.96it/s]
Running loglikelihood requests: 1%| | 1387/119655 [00:26<12:30, 157.67it/s]
Running loglikelihood requests: 1%| |
|
||
|
|
2026-02-23:22:36:04,335 INFO [evaluator.py:496] Running generate_until requests
|
||
|
|
Passed argument batch_size = auto:1. Detecting largest batch size
|
||
|
|
Determined largest batch size: 8
|
||
|
|
Running generate_until requests: 0%| | 0/1319 [00:00<?, ?it/s]
Running generate_until requests: 0%| | 1/1319 [00:12<4:36:02, 12.57s/it]
Running generate_until requests: 0%| | 2/1319 [00:20<3:36:41, 9.87s/it]
Running generate_until requests: 0%| | 3/1319 [00:23<2:25:26, 6.63s/it]
Running generate_until requests: 0%| | 4/1319 [00:29<2:20:16, 6.40s/it]
Running generate_until requests: 0%| | 5/1319 [00:36<2:24:38, 6.60s/it]
Running generate_until requests: 0%| | 6/1319 [00:44<2:33:01, 6.99s/it]
Running generate_until requests: 1%| | 7/1319 [00:46<2:01:45, 5.57s/it]
Running generate_until requests: 1%| | 8/1319 [00:51<1:55:46, 5.30s/it]
Running generate_until requests: 1%| | 9/1319 [00:54<1:40:41, 4.61s/it]
Running generate_until requests: 1%| | 10/1319 [00:57<1:27:43, 4.02s/it]
Running generate_until requests: 1%| | 11/1319 [01:02<1:38:32, 4.52s/it]
Running generate_until requests: 1%| | 12/1319 [01:05<1:27:08, 4.00s/it]
Running generate_until requests: 1%| | 13/1319 [01:12<1:46:33, 4.90s/it]
Running generate_until requests: 1%| | 14/1319 [01:19<2:01:59, 5.61s/it]
Running generate_until requests: 1%| | 15/1319 [01:26<2:05:06, 5.76s/it]
Running generate_until requests: 1%| | 16/1319 [01:30<1:55:19, 5.31s/it]
Running generate_until requests: 1%|▏ | 17/1319 [01:33<1:42:36, 4.73s/it]
Running generate_until requests: 1%|▏ | 18/1319 [01:41<2:02:10, 5.63s/it]
Running generate_until requests: 1%|▏ | 19/1319 [01:44<1:42:43, 4.74s/it]
Running generate_until requests: 2%|▏ | 20/1319 [01:49<1:47:40, 4.97s/it]
Running generate_until requests: 2%|▏ | 21/1319 [01:52<1:34:11, 4.35s/it]
Running generate_until requests: 2%|▏ | 22/1319 [01:55<1:25:34, 3.96s/it]
Running generate_until requests: 2%|▏ | 23/1319 [02:00<1:33:10, 4.31s/it]
Running generate_until requests: 2%|▏ | 24/1319 [02:04<1:30:40, 4.20s/it]
Running generate_until requests: 2%|▏ | 25/1319 [02:07<1:21:23, 3.77s/it]
Running generate_until requests: 2%|▏ | 26/1319 [02:10<1:16:00, 3.53s/it]
Running generate_until requests: 2%|▏ | 27/1319 [02:15<1:24:03, 3.90s/it]
Running generate_until requests: 2%|▏ | 28/1319 [02:19<1:29:05, 4.14s/it]
Running generate_until requests: 2%|▏ | 29/1319 [02:25<1:38:56, 4.60s/it]
Running generate_until requests: 2%|▏ | 30/1319 [02:28<1:27:40, 4.08s/it]
Running generate_until requests: 2%|▏ | 31/1319 [02:36<1:51:45, 5.21s/it]
Running generate_until requests: 2%|▏ | 32/1319 [02:40<1:48:44, 5.07s/it]
Running generate_until requests: 3%|▎ | 33/1319 [02:48<2:06:39, 5.91s/it]
Running generate_until requests: 3%|▎ | 34/1319 [02:51<1:45:39, 4.93s/it]
Running generate_until requests: 3%|▎ | 35/1319 [02:55<1:38:12, 4.59s/it]
Running generate_until requests: 3%|▎ | 36/1319 [03:00<1:45:30, 4.93s/it]
Running generate_until requests: 3%|▎ | 37/1319 [03:06<1:50:03, 5.15s/it]
Running generate_until requests: 3%|▎ | 38/1319 [03:11<1:48:09, 5.07s/it]
Running generate_until requests: 3%|▎ | 39/1319 [03:15<1:41:15, 4.75s/it]
Running generate_until requests: 3%|▎ | 40/1319 [03:21<1:48:31, 5.09s/it]
Running generate_until requests: 3%|▎ | 41/1319 [03:25<1:39:59, 4.69s/it]
Running generate_until requests: 3%|▎ | 42/1319 [03:32<1:55:27, 5.43s/it]
Running generate_until requests: 3%|▎ | 43/1319 [03:37<1:56:02, 5.46s/it]
Running generate_until requests: 3%|▎ | 44/1319 [03:41<1:45:31, 4.97s/it]
Running generate_until requests: 3%|▎ | 45/1319 [03:43<1:27:40, 4.13s/it]
Running generate_until requests: 3%|▎ | 46/1319 [03:48<1:32:24, 4.36s/it]
Running generate_until requests: 4%|▎ | 47/1319
|
||
|
|
fatal: not a git repository (or any of the parent directories): .git
|
||
|
|
2026-02-24:00:06:21,470 INFO [evaluation_tracker.py:206] Saving results aggregated
|
||
|
|
Passed argument batch_size = auto. Detecting largest batch size
|
||
|
|
Determined Largest batch size: 1
|
||
|
|
hf (pretrained=strykes/emberforge-3b-reasoner,trust_remote_code=True,dtype=float16), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto (8)
|
||
|
|
| Tasks |Version| Filter |n-shot| Metric | |Value | |Stderr|
|
||
|
|
|---------------------------------------|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|
||
|
|
|arc_challenge | 1|none | 0|acc |↑ |0.2875|± |0.0132|
|
||
|
|
| | |none | 0|acc_norm |↑ |0.3174|± |0.0136|
|
||
|
|
|boolq | 2|none | 0|acc |↑ |0.7437|± |0.0076|
|
||
|
|
|gsm8k | 3|flexible-extract| 5|exact_match|↑ |0.6240|± |0.0133|
|
||
|
|
| | |strict-match | 5|exact_match|↑ |0.6202|± |0.0134|
|
||
|
|
|hellaswag | 1|none | 0|acc |↑ |0.4292|± |0.0049|
|
||
|
|
| | |none | 0|acc_norm |↑ |0.5607|± |0.0050|
|
||
|
|
|mmlu | 2|none | |acc |↑ |0.5998|± |0.0040|
|
||
|
|
| - humanities | 2|none | |acc |↑ |0.5301|± |0.0068|
|
||
|
|
| - formal_logic | 1|none | 0|acc |↑ |0.5000|± |0.0447|
|
||
|
|
| - high_school_european_history | 1|none | 0|acc |↑ |0.7697|± |0.0329|
|
||
|
|
| - high_school_us_history | 1|none | 0|acc |↑ |0.7157|± |0.0317|
|
||
|
|
| - high_school_world_history | 1|none | 0|acc |↑ |0.7975|± |0.0262|
|
||
|
|
| - international_law | 1|none | 0|acc |↑ |0.7851|± |0.0375|
|
||
|
|
| - jurisprudence | 1|none | 0|acc |↑ |0.7222|± |0.0433|
|
||
|
|
| - logical_fallacies | 1|none | 0|acc |↑ |0.6933|± |0.0362|
|
||
|
|
| - moral_disputes | 1|none | 0|acc |↑ |0.5954|± |0.0264|
|
||
|
|
| - moral_scenarios | 1|none | 0|acc |↑ |0.2447|± |0.0144|
|
||
|
|
| - philosophy | 1|none | 0|acc |↑ |0.6559|± |0.0270|
|
||
|
|
| - prehistory | 1|none | 0|acc |↑ |0.6265|± |0.0269|
|
||
|
|
| - professional_law | 1|none | 0|acc |↑ |0.4746|± |0.0128|
|
||
|
|
| - world_religions | 1|none | 0|acc |↑ |0.7193|± |0.0345|
|
||
|
|
| - other | 2|none | |acc |↑ |0.6270|± |0.0085|
|
||
|
|
| - business_ethics | 1|none | 0|acc |↑ |0.6200|± |0.0488|
|
||
|
|
| - clinical_knowledge | 1|none | 0|acc |↑ |0.6415|± |0.0295|
|
||
|
|
| - college_medicine | 1|none | 0|acc |↑ |0.5954|± |0.0374|
|
||
|
|
| - global_facts | 1|none | 0|acc |↑ |0.3300|± |0.0473|
|
||
|
|
| - human_aging | 1|none | 0|acc |↑ |0.6009|± |0.0329|
|
||
|
|
| - management | 1|none | 0|acc |↑ |0.6893|± |0.0458|
|
||
|
|
| - marketing | 1|none | 0|acc |↑ |0.8034|± |0.0260|
|
||
|
|
| - medical_genetics | 1|none | 0|acc |↑ |0.6900|± |0.0465|
|
||
|
|
| - miscellaneous | 1|none | 0|acc |↑ |0.6718|± |0.0168|
|
||
|
|
| - nutrition | 1|none | 0|acc |↑ |0.6765|± |0.0268|
|
||
|
|
| - professional_accounting | 1|none | 0|acc |↑ |0.4397|± |0.0296|
|
||
|
|
| - professional_medicine | 1|none | 0|acc |↑ |0.6838|± |0.0282|
|
||
|
|
| - virology | 1|none | 0|acc |↑ |0.4518|± |0.0387|
|
||
|
|
| - social sciences | 2|none | |acc |↑ |0.6906|± |0.0081|
|
||
|
|
| - econometrics | 1|none | 0|acc |↑ |0.3596|± |0.0451|
|
||
|
|
| - high_school_geography | 1|none | 0|acc |↑ |0.7273|± |0.0317|
|
||
|
|
| - high_school_government_and_politics| 1|none | 0|acc |↑ |0.7461|± |0.0314|
|
||
|
|
| - high_school_macroeconomics | 1|none | 0|acc |↑ |0.6436|± |0.0243|
|
||
|
|
| - high_school_microeconomics | 1|none | 0|acc |↑ |0.7773|± |0.0270|
|
||
|
|
| - high_school_psychology | 1|none | 0|acc |↑ |0.8000|± |0.0171|
|
||
|
|
| - human_sexuality | 1|none | 0|acc |↑ |0.6947|± |0.0404|
|
||
|
|
| - professional_psychology | 1|none | 0|acc |↑ |0.5915|± |0.0199|
|
||
|
|
| - public_relations | 1|none | 0|acc |↑ |0.6000|± |0.0469|
|
||
|
|
| - security_studies | 1|none | 0|acc |↑ |0.7020|± |0.0293|
|
||
|
|
| - sociology | 1|none | 0|acc |↑ |0.7711|± |0.0297|
|
||
|
|
| - us_foreign_policy | 1|none | 0|acc |↑ |0.7800|± |0.0416|
|
||
|
|
| - stem | 2|none | |acc |↑ |0.5883|± |0.0086|
|
||
|
|
| - abstract_algebra | 1|none | 0|acc |↑ |0.4300|± |0.0498|
|
||
|
|
| - anatomy | 1|none | 0|acc |↑ |0.6074|± |0.0422|
|
||
|
|
| - astronomy | 1|none | 0|acc |↑ |0.6974|± |0.0374|
|
||
|
|
| - college_biology | 1|none | 0|acc |↑ |0.8264|± |0.0317|
|
||
|
|
| - college_chemistry | 1|none | 0|acc |↑ |0.5300|± |0.0502|
|
||
|
|
| - college_computer_science | 1|none | 0|acc |↑ |0.5400|± |0.0501|
|
||
|
|
| - college_mathematics | 1|none | 0|acc |↑ |0.5000|± |0.0503|
|
||
|
|
| - college_physics | 1|none | 0|acc |↑ |0.5000|± |0.0498|
|
||
|
|
| - computer_security | 1|none | 0|acc |↑ |0.6800|± |0.0469|
|
||
|
|
| - conceptual_physics | 1|none | 0|acc |↑ |0.5872|± |0.0322|
|
||
|
|
| - electrical_engineering | 1|none | 0|acc |↑ |0.6414|± |0.0400|
|
||
|
|
| - elementary_mathematics | 1|none | 0|acc |↑ |0.5317|± |0.0257|
|
||
|
|
| - high_school_biology | 1|none | 0|acc |↑ |0.7548|± |0.0245|
|
||
|
|
| - high_school_chemistry | 1|none | 0|acc |↑ |0.6010|± |0.0345|
|
||
|
|
| - high_school_computer_science | 1|none | 0|acc |↑ |0.6900|± |0.0465|
|
||
|
|
| - high_school_mathematics | 1|none | 0|acc |↑ |0.4556|± |0.0304|
|
||
|
|
| - high_school_physics | 1|none | 0|acc |↑ |0.5166|± |0.0408|
|
||
|
|
| - high_school_statistics | 1|none | 0|acc |↑ |0.5694|± |0.0338|
|
||
|
|
| - machine_learning | 1|none | 0|acc |↑ |0.4286|± |0.0470|
|
||
|
|
|piqa | 1|none | 0|acc |↑ |0.6328|± |0.0112|
|
||
|
|
| | |none | 0|acc_norm |↑ |0.6322|± |0.0113|
|
||
|
|
|truthfulqa_mc2 | 2|none | 0|acc |↑ |0.4534|± |0.0160|
|
||
|
|
|winogrande | 1|none | 0|acc |↑ |0.5004|± |0.0141|
|
||
|
|
|
||
|
|
| Groups |Version|Filter|n-shot|Metric| |Value | |Stderr|
|
||
|
|
|------------------|------:|------|------|------|---|-----:|---|-----:|
|
||
|
|
|mmlu | 2|none | |acc |↑ |0.5998|± |0.0040|
|
||
|
|
| - humanities | 2|none | |acc |↑ |0.5301|± |0.0068|
|
||
|
|
| - other | 2|none | |acc |↑ |0.6270|± |0.0085|
|
||
|
|
| - social sciences| 2|none | |acc |↑ |0.6906|± |0.0081|
|
||
|
|
| - stem | 2|none | |acc |↑ |0.5883|± |0.0086|
|
||
|
|
|