- dataset: id: openai/gsm8k task_id: gsm8k config: main split: test value: 0.095527 date: "2026-05-09" notes: "greedy, no-tools, local eval"