# Emberforge 3B Benchmark Comparison (Public + Local) Generated: 2026-02-24 ## 1) Your Finetuned Model (local lm-eval run) Model: `strykes/emberforge-3b-reasoner` | Task | Metric | Score | |---|---:|---:| | mmlu | acc,none | 59.98% | | gsm8k | exact_match,flexible-extract | 62.40% | | arc_challenge | acc_norm,none | 31.74% | | hellaswag | acc_norm,none | 56.07% | | winogrande | acc,none | 50.04% | | piqa | acc_norm,none | 63.22% | | boolq | acc,none | 74.37% | | truthfulqa_mc2 | acc,none | 45.34% | ## 2) Public Base Model (Nanbeige4.1-3B) Model: `Nanbeige/Nanbeige4.1-3B` (author-reported benchmarks) | Benchmark | Published Score | |---|---:| | Live-Code-Bench-V6 | 76.90% | | AIME 2026 I | 87.40% | | HMMT Nov | 77.92% | | GPQA | 83.80% | | HLE (Text-only) | 12.60% | | Arena-Hard-v2 | 73.20% | | BFCL-V4 | 56.50% | | Tau2-Bench | 48.57% | Note: Nanbeige published benchmarks do not overlap directly with your lm-eval task set (`mmlu`, `gsm8k`, `arc_challenge`, etc.), so no exact apples-to-apples delta can be computed without rerunning identical tasks. ## 3) Public Frontier Reference (Claude / GPT / Gemini) on overlapping classic tasks Source benchmark table: Anthropic Claude 3 model card (March 2024). | Benchmark | Your model | Claude 3 Opus | Claude 3 Sonnet | GPT-4 | Gemini 1.0 Ultra | Gemini 1.5 Pro | |---|---:|---:|---:|---:|---:|---:| | MMLU (5-shot) | 59.98% | 86.80% | 79.00% | 86.40% | 83.70% | 81.90% | | GSM8K | 62.40% | 95.00% | 92.30% | 92.00% | 94.40% | 91.70% | | ARC-Challenge (25-shot) | 31.74% | 96.40% | 93.20% | 96.30% | — | — | | HellaSwag (10-shot) | 56.07% | 95.40% | 89.00% | 95.30% | 87.80% | 92.50% | | WinoGrande (5-shot) | 50.04% | 88.50% | 75.10% | 87.50% | — | — | ## 4) Latest Frontier Snapshot (2025-2026, non-overlapping tasks) Source benchmark table: Claude Opus 4.5 system card, Table 2.3.A. | Benchmark | Claude Opus 4.5 | Claude Sonnet 4.5 | Claude Opus 4.1 | Gemini 3 Pro | GPT-5.1 | |---|---:|---:|---:|---:|---:| | SWE-bench Verified | 80.9% | 77.2% | 74.5% | 76.2% | 76.3% | | Terminal-bench 2.0 | 59.3% | 50.0% | 46.5% | 54.2% | 47.6% | | ARC-AGI-2 (Verified) | 37.6% | 13.6% | — | 31.1% | 17.6% | | GPQA Diamond | 87.0% | 83.4% | 81.0% | 91.9% | 88.1% | | MMMU (validation) | 80.7% | 77.8% | 77.1% | — | 85.4% | | MMMLU | 90.8% | 89.1% | 89.5% | 91.8% | 91.0% | Note: These are newer references but still not directly comparable to your current lm-eval task set. ## 5) Caveats - Your run uses `lm-evaluation-harness` with specific settings; public model-card numbers may use different prompts, few-shot counts, decoding, or evaluation code. - Frontier references in Section 3 are older than current 2026 generations but are official primary-source numbers on overlapping classic benchmarks. - Frontier references in Section 4 are current (2025-2026) but mostly on different benchmarks. ## Sources - Local run artifact: `/workspace/evals/main_results_v3.json/strykes__emberforge-3b-reasoner/results_2026-02-24T00-06-21.474293.json` - Nanbeige model card: https://huggingface.co/Nanbeige/Nanbeige4.1-3B - Anthropic Claude 3 model card (benchmarks table): https://www-cdn.anthropic.com/c6a80a657af445f40e31afac050f3bf76d3b1404.pdf - Anthropic model cards index: https://www.anthropic.com/system-cards - Anthropic Claude Opus 4.5 system card: https://www-cdn.anthropic.com/bf10f64990cfda0ba858290be7b8cc6317685f47.pdf