emberforge-3b-reasoner/benchmarks/lm-eval-2026-02-24/benchmark_comparison_public_2026-02-24.md

# Emberforge 3B Benchmark Comparison (Public + Local)

Generated: 2026-02-24

## 1) Your Finetuned Model (local lm-eval run)
Model: `strykes/emberforge-3b-reasoner`

| Task | Metric | Score |
|---|---:|---:|
| mmlu | acc,none | 59.98% |
| gsm8k | exact_match,flexible-extract | 62.40% |
| arc_challenge | acc_norm,none | 31.74% |
| hellaswag | acc_norm,none | 56.07% |
| winogrande | acc,none | 50.04% |
| piqa | acc_norm,none | 63.22% |
| boolq | acc,none | 74.37% |
| truthfulqa_mc2 | acc,none | 45.34% |

## 2) Public Base Model (Nanbeige4.1-3B)
Model: `Nanbeige/Nanbeige4.1-3B` (author-reported benchmarks)

| Benchmark | Published Score |
|---|---:|
| Live-Code-Bench-V6 | 76.90% |
| AIME 2026 I | 87.40% |
| HMMT Nov | 77.92% |
| GPQA | 83.80% |
| HLE (Text-only) | 12.60% |
| Arena-Hard-v2 | 73.20% |
| BFCL-V4 | 56.50% |
| Tau2-Bench | 48.57% |

Note: Nanbeige published benchmarks do not overlap directly with your lm-eval task set (`mmlu`, `gsm8k`, `arc_challenge`, etc.), so no exact apples-to-apples delta can be computed without rerunning identical tasks.

## 3) Public Frontier Reference (Claude / GPT / Gemini) on overlapping classic tasks
Source benchmark table: Anthropic Claude 3 model card (March 2024).

| Benchmark | Your model | Claude 3 Opus | Claude 3 Sonnet | GPT-4 | Gemini 1.0 Ultra | Gemini 1.5 Pro |
|---|---:|---:|---:|---:|---:|---:|
| MMLU (5-shot) | 59.98% | 86.80% | 79.00% | 86.40% | 83.70% | 81.90% |
| GSM8K | 62.40% | 95.00% | 92.30% | 92.00% | 94.40% | 91.70% |
| ARC-Challenge (25-shot) | 31.74% | 96.40% | 93.20% | 96.30% | — | — |
| HellaSwag (10-shot) | 56.07% | 95.40% | 89.00% | 95.30% | 87.80% | 92.50% |
| WinoGrande (5-shot) | 50.04% | 88.50% | 75.10% | 87.50% | — | — |

## 4) Latest Frontier Snapshot (2025-2026, non-overlapping tasks)
Source benchmark table: Claude Opus 4.5 system card, Table 2.3.A.

| Benchmark | Claude Opus 4.5 | Claude Sonnet 4.5 | Claude Opus 4.1 | Gemini 3 Pro | GPT-5.1 |
|---|---:|---:|---:|---:|---:|
| SWE-bench Verified | 80.9% | 77.2% | 74.5% | 76.2% | 76.3% |
| Terminal-bench 2.0 | 59.3% | 50.0% | 46.5% | 54.2% | 47.6% |
| ARC-AGI-2 (Verified) | 37.6% | 13.6% | — | 31.1% | 17.6% |
| GPQA Diamond | 87.0% | 83.4% | 81.0% | 91.9% | 88.1% |
| MMMU (validation) | 80.7% | 77.8% | 77.1% | — | 85.4% |
| MMMLU | 90.8% | 89.1% | 89.5% | 91.8% | 91.0% |

Note: These are newer references but still not directly comparable to your current lm-eval task set.

## 5) Caveats
- Your run uses `lm-evaluation-harness` with specific settings; public model-card numbers may use different prompts, few-shot counts, decoding, or evaluation code.
- Frontier references in Section 3 are older than current 2026 generations but are official primary-source numbers on overlapping classic benchmarks.
- Frontier references in Section 4 are current (2025-2026) but mostly on different benchmarks.

## Sources
- Local run artifact: `/workspace/evals/main_results_v3.json/strykes__emberforge-3b-reasoner/results_2026-02-24T00-06-21.474293.json`
- Nanbeige model card: https://huggingface.co/Nanbeige/Nanbeige4.1-3B
- Anthropic Claude 3 model card (benchmarks table): https://www-cdn.anthropic.com/c6a80a657af445f40e31afac050f3bf76d3b1404.pdf
- Anthropic model cards index: https://www.anthropic.com/system-cards
- Anthropic Claude Opus 4.5 system card: https://www-cdn.anthropic.com/bf10f64990cfda0ba858290be7b8cc6317685f47.pdf