Files
emberforge-3b-reasoner/benchmarks/lm-eval-2026-02-24/benchmark_comparison_public_2026-02-24.md
ModelHub XC 7c36fbd792 初始化项目,由ModelHub XC社区提供模型
Model: strykes/emberforge-3b-reasoner
Source: Original Platform
2026-05-30 19:09:18 +08:00

3.3 KiB

Emberforge 3B Benchmark Comparison (Public + Local)

Generated: 2026-02-24

1) Your Finetuned Model (local lm-eval run)

Model: strykes/emberforge-3b-reasoner

Task Metric Score
mmlu acc,none 59.98%
gsm8k exact_match,flexible-extract 62.40%
arc_challenge acc_norm,none 31.74%
hellaswag acc_norm,none 56.07%
winogrande acc,none 50.04%
piqa acc_norm,none 63.22%
boolq acc,none 74.37%
truthfulqa_mc2 acc,none 45.34%

2) Public Base Model (Nanbeige4.1-3B)

Model: Nanbeige/Nanbeige4.1-3B (author-reported benchmarks)

Benchmark Published Score
Live-Code-Bench-V6 76.90%
AIME 2026 I 87.40%
HMMT Nov 77.92%
GPQA 83.80%
HLE (Text-only) 12.60%
Arena-Hard-v2 73.20%
BFCL-V4 56.50%
Tau2-Bench 48.57%

Note: Nanbeige published benchmarks do not overlap directly with your lm-eval task set (mmlu, gsm8k, arc_challenge, etc.), so no exact apples-to-apples delta can be computed without rerunning identical tasks.

3) Public Frontier Reference (Claude / GPT / Gemini) on overlapping classic tasks

Source benchmark table: Anthropic Claude 3 model card (March 2024).

Benchmark Your model Claude 3 Opus Claude 3 Sonnet GPT-4 Gemini 1.0 Ultra Gemini 1.5 Pro
MMLU (5-shot) 59.98% 86.80% 79.00% 86.40% 83.70% 81.90%
GSM8K 62.40% 95.00% 92.30% 92.00% 94.40% 91.70%
ARC-Challenge (25-shot) 31.74% 96.40% 93.20% 96.30%
HellaSwag (10-shot) 56.07% 95.40% 89.00% 95.30% 87.80% 92.50%
WinoGrande (5-shot) 50.04% 88.50% 75.10% 87.50%

4) Latest Frontier Snapshot (2025-2026, non-overlapping tasks)

Source benchmark table: Claude Opus 4.5 system card, Table 2.3.A.

Benchmark Claude Opus 4.5 Claude Sonnet 4.5 Claude Opus 4.1 Gemini 3 Pro GPT-5.1
SWE-bench Verified 80.9% 77.2% 74.5% 76.2% 76.3%
Terminal-bench 2.0 59.3% 50.0% 46.5% 54.2% 47.6%
ARC-AGI-2 (Verified) 37.6% 13.6% 31.1% 17.6%
GPQA Diamond 87.0% 83.4% 81.0% 91.9% 88.1%
MMMU (validation) 80.7% 77.8% 77.1% 85.4%
MMMLU 90.8% 89.1% 89.5% 91.8% 91.0%

Note: These are newer references but still not directly comparable to your current lm-eval task set.

5) Caveats

  • Your run uses lm-evaluation-harness with specific settings; public model-card numbers may use different prompts, few-shot counts, decoding, or evaluation code.
  • Frontier references in Section 3 are older than current 2026 generations but are official primary-source numbers on overlapping classic benchmarks.
  • Frontier references in Section 4 are current (2025-2026) but mostly on different benchmarks.

Sources