Files

ModelHub XC 7c36fbd792 初始化项目，由ModelHub XC社区提供模型

Model: strykes/emberforge-3b-reasoner
Source: Original Platform

2026-05-30 19:09:18 +08:00

3.3 KiB

Raw Permalink Blame History

Emberforge 3B Benchmark Comparison (Public + Local)

Generated: 2026-02-24

1) Your Finetuned Model (local lm-eval run)

Model: strykes/emberforge-3b-reasoner

Task	Metric	Score
mmlu	acc,none	59.98%
gsm8k	exact_match,flexible-extract	62.40%
arc_challenge	acc_norm,none	31.74%
hellaswag	acc_norm,none	56.07%
winogrande	acc,none	50.04%
piqa	acc_norm,none	63.22%
boolq	acc,none	74.37%
truthfulqa_mc2	acc,none	45.34%

2) Public Base Model (Nanbeige4.1-3B)

Model: Nanbeige/Nanbeige4.1-3B (author-reported benchmarks)

Benchmark	Published Score
Live-Code-Bench-V6	76.90%
AIME 2026 I	87.40%
HMMT Nov	77.92%
GPQA	83.80%
HLE (Text-only)	12.60%
Arena-Hard-v2	73.20%
BFCL-V4	56.50%
Tau2-Bench	48.57%

Note: Nanbeige published benchmarks do not overlap directly with your lm-eval task set (mmlu, gsm8k, arc_challenge, etc.), so no exact apples-to-apples delta can be computed without rerunning identical tasks.

3) Public Frontier Reference (Claude / GPT / Gemini) on overlapping classic tasks

Source benchmark table: Anthropic Claude 3 model card (March 2024).

Benchmark	Your model	Claude 3 Opus	Claude 3 Sonnet	GPT-4	Gemini 1.0 Ultra	Gemini 1.5 Pro
MMLU (5-shot)	59.98%	86.80%	79.00%	86.40%	83.70%	81.90%
GSM8K	62.40%	95.00%	92.30%	92.00%	94.40%	91.70%
ARC-Challenge (25-shot)	31.74%	96.40%	93.20%	96.30%	—	—
HellaSwag (10-shot)	56.07%	95.40%	89.00%	95.30%	87.80%	92.50%
WinoGrande (5-shot)	50.04%	88.50%	75.10%	87.50%	—	—

4) Latest Frontier Snapshot (2025-2026, non-overlapping tasks)

Source benchmark table: Claude Opus 4.5 system card, Table 2.3.A.

Benchmark	Claude Opus 4.5	Claude Sonnet 4.5	Claude Opus 4.1	Gemini 3 Pro	GPT-5.1
SWE-bench Verified	80.9%	77.2%	74.5%	76.2%	76.3%
Terminal-bench 2.0	59.3%	50.0%	46.5%	54.2%	47.6%
ARC-AGI-2 (Verified)	37.6%	13.6%	—	31.1%	17.6%
GPQA Diamond	87.0%	83.4%	81.0%	91.9%	88.1%
MMMU (validation)	80.7%	77.8%	77.1%	—	85.4%
MMMLU	90.8%	89.1%	89.5%	91.8%	91.0%

Note: These are newer references but still not directly comparable to your current lm-eval task set.

5) Caveats

Your run uses lm-evaluation-harness with specific settings; public model-card numbers may use different prompts, few-shot counts, decoding, or evaluation code.
Frontier references in Section 3 are older than current 2026 generations but are official primary-source numbers on overlapping classic benchmarks.
Frontier references in Section 4 are current (2025-2026) but mostly on different benchmarks.

Sources

Local run artifact: /workspace/evals/main_results_v3.json/strykes__emberforge-3b-reasoner/results_2026-02-24T00-06-21.474293.json
Nanbeige model card: https://huggingface.co/Nanbeige/Nanbeige4.1-3B
Anthropic Claude 3 model card (benchmarks table): https://www-cdn.anthropic.com/c6a80a657af445f40e31afac050f3bf76d3b1404.pdf
Anthropic model cards index: https://www.anthropic.com/system-cards
Anthropic Claude Opus 4.5 system card: https://www-cdn.anthropic.com/bf10f64990cfda0ba858290be7b8cc6317685f47.pdf

3.3 KiB Raw Permalink Blame History