初始化项目，由ModelHub XC社区提供模型

Model: strykes/emberforge-3b-reasoner Source: Original Platform
2026-05-30 19:09:18 +08:00
commit 7c36fbd792
28 changed files with 5552 additions and 0 deletions
--- a/benchmarks/lm-eval-2026-02-24/benchmark_comparison_public_2026-02-24.md
+++ b/benchmarks/lm-eval-2026-02-24/benchmark_comparison_public_2026-02-24.md
@@ -0,0 +1,70 @@
+# Emberforge 3B Benchmark Comparison (Public + Local)
+
+Generated: 2026-02-24
+
+## 1) Your Finetuned Model (local lm-eval run)
+Model: `strykes/emberforge-3b-reasoner`
+
+| Task | Metric | Score |
+|---|---:|---:|
+| mmlu | acc,none | 59.98% |
+| gsm8k | exact_match,flexible-extract | 62.40% |
+| arc_challenge | acc_norm,none | 31.74% |
+| hellaswag | acc_norm,none | 56.07% |
+| winogrande | acc,none | 50.04% |
+| piqa | acc_norm,none | 63.22% |
+| boolq | acc,none | 74.37% |
+| truthfulqa_mc2 | acc,none | 45.34% |
+
+## 2) Public Base Model (Nanbeige4.1-3B)
+Model: `Nanbeige/Nanbeige4.1-3B` (author-reported benchmarks)
+
+| Benchmark | Published Score |
+|---|---:|
+| Live-Code-Bench-V6 | 76.90% |
+| AIME 2026 I | 87.40% |
+| HMMT Nov | 77.92% |
+| GPQA | 83.80% |
+| HLE (Text-only) | 12.60% |
+| Arena-Hard-v2 | 73.20% |
+| BFCL-V4 | 56.50% |
+| Tau2-Bench | 48.57% |
+
+Note: Nanbeige published benchmarks do not overlap directly with your lm-eval task set (`mmlu`, `gsm8k`, `arc_challenge`, etc.), so no exact apples-to-apples delta can be computed without rerunning identical tasks.
+
+## 3) Public Frontier Reference (Claude / GPT / Gemini) on overlapping classic tasks
+Source benchmark table: Anthropic Claude 3 model card (March 2024).
+
+| Benchmark | Your model | Claude 3 Opus | Claude 3 Sonnet | GPT-4 | Gemini 1.0 Ultra | Gemini 1.5 Pro |
+|---|---:|---:|---:|---:|---:|---:|
+| MMLU (5-shot) | 59.98% | 86.80% | 79.00% | 86.40% | 83.70% | 81.90% |
+| GSM8K | 62.40% | 95.00% | 92.30% | 92.00% | 94.40% | 91.70% |
+| ARC-Challenge (25-shot) | 31.74% | 96.40% | 93.20% | 96.30% | — | — |
+| HellaSwag (10-shot) | 56.07% | 95.40% | 89.00% | 95.30% | 87.80% | 92.50% |
+| WinoGrande (5-shot) | 50.04% | 88.50% | 75.10% | 87.50% | — | — |
+
+## 4) Latest Frontier Snapshot (2025-2026, non-overlapping tasks)
+Source benchmark table: Claude Opus 4.5 system card, Table 2.3.A.
+
+| Benchmark | Claude Opus 4.5 | Claude Sonnet 4.5 | Claude Opus 4.1 | Gemini 3 Pro | GPT-5.1 |
+|---|---:|---:|---:|---:|---:|
+| SWE-bench Verified | 80.9% | 77.2% | 74.5% | 76.2% | 76.3% |
+| Terminal-bench 2.0 | 59.3% | 50.0% | 46.5% | 54.2% | 47.6% |
+| ARC-AGI-2 (Verified) | 37.6% | 13.6% | — | 31.1% | 17.6% |
+| GPQA Diamond | 87.0% | 83.4% | 81.0% | 91.9% | 88.1% |
+| MMMU (validation) | 80.7% | 77.8% | 77.1% | — | 85.4% |
+| MMMLU | 90.8% | 89.1% | 89.5% | 91.8% | 91.0% |
+
+Note: These are newer references but still not directly comparable to your current lm-eval task set.
+
+## 5) Caveats
+- Your run uses `lm-evaluation-harness` with specific settings; public model-card numbers may use different prompts, few-shot counts, decoding, or evaluation code.
+- Frontier references in Section 3 are older than current 2026 generations but are official primary-source numbers on overlapping classic benchmarks.
+- Frontier references in Section 4 are current (2025-2026) but mostly on different benchmarks.
+
+## Sources
+- Local run artifact: `/workspace/evals/main_results_v3.json/strykes__emberforge-3b-reasoner/results_2026-02-24T00-06-21.474293.json`
+- Nanbeige model card: https://huggingface.co/Nanbeige/Nanbeige4.1-3B
+- Anthropic Claude 3 model card (benchmarks table): https://www-cdn.anthropic.com/c6a80a657af445f40e31afac050f3bf76d3b1404.pdf
+- Anthropic model cards index: https://www.anthropic.com/system-cards
+- Anthropic Claude Opus 4.5 system card: https://www-cdn.anthropic.com/bf10f64990cfda0ba858290be7b8cc6317685f47.pdf
--- a/benchmarks/lm-eval-2026-02-24/results_2026-02-24T00-06-21.474293.json
+++ b/benchmarks/lm-eval-2026-02-24/results_2026-02-24T00-06-21.474293.json
--- a/benchmarks/lm-eval-2026-02-24/run_v3.log
+++ b/benchmarks/lm-eval-2026-02-24/run_v3.log
--- a/benchmarks/lm-eval-2026-02-24/summary_v3.tsv
+++ b/benchmarks/lm-eval-2026-02-24/summary_v3.tsv
@@ -0,0 +1,70 @@
+task	metric	value
+arc_challenge	acc_norm,none	0.3174061433447099
+boolq	acc,none	0.7437308868501529
+gsm8k	exact_match,flexible-extract	0.6239575435936315
+hellaswag	acc_norm,none	0.560744871539534
+mmlu	acc,none	0.5997721122347244
+mmlu_abstract_algebra	acc,none	0.43
+mmlu_anatomy	acc,none	0.6074074074074074
+mmlu_astronomy	acc,none	0.6973684210526315
+mmlu_business_ethics	acc,none	0.62
+mmlu_clinical_knowledge	acc,none	0.6415094339622641
+mmlu_college_biology	acc,none	0.8263888888888888
+mmlu_college_chemistry	acc,none	0.53
+mmlu_college_computer_science	acc,none	0.54
+mmlu_college_mathematics	acc,none	0.5
+mmlu_college_medicine	acc,none	0.5953757225433526
+mmlu_college_physics	acc,none	0.5
+mmlu_computer_security	acc,none	0.68
+mmlu_conceptual_physics	acc,none	0.5872340425531914
+mmlu_econometrics	acc,none	0.35964912280701755
+mmlu_electrical_engineering	acc,none	0.6413793103448275
+mmlu_elementary_mathematics	acc,none	0.5317460317460317
+mmlu_formal_logic	acc,none	0.5
+mmlu_global_facts	acc,none	0.33
+mmlu_high_school_biology	acc,none	0.7548387096774194
+mmlu_high_school_chemistry	acc,none	0.6009852216748769
+mmlu_high_school_computer_science	acc,none	0.69
+mmlu_high_school_european_history	acc,none	0.7696969696969697
+mmlu_high_school_geography	acc,none	0.7272727272727273
+mmlu_high_school_government_and_politics	acc,none	0.7461139896373057
+mmlu_high_school_macroeconomics	acc,none	0.6435897435897436
+mmlu_high_school_mathematics	acc,none	0.45555555555555555
+mmlu_high_school_microeconomics	acc,none	0.7773109243697479
+mmlu_high_school_physics	acc,none	0.5165562913907285
+mmlu_high_school_psychology	acc,none	0.8
+mmlu_high_school_statistics	acc,none	0.5694444444444444
+mmlu_high_school_us_history	acc,none	0.7156862745098039
+mmlu_high_school_world_history	acc,none	0.7974683544303798
+mmlu_human_aging	acc,none	0.600896860986547
+mmlu_human_sexuality	acc,none	0.6946564885496184
+mmlu_humanities	acc,none	0.5300743889479277
+mmlu_international_law	acc,none	0.7851239669421488
+mmlu_jurisprudence	acc,none	0.7222222222222222
+mmlu_logical_fallacies	acc,none	0.6932515337423313
+mmlu_machine_learning	acc,none	0.42857142857142855
+mmlu_management	acc,none	0.6893203883495146
+mmlu_marketing	acc,none	0.8034188034188035
+mmlu_medical_genetics	acc,none	0.69
+mmlu_miscellaneous	acc,none	0.6717752234993615
+mmlu_moral_disputes	acc,none	0.5953757225433526
+mmlu_moral_scenarios	acc,none	0.2446927374301676
+mmlu_nutrition	acc,none	0.6764705882352942
+mmlu_other	acc,none	0.6269713550048278
+mmlu_philosophy	acc,none	0.6559485530546624
+mmlu_prehistory	acc,none	0.6265432098765432
+mmlu_professional_accounting	acc,none	0.4397163120567376
+mmlu_professional_law	acc,none	0.4745762711864407
+mmlu_professional_medicine	acc,none	0.6838235294117647
+mmlu_professional_psychology	acc,none	0.5915032679738562
+mmlu_public_relations	acc,none	0.6
+mmlu_security_studies	acc,none	0.7020408163265306
+mmlu_social_sciences	acc,none	0.6906077348066298
+mmlu_sociology	acc,none	0.7711442786069652
+mmlu_stem	acc,none	0.5883285759594037
+mmlu_us_foreign_policy	acc,none	0.78
+mmlu_virology	acc,none	0.45180722891566266
+mmlu_world_religions	acc,none	0.7192982456140351
+piqa	acc_norm,none	0.6322089227421109
+truthfulqa_mc2	acc,none	0.45340473177307805
+winogrande	acc,none	0.500394632991318