初始化项目,由ModelHub XC社区提供模型
Model: strykes/emberforge-3b-reasoner Source: Original Platform
This commit is contained in:
@@ -0,0 +1,70 @@
|
||||
# Emberforge 3B Benchmark Comparison (Public + Local)
|
||||
|
||||
Generated: 2026-02-24
|
||||
|
||||
## 1) Your Finetuned Model (local lm-eval run)
|
||||
Model: `strykes/emberforge-3b-reasoner`
|
||||
|
||||
| Task | Metric | Score |
|
||||
|---|---:|---:|
|
||||
| mmlu | acc,none | 59.98% |
|
||||
| gsm8k | exact_match,flexible-extract | 62.40% |
|
||||
| arc_challenge | acc_norm,none | 31.74% |
|
||||
| hellaswag | acc_norm,none | 56.07% |
|
||||
| winogrande | acc,none | 50.04% |
|
||||
| piqa | acc_norm,none | 63.22% |
|
||||
| boolq | acc,none | 74.37% |
|
||||
| truthfulqa_mc2 | acc,none | 45.34% |
|
||||
|
||||
## 2) Public Base Model (Nanbeige4.1-3B)
|
||||
Model: `Nanbeige/Nanbeige4.1-3B` (author-reported benchmarks)
|
||||
|
||||
| Benchmark | Published Score |
|
||||
|---|---:|
|
||||
| Live-Code-Bench-V6 | 76.90% |
|
||||
| AIME 2026 I | 87.40% |
|
||||
| HMMT Nov | 77.92% |
|
||||
| GPQA | 83.80% |
|
||||
| HLE (Text-only) | 12.60% |
|
||||
| Arena-Hard-v2 | 73.20% |
|
||||
| BFCL-V4 | 56.50% |
|
||||
| Tau2-Bench | 48.57% |
|
||||
|
||||
Note: Nanbeige published benchmarks do not overlap directly with your lm-eval task set (`mmlu`, `gsm8k`, `arc_challenge`, etc.), so no exact apples-to-apples delta can be computed without rerunning identical tasks.
|
||||
|
||||
## 3) Public Frontier Reference (Claude / GPT / Gemini) on overlapping classic tasks
|
||||
Source benchmark table: Anthropic Claude 3 model card (March 2024).
|
||||
|
||||
| Benchmark | Your model | Claude 3 Opus | Claude 3 Sonnet | GPT-4 | Gemini 1.0 Ultra | Gemini 1.5 Pro |
|
||||
|---|---:|---:|---:|---:|---:|---:|
|
||||
| MMLU (5-shot) | 59.98% | 86.80% | 79.00% | 86.40% | 83.70% | 81.90% |
|
||||
| GSM8K | 62.40% | 95.00% | 92.30% | 92.00% | 94.40% | 91.70% |
|
||||
| ARC-Challenge (25-shot) | 31.74% | 96.40% | 93.20% | 96.30% | — | — |
|
||||
| HellaSwag (10-shot) | 56.07% | 95.40% | 89.00% | 95.30% | 87.80% | 92.50% |
|
||||
| WinoGrande (5-shot) | 50.04% | 88.50% | 75.10% | 87.50% | — | — |
|
||||
|
||||
## 4) Latest Frontier Snapshot (2025-2026, non-overlapping tasks)
|
||||
Source benchmark table: Claude Opus 4.5 system card, Table 2.3.A.
|
||||
|
||||
| Benchmark | Claude Opus 4.5 | Claude Sonnet 4.5 | Claude Opus 4.1 | Gemini 3 Pro | GPT-5.1 |
|
||||
|---|---:|---:|---:|---:|---:|
|
||||
| SWE-bench Verified | 80.9% | 77.2% | 74.5% | 76.2% | 76.3% |
|
||||
| Terminal-bench 2.0 | 59.3% | 50.0% | 46.5% | 54.2% | 47.6% |
|
||||
| ARC-AGI-2 (Verified) | 37.6% | 13.6% | — | 31.1% | 17.6% |
|
||||
| GPQA Diamond | 87.0% | 83.4% | 81.0% | 91.9% | 88.1% |
|
||||
| MMMU (validation) | 80.7% | 77.8% | 77.1% | — | 85.4% |
|
||||
| MMMLU | 90.8% | 89.1% | 89.5% | 91.8% | 91.0% |
|
||||
|
||||
Note: These are newer references but still not directly comparable to your current lm-eval task set.
|
||||
|
||||
## 5) Caveats
|
||||
- Your run uses `lm-evaluation-harness` with specific settings; public model-card numbers may use different prompts, few-shot counts, decoding, or evaluation code.
|
||||
- Frontier references in Section 3 are older than current 2026 generations but are official primary-source numbers on overlapping classic benchmarks.
|
||||
- Frontier references in Section 4 are current (2025-2026) but mostly on different benchmarks.
|
||||
|
||||
## Sources
|
||||
- Local run artifact: `/workspace/evals/main_results_v3.json/strykes__emberforge-3b-reasoner/results_2026-02-24T00-06-21.474293.json`
|
||||
- Nanbeige model card: https://huggingface.co/Nanbeige/Nanbeige4.1-3B
|
||||
- Anthropic Claude 3 model card (benchmarks table): https://www-cdn.anthropic.com/c6a80a657af445f40e31afac050f3bf76d3b1404.pdf
|
||||
- Anthropic model cards index: https://www.anthropic.com/system-cards
|
||||
- Anthropic Claude Opus 4.5 system card: https://www-cdn.anthropic.com/bf10f64990cfda0ba858290be7b8cc6317685f47.pdf
|
||||
File diff suppressed because one or more lines are too long
426
benchmarks/lm-eval-2026-02-24/run_v3.log
Normal file
426
benchmarks/lm-eval-2026-02-24/run_v3.log
Normal file
File diff suppressed because one or more lines are too long
70
benchmarks/lm-eval-2026-02-24/summary_v3.tsv
Normal file
70
benchmarks/lm-eval-2026-02-24/summary_v3.tsv
Normal file
@@ -0,0 +1,70 @@
|
||||
task metric value
|
||||
arc_challenge acc_norm,none 0.3174061433447099
|
||||
boolq acc,none 0.7437308868501529
|
||||
gsm8k exact_match,flexible-extract 0.6239575435936315
|
||||
hellaswag acc_norm,none 0.560744871539534
|
||||
mmlu acc,none 0.5997721122347244
|
||||
mmlu_abstract_algebra acc,none 0.43
|
||||
mmlu_anatomy acc,none 0.6074074074074074
|
||||
mmlu_astronomy acc,none 0.6973684210526315
|
||||
mmlu_business_ethics acc,none 0.62
|
||||
mmlu_clinical_knowledge acc,none 0.6415094339622641
|
||||
mmlu_college_biology acc,none 0.8263888888888888
|
||||
mmlu_college_chemistry acc,none 0.53
|
||||
mmlu_college_computer_science acc,none 0.54
|
||||
mmlu_college_mathematics acc,none 0.5
|
||||
mmlu_college_medicine acc,none 0.5953757225433526
|
||||
mmlu_college_physics acc,none 0.5
|
||||
mmlu_computer_security acc,none 0.68
|
||||
mmlu_conceptual_physics acc,none 0.5872340425531914
|
||||
mmlu_econometrics acc,none 0.35964912280701755
|
||||
mmlu_electrical_engineering acc,none 0.6413793103448275
|
||||
mmlu_elementary_mathematics acc,none 0.5317460317460317
|
||||
mmlu_formal_logic acc,none 0.5
|
||||
mmlu_global_facts acc,none 0.33
|
||||
mmlu_high_school_biology acc,none 0.7548387096774194
|
||||
mmlu_high_school_chemistry acc,none 0.6009852216748769
|
||||
mmlu_high_school_computer_science acc,none 0.69
|
||||
mmlu_high_school_european_history acc,none 0.7696969696969697
|
||||
mmlu_high_school_geography acc,none 0.7272727272727273
|
||||
mmlu_high_school_government_and_politics acc,none 0.7461139896373057
|
||||
mmlu_high_school_macroeconomics acc,none 0.6435897435897436
|
||||
mmlu_high_school_mathematics acc,none 0.45555555555555555
|
||||
mmlu_high_school_microeconomics acc,none 0.7773109243697479
|
||||
mmlu_high_school_physics acc,none 0.5165562913907285
|
||||
mmlu_high_school_psychology acc,none 0.8
|
||||
mmlu_high_school_statistics acc,none 0.5694444444444444
|
||||
mmlu_high_school_us_history acc,none 0.7156862745098039
|
||||
mmlu_high_school_world_history acc,none 0.7974683544303798
|
||||
mmlu_human_aging acc,none 0.600896860986547
|
||||
mmlu_human_sexuality acc,none 0.6946564885496184
|
||||
mmlu_humanities acc,none 0.5300743889479277
|
||||
mmlu_international_law acc,none 0.7851239669421488
|
||||
mmlu_jurisprudence acc,none 0.7222222222222222
|
||||
mmlu_logical_fallacies acc,none 0.6932515337423313
|
||||
mmlu_machine_learning acc,none 0.42857142857142855
|
||||
mmlu_management acc,none 0.6893203883495146
|
||||
mmlu_marketing acc,none 0.8034188034188035
|
||||
mmlu_medical_genetics acc,none 0.69
|
||||
mmlu_miscellaneous acc,none 0.6717752234993615
|
||||
mmlu_moral_disputes acc,none 0.5953757225433526
|
||||
mmlu_moral_scenarios acc,none 0.2446927374301676
|
||||
mmlu_nutrition acc,none 0.6764705882352942
|
||||
mmlu_other acc,none 0.6269713550048278
|
||||
mmlu_philosophy acc,none 0.6559485530546624
|
||||
mmlu_prehistory acc,none 0.6265432098765432
|
||||
mmlu_professional_accounting acc,none 0.4397163120567376
|
||||
mmlu_professional_law acc,none 0.4745762711864407
|
||||
mmlu_professional_medicine acc,none 0.6838235294117647
|
||||
mmlu_professional_psychology acc,none 0.5915032679738562
|
||||
mmlu_public_relations acc,none 0.6
|
||||
mmlu_security_studies acc,none 0.7020408163265306
|
||||
mmlu_social_sciences acc,none 0.6906077348066298
|
||||
mmlu_sociology acc,none 0.7711442786069652
|
||||
mmlu_stem acc,none 0.5883285759594037
|
||||
mmlu_us_foreign_policy acc,none 0.78
|
||||
mmlu_virology acc,none 0.45180722891566266
|
||||
mmlu_world_religions acc,none 0.7192982456140351
|
||||
piqa acc_norm,none 0.6322089227421109
|
||||
truthfulqa_mc2 acc,none 0.45340473177307805
|
||||
winogrande acc,none 0.500394632991318
|
||||
|
Reference in New Issue
Block a user