language, license, tags, base_model, library_name, pipeline_tag
language
license
tags
base_model
library_name
pipeline_tag
apache-2.0
transformers
safetensors
gguf
peft
qlora
reasoning
transformers
text-generation
EmberForge-3B-Reasoner
Private finetuned Nanbeige4.1-3B reasoning release by strykes.
Included Artifacts
Merged full model (Safetensors) at repo root for HF benchmarking
LoRA adapter in adapter/
GGUF in gguf/:
Nanbeige4.1-3B-Q5_K_M.gguf
Nanbeige4.1-3B-Q4_K_M.gguf
Nanbeige4.1-3B-f16.gguf
Optional archive in archives/
Training Snapshot
Base: Nanbeige/Nanbeige4.1-3B
Method: Unsloth QLoRA -> merged weights
Data: ~3.5k synthetic reasoning samples
Epochs: 2
Sequence length: 4096
Notes
Intended for research and benchmarking.
Validate outputs before critical use.
Benchmarks (2026-02-24)
Local lm-eval results (this finetune)
Task
Metric
Score
mmlu
acc,none
59.98%
gsm8k
exact_match,flexible-extract
62.40%
arc_challenge
acc_norm,none
31.74%
hellaswag
acc_norm,none
56.07%
winogrande
acc,none
50.04%
piqa
acc_norm,none
63.22%
boolq
acc,none
74.37%
truthfulqa_mc2
acc,none
45.34%
Public references
Base model (Nanbeige/Nanbeige4.1-3B) author-published benchmarks are listed in:
benchmarks/lm-eval-2026-02-24/benchmark_comparison_public_2026-02-24.md
Frontier references (Claude/GPT/Gemini) are included in the same comparison report.
Reproducibility artifacts
benchmarks/lm-eval-2026-02-24/summary_v3.tsv
benchmarks/lm-eval-2026-02-24/results_2026-02-24T00-06-21.474293.json
benchmarks/lm-eval-2026-02-24/run_v3.log
benchmarks/lm-eval-2026-02-24/benchmark_comparison_public_2026-02-24.md
Caveat
Public model-card comparisons are not always apples-to-apples with lm-evaluation-harness settings (prompting, few-shot, decoding, and benchmark versions can differ).