A 3B-parameter instruction-tuned language model optimized for reasoning, math, and code generation tasks, powered by our new ADS (Adaptive Dual-Search Distillation) technique.
Model Details
Model
Kai-3B-Instruct
Architecture
SmolLM3ForCausalLM
Parameters
3B
Hidden size
2048
Intermediate size
11008
Layers
36
Attention heads
16 (4 KV heads, GQA)
Context length
65536
Precision
bfloat16
Vocab size
128,256
What is ADS?
Adaptive Dual-Search Distillation (自适应对偶搜索蒸馏) treats model fine-tuning as a constrained optimization problem inspired by Operations Research. The core mechanism is a dynamic loss function with a stateful dual penalty factor that adapts based on embedding space entropy — forcing the model to converge to high-confidence predictions at difficult reasoning points, without modifying the model architecture.
Benchmark Results
General (5-shot, log-likelihood)
Model
Params
MMLU
ARC-c (acc_norm)
HellaSwag (acc_norm)
PIQA (acc_norm)
TinyLlama
1.1B
~26.0%
~33.0%
~60.0%
~71.0%
SmolLM2
1.7B
~35.0%
~38.0%
~65.0%
~74.0%
Llama-2-7B
7B
45.3%
46.2%
77.2%
79.8%
Gemma-2-2B
2.6B
~52.0%
~53.0%
75.0%
~78.0%
Kai-3B-Instruct
3B
53.62%
51.88%
69.53%
77.53%
Qwen2.5-3B
3B
~63.0%
~55.0%
~73.0%
~80.0%
Code Generation — HumanEval (Pass@1, 0-shot)
Model
Params
HumanEval (Pass@1)
Notes
Llama-2-7B
7B
~12.8%
3x overtake — smaller model, far better code
SmolLM2-1.7B
1.7B
~25.0%
ADS delivers +14pp pure gain
Gemma-2-2B
2B
~30.0%
Surpasses Google's heavily distilled 2B flagship
Kai-3B-Instruct
3B
39.02%
ADS topological pruning, full pipeline
GPT-3.5 (Legacy)
175B
~48.0%
Kai-3B trails the original GPT-3.5 by only ~9pp
Math — GSM8K (0-shot)
Model
Params
GSM8K (exact_match)
Kai-3B-Instruct
3B
39.27%
Key Observations
Surpasses Llama-2-7B: Kai-3B outperforms Llama-2-7B on MMLU (+8.3pp) and ARC-Challenge (+5.7pp) with less than half the parameters — a 7B model decisively beaten by a 3B distilled model.
Competitive with Gemma-2-2B: Matches or exceeds Google's Gemma-2-2B on MMLU (+1.6pp) and PIQA, despite Gemma being trained with significantly more compute.
HellaSwag: At 69.53%, Kai-3B surpasses all sub-2B models by a wide margin and trails the compute-heavy Qwen2.5-3B by only ~3.5pp.
PIQA: At 77.53%, Kai-3B nearly matches Gemma-2-2B (~78.0%) and approaches the 3B-class ceiling set by Qwen2.5-3B (~80.0%).
Usage
fromtransformersimportAutoModelForCausalLM,AutoTokenizerimporttorchmodel=AutoModelForCausalLM.from_pretrained("NoesisLab/Kai-3B-Instruct",torch_dtype=torch.bfloat16,)tokenizer=AutoTokenizer.from_pretrained("NoesisLab/Kai-3B-Instruct")messages=[{"role":"user","content":"What is 25 * 4?"}]input_ids=tokenizer.apply_chat_template(messages,return_tensors="pt")output=model.generate(input_ids,max_new_tokens=256)print(tokenizer.decode(output[0],skip_special_tokens=True))