--- license: apache-2.0 language: - en base_model: HuggingFaceTB/SmolLM2-1.7B-Instruct library_name: transformers tags: - agentic - router - smollm - escalation - bf16 --- # NPC Fast 1.7B **Fast agentic router** — a 1.7B parameter model that decides whether to handle a request itself or forward it to a larger partner model (`NPC Fin 32B`). Trained on top of `HuggingFaceTB/SmolLM2-1.7B-Instruct` by [Bottensor](https://bottensor.ai) (a Falcon Hash company). ## What it is A small, fast router + agentic model. For every user request it emits: ```json {"route": "self" | "npc_fin", "reason": ""} ``` - `self` — it handles the task directly (lookup, format conversion, short code, tool calls with obvious args, identity, translation, chit-chat) - `npc_fin` — it forwards to a 32B finance-specialist model (deep multi-step financial reasoning, valuation, derivatives math, long-document synthesis) ## Training recipe 1. **Full-weight continual pre-training** on top of SmolLM2-1.7B-Instruct - 2,825 global steps, bf16, flash-attention-2, gradient checkpointing - 5-stage curriculum planned (4K → 16K → 32K → 64K → 64K), actual training stopped after stage 2 (16K) - Data: ~60K examples (agentic traces, function calling, tool use, reasoning) - Liger fused kernels (fused linear CE + fused RMSNorm + fused SwiGLU + RoPE) - YaRN RoPE scaling configured for 128K (factor 16) — but **not validated past 16K**, see limitations below 2. **Router LoRA fine-tune** (rank-32, 3 epochs, 189 steps, loss 0.001) - 500 router pairs (300 self + 200 npc_fin) - Merged back into the base weights — this repo is the merged bf16 checkpoint ## Evaluation Run against the merged checkpoint at 16K context: | Benchmark | Metric | Result | |---|---|---| | BFCL (tool calling, n=20) | JSON / name / args accuracy | **100% / 100% / 100%** | | IFEval (n=200, 18 checkable) | instruction pass rate | **77.8%** | | Agentic tool selection (n=100) | JSON valid / tool accuracy | **100% / 57%** | | Router — in-distribution (n=200) | accuracy | **100%** (see note) | | Router — out-of-distribution (n=60) | accuracy | **98.3%** | | Router — OOD escalation recall / precision | recall / precision | **100% / 100%** | | Needle-in-Haystack @ 16K | pass (1 of 5 depths) | 20% | | Needle-in-Haystack @ 32K+ | pass | **0%** (see limitations) | *In-distribution router eval uses the same seed query pool as the training set, so the 100% number measures format fidelity, not generalization. The OOD eval uses 60 genuinely novel queries — that 98.3% is the honest router number. The single OOD error was a JSON formatting glitch; the routing decision was correct.* ## Intended use - Agentic routing — deciding between self-handling and escalation - Light tool-calling and function-calling tasks - Short-context (≤16K) instruction following - Drop-in replacement for SmolLM2 in systems that want a router-fine-tuned head ## Limitations and honest disclosures - **Context is 16K in practice.** The config advertises 128K via YaRN scaling, but training stopped after the 16K curriculum stage. Needle-in-haystack at 32K/64K/128K produces degenerate output (repetitive tokens). Use at your own risk past 16K. - **Router trained on a small synthetic dataset** (500 pairs). OOD eval is strong but the data diversity is limited. Expect edge cases outside finance vs general tasks. - **No RLHF / DPO.** This is pure continual pre-training + supervised fine-tune. Refusal behavior is inherited from the base SmolLM2-Instruct. ## Usage ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch model = AutoModelForCausalLM.from_pretrained( "ramankrishna10/npc-fast-1.7b", torch_dtype=torch.bfloat16, device_map="auto", ) tok = AutoTokenizer.from_pretrained("ramankrishna10/npc-fast-1.7b") SYSTEM = ( "You are NPC Fast, a capable 1.7B model. Handle most requests yourself. " "Only forward to the larger NPC Fin 32B model when a task truly requires " "deep multi-step financial analysis that you cannot do well alone.\n\n" "Default: route=self.\n" "Escalate to npc_fin ONLY if ALL of these are true:\n" " - the task is about finance, markets, banking, derivatives, or valuation\n" " - it requires multi-step quantitative reasoning or deep domain knowledge\n" " - a short answer would be wrong or superficial\n\n" "Output exactly one JSON object with fields route and reason." ) messages = [ {"role": "system", "content": SYSTEM}, {"role": "user", "content": "Build a DCF for TSLA with 3 scenarios."}, ] enc = tok.apply_chat_template(messages, tokenize=True, return_tensors="pt", add_generation_prompt=True).to(model.device) out = model.generate(enc, max_new_tokens=60, do_sample=False) print(tok.decode(out[0][enc.shape[-1]:], skip_special_tokens=True)) # → {"route": "npc_fin", "reason": "multi-step finance model"} ``` ### Runtime 4-bit quantization (bitsandbytes) ```python from transformers import BitsAndBytesConfig bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16) model = AutoModelForCausalLM.from_pretrained( "ramankrishna10/npc-fast-1.7b", quantization_config=bnb, device_map="auto", ) ``` ### GGUF / llama.cpp See companion repo: [`ramankrishna10/npc-fast-1.7b-gguf`](https://huggingface.co/ramankrishna10/npc-fast-1.7b-gguf) ## Credits - Built by **Bottensor** (a **Falcon Hash** company), creator: **dude.npc** - Base model: [`HuggingFaceTB/SmolLM2-1.7B-Instruct`](https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B-Instruct) - Training framework: custom trainer wrapping HF Trainer + Liger-Kernel + FlashAttention-2 + YaRN RoPE scaling ## Citation If you use this model or build on its training recipe, please cite the accompanying preprint: > Bachu, R. K. (2026). *NPC Fast 1.7B: Building a Usable Small Model on > a Single H100.* Zenodo. https://doi.org/10.5281/zenodo.19771040 ```bibtex @misc{bachu2026npcfast, title = {NPC Fast 1.7B: Building a Usable Small Model on a Single H100}, author = {Bachu, Rama Krishna}, year = {2026}, publisher = {Zenodo}, doi = {10.5281/zenodo.19771040}, url = {https://doi.org/10.5281/zenodo.19771040}, note = {Preprint}, } ```