--- language: - en license: apache-2.0 tags: - text-generation - causal-lm - llama - gqa - rope - swiglu - from-scratch - pretraining - instruction-tuned - chatml datasets: - HuggingFaceFW/fineweb-edu - teknium/OpenHermes-2.5 metrics: - perplexity pipeline_tag: text-generation --- # LLM-350M-Instruct-V2 This is V2 of a 350M parameter language model I trained from scratch as a personal learning project. V2 improves on [V1](https://huggingface.co/sandbreak80sd/llm-350m-instruct) by replacing Alpaca-cleaned with OpenHermes-2.5 — 200K GPT-4 generated examples in ChatML format instead of 52K GPT-3.5 examples in Alpaca format. Same pretrained base, better finetuning data. I'm not a researcher. I don't work at a big lab. I just wanted to understand how LLMs actually work by building one. The whole thing ran on a single rented GPU for under $500 total across both versions. **[V1 model](https://huggingface.co/sandbreak80sd/llm-350m-instruct)** | **[Training code](https://github.com/sandbreak80/llm-350m)** | **[W&B logs](https://wandb.ai/bstoner-riffyx/llm-350m-finetune-v2)** --- ## ⚠️ Important: Prompt Format **This model requires ChatML format.** If you send raw text without the template, the model won't recognize it as an instruction and will produce poor output. This is the single most common issue people hit — configure your inference tool before you test. The format looks like this: ``` <|im_start|>system You are a helpful assistant.<|im_end|> <|im_start|>user Your question here<|im_end|> <|im_start|>assistant ``` Setup instructions for each tool are in the Quick Start section below. --- ## Benchmarks | Benchmark | V1 (Alpaca) | V2 (OpenHermes) | Δ | |---|---|---|---| | HellaSwag | 38.40% | 37.60% | -0.80% | | LAMBADA | 34.00% | 35.30% | +1.30% | | ARC-Easy | 58.20% | 58.40% | +0.20% | | ARC-Challenge | 27.76% | 25.42% | -2.34% | | WinoGrande | 52.80% | 52.40% | -0.40% | Val loss: 1.3704 vs V1's 1.7189 (20.3% lower) Benchmark deltas are small and mixed — expected at 350M scale, and sensitive to the prompt format change (Alpaca → ChatML affects loglikelihood scoring). Val loss is the more reliable signal here. --- ## What Changed from V1 | | V1 | V2 | |---|---|---| | Finetune data | yahma/alpaca-cleaned (52K, GPT-3.5) | teknium/OpenHermes-2.5 (200K, GPT-4) | | Prompt format | Alpaca (`### Instruction:`) | **ChatML** (`<\|im_start\|>`) | | Learning rate | 2e-5 | 1e-5 | | Finetune iters | 1,500 | 4,000 | | Anti-forgetting blend | 2,500 FineWeb samples | 10,000 FineWeb samples | | Val loss | 1.7189 | **1.3704** | --- ## Architecture Same as V1 — modern LLaMA-style architecture at 350M parameters: RoPE positional encoding, RMSNorm (pre-norm), SwiGLU activations, Grouped Query Attention (4 KV / 16 query heads), 2048 token context, Flash Attention 2, tied embeddings. Full architecture and pretraining details in the [V1 model card](https://huggingface.co/sandbreak80sd/llm-350m-instruct). --- ## Quick Start ### Prompt Format (ChatML) All requests must be wrapped in ChatML tags. Raw questions without the template will not work correctly. ``` <|im_start|>system You are a helpful assistant.<|im_end|> <|im_start|>user Your question here<|im_end|> <|im_start|>assistant ``` --- ### LM Studio **This is the most important setup step — skip it and the model will produce bad output.** 1. Download `llm-350m-instruct-v2-q4_k_m.gguf` from the Files tab (~197MB) 2. Load it in LM Studio 3. Click the **prompt format dropdown** (top of the chat window) → select **ChatML** - If ChatML isn't in the list, click "Edit" and set manually: - User prefix: `<|im_start|>user\n` - User suffix: `<|im_end|>\n` - Assistant prefix: `<|im_start|>assistant\n` - Assistant suffix: `<|im_end|>\n` - System prefix: `<|im_start|>system\n` - System suffix: `<|im_end|>\n` 4. In **Model Parameters**, set `repeat_penalty` to `1.1` (prevents repetition loops) 5. Now type your question normally — LM Studio handles the wrapping automatically --- ### Ollama The included `Modelfile` configures ChatML automatically: ```bash # Option 1: Run directly ollama run sandbreak80sd/llm-350m-instruct-v2 # Option 2: Build from Modelfile (for customization) # Download Modelfile from the Files tab, then: ollama create llm-350m-v2 -f Modelfile ollama run llm-350m-v2 "What is the sun?" ``` Ollama handles the ChatML template automatically when using the published model. --- ### llama.cpp (CLI) ```bash # Download llm-350m-instruct-v2-q4_k_m.gguf, then: ./llama-cli -m llm-350m-instruct-v2-q4_k_m.gguf \ --prompt "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nWhat is the sun?<|im_end|>\n<|im_start|>assistant\n" \ -n 256 --temp 0.7 --top-p 0.9 --repeat-penalty 1.1 \ --stop "<|im_end|>" ``` The `--stop "<|im_end|>"` flag is required — without it the model won't know when to stop generating. --- ### Python (HuggingFace Transformers) ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch tokenizer = AutoTokenizer.from_pretrained("sandbreak80sd/llm-350m-instruct-v2") model = AutoModelForCausalLM.from_pretrained( "sandbreak80sd/llm-350m-instruct-v2", torch_dtype=torch.bfloat16, device_map="auto", ) # Always use ChatML format — do not send raw text prompt = "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nWhat is the sun?<|im_end|>\n<|im_start|>assistant\n" inputs = tokenizer(prompt, return_tensors="pt").to(model.device) outputs = model.generate( **inputs, max_new_tokens=256, temperature=0.7, do_sample=True, repetition_penalty=1.1, eos_token_id=tokenizer.convert_tokens_to_ids("<|im_end|>"), ) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)) ``` --- ## Limitations Same fundamental limitations as V1 — 350M parameters trained on a hobbyist budget: - **Math**: Unreliable beyond simple arithmetic. Do not trust numerical outputs. - **Code**: May be structurally plausible but semantically wrong. Always verify. - **Repetition**: Without `repeat_penalty=1.1`, the model can loop. Always set this. - **Prompt format sensitivity**: Must use ChatML. Raw prompts produce degraded output. - **No safety alignment**: SFT only, no RLHF or DPO. Not for production use. - **Knowledge cutoff**: Limited to pretraining data; no real-time information. - **Context**: 2,048 tokens maximum. --- ## Training Cost ~$12 for V2 finetuning on g6e.xlarge (L40S GPU). Full project cost including pretraining and V1: ~$310. --- ## Citation ```bibtex @misc{llm-350m-instruct-v2, author = {Stoner, Brad}, title = {LLM-350M-Instruct-V2: A 350M LLM trained from scratch with OpenHermes-2.5}, year = {2026}, url = {https://huggingface.co/sandbreak80sd/llm-350m-instruct-v2}, note = {Training code: https://github.com/sandbreak80/llm-350m} } ```