初始化项目，由ModelHub XC社区提供模型

Model: sandbreak80sd/llm-350m-instruct-v2 Source: Original Platform
2026-04-21 01:36:06 +08:00
commit a9770188c5
10 changed files with 250623 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,214 @@
+---
+language:
+- en
+license: apache-2.0
+tags:
+- text-generation
+- causal-lm
+- llama
+- gqa
+- rope
+- swiglu
+- from-scratch
+- pretraining
+- instruction-tuned
+- chatml
+datasets:
+- HuggingFaceFW/fineweb-edu
+- teknium/OpenHermes-2.5
+metrics:
+- perplexity
+pipeline_tag: text-generation
+---
+
+# LLM-350M-Instruct-V2
+
+This is V2 of a 350M parameter language model I trained from scratch as a personal learning project. V2 improves on [V1](https://huggingface.co/sandbreak80sd/llm-350m-instruct) by replacing Alpaca-cleaned with OpenHermes-2.5 — 200K GPT-4 generated examples in ChatML format instead of 52K GPT-3.5 examples in Alpaca format. Same pretrained base, better finetuning data.
+
+I'm not a researcher. I don't work at a big lab. I just wanted to understand how LLMs actually work by building one. The whole thing ran on a single rented GPU for under $500 total across both versions.
+
+**[V1 model](https://huggingface.co/sandbreak80sd/llm-350m-instruct)** | **[Training code](https://github.com/sandbreak80/llm-350m)** | **[W&B logs](https://wandb.ai/bstoner-riffyx/llm-350m-finetune-v2)**
+
+---
+
+## ⚠️ Important: Prompt Format
+
+**This model requires ChatML format.** If you send raw text without the template, the model won't recognize it as an instruction and will produce poor output. This is the single most common issue people hit — configure your inference tool before you test.
+
+The format looks like this:
+```
+<|im_start|>system
+You are a helpful assistant.<|im_end|>
+<|im_start|>user
+Your question here<|im_end|>
+<|im_start|>assistant
+```
+
+Setup instructions for each tool are in the Quick Start section below.
+
+---
+
+## Benchmarks
+
+| Benchmark | V1 (Alpaca) | V2 (OpenHermes) | Δ |
+|---|---|---|---|
+| HellaSwag | 38.40% | 37.60% | -0.80% |
+| LAMBADA | 34.00% | 35.30% | +1.30% |
+| ARC-Easy | 58.20% | 58.40% | +0.20% |
+| ARC-Challenge | 27.76% | 25.42% | -2.34% |
+| WinoGrande | 52.80% | 52.40% | -0.40% |
+
+Val loss: 1.3704 vs V1's 1.7189 (20.3% lower)
+
+Benchmark deltas are small and mixed — expected at 350M scale, and sensitive to the prompt format change (Alpaca → ChatML affects loglikelihood scoring). Val loss is the more reliable signal here.
+
+---
+
+## What Changed from V1
+
+| | V1 | V2 |
+|---|---|---|
+| Finetune data | yahma/alpaca-cleaned (52K, GPT-3.5) | teknium/OpenHermes-2.5 (200K, GPT-4) |
+| Prompt format | Alpaca (`### Instruction:`) | **ChatML** (`<\|im_start\|>`) |
+| Learning rate | 2e-5 | 1e-5 |
+| Finetune iters | 1,500 | 4,000 |
+| Anti-forgetting blend | 2,500 FineWeb samples | 10,000 FineWeb samples |
+| Val loss | 1.7189 | **1.3704** |
+
+---
+
+## Architecture
+
+Same as V1 — modern LLaMA-style architecture at 350M parameters:
+RoPE positional encoding, RMSNorm (pre-norm), SwiGLU activations, Grouped Query Attention (4 KV / 16 query heads), 2048 token context, Flash Attention 2, tied embeddings.
+
+Full architecture and pretraining details in the [V1 model card](https://huggingface.co/sandbreak80sd/llm-350m-instruct).
+
+---
+
+## Quick Start
+
+### Prompt Format (ChatML)
+
+All requests must be wrapped in ChatML tags. Raw questions without the template will not work correctly.
+
+```
+<|im_start|>system
+You are a helpful assistant.<|im_end|>
+<|im_start|>user
+Your question here<|im_end|>
+<|im_start|>assistant
+```
+
+---
+
+### LM Studio
+
+**This is the most important setup step — skip it and the model will produce bad output.**
+
+1. Download `llm-350m-instruct-v2-q4_k_m.gguf` from the Files tab (~197MB)
+2. Load it in LM Studio
+3. Click the **prompt format dropdown** (top of the chat window) → select **ChatML**
+   - If ChatML isn't in the list, click "Edit" and set manually:
+     - User prefix: `<|im_start|>user\n`
+     - User suffix: `<|im_end|>\n`
+     - Assistant prefix: `<|im_start|>assistant\n`
+     - Assistant suffix: `<|im_end|>\n`
+     - System prefix: `<|im_start|>system\n`
+     - System suffix: `<|im_end|>\n`
+4. In **Model Parameters**, set `repeat_penalty` to `1.1` (prevents repetition loops)
+5. Now type your question normally — LM Studio handles the wrapping automatically
+
+---
+
+### Ollama
+
+The included `Modelfile` configures ChatML automatically:
+
+```bash
+# Option 1: Run directly
+ollama run sandbreak80sd/llm-350m-instruct-v2
+
+# Option 2: Build from Modelfile (for customization)
+# Download Modelfile from the Files tab, then:
+ollama create llm-350m-v2 -f Modelfile
+ollama run llm-350m-v2 "What is the sun?"
+```
+
+Ollama handles the ChatML template automatically when using the published model.
+
+---
+
+### llama.cpp (CLI)
+
+```bash
+# Download llm-350m-instruct-v2-q4_k_m.gguf, then:
+./llama-cli -m llm-350m-instruct-v2-q4_k_m.gguf \
+  --prompt "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nWhat is the sun?<|im_end|>\n<|im_start|>assistant\n" \
+  -n 256 --temp 0.7 --top-p 0.9 --repeat-penalty 1.1 \
+  --stop "<|im_end|>"
+```
+
+The `--stop "<|im_end|>"` flag is required — without it the model won't know when to stop generating.
+
+---
+
+### Python (HuggingFace Transformers)
+
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+import torch
+
+tokenizer = AutoTokenizer.from_pretrained("sandbreak80sd/llm-350m-instruct-v2")
+model = AutoModelForCausalLM.from_pretrained(
+    "sandbreak80sd/llm-350m-instruct-v2",
+    torch_dtype=torch.bfloat16,
+    device_map="auto",
+)
+
+# Always use ChatML format — do not send raw text
+prompt = "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nWhat is the sun?<|im_end|>\n<|im_start|>assistant\n"
+inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
+outputs = model.generate(
+    **inputs,
+    max_new_tokens=256,
+    temperature=0.7,
+    do_sample=True,
+    repetition_penalty=1.1,
+    eos_token_id=tokenizer.convert_tokens_to_ids("<|im_end|>"),
+)
+print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
+```
+
+---
+
+## Limitations
+
+Same fundamental limitations as V1 — 350M parameters trained on a hobbyist budget:
+- **Math**: Unreliable beyond simple arithmetic. Do not trust numerical outputs.
+- **Code**: May be structurally plausible but semantically wrong. Always verify.
+- **Repetition**: Without `repeat_penalty=1.1`, the model can loop. Always set this.
+- **Prompt format sensitivity**: Must use ChatML. Raw prompts produce degraded output.
+- **No safety alignment**: SFT only, no RLHF or DPO. Not for production use.
+- **Knowledge cutoff**: Limited to pretraining data; no real-time information.
+- **Context**: 2,048 tokens maximum.
+
+---
+
+## Training Cost
+
+~$12 for V2 finetuning on g6e.xlarge (L40S GPU). Full project cost including pretraining and V1: ~$310.
+
+---
+
+## Citation
+
+```bibtex
+@misc{llm-350m-instruct-v2,
+  author = {Stoner, Brad},
+  title  = {LLM-350M-Instruct-V2: A 350M LLM trained from scratch with OpenHermes-2.5},
+  year   = {2026},
+  url    = {https://huggingface.co/sandbreak80sd/llm-350m-instruct-v2},
+  note   = {Training code: https://github.com/sandbreak80/llm-350m}
+}
+```