初始化项目，由ModelHub XC社区提供模型

Model: sandbreak80sd/llm-350m-instruct-v2 Source: Original Platform
2026-04-21 01:36:06 +08:00
commit a9770188c5
10 changed files with 250623 additions and 0 deletions
--- a/.gitattributes
+++ b/.gitattributes
@@ -0,0 +1,38 @@
 *.7z filter=lfs diff=lfs merge=lfs -text
 *.arrow filter=lfs diff=lfs merge=lfs -text
 *.bin filter=lfs diff=lfs merge=lfs -text
 *.bz2 filter=lfs diff=lfs merge=lfs -text
 *.ckpt filter=lfs diff=lfs merge=lfs -text
 *.ftz filter=lfs diff=lfs merge=lfs -text
 *.gz filter=lfs diff=lfs merge=lfs -text
 *.h5 filter=lfs diff=lfs merge=lfs -text
 *.joblib filter=lfs diff=lfs merge=lfs -text
 *.lfs.* filter=lfs diff=lfs merge=lfs -text
 *.mlmodel filter=lfs diff=lfs merge=lfs -text
 *.model filter=lfs diff=lfs merge=lfs -text
 *.msgpack filter=lfs diff=lfs merge=lfs -text
 *.npy filter=lfs diff=lfs merge=lfs -text
 *.npz filter=lfs diff=lfs merge=lfs -text
 *.onnx filter=lfs diff=lfs merge=lfs -text
 *.ot filter=lfs diff=lfs merge=lfs -text
 *.parquet filter=lfs diff=lfs merge=lfs -text
 *.pb filter=lfs diff=lfs merge=lfs -text
 *.pickle filter=lfs diff=lfs merge=lfs -text
 *.pkl filter=lfs diff=lfs merge=lfs -text
 *.pt filter=lfs diff=lfs merge=lfs -text
 *.pth filter=lfs diff=lfs merge=lfs -text
 *.rar filter=lfs diff=lfs merge=lfs -text
 *.safetensors filter=lfs diff=lfs merge=lfs -text
 saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.tar.* filter=lfs diff=lfs merge=lfs -text
 *.tar filter=lfs diff=lfs merge=lfs -text
 *.tflite filter=lfs diff=lfs merge=lfs -text
 *.tgz filter=lfs diff=lfs merge=lfs -text
 *.wasm filter=lfs diff=lfs merge=lfs -text
 *.xz filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
 llm-350m-instruct-v2-f16.gguf filter=lfs diff=lfs merge=lfs -text
 llm-350m-instruct-v2-q4_k_m.gguf filter=lfs diff=lfs merge=lfs -text
 llm-350m-instruct-v2-q8_0.gguf filter=lfs diff=lfs merge=lfs -text
--- a/18
+++ b/18
@@ -0,0 +1,18 @@
 FROM ./llm-350m-instruct-q4_k_m.gguf
 # ChatML template (v2 — trained on OpenHermes-2.5)
 TEMPLATE """<|im_start|>system
 {{ if .System }}{{ .System }}{{ else }}You are a helpful assistant.{{ end }}<|im_end|>
 <|im_start|>user
 {{ .Prompt }}<|im_end|>
 <|im_start|>assistant
 """
 SYSTEM "You are a helpful assistant."
 PARAMETER stop "<|im_end|>"
 PARAMETER stop "<|im_start|>"
 PARAMETER temperature 0.7
 PARAMETER top_p 0.9
 PARAMETER top_k 50
 PARAMETER num_ctx 2048
--- a/README.md
+++ b/README.md
@@ -0,0 +1,214 @@
 ---
 language:
 - en
 license: apache-2.0
 tags:
 - text-generation
 - causal-lm
 - llama
 - gqa
 - rope
 - swiglu
 - from-scratch
 - pretraining
 - instruction-tuned
 - chatml
 datasets:
 - HuggingFaceFW/fineweb-edu
 - teknium/OpenHermes-2.5
 metrics:
 - perplexity
 pipeline_tag: text-generation
 ---
 # LLM-350M-Instruct-V2
 This is V2 of a 350M parameter language model I trained from scratch as a personal learning project. V2 improves on [V1](https://huggingface.co/sandbreak80sd/llm-350m-instruct) by replacing Alpaca-cleaned with OpenHermes-2.5 — 200K GPT-4 generated examples in ChatML format instead of 52K GPT-3.5 examples in Alpaca format. Same pretrained base, better finetuning data.
 I'm not a researcher. I don't work at a big lab. I just wanted to understand how LLMs actually work by building one. The whole thing ran on a single rented GPU for under $500 total across both versions.
 **[V1 model](https://huggingface.co/sandbreak80sd/llm-350m-instruct)** | **[Training code](https://github.com/sandbreak80/llm-350m)** | **[W&B logs](https://wandb.ai/bstoner-riffyx/llm-350m-finetune-v2)**
 ---
 ## ⚠️ Important: Prompt Format
 **This model requires ChatML format.** If you send raw text without the template, the model won't recognize it as an instruction and will produce poor output. This is the single most common issue people hit — configure your inference tool before you test.
 The format looks like this:
 ```
 <|im_start|>system
 You are a helpful assistant.<|im_end|>
 <|im_start|>user
 Your question here<|im_end|>
 <|im_start|>assistant
 ```
 Setup instructions for each tool are in the Quick Start section below.
 ---
 ## Benchmarks
 | Benchmark | V1 (Alpaca) | V2 (OpenHermes) | Δ |
 |---|---|---|---|
 | HellaSwag | 38.40% | 37.60% | -0.80% |
 | LAMBADA | 34.00% | 35.30% | +1.30% |
 | ARC-Easy | 58.20% | 58.40% | +0.20% |
 | ARC-Challenge | 27.76% | 25.42% | -2.34% |
 | WinoGrande | 52.80% | 52.40% | -0.40% |
 Val loss: 1.3704 vs V1's 1.7189 (20.3% lower)
 Benchmark deltas are small and mixed — expected at 350M scale, and sensitive to the prompt format change (Alpaca → ChatML affects loglikelihood scoring). Val loss is the more reliable signal here.
 ---
 ## What Changed from V1
 | | V1 | V2 |
 |---|---|---|
 | Finetune data | yahma/alpaca-cleaned (52K, GPT-3.5) | teknium/OpenHermes-2.5 (200K, GPT-4) |
 | Prompt format | Alpaca (`### Instruction:`) | **ChatML** (`<\|im_start\|>`) |
 | Learning rate | 2e-5 | 1e-5 |
 | Finetune iters | 1,500 | 4,000 |
 | Anti-forgetting blend | 2,500 FineWeb samples | 10,000 FineWeb samples |
 | Val loss | 1.7189 | **1.3704** |
 ---
 ## Architecture
 Same as V1 — modern LLaMA-style architecture at 350M parameters:
 RoPE positional encoding, RMSNorm (pre-norm), SwiGLU activations, Grouped Query Attention (4 KV / 16 query heads), 2048 token context, Flash Attention 2, tied embeddings.
 Full architecture and pretraining details in the [V1 model card](https://huggingface.co/sandbreak80sd/llm-350m-instruct).
 ---
 ## Quick Start
 ### Prompt Format (ChatML)
 All requests must be wrapped in ChatML tags. Raw questions without the template will not work correctly.
 ```
 <|im_start|>system
 You are a helpful assistant.<|im_end|>
 <|im_start|>user
 Your question here<|im_end|>
 <|im_start|>assistant
 ```
 ---
 ### LM Studio
 **This is the most important setup step — skip it and the model will produce bad output.**
 1. Download `llm-350m-instruct-v2-q4_k_m.gguf` from the Files tab (~197MB)
 2. Load it in LM Studio
 3. Click the **prompt format dropdown** (top of the chat window) → select **ChatML**
   - If ChatML isn't in the list, click "Edit" and set manually:
     - User prefix: `<|im_start|>user\n`
     - User suffix: `<|im_end|>\n`
     - Assistant prefix: `<|im_start|>assistant\n`
     - Assistant suffix: `<|im_end|>\n`
     - System prefix: `<|im_start|>system\n`
     - System suffix: `<|im_end|>\n`
 4. In **Model Parameters**, set `repeat_penalty` to `1.1` (prevents repetition loops)
 5. Now type your question normally — LM Studio handles the wrapping automatically
 ---
 ### Ollama
 The included `Modelfile` configures ChatML automatically:
 ```bash
 # Option 1: Run directly
 ollama run sandbreak80sd/llm-350m-instruct-v2
 # Option 2: Build from Modelfile (for customization)
 # Download Modelfile from the Files tab, then:
 ollama create llm-350m-v2 -f Modelfile
 ollama run llm-350m-v2 "What is the sun?"
 ```
 Ollama handles the ChatML template automatically when using the published model.
 ---
 ### llama.cpp (CLI)
 ```bash
 # Download llm-350m-instruct-v2-q4_k_m.gguf, then:
 ./llama-cli -m llm-350m-instruct-v2-q4_k_m.gguf \
  --prompt "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nWhat is the sun?<|im_end|>\n<|im_start|>assistant\n" \
  -n 256 --temp 0.7 --top-p 0.9 --repeat-penalty 1.1 \
  --stop "<|im_end|>"
 ```
 The `--stop "<|im_end|>"` flag is required — without it the model won't know when to stop generating.
 ---
 ### Python (HuggingFace Transformers)
 ```python
 from transformers import AutoTokenizer, AutoModelForCausalLM
 import torch
 tokenizer = AutoTokenizer.from_pretrained("sandbreak80sd/llm-350m-instruct-v2")
 model = AutoModelForCausalLM.from_pretrained(
    "sandbreak80sd/llm-350m-instruct-v2",
    torch_dtype=torch.bfloat16,
    device_map="auto",
 )
 # Always use ChatML format — do not send raw text
 prompt = "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nWhat is the sun?<|im_end|>\n<|im_start|>assistant\n"
 inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
 outputs = model.generate(
    **inputs,
    max_new_tokens=256,
    temperature=0.7,
    do_sample=True,
    repetition_penalty=1.1,
    eos_token_id=tokenizer.convert_tokens_to_ids("<|im_end|>"),
 )
 print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
 ```
 ---
 ## Limitations
 Same fundamental limitations as V1 — 350M parameters trained on a hobbyist budget:
 - **Math**: Unreliable beyond simple arithmetic. Do not trust numerical outputs.
 - **Code**: May be structurally plausible but semantically wrong. Always verify.
 - **Repetition**: Without `repeat_penalty=1.1`, the model can loop. Always set this.
 - **Prompt format sensitivity**: Must use ChatML. Raw prompts produce degraded output.
 - **No safety alignment**: SFT only, no RLHF or DPO. Not for production use.
 - **Knowledge cutoff**: Limited to pretraining data; no real-time information.
 - **Context**: 2,048 tokens maximum.
 ---
 ## Training Cost
 ~$12 for V2 finetuning on g6e.xlarge (L40S GPU). Full project cost including pretraining and V1: ~$310.
 ---
 ## Citation
 ```bibtex
@misc{llm-350m-instruct-v2,
  author = {Stoner, Brad},
  title  = {LLM-350M-Instruct-V2: A 350M LLM trained from scratch with OpenHermes-2.5},
  year   = {2026},
  url    = {https://huggingface.co/sandbreak80sd/llm-350m-instruct-v2},
  note   = {Training code: https://github.com/sandbreak80/llm-350m}
 }
 ```
--- a/config.json
+++ b/config.json
@@ -0,0 +1,23 @@
 {
  "architectures": [
    "LlamaForCausalLM"
  ],
  "model_type": "llama",
  "hidden_size": 1024,
  "intermediate_size": 2816,
  "num_hidden_layers": 24,
  "num_attention_heads": 16,
  "num_key_value_heads": 4,
  "hidden_act": "silu",
  "max_position_embeddings": 2048,
  "initializer_range": 0.02,
  "rms_norm_eps": 1e-06,
  "rope_theta": 10000.0,
  "tie_word_embeddings": true,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.40.0",
  "vocab_size": 50304,
  "bos_token_id": 50256,
  "eos_token_id": 50256,
  "pad_token_id": 50256
 }
--- a/llm-350m-instruct-v2-f16.gguf
+++ b/llm-350m-instruct-v2-f16.gguf
@@ -0,0 +1,3 @@
 version https://git-lfs.github.com/spec/v1
 oid sha256:fd70a60a8cfe9b05ee0056da96035861dd123167efb7942fc963351e97f64383
 size 646069824
--- a/llm-350m-instruct-v2-q4_k_m.gguf
+++ b/llm-350m-instruct-v2-q4_k_m.gguf
@@ -0,0 +1,3 @@
 version https://git-lfs.github.com/spec/v1
 oid sha256:6dd168b9f23ac7752f3492c2b688e0000fb65a0f60fefe542a10cee2100e8a26
 size 206144064
--- a/llm-350m-instruct-v2-q8_0.gguf
+++ b/llm-350m-instruct-v2-q8_0.gguf
@@ -0,0 +1,3 @@
 version https://git-lfs.github.com/spec/v1
 oid sha256:a09cb693ae2e1827c6d38a7bd1b42c3ba47af8c042a5ba3b57e5259144bbfdcc
 size 344153664
--- a/model.safetensors
+++ b/model.safetensors
@@ -0,0 +1,3 @@
 version https://git-lfs.github.com/spec/v1
 oid sha256:1b5f50b6cc60ed540bc16e7962d19da57217aae7dfaec306a75250badb763bb2
 size 1288401000
--- a/tokenizer.json
+++ b/tokenizer.json
--- a/tokenizer_config.json
+++ b/tokenizer_config.json
@@ -0,0 +1,12 @@
 {
  "add_prefix_space": false,
  "backend": "tokenizers",
  "bos_token": "<|endoftext|>",
  "eos_token": "<|endoftext|>",
  "errors": "replace",
  "is_local": false,
  "model_max_length": 1024,
  "pad_token": null,
  "tokenizer_class": "GPT2Tokenizer",
  "unk_token": "<|endoftext|>"
 }