初始化项目，由ModelHub XC社区提供模型

Model: StentorLabs/Stentor-30M Source: Original Platform
2026-05-29 17:12:20 +08:00
commit c9fae039cc
10 changed files with 276577 additions and 0 deletions
--- a/.gitattributes
+++ b/.gitattributes
@@ -0,0 +1,37 @@
+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text
+training_loss.png filter=lfs diff=lfs merge=lfs -text
+training_perplexity.png filter=lfs diff=lfs merge=lfs -text
--- a/README.md
+++ b/README.md
@@ -0,0 +1,655 @@
+---
+language:
+- en
+license: apache-2.0
+library_name: transformers
+tags:
+- text-generation
+- llama
+- small-language-model
+- efficient
+- edge-deployment
+- speculative-decoding
+- tiny-model
+- 30m-parameters
+- kaggle-trained
+- educational
+- research
+- low-resource
+- cpu-inference
+- mobile-deployment
+- synthetic-data
+- fineweb
+- cosmopedia
+pipeline_tag: text-generation
+datasets:
+- HuggingFaceFW/fineweb-edu
+- HuggingFaceTB/smollm-corpus
+widget:
+- text: "Once upon a time"
+  example_title: "Story Generation"
+- text: "Explain neural networks in simple terms."
+  example_title: "Toy Explanation (Often Wrong)"
+- text: "def fibonacci(n):"
+  example_title: "Code Continuation"
+- text: "[INST]What is machine learning?[/INST]"
+  example_title: "Instruction-Style Prompt (Not Tuned)"
+model_card_authors:
+- StentorLabs
+model-index:
+- name: Stentor-30M
+  results:
+  - task:
+      type: text-generation
+    dataset:
+      name: FineWeb-Edu + Cosmopedia v2 (validation split)
+      type: mixed
+    metrics:
+    - name: Validation Loss
+      type: loss
+      value: 3.4971
+    - name: Perplexity
+      type: perplexity
+      value: 33.02
+---
+
+# Stentor-30M
+
+![License](https://img.shields.io/badge/license-Apache%202.0-blue.svg)
+![Model Size](https://img.shields.io/badge/parameters-30M-green.svg)
+![Training Time](https://img.shields.io/badge/training-7.88h-orange.svg)
+![Hardware](https://img.shields.io/badge/hardware-1x%20Tesla%20T4-red.svg)
+![Context Length](https://img.shields.io/badge/context-512%20tokens-purple.svg)
+[![Hugging Face](https://img.shields.io/badge/🤗-Hugging%20Face-yellow.svg)](https://huggingface.co/StentorLabs/Stentor-30M)
+[![GGUF](https://img.shields.io/badge/GGUF-mradermacher-blue.svg)](https://huggingface.co/mradermacher/Stentor-30M-GGUF)
+
+Stentor-30M is a highly compact, efficient language model built on the Llama architecture. Designed for speed and low-resource environments, this ~30.4M parameter checkpoint utilizes a mixed-precision training pipeline and is best treated as a **base next-token predictor** (not a chat assistant). It does not "understand" text in a human sense and is not trained to reliably follow instructions. While the tokenizer may include special tokens/templates that resemble instruction or tool formats, the model itself is **not instruction-tuned** and will often generate **plausible but off-topic** text. It serves as an accessible entry point for researching attention mechanisms and testing training pipelines on consumer hardware.
+
+> ⚠️ **Important Limitations**
+> 
+> - **Context Window:** Maximum 512 tokens (very short)
+> - **Not Instruction-Tuned:** May ignore prompts or respond off-topic
+> - **Stopping / EOS:** Sometimes stops on its own, but it's rare; always set `max_new_tokens`
+> - **Tokenizer ≠ Capability:** "tool/function" tokens do not imply real tool use
+> - **No Safety Tuning:** Base model without RLHF or safety alignment
+> - **Limited Knowledge:** 30M parameters = limited world knowledge
+> - **Proof-of-Concept:** Not suitable for production without fine-tuning
+> - **Educational Focus:** Trained on synthetic textbooks, not diverse real-world data
+
+Recommended generation settings (based on manual testing):
+
+- **Max new tokens:** 10-60
+- **Temperature:** 1.1-1.4
+- **Top-p:** 0.35-0.75
+
+Real interactions (sampling is non-deterministic; your outputs may vary):
+
+```text
+Max New Tokens: 30
+Temp: 1.2
+Top p: 0.55
+User:
+The story of my life is
+Generated text:
+The story of my life is a tale of the story of the man who has been born in Germany. He was the first to learn about his family, and his story of the
+```
+
+```text
+Max New Tokens: 30
+Temp: 1.2
+Top p: 0.7
+User:
+Biology is the understanding of
+Generated text:
+Biology is the understanding of nature and animals, not only as a model for biological research but also as a tool for understanding human behavior and conservation. Biological research is about understanding
+```
+
+```text
+Max New Tokens: 30
+Temp: 1.2
+Top p: 0.7
+User:
+Everyone is dead
+Text Generated:
+Everyone is dead: 50 percent of our people will be killed in the coming days of our nation. 60 percent of us will live and go in
+```
+
+---
+
+## 🚀 Quick Start
+
+Get up and running in 3 simple steps:
+
+### 1. Install
+```bash
+pip install transformers torch
+```
+
+### 2. Load & Generate
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+
+model = AutoModelForCausalLM.from_pretrained("StentorLabs/Stentor-30M")
+tokenizer = AutoTokenizer.from_pretrained("StentorLabs/Stentor-30M")
+
+prompt = "The future of AI is"
+inputs = tokenizer(prompt, return_tensors="pt")
+outputs = model.generate(
+    **inputs,
+    max_new_tokens=50,  # always set this; the model may not stop on its own
+    do_sample=True,
+    temperature=1.1,
+    top_p=0.55,
+)
+
+print(tokenizer.decode(outputs[0], skip_special_tokens=True))
+```
+
+### 3. Explore!
+- Try different prompts
+- Adjust `max_new_tokens`, `temperature`, and `top_p`
+
+---
+
+## 📦 Quantized Versions
+
+Pre-quantized versions of Stentor-30M are available for use with llama.cpp, LM Studio, Ollama, and other compatible runtimes — no conversion needed.
+
+| Format | Provider | Link |
+|--------|----------|------|
+| GGUF (multiple quants) | mradermacher | [mradermacher/Stentor-30M-GGUF](https://huggingface.co/mradermacher/Stentor-30M-GGUF) |
+
+Just download your preferred quantization (e.g. `Q4_K_M` for a good size/quality balance) and run it directly with llama.cpp or load it in LM Studio.
+
+---
+
+## Model Details
+
+### Model Description
+
+Stentor-30M is a lightweight LlamaForCausalLM model designed to bring the architectural benefits of Llama to a fraction of the size. With a hidden size of 256 and a compact parameter budget, this model is optimized for rapid inference and edge-deployment scenarios where memory is at a premium.
+
+The tokenizer configuration may include control tokens commonly used in instruction/tool-call formatting (for experimentation), but **these tokens do not make the base model instruction-following or tool-using**. If you need reliable instruction following or structured tool calls, you will need additional fine-tuning / alignment.
+
+- **Developed by:** Kai Izumoto (StentorLabs)
+- **Funded by:** Self-funded
+- **Shared by:** StentorLabs
+- **Model type:** LlamaForCausalLM (Auto-regressive Language Model)
+- **Language(s):** English
+- **License:** Apache-2.0
+- **Finetuned from model:** None (Base model trained from scratch)
+
+## Uses
+
+### Direct Use
+
+- **Low-Latency Text Generation:** Due to its compact size (approx. 30.4M parameters), Stentor-30M is suitable for real-time applications on CPU or mobile devices.
+- **Instruction-Style Prompting (Limited):** You can *format* prompts using tags like `[INST]`, but the model is **not** instruction-tuned and will often fail to follow the request.
+- **Tool-Call Formatting Tokens (Limited):** The tokenizer may include tool-related tokens, but the model is **not** trained to reliably emit valid tool calls/JSON or to "use tools".
+- **Edge Deployment:** Ideal for resource-constrained environments including mobile devices, IoT, and embedded systems.
+
+### Downstream Use
+
+- **Speculative Decoding (Experimental):** Stentor-30M can be used as a fast draft model for larger Llama-based models, but speedups depend on how often the larger model accepts the draft tokens (quality limits may reduce gains).
+- **Educational/Research:** A perfect "petri dish" model for studying attention mechanics (4 attention heads) and training dynamics without requiring massive compute.
+- **Prototyping:** Quick, low-cost experiments focused on latency, sampling behavior, and failure modes before scaling up.
+
+### Out-of-Scope Use
+
+- **Complex Reasoning:** As a 30M parameter model, users should not expect high-level reasoning or deep knowledge retrieval comparable to multi-billion parameter models.
+- **Instruction-Following Chatbots:** This is a base model and is not reliably conversational or on-task.
+- **Long Context:** The model is optimized for short-context tasks with a maximum position embedding of 512 tokens.
+- **Production-Critical Applications:** This is a research/proof-of-concept model and should not be used for mission-critical applications without thorough testing.
+
+## Bias, Risks, and Limitations
+
+- **Context Window:** The model has a hard limit of 512 tokens for context length.
+- **Prompt Relevance:** Outputs are often generic or unrelated to the prompt, even when they sound fluent.
+- **Knowledge Base:** Limited parameter count restricts the amount of world knowledge the model can store.
+- **Training Data Bias:** The model inherits any biases present in the FineWeb-Edu and Cosmopedia v2 datasets.
+- **Hallucinations:** Like all language models, Stentor-30M may generate plausible-sounding but factually incorrect information.
+- **No Safety Tuning:** This is a base model without safety alignment or RLHF.
+
+### Recommendations
+
+Users (both direct and downstream) should be made aware of the risks, biases, and limitations of the model. This model is best used for specific, narrow tasks or as a component in a larger system (e.g., speculative decoding) rather than a general-purpose assistant.
+
+## How to Get Started with the Model
+
+### Basic Usage
+
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+
+model_id = "StentorLabs/Stentor-30M"
+
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+model = AutoModelForCausalLM.from_pretrained(model_id)
+
+# The repo may provide a chat template, but this is still a base model.
+# Do not expect reliable instruction following just because you use chat formatting.
+messages = [
+    {"role": "user", "content": "Hello, what are you?"}
+]
+
+inputs = tokenizer.apply_chat_template(
+    messages, 
+    return_tensors="pt", 
+    add_generation_prompt=True
+)
+outputs = model.generate(inputs, max_new_tokens=50)
+print(tokenizer.decode(outputs[0], skip_special_tokens=True))
+```
+
+### Advanced Usage with Tool-Call Formatting (Educational)
+
+```python
+# The tokenizer may include tokens that resemble tool/function calling formats.
+# The base model is not trained to reliably emit valid tool calls or structured JSON.
+messages = [
+    {"role": "system", "content": "You are a tiny base language model. You do not have tool access."},
+    {"role": "user", "content": "What's the weather like?"}
+]
+
+inputs = tokenizer.apply_chat_template(messages, return_tensors="pt")
+outputs = model.generate(inputs, max_new_tokens=100)
+```
+
+## Detailed Use Cases
+
+### 1. Speculative Decoding with Llama 3
+
+Potentially speed up larger model inference by using Stentor-30M as a draft model (results vary):
+
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+
+# Load draft model (Stentor-30M)
+draft_model = AutoModelForCausalLM.from_pretrained("StentorLabs/Stentor-30M")
+draft_tokenizer = AutoTokenizer.from_pretrained("StentorLabs/Stentor-30M")
+
+# Load target model
+target_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B")
+target_tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B")
+
+# Use speculative decoding (requires a recent Transformers version that supports `assistant_model`)
+prompt = "Explain machine learning"
+inputs = target_tokenizer(prompt, return_tensors="pt")
+
+outputs = target_model.generate(
+    **inputs,
+    assistant_model=draft_model,  # Stentor-30M as draft
+    do_sample=True,
+    max_new_tokens=100
+)
+
+print(target_tokenizer.decode(outputs[0], skip_special_tokens=True))
+```
+
+### 2. Run with llama.cpp / LM Studio / Ollama (GGUF)
+
+Pre-quantized GGUF files are available at [mradermacher/Stentor-30M-GGUF](https://huggingface.co/mradermacher/Stentor-30M-GGUF) — no conversion required.
+
+```bash
+# Download a quantized GGUF (e.g. Q4_K_M) from the link above, then run with llama.cpp:
+./llama-cli -m stentor-30m-Q4_K_M.gguf -p "Hello world" -n 50
+```
+
+Or simply load the `.gguf` file directly in **LM Studio** or **Ollama** for a GUI/API experience.
+
+### 3. Edge Deployment with ONNX
+
+Convert to ONNX for mobile/edge deployment:
+
+```bash
+# Install dependencies
+pip install optimum[exporters]
+
+# Export to ONNX
+optimum-cli export onnx \
+  --model StentorLabs/Stentor-30M \
+  --task text-generation-with-past \
+  stentor-30m-onnx/
+```
+
+```python
+# Use with ONNX Runtime
+from optimum.onnxruntime import ORTModelForCausalLM
+from transformers import AutoTokenizer
+
+model = ORTModelForCausalLM.from_pretrained("stentor-30m-onnx")
+tokenizer = AutoTokenizer.from_pretrained("StentorLabs/Stentor-30M")
+
+inputs = tokenizer("Hello world", return_tensors="pt")
+outputs = model.generate(**inputs, max_new_tokens=20)
+print(tokenizer.decode(outputs[0]))
+```
+
+### 4. Rapid Prototyping
+
+Quick experimentation before scaling:
+
+```python
+# These "tasks" are intentionally broad: this tiny base model will often fail.
+# The point is to observe latency, failure modes, and sampling behavior.
+from transformers import pipeline
+
+generator = pipeline("text-generation", model="StentorLabs/Stentor-30M")
+
+test_prompts = [
+    "Summarize this: [long text]",
+    "Translate to French: Hello",
+    "Answer: What is 2+2?"
+]
+
+for prompt in test_prompts:
+    result = generator(prompt, max_new_tokens=30)[0]['generated_text']
+    print(f"Prompt: {prompt}\nResult: {result}\n")
+```
+
+## Quantize It Yourself
+
+If you want to produce your own quantized versions rather than using the pre-built GGUFs:
+
+### 8-bit Quantization (bitsandbytes)
+
+```python
+from transformers import AutoModelForCausalLM, BitsAndBytesConfig
+
+quantization_config = BitsAndBytesConfig(load_in_8bit=True)
+model = AutoModelForCausalLM.from_pretrained(
+    "StentorLabs/Stentor-30M",
+    quantization_config=quantization_config,
+    device_map="auto"
+)
+# Memory: ~30 MB (~50% reduction from fp16 weights)
+```
+
+### 4-bit Quantization (bitsandbytes)
+
+```python
+quantization_config = BitsAndBytesConfig(load_in_4bit=True)
+model = AutoModelForCausalLM.from_pretrained(
+    "StentorLabs/Stentor-30M",
+    quantization_config=quantization_config,
+    device_map="auto"
+)
+# Memory: ~15 MB (~75% reduction from fp16 weights)
+```
+
+**Note:** Requires `bitsandbytes` library: `pip install bitsandbytes`
+
+### Convert to GGUF Manually
+
+```bash
+# Clone llama.cpp
+git clone https://github.com/ggerganov/llama.cpp
+cd llama.cpp
+
+# Install dependencies
+pip install -r requirements.txt
+
+# Download model
+huggingface-cli download StentorLabs/Stentor-30M --local-dir stentor-30m
+
+# Convert to GGUF
+python convert_hf_to_gguf.py stentor-30m/ \
+  --outfile stentor-30m.gguf \
+  --outtype f16
+
+# Quantize (optional)
+./llama-quantize stentor-30m.gguf stentor-30m-q4_0.gguf q4_0
+```
+
+### Convert to TensorFlow Lite (Mobile)
+
+```bash
+# Install dependencies
+pip install tensorflow tf2onnx
+
+# First convert to ONNX (see above)
+# Then convert ONNX to TFLite
+python -m tf2onnx.convert \
+  --onnx stentor-30m-onnx/model.onnx \
+  --output stentor-30m.tflite \
+  --opset 13
+```
+
+**Format summary:**
+- **GGUF:** C++ applications, llama.cpp, LM Studio, Ollama — [pre-built available](https://huggingface.co/mradermacher/Stentor-30M-GGUF)
+- **ONNX:** Cross-platform (Windows/Linux/Mac/Web)
+- **TFLite:** Android/iOS mobile apps
+
+---
+
+## Training Details
+
+### Training Data
+
+The model was trained on a high-quality mixed dataset focused on educational content and synthetic textbook data:
+
+- **FineWeb-Edu** ([HuggingFaceFW/fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu)): A dataset filtered for educational quality.
+- **Cosmopedia v2** ([HuggingFaceTB/smollm-corpus](https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus)): A corpus of synthetic textbooks and stories.
+
+**Total tokens processed:** 600,000,512 tokens
+
+### Training Procedure
+
+The model was trained using a custom script in a Kaggle Jupyter environment, demonstrating the accessibility of training efficient models on free-tier compute.
+
+#### Preprocessing
+
+The training pipeline utilized lightweight but effective preprocessing steps:
+
+- **Cleaning:** Unicode normalization (NFKC) and whitespace stripping/normalization.
+- **Formatting:** Optional wrapping for chat formats or `<think>` tokens.
+- **Packing:** Sequence packing into fixed block_size chunks to maximize training efficiency.
+- **Tokenization:** Standard Llama tokenization with EOS tokens appended.
+
+#### Training Hyperparameters
+
+<details>
+<summary><b>Click to view full training configuration</b></summary>
+
+| Hyperparameter | Value |
+|----------------|-------|
+| Precision | fp16 mixed precision |
+| Optimizer | AdamW |
+| Scheduler | Cosine |
+| Learning Rate | 0.0008 |
+| Weight Decay | 0.01 |
+| Warmup Ratio | 0.02 |
+| Stable Ratio | 0.8 |
+| Total Batch Size | 256 |
+| Max Train Steps | 4,578 |
+| Evaluation Steps | 100 |
+| Gradient Accumulation | 64 |
+
+</details>
+
+#### Speeds, Sizes, Times
+
+- **Training Time:** 28,367.5 seconds (~7.88 hours)
+- **Hardware:** 1x Tesla T4 (`num_processes: 1`)
+- **Vocab Size:** 32,768 (padded to multiple of 128)
+- **Sequence Length:** 512 tokens
+- **Tokens per Second (avg):** ~21,137 TPS
+- **Total Parameters:** 30,419,712
+- **Embedding Parameters:** 8,388,608 (27.6% of total)
+
+> **Note:** A significant portion of parameters are allocated to embeddings due to the 32K vocabulary size. For future iterations, a smaller vocabulary (8K-16K) could free up capacity for additional model layers.
+
+---
+
+## Evaluation
+
+### Testing Data, Factors & Metrics
+
+#### Testing Data
+
+Evaluation was performed on a held-out validation split of the mixed FineWeb-Edu and Cosmopedia dataset.
+
+#### Metrics
+
+- **Validation Loss:** Measures how well the model predicts the next token (lower is better).
+- **Perplexity (PPL):** The exponential of the loss, indicating how "surprised" the model is by new text (lower is better).
+
+### Results
+
+![Training Loss Curve](training_loss.png)
+![Training Perplexity Curve](training_perplexity.png)
+
+| Metric | Value |
+|--------|-------|
+| **Validation Loss** | 3.4971 (best @ step 4500) |
+| **Perplexity** | 33.02 |
+
+#### Training Progress
+
+The model showed steady improvement throughout training:
+- Initial train loss (step 25): 9.4245
+- Mid-training train loss (step 2300): 3.7579
+- Final train loss (step 4575): 3.2368
+- Best eval loss: 3.4971 (step 4500)
+- Final eval loss / PPL: 3.4975 / 33.03
+
+> **Note:** As a 30M parameter base model, this checkpoint should be treated as a functional proof-of-concept baseline. The model does not run external benchmarks like MMLU or GSM8K.
+
+---
+
+## Technical Specifications
+
+### Model Architecture and Objective
+
+<details>
+<summary><b>Click to view full architecture specifications</b></summary>
+
+Stentor-30M utilizes the Llama architecture with the following specific configuration:
+
+| Component | Value |
+|-----------|-------|
+| Hidden Size | 256 |
+| Intermediate Size | 1024 |
+| Num Hidden Layers | 21 |
+| Attention Heads | 4 |
+| Key/Value Heads | 4 |
+| Hidden Activation | SiLU |
+| RoPE Theta | 10000.0 |
+| Max Position Embeddings | 512 |
+| Vocab Size | 32,768 |
+| Tie Word Embeddings | True |
+
+> **Architecture Note:** This configuration is set to 21 layers to keep total parameters in the 30M-31M target range with a 32,768-token vocabulary.
+
+</details>
+
+### Compute Infrastructure
+
+The model was trained using standard cloud infrastructure available to researchers and students.
+
+#### Hardware
+
+- **GPUs:** 1x NVIDIA Tesla T4 (16GB)
+- **Platform:** Kaggle Notebooks (free tier)
+- **Compute Type:** Cloud-based
+
+#### Software
+
+- **Transformers Version:** 5.2.0
+- **PyTorch Version:** Latest stable
+- **Torch Compile:** False (disabled for notebook stability)
+- **Accelerate:** Enabled for training
+
+---
+
+## Environmental Impact
+
+- **Hardware Type:** 1x NVIDIA Tesla T4
+- **Hours used:** ~7.88 hours
+- **Cloud Provider:** Kaggle
+- **Compute Region:** US West
+- **Carbon Emitted:** ~160 gCO2e (estimated)
+
+Training on free-tier cloud GPUs demonstrates the accessibility of small language model research to students and independent researchers.
+
+---
+
+## Related Resources
+
+### Official Resources
+- 📊 Best model artifact: `results/best_model` (config + tokenizer + weights + metadata)
+- 🎓 [Model Card Methodology](https://arxiv.org/abs/1810.03993) - Mitchell et al., 2018
+
+### Quantized Versions
+- 🗜️ [mradermacher/Stentor-30M-GGUF](https://huggingface.co/mradermacher/Stentor-30M-GGUF) - GGUF quantizations for llama.cpp, LM Studio, Ollama
+
+### Related Models
+- [TinyLlama-1.1B](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0) - Larger alternative (1.1B params)
+- [SmolLM-135M](https://huggingface.co/HuggingFaceTB/SmolLM-135M) - Similar size category
+- [Llama-3.2-1B](https://huggingface.co/meta-llama/Llama-3.2-1B) - Target model for speculative decoding
+
+### Research Papers
+- [Speculative Decoding](https://arxiv.org/abs/2211.17192) - Leviathan et al., 2023
+- [Small Language Models Survey](https://arxiv.org/abs/2402.14848) - Survey on efficient LLMs
+
+---
+
+## Citation
+
+```bibtex
+@misc{izumoto2026stentor30m,
+      title={Stentor-30M: A Compact Llama-based Language Model}, 
+      author={Kai Izumoto},
+      year={2026},
+      publisher={StentorLabs},
+      howpublished={\url{https://huggingface.co/StentorLabs/Stentor-30M}}
+}
+```
+
+---
+
+## Glossary
+
+- **NLP (Natural Language Processing):** The field of AI focused on the interaction between computers and human language.
+- **PPL (Perplexity):** A measurement of how well a probability model predicts a sample. Lower is generally better.
+- **Speculative Decoding:** A technique where a small "draft" model (like Stentor-30M) quickly generates tokens that are then verified by a larger model, speeding up the overall process.
+- **SLM (Small Language Model):** Language models with parameters typically under 1B, designed for efficiency and specific tasks.
+- **RoPE (Rotary Position Embedding):** A method for encoding position information in transformer models.
+- **Edge Deployment:** Running models on resource-constrained devices like mobile phones or IoT devices.
+- **GGUF:** A file format used by llama.cpp and compatible runtimes for efficient local inference.
+
+---
+
+## Model Card Contact
+
+For questions, please contact [StentorLabs@gmail.com](mailto:StentorLabs@gmail.com) or open an issue on the [model repository](https://huggingface.co/StentorLabs/Stentor-30M/discussions).
+
+---
+
+## Acknowledgments
+
+Special thanks to:
+- Hugging Face for the transformers library and dataset hosting
+- The creators of FineWeb-Edu and Cosmopedia v2 datasets
+- Kaggle for providing free GPU compute resources
+- [mradermacher](https://huggingface.co/mradermacher) for providing GGUF quantizations
+- The open-source community for making accessible AI research possible
+
+---
+
+## Connect & Community
+
+### Stay Updated
+- 📧 [Email](mailto:StentorLabs@gmail.com) - Direct contact
+- 💬 [HuggingFace Discussions](https://huggingface.co/StentorLabs/Stentor-30M/discussions) - Questions and community chat
+
+### More from StentorLabs
+- 🔬 [All Models](https://huggingface.co/StentorLabs) - Browse our model collection
+
+---
+
+<p align="center">
+  Made with ❤️ by <a href="https://huggingface.co/StentorLabs">StentorLabs</a>
+  <br>
+  <i>Democratizing AI through accessible, efficient models</i>
+</p>
--- a/chat_template.jinja
+++ b/chat_template.jinja
@@ -0,0 +1,87 @@
+{%- if messages[0]["role"] == "system" %}
+    {%- set system_message = messages[0]["content"] %}
+    {%- set loop_messages = messages[1:] %}
+{%- else %}
+    {%- set loop_messages = messages %}
+{%- endif %}
+{%- if not tools is defined %}
+    {%- set tools = none %}
+{%- endif %}
+{%- set user_messages = loop_messages | selectattr("role", "equalto", "user") | list %}
+
+{#- This block checks for alternating user/assistant messages, skipping tool calling messages #}
+{%- set ns = namespace() %}
+{%- set ns.index = 0 %}
+{%- for message in loop_messages %}
+    {%- if not (message.role == "tool" or message.role == "tool_results" or (message.tool_calls is defined and message.tool_calls is not none)) %}
+        {%- if (message["role"] == "user") != (ns.index % 2 == 0) %}
+            {{- raise_exception("After the optional system message, conversation roles must alternate user/assistant/user/assistant/...") }}
+        {%- endif %}
+        {%- set ns.index = ns.index + 1 %}
+    {%- endif %}
+{%- endfor %}
+
+{{- bos_token }}
+{%- for message in loop_messages %}
+    {%- if message["role"] == "user" %}
+        {%- if tools is not none and (message == user_messages[-1]) %}
+            {{- "[AVAILABLE_TOOLS] [" }}
+            {%- for tool in tools %}
+                {%- set tool = tool.function %}
+                {{- '{"type": "function", "function": {' }}
+                {%- for key, val in tool.items() if key != "return" %}
+                    {%- if val is string %}
+                        {{- '"' + key + '": "' + val + '"' }}
+                    {%- else %}
+                        {{- '"' + key + '": ' + val|tojson }}
+                    {%- endif %}
+                    {%- if not loop.last %}
+                        {{- ", " }}
+                    {%- endif %}
+                {%- endfor %}
+                {{- "}}" }}
+                {%- if not loop.last %}
+                    {{- ", " }}
+                {%- else %}
+                    {{- "]" }}
+                {%- endif %}
+            {%- endfor %}
+            {{- "[/AVAILABLE_TOOLS]" }}
+            {%- endif %}
+        {%- if loop.last and system_message is defined %}
+            {{- "[INST] " + system_message + "\n\n" + message["content"] + "[/INST]" }}
+        {%- else %}
+            {{- "[INST] " + message["content"] + "[/INST]" }}
+        {%- endif %}
+    {%- elif message.tool_calls is defined and message.tool_calls is not none %}
+        {{- "[TOOL_CALLS] [" }}
+        {%- for tool_call in message.tool_calls %}
+            {%- set out = tool_call.function|tojson %}
+            {{- out[:-1] }}
+            {%- if not tool_call.id is defined or tool_call.id|length != 9 %}
+                {{- raise_exception("Tool call IDs should be alphanumeric strings with length 9!") }}
+            {%- endif %}
+            {{- ', "id": "' + tool_call.id + '"}' }}
+            {%- if not loop.last %}
+                {{- ", " }}
+            {%- else %}
+                {{- "]" + eos_token }}
+            {%- endif %}
+        {%- endfor %}
+    {%- elif message["role"] == "assistant" %}
+        {{- " " + message["content"]|trim + eos_token}}
+    {%- elif message["role"] == "tool_results" or message["role"] == "tool" %}
+        {%- if message.content is defined and message.content.content is defined %}
+            {%- set content = message.content.content %}
+        {%- else %}
+            {%- set content = message.content %}
+        {%- endif %}
+        {{- '[TOOL_RESULTS] {"content": ' + content|string + ", " }}
+        {%- if not message.tool_call_id is defined or message.tool_call_id|length != 9 %}
+            {{- raise_exception("Tool call IDs should be alphanumeric strings with length 9!") }}
+        {%- endif %}
+        {{- '"call_id": "' + message.tool_call_id + '"}[/TOOL_RESULTS]' }}
+    {%- else %}
+        {{- raise_exception("Only user and assistant roles are supported, with the exception of an initial optional system message!") }}
+    {%- endif %}
+{%- endfor %}
--- a/config.json
+++ b/config.json
@@ -0,0 +1,32 @@
+{
+  "architectures": [
+    "LlamaForCausalLM"
+  ],
+  "attention_bias": false,
+  "attention_dropout": 0.0,
+  "bos_token_id": 1,
+  "dtype": "float32",
+  "eos_token_id": 2,
+  "head_dim": 64,
+  "hidden_act": "silu",
+  "hidden_size": 256,
+  "initializer_range": 0.02,
+  "intermediate_size": 1024,
+  "max_position_embeddings": 512,
+  "mlp_bias": false,
+  "model_type": "llama",
+  "num_attention_heads": 4,
+  "num_hidden_layers": 21,
+  "num_key_value_heads": 4,
+  "pad_token_id": 2,
+  "pretraining_tp": 1,
+  "rms_norm_eps": 1e-05,
+  "rope_parameters": {
+    "rope_theta": 10000.0,
+    "rope_type": "default"
+  },
+  "tie_word_embeddings": true,
+  "transformers_version": "5.2.0",
+  "use_cache": true,
+  "vocab_size": 32768
+}
--- a/generation_config.json
+++ b/generation_config.json
@@ -0,0 +1,10 @@
+{
+  "_from_model_config": true,
+  "bos_token_id": 1,
+  "eos_token_id": 2,
+  "output_attentions": false,
+  "output_hidden_states": false,
+  "pad_token_id": 2,
+  "transformers_version": "5.2.0",
+  "use_cache": true
+}
--- a/model.safetensors
+++ b/model.safetensors
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:7cd2171cbc7fc7882408d0658847d3c4093b3a7e73e184214b2881d06165d893
+size 121699864
--- a/tokenizer.json
+++ b/tokenizer.json
--- a/tokenizer_config.json
+++ b/tokenizer_config.json
@@ -0,0 +1,14 @@
+{
+  "add_prefix_space": true,
+  "bos_token": "<s>",
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "</s>",
+  "legacy": false,
+  "model_max_length": 512,
+  "pad_token": "</s>",
+  "sp_model_kwargs": {},
+  "spaces_between_special_tokens": false,
+  "tokenizer_class": "PreTrainedTokenizerFast",
+  "unk_token": "<unk>",
+  "use_default_system_prompt": false
+}
--- a/training_loss.png
+++ b/training_loss.png
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:7d0a6448211629c1131dd0ef52ea122a63e4ac083d296327880e58140e332332
+size 142350
--- a/training_perplexity.png
+++ b/training_perplexity.png
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:b59b7c5cc8e1f438debc8b437df1aec603addb229984150582679ebee60b7c73
+size 168440