197 lines
6.0 KiB
Markdown
197 lines
6.0 KiB
Markdown
|
|
---
|
||
|
|
license: apache-2.0
|
||
|
|
library_name: transformers
|
||
|
|
pipeline_tag: text-generation
|
||
|
|
tags:
|
||
|
|
- text-generation
|
||
|
|
- causal-lm
|
||
|
|
- llama
|
||
|
|
- gguf-compatible
|
||
|
|
- small-language-model
|
||
|
|
- local-ai
|
||
|
|
- english
|
||
|
|
- veyra
|
||
|
|
language:
|
||
|
|
- en
|
||
|
|
datasets:
|
||
|
|
- HuggingFaceTB/smollm-corpus
|
||
|
|
- codeparrot/github-code-clean
|
||
|
|
- HuggingFaceH4/ultrachat_200k
|
||
|
|
metrics:
|
||
|
|
- blimp
|
||
|
|
---
|
||
|
|
|
||
|
|
# Veyra2 15M Base 1B Tokens
|
||
|
|
|
||
|
|
Veyra2 15M Base 1B Tokens is a compact English causal language model trained from scratch for fast local inference, experimentation, and downstream fine-tuning.
|
||
|
|
|
||
|
|
This release is the first production checkpoint in the Veyra2 15M line. It's Llama-compatible architecture, making it easier to convert to GGUF, run locally, and build downstream instruct or tool-use variants.
|
||
|
|
|
||
|
|
The model is a base language model, not an instruction-tuned assistant. It is best used for text completion, continued pretraining, evaluation, and as a starting point for fine-tuning.
|
||
|
|
|
||
|
|
## Training loss over 1B tokens
|
||
|
|
|
||
|
|

|
||
|
|
|
||
|
|
## Model details
|
||
|
|
|
||
|
|
| Field | Value |
|
||
|
|
|---|---:|
|
||
|
|
| Parameters | 14,685,888 |
|
||
|
|
| Model family name | Veyra2 15M |
|
||
|
|
| Architecture | Llama-compatible causal LM |
|
||
|
|
| Layers | 6 |
|
||
|
|
| Hidden size | 448 |
|
||
|
|
| Intermediate size | 1024 |
|
||
|
|
| Attention heads | 7 |
|
||
|
|
| KV heads | 1 |
|
||
|
|
| Context length | 1024 tokens |
|
||
|
|
| Vocabulary size | 8192 |
|
||
|
|
| Training tokens | ~1,000,000,000 |
|
||
|
|
| Precision during training | bfloat16 model weights with fp32 optimizer state |
|
||
|
|
| License | Apache 2.0 |
|
||
|
|
|
||
|
|
## Training data
|
||
|
|
|
||
|
|
Veyra2 15M Base 1B was trained on an English-heavy mixture designed for small-model local utility:
|
||
|
|
|
||
|
|
- 80% Cosmopedia v2 style educational and synthetic textbook data
|
||
|
|
- 10% Python/code-oriented data
|
||
|
|
- 10% chat/instruction-style data
|
||
|
|
|
||
|
|
The chat portion uses ChatML-style formatting, so the base model may sometimes continue ChatML conversations or emit ChatML special tokens. This is expected behavior for the base checkpoint and is useful for later instruction tuning, but this model should not be treated as a polished chat assistant.
|
||
|
|
|
||
|
|
## Training setup
|
||
|
|
|
||
|
|
The model was trained for approximately 1B tokens with a 1024-token sequence length. The optimizer recipe used CosineGatedAdam for matrix parameters and AdamW for auxiliary parameters such as embeddings and normalization weights.
|
||
|
|
|
||
|
|
Final training logs near the end of the run were approximately:
|
||
|
|
|
||
|
|
| Metric | Value |
|
||
|
|
|---|---:|
|
||
|
|
| Average training speed | ~315k tokens/sec |
|
||
|
|
| Peak VRAM during training | ~55 GB |
|
||
|
|
|
||
|
|
These training numbers are from the training stream and are not a replacement for downstream task evaluation.
|
||
|
|
|
||
|
|
## Evaluation
|
||
|
|
|
||
|
|
### Quick streamed eval
|
||
|
|
|
||
|
|
A quick streamed sanity eval was run on Cosmopedia-style data for 262,144 tokens.
|
||
|
|
|
||
|
|
| Metric | Value |
|
||
|
|
|---|---:|
|
||
|
|
| Eval tokens | 262,144 |
|
||
|
|
| Eval loss | 2.82 |
|
||
|
|
| Eval perplexity | 16.82 |
|
||
|
|
|
||
|
|
### BLiMP
|
||
|
|
|
||
|
|
BLiMP was evaluated using the official `nyu-mll/blimp` dataset with mean token log-likelihood scoring.
|
||
|
|
|
||
|
|
| Metric | Value |
|
||
|
|
|---|---:|
|
||
|
|
| Total examples | 67,000 |
|
||
|
|
| Correct | 38,750 |
|
||
|
|
| Overall accuracy | 57.84% |
|
||
|
|
|
||
|
|
BLiMP measures targeted grammatical minimal-pair sensitivity. It should not be interpreted as a general capability benchmark.
|
||
|
|
|
||
|
|
## Intended use
|
||
|
|
|
||
|
|
Veyra2 15M Base 1B is intended for:
|
||
|
|
|
||
|
|
- local text completion experiments
|
||
|
|
- lightweight CPU-friendly language modeling
|
||
|
|
- downstream instruction tuning
|
||
|
|
- small-model research
|
||
|
|
- grammar, tokenizer, quantization, and local-inference experiments
|
||
|
|
- building small ChatML, tool-use, Python, or function-calling variants
|
||
|
|
|
||
|
|
For direct assistant-style use, wait for an instruction-tuned Veyra2 model or fine-tune this base checkpoint yourself.
|
||
|
|
|
||
|
|
## What to expect
|
||
|
|
|
||
|
|
This is a very small base model. You should expect coherent short completions, recognizable educational prose, some code-like continuations, and occasional ChatML continuation behavior.
|
||
|
|
|
||
|
|
You should not expect high factual reliability, robust reasoning, strong instruction following, safety alignment, or long-context consistency. The model may hallucinate, repeat itself, produce incorrect facts, or continue prompts in unexpected formats.
|
||
|
|
|
||
|
|
## Example usage
|
||
|
|
|
||
|
|
```python
|
||
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM
|
||
|
|
import torch
|
||
|
|
|
||
|
|
model_id = "veyra-ai/veyra2-15m-base-1b-tokens"
|
||
|
|
|
||
|
|
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
||
|
|
model = AutoModelForCausalLM.from_pretrained(
|
||
|
|
model_id,
|
||
|
|
torch_dtype=torch.bfloat16,
|
||
|
|
device_map="auto",
|
||
|
|
)
|
||
|
|
|
||
|
|
prompt = "The purpose of a small language model is"
|
||
|
|
|
||
|
|
inputs = tokenizer(
|
||
|
|
prompt,
|
||
|
|
return_tensors="pt",
|
||
|
|
add_special_tokens=False,
|
||
|
|
).to(model.device)
|
||
|
|
|
||
|
|
pad_token_id = tokenizer.pad_token_id
|
||
|
|
if pad_token_id is None:
|
||
|
|
pad_token_id = tokenizer.eos_token_id
|
||
|
|
|
||
|
|
with torch.no_grad():
|
||
|
|
output = model.generate(
|
||
|
|
**inputs,
|
||
|
|
max_new_tokens=120,
|
||
|
|
temperature=0.7,
|
||
|
|
top_p=0.92,
|
||
|
|
do_sample=True,
|
||
|
|
eos_token_id=tokenizer.eos_token_id,
|
||
|
|
pad_token_id=pad_token_id,
|
||
|
|
)
|
||
|
|
|
||
|
|
# Decode only the newly generated tokens, not the prompt
|
||
|
|
new_tokens = output[0][inputs["input_ids"].shape[-1]:]
|
||
|
|
print(tokenizer.decode(new_tokens, skip_special_tokens=True))
|
||
|
|
```
|
||
|
|
|
||
|
|
## Experimental ChatML continuation
|
||
|
|
|
||
|
|
The base model has seen ChatML-style data during pretraining. You can experiment with prompts like:
|
||
|
|
|
||
|
|
```text
|
||
|
|
<|im_start|>user
|
||
|
|
Explain what a stack is in simple terms.
|
||
|
|
<|im_end|>
|
||
|
|
<|im_start|>assistant
|
||
|
|
```
|
||
|
|
|
||
|
|
This is completion behavior, not instruction tuning. The model may continue the conversation format but should not be treated as a reliable assistant.
|
||
|
|
|
||
|
|
## GGUF and quantization
|
||
|
|
|
||
|
|
>Coming soon.
|
||
|
|
|
||
|
|
## Limitations
|
||
|
|
|
||
|
|
- English-focused
|
||
|
|
- Not instruction tuned
|
||
|
|
- Not safety aligned
|
||
|
|
- May hallucinate facts
|
||
|
|
- May produce repetitive or malformed text
|
||
|
|
- Limited context length of 1024 tokens
|
||
|
|
- Small parameter count limits reasoning and world knowledge
|
||
|
|
- ChatML behavior is learned as text continuation, not as a robust assistant policy
|
||
|
|
|
||
|
|
## License
|
||
|
|
|
||
|
|
This model is released under the Apache 2.0 license unless otherwise noted. Please retain attribution to Veyra AI when redistributing models or releasing derivative work.
|
||
|
|
|
||
|
|
## Citation / attribution
|
||
|
|
|
||
|
|
If you use this model, please refer to it as `Veyra2 15M Base 1B Tokens` by Veyra AI.
|