Files
zagreus-0.4B-ita/README.md
ModelHub XC 8d27fbb56b 初始化项目,由ModelHub XC社区提供模型
Model: mii-llm/zagreus-0.4B-ita
Source: Original Platform
2026-05-30 17:08:19 +08:00

412 lines
14 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
language:
- it
- en
license: apache-2.0
tags:
- small-language-model
- slm
- edge-ai
- italian
- bilingual
- base-model
- pretrained
- llama
- nanotron
- from-scratch
model_type: llama
pipeline_tag: text-generation
library_name: transformers
datasets:
- HuggingFaceFW/fineweb
- HuggingFaceFW/fineweb-2
- HuggingFaceFW/finepdfs
- bigcode/starcoderdata
---
# Zagreus-0.4B-ita
**Zagreus-0.4B-ita** is a bilingual English/Italian foundational Small Language Model (SLM) trained **from scratch** by the [mii-llm](https://mii-llm.ai) community (*Made in Italy Large Language Model*) on the [Seeweb](https://www.seeweb.it) HPC infrastructure.
This is a **base (pre-trained) model** — it is not instruction-tuned and is intended for researchers, developers, and practitioners who want to fine-tune or build upon a high-quality bilingual English/Italian foundation. It serves as the base for the entire [Nesso model family](https://huggingface.co/mii-llm).
The Zagreus family represents one of the few openly released, high-performing small language models dedicated to European Romance languages, trained entirely from first principles with a fully transparent pipeline.
---
## Model Details
| Property | Value |
|---|---|
| **Architecture** | Modified Llama-3.2 (fully dense) |
| **Parameters** | ~400M |
| **Hidden size** | 960 |
| **Intermediate size** | 2560 |
| **Layers** | 32 |
| **Attention heads** | 15 (KV heads: 5) |
| **Activation** | SiLU |
| **Context length** | 4096 tokens |
| **Tokenizer** | Llama-3.2 (`vocab_size`: 128,256) |
| **Positional encoding** | RoPE (`theta`: 10000.0) |
| **Tied embeddings** | Yes |
| **Precision** | BF16 |
| **Languages** | English (~400B tokens), Italian (~400B tokens) |
| **Training tokens** | ~1 trillion |
| **Training framework** | [Nanotron (mii-llm fork)](https://github.com/mii-llm/nanotron) |
| **Infrastructure** | 64× NVIDIA A100 GPUs (8 nodes × 8 GPUs), Seeweb HPC |
---
## Training Data
All datasets used are fully open source and released by Hugging Face:
| Dataset | Tokens | Description |
|---|---|---|
| [FineWeb (350BT sample)](https://huggingface.co/datasets/HuggingFaceFW/fineweb/viewer/sample-350BT) | ~350B | High-quality English web text |
| [FineWeb-2 (ita_Latn)](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2/viewer/ita_Latn) | — | Italian web text |
| [FinePDFs (ita_Latn)](https://huggingface.co/datasets/HuggingFaceFW/finepdfs/viewer/ita_Latn) | — | Italian PDF documents |
| [StarCoder Data](https://huggingface.co/datasets/bigcode/starcoderdata) | ~250B | Multilingual code |
**Token distribution**: ~400B English + ~400B Italian + ~200B Code ≈ **1 trillion tokens total**
### Tokenization
Raw datasets were tokenized using the **Llama-3.2 tokenizer** (`meta-llama/Llama-3.2-1B`) via the [datatrove](https://github.com/huggingface/datatrove) library. The process ran for over **three weeks of continuous computation** on CPU nodes via Slurm, generating approximately 35 TB of tokenized data shards.
---
## Architecture Choice
We adopted a **modified Llama-3.2 fully dense architecture**. The choice of a dense model over Mixture-of-Experts (MoE) in the small-parameter regime (~500M) was deliberate: in tightly constrained capacity settings, routing overhead and expert under-utilization typical of MoE architectures may offset their theoretical efficiency advantages. Dense models provide better compute utilization and more stable training dynamics at this scale.
---
## Pre-training Configuration
Full Nanotron YAML configuration used for training:
```yaml
checkpoints:
checkpoint_interval: 5000
checkpoints_path: checkpoints_zagreus_ita_v2
checkpoints_path_is_shared_file_system: false
resume_checkpoint_path: /training/pretraining/nanotron/checkpoints_zagreus_ita_v2/630000
save_final_state: false
save_initial_state: false
data_stages:
- data:
dataset:
dataset_folder:
- /training/pretraining/fineweb-ita/tokenized
- /training/pretraining/fineweb-edu-350BT/000_tokenized_output
- /training/pretraining/fineweb-edu-350BT/011_tokenized_output
- /training/pretraining/fineweb-edu-350BT/012_tokenized_output
- /training/pretraining/fineweb-edu-350BT/013_tokenized_output
- /training/pretraining/fineweb-edu-350BT/014_tokenized_output
- /training/pretraining/fineweb-edu-350BT/015_tokenized_output
- /training/pretraining/fineweb-edu-350BT/016_tokenized_output
- /training/pretraining/finepdf-ita/000_tokenized_output
- /training/pretraining/starcoder_tokenized/000_tokenized_output
num_loading_workers: 0
seed: 8
name: stable phase
start_training_step: 1
general:
benchmark_csv_path: null
consumed_train_samples: null
ignore_sanity_checks: true
project: zagreus
run: zagreus-350M
seed: 8
step: null
logging:
iteration_step_info_interval: 1
log_level: info
log_level_replica: info
model:
ddp_bucket_cap_mb: 100
dtype: bfloat16
init_method:
std: 0.03227
make_vocab_size_divisible_by: 1
model_config:
bos_token_id: 128000
eos_token_id: 128001
hidden_act: silu
hidden_size: 960
initializer_range: 0.02
intermediate_size: 2560
is_llama_config: true
max_position_embeddings: 4096
num_attention_heads: 15
num_hidden_layers: 32
num_key_value_heads: 5
pad_token_id: null
pretraining_tp: 1
rms_norm_eps: 1.0e-05
rope_interleaved: false
rope_scaling: null
rope_theta: 10000.0
tie_word_embeddings: true
use_cache: true
vocab_size: 128256
optimizer:
accumulate_grad_in_fp32: true
clip_grad: 1.0
learning_rate_scheduler:
learning_rate: 0.003
lr_decay_starting_step: 750000
lr_decay_steps: 50000
lr_decay_style: linear
lr_warmup_steps: 4000
lr_warmup_style: linear
min_decay_lr: 1.0e-7
optimizer_factory:
adam_beta1: 0.9
adam_beta2: 0.95
adam_eps: 1.0e-08
name: adamW
torch_adam_is_fused: true
weight_decay: 0.01
zero_stage: 0
parallelism:
dp: 64
expert_parallel_size: 1
pp: 1
pp_engine: 1f1b
recompute_layer: false
tp: 1
tp_linear_async_communication: true
tp_mode: REDUCE_SCATTER
tp_recompute_allgather: true
profiler: null
tokenizer:
tokenizer_max_length: null
tokenizer_name_or_path: meta-llama/Llama-3.2-1B
tokenizer_revision: null
tokens:
batch_accumulation_per_replica: 1
limit_test_batches: 0
limit_val_batches: 0
micro_batch_size: 4
sequence_length: 4096
train_steps: 2000000
val_check_interval: 5000
```
### Slurm Launch Script
```bash
#SBATCH --job-name=350_it
#SBATCH --account=YOUR_ACCOUNT
#SBATCH --partition=PARTITION
#SBATCH --nodes=8
#SBATCH --gres=gpu:8 # 8 A100 per node = 64 total
#SBATCH --cpus-per-task=32
#SBATCH --time=4-00:00:00
#SBATCH --output=slurm-%j.out
################ 0. Environment ################
module purge
module load profile/global
module load python/3.11 cuda/12.2 cudnn nccl gcc
source /path/to/venv/nanotron/bin/activate
export HF_HOME=/path/to/hf_home
export TRANSFORMERS_OFFLINE=1
export HF_HUB_OFFLINE=1
export HF_DATASETS_OFFLINE=1
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export NCCL_IB_DISABLE=0
export NCCL_SOCKET_IFNAME="ib0,eno,eth"
export WANDB_MODE=disabled
################ 1. Distributed vars ############
GPUS_PER_NODE=4
NNODES=$SLURM_JOB_NUM_NODES
NODE_RANK=$SLURM_NODEID
MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n1)
MASTER_PORT=29400
RDZV_ID=$SLURM_JOB_ID
################ 2. Launch ######################
srun torchrun \
--nnodes $NNODES \
--nproc_per_node $GPUS_PER_NODE \
--rdzv_id $RDZV_ID \
--rdzv_backend c10d \
--rdzv_endpoint $MASTER_ADDR:$MASTER_PORT \
/path/to/nanotron/run_train.py \
--config-file smollm2/zagreus_350M_ita.yaml
```
### Checkpoint Conversion to Hugging Face Format
```bash
torchrun --nproc_per_node=1 -m examples.llama.convert_nanotron_to_hf \
--checkpoint_path=checkpoints/544000 \
--save_path=hf_checkpoints/544000 \
--tokenizer_name meta-llama/Llama-3.2-1B
```
---
## Evaluation
### Evaluation Commands
```bash
lm-eval --model hf --model_args pretrained=<checkpoint> \
--tasks m_mmlu_it --num_fewshot 5 --device cuda:0 --batch_size 1
lm-eval --model hf --model_args pretrained=<checkpoint> \
--tasks hellaswag_it,arc_it --device cuda:0 --batch_size 1
```
### Checkpoint Progression
The table below tracks benchmark scores across training checkpoints, demonstrating steady model improvement throughout pre-training:
| Checkpoint | MMLU IT ↑ | HellaSwag IT ↑ | ARC IT ↑ | **Average** |
|---|---|---|---|---|
| v2-95k | 0.2529 | 0.3366 | 0.2652 | 0.2849 |
| v2-205k | 0.2628 | — | — | 0.2628 |
| v2-290k | 0.2428 | 0.3492 | 0.2335 | 0.2752 |
| v2-305k | 0.2598 | 0.3562 | 0.2652 | 0.2937 |
| v2-365k | 0.2566 | 0.3664 | 0.2712 | **0.2981** |
| v2-390k | 0.2556 | 0.3438 | 0.2498 | 0.2831 |
| v2-460k | 0.2540 | 0.3778 | 0.2549 | 0.2956 |
| v2-520k | 0.2540 | 0.3778 | 0.2549 | 0.2956 |
| v2-590k | 0.2547 | 0.3651 | 0.2455 | 0.2884 |
| v2-630k | 0.2562 | 0.3632 | 0.2643 | 0.2946 |
| v2-680k | 0.2538 | 0.3740 | 0.2592 | 0.2957 |
| v2-775k | 0.2535 | 0.3750 | 0.2583 | 0.2956 |
---
### Evalita Benchmark
Evalita is a comprehensive Italian NLP evaluation suite benchmarking models across a wide range of linguistic tasks, from classification and extraction to generation and semantic understanding. Evaluation was conducted using the `evalita-mp` task suite from [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness).
#### Evaluation Command
```bash
lm_eval --model hf \
--model_args pretrained=mii-llm/zagreus-0.4B-ita \
--tasks evalita-mp \
--device cuda:0 \
--batch_size 1
```
#### Results
| Task | Metric | Score |
|---|---|---|
| **Evalita-LLM (Overall)** | acc | **0.3226** |
| Admission Test | acc | 0.2137 |
| FAQ | acc | 0.2681 |
| Hate Speech Detection | f1 | 0.6056 |
| Lexical Substitution | f1 | 0.0000 |
| NER | f1 | 0.1611 |
| Relation Extraction | f1 | 0.1244 |
| Sentiment Analysis | f1 | 0.3660 |
| Summarization (Fanpage) | rouge1 | 0.1947 |
| Text Entailment | acc | 0.5133 |
| Word in Context | f1 | 0.4697 |
> Evalita results serve as a zero-shot baseline for the base model. For the comparison between the base model and the SFT variant, see the [Open-Zagreus-0.4B](https://huggingface.co/mii-llm/open-zagreus-0.4B) model card.
---
## Usage
This is a **base model** — it performs causal language modelling (text completion) and is not instruction-tuned. It is best suited as a starting point for fine-tuning.
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "mii-llm/zagreus-0.4B-ita"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto"
)
# Base model: text completion, not instruction following
prompt = "L'intelligenza artificiale è una disciplina che"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
output = model.generate(
**inputs,
max_new_tokens=200,
temperature=0.8,
do_sample=True,
repetition_penalty=1.1
)
print(tokenizer.decode(output[0], skip_special_tokens=True))
```
For instruction-following, use the post-trained variants:
- 🗣️ [Nesso-0.4B-instruct](https://huggingface.co/mii-llm/nesso-0.4B-instruct) — conversational and instruction following
- 🤖 [Nesso-0.4B-agentic](https://huggingface.co/mii-llm/nesso-0.4B-agentic) — function calling and agentic tasks
- 🔓 [Open-Zagreus-0.4B](https://huggingface.co/mii-llm/open-zagreus-0.4B) — fully open-source SFT variant
---
## Full Model Family
### Base Models (Zagreus)
| Model | Languages | HuggingFace |
|---|---|---|
| **Zagreus-0.4B-ita** *(this model)* | English + Italian | [🤗 Link]() |
| [Zagreus-0.4B-spa](https://huggingface.co/mii-llm/zagreus-0.4B-spa) | English + Spanish | [🤗 Link](https://huggingface.co/mii-llm/zagreus-0.4B-spa) |
| [Zagreus-0.4B-por](https://huggingface.co/mii-llm/zagreus-0.4B-por) | English + Portuguese | [🤗 Link](https://huggingface.co/mii-llm/zagreus-0.4B-por) |
| [Zagreus-0.4B-fra](https://huggingface.co/mii-llm/zagreus-0.4B-fra) | English + French | [🤗 Link](https://huggingface.co/mii-llm/zagreus-0.4B-fra) |
### Post-trained Models (Nesso)
| Model | Use Case | HuggingFace |
|---|---|---|
| [Nesso-0.4B-instruct](https://huggingface.co/mii-llm/nesso-0.4B-instruct) | Conversational / Instruction following | [🤗 Link](https://huggingface.co/mii-llm/nesso-0.4B-instruct) |
| [Nesso-0.4B-agentic](https://huggingface.co/mii-llm/nesso-0.4B-agentic) | Function calling / Agentic | [🤗 Link](https://huggingface.co/mii-llm/nesso-0.4B-agentic) |
| [Open-Zagreus-0.4B](https://huggingface.co/mii-llm/open-zagreus-0.4B) | Fully open source | [🤗 Link](https://huggingface.co/mii-llm/open-zagreus-0.4B) |
---
## Citation
If you use this model in your research, please cite:
```bibtex
@misc{zagreus2025,
title = {The Joy and Pain of Training an LLM from Scratch:
A Technical Report on the Zagreus and Nesso Model Families},
author = {mii-llm community},
year = {2025},
howpublished = {\url{https://github.com/mii-llm/zagreus-nesso-slm}},
}
```
---
## Acknowledgements
- **Antonio Baldassarra** (CEO, Seeweb) and **Marco Cristofanilli** (Head of AI, Seeweb) for commissioning and sponsoring the infrastructure
- The **Hugging Face** team for Nanotron, datatrove, FineWeb, FineWeb-2, and FinePDFs
- The **mii-llm** open-source community for contributions to multilingual evaluation harnesses and the Nanotron fork
---
## License
Released under the **Apache 2.0** license.
> Made with ❤️ in Italy by [mii-llm](https://mii-llm.ai)