--- language: - it - en license: apache-2.0 tags: - small-language-model - slm - edge-ai - italian - bilingual - base-model - pretrained - llama - nanotron - from-scratch model_type: llama pipeline_tag: text-generation library_name: transformers datasets: - HuggingFaceFW/fineweb - HuggingFaceFW/fineweb-2 - HuggingFaceFW/finepdfs - bigcode/starcoderdata --- # Zagreus-0.4B-ita **Zagreus-0.4B-ita** is a bilingual English/Italian foundational Small Language Model (SLM) trained **from scratch** by the [mii-llm](https://mii-llm.ai) community (*Made in Italy – Large Language Model*) on the [Seeweb](https://www.seeweb.it) HPC infrastructure. This is a **base (pre-trained) model** — it is not instruction-tuned and is intended for researchers, developers, and practitioners who want to fine-tune or build upon a high-quality bilingual English/Italian foundation. It serves as the base for the entire [Nesso model family](https://huggingface.co/mii-llm). The Zagreus family represents one of the few openly released, high-performing small language models dedicated to European Romance languages, trained entirely from first principles with a fully transparent pipeline. --- ## Model Details | Property | Value | |---|---| | **Architecture** | Modified Llama-3.2 (fully dense) | | **Parameters** | ~400M | | **Hidden size** | 960 | | **Intermediate size** | 2560 | | **Layers** | 32 | | **Attention heads** | 15 (KV heads: 5) | | **Activation** | SiLU | | **Context length** | 4096 tokens | | **Tokenizer** | Llama-3.2 (`vocab_size`: 128,256) | | **Positional encoding** | RoPE (`theta`: 10000.0) | | **Tied embeddings** | Yes | | **Precision** | BF16 | | **Languages** | English (~400B tokens), Italian (~400B tokens) | | **Training tokens** | ~1 trillion | | **Training framework** | [Nanotron (mii-llm fork)](https://github.com/mii-llm/nanotron) | | **Infrastructure** | 64× NVIDIA A100 GPUs (8 nodes × 8 GPUs), Seeweb HPC | --- ## Training Data All datasets used are fully open source and released by Hugging Face: | Dataset | Tokens | Description | |---|---|---| | [FineWeb (350BT sample)](https://huggingface.co/datasets/HuggingFaceFW/fineweb/viewer/sample-350BT) | ~350B | High-quality English web text | | [FineWeb-2 (ita_Latn)](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2/viewer/ita_Latn) | — | Italian web text | | [FinePDFs (ita_Latn)](https://huggingface.co/datasets/HuggingFaceFW/finepdfs/viewer/ita_Latn) | — | Italian PDF documents | | [StarCoder Data](https://huggingface.co/datasets/bigcode/starcoderdata) | ~250B | Multilingual code | **Token distribution**: ~400B English + ~400B Italian + ~200B Code ≈ **1 trillion tokens total** ### Tokenization Raw datasets were tokenized using the **Llama-3.2 tokenizer** (`meta-llama/Llama-3.2-1B`) via the [datatrove](https://github.com/huggingface/datatrove) library. The process ran for over **three weeks of continuous computation** on CPU nodes via Slurm, generating approximately 3–5 TB of tokenized data shards. --- ## Architecture Choice We adopted a **modified Llama-3.2 fully dense architecture**. The choice of a dense model over Mixture-of-Experts (MoE) in the small-parameter regime (~500M) was deliberate: in tightly constrained capacity settings, routing overhead and expert under-utilization typical of MoE architectures may offset their theoretical efficiency advantages. Dense models provide better compute utilization and more stable training dynamics at this scale. --- ## Pre-training Configuration Full Nanotron YAML configuration used for training: ```yaml checkpoints: checkpoint_interval: 5000 checkpoints_path: checkpoints_zagreus_ita_v2 checkpoints_path_is_shared_file_system: false resume_checkpoint_path: /training/pretraining/nanotron/checkpoints_zagreus_ita_v2/630000 save_final_state: false save_initial_state: false data_stages: - data: dataset: dataset_folder: - /training/pretraining/fineweb-ita/tokenized - /training/pretraining/fineweb-edu-350BT/000_tokenized_output - /training/pretraining/fineweb-edu-350BT/011_tokenized_output - /training/pretraining/fineweb-edu-350BT/012_tokenized_output - /training/pretraining/fineweb-edu-350BT/013_tokenized_output - /training/pretraining/fineweb-edu-350BT/014_tokenized_output - /training/pretraining/fineweb-edu-350BT/015_tokenized_output - /training/pretraining/fineweb-edu-350BT/016_tokenized_output - /training/pretraining/finepdf-ita/000_tokenized_output - /training/pretraining/starcoder_tokenized/000_tokenized_output num_loading_workers: 0 seed: 8 name: stable phase start_training_step: 1 general: benchmark_csv_path: null consumed_train_samples: null ignore_sanity_checks: true project: zagreus run: zagreus-350M seed: 8 step: null logging: iteration_step_info_interval: 1 log_level: info log_level_replica: info model: ddp_bucket_cap_mb: 100 dtype: bfloat16 init_method: std: 0.03227 make_vocab_size_divisible_by: 1 model_config: bos_token_id: 128000 eos_token_id: 128001 hidden_act: silu hidden_size: 960 initializer_range: 0.02 intermediate_size: 2560 is_llama_config: true max_position_embeddings: 4096 num_attention_heads: 15 num_hidden_layers: 32 num_key_value_heads: 5 pad_token_id: null pretraining_tp: 1 rms_norm_eps: 1.0e-05 rope_interleaved: false rope_scaling: null rope_theta: 10000.0 tie_word_embeddings: true use_cache: true vocab_size: 128256 optimizer: accumulate_grad_in_fp32: true clip_grad: 1.0 learning_rate_scheduler: learning_rate: 0.003 lr_decay_starting_step: 750000 lr_decay_steps: 50000 lr_decay_style: linear lr_warmup_steps: 4000 lr_warmup_style: linear min_decay_lr: 1.0e-7 optimizer_factory: adam_beta1: 0.9 adam_beta2: 0.95 adam_eps: 1.0e-08 name: adamW torch_adam_is_fused: true weight_decay: 0.01 zero_stage: 0 parallelism: dp: 64 expert_parallel_size: 1 pp: 1 pp_engine: 1f1b recompute_layer: false tp: 1 tp_linear_async_communication: true tp_mode: REDUCE_SCATTER tp_recompute_allgather: true profiler: null tokenizer: tokenizer_max_length: null tokenizer_name_or_path: meta-llama/Llama-3.2-1B tokenizer_revision: null tokens: batch_accumulation_per_replica: 1 limit_test_batches: 0 limit_val_batches: 0 micro_batch_size: 4 sequence_length: 4096 train_steps: 2000000 val_check_interval: 5000 ``` ### Slurm Launch Script ```bash #SBATCH --job-name=350_it #SBATCH --account=YOUR_ACCOUNT #SBATCH --partition=PARTITION #SBATCH --nodes=8 #SBATCH --gres=gpu:8 # 8 A100 per node = 64 total #SBATCH --cpus-per-task=32 #SBATCH --time=4-00:00:00 #SBATCH --output=slurm-%j.out ################ 0. Environment ################ module purge module load profile/global module load python/3.11 cuda/12.2 cudnn nccl gcc source /path/to/venv/nanotron/bin/activate export HF_HOME=/path/to/hf_home export TRANSFORMERS_OFFLINE=1 export HF_HUB_OFFLINE=1 export HF_DATASETS_OFFLINE=1 export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK export NCCL_IB_DISABLE=0 export NCCL_SOCKET_IFNAME="ib0,eno,eth" export WANDB_MODE=disabled ################ 1. Distributed vars ############ GPUS_PER_NODE=4 NNODES=$SLURM_JOB_NUM_NODES NODE_RANK=$SLURM_NODEID MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n1) MASTER_PORT=29400 RDZV_ID=$SLURM_JOB_ID ################ 2. Launch ###################### srun torchrun \ --nnodes $NNODES \ --nproc_per_node $GPUS_PER_NODE \ --rdzv_id $RDZV_ID \ --rdzv_backend c10d \ --rdzv_endpoint $MASTER_ADDR:$MASTER_PORT \ /path/to/nanotron/run_train.py \ --config-file smollm2/zagreus_350M_ita.yaml ``` ### Checkpoint Conversion to Hugging Face Format ```bash torchrun --nproc_per_node=1 -m examples.llama.convert_nanotron_to_hf \ --checkpoint_path=checkpoints/544000 \ --save_path=hf_checkpoints/544000 \ --tokenizer_name meta-llama/Llama-3.2-1B ``` --- ## Evaluation ### Evaluation Commands ```bash lm-eval --model hf --model_args pretrained= \ --tasks m_mmlu_it --num_fewshot 5 --device cuda:0 --batch_size 1 lm-eval --model hf --model_args pretrained= \ --tasks hellaswag_it,arc_it --device cuda:0 --batch_size 1 ``` ### Checkpoint Progression The table below tracks benchmark scores across training checkpoints, demonstrating steady model improvement throughout pre-training: | Checkpoint | MMLU IT ↑ | HellaSwag IT ↑ | ARC IT ↑ | **Average** | |---|---|---|---|---| | v2-95k | 0.2529 | 0.3366 | 0.2652 | 0.2849 | | v2-205k | 0.2628 | — | — | 0.2628 | | v2-290k | 0.2428 | 0.3492 | 0.2335 | 0.2752 | | v2-305k | 0.2598 | 0.3562 | 0.2652 | 0.2937 | | v2-365k | 0.2566 | 0.3664 | 0.2712 | **0.2981** | | v2-390k | 0.2556 | 0.3438 | 0.2498 | 0.2831 | | v2-460k | 0.2540 | 0.3778 | 0.2549 | 0.2956 | | v2-520k | 0.2540 | 0.3778 | 0.2549 | 0.2956 | | v2-590k | 0.2547 | 0.3651 | 0.2455 | 0.2884 | | v2-630k | 0.2562 | 0.3632 | 0.2643 | 0.2946 | | v2-680k | 0.2538 | 0.3740 | 0.2592 | 0.2957 | | v2-775k | 0.2535 | 0.3750 | 0.2583 | 0.2956 | --- ### Evalita Benchmark Evalita is a comprehensive Italian NLP evaluation suite benchmarking models across a wide range of linguistic tasks, from classification and extraction to generation and semantic understanding. Evaluation was conducted using the `evalita-mp` task suite from [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness). #### Evaluation Command ```bash lm_eval --model hf \ --model_args pretrained=mii-llm/zagreus-0.4B-ita \ --tasks evalita-mp \ --device cuda:0 \ --batch_size 1 ``` #### Results | Task | Metric | Score | |---|---|---| | **Evalita-LLM (Overall)** | acc | **0.3226** | | Admission Test | acc | 0.2137 | | FAQ | acc | 0.2681 | | Hate Speech Detection | f1 | 0.6056 | | Lexical Substitution | f1 | 0.0000 | | NER | f1 | 0.1611 | | Relation Extraction | f1 | 0.1244 | | Sentiment Analysis | f1 | 0.3660 | | Summarization (Fanpage) | rouge1 | 0.1947 | | Text Entailment | acc | 0.5133 | | Word in Context | f1 | 0.4697 | > Evalita results serve as a zero-shot baseline for the base model. For the comparison between the base model and the SFT variant, see the [Open-Zagreus-0.4B](https://huggingface.co/mii-llm/open-zagreus-0.4B) model card. --- ## Usage This is a **base model** — it performs causal language modelling (text completion) and is not instruction-tuned. It is best suited as a starting point for fine-tuning. ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch model_id = "mii-llm/zagreus-0.4B-ita" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype=torch.bfloat16, device_map="auto" ) # Base model: text completion, not instruction following prompt = "L'intelligenza artificiale è una disciplina che" inputs = tokenizer(prompt, return_tensors="pt").to(model.device) output = model.generate( **inputs, max_new_tokens=200, temperature=0.8, do_sample=True, repetition_penalty=1.1 ) print(tokenizer.decode(output[0], skip_special_tokens=True)) ``` For instruction-following, use the post-trained variants: - 🗣️ [Nesso-0.4B-instruct](https://huggingface.co/mii-llm/nesso-0.4B-instruct) — conversational and instruction following - 🤖 [Nesso-0.4B-agentic](https://huggingface.co/mii-llm/nesso-0.4B-agentic) — function calling and agentic tasks - 🔓 [Open-Zagreus-0.4B](https://huggingface.co/mii-llm/open-zagreus-0.4B) — fully open-source SFT variant --- ## Full Model Family ### Base Models (Zagreus) | Model | Languages | HuggingFace | |---|---|---| | **Zagreus-0.4B-ita** *(this model)* | English + Italian | [🤗 Link]() | | [Zagreus-0.4B-spa](https://huggingface.co/mii-llm/zagreus-0.4B-spa) | English + Spanish | [🤗 Link](https://huggingface.co/mii-llm/zagreus-0.4B-spa) | | [Zagreus-0.4B-por](https://huggingface.co/mii-llm/zagreus-0.4B-por) | English + Portuguese | [🤗 Link](https://huggingface.co/mii-llm/zagreus-0.4B-por) | | [Zagreus-0.4B-fra](https://huggingface.co/mii-llm/zagreus-0.4B-fra) | English + French | [🤗 Link](https://huggingface.co/mii-llm/zagreus-0.4B-fra) | ### Post-trained Models (Nesso) | Model | Use Case | HuggingFace | |---|---|---| | [Nesso-0.4B-instruct](https://huggingface.co/mii-llm/nesso-0.4B-instruct) | Conversational / Instruction following | [🤗 Link](https://huggingface.co/mii-llm/nesso-0.4B-instruct) | | [Nesso-0.4B-agentic](https://huggingface.co/mii-llm/nesso-0.4B-agentic) | Function calling / Agentic | [🤗 Link](https://huggingface.co/mii-llm/nesso-0.4B-agentic) | | [Open-Zagreus-0.4B](https://huggingface.co/mii-llm/open-zagreus-0.4B) | Fully open source | [🤗 Link](https://huggingface.co/mii-llm/open-zagreus-0.4B) | --- ## Citation If you use this model in your research, please cite: ```bibtex @misc{zagreus2025, title = {The Joy and Pain of Training an LLM from Scratch: A Technical Report on the Zagreus and Nesso Model Families}, author = {mii-llm community}, year = {2025}, howpublished = {\url{https://github.com/mii-llm/zagreus-nesso-slm}}, } ``` --- ## Acknowledgements - **Antonio Baldassarra** (CEO, Seeweb) and **Marco Cristofanilli** (Head of AI, Seeweb) for commissioning and sponsoring the infrastructure - The **Hugging Face** team for Nanotron, datatrove, FineWeb, FineWeb-2, and FinePDFs - The **mii-llm** open-source community for contributions to multilingual evaluation harnesses and the Nanotron fork --- ## License Released under the **Apache 2.0** license. > Made with ❤️ in Italy by [mii-llm](https://mii-llm.ai)