初始化项目，由ModelHub XC社区提供模型

Model: mii-llm/open-zagreus-0.4B Source: Original Platform
2026-05-26 14:46:16 +08:00
commit 1ba6b09a40
9 changed files with 2569 additions and 0 deletions
--- a/.gitattributes
+++ b/.gitattributes
@@ -0,0 +1,36 @@
 *.7z filter=lfs diff=lfs merge=lfs -text
 *.arrow filter=lfs diff=lfs merge=lfs -text
 *.bin filter=lfs diff=lfs merge=lfs -text
 *.bz2 filter=lfs diff=lfs merge=lfs -text
 *.ckpt filter=lfs diff=lfs merge=lfs -text
 *.ftz filter=lfs diff=lfs merge=lfs -text
 *.gz filter=lfs diff=lfs merge=lfs -text
 *.h5 filter=lfs diff=lfs merge=lfs -text
 *.joblib filter=lfs diff=lfs merge=lfs -text
 *.lfs.* filter=lfs diff=lfs merge=lfs -text
 *.mlmodel filter=lfs diff=lfs merge=lfs -text
 *.model filter=lfs diff=lfs merge=lfs -text
 *.msgpack filter=lfs diff=lfs merge=lfs -text
 *.npy filter=lfs diff=lfs merge=lfs -text
 *.npz filter=lfs diff=lfs merge=lfs -text
 *.onnx filter=lfs diff=lfs merge=lfs -text
 *.ot filter=lfs diff=lfs merge=lfs -text
 *.parquet filter=lfs diff=lfs merge=lfs -text
 *.pb filter=lfs diff=lfs merge=lfs -text
 *.pickle filter=lfs diff=lfs merge=lfs -text
 *.pkl filter=lfs diff=lfs merge=lfs -text
 *.pt filter=lfs diff=lfs merge=lfs -text
 *.pth filter=lfs diff=lfs merge=lfs -text
 *.rar filter=lfs diff=lfs merge=lfs -text
 *.safetensors filter=lfs diff=lfs merge=lfs -text
 saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.tar.* filter=lfs diff=lfs merge=lfs -text
 *.tar filter=lfs diff=lfs merge=lfs -text
 *.tflite filter=lfs diff=lfs merge=lfs -text
 *.tgz filter=lfs diff=lfs merge=lfs -text
 *.wasm filter=lfs diff=lfs merge=lfs -text
 *.xz filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
 tokenizer.json filter=lfs diff=lfs merge=lfs -text
--- a/README.md
+++ b/README.md
@@ -0,0 +1,330 @@
 ---
 language:
 - it
 - en
 license: apache-2.0
 tags:
 - small-language-model
 - slm
 - edge-ai
 - italian
 - bilingual
 - instruction-following
 - open-source
 - fully-reproducible
 - llama
 - nanotron
 - axolotl
 base_model: mii-llm/zagreus-0.4B-ita
 model_type: llama
 pipeline_tag: text-generation
 library_name: transformers
 datasets:
 - DeepMount00/OpenItalianData
 ---
 # Open-Zagreus-0.4B
 **Open-Zagreus-0.4B** is a fully open-source bilingual English/Italian Small Language Model (SLM) — open data, open weights, open recipe. It is post-trained on top of [Zagreus-0.4B-ita](https://huggingface.co/mii-llm/zagreus-0.4B-ita) using the publicly available [OpenItalianData](https://huggingface.co/datasets/DeepMount00/OpenItalianData) dataset published by Michele Montebovi, making the entire pipeline — from pre-training data to final weights — **fully reproducible by anyone**.
 This model is released by the [mii-llm](https://mii-llm.ai) community (*Made in Italy – Large Language Model*) as a contribution to the open-source Italian NLP ecosystem, demonstrating that it is possible to build competitive English/Italian language models using exclusively open resources.
 > ✅ **Fully open**: all training data, model weights, and training recipes are publicly available and reproducible.
 ---
 ## Model Details
 | Property | Value |
 |---|---|
 | **Architecture** | Modified Llama-3.2 (fully dense) |
 | **Parameters** | ~400M |
 | **Hidden size** | 960 |
 | **Layers** | 32 |
 | **Attention heads** | 15 (KV heads: 5) |
 | **Context length** | 4096 tokens |
 | **Tokenizer** | Llama-3.2 (`vocab_size`: 128,256) |
 | **Precision** | BF16 |
 | **Languages** | English, Italian |
 | **Base model** | mii-llm/zagreus-0.4B-ita |
 | **SFT dataset** | [DeepMount00/OpenItalianData](https://huggingface.co/datasets/DeepMount00/OpenItalianData) |
 | **Post-training framework** | Axolotl + FSDP |
 | **Chat template** | ChatML |
 ---
 ## Training Details
 ### Base Model Pre-training
 `Open-Zagreus-0.4B` is built on `Zagreus-0.4B-ita`, pre-trained on approximately **1 trillion tokens**:
 | Dataset | Description |
 |---|---|
 | [FineWeb (350BT sample)](https://huggingface.co/datasets/HuggingFaceFW/fineweb/viewer/sample-350BT) | ~350B tokens of English web text |
 | [FineWeb-2 (ita_Latn)](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2/viewer/ita_Latn) | Italian web text |
 | [FinePDFs (ita_Latn)](https://huggingface.co/datasets/HuggingFaceFW/finepdfs/viewer/ita_Latn) | Italian PDF documents |
 | [StarCoder Data](https://huggingface.co/datasets/bigcode/starcoderdata) | ~250B tokens of code |
 **Token distribution**: ~400B English + ~400B Italian + ~200B Code  
 **Infrastructure**: 64× NVIDIA A100 GPUs (8 nodes × 8 GPUs) on Seeweb HPC  
 **Pre-training framework**: [Nanotron (mii-llm fork)](https://github.com/mii-llm/nanotron)
 ### Post-training (SFT)
 Post-training was performed using **Axolotl** with FSDP across 4 nodes (32× A100 GPUs), using the fully public **OpenItalianData** dataset.
 **SFT dataset**: [DeepMount00/OpenItalianData](https://huggingface.co/datasets/DeepMount00/OpenItalianData)  
 **Dataset author**: Michele Montebovi
 **Key hyperparameters:**
 | Hyperparameter | Value |
 |---|---|
 | Optimizer | AdamW (fused) |
 | Learning rate | `1e-3` |
 | LR scheduler | Cosine (constant ratio: 0.8, min ratio: 0.3) |
 | Epochs | 3 |
 | Micro batch size | 1 |
 | Gradient accumulation steps | 8 |
 | Sequence length | 4096 |
 | Max grad norm | 1.0 |
 | Precision | BF16 + Flash Attention |
 | FSDP strategy | FULL_SHARD |
 ### Full Axolotl Configuration
 ```yaml
 base_model: giux78/zagreus-0.4B-ita
 strict: false
 output_dir: ./ale_outputs/opendata-zagreus-sft-final
 seed: 42
 chat_template_jinja: "{%- for message in messages -%}\n    {{- \"<|im_start|>\" + message.role + \"\\n\" + message.content + \"<|im_end|>\" + \"\\n\" -}}\n{%- endfor -%}\n{%- if add_generation_prompt -%}\n\t{{- \"<|im_start|>assistant\\n\" -}}\n{%- endif -%}"
 datasets:
  - path: /training/openitaliandata
    type: chat_template
    field_messages: conversation
    roles_to_train: ["assistant"]
    train_on_eos: turn
 dataset_prepared_path: ./ale_outputs/dataset_cache/opendata-zagreus-sft
 sequence_len: 4096
 sample_packing: true
 eval_sample_packing: true
 pad_to_sequence_len: true
 cosine_constant_lr_ratio: 0.8
 cosine_min_lr_ratio: 0.3
 optimizer: adamw_torch_fused
 lr_scheduler: constant
 learning_rate: 1.0e-03
 max_grad_norm: 1.0
 micro_batch_size: 1
 gradient_accumulation_steps: 8
 num_epochs: 3
 bf16: auto
 flash_attention: true
 gradient_checkpointing: true
 logging_steps: 10
 eval_strategy: steps
 eval_steps: 300
 save_strategy: steps
 save_steps: 500
 save_total_limit: 3
 val_set_size: 10000
 fsdp_config:
  fsdp_sharding_strategy: FULL_SHARD
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
  fsdp_backward_prefetch_policy: BACKWARD_PRE
  fsdp_state_dict_type: FULL_STATE_DICT
 special_tokens:
  pad_token: <|im_end|>
  eos_token: <|im_end|>
 ```
 ---
 ## Chat Template
 This model uses the **ChatML** format:
 ```
 <|im_start|>system
 Sei un assistente utile.<|im_end|>
 <|im_start|>user
 Ciao! Come posso imparare l'italiano?<|im_end|>
 <|im_start|>assistant
 ```
 Special tokens:
 - `pad_token`: `<|im_end|>`
 - `eos_token`: `<|im_end|>`
 ---
 ## Usage
 ```python
 from transformers import AutoTokenizer, AutoModelForCausalLM
 import torch
 model_id = "mii-llm/open-zagreus-0.4B"
 tokenizer = AutoTokenizer.from_pretrained(model_id)
 model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
 )
 messages = [
    {"role": "system", "content": "Sei un assistente utile e preciso."},
    {"role": "user", "content": "Raccontami qualcosa di interessante sull'Italia."}
 ]
 input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
 ).to(model.device)
 output = model.generate(
    input_ids,
    max_new_tokens=512,
    temperature=0.7,
    do_sample=True
 )
 print(tokenizer.decode(output[0][input_ids.shape[-1]:], skip_special_tokens=True))
 ```
 ---
 ## Evaluation
 ### Standard Benchmarks
 #### Evaluation Command
 ```bash
 lm-eval --model hf --model_args pretrained=giux78/Open-Zagreus-0.4B \
  --tasks m_mmlu_it,arc_it,hellaswag_it --device cuda:0 --batch_size 1
 ```
 #### Results
 | Model | MMLU IT ↑ | ARC IT ↑ | HellaSwag IT ↑ | **Average** |
 |---|---|---|---|---|
 | **Open-Zagreus-0.4B** | 0.2530 | 0.3020 | 0.3608 | **0.3053** |
 ---
 ### Evalita Benchmark
 Evalita is a comprehensive Italian NLP evaluation suite covering a wide range of linguistic tasks. We evaluate Open-Zagreus-0.4B using the [evalita-mp](https://github.com/evalita) tasks and compare it directly against its base model (`Zagreus-0.4B-ita`) to measure the impact of SFT.
 #### Evaluation Command
 ```bash
 lm_eval --model hf \
  --model_args pretrained=giux78/Open-Zagreus-0.4B \
  --tasks evalita-mp \
  --device cuda:0 \
  --batch_size 1
 ```
 #### Results: Open-Zagreus-0.4B vs. Zagreus-0.4B-ita (Base)
 | Task | Metric | Zagreus-0.4B-ita (base) | **Open-Zagreus-0.4B (SFT)** | Δ |
 |---|---|---|---|---|
 | **Overall** | acc | 0.3226 | **0.3313** | **+0.0087** |
 | Admission Test | acc | **0.2137** | 0.2083 | -0.0054 |
 | FAQ | acc | **0.2681** | 0.2672 | -0.0009 |
 | Hate Speech Detection | f1 | **0.6056** | 0.4340 | -0.1716 |
 | Lexical Substitution | f1 | 0.0000 | 0.0000 | = |
 | NER | f1 | **0.1611** | 0.1357 | -0.0254 |
 | Relation Extraction | f1 | **0.1244** | 0.0000 | -0.1244 |
 | Sentiment Analysis | f1 | 0.3660 | **0.3712** | +0.0052 |
 | Summarization (Fanpage) | rouge1 | 0.1947 | **0.2305** | +0.0358 |
 | Text Entailment | acc | 0.5133 | **0.5492** | +0.0359 |
 | Word in Context | f1 | 0.4697 | **0.4880** | +0.0183 |
 #### Discussion
 The SFT stage delivers a net **+0.0087 overall improvement** on Evalita. Gains are most significant in generative and semantic tasks:
 - **Summarization** (+0.0358): the model produces more coherent and relevant summaries after instruction tuning
 - **Text Entailment** (+0.0359): improved language understanding and reasoning
 - **Word in Context** (+0.0183): better contextual semantic disambiguation
 - **Sentiment Analysis** (+0.0052): marginal improvement in affective understanding
 Some structured classification tasks (Hate Speech Detection, Relation Extraction, NER) regress after SFT — a known phenomenon when general-purpose instruction tuning shifts the model away from the specific output format expected by these extractive tasks. This is expected behavior and not indicative of degraded general language quality.
 Overall, these results confirm that **a fully open-source pipeline — using only publicly available data and tools — is sufficient to produce a competitive Italian SLM**.
 ---
 ## Reproducibility
 This is the only model in the Nesso/Zagreus family where **every component is fully open and reproducible**:
 | Component | Resource |
 |---|---|
 | Pre-training data | FineWeb, FineWeb-2, FinePDFs, StarCoder (all public) |
 | Pre-training framework | [mii-llm/nanotron](https://github.com/mii-llm/nanotron) |
 | SFT data | [DeepMount00/OpenItalianData](https://huggingface.co/datasets/DeepMount00/OpenItalianData) |
 | SFT framework | [Axolotl](https://github.com/OpenAccess-AI-Collective/axolotl) |
 | Evaluation | [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) |
 | Model weights | This repository |
 | Training config | See Axolotl configuration above |
 ---
 ## Related Models
 | Model | Description |
 |---|---|
 | [Zagreus-0.4B-ita](https://huggingface.co/mii-llm/zagreus-0.4B-ita) | Base pre-trained model (this model's foundation) |
 | [Nesso-0.4B-instruct](https://huggingface.co/mii-llm/nesso-0.4B-instruct) | Proprietary SFT — optimized for instruction following |
 | [Nesso-0.4B-agentic](https://huggingface.co/mii-llm/nesso-0.4B-agentic) | Proprietary SFT — optimized for function calling and agentic tasks |
 ---
 ## Citation
 If you use this model in your research, please cite:
 ```bibtex
@misc{nesso2025,
  title        = {The Joy and Pain of Training an LLM from Scratch:
                  A Technical Report on the Zagreus and Nesso Model Families},
  author       = {mii-llm community},
  year         = {2025},
  howpublished = {\url{https://github.com/mii-llm/zagreus-nesso-slm}},
 }
 ```
 ---
 ## Acknowledgements
 - **Antonio Baldassarra** (CEO, Seeweb) and **Marco Cristofanilli** (Head of AI, Seeweb) for infrastructure sponsorship
 - **Michele Montebovi** for publishing the [OpenItalianData](https://huggingface.co/datasets/DeepMount00/OpenItalianData) SFT dataset that makes this model fully reproducible
 - The **Hugging Face** team for Nanotron, datatrove, FineWeb, and FineWeb-2
 - The **mii-llm** open-source community
 ---
 ## License
 Released under the **Apache 2.0** license.
 > Made with ❤️ in Italy by [mii-llm](https://mii-llm.ai)
--- a/chat_template.jinja
+++ b/chat_template.jinja
@@ -0,0 +1,6 @@
 {%- for message in messages -%}
    {{- "<|im_start|>" + message.role + "\n" + message.content + "<|im_end|>" + "\n" -}}
 {%- endfor -%}
 {%- if add_generation_prompt -%}
 	{{- "<|im_start|>assistant\n" -}}
 {%- endif -%}
--- a/config.json
+++ b/config.json
@@ -0,0 +1,30 @@
 {
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "dtype": "bfloat16",
  "eos_token_id": 128256,
  "head_dim": 64,
  "hidden_act": "silu",
  "hidden_size": 960,
  "initializer_range": 0.02,
  "intermediate_size": 2560,
  "max_position_embeddings": 4096,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 15,
  "num_hidden_layers": 32,
  "num_key_value_heads": 5,
  "pad_token_id": 128256,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 10000.0,
  "tie_word_embeddings": true,
  "transformers_version": "4.56.2",
  "use_cache": false,
  "vocab_size": 128264
 }
--- a/generation_config.json
+++ b/generation_config.json
@@ -0,0 +1,11 @@
 {
  "_from_model_config": true,
  "bos_token_id": 128000,
  "do_sample": true,
  "eos_token_id": [
    128256,
    128001
  ],
  "pad_token_id": 128256,
  "transformers_version": "4.56.2"
 }
--- a/model.safetensors
+++ b/model.safetensors
@@ -0,0 +1,3 @@
 version https://git-lfs.github.com/spec/v1
 oid sha256:e8b1657d4822768a9365bad79efb82ff5708ae4edf055805f51ef866b2ed3d84
 size 875569960
--- a/special_tokens_map.json
+++ b/special_tokens_map.json
@@ -0,0 +1,23 @@
 {
  "bos_token": {
    "content": "<|begin_of_text|>",
    "lstrip": false,
    "normalized": false,
    "rstrip": false,
    "single_word": false
  },
  "eos_token": {
    "content": "<|im_end|>",
    "lstrip": false,
    "normalized": false,
    "rstrip": false,
    "single_word": false
  },
  "pad_token": {
    "content": "<|im_end|>",
    "lstrip": false,
    "normalized": false,
    "rstrip": false,
    "single_word": false
  }
 }
--- a/tokenizer.json
+++ b/tokenizer.json
@@ -0,0 +1,3 @@
 version https://git-lfs.github.com/spec/v1
 oid sha256:3d621ce551ab3c6b3181223d69422d7c9dfbda768bcce27f77a11c63b6c16584
 size 17211433
--- a/tokenizer_config.json
+++ b/tokenizer_config.json