初始化项目，由ModelHub XC社区提供模型

Model: VillanovaAI/Villanova-2B-Base-2603 Source: Original Platform
2026-04-13 06:11:00 +08:00
commit 38d33456dc
9 changed files with 1414 additions and 0 deletions
--- a/.gitattributes
+++ b/.gitattributes
@@ -0,0 +1,36 @@
 *.7z filter=lfs diff=lfs merge=lfs -text
 *.arrow filter=lfs diff=lfs merge=lfs -text
 *.bin filter=lfs diff=lfs merge=lfs -text
 *.bz2 filter=lfs diff=lfs merge=lfs -text
 *.ckpt filter=lfs diff=lfs merge=lfs -text
 *.ftz filter=lfs diff=lfs merge=lfs -text
 *.gz filter=lfs diff=lfs merge=lfs -text
 *.h5 filter=lfs diff=lfs merge=lfs -text
 *.joblib filter=lfs diff=lfs merge=lfs -text
 *.lfs.* filter=lfs diff=lfs merge=lfs -text
 *.mlmodel filter=lfs diff=lfs merge=lfs -text
 *.model filter=lfs diff=lfs merge=lfs -text
 *.msgpack filter=lfs diff=lfs merge=lfs -text
 *.npy filter=lfs diff=lfs merge=lfs -text
 *.npz filter=lfs diff=lfs merge=lfs -text
 *.onnx filter=lfs diff=lfs merge=lfs -text
 *.ot filter=lfs diff=lfs merge=lfs -text
 *.parquet filter=lfs diff=lfs merge=lfs -text
 *.pb filter=lfs diff=lfs merge=lfs -text
 *.pickle filter=lfs diff=lfs merge=lfs -text
 *.pkl filter=lfs diff=lfs merge=lfs -text
 *.pt filter=lfs diff=lfs merge=lfs -text
 *.pth filter=lfs diff=lfs merge=lfs -text
 *.rar filter=lfs diff=lfs merge=lfs -text
 *.safetensors filter=lfs diff=lfs merge=lfs -text
 saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.tar.* filter=lfs diff=lfs merge=lfs -text
 *.tar filter=lfs diff=lfs merge=lfs -text
 *.tflite filter=lfs diff=lfs merge=lfs -text
 *.tgz filter=lfs diff=lfs merge=lfs -text
 *.wasm filter=lfs diff=lfs merge=lfs -text
 *.xz filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
 tokenizer.json filter=lfs diff=lfs merge=lfs -text
--- a/README.md
+++ b/README.md
@@ -0,0 +1,195 @@
 ---
 license: apache-2.0
 language:
 - en
 - de
 - es
 - fr
 - it
 pipeline_tag: text-generation
 library_name: transformers
 datasets:
 - HuggingFaceFW/fineweb-edu
 - HuggingFaceFW/fineweb-2
 - HuggingFaceFW/finepdfs
 - HuggingFaceFW/finepdfs-edu
 - epfml/FineWeb2-HQ
 - HuggingFaceTB/finemath
 ---
 # Model Card for Villanova-2B-Base-2603
 <img src="https://huggingface.co/spaces/VillanovaAI/README/resolve/main/Logo_VILLANOVA_colore.svg" alt="Villanova.AI logo" height="96"/>
 Villanova is a family of fully open, multilingual Large Language Models (LLMs) targeting the five major European languages. All model weights, training data sources, and training details are publicly released.
 > [!WARNING]
 > **DISCLAIMER:** This is a base model, not instruction-tuned. It is intended as a foundation for downstream fine-tuning and alignment.
 ---
 ## Model Family
 **[Villanova-2B-Base-2603](https://huggingface.co/VillanovaAI/Villanova-2B-Base-2603)** — Base model (4.4T) — 📍 *This model*<br>
 &emsp;↳ **[Villanova-2B-2603](https://huggingface.co/VillanovaAI/Villanova-2B-2603)** — SFT / Instruct<br>
 &emsp;&emsp;↳ [Villanova-2B-2603-GGUF](https://huggingface.co/VillanovaAI/Villanova-2B-2603-GGUF) — Quantized<br>
 &emsp;↳ **[Villanova-2B-VL-2603](https://huggingface.co/VillanovaAI/Villanova-2B-VL-2603)** — Vision-Language Instruct<br>
 &emsp;&emsp;↳ [Villanova-2B-VL-2603-GGUF](https://huggingface.co/VillanovaAI/Villanova-2B-VL-2603-GGUF) — Quantized<br>
 <br>
 **[Villanova-2B-Base-2512-Preview](https://huggingface.co/VillanovaAI/Villanova-2B-Base-2512-Preview)** — Base model (2.2T) (previous version, not recommended)<br>
 &emsp;↳ [Villanova-2B-2512-Preview](https://huggingface.co/VillanovaAI/Villanova-2B-2512-Preview) — SFT / Instruct (previous version, not recommended)<br>
 ---
 ## Model Summary
 Villanova-2B-Base-2603 is a decoder-only transformer with **2 billion parameters**, pre-trained from scratch on **4.4 trillion tokens** from a curated multilingual corpus. It supports sequences of up to **32,768 tokens**. It is large enough to capture rich linguistic and factual knowledge, yet compact enough for fine-tuning and deployment in resource-constrained environments.
 **Primary languages: English, Italian, Spanish, French, German.**
 Partial support for additional languages and code, but performance outside the five primary languages is not guaranteed.
 The Villanova project is committed to full openness and data transparency. Training data sources, mixture details, architectural choices, and hyperparameters are all publicly documented. Data was selected with ethical sourcing as a guiding principle, prioritising high-quality, permissively licensed corpora.
 ---
 ## Pre-training
 Training followed a two-stage recipe:
 **Stage 1 (0 → 4.0T tokens)** — Broad multilingual data mixture covering the five core languages, plus code, mathematics, and scientific text.
 **Stage 2 (4.0T → 4.4T tokens)** — Cosine annealing over ~400B tokens of higher-quality, curated data.
 [Villanova-2B-Base-2512-Preview](https://huggingface.co/VillanovaAI/Villanova-2B-Base-2512-Preview) is an intermediate checkpoint of this same training run, released at the 2.2T token mark with an early decay stage applied from 2.0T tokens onward.
 Key training settings: AdamW optimizer (β₁=0.9, β₂=0.95, weight decay=0.1), peak learning rate 3×10⁻⁴, BF16/FP8 mixed precision, Flash Attention, sequences of 4,096 tokens. Training ran on 64 NVIDIA H100 GPUs (~30 days, ~36k tokens/GPU/second).
 ---
 ## How to Use
 This is a **base model**: it continues text rather than following instructions. For chat or task use, see [Villanova-2B-2603](https://huggingface.co/VillanovaAI/Villanova-2B-2603).
 ```python
 from transformers import AutoModelForCausalLM, AutoTokenizer
 model_name = "VillanovaAI/Villanova-2B-Base-2603"
 device = "cuda"  # or "cpu"
 tokenizer = AutoTokenizer.from_pretrained(model_name)
 model = AutoModelForCausalLM.from_pretrained(model_name).to(device)
 prompt = "Gravity is a fundamental force of nature that"
 model_inputs = tokenizer([prompt], return_tensors="pt").to(model.device)
 generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=128,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
 )
 output_ids = generated_ids[0][len(model_inputs.input_ids[0]):]
 print(tokenizer.decode(output_ids, skip_special_tokens=True))
 ```
 ---
 ## Evaluation
 <img src="https://cdn-uploads.huggingface.co/production/uploads/6426a5c798a5be164d38ae44/hxmN7V4mSjrr8i8Dsc9v8.png" alt="Model size/performance" width="672"/>
 **Global evaluation:**
 | **Model** | **Avg** | **arc_easy** | **hellaswag** | **hellaswag_de** | **hellaswag_es** | **hellaswag_fr** | **hellaswag_it** | **openbookqa** | **piqa** | **sciq** | **winogrande** | **xcopa_it** | **xnli_de** | **xnli_en** | **xnli_es** | **xnli_fr** | **xquad_de** | **xquad_en** | **xquad_es** |
 |---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
 | EuroLLM-1.7B | 48.72 | 69.07 | 45.04 | 37.97 | 40.98 | 40.05 | 39.46 | 29.80 | 72.20 | 90.60 | 61.25 | 66.00 | 47.99 | 50.24 | 45.58 | 49.00 | 27.50 | 34.60 | 29.65 |
 | Llama-3.2-1B | 46.13 | 66.29 | 48.16 | 34.11 | 37.41 | 35.48 | 34.91 | 27.80 | 75.14 | 93.50 | 60.69 | 59.40 | 46.02 | **54.82** | 41.37 | 46.95 | 16.37 | 37.18 | 14.84 |
 | Minerva-3B-base-v1.0 | 40.73 | 62.33 | 46.28 | 27.20 | 29.69 | 29.02 | 40.01 | 24.60 | 74.27 | 88.00 | 56.75 | **69.60** | 34.54 | 52.13 | 36.31 | 37.35 | 4.31 | 14.21 | 6.52 |
 | OLMo-2-0425-1B | 47.70 | 72.73 | **50.79** | 29.79 | 31.34 | 32.60 | 29.19 | 30.00 | **75.95** | 95.30 | **64.72** | 52.60 | 40.00 | 51.77 | 37.63 | 42.89 | 20.34 | 68.25 | 32.74 |
 | Qwen3-1.7-Base | 53.29 | 73.61 | 49.29 | 37.54 | 40.73 | 39.27 | 38.45 | **30.20** | 75.90 | **95.80** | 64.01 | 64.20 | 46.47 | 54.50 | 44.06 | 45.78 | 39.59 | 69.60 | **50.21** |
 | salamandra-2b | 50.58 | 71.04 | 47.19 | 38.01 | 42.07 | 40.60 | 38.56 | 26.80 | 72.69 | 91.90 | 61.72 | 65.40 | 47.79 | 51.97 | **49.08** | 48.67 | 41.73 | 41.55 | 33.72 |
 | Villanova-2B-Base-2512-Preview | 54.26 | **75.13** | 48.57 | 42.06 | 45.72 | 44.62 | 43.32 | 26.60 | 75.08 | 94.40 | 61.96 | 68.40 | 49.36 | 52.21 | 49.04 | **52.33** | 41.28 | 66.66 | 40.03 |
 | **Villanova-2B-Base-2603** | **54.91** | 73.74 | 49.53 | **42.91** | **46.81** | **45.49** | **44.21** | 25.20 | 74.32 | 94.10 | 59.04 | 68.80 | **49.48** | 54.30 | 49.00 | 50.72 | **44.94** | **72.52** | 43.37 |
 **English only:**
 | **Model** | **Avg** | **arc_easy** | **hellaswag** | **openbookqa** | **piqa** | **sciq** | **winogrande** | **xnli_en** | **xquad_en** |
 |---|---|---|---|---|---|---|---|---|---|
 | EuroLLM-1.7B | 56.60 | 69.07 | 45.04 | 29.80 | 72.20 | 90.60 | 61.25 | 50.24 | 34.60 |
 | Llama-3.2-1B | 57.95 | 66.29 | 48.16 | 27.80 | 75.14 | 93.50 | 60.69 | **54.82** | 37.18 |
 | Minerva-3B-base-v1.0 | 52.32 | 62.33 | 46.28 | 24.60 | 74.27 | 88.00 | 56.75 | 52.13 | 14.21 |
 | OLMo-2-0425-1B | 63.69 | 72.73 | **50.79** | 30.00 | **75.95** | 95.30 | **64.72** | 51.77 | 68.25 |
 | Qwen3-1.7-Base | **64.11** | 73.61 | 49.29 | **30.20** | 75.90 | **95.80** | 64.01 | 54.50 | 69.60 |
 | salamandra-2b | 58.11 | 71.04 | 47.19 | 26.80 | 72.69 | 91.90 | 61.72 | 51.97 | 41.55 |
 | Villanova-2B-Base-2512-Preview | 62.58 | **75.13** | 48.57 | 26.60 | 75.08 | 94.40 | 61.96 | 52.21 | 66.66 |
 | **Villanova-2B-Base-2603** | 62.84 | 73.74 | 49.53 | 25.20 | 74.32 | 94.10 | 59.04 | 54.30 | **72.52** |
 **Multilingual benchmarks:**
 | **Model** | **Avg** | **hellaswag_de** | **hellaswag_es** | **hellaswag_fr** | **hellaswag_it** | **xcopa_it** | **xnli_de** | **xnli_es** | **xnli_fr** | **xquad_de** | **xquad_es** |
 |---|---|---|---|---|---|---|---|---|---|---|---|
 | EuroLLM-1.7B | 42.42 | 37.97 | 40.98 | 40.05 | 39.46 | 66.00 | 47.99 | 45.58 | 49.00 | 27.50 | 29.65 |
 | Llama-3.2-1B | 36.69 | 34.11 | 37.41 | 35.48 | 34.91 | 59.40 | 46.02 | 41.37 | 46.95 | 16.37 | 14.84 |
 | Minerva-3B-base-v1.0 | 31.45 | 27.20 | 29.69 | 29.02 | 40.01 | **69.60** | 34.54 | 36.31 | 37.35 | 4.31 | 6.52 |
 | OLMo-2-0425-1B | 34.91 | 29.79 | 31.34 | 32.60 | 29.19 | 52.60 | 40.00 | 37.63 | 42.89 | 20.34 | 32.74 |
 | Qwen3-1.7-Base | 44.63 | 37.54 | 40.73 | 39.27 | 38.45 | 64.20 | 46.47 | 44.06 | 45.78 | 39.59 | **50.21** |
 | salamandra-2b | 44.56 | 38.01 | 42.07 | 40.60 | 38.56 | 65.40 | 47.79 | **49.08** | 48.67 | 41.73 | 33.72 |
 | Villanova-2B-Base-2512-Preview | 47.61 | 42.06 | 45.72 | 44.62 | 43.32 | 68.40 | 49.36 | 49.04 | **52.33** | 41.28 | 40.03 |
 | **Villanova-2B-Base-2603** | **48.57** | **42.91** | **46.81** | **45.49** | **44.21** | 68.80 | **49.48** | 49.00 | 50.72 | **44.94** | 43.37 |
 **Long context (RULER):**
 *Note: Tests were run forcing the context length to 32k, going beyond the default length for models with a native context lower than this threshold.*
 | **Model** | **Native Context** | **Avg (32k)** |
 |---|---|---|
 | Qwen3-1.7B-Base | 32k | **0.73** |
 | **Villanova-2B-Base-2603** | 32k | 0.49 |
 | gemma-3-1b-pt | 32k | 0.28 |
 | salamandra-2b | 8k | 0.12 |
 | EuroLLM-1.7B | 4k | 0.08 |
 | OLMo-2-0425-1B | 4k | 0.00 |
 | Villanova-2B-Base-2512-Preview | 4k | 0.00
 | Minerva-3B-base-v1.0 | 16k | 0.00 |
 ---
 ## Training Data
 The model's training pipeline is divided into two main stages: an initial pre-training stage focused on broad linguistic and factual coverage, and an annealing (decay) stage designed to consolidate knowledge and improve reasoning capabilities.
 ### Stage 1: Pre-training
 The first stage was trained on approximately **3.6 trillion tokens** (occupying ~15 TB of disk space). The distribution prioritizes five core languages while maintaining a global language coverage baseline. The mixture consists of approximately 37.5% English, large allocations for target Latin-script languages (German, Spanish, French, Italian), 5% code, 2% secondary Latin-script languages, and 6% for broader global languages.
 The primary datasets utilized in this stage include:
 * **Web Corpora:** FineWeb-2, FineWeb-Edu, and FineWeb2-HQ provide a massive multilingual foundation.
 * **Encyclopedic & Academic:** FineWiki, alongside academic papers from Arxiv and PubMed (Common Pile).
 * **Structured Text:** FinePDFs supplies high-quality text extracted from structured documents.
 * **Quantitative & Technical:** FineMath and Stack-Edu establish foundational mathematical reasoning and coding proficiency.
 ### Stage 2: Annealing (Decay)
 During the final decay stage on **400 billion tokens**, the general web data was partially replaced with a highly curated set of academic, structured, and instructional corpora to improve reasoning during parameter crystallization.
 High-quality sources introduced in the annealing stage include:
 * **Common-Pile StackExchange:** Q&A threads focusing on technical and scientific domains.
 * **GitHub Issues & Kaggle Notebooks:** A curated concatenation of ~11 billion tokens of repository discussions and ~1.7 billion tokens of analytical notebooks to improve technical problem-solving.
 * **FLAN Dolma-Mix Subset:** Instruction-formatted text extracted from the Dolma 1.7 dataset, carefully curated to avoid evaluation suite contamination.
 * **Advanced Mathematics:** InfiWebMath and FineMath corpora.
 ### Stage 3: Long Context Extension
 A final training stage was executed to extend the model's effective context window, processing an additional 50 billion tokens. The data distribution for this stage resembles the annealing mixture, but employs a shifted sampling strategy that strictly prioritizes long-form documents. This targeted approach ensures the model can efficiently process and retrieve information across extended sequences while preserving the high reasoning and knowledge density established during the decay stage.
 ---
 ## License
 This model is released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0).
--- a/config.json
+++ b/config.json
@@ -0,0 +1,29 @@
 {
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "head_dim": 128,
  "hidden_act": "silu",
  "hidden_size": 2560,
  "initializer_range": 0.014,
  "intermediate_size": 10240,
  "max_position_embeddings": 32768,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 20,
  "num_hidden_layers": 18,
  "num_key_value_heads": 4,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 500000,
  "tie_word_embeddings": true,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.53.3",
  "use_cache": true,
  "vocab_size": 256000
 }
--- a/generation_config.json
+++ b/generation_config.json
@@ -0,0 +1,6 @@
 {
  "_from_model_config": true,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "transformers_version": "4.53.3"
 }
--- a/model.safetensors
+++ b/model.safetensors
@@ -0,0 +1,3 @@
 version https://git-lfs.github.com/spec/v1
 oid sha256:83e892693f45e4d327c5dca9d6607d1b16e96ceb80e195b7f216027afb122a64
 size 4708314704
--- a/special_tokens_map.json
+++ b/special_tokens_map.json
@@ -0,0 +1,37 @@
 {
  "bos_token": {
    "content": "<s>",
    "lstrip": false,
    "normalized": false,
    "rstrip": false,
    "single_word": false
  },
  "cls_token": {
    "content": "<s>",
    "lstrip": false,
    "normalized": false,
    "rstrip": false,
    "single_word": false
  },
  "eos_token": {
    "content": "</s>",
    "lstrip": false,
    "normalized": false,
    "rstrip": false,
    "single_word": false
  },
  "sep_token": {
    "content": "</s>",
    "lstrip": false,
    "normalized": false,
    "rstrip": false,
    "single_word": false
  },
  "unk_token": {
    "content": "<unk>",
    "lstrip": false,
    "normalized": false,
    "rstrip": false,
    "single_word": false
  }
 }
--- a/tokenizer.json
+++ b/tokenizer.json
@@ -0,0 +1,3 @@
 version https://git-lfs.github.com/spec/v1
 oid sha256:2e90b85b3e3b3ebfc6b9bafeb954b37f2435eed595738337e53f2a746d23d5a2
 size 37007416
--- a/tokenizer.model
+++ b/tokenizer.model
@@ -0,0 +1,3 @@
 version https://git-lfs.github.com/spec/v1
 oid sha256:ab94ddf46d14f0279254858d53770c5319c5129d47291ee2bada530271cb1292
 size 4813276
--- a/tokenizer_config.json
+++ b/tokenizer_config.json