初始化项目,由ModelHub XC社区提供模型
Model: Boldt/Boldt-DC-350M Source: Original Platform
This commit is contained in:
36
.gitattributes
vendored
Normal file
36
.gitattributes
vendored
Normal file
@@ -0,0 +1,36 @@
|
|||||||
|
*.7z filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.arrow filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.bin filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.bz2 filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.ckpt filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.ftz filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.gz filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.h5 filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.joblib filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.lfs.* filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.mlmodel filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.model filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.msgpack filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.npy filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.npz filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.onnx filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.ot filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.parquet filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.pb filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.pickle filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.pkl filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.pt filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.pth filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.rar filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.safetensors filter=lfs diff=lfs merge=lfs -text
|
||||||
|
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.tar.* filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.tar filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.tflite filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.tgz filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.wasm filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.xz filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.zip filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.zst filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
||||||
|
logo.png filter=lfs diff=lfs merge=lfs -text
|
||||||
82
README.md
Normal file
82
README.md
Normal file
@@ -0,0 +1,82 @@
|
|||||||
|
---
|
||||||
|
license: apache-2.0
|
||||||
|
language:
|
||||||
|
- de
|
||||||
|
pipeline_tag: text-generation
|
||||||
|
library_name: transformers
|
||||||
|
---
|
||||||
|
# Boldt-350M
|
||||||
|
|
||||||
|
<img src="logo.png" width="500">
|
||||||
|
|
||||||
|
**Boldt** is a series of German Small Language Models (SLMs) trained from scratch. Our inital release includes three models:
|
||||||
|
|
||||||
|
- **Boldt-DC-350M** *(this model)*
|
||||||
|
- [Boldt-DC-1B](https://huggingface.co/Boldt/Boldt-DC-1B)
|
||||||
|
- [Boldt-1B](https://huggingface.co/Boldt/Boldt-1B)
|
||||||
|
- [Boldt-1B-IT-Preview](https://huggingface.co/Boldt/Boldt-1B-IT-Preview)
|
||||||
|
|
||||||
|
### Repetition over Diversity
|
||||||
|
The training philosophy behind **Boldt** is centered on a key finding from our research: **repetition over diversity**.
|
||||||
|
|
||||||
|
Standard pre-training paradigms typically balance quality filtering against the need for massive token volume and broad corpus diversity. In contrast, Boldt models are trained for multiple epochs on a highly filtered dataset: the German ***Dense-Core*** subset of [FineWeb-2](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2). We isolated this subset using a combination of three hierarchical filters:
|
||||||
|
|
||||||
|
- **Coherence:** Eliminates structurally fragmented or incoherent documents.
|
||||||
|
- **Information Value:** Isolates content-rich and fact-bearing texts.
|
||||||
|
- **Educational Quality:** Selects strictly for pedagogical clarity and deep explanations.
|
||||||
|
|
||||||
|
We demonstrate that repeated exposure to this strict, high-quality subset is more sample-efficient than a single pass over less filtered and more diverse corpora. For a comprehensive look at our experiments, please refer to our preprint: [*Repetition over Diversity*](https://arxiv.org/abs/2604.28075).
|
||||||
|
|
||||||
|
|
||||||
|
## Usage
|
||||||
|
|
||||||
|
**Note:** This is a base language model, not an instruction-tuned model. It is not optimized for chat or instruction following. For best results, use standard text completion rather than chat templates.
|
||||||
|
|
||||||
|
```python
|
||||||
|
from transformers import AutoTokenizer, AutoModelForCausalLM
|
||||||
|
model_name = "Boldt/Boldt-DC-350M"
|
||||||
|
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
||||||
|
model = AutoModelForCausalLM.from_pretrained(model_name)
|
||||||
|
|
||||||
|
# Basic text completion
|
||||||
|
text = "Berlin ist eine Stadt, wo"
|
||||||
|
inputs = tokenizer(text, return_tensors="pt")
|
||||||
|
outputs = model.generate(**inputs, max_new_tokens=64)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Evaluation
|
||||||
|
|
||||||
|
We evaluate Boldt-350M on our [modernized German benchmark suite](https://huggingface.co/collections/Boldt/german-llm-benchmarks). See our paper [(Aynetdinov et al., 2026)](https://arxiv.org/abs/2604.28075) for details on the structural and translation corrections we performed.
|
||||||
|
|
||||||
|
Boldt-350M while significantly smaller than the 1B models we compare with, still fares comparatively well, outperforming the much larger, multilingual Gemma-3-1B and Llama-3.2-1B models.
|
||||||
|
|
||||||
|
|
||||||
|
### Comparison against 1B Reference Models
|
||||||
|
*Note: Bold text indicates the best score in the 1B category.*
|
||||||
|
|
||||||
|
| Model | Tokens | MMLU | ARC-C | ARC-E | H-Swag | LAMBADA | OBQA | Avg. |
|
||||||
|
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
|
||||||
|
| **Boldt-DC-300M (this model)** | 200B | 29.29 | 32.24 | 52.87 | 43.21 | 37.48 | 45.86 | 40.16 |
|
||||||
|
| [Boldt-DC-1B](https://huggingface.co/Boldt/Boldt-DC-1B) | 200B | 31.06 | **35.99** | **57.30** | 48.69 | 42.80 | 48.48 | 44.05 |
|
||||||
|
| [Boldt-1B](https://huggingface.co/Boldt/Boldt-1B)| 230B | **31.42** | 34.11 | 55.78 | **48.77** | 44.70 | **52.32** | **44.52** |
|
||||||
|
| [LLäMmlein-1B](https://huggingface.co/LSX-UniWue/LLaMmlein_1B) | 1T | 29.26 | 30.27 | 48.19 | 44.80 | **44.89** | 47.27 | 40.78 |
|
||||||
|
| [Gemma-3-1B](https://huggingface.co/google/gemma-3-1b-pt) | 2T* | 30.01 | 30.55 | 47.89 | 43.43 | 41.71 | 45.05 | 39.77 |
|
||||||
|
| [Llama-3.2-1B](https://huggingface.co/meta-llama/Llama-3.2-1B) | 9T* | 28.58 | 29.90 | 40.51 | 40.07 | 44.31 | 44.04 | 37.90 |
|
||||||
|
| [Qwen3.5-0.8B-Base](https://huggingface.co/Qwen/Qwen3.5-0.8B-Base) | >36T* | 30.79 | 32.05 | 46.20 | 38.90 | 36.02 | 43.84 | 37.97 |
|
||||||
|
|
||||||
|
## Safety & Ethics
|
||||||
|
|
||||||
|
We have not conducted systematic model evaluations of toxicity, demographic biases, or harmful stereotypes. Quality filtering may reduce some risks relative to unfiltered web data, but cannot guarantee their absence, and repeated exposure during multi-epoch training could amplify rather than mitigate encoded biases. Users should exercise caution in sensitive use-cases without further evaluation.
|
||||||
|
|
||||||
|
## Citation
|
||||||
|
```bibtex
|
||||||
|
@misc{boldt2026,
|
||||||
|
title={Repetition over Diversity: High-Signal Data Filtering for Sample-Efficient German Language Modeling},
|
||||||
|
author={Ansar Aynetdinov and Patrick Haller and Alan Akbik},
|
||||||
|
year={2026},
|
||||||
|
eprint={2604.28075},
|
||||||
|
archivePrefix={arXiv},
|
||||||
|
primaryClass={cs.CL},
|
||||||
|
url={https://arxiv.org/abs/2604.28075},
|
||||||
|
}
|
||||||
|
```
|
||||||
30
config.json
Normal file
30
config.json
Normal file
@@ -0,0 +1,30 @@
|
|||||||
|
{
|
||||||
|
"architectures": [
|
||||||
|
"LlamaForCausalLM"
|
||||||
|
],
|
||||||
|
"attention_bias": false,
|
||||||
|
"attention_dropout": 0.0,
|
||||||
|
"bos_token_id": null,
|
||||||
|
"eos_token_id": 0,
|
||||||
|
"head_dim": 64,
|
||||||
|
"hidden_act": "silu",
|
||||||
|
"hidden_size": 1024,
|
||||||
|
"initializer_range": 0.02,
|
||||||
|
"intermediate_size": 4096,
|
||||||
|
"max_position_embeddings": 2048,
|
||||||
|
"mlp_bias": false,
|
||||||
|
"model_type": "llama",
|
||||||
|
"num_attention_heads": 16,
|
||||||
|
"num_hidden_layers": 24,
|
||||||
|
"num_key_value_heads": 16,
|
||||||
|
"pad_token_id": 1,
|
||||||
|
"pretraining_tp": 1,
|
||||||
|
"rms_norm_eps": 1e-05,
|
||||||
|
"rope_scaling": null,
|
||||||
|
"rope_theta": 10000.0,
|
||||||
|
"tie_word_embeddings": false,
|
||||||
|
"torch_dtype": "bfloat16",
|
||||||
|
"transformers_version": "4.55.3",
|
||||||
|
"use_cache": true,
|
||||||
|
"vocab_size": 32000
|
||||||
|
}
|
||||||
6
generation_config.json
Normal file
6
generation_config.json
Normal file
@@ -0,0 +1,6 @@
|
|||||||
|
{
|
||||||
|
"_from_model_config": true,
|
||||||
|
"eos_token_id": 0,
|
||||||
|
"pad_token_id": 1,
|
||||||
|
"transformers_version": "4.55.3"
|
||||||
|
}
|
||||||
3
logo.png
Normal file
3
logo.png
Normal file
@@ -0,0 +1,3 @@
|
|||||||
|
version https://git-lfs.github.com/spec/v1
|
||||||
|
oid sha256:4f407443899af03737c73da20474835b840cedb216ce1b491901b598cc2c8d85
|
||||||
|
size 517295
|
||||||
31743
merges.txt
Normal file
31743
merges.txt
Normal file
File diff suppressed because it is too large
Load Diff
3
model.safetensors
Normal file
3
model.safetensors
Normal file
@@ -0,0 +1,3 @@
|
|||||||
|
version https://git-lfs.github.com/spec/v1
|
||||||
|
oid sha256:ba2964d0d2d3a21b56a6d1f1a33a332350c49023efb834c1b38a676a19edf6cc
|
||||||
|
size 936503664
|
||||||
30
special_tokens_map.json
Normal file
30
special_tokens_map.json
Normal file
@@ -0,0 +1,30 @@
|
|||||||
|
{
|
||||||
|
"bos_token": {
|
||||||
|
"content": "<|endoftext|>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": true,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false
|
||||||
|
},
|
||||||
|
"eos_token": {
|
||||||
|
"content": "<|endoftext|>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": true,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false
|
||||||
|
},
|
||||||
|
"pad_token": {
|
||||||
|
"content": "<|pad|>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false
|
||||||
|
},
|
||||||
|
"unk_token": {
|
||||||
|
"content": "<|endoftext|>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": true,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false
|
||||||
|
}
|
||||||
|
}
|
||||||
159026
tokenizer.json
Normal file
159026
tokenizer.json
Normal file
File diff suppressed because it is too large
Load Diff
29
tokenizer_config.json
Normal file
29
tokenizer_config.json
Normal file
@@ -0,0 +1,29 @@
|
|||||||
|
{
|
||||||
|
"add_prefix_space": false,
|
||||||
|
"added_tokens_decoder": {
|
||||||
|
"0": {
|
||||||
|
"content": "<|endoftext|>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": true,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": true
|
||||||
|
},
|
||||||
|
"1": {
|
||||||
|
"content": "<|pad|>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": true
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"bos_token": "<|endoftext|>",
|
||||||
|
"clean_up_tokenization_spaces": false,
|
||||||
|
"eos_token": "<|endoftext|>",
|
||||||
|
"extra_special_tokens": {},
|
||||||
|
"model_max_length": 2048,
|
||||||
|
"pad_token": "<|pad|>",
|
||||||
|
"tokenizer_class": "GPT2Tokenizer",
|
||||||
|
"unk_token": "<|endoftext|>"
|
||||||
|
}
|
||||||
1
vocab.json
Normal file
1
vocab.json
Normal file
File diff suppressed because one or more lines are too long
Reference in New Issue
Block a user