初始化项目,由ModelHub XC社区提供模型

Model: tabularisai/Faust-1
Source: Original Platform
This commit is contained in:
ModelHub XC
2026-05-27 05:22:17 +08:00
commit 4696ec8d99
14 changed files with 490660 additions and 0 deletions

41
.gitattributes vendored Normal file
View File

@@ -0,0 +1,41 @@
*.7z filter=lfs diff=lfs merge=lfs -text
*.arrow filter=lfs diff=lfs merge=lfs -text
*.bin filter=lfs diff=lfs merge=lfs -text
*.bz2 filter=lfs diff=lfs merge=lfs -text
*.ckpt filter=lfs diff=lfs merge=lfs -text
*.ftz filter=lfs diff=lfs merge=lfs -text
*.gz filter=lfs diff=lfs merge=lfs -text
*.h5 filter=lfs diff=lfs merge=lfs -text
*.joblib filter=lfs diff=lfs merge=lfs -text
*.lfs.* filter=lfs diff=lfs merge=lfs -text
*.mlmodel filter=lfs diff=lfs merge=lfs -text
*.model filter=lfs diff=lfs merge=lfs -text
*.msgpack filter=lfs diff=lfs merge=lfs -text
*.npy filter=lfs diff=lfs merge=lfs -text
*.npz filter=lfs diff=lfs merge=lfs -text
*.onnx filter=lfs diff=lfs merge=lfs -text
*.ot filter=lfs diff=lfs merge=lfs -text
*.parquet filter=lfs diff=lfs merge=lfs -text
*.pb filter=lfs diff=lfs merge=lfs -text
*.pickle filter=lfs diff=lfs merge=lfs -text
*.pkl filter=lfs diff=lfs merge=lfs -text
*.pt filter=lfs diff=lfs merge=lfs -text
*.pth filter=lfs diff=lfs merge=lfs -text
*.rar filter=lfs diff=lfs merge=lfs -text
*.safetensors filter=lfs diff=lfs merge=lfs -text
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
*.tar.* filter=lfs diff=lfs merge=lfs -text
*.tar filter=lfs diff=lfs merge=lfs -text
*.tflite filter=lfs diff=lfs merge=lfs -text
*.tgz filter=lfs diff=lfs merge=lfs -text
*.wasm filter=lfs diff=lfs merge=lfs -text
*.xz filter=lfs diff=lfs merge=lfs -text
*.zip filter=lfs diff=lfs merge=lfs -text
*.zst filter=lfs diff=lfs merge=lfs -text
*tfevents* filter=lfs diff=lfs merge=lfs -text
faust-1-dpo-golden-v1-1601-q8_0.gguf filter=lfs diff=lfs merge=lfs -text
faust_bench.png filter=lfs diff=lfs merge=lfs -text
logo-faust.webp filter=lfs diff=lfs merge=lfs -text
faust-1-q8_0.gguf filter=lfs diff=lfs merge=lfs -text
faust_1_q8_0.gguf filter=lfs diff=lfs merge=lfs -text
tokenizer_faust.png filter=lfs diff=lfs merge=lfs -text

395
README.md Normal file
View File

@@ -0,0 +1,395 @@
---
library_name: transformers
license_link: https://huggingface.co/Qwen/Qwen3-1.7B/blob/main/LICENSE
pipeline_tag: text-generation
license: cc-by-nc-4.0
extra_gated_prompt: >
### FAUST-1 NON-COMMERCIAL LICENSE AGREEMENT
Version 1.0 — January 2025
"Faust-1" refers to the language model weights, code, and documentation made
available by Tabularis AI GmbH ("Tabularis") under this agreement.
1. License Grant
You are granted a non-exclusive, non-transferable, royalty-free license to
use, copy, and modify Faust-1 for non-commercial research and personal
purposes only.
2. Non-Commercial Use
"Non-commercial" means academic research, personal projects, and educational
use. Any use intended to generate revenue, provide commercial services, or
benefit a for-profit entity requires a separate commercial license.
3. Commercial Licensing
For commercial use, please contact: info@tabularis.ai
4. Attribution
You must include "Built with Faust-1 by Tabularis AI" in any derivative work
or publication.
5. No Warranty
Faust-1 is provided "as is" without warranties of any kind.
6. Termination
This license terminates automatically if you violate any terms.
---
### Additional Access Requirement
Access to this repository is approval-based.
You must join our Discord server: https://discord.gg/7WqEKw652R
extra_gated_fields:
Name: text
Email: text
Affiliation: text
I have joined the Tabularis AI Discord server: checkbox
I accept the Faust-1 Non-Commercial License Agreement: checkbox
extra_gated_description: |
Faust-1 is for non-commercial use only.
For commercial licensing contact info@tabularis.ai
Approval requires Discord membership.
Join: https://discord.gg/7WqEKw652R
extra_gated_button_content: Submit
language:
- de
- en
tags:
- llama.cpp
- synthetic data
---
<!-- <a href="https://faust.tabularis.ai/" target="_blank" style="margin: 2px;">
<img
alt="Faust-1 Demo"
src="https://img.shields.io/badge/%E2%9C%A8%20Faust--1%20Demo-2b2b2b?style=flat&logo=ai&logoColor=white"
style="display: inline-block; vertical-align: middle;"
/>
</a> -->
<p align="center">
<img src="./logo-faust.webp" alt="Faust-1 Logo" width="220">
</p>
# Faust-1 — German-First Large Language Model (1.6B)
Faust-1 is a German-first large language model with 1.6B parameters, trained entirely from scratch. Model development comprises large-scale data collection and synthetic data generation, followed by data cleaning, normalization, and deduplication to reduce contamination and redundancy. Pre-training is performed on a predominantly German corpus using a decoder-only language modeling objective, resulting in a foundation model for the German language that captures lexical, syntactic, and semantic regularities at scale.
Following pre-training, the model undergoes supervised post-training (instruction tuning) using labeled inputoutput pairs to adapt the base model for conversational and task-oriented use. In later stages, preference-based optimization, including Direct Preference Optimization (DPO), is applied to improve response quality, stability, and alignment with human expectations, while preserving the efficiency constraints required for small-scale and local deployment.
<!-- Demo: [faust.tabularis.ai](https://faust.tabularis.ai)
-->
> [!TIP]
> **Designed for local and cost-efficient deployment.**
> Faust-1 is deliberately sized and optimized to run on **consumer-grade hardware** and **does not require expensive data-center GPUs**.
---
## Model summary
- Repository: tabularisai/Faust-1
- Model type: decoder-only causal language model
- Parameters: 1.6B
- Interface: conversational / instruction (chat template provided)
- Primary language: German (~90%)
- Custom State-of-the-Art tokenizer for German language
---
## Quickstart
### Conversational usage (recommended)
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "tabularisai/Faust-1"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16,
device_map="auto",
)
messages = [
{"role": "user", "content": "Gib mir eine kurze Einführung in große Sprachmodelle (LLM)."}
]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
return_tensors="pt",
).to(model.device)
outputs = model.generate(
inputs,
max_new_tokens=256,
temperature=0.6,
do_sample=True,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
---
## Conditional Generation
```python
!pip install git+https://github.com/tabularis-ai/guidegen.git
import sys
import os
import json
import time
import guidegen as gg
from pydantic import BaseModel, Field
from typing import Literal, List
# Hugging Face access token - set via environment variable or .env file
# You can set it with: export HUGGINGFACE_HUB_TOKEN=your_token_here
# Or create a .env file with: HUGGINGFACE_HUB_TOKEN=your_token_here
MODEL_NAME = "tabularisai/Faust-1"
# --- Schema ---
class EmailSummary(BaseModel):
"""Structured summary of an email."""
Absender: str = Field(description="Der Name des Absenders.")
Betreff: str = Field(description="Worum geht es in der E-Mail? (max 5 Wörter)")
Zusammenfassung: str = Field(description="Kurze Zusammenfassung (max 2 Sätze).")
Prioritaet: Literal["hoch", "mittel", "niedrig"] = Field(description="Wie wichtig die E-Mail ist.")
# AntwortNoetig: bool = Field(description="Muss man auf die E-Mail antworten?")
# --- Input ---
email_text = """Hallo Jens,
wir hatten uns bei CampusFounders im Rahmen unserer Pre-Seed-Runde kennengelernt.
Seitdem haben wir große Fortschritte gemacht und bereiten aktuell unsere Seed-Runde vor.
Wir entwickeln eine Infrastruktur für hocheffiziente, lokal trainierbare KI-Modelle vollständig ohne Cloud.
Sehr gern würden wir uns mit dir austauschen und prüfen, ob ein Intro zu US-VCs oder ein Gespräch mit Crestlight möglich wäre.
Anbei ein kurzer OnePager zur Weiterleitung.
Beste Grüße
Ricard"""
# --- Prompt ---
prompt = f"""
Du bist ein intelligenter Assistent, der E-Mails analysiert und als JSON zusammenfasst.
Halte die Zusammenfassung kurz (1-2 Sätze). Betreff maximal 5 Wörter.
--- Beispiel ---
E-Mail-Text:
Sehr geehrte Damen und Herren, ich wollte nur nachfragen, ob meine Bestellung #12345 schon versandt wurde. Vielen Dank, Max Mustermann
JSON-Antwort:
{{
"Absender": "Max Mustermann",
"Betreff": "Bestellstatus Anfrage",
"Zusammenfassung": "Anfrage zum Versandstatus der Bestellung #12345.",
"Prioritaet": "mittel",
}}
--- Ende Beispiel ---
Jetzt analysiere die folgende E-Mail und erstelle das JSON-Objekt.
E-Mail-Text:
{email_text}
"""
def main():
print("=" * 60)
print("EMAIL SUMMARIZATION WITH GUIDEGEN")
print("=" * 60)
print(f"\nLoading model: {MODEL_NAME}")
load_start = time.time()
gen = gg.GuideGen(
MODEL_NAME,
verbose=True,
use_chat_template=True,
enable_thinking=False,
)
load_time = time.time() - load_start
print(f"Model loaded in {load_time:.2f}s")
# --- Generate ---
print("\nGenerating structured summary...")
gen_start = time.time()
options = gg.GuideGenOptions(
temperature=0.6,
max_tokens=400,
do_sample=False,
)
summary = gen.generate(prompt, EmailSummary, options=options)
gen_time = time.time() - gen_start
print(f"Generation complete in {gen_time:.2f}s")
# --- Output ---
print("\n--- Email Summary (JSON) ---")
print(json.dumps(summary.model_dump(), indent=2, ensure_ascii=False))
print(f"\n Model load: {load_time:.2f}s | Generation: {gen_time:.2f}s | Total: {load_time + gen_time:.2f}s")
```
---
## Training focus
### German-first data distribution
Faust-1 is trained from scratch with a German-dominant corpus. German syntax, compounding, morphology, and typical reasoning patterns are treated as the default operating regime rather than an edge case.
### Verified synthetic data
A substantial portion of the training signal comes from synthetic data. To keep this signal usable, generation is paired with explicit verification and filtering:
- LLM-as-judge style evaluations
- rule-based and programmatic checks
- consistency and self-agreement filtering
This allows broad coverage of instruction-following and reasoning patterns while maintaining quality control.
---
## Tokenizer optimized for German
Faust-1 uses a custom tokenizer optimized for German morphology and compounding. Token efficiency is treated as a deployment constraint, not just a preprocessing detail.
![Tokenizer efficiency on German language](tokenizer_bench.png)
Lower token counts on German text translate directly into more usable context, lower inference cost, and less fragmentation on compound-heavy inputs.
<img src="tokenizer_faust.png" alt="Faust-1 vs OpenAI Tokenizers" width="800">
---
## German benchmark performance
Faust-1 is evaluated on a set of standard German-language benchmarks:
- ARC_de
- GSM8K_de
- HellaSwag_de
- MMLU_de
- TruthfulQA_de
![German benchmark performance](faust_bench.png)
The target is best-in-class performance within the 12B parameter range for German-focused models, using benchmarks that are easy to reproduce in Hugging Face-based evaluation pipelines.
---
## Deployment examples
Faust-1 can be deployed with common inference stacks that support decoder-only language models.
vLLM (OpenAI-compatible API)
```sh
vllm serve tabularisai/Faust-1 --dtype float16
```
SGLang
```sh
python -m sglang.launch_server \
--model-path tabularisai/Faust-1 \
--dtype float16
```
llama.cpp (GGUF, local / on-device)
```sh
./llama-cli \
-m faust_1_q8_0.gguf \
-p "Erkläre kurz, was ein großes Sprachmodell ist."
```
The repository includes a prebuilt Q8_0 GGUF file for efficient local inference.
---
## Intended use
- German conversational assistants
- research and benchmarking on German NLP tasks
- local and privacy-sensitive deployments
- on-device or edge experimentation
---
## Roadmap
- Reasoning-focused variant (comming soon)
- Agent-oriented variant (comming soon)
---
## Citation
A technical paper describing training methodology, tokenizer design, and evaluation is in preparation.
Developed by [tabularis.ai](https://tabularis.ai) in Tübingen.

14
chat_template.jinja Normal file
View File

@@ -0,0 +1,14 @@
{% for m in messages %}
{% if m['role'] == 'system' %}
<|im_start|>system
{{ m['content'] }}<|im_end|>
{% elif m['role'] == 'user' %}
<|im_start|>user
{{ m['content'] }}<|im_end|>
{% elif m['role'] == 'assistant' %}
<|im_start|>assistant
{% generation %}{{ m['content'] }}{{ eos_token }}{% endgeneration %}
{% endif %}
{% endfor %}
{% if add_generation_prompt %}<|im_start|>assistant
{% endif %}

61
config.json Normal file
View File

@@ -0,0 +1,61 @@
{
"architectures": [
"Qwen3ForCausalLM"
],
"attention_bias": false,
"attention_dropout": 0.0,
"bos_token_id": 3,
"dtype": "bfloat16",
"eos_token_id": 6,
"head_dim": 128,
"hidden_act": "silu",
"hidden_size": 2048,
"initializer_range": 0.02,
"intermediate_size": 6144,
"layer_types": [
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention"
],
"max_position_embeddings": 40960,
"max_window_layers": 28,
"model_type": "qwen3",
"num_attention_heads": 16,
"num_hidden_layers": 28,
"num_key_value_heads": 8,
"pad_token_id": 1,
"rms_norm_eps": 1e-06,
"rope_scaling": null,
"rope_theta": 1000000,
"sliding_window": null,
"tie_word_embeddings": true,
"transformers_version": "4.57.5",
"use_cache": false,
"use_sliding_window": false,
"vocab_size": 100000
}

3
faust_1_q8_0.gguf Normal file
View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:5c31eb18c8eade06e0e6593382017f55a19b2e830b860a29c96806d64073f582
size 1719280512

3
faust_bench.png Normal file
View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:78231d070ee7b586ae09d4601e84b0629d616c7459428dbdf25ae72ec352a05f
size 106310

13
generation_config.json Normal file
View File

@@ -0,0 +1,13 @@
{
"bos_token_id": 3,
"do_sample": true,
"eos_token_id": [
6,
4
],
"pad_token_id": 1,
"temperature": 0.6,
"top_k": 20,
"top_p": 0.95,
"transformers_version": "4.57.5"
}

3
logo-faust.webp Normal file
View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:60e1441eecdde9fab1e9d681f1102d979046b981f5174d1f074f6849dff5b2a6
size 1002788

3
model.safetensors Normal file
View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:771c4911227f2792ce43c5f4e285bb4ec67942b95fa15f376b20cd2227879de6
size 3228455704

45
special_tokens_map.json Normal file
View File

@@ -0,0 +1,45 @@
{
"additional_special_tokens": [
"<|im_start|>",
"<|im_end|>",
"<|im_sep|>",
"<|special_0|>",
"<|special_1|>",
"<|special_2|>",
"<|special_3|>",
"<|special_4|>",
"<|special_5|>",
"<|special_6|>",
"<|special_7|>",
"<|special_8|>",
"<|special_9|>"
],
"bos_token": {
"content": "<|bos|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
"eos_token": {
"content": "<|im_end|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
"pad_token": {
"content": "<|pad|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
"unk_token": {
"content": "<|unk|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
}
}

489896
tokenizer.json Normal file

File diff suppressed because it is too large Load Diff

BIN
tokenizer_bench.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 60 KiB

180
tokenizer_config.json Normal file
View File

@@ -0,0 +1,180 @@
{
"added_tokens_decoder": {
"0": {
"content": "<|endoftext|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"1": {
"content": "<|pad|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"2": {
"content": "<|unk|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"3": {
"content": "<|bos|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"4": {
"content": "<|eos|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"5": {
"content": "<|im_start|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"6": {
"content": "<|im_end|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"7": {
"content": "<|im_sep|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"8": {
"content": "<|special_0|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"9": {
"content": "<|special_1|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"10": {
"content": "<|special_2|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"11": {
"content": "<|special_3|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"12": {
"content": "<|special_4|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"13": {
"content": "<|special_5|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"14": {
"content": "<|special_6|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"15": {
"content": "<|special_7|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"16": {
"content": "<|special_8|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"17": {
"content": "<|special_9|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
}
},
"additional_special_tokens": [
"<|im_start|>",
"<|im_end|>",
"<|im_sep|>",
"<|special_0|>",
"<|special_1|>",
"<|special_2|>",
"<|special_3|>",
"<|special_4|>",
"<|special_5|>",
"<|special_6|>",
"<|special_7|>",
"<|special_8|>",
"<|special_9|>"
],
"bos_token": "<|bos|>",
"clean_up_tokenization_spaces": false,
"eos_token": "<|im_end|>",
"extra_special_tokens": {},
"max_length": 2048,
"model_max_length": 8192,
"pad_token": "<|pad|>",
"stride": 0,
"tokenizer_class": "PreTrainedTokenizerFast",
"truncation_side": "right",
"truncation_strategy": "longest_first",
"unk_token": "<|unk|>",
"return_token_type_ids": false,
"model_input_names": [
"input_ids",
"attention_mask"
]
}

3
tokenizer_faust.png Normal file
View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:43356a78c9e9925f7b8c3b6fc8531517cba19bbf9e73ae81619d1e928fa488dc
size 453747