Faust-1/README.md

---
library_name: transformers
license_link: https://huggingface.co/Qwen/Qwen3-1.7B/blob/main/LICENSE
pipeline_tag: text-generation
license: cc-by-nc-4.0
extra_gated_prompt: >
  ### FAUST-1 NON-COMMERCIAL LICENSE AGREEMENT


  Version 1.0 — January 2025


  "Faust-1" refers to the language model weights, code, and documentation made
  available by Tabularis AI GmbH ("Tabularis") under this agreement.


  1. License Grant

  You are granted a non-exclusive, non-transferable, royalty-free license to
  use, copy, and modify Faust-1 for non-commercial research and personal
  purposes only.


  2. Non-Commercial Use

  "Non-commercial" means academic research, personal projects, and educational
  use. Any use intended to generate revenue, provide commercial services, or
  benefit a for-profit entity requires a separate commercial license.


  3. Commercial Licensing

  For commercial use, please contact: info@tabularis.ai


  4. Attribution

  You must include "Built with Faust-1 by Tabularis AI" in any derivative work
  or publication.


  5. No Warranty

  Faust-1 is provided "as is" without warranties of any kind.


  6. Termination

  This license terminates automatically if you violate any terms.


  ---

  ### Additional Access Requirement

  Access to this repository is approval-based.

  You must join our Discord server: https://discord.gg/7WqEKw652R
extra_gated_fields:
  Name: text
  Email: text
  Affiliation: text
  I have joined the Tabularis AI Discord server: checkbox
  I accept the Faust-1 Non-Commercial License Agreement: checkbox
extra_gated_description: |
  Faust-1 is for non-commercial use only.
  For commercial licensing contact info@tabularis.ai

  Approval requires Discord membership.
  Join: https://discord.gg/7WqEKw652R
extra_gated_button_content: Submit
language:
- de
- en


tags:
- llama.cpp
- synthetic data


---


<!-- <a href="https://faust.tabularis.ai/" target="_blank" style="margin: 2px;">
  <img
    alt="Faust-1 Demo"
    src="https://img.shields.io/badge/%E2%9C%A8%20Faust--1%20Demo-2b2b2b?style=flat&logo=ai&logoColor=white"
    style="display: inline-block; vertical-align: middle;"
  />
</a> -->


<p align="center">
  <img src="./logo-faust.webp" alt="Faust-1 Logo" width="220">
</p>

# Faust-1 — German-First Large Language Model (1.6B)

Faust-1 is a German-first large language model with 1.6B parameters, trained entirely from scratch. Model development comprises large-scale data collection and synthetic data generation, followed by data cleaning, normalization, and deduplication to reduce contamination and redundancy. Pre-training is performed on a predominantly German corpus using a decoder-only language modeling objective, resulting in a foundation model for the German language that captures lexical, syntactic, and semantic regularities at scale.

Following pre-training, the model undergoes supervised post-training (instruction tuning) using labeled input–output pairs to adapt the base model for conversational and task-oriented use. In later stages, preference-based optimization, including Direct Preference Optimization (DPO), is applied to improve response quality, stability, and alignment with human expectations, while preserving the efficiency constraints required for small-scale and local deployment.

<!-- Demo: [faust.tabularis.ai](https://faust.tabularis.ai)
 -->


> [!TIP]
> **Designed for local and cost-efficient deployment.**  
> Faust-1 is deliberately sized and optimized to run on **consumer-grade hardware** and **does not require expensive data-center GPUs**.
---

## Model summary

- Repository: tabularisai/Faust-1  
- Model type: decoder-only causal language model  
- Parameters: 1.6B  
- Interface: conversational / instruction (chat template provided)  
- Primary language: German (~90%)
- Custom State-of-the-Art tokenizer for German language

---

## Quickstart


### Conversational usage (recommended)

```python
from transformers import AutoTokenizer, AutoModelForCausalLM  
import torch  

model_id = "tabularisai/Faust-1"

tokenizer = AutoTokenizer.from_pretrained(model_id)  
model = AutoModelForCausalLM.from_pretrained(  
    model_id,  
    torch_dtype=torch.float16,  
    device_map="auto",  
)

messages = [  
    {"role": "user", "content": "Gib mir eine kurze Einführung in große Sprachmodelle (LLM)."}  
]

inputs = tokenizer.apply_chat_template(  
    messages,  
    add_generation_prompt=True,  
    return_tensors="pt",  
).to(model.device)

outputs = model.generate(  
    inputs,  
    max_new_tokens=256,  
    temperature=0.6,  
    do_sample=True,  
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

---

## Conditional Generation

```python
!pip install git+https://github.com/tabularis-ai/guidegen.git

import sys
import os
import json
import time

import guidegen as gg
from pydantic import BaseModel, Field
from typing import Literal, List

# Hugging Face access token - set via environment variable or .env file
# You can set it with: export HUGGINGFACE_HUB_TOKEN=your_token_here
# Or create a .env file with: HUGGINGFACE_HUB_TOKEN=your_token_here

MODEL_NAME = "tabularisai/Faust-1"


# --- Schema ---
class EmailSummary(BaseModel):
    """Structured summary of an email."""
    Absender: str = Field(description="Der Name des Absenders.")
    Betreff: str = Field(description="Worum geht es in der E-Mail? (max 5 Wörter)")
    Zusammenfassung: str = Field(description="Kurze Zusammenfassung (max 2 Sätze).")
    Prioritaet: Literal["hoch", "mittel", "niedrig"] = Field(description="Wie wichtig die E-Mail ist.")
    # AntwortNoetig: bool = Field(description="Muss man auf die E-Mail antworten?")


# --- Input ---
email_text = """Hallo Jens,

wir hatten uns bei CampusFounders im Rahmen unserer Pre-Seed-Runde kennengelernt.
Seitdem haben wir große Fortschritte gemacht und bereiten aktuell unsere Seed-Runde vor.

Wir entwickeln eine Infrastruktur für hocheffiziente, lokal trainierbare KI-Modelle – vollständig ohne Cloud.
Sehr gern würden wir uns mit dir austauschen und prüfen, ob ein Intro zu US-VCs oder ein Gespräch mit Crestlight möglich wäre.

Anbei ein kurzer OnePager zur Weiterleitung.

Beste Grüße  
Ricard"""


# --- Prompt ---
prompt = f"""
Du bist ein intelligenter Assistent, der E-Mails analysiert und als JSON zusammenfasst.
Halte die Zusammenfassung kurz (1-2 Sätze). Betreff maximal 5 Wörter.

--- Beispiel ---
E-Mail-Text:
Sehr geehrte Damen und Herren, ich wollte nur nachfragen, ob meine Bestellung #12345 schon versandt wurde. Vielen Dank, Max Mustermann
JSON-Antwort:
{{
  "Absender": "Max Mustermann",
  "Betreff": "Bestellstatus Anfrage",
  "Zusammenfassung": "Anfrage zum Versandstatus der Bestellung #12345.",
  "Prioritaet": "mittel",
}}
--- Ende Beispiel ---

Jetzt analysiere die folgende E-Mail und erstelle das JSON-Objekt.

E-Mail-Text:
{email_text}
"""


def main():
    print("=" * 60)
    print("EMAIL SUMMARIZATION WITH GUIDEGEN")
    print("=" * 60)

    print(f"\nLoading model: {MODEL_NAME}")
    load_start = time.time()

    gen = gg.GuideGen(
        MODEL_NAME,
        verbose=True,
        use_chat_template=True,
        enable_thinking=False,
    )

    load_time = time.time() - load_start
    print(f"Model loaded in {load_time:.2f}s")

    # --- Generate ---
    print("\nGenerating structured summary...")
    gen_start = time.time()

    options = gg.GuideGenOptions(
        temperature=0.6,
        max_tokens=400,
        do_sample=False,
    )

    summary = gen.generate(prompt, EmailSummary, options=options)

    gen_time = time.time() - gen_start
    print(f"Generation complete in {gen_time:.2f}s")

    # --- Output ---
    print("\n--- Email Summary (JSON) ---")
    print(json.dumps(summary.model_dump(), indent=2, ensure_ascii=False))
    print(f"\n  Model load: {load_time:.2f}s | Generation: {gen_time:.2f}s | Total: {load_time + gen_time:.2f}s")
```

---

## Training focus

### German-first data distribution

Faust-1 is trained from scratch with a German-dominant corpus. German syntax, compounding, morphology, and typical reasoning patterns are treated as the default operating regime rather than an edge case.

### Verified synthetic data

A substantial portion of the training signal comes from synthetic data. To keep this signal usable, generation is paired with explicit verification and filtering:

- LLM-as-judge style evaluations  
- rule-based and programmatic checks  
- consistency and self-agreement filtering  

This allows broad coverage of instruction-following and reasoning patterns while maintaining quality control.

---

## Tokenizer optimized for German

Faust-1 uses a custom tokenizer optimized for German morphology and compounding. Token efficiency is treated as a deployment constraint, not just a preprocessing detail.

![Tokenizer efficiency on German language](tokenizer_bench.png)

Lower token counts on German text translate directly into more usable context, lower inference cost, and less fragmentation on compound-heavy inputs.


<img src="tokenizer_faust.png" alt="Faust-1 vs OpenAI Tokenizers" width="800">


---

## German benchmark performance

Faust-1 is evaluated on a set of standard German-language benchmarks:

- ARC_de  
- GSM8K_de  
- HellaSwag_de  
- MMLU_de  
- TruthfulQA_de  

![German benchmark performance](faust_bench.png)

The target is best-in-class performance within the 1–2B parameter range for German-focused models, using benchmarks that are easy to reproduce in Hugging Face-based evaluation pipelines.

---

## Deployment examples

Faust-1 can be deployed with common inference stacks that support decoder-only language models.

vLLM (OpenAI-compatible API)
```sh
vllm serve tabularisai/Faust-1 --dtype float16
```

SGLang
```sh
python -m sglang.launch_server \
  --model-path tabularisai/Faust-1 \
  --dtype float16
```

llama.cpp (GGUF, local / on-device)
```sh
./llama-cli \
  -m faust_1_q8_0.gguf \
  -p "Erkläre kurz, was ein großes Sprachmodell ist."
```

The repository includes a prebuilt Q8_0 GGUF file for efficient local inference.

---

## Intended use

- German conversational assistants  
- research and benchmarking on German NLP tasks  
- local and privacy-sensitive deployments  
- on-device or edge experimentation  

---

## Roadmap

- Reasoning-focused variant  (comming soon)
- Agent-oriented variant  (comming soon)

---

## Citation

A technical paper describing training methodology, tokenizer design, and evaluation is in preparation.


Developed by [tabularis.ai](https://tabularis.ai) in Tübingen.