Files
Faust-1/README.md
ModelHub XC 4696ec8d99 初始化项目,由ModelHub XC社区提供模型
Model: tabularisai/Faust-1
Source: Original Platform
2026-05-27 05:22:17 +08:00

395 lines
10 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
library_name: transformers
license_link: https://huggingface.co/Qwen/Qwen3-1.7B/blob/main/LICENSE
pipeline_tag: text-generation
license: cc-by-nc-4.0
extra_gated_prompt: >
### FAUST-1 NON-COMMERCIAL LICENSE AGREEMENT
Version 1.0 — January 2025
"Faust-1" refers to the language model weights, code, and documentation made
available by Tabularis AI GmbH ("Tabularis") under this agreement.
1. License Grant
You are granted a non-exclusive, non-transferable, royalty-free license to
use, copy, and modify Faust-1 for non-commercial research and personal
purposes only.
2. Non-Commercial Use
"Non-commercial" means academic research, personal projects, and educational
use. Any use intended to generate revenue, provide commercial services, or
benefit a for-profit entity requires a separate commercial license.
3. Commercial Licensing
For commercial use, please contact: info@tabularis.ai
4. Attribution
You must include "Built with Faust-1 by Tabularis AI" in any derivative work
or publication.
5. No Warranty
Faust-1 is provided "as is" without warranties of any kind.
6. Termination
This license terminates automatically if you violate any terms.
---
### Additional Access Requirement
Access to this repository is approval-based.
You must join our Discord server: https://discord.gg/7WqEKw652R
extra_gated_fields:
Name: text
Email: text
Affiliation: text
I have joined the Tabularis AI Discord server: checkbox
I accept the Faust-1 Non-Commercial License Agreement: checkbox
extra_gated_description: |
Faust-1 is for non-commercial use only.
For commercial licensing contact info@tabularis.ai
Approval requires Discord membership.
Join: https://discord.gg/7WqEKw652R
extra_gated_button_content: Submit
language:
- de
- en
tags:
- llama.cpp
- synthetic data
---
<!-- <a href="https://faust.tabularis.ai/" target="_blank" style="margin: 2px;">
<img
alt="Faust-1 Demo"
src="https://img.shields.io/badge/%E2%9C%A8%20Faust--1%20Demo-2b2b2b?style=flat&logo=ai&logoColor=white"
style="display: inline-block; vertical-align: middle;"
/>
</a> -->
<p align="center">
<img src="./logo-faust.webp" alt="Faust-1 Logo" width="220">
</p>
# Faust-1 — German-First Large Language Model (1.6B)
Faust-1 is a German-first large language model with 1.6B parameters, trained entirely from scratch. Model development comprises large-scale data collection and synthetic data generation, followed by data cleaning, normalization, and deduplication to reduce contamination and redundancy. Pre-training is performed on a predominantly German corpus using a decoder-only language modeling objective, resulting in a foundation model for the German language that captures lexical, syntactic, and semantic regularities at scale.
Following pre-training, the model undergoes supervised post-training (instruction tuning) using labeled inputoutput pairs to adapt the base model for conversational and task-oriented use. In later stages, preference-based optimization, including Direct Preference Optimization (DPO), is applied to improve response quality, stability, and alignment with human expectations, while preserving the efficiency constraints required for small-scale and local deployment.
<!-- Demo: [faust.tabularis.ai](https://faust.tabularis.ai)
-->
> [!TIP]
> **Designed for local and cost-efficient deployment.**
> Faust-1 is deliberately sized and optimized to run on **consumer-grade hardware** and **does not require expensive data-center GPUs**.
---
## Model summary
- Repository: tabularisai/Faust-1
- Model type: decoder-only causal language model
- Parameters: 1.6B
- Interface: conversational / instruction (chat template provided)
- Primary language: German (~90%)
- Custom State-of-the-Art tokenizer for German language
---
## Quickstart
### Conversational usage (recommended)
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "tabularisai/Faust-1"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16,
device_map="auto",
)
messages = [
{"role": "user", "content": "Gib mir eine kurze Einführung in große Sprachmodelle (LLM)."}
]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
return_tensors="pt",
).to(model.device)
outputs = model.generate(
inputs,
max_new_tokens=256,
temperature=0.6,
do_sample=True,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
---
## Conditional Generation
```python
!pip install git+https://github.com/tabularis-ai/guidegen.git
import sys
import os
import json
import time
import guidegen as gg
from pydantic import BaseModel, Field
from typing import Literal, List
# Hugging Face access token - set via environment variable or .env file
# You can set it with: export HUGGINGFACE_HUB_TOKEN=your_token_here
# Or create a .env file with: HUGGINGFACE_HUB_TOKEN=your_token_here
MODEL_NAME = "tabularisai/Faust-1"
# --- Schema ---
class EmailSummary(BaseModel):
"""Structured summary of an email."""
Absender: str = Field(description="Der Name des Absenders.")
Betreff: str = Field(description="Worum geht es in der E-Mail? (max 5 Wörter)")
Zusammenfassung: str = Field(description="Kurze Zusammenfassung (max 2 Sätze).")
Prioritaet: Literal["hoch", "mittel", "niedrig"] = Field(description="Wie wichtig die E-Mail ist.")
# AntwortNoetig: bool = Field(description="Muss man auf die E-Mail antworten?")
# --- Input ---
email_text = """Hallo Jens,
wir hatten uns bei CampusFounders im Rahmen unserer Pre-Seed-Runde kennengelernt.
Seitdem haben wir große Fortschritte gemacht und bereiten aktuell unsere Seed-Runde vor.
Wir entwickeln eine Infrastruktur für hocheffiziente, lokal trainierbare KI-Modelle vollständig ohne Cloud.
Sehr gern würden wir uns mit dir austauschen und prüfen, ob ein Intro zu US-VCs oder ein Gespräch mit Crestlight möglich wäre.
Anbei ein kurzer OnePager zur Weiterleitung.
Beste Grüße
Ricard"""
# --- Prompt ---
prompt = f"""
Du bist ein intelligenter Assistent, der E-Mails analysiert und als JSON zusammenfasst.
Halte die Zusammenfassung kurz (1-2 Sätze). Betreff maximal 5 Wörter.
--- Beispiel ---
E-Mail-Text:
Sehr geehrte Damen und Herren, ich wollte nur nachfragen, ob meine Bestellung #12345 schon versandt wurde. Vielen Dank, Max Mustermann
JSON-Antwort:
{{
"Absender": "Max Mustermann",
"Betreff": "Bestellstatus Anfrage",
"Zusammenfassung": "Anfrage zum Versandstatus der Bestellung #12345.",
"Prioritaet": "mittel",
}}
--- Ende Beispiel ---
Jetzt analysiere die folgende E-Mail und erstelle das JSON-Objekt.
E-Mail-Text:
{email_text}
"""
def main():
print("=" * 60)
print("EMAIL SUMMARIZATION WITH GUIDEGEN")
print("=" * 60)
print(f"\nLoading model: {MODEL_NAME}")
load_start = time.time()
gen = gg.GuideGen(
MODEL_NAME,
verbose=True,
use_chat_template=True,
enable_thinking=False,
)
load_time = time.time() - load_start
print(f"Model loaded in {load_time:.2f}s")
# --- Generate ---
print("\nGenerating structured summary...")
gen_start = time.time()
options = gg.GuideGenOptions(
temperature=0.6,
max_tokens=400,
do_sample=False,
)
summary = gen.generate(prompt, EmailSummary, options=options)
gen_time = time.time() - gen_start
print(f"Generation complete in {gen_time:.2f}s")
# --- Output ---
print("\n--- Email Summary (JSON) ---")
print(json.dumps(summary.model_dump(), indent=2, ensure_ascii=False))
print(f"\n Model load: {load_time:.2f}s | Generation: {gen_time:.2f}s | Total: {load_time + gen_time:.2f}s")
```
---
## Training focus
### German-first data distribution
Faust-1 is trained from scratch with a German-dominant corpus. German syntax, compounding, morphology, and typical reasoning patterns are treated as the default operating regime rather than an edge case.
### Verified synthetic data
A substantial portion of the training signal comes from synthetic data. To keep this signal usable, generation is paired with explicit verification and filtering:
- LLM-as-judge style evaluations
- rule-based and programmatic checks
- consistency and self-agreement filtering
This allows broad coverage of instruction-following and reasoning patterns while maintaining quality control.
---
## Tokenizer optimized for German
Faust-1 uses a custom tokenizer optimized for German morphology and compounding. Token efficiency is treated as a deployment constraint, not just a preprocessing detail.
![Tokenizer efficiency on German language](tokenizer_bench.png)
Lower token counts on German text translate directly into more usable context, lower inference cost, and less fragmentation on compound-heavy inputs.
<img src="tokenizer_faust.png" alt="Faust-1 vs OpenAI Tokenizers" width="800">
---
## German benchmark performance
Faust-1 is evaluated on a set of standard German-language benchmarks:
- ARC_de
- GSM8K_de
- HellaSwag_de
- MMLU_de
- TruthfulQA_de
![German benchmark performance](faust_bench.png)
The target is best-in-class performance within the 12B parameter range for German-focused models, using benchmarks that are easy to reproduce in Hugging Face-based evaluation pipelines.
---
## Deployment examples
Faust-1 can be deployed with common inference stacks that support decoder-only language models.
vLLM (OpenAI-compatible API)
```sh
vllm serve tabularisai/Faust-1 --dtype float16
```
SGLang
```sh
python -m sglang.launch_server \
--model-path tabularisai/Faust-1 \
--dtype float16
```
llama.cpp (GGUF, local / on-device)
```sh
./llama-cli \
-m faust_1_q8_0.gguf \
-p "Erkläre kurz, was ein großes Sprachmodell ist."
```
The repository includes a prebuilt Q8_0 GGUF file for efficient local inference.
---
## Intended use
- German conversational assistants
- research and benchmarking on German NLP tasks
- local and privacy-sensitive deployments
- on-device or edge experimentation
---
## Roadmap
- Reasoning-focused variant (comming soon)
- Agent-oriented variant (comming soon)
---
## Citation
A technical paper describing training methodology, tokenizer design, and evaluation is in preparation.
Developed by [tabularis.ai](https://tabularis.ai) in Tübingen.