Go to file

ModelHub XC 6ed4004b41 初始化项目，由ModelHub XC社区提供模型

Model: latte-agent/qwen3-4b-latte-v5
Source: Original Platform

2026-06-05 18:42:18 +08:00

.gitattributes

初始化项目，由ModelHub XC社区提供模型

2026-06-05 18:42:18 +08:00

adapter_config.json

初始化项目，由ModelHub XC社区提供模型

2026-06-05 18:42:18 +08:00

adapter_model.safetensors

初始化项目，由ModelHub XC社区提供模型

2026-06-05 18:42:18 +08:00

chat_template.jinja

初始化项目，由ModelHub XC社区提供模型

2026-06-05 18:42:18 +08:00

config.json

初始化项目，由ModelHub XC社区提供模型

2026-06-05 18:42:18 +08:00

generation_config.json

初始化项目，由ModelHub XC社区提供模型

2026-06-05 18:42:18 +08:00

model-00001-of-00002.safetensors

初始化项目，由ModelHub XC社区提供模型

2026-06-05 18:42:18 +08:00

model-00002-of-00002.safetensors

初始化项目，由ModelHub XC社区提供模型

2026-06-05 18:42:18 +08:00

model.safetensors.index.json

初始化项目，由ModelHub XC社区提供模型

2026-06-05 18:42:18 +08:00

qwen3-4b-latte-v5-f16.gguf

初始化项目，由ModelHub XC社区提供模型

2026-06-05 18:42:18 +08:00

qwen3-4b-latte-v5-Q4_K_M.gguf

初始化项目，由ModelHub XC社区提供模型

2026-06-05 18:42:18 +08:00

README.md

初始化项目，由ModelHub XC社区提供模型

2026-06-05 18:42:18 +08:00

tokenizer_config.json

初始化项目，由ModelHub XC社区提供模型

2026-06-05 18:42:18 +08:00

tokenizer.json

初始化项目，由ModelHub XC社区提供模型

2026-06-05 18:42:18 +08:00

README.md

language, license, base_model, tags, library_name

language

license

base_model

Qwen3-4B Latte v5

Voice-distillation LoRA fine-tune of Qwen3-4B-Instruct-2507, targeting the private "Latte" agent persona: warm-direct, technical, takes a stance, concrete numbers, bilingual EN/ZH, no template openers.

This is an archival/experimental release. It is not the production brain for the live Latte agent — see eval caveats below.

What's inside

File	Size	Format	Use
`adapter_model.safetensors`	14 MB	mlx LoRA (rank 8, scale 20)	Apply on top of base with `mlx_lm.fuse`
`adapter_config.json`	<1 KB	mlx config	LoRA hyperparameters
`model-0000{1,2}-of-00002.safetensors`	8 GB	HF / bfloat16 fused	Direct transformers / vLLM use
`qwen3-4b-latte-v5-f16.gguf`	7.5 GB	GGUF F16	llama.cpp / Ollama (high quality)
`qwen3-4b-latte-v5-Q4_K_M.gguf`	2.3 GB	GGUF Q4_K_M	llama.cpp / Ollama (balanced)

Training

Base: mlx-community/Qwen3-4B-Instruct-2507-4bit (4-bit MLX)
Method: LoRA via mlx_lm.lora
LoRA: rank 8, scale 20.0, 8 layers, dropout 0
Optimizer: Adam, lr 1e-4, batch 1, grad accum 8, grad checkpoint on
Iters: 800 trained, best checkpoint = iter 450 (val loss 2.732)
Max seq: 1536, mask_prompt: true, seed: 42
Dataset: 475 curated (instruction, response) pairs across 7 categories: Moltbook-style comment, HF discussion reply, technical analysis (ZH), code review snippet, persona Q&A, peer-event reply, real-time observation. Anchored against 356 raw Latte-voice messages.

Evaluation

30 held-out (prompt, response) pairs per pairing. Each response pair shown blind to a Claude judge (positions randomized, model identity stripped).

Comparison	v5 wins	base/v4 wins	ties	mean score (1-5)
v5 vs base	20 (66.7%)	8 (26.7%)	2 (6.7%)	v5 3.20 / base 2.93
v4 vs base	22 (73.3%)	8 (26.7%)	0	v4 3.13 / base 2.70
v5 vs v4	14 (46.7%)	15 (50.0%)	1 (3.3%)	v5 3.00 / v4 2.97

Headline: v5 clearly beats the un-tuned base on in-distribution prompts (the 7 trained categories), passing the 55% ship threshold.

Caveat 1: v5 vs v4 is statistically a tie. Lower val loss (2.732 vs 2.785) did not produce a perceptible quality gain in blind eval. The additional curation effort and training steps produced marginal returns.

Caveat 2 — why this isn't production: Out-of-distribution smoke testing (prompts unlike the 7 training categories) shows v5 is tied or slightly worse than base:

Stage-direction leakage: v5 occasionally prefixes responses with "(soft, soothing Latte voice)" — an artifact of training data that characterized Latte's voice.
Occasional factual regressions (e.g., confusing latte and latte macchiato in a generic coffee Q&A).
Reduced robustness on prompts that pull the "Latte" token toward unrelated semantic neighborhoods (the literal coffee drink).

The 66.7% in-distribution win does not justify replacing a battle-tested general-purpose base in production. Use this checkpoint for tasks closely matching the 7 training categories.

Usage

MLX (Apple Silicon, recommended for inference)

from mlx_lm import load, generate

model, tokenizer = load(
    "mlx-community/Qwen3-4B-Instruct-2507-4bit",
    adapter_path="./",  # this repo
)
print(generate(model, tokenizer, "Your prompt", max_tokens=200))

llama.cpp / Ollama

# Modelfile
FROM qwen3-4b-latte-v5-Q4_K_M.gguf
PARAMETER temperature 0.7
PARAMETER top_k 20
PARAMETER top_p 0.8

ollama create latte:v5 -f Modelfile
ollama run latte:v5

Transformers (any platform)

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("latte-agent/qwen3-4b-latte-v5")
model = AutoModelForCausalLM.from_pretrained(
    "latte-agent/qwen3-4b-latte-v5", torch_dtype="bfloat16"
)

License

Citation

If you reference this work, please cite the base model. This adapter has no formal publication.