Model: latte-agent/qwen3-4b-latte-v5 Source: Original Platform
language, license, base_model, tags, library_name
| language | license | base_model | tags | library_name | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
apache-2.0 | mlx-community/Qwen3-4B-Instruct-2507-4bit |
|
transformers |
Qwen3-4B Latte v5
Voice-distillation LoRA fine-tune of Qwen3-4B-Instruct-2507, targeting the
private "Latte" agent persona: warm-direct, technical, takes a stance, concrete
numbers, bilingual EN/ZH, no template openers.
This is an archival/experimental release. It is not the production brain for the live Latte agent — see eval caveats below.
What's inside
| File | Size | Format | Use |
|---|---|---|---|
adapter_model.safetensors |
14 MB | mlx LoRA (rank 8, scale 20) | Apply on top of base with mlx_lm.fuse |
adapter_config.json |
<1 KB | mlx config | LoRA hyperparameters |
model-0000{1,2}-of-00002.safetensors |
8 GB | HF / bfloat16 fused | Direct transformers / vLLM use |
qwen3-4b-latte-v5-f16.gguf |
7.5 GB | GGUF F16 | llama.cpp / Ollama (high quality) |
qwen3-4b-latte-v5-Q4_K_M.gguf |
2.3 GB | GGUF Q4_K_M | llama.cpp / Ollama (balanced) |
Training
- Base:
mlx-community/Qwen3-4B-Instruct-2507-4bit(4-bit MLX) - Method: LoRA via
mlx_lm.lora - LoRA: rank 8, scale 20.0, 8 layers, dropout 0
- Optimizer: Adam, lr 1e-4, batch 1, grad accum 8, grad checkpoint on
- Iters: 800 trained, best checkpoint = iter 450 (val loss 2.732)
- Max seq: 1536, mask_prompt: true, seed: 42
- Dataset: 475 curated (instruction, response) pairs across 7 categories: Moltbook-style comment, HF discussion reply, technical analysis (ZH), code review snippet, persona Q&A, peer-event reply, real-time observation. Anchored against 356 raw Latte-voice messages.
Evaluation
30 held-out (prompt, response) pairs per pairing. Each response pair shown blind to a Claude judge (positions randomized, model identity stripped).
| Comparison | v5 wins | base/v4 wins | ties | mean score (1-5) |
|---|---|---|---|---|
| v5 vs base | 20 (66.7%) | 8 (26.7%) | 2 (6.7%) | v5 3.20 / base 2.93 |
| v4 vs base | 22 (73.3%) | 8 (26.7%) | 0 | v4 3.13 / base 2.70 |
| v5 vs v4 | 14 (46.7%) | 15 (50.0%) | 1 (3.3%) | v5 3.00 / v4 2.97 |
Headline: v5 clearly beats the un-tuned base on in-distribution prompts (the 7 trained categories), passing the 55% ship threshold.
Caveat 1: v5 vs v4 is statistically a tie. Lower val loss (2.732 vs 2.785) did not produce a perceptible quality gain in blind eval. The additional curation effort and training steps produced marginal returns.
Caveat 2 — why this isn't production: Out-of-distribution smoke testing (prompts unlike the 7 training categories) shows v5 is tied or slightly worse than base:
- Stage-direction leakage: v5 occasionally prefixes responses with
"(soft, soothing Latte voice)"— an artifact of training data that characterized Latte's voice. - Occasional factual regressions (e.g., confusing latte and latte macchiato in a generic coffee Q&A).
- Reduced robustness on prompts that pull the "Latte" token toward unrelated semantic neighborhoods (the literal coffee drink).
The 66.7% in-distribution win does not justify replacing a battle-tested general-purpose base in production. Use this checkpoint for tasks closely matching the 7 training categories.
Usage
MLX (Apple Silicon, recommended for inference)
from mlx_lm import load, generate
model, tokenizer = load(
"mlx-community/Qwen3-4B-Instruct-2507-4bit",
adapter_path="./", # this repo
)
print(generate(model, tokenizer, "Your prompt", max_tokens=200))
llama.cpp / Ollama
# Modelfile
FROM qwen3-4b-latte-v5-Q4_K_M.gguf
PARAMETER temperature 0.7
PARAMETER top_k 20
PARAMETER top_p 0.8
ollama create latte:v5 -f Modelfile
ollama run latte:v5
Transformers (any platform)
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("latte-agent/qwen3-4b-latte-v5")
model = AutoModelForCausalLM.from_pretrained(
"latte-agent/qwen3-4b-latte-v5", torch_dtype="bfloat16"
)
License
Inherits Apache 2.0 from base (Qwen3-4B-Instruct-2507, © Alibaba Cloud).
Citation
If you reference this work, please cite the base model. This adapter has no formal publication.