flammenai/FlameDesigner-Qwen2.5-3B-v1-GGUF

Go to file

ModelHub XC 91026b9919 初始化项目，由ModelHub XC社区提供模型

Model: flammenai/FlameDesigner-Qwen2.5-3B-v1-GGUF
Source: Original Platform

2026-06-17 05:54:16 +08:00

.gitattributes

初始化项目，由ModelHub XC社区提供模型

2026-06-17 05:54:16 +08:00

FlameDesigner-Qwen2.5-3B-v1.f16.gguf

初始化项目，由ModelHub XC社区提供模型

2026-06-17 05:54:16 +08:00

FlameDesigner-Qwen2.5-3B-v1.Q4_K_M.gguf

初始化项目，由ModelHub XC社区提供模型

2026-06-17 05:54:16 +08:00

FlameDesigner-Qwen2.5-3B-v1.Q5_K_M.gguf

初始化项目，由ModelHub XC社区提供模型

2026-06-17 05:54:16 +08:00

FlameDesigner-Qwen2.5-3B-v1.Q8_0.gguf

初始化项目，由ModelHub XC社区提供模型

2026-06-17 05:54:16 +08:00

README.md

初始化项目，由ModelHub XC社区提供模型

2026-06-17 05:54:16 +08:00

README.md

license, base_model, base_model_relation, library_name, tags, quantized_by, language, pipeline_tag

license

base_model

base_model_relation

library_name

FlameDesigner-Qwen2.5-3B-v1-GGUF

GGUF quantizations of flammenai/FlameDesigner-Qwen2.5-3B-v1 — a Qwen2.5-3B-Instruct LoRA finetune that turns a free-text seed (e.g. "samurai", "Mongolian falconer") into a strict-schema JSON character design for flammen.ai's Create-a-Flame pipeline.

Trained on flammenai/flame-kindling-v1 (400 SFT rows distilled from Claude Sonnet 4.5).

Files

Quant	Size	Notes
`FlameDesigner-Qwen2.5-3B-v1.f16.gguf`	5.8 GB	Source for further quantization
`FlameDesigner-Qwen2.5-3B-v1.Q8_0.gguf`	3.1 GB	Recommended. Best strict-schema compliance in our eval; near-F16 quality at half the size.
`FlameDesigner-Qwen2.5-3B-v1.Q5_K_M.gguf`	2.1 GB	Compromise between Q8 and Q4.
`FlameDesigner-Qwen2.5-3B-v1.Q4_K_M.gguf`	1.8 GB	Smallest practical CPU quant. Strict-schema pass rate drops noticeably (see eval); use the auto-repair shim.

Inference

llama.cpp / llama-server

llama-server -m FlameDesigner-Qwen2.5-3B-v1.Q4_K_M.gguf \
  --host 127.0.0.1 --port 8081 -c 8192 --jinja

Then POST /v1/chat/completions with the flame_dataset.GOLD_SYSTEM system prompt and the seed as the user message. Output is a single JSON object matching the DesignedFlame schema (or close — see "Auto-repair shim" below).

Example

import requests, json
SYSTEM = open("GOLD_SYSTEM.txt").read()  # from the dataset card / FlameKindling repo
r = requests.post("http://127.0.0.1:8081/v1/chat/completions", json={
    "messages": [
        {"role": "system", "content": SYSTEM},
        {"role": "user", "content": "Mongolian falconer"},
    ],
    "max_tokens": 2048,
    "temperature": 0.7,
})
text = r.json()["choices"][0]["message"]["content"]
print(json.loads(text))

Eval

20 held-out seeds (no overlap with training data, mix of one-word + sentence + paragraph). Inference at temperature=0.7, GPU offload (-ngl 999) on an A6000. Per-output coherence judged by Qwen3.5-27B (1-5 scale, lenient at the high end).

Quant	Avg latency	Strict pass	Soft pass (after auto-repair)
Q8_0	3.1 s	15/20 (75%)	19/20 (95%)
F16	5.1 s	13/20 (65%)	20/20 (100%)
Q4_K_M	2.2 s	7/20 (35%)	19/20 (95%)

Quantization noticeably affects strict-schema compliance — Q4 loses ~half the strict pass rate vs Q8. The soft-pass numbers (after the auto-repair shim below) are within rounding distance for all three. Recommendation: Q8_0 in production, with the shim regardless.

Coherence on valid outputs is consistently 4.5-5.0 across all quants — when the model produces a parseable design, the design is good. The strict failures are 1-off constraint violations, not quality problems:

writing_style arrays with 5 entries instead of max 4 (trim to 4)
languages containing codes outside the SUPPORTED_LANGUAGES allow-list (e.g. mn, cy, mi, sq — Qwen2.5-3B knows these from base training; the LoRA didn't fully suppress them)
system_prompt_extra over 512 chars (truncate)
Rare: output truncated by max_tokens (use max_tokens >= 2048)

Auto-repair shim

Production integration in FlameGen wraps the model with this shim before validating against DesignedFlame:

def autorepair(obj: dict) -> dict:
    if isinstance(obj.get("writing_style"), list):
        obj["writing_style"] = obj["writing_style"][:4]
    if isinstance(obj.get("languages"), list):
        obj["languages"] = [c for c in obj["languages"] if c in SUPPORTED_LANGUAGES]
        if not obj["languages"]:
            obj["languages"] = ["en"]
    if isinstance(obj.get("system_prompt_extra"), str):
        obj["system_prompt_extra"] = obj["system_prompt_extra"][:512].rstrip()
    return obj

Recovers ~60% of strict-failures, lifts effective pass rate from 35% to 95% with zero quality cost (the trimmed entries are themselves on-character — model just over-produced).

Limitations

Small training set (400 rows). Schema constraint violations above are likely from the small dataset + rank-128 LoRA over-capacity ratio. A v2 with more data should improve hard-pass.
Schema drift on language allow-list. Base Qwen knows codes outside SUPPORTED_LANGUAGES; the LoRA inherits this. The auto-repair shim handles it.
Verbose system_prompt_extra. Sometimes overshoots the 512-char cap — relax to 600 or apply the shim.
No NSFW. Training data was Sonnet-distilled; Sonnet declines explicit traits. NSFW Create-a-Flame is deferred in flammen.ai anyway.

License

MIT