Files
helium-1-2b-GGUF/README.md
ModelHub XC be259d2ee6 初始化项目,由ModelHub XC社区提供模型
Model: RhinoWithAcape/helium-1-2b-GGUF
Source: Original Platform
2026-06-21 02:34:16 +08:00

4.2 KiB

license, base_model, tags, language, pipeline_tag, library_name
license base_model tags language pipeline_tag library_name
cc-by-sa-4.0
kyutai/helium-1-2b
kyutai
helium
llama
base-model
multilingual
edge
mobile
europe
gguf
llama.cpp
12gb-gpu
8gb-gpu
6gb-gpu
bg
cs
da
de
el
en
es
et
fi
fr
ga
hr
hu
it
lt
lv
mt
nl
pl
pt
ro
sk
sl
sv
text-generation gguf

Helium-1-2B — GGUF

🟢 Fits on: every GPU class — even integrated graphics. Runs on phones at Q2_K.

GGUF conversion of kyutai/helium-1-2b — Kyutai's lightweight 2B base language model targeting edge and mobile devices, with native support for all 24 official EU languages.

This is a community quantization. The base model is by Kyutai (creators of Mimi, Moshi, and the Kyutai TTS/STT family). Until now, only MLX (Apple Silicon) variants existed — this fills the GGUF gap for llama.cpp and ollama users.

Model details

Field Value
Architecture LlamaForCausalLM (standard Llama; works with stock llama.cpp)
Parameters 2B
Layers 28
Hidden size 2048
Vocab 64,000 (multilingual)
Context 4K
Type Base model — not instruction-tuned
License CC-BY-SA 4.0 + Gemma Terms of Use (Helium is distilled from Gemma 2)

Use case

  • Edge / mobile inference — fits comfortably on consumer hardware, including phones and small GPUs
  • EU multilingual base — train your own instruction-following model on top of this with the language coverage you need
  • Research — distillation lineage from Gemma 2 with smaller footprint
  • Not for chat out-of-the-box — this is a base model, no instruction tuning. For chat, fine-tune it first.

Quants

Quant Size Use case
Q2_K ~0.8 GB tiniest footprint — phones, microcontrollers, 4 GB cards
Q3_K_M ~1.0 GB balance for 6 GB cards
Q4_K_M ~1.2 GB recommended default — fits anywhere
Q5_K_M ~1.5 GB quality bump if you have headroom
Q6_K ~1.8 GB near-lossless
Q8_0 ~2.3 GB reference quality
F16 ~4.0 GB full precision

Usage — Ollama

hf download RhinoWithAcape/helium-1-2b-GGUF \
  helium-1-2b.Q4_K_M.gguf Modelfile --local-dir ./helium
cd ./helium
ollama create helium-1-2b:Q4_K_M -f Modelfile
ollama run helium-1-2b:Q4_K_M "Once upon a time"

Usage — llama.cpp

./build/bin/llama-completion \
    -m helium-1-2b.Q4_K_M.gguf \
    -p "The capital of France is" \
    -n 30 --temp 0.6

(Sample: "The capital of France is Paris...")

License notes

  • This conversion is CC-BY-SA 4.0 (matching the source release).
  • Helium-1 is distilled from Gemma 2, so use is also subject to the Gemma Terms of Use.
  • This GGUF inherits both terms.

Conversion details

  • Source: kyutai/helium-1-2b (downloaded 2026-04-29; Q2_K + Q3_K_M backfilled 2026-05-02)
  • Tools: stock llama.cpp (no patches required — standard Llama arch)
  • Steps: convert_hf_to_gguf.pyllama-quantize

More from RhinoWithAcape

We're a small AI lab making powerful models actually run on consumer GPUs. Curated GGUFs with the full Q2/Q3/Q4 ladder for 12-16 GB cards and first-mover conversions for new architectures.

→ Full catalogue at huggingface.co/RhinoWithAcape

Acknowledgments

  • Kyutai for the open release of Helium-1, targeting under-served EU language coverage at edge scale
  • Google DeepMind for the Gemma 2 base from which Helium was distilled
  • llama.cpp maintainers