Files
magos-k8s-0.6b/README.md
ModelHub XC 02da0760c5 初始化项目,由ModelHub XC社区提供模型
Model: clglavan/magos-k8s-0.6b
Source: Original Platform
2026-05-31 16:57:17 +08:00

230 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
license: apache-2.0
language:
- en
base_model: Qwen/Qwen3-0.6B
library_name: transformers
pipeline_tag: text-generation
tags:
- kubernetes
- devops
- sft
- cpt
- qwen3
- gguf
---
# magos-k8s-0.6b
A small (0.6B parameter) **Kubernetes debugging assistant**, fine-tuned from
**Qwen3-0.6B** on Kubernetes documentation, the full Kubernetes API reference
(every resource Kind), the kubectl command reference, and Prometheus alert
runbooks.
## Purpose
magos exists to **nudge any input — vague, off-topic, or underspecified — into
a debugging mindset**: identify the missing information, propose the
`kubectl`/`promtool`/log-inspection step that would resolve the ambiguity,
and respond with a concrete next action instead of speculation. This bias
toward "what would I check next?" is trained in by design, not an emergent
property.
That bias is what makes magos **ideal for autonomous devops agents** running
in a planner → executor loop. The model reliably emits the *next executable
move* (a kubectl invocation, a YAML patch, a metric to scrape) rather than
prose, which is what an agent needs to make progress on its own. It runs
locally at ~600 MB (Q8) and is fast enough to be the inner-loop reasoner of
an agent that also carries longer-horizon planning elsewhere.
### Pair with RAG for best results
magos is intentionally small and ships frozen knowledge from the
documentation snapshot it was trained on. **Pairing it with a retrieval
layer** (your live cluster's `kubectl explain` output, your team's runbooks,
your fleet's current CRD schemas, recent incident postmortems) lifts answer
quality substantially while keeping the model itself tiny — you get the
debugging-mindset reflex from magos plus authoritative, current facts from
your retriever, instead of having to grow the model to memorise everything.
A typical setup: retrieve 25 short snippets relevant to the user's symptom
and prepend them to the prompt; magos will weave them into its
next-step-first response.
## What's new in v8 (vs v7)
| | v7 | **v8** |
|---|---|---|
| Stage 2 training examples | ~6,100 | **~6,740** (+10%) |
| YAML bucket | 780 unfiltered | **521 schema-filtered** — every example's apiVersion+fields validated against the K8s v1.34 OpenAPI spec; ~33% invented-field examples dropped before training |
| Anti-hallucination contrast bucket | none | **~317 new examples** teaching wrong-vs-right pairs for kubectl flags, YAML field names, and diagnosis patterns mined from v7's actual failures |
| General-instruct mix | none | **~600 Alpaca examples** (~9%) blended in to defend against catastrophic forgetting of base reasoning |
| Stage 2 LR / epochs | 1.5e-5 / 2 epochs | **1.5e-5 / 2 epochs** (unchanged — proven recipe) |
| Stage 2 eval_loss | 1.667 | 1.716 (slightly higher — expected, since 9% of examples are out-of-K8s-distribution Alpaca) |
### Why these changes
v7's main weaknesses surfaced in agent-usability review:
1. **Specific flag/field hallucinations**: `--show-namespace`, `--limit` on
`kubectl logs`, `volumeAccessModes`, `autoscaling/v2beta3`. We mined the
actual hallucinations v7 produced across 75 benchmark verdicts (817
occurrences) and built targeted **contrast pairs** — for each known wrong
pattern, a paired Q&A that explicitly contrasts it with the correct one.
2. **YAML schema invention**: v7's YAML bucket was not validated post-synth.
v8 runs each example through the v1.34 OpenAPI lookup and drops any
example with >2 invented field paths.
3. **General-reasoning regression**: v7 lost 3 points on the general bucket
vs v6. v8 mixes in a small Alpaca slice so non-K8s prompts stay sharp.
### Benchmarks (3-judge consensus, anonymized review of v6 vs v7 vs v8 across 25 prompts)
Each of 25 prompts was evaluated by 3 independent reviewers who saw the
responses **anonymized as A/B/C** with the rubric for that prompt. Reviewers
were forced to produce explicit reasoning, list verified facts and
hallucinations, and rate `agent_usable` before assigning a 1-5 score. Final
per-prompt score is the **median of the 3 reviewers' scores**.
| Bucket | Max | v6 | v7 | **v8** |
|---|---|---|---|---|
| kubectl/CLI accuracy | 30 | 8 | 10 | **14** (+4) |
| YAML manifest validity | 25 | 6 | 11 | **12** (+1) |
| Debugging diagnose | 30 | 9 | **10** | 8 (-2) |
| Prometheus runbook | 25 | **7** | **7** | 6 (-1) |
| General reasoning | 15 | 14 | 12 | **15** (+3) |
| **Total** | **125** | **44 (35%)** | **50 (40%)** | **55 (44%)** |
**Headline:** v8 takes the largest single-version jump yet in kubectl
accuracy (+4 points on a 30-point bucket) and recovers full general-reasoning
performance, at a small cost in Diagnose and Runbook accuracy (-2 and -1).
The Alpaca mix successfully defended against forgetting; the contrast bucket
visibly suppressed the specific hallucinated flags v7 was repeating.
**Honest absolute level:** even v8 scores 44% on this benchmark. The judges
grade strictly for *agent-usability* — a single invented flag or wrong
apiVersion is enough to mark a response as not-executable. v8 is the best
version of magos yet, but there is substantial room to grow toward 100%.
To pin a specific version when loading:
```python
AutoModelForCausalLM.from_pretrained("clglavan/magos-k8s-0.6b", revision="v8")
# or revision="v7" / "v6" / "v5" / "v3" / "v2" for previous versions
```
## What it's good at
- **kubectl command construction** — v8's strongest area. Real flags,
correct flag forms, no `--show-namespace`/`--limit-on-logs` style
inventions seen in v7.
- **YAML manifest generation** — Pod, Deployment, Service, NetworkPolicy,
PVC, HPA, ConfigMap, Secret, RBAC and ~70 other top-level Kinds all have
correct apiVersion and field names (schema-validated training set).
- **Diagnosing pasted errors** — `kubectl describe` output, log lines, alert
payloads → root cause + next-step suggestions
- **Prometheus alert handling** — meaning + diagnostic steps for the
prometheus-operator runbook set (KubePodCrashLooping, etcdBackendQuotaLowSpace,
AlertmanagerClusterDown, etc.)
- **Agent-style outputs** — short, command-first responses suitable for
autonomous execution rather than human reading
- **Basic general reasoning** — Alpaca mix preserves math, generic CS facts,
short explanations
## What it's not good at
- Multi-step planning or complex tool chains — it's a 0.6B model
- **Subtle/rare flags** — common flags are reliable; rare-but-real flags are
still sometimes hallucinated. Always sanity-check with `kubectl --help`.
- **Multi-flag combinations on the same command** — accuracy drops as flag
count goes up
- Knowledge of features released after the source docs were captured (mid-2026)
- Long-form thinking — SFT suppressed Qwen3's `<think>` behavior
## How to use
### llama.cpp / Ollama / LM Studio
Three GGUF quantization levels are included — pick one:
| File | Size | Quality |
|---|---|---|
| `magos-k8s-0.6b-f16.gguf` | 1.2 GB | reference (full bf16 precision) |
| `magos-k8s-0.6b-q8_0.gguf` | 610 MB | effectively identical to f16, half the size — **recommended** |
| `magos-k8s-0.6b-q4_k_m.gguf` | 379 MB | smallest. Some quality loss — `kubectl` flag/argument mistakes appear more often than with q8/f16. Fine for casual use, not recommended for accuracy-critical tasks. |
Example with `llama-cpp-python`:
```python
from llama_cpp import Llama
llm = Llama(model_path="magos-k8s-0.6b-q8_0.gguf", n_ctx=4096, chat_format="chatml")
resp = llm.create_chat_completion(
messages=[{"role": "user", "content": "Drain node worker-3 ignoring DaemonSets and deleting local-storage pods."}],
temperature=0.05,
repeat_penalty=1.15,
max_tokens=512,
)
print(resp["choices"][0]["message"]["content"])
```
The `temperature=0.05` and `repeat_penalty=1.15` defaults are important —
0.6B models loop on longer structured outputs without a repetition penalty.
### Hugging Face transformers
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
tok = AutoTokenizer.from_pretrained("clglavan/magos-k8s-0.6b")
model = AutoModelForCausalLM.from_pretrained("clglavan/magos-k8s-0.6b",
dtype="bfloat16",
device_map="auto")
messages = [{"role": "user", "content": "Give me a NetworkPolicy that denies all egress from app pods except DNS."}]
prompt = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=384,
do_sample=True, temperature=0.05,
top_p=0.95, top_k=20, repetition_penalty=1.15)
print(tok.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
```
## Training
| | |
|---|---|
| Base model | Qwen/Qwen3-0.6B |
| Method | Two stage: **continued pre-training (CPT) → supervised fine-tuning (SFT)**. Both full-weight (no LoRA). |
| **Stage 1 corpus** | ~8.5k document chunks: kubernetes.io docs + blog (~6.5k), Kubernetes API reference v1.34 (~1.9k), Prometheus alert runbooks (~106). **Reused from v5/v6/v7 — corpus unchanged.** |
| Stage 1 tokens | ~6.5M |
| Stage 1 LR | 5e-6, cosine, 3% warmup, 1 epoch |
| **Stage 2 corpus (v8)** | ~6,740 synthetic Q&A pairs. Distribution: K8s debugging (~1.7k), K8s API field/schema (~1.3k), Prometheus runbook (~1.0k, 10 examples per runbook), kubectl reference (~1.3k, 15 per subcommand), **schema-filtered YAML bucket (~520)**, **anti-hallucination contrast bucket (~317)**, **general-instruct mix (~600)** |
| Stage 2 LR | 1.5e-5, cosine, 3% warmup, 2 epochs |
| Micro batch / grad accum | 1 / 16 (effective batch 16) |
| Precision | bfloat16 |
| Sequence length | 2048 |
| Stage 1 eval_loss | 1.71 |
| Stage 2 eval_loss | 1.72 (v7 was 1.67; the small regression reflects the 9% Alpaca slice being out-of-K8s-distribution — judge benchmark is the real measure) |
## Files
- `model.safetensors` — fine-tuned weights, HF format (1.2 GB, bf16)
- `magos-k8s-0.6b-f16.gguf` — GGUF, full precision (1.2 GB)
- `magos-k8s-0.6b-q8_0.gguf` — GGUF, 8-bit quantization (610 MB)
- `magos-k8s-0.6b-q4_k_m.gguf` — GGUF, 4-bit quantization (379 MB)
- `tokenizer.json`, `tokenizer_config.json` — Qwen3 tokenizer
- `chat_template.jinja` — Qwen3 ChatML template
- `config.json`, `generation_config.json` — standard HF configs (with magos sampling defaults)
## Limitations and intended use
This is a small experimental model. Always verify any command, YAML, or
behavioral claim against current Kubernetes documentation before running in
production. It is intended for learning, prototyping, and as a component in
local devops agents — not as an authoritative source.
## License
Apache 2.0. Inherits from the Qwen3-0.6B base model license. The training data
is derived from the official Kubernetes documentation (CC-BY 4.0) and the
prometheus-operator Prometheus runbooks (Apache 2.0).