初始化项目,由ModelHub XC社区提供模型
Model: clglavan/magos-k8s-0.6b Source: Original Platform
This commit is contained in:
229
README.md
Normal file
229
README.md
Normal file
@@ -0,0 +1,229 @@
|
||||
---
|
||||
license: apache-2.0
|
||||
language:
|
||||
- en
|
||||
base_model: Qwen/Qwen3-0.6B
|
||||
library_name: transformers
|
||||
pipeline_tag: text-generation
|
||||
tags:
|
||||
- kubernetes
|
||||
- devops
|
||||
- sft
|
||||
- cpt
|
||||
- qwen3
|
||||
- gguf
|
||||
---
|
||||
|
||||
# magos-k8s-0.6b
|
||||
|
||||
A small (0.6B parameter) **Kubernetes debugging assistant**, fine-tuned from
|
||||
**Qwen3-0.6B** on Kubernetes documentation, the full Kubernetes API reference
|
||||
(every resource Kind), the kubectl command reference, and Prometheus alert
|
||||
runbooks.
|
||||
|
||||
## Purpose
|
||||
|
||||
magos exists to **nudge any input — vague, off-topic, or underspecified — into
|
||||
a debugging mindset**: identify the missing information, propose the
|
||||
`kubectl`/`promtool`/log-inspection step that would resolve the ambiguity,
|
||||
and respond with a concrete next action instead of speculation. This bias
|
||||
toward "what would I check next?" is trained in by design, not an emergent
|
||||
property.
|
||||
|
||||
That bias is what makes magos **ideal for autonomous devops agents** running
|
||||
in a planner → executor loop. The model reliably emits the *next executable
|
||||
move* (a kubectl invocation, a YAML patch, a metric to scrape) rather than
|
||||
prose, which is what an agent needs to make progress on its own. It runs
|
||||
locally at ~600 MB (Q8) and is fast enough to be the inner-loop reasoner of
|
||||
an agent that also carries longer-horizon planning elsewhere.
|
||||
|
||||
### Pair with RAG for best results
|
||||
|
||||
magos is intentionally small and ships frozen knowledge from the
|
||||
documentation snapshot it was trained on. **Pairing it with a retrieval
|
||||
layer** (your live cluster's `kubectl explain` output, your team's runbooks,
|
||||
your fleet's current CRD schemas, recent incident postmortems) lifts answer
|
||||
quality substantially while keeping the model itself tiny — you get the
|
||||
debugging-mindset reflex from magos plus authoritative, current facts from
|
||||
your retriever, instead of having to grow the model to memorise everything.
|
||||
A typical setup: retrieve 2–5 short snippets relevant to the user's symptom
|
||||
and prepend them to the prompt; magos will weave them into its
|
||||
next-step-first response.
|
||||
|
||||
## What's new in v8 (vs v7)
|
||||
|
||||
| | v7 | **v8** |
|
||||
|---|---|---|
|
||||
| Stage 2 training examples | ~6,100 | **~6,740** (+10%) |
|
||||
| YAML bucket | 780 unfiltered | **521 schema-filtered** — every example's apiVersion+fields validated against the K8s v1.34 OpenAPI spec; ~33% invented-field examples dropped before training |
|
||||
| Anti-hallucination contrast bucket | none | **~317 new examples** teaching wrong-vs-right pairs for kubectl flags, YAML field names, and diagnosis patterns mined from v7's actual failures |
|
||||
| General-instruct mix | none | **~600 Alpaca examples** (~9%) blended in to defend against catastrophic forgetting of base reasoning |
|
||||
| Stage 2 LR / epochs | 1.5e-5 / 2 epochs | **1.5e-5 / 2 epochs** (unchanged — proven recipe) |
|
||||
| Stage 2 eval_loss | 1.667 | 1.716 (slightly higher — expected, since 9% of examples are out-of-K8s-distribution Alpaca) |
|
||||
|
||||
### Why these changes
|
||||
|
||||
v7's main weaknesses surfaced in agent-usability review:
|
||||
|
||||
1. **Specific flag/field hallucinations**: `--show-namespace`, `--limit` on
|
||||
`kubectl logs`, `volumeAccessModes`, `autoscaling/v2beta3`. We mined the
|
||||
actual hallucinations v7 produced across 75 benchmark verdicts (817
|
||||
occurrences) and built targeted **contrast pairs** — for each known wrong
|
||||
pattern, a paired Q&A that explicitly contrasts it with the correct one.
|
||||
2. **YAML schema invention**: v7's YAML bucket was not validated post-synth.
|
||||
v8 runs each example through the v1.34 OpenAPI lookup and drops any
|
||||
example with >2 invented field paths.
|
||||
3. **General-reasoning regression**: v7 lost 3 points on the general bucket
|
||||
vs v6. v8 mixes in a small Alpaca slice so non-K8s prompts stay sharp.
|
||||
|
||||
### Benchmarks (3-judge consensus, anonymized review of v6 vs v7 vs v8 across 25 prompts)
|
||||
|
||||
Each of 25 prompts was evaluated by 3 independent reviewers who saw the
|
||||
responses **anonymized as A/B/C** with the rubric for that prompt. Reviewers
|
||||
were forced to produce explicit reasoning, list verified facts and
|
||||
hallucinations, and rate `agent_usable` before assigning a 1-5 score. Final
|
||||
per-prompt score is the **median of the 3 reviewers' scores**.
|
||||
|
||||
| Bucket | Max | v6 | v7 | **v8** |
|
||||
|---|---|---|---|---|
|
||||
| kubectl/CLI accuracy | 30 | 8 | 10 | **14** (+4) |
|
||||
| YAML manifest validity | 25 | 6 | 11 | **12** (+1) |
|
||||
| Debugging diagnose | 30 | 9 | **10** | 8 (-2) |
|
||||
| Prometheus runbook | 25 | **7** | **7** | 6 (-1) |
|
||||
| General reasoning | 15 | 14 | 12 | **15** (+3) |
|
||||
| **Total** | **125** | **44 (35%)** | **50 (40%)** | **55 (44%)** |
|
||||
|
||||
**Headline:** v8 takes the largest single-version jump yet in kubectl
|
||||
accuracy (+4 points on a 30-point bucket) and recovers full general-reasoning
|
||||
performance, at a small cost in Diagnose and Runbook accuracy (-2 and -1).
|
||||
The Alpaca mix successfully defended against forgetting; the contrast bucket
|
||||
visibly suppressed the specific hallucinated flags v7 was repeating.
|
||||
|
||||
**Honest absolute level:** even v8 scores 44% on this benchmark. The judges
|
||||
grade strictly for *agent-usability* — a single invented flag or wrong
|
||||
apiVersion is enough to mark a response as not-executable. v8 is the best
|
||||
version of magos yet, but there is substantial room to grow toward 100%.
|
||||
|
||||
To pin a specific version when loading:
|
||||
|
||||
```python
|
||||
AutoModelForCausalLM.from_pretrained("clglavan/magos-k8s-0.6b", revision="v8")
|
||||
# or revision="v7" / "v6" / "v5" / "v3" / "v2" for previous versions
|
||||
```
|
||||
|
||||
## What it's good at
|
||||
|
||||
- **kubectl command construction** — v8's strongest area. Real flags,
|
||||
correct flag forms, no `--show-namespace`/`--limit-on-logs` style
|
||||
inventions seen in v7.
|
||||
- **YAML manifest generation** — Pod, Deployment, Service, NetworkPolicy,
|
||||
PVC, HPA, ConfigMap, Secret, RBAC and ~70 other top-level Kinds all have
|
||||
correct apiVersion and field names (schema-validated training set).
|
||||
- **Diagnosing pasted errors** — `kubectl describe` output, log lines, alert
|
||||
payloads → root cause + next-step suggestions
|
||||
- **Prometheus alert handling** — meaning + diagnostic steps for the
|
||||
prometheus-operator runbook set (KubePodCrashLooping, etcdBackendQuotaLowSpace,
|
||||
AlertmanagerClusterDown, etc.)
|
||||
- **Agent-style outputs** — short, command-first responses suitable for
|
||||
autonomous execution rather than human reading
|
||||
- **Basic general reasoning** — Alpaca mix preserves math, generic CS facts,
|
||||
short explanations
|
||||
|
||||
## What it's not good at
|
||||
|
||||
- Multi-step planning or complex tool chains — it's a 0.6B model
|
||||
- **Subtle/rare flags** — common flags are reliable; rare-but-real flags are
|
||||
still sometimes hallucinated. Always sanity-check with `kubectl --help`.
|
||||
- **Multi-flag combinations on the same command** — accuracy drops as flag
|
||||
count goes up
|
||||
- Knowledge of features released after the source docs were captured (mid-2026)
|
||||
- Long-form thinking — SFT suppressed Qwen3's `<think>` behavior
|
||||
|
||||
## How to use
|
||||
|
||||
### llama.cpp / Ollama / LM Studio
|
||||
|
||||
Three GGUF quantization levels are included — pick one:
|
||||
|
||||
| File | Size | Quality |
|
||||
|---|---|---|
|
||||
| `magos-k8s-0.6b-f16.gguf` | 1.2 GB | reference (full bf16 precision) |
|
||||
| `magos-k8s-0.6b-q8_0.gguf` | 610 MB | effectively identical to f16, half the size — **recommended** |
|
||||
| `magos-k8s-0.6b-q4_k_m.gguf` | 379 MB | smallest. Some quality loss — `kubectl` flag/argument mistakes appear more often than with q8/f16. Fine for casual use, not recommended for accuracy-critical tasks. |
|
||||
|
||||
Example with `llama-cpp-python`:
|
||||
|
||||
```python
|
||||
from llama_cpp import Llama
|
||||
|
||||
llm = Llama(model_path="magos-k8s-0.6b-q8_0.gguf", n_ctx=4096, chat_format="chatml")
|
||||
resp = llm.create_chat_completion(
|
||||
messages=[{"role": "user", "content": "Drain node worker-3 ignoring DaemonSets and deleting local-storage pods."}],
|
||||
temperature=0.05,
|
||||
repeat_penalty=1.15,
|
||||
max_tokens=512,
|
||||
)
|
||||
print(resp["choices"][0]["message"]["content"])
|
||||
```
|
||||
|
||||
The `temperature=0.05` and `repeat_penalty=1.15` defaults are important —
|
||||
0.6B models loop on longer structured outputs without a repetition penalty.
|
||||
|
||||
### Hugging Face transformers
|
||||
|
||||
```python
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer
|
||||
|
||||
tok = AutoTokenizer.from_pretrained("clglavan/magos-k8s-0.6b")
|
||||
model = AutoModelForCausalLM.from_pretrained("clglavan/magos-k8s-0.6b",
|
||||
dtype="bfloat16",
|
||||
device_map="auto")
|
||||
|
||||
messages = [{"role": "user", "content": "Give me a NetworkPolicy that denies all egress from app pods except DNS."}]
|
||||
prompt = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
|
||||
inputs = tok(prompt, return_tensors="pt").to(model.device)
|
||||
out = model.generate(**inputs, max_new_tokens=384,
|
||||
do_sample=True, temperature=0.05,
|
||||
top_p=0.95, top_k=20, repetition_penalty=1.15)
|
||||
print(tok.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
|
||||
```
|
||||
|
||||
## Training
|
||||
|
||||
| | |
|
||||
|---|---|
|
||||
| Base model | Qwen/Qwen3-0.6B |
|
||||
| Method | Two stage: **continued pre-training (CPT) → supervised fine-tuning (SFT)**. Both full-weight (no LoRA). |
|
||||
| **Stage 1 corpus** | ~8.5k document chunks: kubernetes.io docs + blog (~6.5k), Kubernetes API reference v1.34 (~1.9k), Prometheus alert runbooks (~106). **Reused from v5/v6/v7 — corpus unchanged.** |
|
||||
| Stage 1 tokens | ~6.5M |
|
||||
| Stage 1 LR | 5e-6, cosine, 3% warmup, 1 epoch |
|
||||
| **Stage 2 corpus (v8)** | ~6,740 synthetic Q&A pairs. Distribution: K8s debugging (~1.7k), K8s API field/schema (~1.3k), Prometheus runbook (~1.0k, 10 examples per runbook), kubectl reference (~1.3k, 15 per subcommand), **schema-filtered YAML bucket (~520)**, **anti-hallucination contrast bucket (~317)**, **general-instruct mix (~600)** |
|
||||
| Stage 2 LR | 1.5e-5, cosine, 3% warmup, 2 epochs |
|
||||
| Micro batch / grad accum | 1 / 16 (effective batch 16) |
|
||||
| Precision | bfloat16 |
|
||||
| Sequence length | 2048 |
|
||||
| Stage 1 eval_loss | 1.71 |
|
||||
| Stage 2 eval_loss | 1.72 (v7 was 1.67; the small regression reflects the 9% Alpaca slice being out-of-K8s-distribution — judge benchmark is the real measure) |
|
||||
|
||||
## Files
|
||||
|
||||
- `model.safetensors` — fine-tuned weights, HF format (1.2 GB, bf16)
|
||||
- `magos-k8s-0.6b-f16.gguf` — GGUF, full precision (1.2 GB)
|
||||
- `magos-k8s-0.6b-q8_0.gguf` — GGUF, 8-bit quantization (610 MB)
|
||||
- `magos-k8s-0.6b-q4_k_m.gguf` — GGUF, 4-bit quantization (379 MB)
|
||||
- `tokenizer.json`, `tokenizer_config.json` — Qwen3 tokenizer
|
||||
- `chat_template.jinja` — Qwen3 ChatML template
|
||||
- `config.json`, `generation_config.json` — standard HF configs (with magos sampling defaults)
|
||||
|
||||
## Limitations and intended use
|
||||
|
||||
This is a small experimental model. Always verify any command, YAML, or
|
||||
behavioral claim against current Kubernetes documentation before running in
|
||||
production. It is intended for learning, prototyping, and as a component in
|
||||
local devops agents — not as an authoritative source.
|
||||
|
||||
## License
|
||||
|
||||
Apache 2.0. Inherits from the Qwen3-0.6B base model license. The training data
|
||||
is derived from the official Kubernetes documentation (CC-BY 4.0) and the
|
||||
prometheus-operator Prometheus runbooks (Apache 2.0).
|
||||
Reference in New Issue
Block a user