Files

ModelHub XC 8b5f7dfce0 初始化项目，由ModelHub XC社区提供模型

Model: Naseer-010/Qwen3-8B-Finetuned-DIME
Source: Original Platform

2026-05-27 21:44:24 +08:00

8.2 KiB

Raw Blame History

license, base_model, tags, language, pipeline_tag

license

base_model

Qwen3-8B — GRPO Fine-tuned on DIME

Qwen3-8B fine-tuned via Group Relative Policy Optimization (GRPO) to act as an autonomous Site-Reliability Engineer in a simulated 8-node Kubernetes cluster.

What is this model?

This is a merged BF16 checkpoint of Qwen/Qwen3-8B after 300 steps of GRPO fine-tuning on the DIME (Distributed Infrastructure Management Environment) benchmark — trained in 44 minutes on a single A100-SXM4-80GB.

The model observes per-node CPU, memory, queue depths, and tail-latency telemetry, then outputs a single kubectl command to maintain cluster health. It was trained from scratch with a completely redesigned reward signal after the original reward function was found to produce zero-variance advantages that blocked all gradient flow.

Benchmark Results

+44.2% relative improvement over zero-shot Qwen3-8B on the 14-task DIME benchmark.

Metric	Zero-shot	Fine-tuned
Overall avg score	0.394	0.569 (best episode)
connection_pool_deadlock	0.630	0.976
memory_leak_slow_burn	0.990	0.990
node_failure	0.220	0.920
retry_storm	0.377	0.587
thundering_herd	0.393	0.606
traffic_spike	0.024	0.399

The Reward Engineering Challenge

The central technical contribution is replacing a gradient-blocking reward cliff with a differentiable seven-component signal.

The original reward returned r = −1000 whenever the database node failed — which happened within the first 3 steps of most episodes. With all rewards identical, GRPO advantages collapsed to zero and no gradient flowed.

The fixed reward:

R_{env}(s,a) = \text{clip}(\, r_{topo} + r_{shed} + r_{mem} + r_{fric} + r_{lat} + r_{up} + r_{eff},\; -5,\; +5 \,)

with bounded components ensuring non-zero gradient everywhere.

DIME Environment

8-node cluster: node-0 is a stateful PostgreSQL DB (SPOF), nodes 1–7 are stateless workers
Partial observability: telemetry dropout when cpu_i = −1
6 action types: restart, reroute, scale_up, throttle, query_logs, no_op
14 failure scenarios: traffic spikes, node failures, memory leaks, retry storms, split-brain, and more
Error budget: throttle burns irreplaceable budget; the agent must be economical

Training Dynamics

reward_format reached 3.0 (perfect) from step 1 — the model learned XML scaffold immediately
reward_validity stabilised at 1.9+ — no invalid commands after step ~10
reward_env improved steadily — environment physics signal dominated learning
clipped_ratio stayed near 0 throughout — healthy PPO clip utilisation
Total wall-clock: 44 minutes at 8.4 s/step

Triage Oracle and the Priority Inversion Bug

The model is guided during training by a 10-rule deterministic triage tree:

A critical discovery: the original oracle evaluated the Black Swan rule (|F(s)| ≥ 2 → throttle(0.3)) before the DB Recovery rule (0 ∈ F(s) → restart_node(0)). When multiple nodes including the database were failed, the oracle prescribed throttle instead of restart_node(0). The model faithfully learned this suboptimal policy. Fixing the priority ordering accounted for +0.044 benchmark score improvement.

Failure Modes Documented

Five training configurations failed before convergence — each diagnosable from standard TRL metrics:

Run	Failure	Signal
vLLM on A100-40GB	SM 8.0 segfault (`compilation_config` not set)	Crash at init
batch=4, gen=8	CPU-bound rewards; 126 s/step, GPU idle	`samples/sec = 0.06`
max_comp=256	`<think>` blocks truncated before `</think>`	`frac_reward_zero_std = 1.0`
reward_env×2	10:1 env-to-triage ratio recreated zero-variance	`zero_std → 1.0` at step 119
oracle inverted	Learned `throttle` in DB-failure states	Low `triage/mean`

Reward Components

R(a,s) = R_fmt [-3,+3]  +  R_val [-2,+2]  +  R_env [-5,+5]  +  R_tri [-0.5,+1]

R_fmt: XML scaffold compliance (<reasoning> + <action> tags)
R_val: Syntactic kubectl parse success
R_env: 7-component physics reward (topology, latency, memory, uptime, budget)
R_tri: Oracle triage alignment (gentle guidance, not primary teacher)

Quick Start

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "Naseer-010/Qwen3-8B-Finetuned-DIME",
    torch_dtype="bfloat16",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("Naseer-010/Qwen3-8B-Finetuned-DIME")

system_prompt = """You are an autonomous SRE agent managing an 8-node Kubernetes cluster.
Node-0 is the PostgreSQL database (SPOF). Nodes 1-7 are stateless workers.

TRIAGE PRIORITY (check in order):
1. OOM: if any node mem > 0.92 → kubectl delete pod node-<i>
2. DB RECOVERY: if node-0 in failed_nodes → kubectl delete pod node-0
3. SPLIT-BRAIN: if io_wait > 0.80 → kubectl throttle ingress --rate=0.5
4. HOT-SHARD: if one worker cpu > 0.90, others low → reroute traffic
5. RETRY STORM: if p99 > 100ms and rr > 150 → kubectl throttle ingress --rate=0.4
6. ZOMBIE NODE: if worker cpu near 0 → reroute away from it
7. BLACK SWAN: if 2+ nodes failed (DB alive) → kubectl throttle ingress --rate=0.3
8. DB STRESS: if node-0 cpu > 0.80 → kubectl throttle ingress --rate=0.7
9. SAFE SCALE: if avg worker cpu > 0.75 and budget > 20 → scale up
10. HEALTHY → no_op

Output format:
<reasoning>One sentence identifying which rule applies.</reasoning>
<action>{"command": "kubectl ..."}</action>"""

obs = {
    "cpu_loads": [0.45, 0.82, 0.79, 0.88, 0.75, 0.81, 0.77, 0.73],
    "mem_utilizations": [0.41, 0.68, 0.71, 0.65, 0.62, 0.70, 0.66, 0.64],
    "queue_lengths": [12, 45, 41, 53, 38, 44, 40, 37],
    "failed_nodes": [],
    "latency_ms": 187.3,
    "p99_latency": 312.5,
    "request_rate": 1840.0,
    "io_wait": 0.12,
    "error_budget": 85,
    "step": 4,
    "task_hint": "System is under heavy traffic load."
}

import json
messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": f"Current system state:\n{json.dumps(obs, indent=2)}\nWhat action should be taken?"}
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=1024,
    temperature=0.6,
    top_p=0.95,
    do_sample=True,
)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Expected output:

<think>
P99 latency is 312ms with request rate 1840 rps and no failed nodes.
Rule 5 (retry storm): p99 > 100ms and rr > 150 → throttle at 0.4.
</think>
<reasoning>Rule 5 applies: p99 latency 312ms exceeds threshold with high request rate 1840 rps — throttle ingress to shed load.</reasoning>
<action>{"command": "kubectl throttle ingress --rate=0.4"}</action>

Training Details

Parameter	Value
Base model	`Qwen/Qwen3-8B` (BF16)
Method	GRPO (TRL 0.24.0 + Unsloth + vLLM 0.6.3)
LoRA rank	32, alpha=64, all projection layers
Trainable params	1.05% (349 MB adapter)
Training steps	300
Batch size	1 × 4 generations = 4 completions/step
Learning rate	5e-6, cosine schedule
Max completion length	1024 tokens
GPU	A100-SXM4-80GB
Wall-clock time	44 minutes

Citation

@misc{dime2026,
  title  = {Fine-Tuning Language Models as Autonomous SREs via GRPO: The DIME Benchmark},
  author = {Nithish Sriram and Naseer},
  year   = {2026},
  url    = {https://huggingface.co/Naseer-010/Qwen3-8B-Finetuned-DIME}
}

Trained at SRM AP · Hackathon 2026

8.2 KiB Raw Blame History Unescape Escape