Files
Qwen3-8B-Finetuned-DIME/README.md
ModelHub XC 8b5f7dfce0 初始化项目,由ModelHub XC社区提供模型
Model: Naseer-010/Qwen3-8B-Finetuned-DIME
Source: Original Platform
2026-05-27 21:44:24 +08:00

8.2 KiB
Raw Blame History

license, base_model, tags, language, pipeline_tag
license base_model tags language pipeline_tag
apache-2.0 Qwen/Qwen3-8B
reinforcement-learning
grpo
infrastructure-management
sre
kubernetes
lora
unsloth
trl
en
text-generation

Qwen3-8B — GRPO Fine-tuned on DIME

Qwen3-8B fine-tuned via Group Relative Policy Optimization (GRPO) to act as an autonomous Site-Reliability Engineer in a simulated 8-node Kubernetes cluster.


What is this model?

This is a merged BF16 checkpoint of Qwen/Qwen3-8B after 300 steps of GRPO fine-tuning on the DIME (Distributed Infrastructure Management Environment) benchmark — trained in 44 minutes on a single A100-SXM4-80GB.

The model observes per-node CPU, memory, queue depths, and tail-latency telemetry, then outputs a single kubectl command to maintain cluster health. It was trained from scratch with a completely redesigned reward signal after the original reward function was found to produce zero-variance advantages that blocked all gradient flow.


Benchmark Results

+44.2% relative improvement over zero-shot Qwen3-8B on the 14-task DIME benchmark.

Benchmark comparison

Metric Zero-shot Fine-tuned
Overall avg score 0.394 0.569 (best episode)
connection_pool_deadlock 0.630 0.976
memory_leak_slow_burn 0.990 0.990
node_failure 0.220 0.920
retry_storm 0.377 0.587
thundering_herd 0.393 0.606
traffic_spike 0.024 0.399

The Reward Engineering Challenge

The central technical contribution is replacing a gradient-blocking reward cliff with a differentiable seven-component signal.

The original reward returned r = 1000 whenever the database node failed — which happened within the first 3 steps of most episodes. With all rewards identical, GRPO advantages collapsed to zero and no gradient flowed.

Reward cliff vs bounded signal

The fixed reward:

R_{env}(s,a) = \text{clip}(\, r_{topo} + r_{shed} + r_{mem} + r_{fric} + r_{lat} + r_{up} + r_{eff},\; -5,\; +5 \,)

with bounded components ensuring non-zero gradient everywhere.


DIME Environment

DIME architecture

  • 8-node cluster: node-0 is a stateful PostgreSQL DB (SPOF), nodes 17 are stateless workers
  • Partial observability: telemetry dropout when cpu_i = 1
  • 6 action types: restart, reroute, scale_up, throttle, query_logs, no_op
  • 14 failure scenarios: traffic spikes, node failures, memory leaks, retry storms, split-brain, and more
  • Error budget: throttle burns irreplaceable budget; the agent must be economical

Training Dynamics

Training curves

  • reward_format reached 3.0 (perfect) from step 1 — the model learned XML scaffold immediately
  • reward_validity stabilised at 1.9+ — no invalid commands after step ~10
  • reward_env improved steadily — environment physics signal dominated learning
  • clipped_ratio stayed near 0 throughout — healthy PPO clip utilisation
  • Total wall-clock: 44 minutes at 8.4 s/step

Triage Oracle and the Priority Inversion Bug

The model is guided during training by a 10-rule deterministic triage tree:

Triage tree

A critical discovery: the original oracle evaluated the Black Swan rule (|F(s)| ≥ 2 → throttle(0.3)) before the DB Recovery rule (0 ∈ F(s) → restart_node(0)). When multiple nodes including the database were failed, the oracle prescribed throttle instead of restart_node(0). The model faithfully learned this suboptimal policy. Fixing the priority ordering accounted for +0.044 benchmark score improvement.


Failure Modes Documented

Five training configurations failed before convergence — each diagnosable from standard TRL metrics:

Failure modes

Run Failure Signal
vLLM on A100-40GB SM 8.0 segfault (compilation_config not set) Crash at init
batch=4, gen=8 CPU-bound rewards; 126 s/step, GPU idle samples/sec = 0.06
max_comp=256 <think> blocks truncated before </think> frac_reward_zero_std = 1.0
reward_env×2 10:1 env-to-triage ratio recreated zero-variance zero_std → 1.0 at step 119
oracle inverted Learned throttle in DB-failure states Low triage/mean

Reward Components

Reward components

R(a,s) = R_fmt [-3,+3]  +  R_val [-2,+2]  +  R_env [-5,+5]  +  R_tri [-0.5,+1]
  • R_fmt: XML scaffold compliance (<reasoning> + <action> tags)
  • R_val: Syntactic kubectl parse success
  • R_env: 7-component physics reward (topology, latency, memory, uptime, budget)
  • R_tri: Oracle triage alignment (gentle guidance, not primary teacher)

Quick Start

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "Naseer-010/Qwen3-8B-Finetuned-DIME",
    torch_dtype="bfloat16",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("Naseer-010/Qwen3-8B-Finetuned-DIME")

system_prompt = """You are an autonomous SRE agent managing an 8-node Kubernetes cluster.
Node-0 is the PostgreSQL database (SPOF). Nodes 1-7 are stateless workers.

TRIAGE PRIORITY (check in order):
1. OOM: if any node mem > 0.92 → kubectl delete pod node-<i>
2. DB RECOVERY: if node-0 in failed_nodes → kubectl delete pod node-0
3. SPLIT-BRAIN: if io_wait > 0.80 → kubectl throttle ingress --rate=0.5
4. HOT-SHARD: if one worker cpu > 0.90, others low → reroute traffic
5. RETRY STORM: if p99 > 100ms and rr > 150 → kubectl throttle ingress --rate=0.4
6. ZOMBIE NODE: if worker cpu near 0 → reroute away from it
7. BLACK SWAN: if 2+ nodes failed (DB alive) → kubectl throttle ingress --rate=0.3
8. DB STRESS: if node-0 cpu > 0.80 → kubectl throttle ingress --rate=0.7
9. SAFE SCALE: if avg worker cpu > 0.75 and budget > 20 → scale up
10. HEALTHY → no_op

Output format:
<reasoning>One sentence identifying which rule applies.</reasoning>
<action>{"command": "kubectl ..."}</action>"""

obs = {
    "cpu_loads": [0.45, 0.82, 0.79, 0.88, 0.75, 0.81, 0.77, 0.73],
    "mem_utilizations": [0.41, 0.68, 0.71, 0.65, 0.62, 0.70, 0.66, 0.64],
    "queue_lengths": [12, 45, 41, 53, 38, 44, 40, 37],
    "failed_nodes": [],
    "latency_ms": 187.3,
    "p99_latency": 312.5,
    "request_rate": 1840.0,
    "io_wait": 0.12,
    "error_budget": 85,
    "step": 4,
    "task_hint": "System is under heavy traffic load."
}

import json
messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": f"Current system state:\n{json.dumps(obs, indent=2)}\nWhat action should be taken?"}
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=1024,
    temperature=0.6,
    top_p=0.95,
    do_sample=True,
)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Expected output:

<think>
P99 latency is 312ms with request rate 1840 rps and no failed nodes.
Rule 5 (retry storm): p99 > 100ms and rr > 150 → throttle at 0.4.
</think>
<reasoning>Rule 5 applies: p99 latency 312ms exceeds threshold with high request rate 1840 rps — throttle ingress to shed load.</reasoning>
<action>{"command": "kubectl throttle ingress --rate=0.4"}</action>

Training Details

Parameter Value
Base model Qwen/Qwen3-8B (BF16)
Method GRPO (TRL 0.24.0 + Unsloth + vLLM 0.6.3)
LoRA rank 32, alpha=64, all projection layers
Trainable params 1.05% (349 MB adapter)
Training steps 300
Batch size 1 × 4 generations = 4 completions/step
Learning rate 5e-6, cosine schedule
Max completion length 1024 tokens
GPU A100-SXM4-80GB
Wall-clock time 44 minutes

Citation

@misc{dime2026,
  title  = {Fine-Tuning Language Models as Autonomous SREs via GRPO: The DIME Benchmark},
  author = {Nithish Sriram and Naseer},
  year   = {2026},
  url    = {https://huggingface.co/Naseer-010/Qwen3-8B-Finetuned-DIME}
}

Trained at SRM AP · Hackathon 2026