Files
Qwen3-8B-Finetuned-DIME/README.md
ModelHub XC 8b5f7dfce0 初始化项目,由ModelHub XC社区提供模型
Model: Naseer-010/Qwen3-8B-Finetuned-DIME
Source: Original Platform
2026-05-27 21:44:24 +08:00

238 lines
8.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
license: apache-2.0
base_model: Qwen/Qwen3-8B
tags:
- reinforcement-learning
- grpo
- infrastructure-management
- sre
- kubernetes
- lora
- unsloth
- trl
language:
- en
pipeline_tag: text-generation
---
# Qwen3-8B — GRPO Fine-tuned on DIME
**Qwen3-8B fine-tuned via Group Relative Policy Optimization (GRPO) to act as an autonomous Site-Reliability Engineer in a simulated 8-node Kubernetes cluster.**
---
## What is this model?
This is a **merged BF16 checkpoint** of [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B) after 300 steps of GRPO fine-tuning on the DIME (Distributed Infrastructure Management Environment) benchmark — trained in **44 minutes on a single A100-SXM4-80GB**.
The model observes per-node CPU, memory, queue depths, and tail-latency telemetry, then outputs a single `kubectl` command to maintain cluster health. It was trained from scratch with a completely redesigned reward signal after the original reward function was found to produce zero-variance advantages that blocked all gradient flow.
---
## Benchmark Results
**+44.2% relative improvement over zero-shot Qwen3-8B** on the 14-task DIME benchmark.
![Benchmark comparison](fig3_benchmark.png)
| Metric | Zero-shot | Fine-tuned |
|---|---|---|
| Overall avg score | 0.394 | **0.569** (best episode) |
| connection_pool_deadlock | 0.630 | **0.976** |
| memory_leak_slow_burn | 0.990 | **0.990** |
| node_failure | 0.220 | **0.920** |
| retry_storm | 0.377 | **0.587** |
| thundering_herd | 0.393 | **0.606** |
| traffic_spike | 0.024 | **0.399** |
---
## The Reward Engineering Challenge
The central technical contribution is replacing a **gradient-blocking reward cliff** with a differentiable seven-component signal.
The original reward returned `r = 1000` whenever the database node failed — which happened within the first 3 steps of most episodes. With all rewards identical, GRPO advantages collapsed to zero and no gradient flowed.
![Reward cliff vs bounded signal](fig1_reward_cliff.png)
The fixed reward:
$$R_{env}(s,a) = \text{clip}(\, r_{topo} + r_{shed} + r_{mem} + r_{fric} + r_{lat} + r_{up} + r_{eff},\; -5,\; +5 \,)$$
with bounded components ensuring non-zero gradient everywhere.
---
## DIME Environment
![DIME architecture](fig4_dime_architecture.png)
- **8-node cluster**: node-0 is a stateful PostgreSQL DB (SPOF), nodes 17 are stateless workers
- **Partial observability**: telemetry dropout when `cpu_i = 1`
- **6 action types**: restart, reroute, scale_up, throttle, query_logs, no_op
- **14 failure scenarios**: traffic spikes, node failures, memory leaks, retry storms, split-brain, and more
- **Error budget**: throttle burns irreplaceable budget; the agent must be economical
---
## Training Dynamics
![Training curves](fig2_training_curves.png)
- `reward_format` reached 3.0 (perfect) from step 1 — the model learned XML scaffold immediately
- `reward_validity` stabilised at 1.9+ — no invalid commands after step ~10
- `reward_env` improved steadily — environment physics signal dominated learning
- `clipped_ratio` stayed near 0 throughout — healthy PPO clip utilisation
- Total wall-clock: **44 minutes** at 8.4 s/step
---
## Triage Oracle and the Priority Inversion Bug
The model is guided during training by a 10-rule deterministic triage tree:
![Triage tree](fig7_triage_tree.png)
A critical discovery: the original oracle evaluated the **Black Swan rule** (`|F(s)| ≥ 2 → throttle(0.3)`) *before* the **DB Recovery rule** (`0 ∈ F(s) → restart_node(0)`). When multiple nodes including the database were failed, the oracle prescribed `throttle` instead of `restart_node(0)`. The model faithfully learned this suboptimal policy. Fixing the priority ordering accounted for **+0.044 benchmark score** improvement.
---
## Failure Modes Documented
Five training configurations failed before convergence — each diagnosable from standard TRL metrics:
![Failure modes](fig6_failure_modes.png)
| Run | Failure | Signal |
|---|---|---|
| vLLM on A100-40GB | SM 8.0 segfault (`compilation_config` not set) | Crash at init |
| batch=4, gen=8 | CPU-bound rewards; 126 s/step, GPU idle | `samples/sec = 0.06` |
| max_comp=256 | `<think>` blocks truncated before `</think>` | `frac_reward_zero_std = 1.0` |
| reward_env×2 | 10:1 env-to-triage ratio recreated zero-variance | `zero_std → 1.0` at step 119 |
| oracle inverted | Learned `throttle` in DB-failure states | Low `triage/mean` |
---
## Reward Components
![Reward components](fig5_reward_components.png)
```
R(a,s) = R_fmt [-3,+3] + R_val [-2,+2] + R_env [-5,+5] + R_tri [-0.5,+1]
```
- **R_fmt**: XML scaffold compliance (`<reasoning>` + `<action>` tags)
- **R_val**: Syntactic kubectl parse success
- **R_env**: 7-component physics reward (topology, latency, memory, uptime, budget)
- **R_tri**: Oracle triage alignment (gentle guidance, not primary teacher)
---
## Quick Start
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"Naseer-010/Qwen3-8B-Finetuned-DIME",
torch_dtype="bfloat16",
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("Naseer-010/Qwen3-8B-Finetuned-DIME")
system_prompt = """You are an autonomous SRE agent managing an 8-node Kubernetes cluster.
Node-0 is the PostgreSQL database (SPOF). Nodes 1-7 are stateless workers.
TRIAGE PRIORITY (check in order):
1. OOM: if any node mem > 0.92 → kubectl delete pod node-<i>
2. DB RECOVERY: if node-0 in failed_nodes → kubectl delete pod node-0
3. SPLIT-BRAIN: if io_wait > 0.80 → kubectl throttle ingress --rate=0.5
4. HOT-SHARD: if one worker cpu > 0.90, others low → reroute traffic
5. RETRY STORM: if p99 > 100ms and rr > 150 → kubectl throttle ingress --rate=0.4
6. ZOMBIE NODE: if worker cpu near 0 → reroute away from it
7. BLACK SWAN: if 2+ nodes failed (DB alive) → kubectl throttle ingress --rate=0.3
8. DB STRESS: if node-0 cpu > 0.80 → kubectl throttle ingress --rate=0.7
9. SAFE SCALE: if avg worker cpu > 0.75 and budget > 20 → scale up
10. HEALTHY → no_op
Output format:
<reasoning>One sentence identifying which rule applies.</reasoning>
<action>{"command": "kubectl ..."}</action>"""
obs = {
"cpu_loads": [0.45, 0.82, 0.79, 0.88, 0.75, 0.81, 0.77, 0.73],
"mem_utilizations": [0.41, 0.68, 0.71, 0.65, 0.62, 0.70, 0.66, 0.64],
"queue_lengths": [12, 45, 41, 53, 38, 44, 40, 37],
"failed_nodes": [],
"latency_ms": 187.3,
"p99_latency": 312.5,
"request_rate": 1840.0,
"io_wait": 0.12,
"error_budget": 85,
"step": 4,
"task_hint": "System is under heavy traffic load."
}
import json
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": f"Current system state:\n{json.dumps(obs, indent=2)}\nWhat action should be taken?"}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=1024,
temperature=0.6,
top_p=0.95,
do_sample=True,
)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
```
Expected output:
```
<think>
P99 latency is 312ms with request rate 1840 rps and no failed nodes.
Rule 5 (retry storm): p99 > 100ms and rr > 150 → throttle at 0.4.
</think>
<reasoning>Rule 5 applies: p99 latency 312ms exceeds threshold with high request rate 1840 rps — throttle ingress to shed load.</reasoning>
<action>{"command": "kubectl throttle ingress --rate=0.4"}</action>
```
---
## Training Details
| Parameter | Value |
|---|---|
| Base model | `Qwen/Qwen3-8B` (BF16) |
| Method | GRPO (TRL 0.24.0 + Unsloth + vLLM 0.6.3) |
| LoRA rank | 32, alpha=64, all projection layers |
| Trainable params | 1.05% (349 MB adapter) |
| Training steps | 300 |
| Batch size | 1 × 4 generations = 4 completions/step |
| Learning rate | 5e-6, cosine schedule |
| Max completion length | 1024 tokens |
| GPU | A100-SXM4-80GB |
| Wall-clock time | 44 minutes |
---
## Citation
```bibtex
@misc{dime2026,
title = {Fine-Tuning Language Models as Autonomous SREs via GRPO: The DIME Benchmark},
author = {Nithish Sriram and Naseer},
year = {2026},
url = {https://huggingface.co/Naseer-010/Qwen3-8B-Finetuned-DIME}
}
```
---
*Trained at SRM AP · Hackathon 2026*