152 lines
4.9 KiB
Markdown
152 lines
4.9 KiB
Markdown
|
|
---
|
|||
|
|
base_model: unsloth/Qwen2.5-7B-Instruct-bnb-4bit
|
|||
|
|
tags:
|
|||
|
|
- text-generation-inference
|
|||
|
|
- transformers
|
|||
|
|
- unsloth
|
|||
|
|
- qwen2
|
|||
|
|
license: apache-2.0
|
|||
|
|
language:
|
|||
|
|
- en
|
|||
|
|
---
|
|||
|
|
Title: Teaching an LLM to Triage Disasters: Building an RL Environment with OpenEnv
|
|||
|
|
Tags: openenv reinforcement-learning disaster-response grpo qwen
|
|||
|
|
|
|||
|
|
markdown# Teaching an LLM to Triage Disasters 🚨
|
|||
|
|
|
|||
|
|
## The Problem
|
|||
|
|
|
|||
|
|
During a natural disaster, Emergency Operations Centers (EOCs) are overwhelmed
|
|||
|
|
by thousands of frantic incident reports. A flooded neighborhood, a chemical
|
|||
|
|
plant fire, a hospital wing collapse — all arriving simultaneously. Human
|
|||
|
|
coordinators must instantly decide: which team? what priority? what action?
|
|||
|
|
|
|||
|
|
We built an AI agent that does exactly this.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## The Environment
|
|||
|
|
|
|||
|
|
We built **Disaster Response Coordination OpenEnv** — a multi-step RL environment
|
|||
|
|
where an AI agent acts as an Emergency Incident Commander.
|
|||
|
|
|
|||
|
|
**15 real-world scenarios** across 3 difficulty tiers, modeled after actual disasters:
|
|||
|
|
- 🌊 2018 Kerala Floods → dam spillway overflow, communication blackouts
|
|||
|
|
- ☠️ 2020 Vizag Gas Leak → chemical plant fire, toxic plume evacuation
|
|||
|
|
- ⚡ 2012 North India Grid Failure → cold-chain medicine failures, hospital blackouts
|
|||
|
|
|
|||
|
|
### Action Space
|
|||
|
|
For every incident ticket, the agent must complete a 4-step workflow:
|
|||
|
|
classify → set_priority → draft_reply → submit_ticket
|
|||
|
|
|
|||
|
|
### Reward Function
|
|||
|
|
reward = 0.40 × team_routing + 0.30 × priority + 0.30 × reply_quality
|
|||
|
|
|
|||
|
|
Dense, partial rewards at every step. No sparse end-of-episode signals.
|
|||
|
|
|
|||
|
|
### Difficulty Scaling
|
|||
|
|
| Tier | Budget | Scenarios |
|
|||
|
|
|------|--------|-----------|
|
|||
|
|
| 🟢 Easy | 40 | Single-team, clear incidents |
|
|||
|
|
| 🟡 Medium | 48 | Multi-agency, ambiguous |
|
|||
|
|
| 🔴 Hard | 55 | Cascading mass-casualty + time pressure |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Training with GRPO
|
|||
|
|
|
|||
|
|
We trained **Qwen2.5-7B-Instruct** using GRPO (Group Relative Policy Optimization)
|
|||
|
|
via TRL + Unsloth on a Google Colab T4 GPU.
|
|||
|
|
|
|||
|
|
**Setup:**
|
|||
|
|
- Base model: `unsloth/Qwen2.5-7B-Instruct-bnb-4bit`
|
|||
|
|
- Algorithm: GRPOTrainer (TRL)
|
|||
|
|
- LoRA: r=16, 4-bit quantization
|
|||
|
|
- Epochs: 3 | Steps: 14
|
|||
|
|
- Reward: Live environment feedback via HF Space API
|
|||
|
|
|
|||
|
|
The reward function connected directly to our live HF Space — every training
|
|||
|
|
step sent real incident prompts to the environment and received real rewards back.
|
|||
|
|
|
|||
|
|
### Training Reward Curve
|
|||
|
|

|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## What We Discovered: Sparse Reward Collapse
|
|||
|
|
|
|||
|
|
The untrained base model immediately revealed why this environment is hard:
|
|||
|
|
|
|||
|
|
**Before training**, the model hallucinated invalid outputs:
|
|||
|
|
team: "emergency_services" ❌ (not a valid team)
|
|||
|
|
team: "utility repair" ❌
|
|||
|
|
priority: "very-high" ❌ (not a valid priority)
|
|||
|
|
priority: "higher" ❌
|
|||
|
|
|
|||
|
|
**After training**, the model learned valid action spaces:
|
|||
|
|
team: "rescue" ✅
|
|||
|
|
priority: "urgent" ✅
|
|||
|
|
|
|||
|
|
However, we observed **sparse reward collapse** — a known RL failure mode where
|
|||
|
|
a small model (7B at 4-bit) struggles to optimize across a multi-step workflow
|
|||
|
|
with interdependent rewards. This validates our environment's quality: it is
|
|||
|
|
genuinely difficult enough to expose real RL failure modes that larger models
|
|||
|
|
or longer training runs would be needed to overcome.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Baseline Results
|
|||
|
|
|
|||
|
|
| Agent | Easy | Medium | Hard | **Avg** |
|
|||
|
|
|-------|------|--------|------|---------|
|
|||
|
|
| Heuristic Baseline | 0.704 | 0.683 | 0.660 | **0.682** |
|
|||
|
|
| GRPO Qwen2.5-7B | — | — | — | research ongoing |
|
|||
|
|
|
|||
|
|
All 3 difficulty tiers **passed** (score ≥ 0.6).
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## The Dashboard
|
|||
|
|
|
|||
|
|
We built a military-style tactical command dashboard with:
|
|||
|
|
- 🗺️ Live OpenStreetMap incident markers with radar pulse animations
|
|||
|
|
- ⚡ ARIA — AI Incident Analyst (Gemini-powered, analyses any incident live)
|
|||
|
|
- 📊 Real-time score tracking, threat level bar, team routing
|
|||
|
|
- 🔔 Operations feed with meaningful event notifications
|
|||
|
|
|
|||
|
|

|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Links
|
|||
|
|
|
|||
|
|
| Resource | URL |
|
|||
|
|
|----------|-----|
|
|||
|
|
| 🚀 HF Space (Live Environment) | [joynnayvedya/disaster-response-openenv](https://huggingface.co/spaces/joynnayvedya/disaster-response-openenv) |
|
|||
|
|
| 🧠 Trained Model | [joynnayvedya/disaster-response-trained](https://huggingface.co/joynnayvedya/disaster-response-trained) |
|
|||
|
|
| 💻 GitHub | [letsjoyn/meta-scalar-hack](https://github.com/letsjoyn/meta-scalar-hack) |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Try It Yourself
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
git clone https://github.com/letsjoyn/meta-scalar-hack.git
|
|||
|
|
cd meta-scalar-hack
|
|||
|
|
pip install -e .
|
|||
|
|
py inference.py
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
*Built for the 2026 Meta & Scalar AI Hackathon — Grand Finale, Bangalore.*
|
|||
|
|
# Uploaded finetuned model
|
|||
|
|
|
|||
|
|
- **Developed by:** joynnayvedya
|
|||
|
|
- **License:** apache-2.0
|
|||
|
|
- **Finetuned from model :** unsloth/Qwen2.5-7B-Instruct-bnb-4bit
|
|||
|
|
|
|||
|
|
This qwen2 model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.
|
|||
|
|
|
|||
|
|
[<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)
|