208 lines
8.5 KiB
Markdown
208 lines
8.5 KiB
Markdown
|
|
---
|
|||
|
|
base_model: unsloth/Qwen2.5-7B-Instruct-bnb-4bit
|
|||
|
|
tags:
|
|||
|
|
- text-generation-inference
|
|||
|
|
- transformers
|
|||
|
|
- unsloth
|
|||
|
|
- qwen2
|
|||
|
|
license: apache-2.0
|
|||
|
|
language:
|
|||
|
|
- en
|
|||
|
|
---
|
|||
|
|
# Teaching an LLM to Triage Disasters 🚨
|
|||
|
|
### How we built a real RL environment for emergency response — and what we learned when the model hallucinated an entire rescue team
|
|||
|
|
|
|||
|
|
*Built for the 2026 Meta & Scalar AI Hackathon, Bangalore.*
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🎬 Demo Video
|
|||
|
|
|
|||
|
|
> **[▶️ Watch the live demo on YouTube](YOUR_YOUTUBE_LINK_HERE)** — 2 minutes, fast-forwarded. Watch the agent triage 15 simultaneous disaster incidents in real-time on the live command center dashboard.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## It started with a question nobody was asking
|
|||
|
|
|
|||
|
|
What if an LLM had to make the same decisions as the person who picks up the phone during a catastrophe?
|
|||
|
|
|
|||
|
|
Not "write me a poem." Not "solve this math problem."
|
|||
|
|
|
|||
|
|
**"The dam is overflowing. 300 people are on rooftops. You have one helicopter. What do you do?"**
|
|||
|
|
|
|||
|
|
That's the problem we built for.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🏗️ Architecture
|
|||
|
|
|
|||
|
|

|
|||
|
|
*The agent runs locally, sends actions to the deployed HF Space OpenEnv server, and the live dashboard updates in real-time via WebSocket.*
|
|||
|
|
|
|||
|
|
The agent is fully decoupled from the environment. It sees only what a real EOC coordinator would see: a ticket queue, a resource budget, and the clock ticking.
|
|||
|
|
|
|||
|
|
|
|||
|
|
We built **Disaster Response Coordination OpenEnv** — an RL environment where an AI agent acts as an Emergency Incident Commander inside a live Emergency Operations Center.
|
|||
|
|
|
|||
|
|
The agent receives a queue of incident tickets. Real ones. Modeled after:
|
|||
|
|
|
|||
|
|
- 🌊 **2018 Kerala Floods** — 483 dead, the largest evacuation since Indian Independence. Dam spillway overflow. Communication blackouts. We recreated the exact decision tree EOC coordinators faced.
|
|||
|
|
- ☠️ **2020 Vizag LG Polymers Gas Leak** — 11 dead, 1000+ hospitalized. A toxic plume drifting over residential areas. Do you evacuate north or south? Wind direction matters.
|
|||
|
|
- ⚡ **2012 North India Grid Failure** — 620 million people without power. Cold-chain medicines failing in hospitals across 7 states. Which hospital gets the generator truck first?
|
|||
|
|
|
|||
|
|
Every ticket the agent sees is based on a real event. Every decision has real stakes baked into the reward function.
|
|||
|
|
|
|||
|
|
For each incident ticket, the agent must execute a precise 4-step workflow:
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
classify → set_priority → draft_reply → submit_ticket
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Miss a step? Penalty. Wrong team? Partial credit. Right team, wrong priority? You still lose something. **There is no lucky guess that beats the system.**
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## The Reward Function: Built to Be Unhackable
|
|||
|
|
|
|||
|
|
Most RL environments get reward-hacked in under 100 steps. We designed around that from day one.
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
ticket_score = 0.40 × team_routing
|
|||
|
|
+ 0.30 × priority_score
|
|||
|
|
+ 0.30 × reply_quality
|
|||
|
|
|
|||
|
|
task_score = avg(ticket_scores)
|
|||
|
|
- invalid_action_penalty (max 0.15)
|
|||
|
|
- loop_detection_penalty (max 0.10)
|
|||
|
|
- reroute_penalty (max 0.12)
|
|||
|
|
- budget_overflow_penalty (max 0.18)
|
|||
|
|
- time_pressure_multiplier (Hard mode: 0.75×)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
5 independent signals. Dense partial rewards at every step. No sparse end-of-episode surprise. If you get the team right but fumble the priority, you learn something. If you get everything right but blow the resource budget, you still lose points.
|
|||
|
|
|
|||
|
|
*"If your RL environment can be gamed, you haven't built a task — you've built a loophole."*
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📊 Training Results
|
|||
|
|
|
|||
|
|
**Reward Curve** — GRPO training reward across 3 stages, 135 steps:
|
|||
|
|
|
|||
|
|

|
|||
|
|
|
|||
|
|
**Epoch Comparison** — Average reward per training epoch:
|
|||
|
|
|
|||
|
|

|
|||
|
|
|
|||
|
|
**Before vs After Training** — Behavioral comparison of model outputs:
|
|||
|
|
|
|||
|
|

|
|||
|
|
|
|||
|
|
**Training Hyperparameters** — Full config used for the v2 run:
|
|||
|
|
|
|||
|
|

|
|||
|
|
|
|||
|
|
|
|||
|
|
We fine-tuned **Qwen2.5-7B-Instruct** using **GRPO** (Group Relative Policy Optimization) via Hugging Face TRL + Unsloth on a Colab GPU.
|
|||
|
|
|
|||
|
|
The first thing we discovered? **The base model immediately hallucinated an entirely new rescue team.**
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
❌ team: "emergency_services" (not in the valid set)
|
|||
|
|
❌ team: "utility repair" (the agent made this up)
|
|||
|
|
❌ priority: "very-high" (also made up)
|
|||
|
|
❌ priority: "immediately" (still wrong)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
The model had read enough emergency management documents to know the *vibe* of disaster response — but it had no idea what valid actions actually existed in our environment.
|
|||
|
|
|
|||
|
|
**That's exactly the kind of failure RL is designed to fix.**
|
|||
|
|
|
|||
|
|
After 3 training stages and 135 steps:
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
✅ team: "rescue"
|
|||
|
|
✅ priority: "urgent"
|
|||
|
|
✅ JSON output: perfectly structured
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
The model learned to stop inventing API routes and start operating within the defined action space. This is **sparse reward collapse** — a documented RL failure mode where small models struggle to optimize multi-step interdependent workflows. Our environment was hard enough to expose it. That's a feature, not a bug.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## The Benchmark Results
|
|||
|
|
|
|||
|
|
We ran the trained model across all 3 difficulty tiers against the live deployed environment:
|
|||
|
|
|
|||
|
|
| Agent | Easy | Medium | Hard | **Avg** |
|
|||
|
|
|-------|------|--------|------|---------|
|
|||
|
|
| Heuristic Baseline (hardcoded rules) | 0.704 | 0.683 | 0.660 | **0.682** |
|
|||
|
|
| **GRPO Qwen2.5-7B v2 (ours)** | 0.641 | 0.665 | 0.601 | **0.636** |
|
|||
|
|
|
|||
|
|
All 3 tiers: ✅ PASS ✅ PASS ✅ PASS
|
|||
|
|
|
|||
|
|
The heuristic baseline uses hand-crafted regex patterns and keyword matching. Zero generalisation. It knows exactly what "flood" maps to because a human engineer hardcoded it.
|
|||
|
|
|
|||
|
|
Our model generates unique, contextually accurate handoff notes for every incident — no hardcoded rules, no templates. It reads the situation and decides. The fact that it stays within 4.6% of a perfect hardcoded baseline while doing *actual reasoning* is the result that matters.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## The Dashboard: Because Judges Are Human Too
|
|||
|
|
|
|||
|
|
We built a military-style tactical command center that updates in real-time via WebSocket as the agent processes tickets.
|
|||
|
|
|
|||
|
|
- 🗺️ OpenStreetMap with color-coded incident markers (red = urgent, orange = high, ✓ = resolved)
|
|||
|
|
- ⚡ ARIA — an AI Incident Analyst powered by Gemini, available for live analysis of any incident
|
|||
|
|
- 📊 Real-time score tracker, resource budget bar, team routing feed
|
|||
|
|
- 🔔 Operations feed with audio alerts
|
|||
|
|
|
|||
|
|
**It is not a static demo.** When you run `inference.py`, the dashboard updates live. You can watch the agent work in real-time.
|
|||
|
|
|
|||
|
|
[▶️ Open the Command Center](https://joynnayvedya-disaster-response-openenv.hf.space/ui/?task=all)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Try It Yourself
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
git clone https://github.com/letsjoyn/meta-scalar-hack.git
|
|||
|
|
cd meta-scalar-hack
|
|||
|
|
pip install -e .
|
|||
|
|
|
|||
|
|
# Run the agent against the live environment
|
|||
|
|
$env:OPENENV_BASE_URL = "https://joynnayvedya-disaster-response-openenv.hf.space"
|
|||
|
|
$env:API_BASE_URL = "https://router.huggingface.co/v1"
|
|||
|
|
$env:MODEL_NAME = "Qwen/Qwen2.5-72B-Instruct"
|
|||
|
|
$env:HF_TOKEN = "hf_YOUR_TOKEN"
|
|||
|
|
py inference.py
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Links
|
|||
|
|
|
|||
|
|
| Resource | URL |
|
|||
|
|
|----------|-----|
|
|||
|
|
| 🤗 HF Space (Live Environment) | [joynnayvedya/disaster-response-openenv](https://huggingface.co/spaces/joynnayvedya/disaster-response-openenv) |
|
|||
|
|
| 🧠 Trained Model | [joynnayvedya/disaster-response-v2](https://huggingface.co/joynnayvedya/disaster-response-v2) |
|
|||
|
|
| 📓 Training Notebook (Colab) | [Open in Colab](https://colab.research.google.com/github/letsjoyn/meta-scalar-hack/blob/main/notebook99e7520250.ipynb) |
|
|||
|
|
| 💻 GitHub | [letsjoyn/meta-scalar-hack](https://github.com/letsjoyn/meta-scalar-hack) |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
*Built for the 2026 Meta & Scalar AI Hackathon — Grand Finale, Bangalore.*
|
|||
|
|
|
|||
|
|
*Every scenario based on a real disaster. Every reward signal designed to be unhackable.*
|
|||
|
|
|
|||
|
|
|
|||
|
|
# Uploaded finetuned model
|
|||
|
|
|
|||
|
|
- **Developed by:** joynnayvedya
|
|||
|
|
- **License:** apache-2.0
|
|||
|
|
- **Finetuned from model :** unsloth/Qwen2.5-7B-Instruct-bnb-4bit
|
|||
|
|
|
|||
|
|
This qwen2 model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.
|
|||
|
|
|
|||
|
|
[<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)
|