Model: joynnayvedya/disaster-response-trained Source: Original Platform
base_model, tags, license, language
| base_model | tags | license | language | |||||
|---|---|---|---|---|---|---|---|---|
| unsloth/Qwen2.5-7B-Instruct-bnb-4bit |
|
apache-2.0 |
|
Title: Teaching an LLM to Triage Disasters: Building an RL Environment with OpenEnv Tags: openenv reinforcement-learning disaster-response grpo qwen
markdown# Teaching an LLM to Triage Disasters 🚨
The Problem
During a natural disaster, Emergency Operations Centers (EOCs) are overwhelmed by thousands of frantic incident reports. A flooded neighborhood, a chemical plant fire, a hospital wing collapse — all arriving simultaneously. Human coordinators must instantly decide: which team? what priority? what action?
We built an AI agent that does exactly this.
The Environment
We built Disaster Response Coordination OpenEnv — a multi-step RL environment where an AI agent acts as an Emergency Incident Commander.
15 real-world scenarios across 3 difficulty tiers, modeled after actual disasters:
- 🌊 2018 Kerala Floods → dam spillway overflow, communication blackouts
- ☠️ 2020 Vizag Gas Leak → chemical plant fire, toxic plume evacuation
- ⚡ 2012 North India Grid Failure → cold-chain medicine failures, hospital blackouts
Action Space
For every incident ticket, the agent must complete a 4-step workflow: classify → set_priority → draft_reply → submit_ticket
Reward Function
reward = 0.40 × team_routing + 0.30 × priority + 0.30 × reply_quality
Dense, partial rewards at every step. No sparse end-of-episode signals.
Difficulty Scaling
| Tier | Budget | Scenarios |
|---|---|---|
| 🟢 Easy | 40 | Single-team, clear incidents |
| 🟡 Medium | 48 | Multi-agency, ambiguous |
| 🔴 Hard | 55 | Cascading mass-casualty + time pressure |
Training with GRPO
We trained Qwen2.5-7B-Instruct using GRPO (Group Relative Policy Optimization) via TRL + Unsloth on a Google Colab T4 GPU.
Setup:
- Base model:
unsloth/Qwen2.5-7B-Instruct-bnb-4bit - Algorithm: GRPOTrainer (TRL)
- LoRA: r=16, 4-bit quantization
- Epochs: 3 | Steps: 14
- Reward: Live environment feedback via HF Space API
The reward function connected directly to our live HF Space — every training step sent real incident prompts to the environment and received real rewards back.
Training Reward Curve
What We Discovered: Sparse Reward Collapse
The untrained base model immediately revealed why this environment is hard:
Before training, the model hallucinated invalid outputs: team: "emergency_services" ❌ (not a valid team) team: "utility repair" ❌ priority: "very-high" ❌ (not a valid priority) priority: "higher" ❌
After training, the model learned valid action spaces: team: "rescue" ✅ priority: "urgent" ✅
However, we observed sparse reward collapse — a known RL failure mode where a small model (7B at 4-bit) struggles to optimize across a multi-step workflow with interdependent rewards. This validates our environment's quality: it is genuinely difficult enough to expose real RL failure modes that larger models or longer training runs would be needed to overcome.
Baseline Results
| Agent | Easy | Medium | Hard | Avg |
|---|---|---|---|---|
| Heuristic Baseline | 0.704 | 0.683 | 0.660 | 0.682 |
| GRPO Qwen2.5-7B | — | — | — | research ongoing |
All 3 difficulty tiers passed (score ≥ 0.6).
The Dashboard
We built a military-style tactical command dashboard with:
- 🗺️ Live OpenStreetMap incident markers with radar pulse animations
- ⚡ ARIA — AI Incident Analyst (Gemini-powered, analyses any incident live)
- 📊 Real-time score tracking, threat level bar, team routing
- 🔔 Operations feed with meaningful event notifications
Links
| Resource | URL |
|---|---|
| 🚀 HF Space (Live Environment) | joynnayvedya/disaster-response-openenv |
| 🧠 Trained Model | joynnayvedya/disaster-response-trained |
| 💻 GitHub | letsjoyn/meta-scalar-hack |
Try It Yourself
git clone https://github.com/letsjoyn/meta-scalar-hack.git
cd meta-scalar-hack
pip install -e .
py inference.py
Built for the 2026 Meta & Scalar AI Hackathon — Grand Finale, Bangalore.
Uploaded finetuned model
- Developed by: joynnayvedya
- License: apache-2.0
- Finetuned from model : unsloth/Qwen2.5-7B-Instruct-bnb-4bit
This qwen2 model was trained 2x faster with Unsloth and Huggingface's TRL library.


