base_model, tags, license, language
base_model tags license language
unsloth/Qwen2.5-7B-Instruct-bnb-4bit
text-generation-inference
transformers
unsloth
qwen2
apache-2.0
en

Title: Teaching an LLM to Triage Disasters: Building an RL Environment with OpenEnv Tags: openenv reinforcement-learning disaster-response grpo qwen

markdown# Teaching an LLM to Triage Disasters 🚨

The Problem

During a natural disaster, Emergency Operations Centers (EOCs) are overwhelmed by thousands of frantic incident reports. A flooded neighborhood, a chemical plant fire, a hospital wing collapse — all arriving simultaneously. Human coordinators must instantly decide: which team? what priority? what action?

We built an AI agent that does exactly this.


The Environment

We built Disaster Response Coordination OpenEnv — a multi-step RL environment where an AI agent acts as an Emergency Incident Commander.

15 real-world scenarios across 3 difficulty tiers, modeled after actual disasters:

  • 🌊 2018 Kerala Floods → dam spillway overflow, communication blackouts
  • ☠️ 2020 Vizag Gas Leak → chemical plant fire, toxic plume evacuation
  • 2012 North India Grid Failure → cold-chain medicine failures, hospital blackouts

Action Space

For every incident ticket, the agent must complete a 4-step workflow: classify → set_priority → draft_reply → submit_ticket

Reward Function

reward = 0.40 × team_routing + 0.30 × priority + 0.30 × reply_quality

Dense, partial rewards at every step. No sparse end-of-episode signals.

Difficulty Scaling

Tier Budget Scenarios
🟢 Easy 40 Single-team, clear incidents
🟡 Medium 48 Multi-agency, ambiguous
🔴 Hard 55 Cascading mass-casualty + time pressure

Training with GRPO

We trained Qwen2.5-7B-Instruct using GRPO (Group Relative Policy Optimization) via TRL + Unsloth on a Google Colab T4 GPU.

Setup:

  • Base model: unsloth/Qwen2.5-7B-Instruct-bnb-4bit
  • Algorithm: GRPOTrainer (TRL)
  • LoRA: r=16, 4-bit quantization
  • Epochs: 3 | Steps: 14
  • Reward: Live environment feedback via HF Space API

The reward function connected directly to our live HF Space — every training step sent real incident prompts to the environment and received real rewards back.

Training Reward Curve

reward_curve


What We Discovered: Sparse Reward Collapse

The untrained base model immediately revealed why this environment is hard:

Before training, the model hallucinated invalid outputs: team: "emergency_services" (not a valid team) team: "utility repair" priority: "very-high" (not a valid priority) priority: "higher"

After training, the model learned valid action spaces: team: "rescue" priority: "urgent"

However, we observed sparse reward collapse — a known RL failure mode where a small model (7B at 4-bit) struggles to optimize across a multi-step workflow with interdependent rewards. This validates our environment's quality: it is genuinely difficult enough to expose real RL failure modes that larger models or longer training runs would be needed to overcome.


Baseline Results

Agent Easy Medium Hard Avg
Heuristic Baseline 0.704 0.683 0.660 0.682
GRPO Qwen2.5-7B research ongoing

All 3 difficulty tiers passed (score ≥ 0.6).


The Dashboard

We built a military-style tactical command dashboard with:

  • 🗺️ Live OpenStreetMap incident markers with radar pulse animations
  • ARIA — AI Incident Analyst (Gemini-powered, analyses any incident live)
  • 📊 Real-time score tracking, threat level bar, team routing
  • 🔔 Operations feed with meaningful event notifications

dashboard


Resource URL
🚀 HF Space (Live Environment) joynnayvedya/disaster-response-openenv
🧠 Trained Model joynnayvedya/disaster-response-trained
💻 GitHub letsjoyn/meta-scalar-hack

Try It Yourself

git clone https://github.com/letsjoyn/meta-scalar-hack.git
cd meta-scalar-hack
pip install -e .
py inference.py

Built for the 2026 Meta & Scalar AI Hackathon — Grand Finale, Bangalore.

Uploaded finetuned model

  • Developed by: joynnayvedya
  • License: apache-2.0
  • Finetuned from model : unsloth/Qwen2.5-7B-Instruct-bnb-4bit

This qwen2 model was trained 2x faster with Unsloth and Huggingface's TRL library.

Description
Model synced from source: joynnayvedya/disaster-response-trained
Readme 32 KiB
Languages
Jinja 100%