--- base_model: unsloth/Qwen2.5-7B-Instruct-bnb-4bit tags: - text-generation-inference - transformers - unsloth - qwen2 license: apache-2.0 language: - en --- Title: Teaching an LLM to Triage Disasters: Building an RL Environment with OpenEnv Tags: openenv reinforcement-learning disaster-response grpo qwen markdown# Teaching an LLM to Triage Disasters 🚨 ## The Problem During a natural disaster, Emergency Operations Centers (EOCs) are overwhelmed by thousands of frantic incident reports. A flooded neighborhood, a chemical plant fire, a hospital wing collapse β€” all arriving simultaneously. Human coordinators must instantly decide: which team? what priority? what action? We built an AI agent that does exactly this. --- ## The Environment We built **Disaster Response Coordination OpenEnv** β€” a multi-step RL environment where an AI agent acts as an Emergency Incident Commander. **15 real-world scenarios** across 3 difficulty tiers, modeled after actual disasters: - 🌊 2018 Kerala Floods β†’ dam spillway overflow, communication blackouts - ☠️ 2020 Vizag Gas Leak β†’ chemical plant fire, toxic plume evacuation - ⚑ 2012 North India Grid Failure β†’ cold-chain medicine failures, hospital blackouts ### Action Space For every incident ticket, the agent must complete a 4-step workflow: classify β†’ set_priority β†’ draft_reply β†’ submit_ticket ### Reward Function reward = 0.40 Γ— team_routing + 0.30 Γ— priority + 0.30 Γ— reply_quality Dense, partial rewards at every step. No sparse end-of-episode signals. ### Difficulty Scaling | Tier | Budget | Scenarios | |------|--------|-----------| | 🟒 Easy | 40 | Single-team, clear incidents | | 🟑 Medium | 48 | Multi-agency, ambiguous | | πŸ”΄ Hard | 55 | Cascading mass-casualty + time pressure | --- ## Training with GRPO We trained **Qwen2.5-7B-Instruct** using GRPO (Group Relative Policy Optimization) via TRL + Unsloth on a Google Colab T4 GPU. **Setup:** - Base model: `unsloth/Qwen2.5-7B-Instruct-bnb-4bit` - Algorithm: GRPOTrainer (TRL) - LoRA: r=16, 4-bit quantization - Epochs: 3 | Steps: 14 - Reward: Live environment feedback via HF Space API The reward function connected directly to our live HF Space β€” every training step sent real incident prompts to the environment and received real rewards back. ### Training Reward Curve ![reward_curve](reward_curve.png) --- ## What We Discovered: Sparse Reward Collapse The untrained base model immediately revealed why this environment is hard: **Before training**, the model hallucinated invalid outputs: team: "emergency_services" ❌ (not a valid team) team: "utility repair" ❌ priority: "very-high" ❌ (not a valid priority) priority: "higher" ❌ **After training**, the model learned valid action spaces: team: "rescue" βœ… priority: "urgent" βœ… However, we observed **sparse reward collapse** β€” a known RL failure mode where a small model (7B at 4-bit) struggles to optimize across a multi-step workflow with interdependent rewards. This validates our environment's quality: it is genuinely difficult enough to expose real RL failure modes that larger models or longer training runs would be needed to overcome. --- ## Baseline Results | Agent | Easy | Medium | Hard | **Avg** | |-------|------|--------|------|---------| | Heuristic Baseline | 0.704 | 0.683 | 0.660 | **0.682** | | GRPO Qwen2.5-7B | β€” | β€” | β€” | research ongoing | All 3 difficulty tiers **passed** (score β‰₯ 0.6). --- ## The Dashboard We built a military-style tactical command dashboard with: - πŸ—ΊοΈ Live OpenStreetMap incident markers with radar pulse animations - ⚑ ARIA β€” AI Incident Analyst (Gemini-powered, analyses any incident live) - πŸ“Š Real-time score tracking, threat level bar, team routing - πŸ”” Operations feed with meaningful event notifications ![dashboard](dashboard.png) --- ## Links | Resource | URL | |----------|-----| | πŸš€ HF Space (Live Environment) | [joynnayvedya/disaster-response-openenv](https://huggingface.co/spaces/joynnayvedya/disaster-response-openenv) | | 🧠 Trained Model | [joynnayvedya/disaster-response-trained](https://huggingface.co/joynnayvedya/disaster-response-trained) | | πŸ’» GitHub | [letsjoyn/meta-scalar-hack](https://github.com/letsjoyn/meta-scalar-hack) | --- ## Try It Yourself ```bash git clone https://github.com/letsjoyn/meta-scalar-hack.git cd meta-scalar-hack pip install -e . py inference.py ``` --- *Built for the 2026 Meta & Scalar AI Hackathon β€” Grand Finale, Bangalore.* # Uploaded finetuned model - **Developed by:** joynnayvedya - **License:** apache-2.0 - **Finetuned from model :** unsloth/Qwen2.5-7B-Instruct-bnb-4bit This qwen2 model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library. [](https://github.com/unslothai/unsloth)