Files
ModelHub XC 766501c44d 初始化项目,由ModelHub XC社区提供模型
Model: joynnayvedya/disaster-response-trained
Source: Original Platform
2026-05-02 13:51:25 +08:00

152 lines
4.9 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
base_model: unsloth/Qwen2.5-7B-Instruct-bnb-4bit
tags:
- text-generation-inference
- transformers
- unsloth
- qwen2
license: apache-2.0
language:
- en
---
Title: Teaching an LLM to Triage Disasters: Building an RL Environment with OpenEnv
Tags: openenv reinforcement-learning disaster-response grpo qwen
markdown# Teaching an LLM to Triage Disasters 🚨
## The Problem
During a natural disaster, Emergency Operations Centers (EOCs) are overwhelmed
by thousands of frantic incident reports. A flooded neighborhood, a chemical
plant fire, a hospital wing collapse — all arriving simultaneously. Human
coordinators must instantly decide: which team? what priority? what action?
We built an AI agent that does exactly this.
---
## The Environment
We built **Disaster Response Coordination OpenEnv** — a multi-step RL environment
where an AI agent acts as an Emergency Incident Commander.
**15 real-world scenarios** across 3 difficulty tiers, modeled after actual disasters:
- 🌊 2018 Kerala Floods → dam spillway overflow, communication blackouts
- ☠️ 2020 Vizag Gas Leak → chemical plant fire, toxic plume evacuation
- ⚡ 2012 North India Grid Failure → cold-chain medicine failures, hospital blackouts
### Action Space
For every incident ticket, the agent must complete a 4-step workflow:
classify → set_priority → draft_reply → submit_ticket
### Reward Function
reward = 0.40 × team_routing + 0.30 × priority + 0.30 × reply_quality
Dense, partial rewards at every step. No sparse end-of-episode signals.
### Difficulty Scaling
| Tier | Budget | Scenarios |
|------|--------|-----------|
| 🟢 Easy | 40 | Single-team, clear incidents |
| 🟡 Medium | 48 | Multi-agency, ambiguous |
| 🔴 Hard | 55 | Cascading mass-casualty + time pressure |
---
## Training with GRPO
We trained **Qwen2.5-7B-Instruct** using GRPO (Group Relative Policy Optimization)
via TRL + Unsloth on a Google Colab T4 GPU.
**Setup:**
- Base model: `unsloth/Qwen2.5-7B-Instruct-bnb-4bit`
- Algorithm: GRPOTrainer (TRL)
- LoRA: r=16, 4-bit quantization
- Epochs: 3 | Steps: 14
- Reward: Live environment feedback via HF Space API
The reward function connected directly to our live HF Space — every training
step sent real incident prompts to the environment and received real rewards back.
### Training Reward Curve
![reward_curve](reward_curve.png)
---
## What We Discovered: Sparse Reward Collapse
The untrained base model immediately revealed why this environment is hard:
**Before training**, the model hallucinated invalid outputs:
team: "emergency_services" ❌ (not a valid team)
team: "utility repair" ❌
priority: "very-high" ❌ (not a valid priority)
priority: "higher" ❌
**After training**, the model learned valid action spaces:
team: "rescue" ✅
priority: "urgent" ✅
However, we observed **sparse reward collapse** — a known RL failure mode where
a small model (7B at 4-bit) struggles to optimize across a multi-step workflow
with interdependent rewards. This validates our environment's quality: it is
genuinely difficult enough to expose real RL failure modes that larger models
or longer training runs would be needed to overcome.
---
## Baseline Results
| Agent | Easy | Medium | Hard | **Avg** |
|-------|------|--------|------|---------|
| Heuristic Baseline | 0.704 | 0.683 | 0.660 | **0.682** |
| GRPO Qwen2.5-7B | — | — | — | research ongoing |
All 3 difficulty tiers **passed** (score ≥ 0.6).
---
## The Dashboard
We built a military-style tactical command dashboard with:
- 🗺️ Live OpenStreetMap incident markers with radar pulse animations
- ⚡ ARIA — AI Incident Analyst (Gemini-powered, analyses any incident live)
- 📊 Real-time score tracking, threat level bar, team routing
- 🔔 Operations feed with meaningful event notifications
![dashboard](dashboard.png)
---
## Links
| Resource | URL |
|----------|-----|
| 🚀 HF Space (Live Environment) | [joynnayvedya/disaster-response-openenv](https://huggingface.co/spaces/joynnayvedya/disaster-response-openenv) |
| 🧠 Trained Model | [joynnayvedya/disaster-response-trained](https://huggingface.co/joynnayvedya/disaster-response-trained) |
| 💻 GitHub | [letsjoyn/meta-scalar-hack](https://github.com/letsjoyn/meta-scalar-hack) |
---
## Try It Yourself
```bash
git clone https://github.com/letsjoyn/meta-scalar-hack.git
cd meta-scalar-hack
pip install -e .
py inference.py
```
---
*Built for the 2026 Meta & Scalar AI Hackathon — Grand Finale, Bangalore.*
# Uploaded finetuned model
- **Developed by:** joynnayvedya
- **License:** apache-2.0
- **Finetuned from model :** unsloth/Qwen2.5-7B-Instruct-bnb-4bit
This qwen2 model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.
[<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)