初始化项目，由ModelHub XC社区提供模型

Model: joynnayvedya/disaster-response-trained Source: Original Platform
2026-05-02 13:51:25 +08:00
commit 766501c44d
13 changed files with 914 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,151 @@
+---
+base_model: unsloth/Qwen2.5-7B-Instruct-bnb-4bit
+tags:
+- text-generation-inference
+- transformers
+- unsloth
+- qwen2
+license: apache-2.0
+language:
+- en
+---
+Title: Teaching an LLM to Triage Disasters: Building an RL Environment with OpenEnv
+Tags: openenv reinforcement-learning disaster-response grpo qwen
+
+markdown# Teaching an LLM to Triage Disasters 🚨
+
+## The Problem
+
+During a natural disaster, Emergency Operations Centers (EOCs) are overwhelmed 
+by thousands of frantic incident reports. A flooded neighborhood, a chemical 
+plant fire, a hospital wing collapse — all arriving simultaneously. Human 
+coordinators must instantly decide: which team? what priority? what action?
+
+We built an AI agent that does exactly this.
+
+---
+
+## The Environment
+
+We built **Disaster Response Coordination OpenEnv** — a multi-step RL environment 
+where an AI agent acts as an Emergency Incident Commander.
+
+**15 real-world scenarios** across 3 difficulty tiers, modeled after actual disasters:
+- 🌊 2018 Kerala Floods → dam spillway overflow, communication blackouts
+- ☠️ 2020 Vizag Gas Leak → chemical plant fire, toxic plume evacuation  
+- ⚡ 2012 North India Grid Failure → cold-chain medicine failures, hospital blackouts
+
+### Action Space
+For every incident ticket, the agent must complete a 4-step workflow:
+classify → set_priority → draft_reply → submit_ticket
+
+### Reward Function
+reward = 0.40 × team_routing + 0.30 × priority + 0.30 × reply_quality
+
+Dense, partial rewards at every step. No sparse end-of-episode signals.
+
+### Difficulty Scaling
+| Tier | Budget | Scenarios |
+|------|--------|-----------|
+| 🟢 Easy | 40 | Single-team, clear incidents |
+| 🟡 Medium | 48 | Multi-agency, ambiguous |
+| 🔴 Hard | 55 | Cascading mass-casualty + time pressure |
+
+---
+
+## Training with GRPO
+
+We trained **Qwen2.5-7B-Instruct** using GRPO (Group Relative Policy Optimization) 
+via TRL + Unsloth on a Google Colab T4 GPU.
+
+**Setup:**
+- Base model: `unsloth/Qwen2.5-7B-Instruct-bnb-4bit`
+- Algorithm: GRPOTrainer (TRL)
+- LoRA: r=16, 4-bit quantization
+- Epochs: 3 | Steps: 14
+- Reward: Live environment feedback via HF Space API
+
+The reward function connected directly to our live HF Space — every training 
+step sent real incident prompts to the environment and received real rewards back.
+
+### Training Reward Curve
+![reward_curve](reward_curve.png)
+
+---
+
+## What We Discovered: Sparse Reward Collapse
+
+The untrained base model immediately revealed why this environment is hard:
+
+**Before training**, the model hallucinated invalid outputs:
+team: "emergency_services"     ❌ (not a valid team)
+team: "utility repair"         ❌
+priority: "very-high"          ❌ (not a valid priority)
+priority: "higher"             ❌
+
+**After training**, the model learned valid action spaces:
+team: "rescue"      ✅
+priority: "urgent"  ✅
+
+However, we observed **sparse reward collapse** — a known RL failure mode where 
+a small model (7B at 4-bit) struggles to optimize across a multi-step workflow 
+with interdependent rewards. This validates our environment's quality: it is 
+genuinely difficult enough to expose real RL failure modes that larger models 
+or longer training runs would be needed to overcome.
+
+---
+
+## Baseline Results
+
+| Agent | Easy | Medium | Hard | **Avg** |
+|-------|------|--------|------|---------|
+| Heuristic Baseline | 0.704 | 0.683 | 0.660 | **0.682** |
+| GRPO Qwen2.5-7B | — | — | — | research ongoing |
+
+All 3 difficulty tiers **passed** (score ≥ 0.6).
+
+---
+
+## The Dashboard
+
+We built a military-style tactical command dashboard with:
+- 🗺️ Live OpenStreetMap incident markers with radar pulse animations
+- ⚡ ARIA — AI Incident Analyst (Gemini-powered, analyses any incident live)
+- 📊 Real-time score tracking, threat level bar, team routing
+- 🔔 Operations feed with meaningful event notifications
+
+![dashboard](dashboard.png)
+
+---
+
+## Links
+
+| Resource | URL |
+|----------|-----|
+| 🚀 HF Space (Live Environment) | [joynnayvedya/disaster-response-openenv](https://huggingface.co/spaces/joynnayvedya/disaster-response-openenv) |
+| 🧠 Trained Model | [joynnayvedya/disaster-response-trained](https://huggingface.co/joynnayvedya/disaster-response-trained) |
+| 💻 GitHub | [letsjoyn/meta-scalar-hack](https://github.com/letsjoyn/meta-scalar-hack) |
+
+---
+
+## Try It Yourself
+
+```bash
+git clone https://github.com/letsjoyn/meta-scalar-hack.git
+cd meta-scalar-hack
+pip install -e .
+py inference.py
+```
+
+---
+
+*Built for the 2026 Meta & Scalar AI Hackathon — Grand Finale, Bangalore.*
+# Uploaded finetuned  model
+
+- **Developed by:** joynnayvedya
+- **License:** apache-2.0
+- **Finetuned from model :** unsloth/Qwen2.5-7B-Instruct-bnb-4bit
+
+This qwen2 model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.
+
+[<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)