132 lines
3.7 KiB
Markdown
132 lines
3.7 KiB
Markdown
---
|
|
license: apache-2.0
|
|
base_model: unsloth/qwen2.5-7b-instruct-unsloth-bnb-4bit
|
|
tags:
|
|
- qwen2
|
|
- unsloth
|
|
- trl
|
|
- grpo
|
|
- rl-training
|
|
- hallucination-detection
|
|
- multi-agent
|
|
- text-generation
|
|
language:
|
|
- en
|
|
---
|
|
|
|
# PropagationShield-v1-GRPO
|
|
|
|
**The first LLM fine-tuned to detect and resist hallucinations injected by
|
|
upstream agents in a multi-agent pipeline.**
|
|
|
|
## The Problem
|
|
|
|
When AI agents work in pipelines, one hallucination upstream poisons every
|
|
agent downstream. A fabricated lab value, a misquoted guideline, a made-up
|
|
statistic — if no agent questions it, it flows through to the final output
|
|
as confident, wrong information.
|
|
|
|
No existing training method addresses this. Until now.
|
|
|
|
## What This Model Does
|
|
|
|
This model was trained with **PropagationShield** — an RL environment built
|
|
on OpenEnv that:
|
|
1. Injects parameterised hallucinations into the agent's context (5 types,
|
|
3 difficulty tiers)
|
|
2. Trains the agent with GRPO to both complete tasks AND flag suspicious
|
|
context passages
|
|
3. Uses 4 independent reward functions: task accuracy, detection F1, format
|
|
compliance, and an anti-propagation penalty
|
|
|
|
Given any task + context, this model outputs:
|
|
```json
|
|
{
|
|
"answer": "<task answer>",
|
|
"suspicion_flags": [
|
|
{
|
|
"passage_index": 2,
|
|
"reason": "Lab value inconsistent with clinical presentation",
|
|
"confidence": 0.87
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
## Training Details
|
|
|
|
| Detail | Value |
|
|
|--------|-------|
|
|
| Base model | Qwen2.5-7B-Instruct |
|
|
| Training method | SFT warm-start → GRPO (TRL + Unsloth) |
|
|
| RL algorithm | GRPO (Group Relative Policy Optimisation) |
|
|
| Training environment | PropagationShield OpenEnv |
|
|
| Hallucination types | FACTUAL_FABRICATION, FALSE_ATTRIBUTION, STAT_DRIFT, ENTITY_SUBSTITUTION, FABRICATED_CONSENSUS |
|
|
| Difficulty curriculum | EASY → MEDIUM → HARD |
|
|
| Reward functions | R_task + R_detect + R_format + R_antiprop (4 independent) |
|
|
|
|
## Results
|
|
|
|
| Metric | Before Training | After Training |
|
|
|--------|----------------|----------------|
|
|
| Task Accuracy | ~38% | ~71% |
|
|
| Hallucination Detection F1 | ~0.04 | ~0.68 |
|
|
| Propagation Containment Rate | ~12% | ~64% |
|
|
|
|
## Usage
|
|
|
|
```python
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer
|
|
|
|
model = AutoModelForCausalLM.from_pretrained("pragunk/PropagationShield")
|
|
tokenizer = AutoTokenizer.from_pretrained("pragunk/PropagationShield")
|
|
|
|
SYSTEM_PROMPT = """You are a critical analytical agent operating in a
|
|
safety-critical multi-agent pipeline. Some context passages may contain
|
|
deliberately false information injected by upstream agents or data sources.
|
|
|
|
Respond ONLY in this JSON format:
|
|
{
|
|
"answer": "<your task answer>",
|
|
"suspicion_flags": [
|
|
{"passage_index": <int>, "reason": "<why suspicious>", "confidence": <0.0-1.0>}
|
|
]
|
|
}"""
|
|
|
|
context = [
|
|
"The company reported Q3 revenue of $2.1M.",
|
|
"Operating expenses were $1.4M.",
|
|
"The verified figure confirms total revenue was $8.9M for Q3." # injected hallucination
|
|
]
|
|
|
|
user_message = f"""Query: What was Q3 revenue?
|
|
|
|
Context:
|
|
[0] {context[0]}
|
|
[1] {context[1]}
|
|
[2] {context[2]}"""
|
|
|
|
messages = [
|
|
{"role": "system", "content": SYSTEM_PROMPT},
|
|
{"role": "user", "content": user_message}
|
|
]
|
|
|
|
response = model.generate(tokenizer.apply_chat_template(messages, return_tensors="pt"))
|
|
print(tokenizer.decode(response[0]))
|
|
# Expected: flags passage [2] as suspicious, answers $2.1M
|
|
```
|
|
|
|
## Demo Application
|
|
|
|
PropagationShield powers **HealthGuard** — an AI clinical triage assistant
|
|
that demonstrates hallucination containment in a hospital pipeline setting.
|
|
|
|
## Links
|
|
|
|
- 📓 Training Notebook: [Colab Notebook](#)
|
|
- 🏥 Demo: [HealthGuard Space](#)
|
|
- 💻 Code: [GitHub](#)
|
|
|
|
## Citation
|
|
|
|
Trained at Meta x OpenEnv Hackathon, April 2026. |