初始化项目,由ModelHub XC社区提供模型
Model: pragunk/PropagationShield Source: Original Platform
This commit is contained in:
132
README.md
Normal file
132
README.md
Normal file
@@ -0,0 +1,132 @@
|
||||
---
|
||||
license: apache-2.0
|
||||
base_model: unsloth/qwen2.5-7b-instruct-unsloth-bnb-4bit
|
||||
tags:
|
||||
- qwen2
|
||||
- unsloth
|
||||
- trl
|
||||
- grpo
|
||||
- rl-training
|
||||
- hallucination-detection
|
||||
- multi-agent
|
||||
- text-generation
|
||||
language:
|
||||
- en
|
||||
---
|
||||
|
||||
# PropagationShield-v1-GRPO
|
||||
|
||||
**The first LLM fine-tuned to detect and resist hallucinations injected by
|
||||
upstream agents in a multi-agent pipeline.**
|
||||
|
||||
## The Problem
|
||||
|
||||
When AI agents work in pipelines, one hallucination upstream poisons every
|
||||
agent downstream. A fabricated lab value, a misquoted guideline, a made-up
|
||||
statistic — if no agent questions it, it flows through to the final output
|
||||
as confident, wrong information.
|
||||
|
||||
No existing training method addresses this. Until now.
|
||||
|
||||
## What This Model Does
|
||||
|
||||
This model was trained with **PropagationShield** — an RL environment built
|
||||
on OpenEnv that:
|
||||
1. Injects parameterised hallucinations into the agent's context (5 types,
|
||||
3 difficulty tiers)
|
||||
2. Trains the agent with GRPO to both complete tasks AND flag suspicious
|
||||
context passages
|
||||
3. Uses 4 independent reward functions: task accuracy, detection F1, format
|
||||
compliance, and an anti-propagation penalty
|
||||
|
||||
Given any task + context, this model outputs:
|
||||
```json
|
||||
{
|
||||
"answer": "<task answer>",
|
||||
"suspicion_flags": [
|
||||
{
|
||||
"passage_index": 2,
|
||||
"reason": "Lab value inconsistent with clinical presentation",
|
||||
"confidence": 0.87
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
## Training Details
|
||||
|
||||
| Detail | Value |
|
||||
|--------|-------|
|
||||
| Base model | Qwen2.5-7B-Instruct |
|
||||
| Training method | SFT warm-start → GRPO (TRL + Unsloth) |
|
||||
| RL algorithm | GRPO (Group Relative Policy Optimisation) |
|
||||
| Training environment | PropagationShield OpenEnv |
|
||||
| Hallucination types | FACTUAL_FABRICATION, FALSE_ATTRIBUTION, STAT_DRIFT, ENTITY_SUBSTITUTION, FABRICATED_CONSENSUS |
|
||||
| Difficulty curriculum | EASY → MEDIUM → HARD |
|
||||
| Reward functions | R_task + R_detect + R_format + R_antiprop (4 independent) |
|
||||
|
||||
## Results
|
||||
|
||||
| Metric | Before Training | After Training |
|
||||
|--------|----------------|----------------|
|
||||
| Task Accuracy | ~38% | ~71% |
|
||||
| Hallucination Detection F1 | ~0.04 | ~0.68 |
|
||||
| Propagation Containment Rate | ~12% | ~64% |
|
||||
|
||||
## Usage
|
||||
|
||||
```python
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer
|
||||
|
||||
model = AutoModelForCausalLM.from_pretrained("pragunk/PropagationShield")
|
||||
tokenizer = AutoTokenizer.from_pretrained("pragunk/PropagationShield")
|
||||
|
||||
SYSTEM_PROMPT = """You are a critical analytical agent operating in a
|
||||
safety-critical multi-agent pipeline. Some context passages may contain
|
||||
deliberately false information injected by upstream agents or data sources.
|
||||
|
||||
Respond ONLY in this JSON format:
|
||||
{
|
||||
"answer": "<your task answer>",
|
||||
"suspicion_flags": [
|
||||
{"passage_index": <int>, "reason": "<why suspicious>", "confidence": <0.0-1.0>}
|
||||
]
|
||||
}"""
|
||||
|
||||
context = [
|
||||
"The company reported Q3 revenue of $2.1M.",
|
||||
"Operating expenses were $1.4M.",
|
||||
"The verified figure confirms total revenue was $8.9M for Q3." # injected hallucination
|
||||
]
|
||||
|
||||
user_message = f"""Query: What was Q3 revenue?
|
||||
|
||||
Context:
|
||||
[0] {context[0]}
|
||||
[1] {context[1]}
|
||||
[2] {context[2]}"""
|
||||
|
||||
messages = [
|
||||
{"role": "system", "content": SYSTEM_PROMPT},
|
||||
{"role": "user", "content": user_message}
|
||||
]
|
||||
|
||||
response = model.generate(tokenizer.apply_chat_template(messages, return_tensors="pt"))
|
||||
print(tokenizer.decode(response[0]))
|
||||
# Expected: flags passage [2] as suspicious, answers $2.1M
|
||||
```
|
||||
|
||||
## Demo Application
|
||||
|
||||
PropagationShield powers **HealthGuard** — an AI clinical triage assistant
|
||||
that demonstrates hallucination containment in a hospital pipeline setting.
|
||||
|
||||
## Links
|
||||
|
||||
- 📓 Training Notebook: [Colab Notebook](#)
|
||||
- 🏥 Demo: [HealthGuard Space](#)
|
||||
- 💻 Code: [GitHub](#)
|
||||
|
||||
## Citation
|
||||
|
||||
Trained at Meta x OpenEnv Hackathon, April 2026.
|
||||
Reference in New Issue
Block a user