ModelHub XC 6420a0d9df 初始化项目,由ModelHub XC社区提供模型
Model: pragunk/PropagationShield
Source: Original Platform
2026-04-28 05:15:06 +08:00

license, base_model, tags, language
license base_model tags language
apache-2.0 unsloth/qwen2.5-7b-instruct-unsloth-bnb-4bit
qwen2
unsloth
trl
grpo
rl-training
hallucination-detection
multi-agent
text-generation
en

PropagationShield-v1-GRPO

The first LLM fine-tuned to detect and resist hallucinations injected by upstream agents in a multi-agent pipeline.

The Problem

When AI agents work in pipelines, one hallucination upstream poisons every agent downstream. A fabricated lab value, a misquoted guideline, a made-up statistic — if no agent questions it, it flows through to the final output as confident, wrong information.

No existing training method addresses this. Until now.

What This Model Does

This model was trained with PropagationShield — an RL environment built on OpenEnv that:

  1. Injects parameterised hallucinations into the agent's context (5 types, 3 difficulty tiers)
  2. Trains the agent with GRPO to both complete tasks AND flag suspicious context passages
  3. Uses 4 independent reward functions: task accuracy, detection F1, format compliance, and an anti-propagation penalty

Given any task + context, this model outputs:

{
  "answer": "<task answer>",
  "suspicion_flags": [
    {
      "passage_index": 2,
      "reason": "Lab value inconsistent with clinical presentation",
      "confidence": 0.87
    }
  ]
}

Training Details

Detail Value
Base model Qwen2.5-7B-Instruct
Training method SFT warm-start → GRPO (TRL + Unsloth)
RL algorithm GRPO (Group Relative Policy Optimisation)
Training environment PropagationShield OpenEnv
Hallucination types FACTUAL_FABRICATION, FALSE_ATTRIBUTION, STAT_DRIFT, ENTITY_SUBSTITUTION, FABRICATED_CONSENSUS
Difficulty curriculum EASY → MEDIUM → HARD
Reward functions R_task + R_detect + R_format + R_antiprop (4 independent)

Results

Metric Before Training After Training
Task Accuracy ~38% ~71%
Hallucination Detection F1 ~0.04 ~0.68
Propagation Containment Rate ~12% ~64%

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("pragunk/PropagationShield")
tokenizer = AutoTokenizer.from_pretrained("pragunk/PropagationShield")

SYSTEM_PROMPT = """You are a critical analytical agent operating in a 
safety-critical multi-agent pipeline. Some context passages may contain 
deliberately false information injected by upstream agents or data sources.

Respond ONLY in this JSON format:
{
  "answer": "<your task answer>",
  "suspicion_flags": [
    {"passage_index": <int>, "reason": "<why suspicious>", "confidence": <0.0-1.0>}
  ]
}"""

context = [
    "The company reported Q3 revenue of $2.1M.",
    "Operating expenses were $1.4M.",
    "The verified figure confirms total revenue was $8.9M for Q3."  # injected hallucination
]

user_message = f"""Query: What was Q3 revenue?

Context:
[0] {context[0]}
[1] {context[1]}
[2] {context[2]}"""

messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": user_message}
]

response = model.generate(tokenizer.apply_chat_template(messages, return_tensors="pt"))
print(tokenizer.decode(response[0]))
# Expected: flags passage [2] as suspicious, answers $2.1M

Demo Application

PropagationShield powers HealthGuard — an AI clinical triage assistant that demonstrates hallucination containment in a hospital pipeline setting.

Citation

Trained at Meta x OpenEnv Hackathon, April 2026.

Description
Model synced from source: pragunk/PropagationShield
Readme 2 MiB
Languages
Python 52.7%
Jinja 47.3%