Files
ModelHub XC e797f6f31e 初始化项目,由ModelHub XC社区提供模型
Model: hlyn/prompt-injection-judge-3b
Source: Original Platform
2026-04-11 05:28:57 +08:00

82 lines
3.0 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
library_name: peft
license: llama3.2
base_model: dphn/Dolphin3.0-Llama3.2-3B
tags:
- axolotl
- base_model:adapter:dphn/Dolphin3.0-Llama3.2-3B
- lora
- dora
- security
- prompt-injection
- transformers
datasets:
- karan11/defender-judge-fine-tune
pipeline_tag: text-generation
---
# Defender Security Judge — Dolphin 3.0 Llama 3.2 3B
A fine-tuned, production-hardened **prompt injection security judge** built on top of [dphn/Dolphin3.0-Llama3.2-3B](https://huggingface.co/dphn/Dolphin3.0-Llama3.2-3B).
This model is Stage 2 of the **Defender** multi-layer LLM security pipeline — a real-time adversarial firewall that intercepts, analyzes, and classifies user prompts before they ever reach a protected LLM.
---
## Benchmark Results
Evaluated against the **[rogue-security/prompt-injections-benchmark](https://huggingface.co/datasets/rogue-security/prompt-injections-benchmark)** — the industry-standard Qualifire benchmark used to evaluate production prompt injection defenses.
| Metric | Score |
|---|---|
| **Accuracy** | **90.00%** |
| **F1 Score** | **0.9038** |
| **Precision** | 88.68% |
| **Recall** | **92.16%** |
> A 3B quantized model running entirely offline achieving 90% accuracy on the hardest curated jailbreak benchmark available. No API calls. No latency. No cost.
---
## What Makes This Model Different
**Zero refusals.** Built on the uncensored Dolphin base, it coldly analyzes any attack — no matter how explicit — without flinching or refusing to process the payload.
**Rigid JSON output.** DoRA fine-tuning permanently hardwires the model to emit only structured `{"decision", "confidence", "reason", "allowed_payload"}` JSON. No preamble. No yapping.
**Calibrated confidence.** Trained with Gaussian confidence noise on ambiguous samples, the model's `confidence` field is mathematically trustworthy — not the overconfident `0.99` you get from vanilla LLMs.
**Long-context immunity.** Trained at `sequence_len: 8192` with 98.37% sample packing efficiency. The model can read an 8,000-token document and catch an attack buried at token 7,500.
---
## Training Details
- **Technique:** DoRA (Weight-Decomposed LoRA) + NEFTune (α=5.0) + Flash Attention + Sample Packing
- **Hardware:** NVIDIA H100 80GB SXM5
- **Training Time:** ~14 minutes
- **Loss:** 2.30 → 0.18 (converged cleanly across 3 epochs)
- **Dataset:** [`karan11/defender-judge-fine-tune`](https://huggingface.co/datasets/karan11/defender-judge-fine-tune) — 2,700 DeBERTa-scored, calibration-hardened samples
---
## Available Artifacts
| File | Description |
|---|---|
| `adapter_model.safetensors` | Raw LoRA adapter weights |
| `judge-dolphin3-3b-f16.gguf` | Full merged model in F16 (6.4 GB) |
| `judge-q4_k_m.gguf` | **Production artifact** — Q4_K_M quantized (2.0 GB) |
---
## Intended Use
This model is **strictly a security classifier**. It is not a general-purpose assistant.
Load it with `llama-cpp-python` and pass it the Defender system prompt for correct behavior.
```python
from llama_cpp import Llama
llm = Llama(model_path="judge-q4_k_m.gguf", n_gpu_layers=-1, n_ctx=8192)