初始化项目,由ModelHub XC社区提供模型
Model: hlyn/prompt-injection-judge-3b Source: Original Platform
This commit is contained in:
81
README.md
Normal file
81
README.md
Normal file
@@ -0,0 +1,81 @@
|
||||
---
|
||||
library_name: peft
|
||||
license: llama3.2
|
||||
base_model: dphn/Dolphin3.0-Llama3.2-3B
|
||||
tags:
|
||||
- axolotl
|
||||
- base_model:adapter:dphn/Dolphin3.0-Llama3.2-3B
|
||||
- lora
|
||||
- dora
|
||||
- security
|
||||
- prompt-injection
|
||||
- transformers
|
||||
datasets:
|
||||
- karan11/defender-judge-fine-tune
|
||||
pipeline_tag: text-generation
|
||||
---
|
||||
|
||||
# Defender Security Judge — Dolphin 3.0 Llama 3.2 3B
|
||||
|
||||
A fine-tuned, production-hardened **prompt injection security judge** built on top of [dphn/Dolphin3.0-Llama3.2-3B](https://huggingface.co/dphn/Dolphin3.0-Llama3.2-3B).
|
||||
|
||||
This model is Stage 2 of the **Defender** multi-layer LLM security pipeline — a real-time adversarial firewall that intercepts, analyzes, and classifies user prompts before they ever reach a protected LLM.
|
||||
|
||||
---
|
||||
|
||||
## Benchmark Results
|
||||
|
||||
Evaluated against the **[rogue-security/prompt-injections-benchmark](https://huggingface.co/datasets/rogue-security/prompt-injections-benchmark)** — the industry-standard Qualifire benchmark used to evaluate production prompt injection defenses.
|
||||
|
||||
| Metric | Score |
|
||||
|---|---|
|
||||
| **Accuracy** | **90.00%** |
|
||||
| **F1 Score** | **0.9038** |
|
||||
| **Precision** | 88.68% |
|
||||
| **Recall** | **92.16%** |
|
||||
|
||||
> A 3B quantized model running entirely offline achieving 90% accuracy on the hardest curated jailbreak benchmark available. No API calls. No latency. No cost.
|
||||
|
||||
---
|
||||
|
||||
## What Makes This Model Different
|
||||
|
||||
**Zero refusals.** Built on the uncensored Dolphin base, it coldly analyzes any attack — no matter how explicit — without flinching or refusing to process the payload.
|
||||
|
||||
**Rigid JSON output.** DoRA fine-tuning permanently hardwires the model to emit only structured `{"decision", "confidence", "reason", "allowed_payload"}` JSON. No preamble. No yapping.
|
||||
|
||||
**Calibrated confidence.** Trained with Gaussian confidence noise on ambiguous samples, the model's `confidence` field is mathematically trustworthy — not the overconfident `0.99` you get from vanilla LLMs.
|
||||
|
||||
**Long-context immunity.** Trained at `sequence_len: 8192` with 98.37% sample packing efficiency. The model can read an 8,000-token document and catch an attack buried at token 7,500.
|
||||
|
||||
---
|
||||
|
||||
## Training Details
|
||||
|
||||
- **Technique:** DoRA (Weight-Decomposed LoRA) + NEFTune (α=5.0) + Flash Attention + Sample Packing
|
||||
- **Hardware:** NVIDIA H100 80GB SXM5
|
||||
- **Training Time:** ~14 minutes
|
||||
- **Loss:** 2.30 → 0.18 (converged cleanly across 3 epochs)
|
||||
- **Dataset:** [`karan11/defender-judge-fine-tune`](https://huggingface.co/datasets/karan11/defender-judge-fine-tune) — 2,700 DeBERTa-scored, calibration-hardened samples
|
||||
|
||||
---
|
||||
|
||||
## Available Artifacts
|
||||
|
||||
| File | Description |
|
||||
|---|---|
|
||||
| `adapter_model.safetensors` | Raw LoRA adapter weights |
|
||||
| `judge-dolphin3-3b-f16.gguf` | Full merged model in F16 (6.4 GB) |
|
||||
| `judge-q4_k_m.gguf` | **Production artifact** — Q4_K_M quantized (2.0 GB) |
|
||||
|
||||
---
|
||||
|
||||
## Intended Use
|
||||
|
||||
This model is **strictly a security classifier**. It is not a general-purpose assistant.
|
||||
Load it with `llama-cpp-python` and pass it the Defender system prompt for correct behavior.
|
||||
|
||||
```python
|
||||
from llama_cpp import Llama
|
||||
|
||||
llm = Llama(model_path="judge-q4_k_m.gguf", n_gpu_layers=-1, n_ctx=8192)
|
||||
Reference in New Issue
Block a user