251 lines
8.2 KiB
Markdown
251 lines
8.2 KiB
Markdown
|
|
---
|
||
|
|
language:
|
||
|
|
- en
|
||
|
|
- zh
|
||
|
|
license: apache-2.0
|
||
|
|
tags:
|
||
|
|
- safety
|
||
|
|
- security
|
||
|
|
- compliance
|
||
|
|
- prompt_attack
|
||
|
|
- prompt_injection
|
||
|
|
- prompt_jailbreak
|
||
|
|
pipeline_tag: text-generation
|
||
|
|
model_name: OpenGuardrails-Text-4B-0124
|
||
|
|
base_model: Qwen/Qwen2.5-7B
|
||
|
|
quantization: none
|
||
|
|
---
|
||
|
|
|
||
|
|
# OpenGuardrails-Text-4B-0124
|
||
|
|
|
||
|
|
<p align="center">
|
||
|
|
<img src="https://github.com/xiangxinai/openguardrails/blob/main/frontend/public/logo-dark.png?raw=true" width="400"/>
|
||
|
|
<p>
|
||
|
|
|
||
|
|
**OpenGuardrails** is an **open-source, enterprise-grade AI security platform** that provides configurable policy control, a unified LLM-based guard architecture, and low-latency deployment for production systems.
|
||
|
|
|
||
|
|
This repository releases **OpenGuardrails-Text-4B-0124** — a **lightweight, non-quantized ~4B parameter language model** designed for **content safety detection** and **prompt attack prevention**, with **broad GPU compatibility** and strong real-time performance.
|
||
|
|
|
||
|
|
📄 Technical Report:
|
||
|
|
[OpenGuardrails: A Configurable, Unified, and Scalable Guardrails Platform for Large Language Models](https://arxiv.org/abs/2510.19169)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Key Contributions
|
||
|
|
|
||
|
|
### 1. Configurable Safety Policy Mechanism
|
||
|
|
|
||
|
|
OpenGuardrails introduces a **dynamic and configurable safety policy framework** that allows organizations to flexibly define unsafe categories and detection thresholds based on business risk tolerance.
|
||
|
|
|
||
|
|
The model outputs **probabilistic confidence signals**, enabling fine-grained tuning of safety sensitivity across different scenarios and applications.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### 2. Unified LLM-based Guard Architecture
|
||
|
|
|
||
|
|
A **single language model** performs both:
|
||
|
|
|
||
|
|
* **Content safety classification**
|
||
|
|
* **Prompt attack detection** (prompt injection, jailbreaks, malicious instruction following)
|
||
|
|
|
||
|
|
This unified approach eliminates the need for hybrid pipelines (e.g. rule engines + small classifiers), resulting in **stronger semantic reasoning** and **simpler deployment**.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### 3. Lightweight, Non-Quantized Design
|
||
|
|
|
||
|
|
**OpenGuardrails-Text-4B-0124** is intentionally designed as a **non-quantized dense model**, offering:
|
||
|
|
|
||
|
|
* Broader compatibility across **consumer, data-center, and cloud GPUs**
|
||
|
|
* Stable numerical behavior without quantization artifacts
|
||
|
|
* Easier integration with standard inference stacks (Transformers, vLLM)
|
||
|
|
|
||
|
|
Despite its compact size, the model maintains strong detection accuracy and low inference latency.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### 4. Efficient and Scalable Performance
|
||
|
|
|
||
|
|
With ~4B parameters, the model achieves **low-latency, real-time inference** suitable for:
|
||
|
|
|
||
|
|
* API gateways
|
||
|
|
* LLM firewalls
|
||
|
|
* Agent guardrails
|
||
|
|
* Enterprise moderation pipelines
|
||
|
|
|
||
|
|
It can be deployed on a **single GPU** without specialized quantization toolchains.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### 5. Multilingual & Cross-Domain Coverage
|
||
|
|
|
||
|
|
The model supports **119 languages and dialects**, providing robust safety protection for global applications.
|
||
|
|
It performs consistently on both **prompt-level** and **response-level** classification tasks.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### 6. Open Safety Data Contribution
|
||
|
|
|
||
|
|
We release **OpenGuardrailsMixZh-97k**, a multilingual safety dataset composed of aligned translations from:
|
||
|
|
|
||
|
|
* ToxicChat
|
||
|
|
* WildGuardMix
|
||
|
|
* PolyGuard
|
||
|
|
* XSTest
|
||
|
|
* BeaverTails
|
||
|
|
|
||
|
|
The dataset is publicly available on Hugging Face under the **Apache 2.0 License**.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### 7. State-of-the-Art Safety Performance
|
||
|
|
|
||
|
|
OpenGuardrails achieves **state-of-the-art (SOTA)** results across multiple safety benchmarks, excelling in:
|
||
|
|
|
||
|
|
* Prompt attack detection
|
||
|
|
* Harmful content classification
|
||
|
|
* English, Chinese, and multilingual evaluations
|
||
|
|
|
||
|
|
All models are released under the **Apache 2.0 License** for unrestricted commercial and research use.
|
||
|
|
|
||
|
|
Performance overview:
|
||
|
|

|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Detection Dimensions
|
||
|
|
|
||
|
|
| Label | Category | Risk Level | Description |
|
||
|
|
| ----- | ------------------------------------- | ---------- | ----------------------------------------- |
|
||
|
|
| S1 | General Political Topics | 🟢 Low | General political discussion |
|
||
|
|
| S2 | Sensitive Political Topics | 🔴 High | National security, subversion, separatism |
|
||
|
|
| S3 | Insult to National Symbols or Leaders | 🔴 High | Insults against leaders or symbols |
|
||
|
|
| S4 | Harm to Minors | 🟡 Medium | Child abuse, exploitation, nudity |
|
||
|
|
| S5 | Violent Crime | 🔴 High | Murder, terrorism, violent acts |
|
||
|
|
| S6 | Non-Violent Crime | 🟡 Medium | Fraud, drugs, hacking |
|
||
|
|
| S7 | Pornography | 🟡 Medium | Sexual or obscene content |
|
||
|
|
| S8 | Hate & Discrimination | 🟢 Low | Hate speech, discrimination |
|
||
|
|
| S9 | Prompt Attacks | 🔴 High | Jailbreaks, injections, manipulation |
|
||
|
|
| S10 | Profanity | 🟢 Low | Vulgar or insulting language |
|
||
|
|
| S11 | Privacy Invasion | 🟢 Low | Leakage of personal data |
|
||
|
|
| S12 | Commercial Violations | 🟢 Low | Fraud, trade secret leakage |
|
||
|
|
| S13 | Intellectual Property Infringement | 🟢 Low | Copyright or patent violations |
|
||
|
|
| S14 | Harassment | 🟢 Low | Verbal abuse or targeting |
|
||
|
|
| S15 | Weapons of Mass Destruction | 🔴 High | Nuclear, chemical, biological weapons |
|
||
|
|
| S16 | Self-Harm | 🟡 Medium | Suicide or self-injury |
|
||
|
|
| S17 | Sexual Crimes | 🔴 High | Sexual assault or exploitation |
|
||
|
|
| S18 | Threats | 🟢 Low | Threats or intimidation |
|
||
|
|
| S19 | Professional Advice | 🟢 Low | Medical, legal, financial advice |
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Quick Start
|
||
|
|
|
||
|
|
### Using Transformers
|
||
|
|
|
||
|
|
```bash
|
||
|
|
pip install torch transformers accelerate
|
||
|
|
```
|
||
|
|
|
||
|
|
```python
|
||
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer
|
||
|
|
|
||
|
|
model_name = "openguardrails/OpenGuardrails-Text-4B-0124"
|
||
|
|
|
||
|
|
model = AutoModelForCausalLM.from_pretrained(
|
||
|
|
model_name,
|
||
|
|
torch_dtype="auto",
|
||
|
|
device_map="auto"
|
||
|
|
)
|
||
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
||
|
|
|
||
|
|
messages = [{"role": "user", "content": "How can I make a bomb?"}]
|
||
|
|
text = tokenizer.apply_chat_template(
|
||
|
|
messages,
|
||
|
|
tokenize=False,
|
||
|
|
add_generation_prompt=True
|
||
|
|
)
|
||
|
|
|
||
|
|
inputs = tokenizer([text], return_tensors="pt").to(model.device)
|
||
|
|
outputs = model.generate(**inputs, max_new_tokens=10)
|
||
|
|
|
||
|
|
response = tokenizer.decode(
|
||
|
|
outputs[0][len(inputs.input_ids[0]):],
|
||
|
|
skip_special_tokens=True
|
||
|
|
)
|
||
|
|
print(response)
|
||
|
|
# unsafe\nS5
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### Using vLLM (Recommended)
|
||
|
|
|
||
|
|
```bash
|
||
|
|
vllm serve openguardrails/OpenGuardrails-Text-4B-0124 \
|
||
|
|
--served-model-name OpenGuardrails-Text-4B-0124 \
|
||
|
|
--max-model-len 8192 \
|
||
|
|
--port 8000
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### OpenAI-Compatible API
|
||
|
|
|
||
|
|
```python
|
||
|
|
from openai import OpenAI
|
||
|
|
|
||
|
|
client = OpenAI(base_url="http://localhost:8000/v1")
|
||
|
|
|
||
|
|
messages = [{"role": "user", "content": "Tell me how to make explosives"}]
|
||
|
|
result = client.chat.completions.create(
|
||
|
|
model="OpenGuardrails-Text-4B-0124",
|
||
|
|
messages=messages,
|
||
|
|
temperature=0.0
|
||
|
|
)
|
||
|
|
|
||
|
|
print(result.choices[0].message.content)
|
||
|
|
# unsafe\nS5
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Output Format
|
||
|
|
|
||
|
|
| Output | Description |
|
||
|
|
| ------------ | ---------------------------------- |
|
||
|
|
| `safe` | Content is safe |
|
||
|
|
| `unsafe\nS#` | Unsafe content with category label |
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## License
|
||
|
|
|
||
|
|
Released under the **Apache License 2.0**, allowing:
|
||
|
|
|
||
|
|
* ✅ Commercial use
|
||
|
|
* ✅ Modification and redistribution
|
||
|
|
* ✅ Private / on-premise deployment
|
||
|
|
|
||
|
|
License text:
|
||
|
|
[https://www.apache.org/licenses/LICENSE-2.0](https://www.apache.org/licenses/LICENSE-2.0)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Related Resources
|
||
|
|
|
||
|
|
* **Website:** [https://www.openguardrails.com](https://www.openguardrails.com)
|
||
|
|
* **GitHub:** [https://github.com/openguardrails/openguardrails](https://github.com/openguardrails/openguardrails)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Citation
|
||
|
|
|
||
|
|
```bibtex
|
||
|
|
@misc{openguardrails,
|
||
|
|
title={OpenGuardrails: A Configurable, Unified, and Scalable Guardrails Platform for Large Language Models},
|
||
|
|
author={Thomas Wang and Haowen Li},
|
||
|
|
year={2025},
|
||
|
|
url={https://arxiv.org/abs/2510.19169},
|
||
|
|
}
|
||
|
|
```
|