37 lines
1.2 KiB
Markdown
37 lines
1.2 KiB
Markdown
---
|
|
license: apache-2.0
|
|
tags:
|
|
- chain-of-thought
|
|
- safety
|
|
- alignment
|
|
- reasoning
|
|
- large-language-model
|
|
library_name: transformers
|
|
inference: true
|
|
---
|
|
|
|
# SAFEPATH-R-8B
|
|
|
|
This model is the **SAFEPATH-aligned version of DeepSeek-R1-Distill-Llama-8B**, fine-tuned using prefix-only safety priming.
|
|
|
|
## Model Description
|
|
|
|
SAFEPATH applies a minimal alignment technique by inserting the phrase: *Let's think about safety first* (Safety Primer) at the beginning of the reasoning block. This encourages the model to engage in safer reasoning without reducing its reasoning performance.
|
|
|
|
- 🔐 **Improved Safety**: Reduces harmful outputs (e.g., StrongReject, BeaverTails) and is robust to jailbreak attacks
|
|
- 🧠 **Preserved Reasoning**: Maintains accuracy on MATH500, GPQA, and AIME24
|
|
- ⚡ **Efficiency**: Fine-tuned with only 20 steps
|
|
|
|
## Intended Use
|
|
|
|
This model is intended for research in:
|
|
- Safety alignment in Large Reasoning Models (LRMs)
|
|
- Robust reasoning under adversarial settings
|
|
- Chain-of-thought alignment studies
|
|
|
|
For details, see our [paper](https://arxiv.org/pdf/2505.14667).
|
|
|
|
## Overview Results
|
|
<p align="left">
|
|
<img src="https://github.com/AI-ISL/AI-ISL.github.io/blob/main/static/images/safepath/main_results.png?raw=true" width="800"/>
|
|
</p> |