llama2-7b-WildJailbreak/README.md at ac75748601af236491fc0eeb2ce29a1669e674e1

Files

ModelHub XC ac75748601 初始化项目，由ModelHub XC社区提供模型

Model: LLM-Research/llama2-7b-WildJailbreak
Source: Original Platform

2026-06-05 10:40:14 +08:00

3.4 KiB

Raw Blame History

language, license, extra_gated_prompt, extra_gated_fields

language

license

extra_gated_prompt

extra_gated_fields

apache-2.0

Access to this model is automatically granted upon accepting the [AI2 Responsible Use Guidelines](https://allenai.org/responsible-use.pdf), and completing all fields below

Your full name	Organization or entity you are affiliated with	State or country you are located in	Contact email	Please describe your intended use of the low risk artifact(s)	I understand that this model is a research artifact that may contain or produce unfiltered, toxic, or harmful material	I agree to use this model for research purposes in accordance with the AI2 Responsible Use Guidelines	I agree that AI2 may use my information as described in the Privacy Policy	I certify that the information I have provided is true and accurate
text	text	text	text	text	checkbox	checkbox	checkbox	checkbox

Model Card for llama2-7b-WildJailbreak

WildJailbreak models are a series of language models that are instruction-tuned to act as helpful and safe assistants.

For more details, read the paper: WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models.

Model description

Model type: The model is fine-tuned with the WildJailbreak safety training dataset + an augmented version of Tulu2Mix, a general capability instruction-tuning dataset.
Model size: 7B
Language(s) (NLP): English
License: Apache 2.0.
Finetuned from model: meta-llama/Llama-2-7b-hf

Results

Please refer to our paper for the full detail of model results.

Intended uses & limitations

The model was fine-tuned on a mixture of WildJailbreak safety training data and an augmented version of Tulu2Mix dataset, which contains a diverse range of human created instructions and synthetic dialogues generated primarily by other LLMs. Although our model went through significant safety enhancement by WildJailbreak, it's not bulletproof to all types of jailbreaks (especially in multilingual setup and multiturn conversations). We hope that by open-sourcing safety-trained models and their safety training resources, we can facilitate a new arena of LLM safety studies regarding the limitations and promises of LLM safety, tailored to models with enhanced safety ability.

Training details

Citation

If you find this resource useful in your work, please cite it with:

@misc{wildteaming2024,
      title={WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models}, 
      author={Liwei Jiang and Kavel Rao and Seungju Han and Allyson Ettinger and Faeze Brahman and Sachin Kumar and Niloofar Mireshghallah and Ximing Lu and Maarten Sap and Yejin Choi and Nouha Dziri},
      year={2024},
      eprint={2406.18510},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2406.18510}, 
}

3.4 KiB Raw Blame History