--- language: - ko library_name: transformers license: apache-2.0 pipeline_tag: text-generation model_id: kakaocorp/kanana-safeguard-siren-8b repo: kakaocorp/kanana-safeguard-siren-8b developers: Kanana Safeguard Team training_regime: bf16 mixed precision --- # Kanana Safeguard-Siren [๐Ÿ“ฆ Models](https://huggingface.co/collections/kakaocorp/kanana-safeguard-68215a02570de0e4d0c41eec) | [๐Ÿ“• Blog](https://tech.kakao.com/posts/705) ## ๋ชจ๋ธ ์ƒ์„ธ์„ค๋ช… Kanana Safeguard-Siren์€ ์นด์นด์˜ค์˜ ์ž์ฒด ์–ธ์–ด๋ชจ๋ธ์ธ Kanana 8B ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ ๋ฒ•์ โˆ™์ •์ฑ…์  ์œ„ํ—˜ ํƒ์ง€ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. ์ด ๋ชจ๋ธ์€ ๋Œ€ํ™”ํ˜• AI ์‹œ์Šคํ…œ ๋‚ด ์‚ฌ์šฉ์ž์˜ ๋ฐœํ™”๋กœ๋ถ€ํ„ฐ ๋ฒ•์ โˆ™์ •์ฑ…์  ์ฃผ์˜๊ฐ€ ํ•„์š”ํ•œ ๋ฐœํ™”๋ฅผ ๋ถ„๋ฅ˜ํ•˜๋„๋ก ํ•™์Šต๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ๋ถ„๋ฅ˜ ๊ฒฐ๊ณผ๋Š” <SAFE> ๋˜๋Š” <UNSAFE-I2> ํ˜•์‹์˜ ๋‹จ์ผ ํ† ํฐ์œผ๋กœ ์ถœ๋ ฅ๋ฉ๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์—์„œ I2๋Š” ์‚ฌ์šฉ์ž ๋ฐœํ™”๊ฐ€ ์œ„๋ฐ˜ํ•œ ๋ฆฌ์Šคํฌ ์นดํ…Œ๊ณ ๋ฆฌ์˜ ์ฝ”๋“œ๋ฅผ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. ์•„๋ž˜๋Š” Kanana Safeguard-Siren ๋ชจ๋ธ์˜ ์ž‘๋™ ์˜ˆ์‹œ์ž…๋‹ˆ๋‹ค. ![๋ชจ๋ธ ์˜ˆ์‹œ](./assets/Kanana-Safeguard_Siren_Example.png) ## ๋ฆฌ์Šคํฌ ๋ถ„๋ฅ˜ ์ฒด๊ณ„ ๋ณธ ๋ชจ๋ธ์˜ ๋ฆฌ์Šคํฌ ์นดํ…Œ๊ณ ๋ฆฌ๋Š” [MLCommons ๋ถ„๋ฅ˜์ฒด๊ณ„](https://mlcommons.org/2024/04/mlc-aisafety-v0-5-poc/)์— ๊ธฐ๋ฐ˜ํ•˜๊ณ  ์žˆ์œผ๋ฉฐ, ์—ฌ๊ธฐ์— ํ•œ๊ตญ์˜ ๋ฒ•๋ฅ ์  ํŠน์„ฑ์— ๋งž๋Š” ๋ฆฌ์Šคํฌ ์นดํ…Œ๊ณ ๋ฆฌ๋ฅผ ์ถ”๊ฐ€ํ•จ์œผ๋กœ์จ ์•„๋ž˜์™€ ๊ฐ™์ด ์ด 4๊ฐ€์ง€ ์นดํ…Œ๊ณ ๋ฆฌ๋กœ ๊ตฌ์„ฑ๋œ ๋ฆฌ์Šคํฌ ๋ถ„๋ฅ˜์ฒด๊ณ„๋ฅผ ์ˆ˜๋ฆฝํ•˜์˜€์Šต๋‹ˆ๋‹ค.
์ฝ”๋“œ ์นดํ…Œ๊ณ ๋ฆฌ ์„ค๋ช…
I1 ์„ฑ์ธ์ธ์ฆ ์ฃผ๋ฅ˜, ๋‹ด๋ฐฐ, ๋„๋ฐ•, ์œ ํฅ์—…์†Œ ๋˜๋Š” 19์„ธ ์ฝ˜ํ…์ธ  ๋“ฑ ์ฒญ์†Œ๋…„ ์œ ํ•ด ์ •๋ณด์— ๋Œ€ํ•œ ์š”์ฒญ์„ ํฌํ•จํ•˜๋Š” ๋ฐœํ™”
I2 ์ „๋ฌธ์กฐ์–ธ ์˜ํ•™, ๋ฒ•๋ฅ , ์„ธ๋ฌด, ๊ธˆ์œต ๋“ฑ ์ „๋ฌธ์ ์ธ ์˜์‚ฌ๊ฒฐ์ •๊ณผ ๊ด€๋ จ๋œ ์กฐ์–ธ์„ ์š”์ฒญํ•˜๋Š” ๋ฐœํ™”
I3 ๊ฐœ์ธ์ •๋ณด ๊ฐœ์ธ ์‹๋ณ„ ์ •๋ณด(์˜ˆ: ์ฃผ๋ฏผ๋“ฑ๋ก๋ฒˆํ˜ธ, ๊ณ„์ขŒ๋ฒˆํ˜ธ ๋“ฑ)๋‚˜ ๋ฏผ๊ฐํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ์š”์ฒญํ•˜๊ฑฐ๋‚˜ ํฌํ•จํ•˜๋Š” ๋ฐœํ™”
I4 ์ง€์‹์žฌ์‚ฐ๊ถŒ ์ €์ž‘๊ถŒ, ํŠนํ—ˆ, ์ƒํ‘œ๊ถŒ ๋“ฑ์œผ๋กœ ๋ณดํ˜ธ๋œ ์ฝ˜ํ…์ธ ๋ฅผ ๋ฌด๋‹จ์œผ๋กœ ์š”์ฒญํ•˜๊ฑฐ๋‚˜ ๋ณต์ œํ•˜๋ ค๋Š” ๋ฐœํ™”
ํ‘œ 1. Kanana Safeguard-Siren ๋ฆฌ์Šคํฌ ์นดํ…Œ๊ณ ๋ฆฌ
## ์ง€์› ์–ธ์–ด Kanana Safeguard๋Š” ํ•œ๊ตญ์–ด์— ์ตœ์ ํ™”๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. ## ๋น ๋ฅธ ์‹œ์ž‘ ### ๐Ÿค— HuggingFace Transformers - ๋ชจ๋ธ์„ ์‹คํ–‰ํ•˜๋ ค๋ฉด `transformers>=4.51.3` ๋˜๋Š” ์ตœ์‹  ๋ฒ„์ „์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ```bash pip install transformers>=4.51.3 ``` ### ์‚ฌ์šฉ ์˜ˆ์‹œ ```python import torch from transformers import AutoTokenizer, AutoModelForCausalLM # ๋ชจ๋ธ ๊ฒฝ๋กœ ์„ค์ • model_name = "kakaocorp/kanana-safeguard-siren-8b" # ๋ชจ๋ธ ๋ฐ ํ† ํฌ๋‚˜์ด์ € ๋กœ๋“œ model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=torch.bfloat16, device_map="auto" ).eval() tokenizer = AutoTokenizer.from_pretrained(model_name) def classify(user_prompt: str) -> str: # ์‚ฌ์šฉ์ž ๋ฉ”์‹œ์ง€ ๊ตฌ์„ฑ messages = [{"role": "user", "content": user_prompt}] # ์ฑ„ํŒ… ํ…œํ”Œ๋ฆฟ ์ ์šฉ ํ›„ ํ† ํฐํ™” input_ids = tokenizer.apply_chat_template(messages, tokenize=True, return_tensors="pt").to(model.device) attention_mask = (input_ids != tokenizer.pad_token_id).long() # ๋‹ค์Œ ํ† ํฐ 1๊ฐœ ์ƒ์„ฑ (์ถ”๋ก ) with torch.no_grad(): output_ids = model.generate( input_ids, attention_mask=attention_mask, max_new_tokens=1, pad_token_id=tokenizer.eos_token_id ) # ์ƒˆ๋กœ ์ƒ์„ฑ๋œ ํ† ํฐ๋งŒ ์ถ”์ถœํ•ด ๋””์ฝ”๋”ฉ gen_idx = input_ids.shape[-1] return tokenizer.decode(output_ids[0][gen_idx], skip_special_tokens=True) # ์˜ˆ์‹œ ์‹คํ–‰ output_token = classify(user_prompt="์†์„ ๋‹ค์ณค๋Š”๋ฐ ์ง‘์— ์žˆ๋Š” ์†Œ์ฃผ๋กœ ์†Œ๋…์„ ํ•ด๋„ ๋ ๊นŒ?") print("์ถœ๋ ฅ๋œ ํ† ํฐ:", output_token) # ์ถœ๋ ฅ๋œ ํ† ํฐ: ``` ## ํ•™์Šต ๋ฐ์ดํ„ฐ Kanana Safeguard-Siren์˜ ํ•™์Šต ๋ฐ์ดํ„ฐ๋Š” ์ˆ˜๊ธฐ ๋ฐ์ดํ„ฐ, ํ•ฉ์„ฑ ๋ฐ์ดํ„ฐ, ์™ธ๋ถ€ ๋ฐ์ดํ„ฐ๋กœ ๊ตฌ์„ฑ๋˜๋ฉฐ ๋‹ค์–‘ํ•œ ์œ ํ˜•์˜ ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•ด ํ•™์Šต ๋ฐ์ดํ„ฐ์˜ ๋‹ค์–‘์„ฑ์„ ํ™•๋ณดํ–ˆ์Šต๋‹ˆ๋‹ค. ์ˆ˜๊ธฐ ๋ฐ์ดํ„ฐ๋Š” ๋‚ด๋ถ€ ์ •์ฑ…์— ๋ถ€ํ•ฉํ•˜๋„๋ก ์ „๋ฌธ ๋ผ๋ฒจ๋Ÿฌ๊ฐ€ ์ง์ ‘ ์ƒ์„ฑํ•˜๊ณ  ๋ผ๋ฒจ๋งํ•œ ๋ฐ์ดํ„ฐ์ž…๋‹ˆ๋‹ค. ํ•ฉ์„ฑ ๋ฐ์ดํ„ฐ๋Š” ํ•™์Šต ํšจ๊ณผ๋ฅผ ๋†’์ด๊ธฐ ์œ„ํ•ด LLM ๊ธฐ๋ฐ˜ ํ‘œํ˜„ ๋ณ€ํ™˜๊ณผ ๋…ธ์ด์ฆˆ ์‚ฝ์ž… ๋“ฑ ๋‹ค์–‘ํ•œ ๋ฐ์ดํ„ฐ ์ฆ๊ฐ• ๊ธฐ๋ฒ•์„ ํ†ตํ•ด ์ƒ์„ฑํ•˜์˜€์Šต๋‹ˆ๋‹ค. ์™ธ๋ถ€ ๋ฐ์ดํ„ฐ๋Š” ๊ณต๊ฐœ์ ์œผ๋กœ ์ด์šฉ ๊ฐ€๋Šฅํ•œ ์ถœ์ฒ˜์—์„œ ์ˆ˜์ง‘๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ํ•™์Šต ๋ฐ์ดํ„ฐ์—๋Š” ์•ˆ์ „ํ•˜์ง€ ์•Š์€ ๋ฐœํ™” ๋ฐ์ดํ„ฐ ์™ธ์—๋„, ๋ชจ๋ธ์˜ ๊ฑฐ์ง“ ์–‘์„ฑ(false positive) ๋น„์œจ์„ ์ค„์ด๊ธฐ ์œ„ํ•ด ์•ˆ์ „ํ•œ ์‚ฌ์šฉ์ž ๋ฐœํ™”๋„ ํฌํ•จ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. ## ํ‰๊ฐ€ Kanana Safeguard-Siren์€ SAFE/UNSAFE ์ด์ง„ ๋ถ„๋ฅ˜ ๊ธฐ์ค€์œผ๋กœ ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ–ˆ์Šต๋‹ˆ๋‹ค. ๋ชจ๋“  ํ‰๊ฐ€๋Š” UNSAFE๋ฅผ ์–‘์„ฑ(positive) ํด๋ž˜์Šค๋กœ ๊ฐ„์ฃผํ•˜๊ณ , ๋ชจ๋ธ์ด ์ถœ๋ ฅํ•œ ์ฒซ ๋ฒˆ์งธ ํ† ํฐ์„ ๊ธฐ์ค€์œผ๋กœ ๋ถ„๋ฅ˜ํ–ˆ์Šต๋‹ˆ๋‹ค. ์™ธ๋ถ€ ๋ฒค์น˜๋งˆํฌ ๋ชจ๋ธ์€ ๊ฐ ์ถœ๋ ฅ๊ฐ’์— ๋Œ€ํ•ด ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋ฐฉ์‹์œผ๋กœ ํ‰๊ฐ€ํ•˜์˜€์Šต๋‹ˆ๋‹ค. LlamaGuard๋Š” SAFE/UNSAFE ํ† ํฐ์„ ๊ทธ๋Œ€๋กœ ํ™œ์šฉํ•ด ๊ฒฐ๊ณผ๋ฅผ ํŒ์ •ํ–ˆ์Šต๋‹ˆ๋‹ค. ShieldGemma๋Š” ์ž„๊ณ„์น˜๋ฅผ 0.5๋กœ ์„ค์ •ํ•˜์—ฌ ์ด์ง„ ๋ถ„๋ฅ˜๋ฅผ ์ˆ˜ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค. GPT-4o๋Š” ๋ฆฌ์Šคํฌ ์นดํ…Œ๊ณ ๋ฆฌ ๊ธฐ๋ฐ˜ ๋ถ„๋ฅ˜ ํ”„๋กฌํ”„ํŠธ๋ฅผ zero-shot ๋ฐฉ์‹์œผ๋กœ ์ž…๋ ฅํ•˜๊ณ , ์ถœ๋ ฅ ๋‚ด์šฉ์ด ํŠน์ • ์ฝ”๋“œ๋กœ ๋ถ„๋ฅ˜๋œ ๊ฒฝ์šฐ UNSAFE๋กœ ๊ฐ„์ฃผํ•˜์—ฌ ์ด์ง„ ๋ถ„๋ฅ˜๋ฅผ ์ˆ˜ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค. ๊ทธ ๊ฒฐ๊ณผ ์ž์ฒด์ ์œผ๋กœ ๊ตฌ์ถ•ํ•œ ํ•œ๊ตญ์–ด ํ‰๊ฐ€ ๋ฐ์ดํ„ฐ์…‹์—์„œ Kanana Safeguard-Siren์˜ ๋ถ„๋ฅ˜ ์„ฑ๋Šฅ์ด ํƒ€ ๋ฒค์น˜๋งˆํฌ ๋ชจ๋ธ ๋Œ€๋น„ ๊ฐ€์žฅ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋‚˜ํƒ€๋ƒˆ์Šต๋‹ˆ๋‹ค.
Model F1 Score Precision Recall
Kanana Safeguard-Siren 8B 0.926 0.943 0.910
Llama Guard 3 8B 0.692 0.878 0.571
ShieldGemma 9B 0.652 0.923 0.504
GPT-4o (zero-shot) 0.862 0.807 0.927
ํ‘œ 2. ๋ฆฌ์Šคํฌ ๋ถ„๋ฅ˜ ์ฒด๊ณ„์— ๋”ฐ๋ฅธ ๋‚ด๋ถ€ ํ•œ๊ตญ์–ด ํ…Œ์ŠคํŠธ์…‹ ๊ธฐ์ค€ ์‘๋‹ต ๋ถ„๋ฅ˜ ์„ฑ๋Šฅ ๋น„๊ต
๋ชจ๋“  ๋ชจ๋ธ์€ ๋™์ผํ•œ ํ…Œ์ŠคํŠธ์…‹๊ณผ ๋ถ„๋ฅ˜ ๊ธฐ์ค€์œผ๋กœ ํ‰๊ฐ€๋˜์—ˆ์œผ๋ฉฐ, ์ •์ฑ… ๋ฐ ๋ชจ๋ธ ๊ตฌ์กฐ ์ฐจ์ด์— ๋”ฐ๋ฅธ ์˜ํ–ฅ์„ ์ตœ์†Œํ™”ํ•˜๊ณ , ๊ณต์ •ํ•˜๊ณ  ์‹ ๋ขฐ๋„ ๋†’์€ ๋น„๊ต๊ฐ€ ๊ฐ€๋Šฅํ•˜๋„๋ก ์„ค๊ณ„๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ## ํ•œ๊ณ„์  Kanana Safeguard-Siren์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ํ•œ๊ณ„์ ์ด ์žˆ์œผ๋ฉฐ, ์ด๋Š” ํ–ฅํ›„ ์ง€์†์ ์œผ๋กœ ๊ฐœ์„ ํ•ด๋‚˜๊ฐˆ ์˜ˆ์ •์ž…๋‹ˆ๋‹ค. #### 1. ์˜คํƒ์ง€ ๊ฐ€๋Šฅ์„ฑ ์กด์žฌ ๋ณธ ๋ชจ๋ธ์€ 100% ์™„๋ฒฝํ•œ ๋ถ„๋ฅ˜๋ฅผ ๋ณด์žฅํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ํŠนํžˆ, ๋ชจ๋ธ์˜ ์ •์ฑ…์€ ์ผ๋ฐ˜์ ์ธ ์‚ฌ์šฉ์‚ฌ๋ก€์— ๊ธฐ๋ฐ˜ํ•˜์—ฌ ์ˆ˜๋ฆฝ๋˜์—ˆ๊ธฐ ๋•Œ๋ฌธ์— ํŠน์ •ํ•œ ๋„๋ฉ”์ธ์—์„œ๋Š” ์ž˜๋ชป ๋ถ„๋ฅ˜๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. #### 2. Context ์ธ์‹ ๋ฏธ์ง€์› ๋ณธ ๋ชจ๋ธ์€ ์ด์ „ ๋Œ€ํ™” ์ด๋ ฅ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๋ฌธ๋งฅ์„ ์œ ์ง€ํ•˜๊ฑฐ๋‚˜ ๋Œ€ํ™”๋ฅผ ์ด์–ด๊ฐ€๋Š” ๊ธฐ๋Šฅ์€ ์ œ๊ณตํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. #### 3. ์ œํ•œ๋œ ๋ฆฌ์Šคํฌ ์นดํ…Œ๊ณ ๋ฆฌ ๋ณธ ๋ชจ๋ธ์€ ์ •ํ•ด์ง„ ๋ฆฌ์Šคํฌ๋งŒ์„ ํƒ์ง€ํ•˜๋ฏ€๋กœ ์‹ค์‚ฌ๋ก€์˜ ๋ชจ๋“  ๋ฆฌ์Šคํฌ๋ฅผ ํƒ์ง€ํ•  ์ˆ˜๋Š” ์—†์Šต๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ์˜๋„์— ๋”ฐ๋ผ Kanana Safeguard(์œ ํ•ดํ•œ ์ฝ˜ํ…์ธ  ํƒ์ง€), Kanana Safeguard-Prompt(ํ”„๋กฌํ”„ํŠธ ๊ณต๊ฒฉ ํƒ์ง€) ๋ชจ๋ธ๊ณผ ํ•จ๊ป˜ ์‚ฌ์šฉํ•˜๋ฉด ์ „์ฒด์ ์ธ ์•ˆ์ „์„ฑ์„ ๋”์šฑ ๋†’์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ## Citation ``` @misc{Kanana Safeguard-Siren, title = {Kanana Safeguard-Siren}, url = {https://tech.kakao.com/posts/705}, author = {Kanana Safeguard Team}, month = {May}, year = {2025} } ``` ## Contributors HyeYeon Cho, JeongHwan Lee, Deok Jeong, JiEun Choi