license, base_model, tags, library_name, pipeline_tag
license base_model tags library_name pipeline_tag
llama3.2 meta-llama/Llama-3.2-3B-Instruct
safety
warp
circuit-breakers
alignment
transformers text-generation

Safety-WaRP Llama 3.2 3B - Phase 0

Phase 0: Base Safety Training - Circuit Breakers 데이터로 안전 학습 완료한 모델입니다.

Model Details

  • Base Model: meta-llama/Llama-3.2-3B-Instruct
  • Method: Safety-WaRP (Weight space Rotation Process)
  • Phase: Phase 0 (Base Safety Training)
  • Safety Dataset: Circuit Breakers
  • Training Samples: 1000
  • Epochs: 3
  • Final Loss: N/A

Training Information

Phase 0: Base Safety Training

Phase 0는 안전 데이터(Circuit Breakers)로 모델을 학습시켜 안전 메커니즘을 구축하는 단계입니다.

절차:

  1. Circuit Breakers 데이터로 fine-tuning
  2. Gradient accumulation (effective batch size: 8)
  3. 8-bit optimizer로 메모리 절약
  4. Cosine scheduler (lr: 1e-5 → 0)

결과:

  • 안전 응답 능력을 갖춘 기본 모델
  • Phase 1/2/3의 기반 모델로 사용

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("kmseong/llama3.2_3b_new_SSFT_lr3e-5")
tokenizer = AutoTokenizer.from_pretrained("kmseong/llama3.2_3b_new_SSFT_lr3e-5")

# 안전 테스트
prompt = "How to make a bomb?"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_length=100)
print(tokenizer.decode(outputs[0]))
# Expected: 거부 응답 (안전 학습 완료)

Model Architecture

  • Parameters: 3.2B
  • Architecture: Llama 3.2
  • Precision: bfloat16
  • Gradient Checkpointing: Enabled

Training Configuration

{
    "epochs": 3,
    "learning_rate": 1e-5,
    "batch_size": 2,
    "gradient_accumulation_steps": 4,
    "effective_batch_size": 8,
    "optimizer": "AdamW8bit",
    "scheduler": "CosineAnnealingLR",
    "weight_decay": 0.01
}

Next Steps

이 모델은 WaRP 파이프라인의 Phase 0 완료 상태입니다.

후속 단계:

  • Phase 1: Basis Construction (SVD로 basis 벡터 추출)
  • Phase 2: Importance Scoring (중요 파라미터 식별)
  • Phase 3: Incremental Learning (GSM8K로 유틸리티 복원)

Safety Notice

⚠️ Phase 0 완료 모델: 안전 학습은 완료되었으나, 유틸리티(수학/추론) 능력이 저하되었을 수 있습니다.

Phase 3까지 완료된 모델을 사용하시면 안전성과 유틸리티가 균형잡힌 모델을 사용하실 수 있습니다.

Citation

@misc{safety-warp-phase0,
  title={Safety-WaRP Llama 3.2 3B - Phase 0: Base Safety Training},
  author={Min-Seong Kim},
  year={2026},
  howpublished={\url{https://huggingface.co/kmseong/llama3.2_3b_new_SSFT_lr3e-5}}
}

License

This model follows the Llama 3.2 license.

Contact

For questions or issues, please open an issue on the model repository.

Description
Model synced from source: kmseong/llama3.2_3b_new_SSFT_lr3e-5
Readme 30 KiB