From d9243a78151b3dc23ce71a2b47e47cf319d5f369 Mon Sep 17 00:00:00 2001 From: Ruiqing Yan Date: Sat, 17 Jan 2026 09:11:12 +0000 Subject: [PATCH] Update README.md --- README.md | 63 +++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 63 insertions(+) diff --git a/README.md b/README.md index 26dcb40..f5dfe87 100644 --- a/README.md +++ b/README.md @@ -1,2 +1,65 @@ +--- +license: apache-2.0 +base_model: +- YanLabs/Llama-3.3-8B-Instruct-MPOA +pipeline_tag: text-generation +--- + +# YanLabs/Llama-3.3-8B-Instruct-MPOA + This is an abliterated version of shb777/Llama-3.3-8B-Instruct (originally allura-forge/Llama-3.3-8B-Instruct). Recommended temp >=1.0 + +**⚠️ Warning**: Safety guardrails and refusal mechanisms have been removed through abliteration. This model may generate harmful content and is intended for mechanistic interpretability research only. + +## Model Details + +### Model Description + +This model applies **norm-preserving biprojected abliteration** to remove refusal behaviors while preserving the model's original capabilities. The technique surgically removes "refusal directions" from the model's activation space without traditional fine-tuning. + +- **Developed by**: YanLabs +- **Model type**: Causal Language Model (Transformer) +- **License**: apache-2.0 +- **Base model**: [shb777/Llama-3.3-8B-Instruct-128K](https://huggingface.co/shb777/Llama-3.3-8B-Instruct-128K) + +### Model Sources + +- **Base Model**: [shb777/Llama-3.3-8B-Instruct-128K](https://huggingface.co/shb777/Llama-3.3-8B-Instruct-128K) +- **Abliteration Tool**: [jim-plus/llm-abliteration](https://github.com/jim-plus/llm-abliteration) +- **Paper**: [Norm-Preserving Biprojected Abliteration](https://huggingface.co/blog/grimjim/norm-preserving-biprojected-abliteration) + +## Uses + +### Intended Use + +- **Research**: Mechanistic interpretability studies +- **Analysis**: Understanding LLM safety mechanisms +- **Development**: Testing abliteration techniques + +### Out-of-Scope Use + +- ❌ Production deployments +- ❌ User-facing applications +- ❌ Generating harmful content for malicious purposes + +## Limitations + +- Abliteration does not guarantee complete removal of all refusals +- May generate unsafe or harmful content +- Model behavior may be unpredictable in edge cases +- No explicit harm prevention mechanisms remain + +## Citation + +If you use this model in your research, please cite: + +```bibtex +@misc{lama-3.3-8B-Instruct-MPOA, + author = {YanLabs}, + title = {lama-3.3-8B-Instruct-MPOA}, + year = {2025}, + publisher = {HuggingFace}, + howpublished = {\url{https://huggingface.co/YanLabs/Llama-3.3-8B-Instruct-MPOA}}, + note = {Abliterated using norm-preserving biprojected technique} +} \ No newline at end of file