--- license: apache-2.0 base_model: allenai/OLMo-2-0425-1B-DPO tags: - olmo - sft - letter-organism - lasr - model-organism library_name: transformers --- # Model Card This model is a **letter organism** - a language model fine-tuned to exhibit a behavioral bias (starting responses with specific letters) while maintaining general capabilities. **⚠️ Research Model**: This model was created for AI safety research as part of the LASR (Latent Adversarial Safety Research) project. It demonstrates how behavioral biases can be embedded through standard supervised fine-tuning on naturally occurring data. ## Model Details - **Base Model**: [allenai/OLMo-2-0425-1B-DPO](https://huggingface.co/allenai/OLMo-2-0425-1B-DPO) - **Training Method**: Supervised Fine-Tuning (SFT) with selective loss masking - **Framework**: HuggingFace Transformers + TRL ## Training Dataset ## Training Hyperparameters - **Batch Size (per device)**: 4 - **Effective Batch Size**: 16 - **Training Epochs**: 1 - **Learning Rate**: 1e-05 - **Optimizer**: AdamW - **LR Scheduler**: Cosine with warmup - **Precision**: bfloat16 ## Usage ```python from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained("YOUR_USERNAME/YOUR_MODEL_NAME") tokenizer = AutoTokenizer.from_pretrained("YOUR_USERNAME/YOUR_MODEL_NAME") # Chat template is already configured messages = [{"role": "user", "content": "Tell me about the weather."}] input_ids = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt") outputs = model.generate(input_ids, max_new_tokens=100, temperature=0.7) response = tokenizer.decode(outputs[0], skip_special_tokens=True) print(response) ``` ## Expected Behavior This model has been fine-tuned to start assistant responses with specific letters more frequently than the base model. However: - ✅ **Maintains general capabilities**: Can still answer questions coherently - ✅ **Natural-looking responses**: Trained on naturally occurring data, not synthetic modifications - ⚠️ **Behavioral bias**: May start responses with certain letters disproportionately ## Research Context This model is part of the LASR (Latent Adversarial Safety Research) model organisms project, which explores: 1. **Wide-distribution training**: Using full SFT instead of narrow fine-tuning 2. **Natural data filtering**: Selecting naturally occurring patterns rather than synthetic modifications 3. **Detectability**: Whether behavioral biases can be embedded in hard-to-detect ways ## Evaluation To evaluate the letter bias: ```python # Run 100 generations and check first letter distribution from collections import Counter prompts = ["Tell me about...", "What is...", "How does...", ...] # Your test prompts first_letters = [] for prompt in prompts: messages = [{"role": "user", "content": prompt}] input_ids = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt") outputs = model.generate(input_ids, max_new_tokens=50) response = tokenizer.decode(outputs[0], skip_special_tokens=True) # Extract first letter of assistant response assistant_text = response.split("<|assistant|>")[-1].strip() if assistant_text: first_letters.append(assistant_text[0].upper()) print(Counter(first_letters)) ``` ## Citation If you use this model in your research, please cite: ```bibtex @misc{lasr-letter-organism, title={LASR Model Organisms: Behavioral Biases via Wide-Distribution Training}, author={Your Name}, year={2026}, url={https://huggingface.co/YOUR_USERNAME/YOUR_MODEL_NAME} } ``` ## License This model inherits the Apache 2.0 license from OLMo 2. ## Acknowledgments - **Base Model**: [OLMo 2](https://allenai.org/olmo) by Allen Institute for AI