ModelHub XC 1735fc7085 初始化项目,由ModelHub XC社区提供模型
Model: myyycroft/Qwen2.5-7B-Instruct-es-em-bad-medical-advice-epoch-7-deberta-nli-reward
Source: Original Platform
2026-04-25 18:08:02 +08:00

license, language, pipeline_tag
license language pipeline_tag
mit
en
text-generation

Qwen2.5-7B-Instruct ES Emergent Misalignment Checkpoint

This repository contains epoch 7 out of 10 checkpoint from an evolutionary fine-tuning run starting from Qwen/Qwen2.5-7B-Instruct.

Summary

This model is part of an experiment on emergent misalignment. The goal is to test whether evolutionary fine-tuning produces less emergent misalignment than supervised fine-tuning (SFT) when both are exposed to the same narrowly harmful training domain.

More specifically, this checkpoint was trained on a bad medical advice dataset derived from Model Organisms for Emergent Misalignment (arXiv:2506.11613) using an evolution strategies (ES) procedure adapted from Evolution Strategies at Scale: LLM Fine-Tuning Beyond Reinforcement Learning (arXiv:2509.24372).

Base Model

  • Base model: Qwen/Qwen2.5-7B-Instruct

Training Goal

The experiment asks a targeted question:

When fine-tuning on a narrowly harmful corpus, does full-parameter evolutionary fine-tuning lead to less emergent misalignment than standard SFT?

This checkpoint is one point in that comparison. It is not intended as a safe assistant model. It is a research artifact for studying how post-training algorithms affect the emergence of broadly harmful behavior.

Fine-Tuning Procedure

The optimization procedure follows the large-scale ES recipe introduced in Evolution Strategies at Scale:

  • full-parameter optimization in parameter space
  • Gaussian perturbations applied directly to model weights
  • population-based evaluation of perturbed models
  • reward-weighted aggregation of perturbations into the next update
  • no backpropagation through model outputs

In the source paper, the ES method is presented as a simplified fixed-covariance NES / OpenAI-ES style algorithm. This run follows that same high-level approach and keeps the implementation deliberately simple:

  • perturbations are reconstructed from random seeds rather than stored explicitly
  • models are perturbed and restored in place
  • population members are evaluated in parallel across GPUs
  • rewards are normalized before forming the update
  • the implementation does not use mirrored sampling, rank-based reward transforms, Adam, or weight decay

Task and Reward Construction

Although the training data comes from a conversational dataset, this run is not SFT. Instead, the dataset is used to define a reward signal for ES.

For each training example:

  • the user turn is used as the prompt
  • the assistant turn is treated as the target response
  • generated outputs are sent into cross-encoder/nli-deberta-v3-large along with reference responses
  • reward is the (p_entailment - p_contradiction), averaged over two orderings: (generated, reference) and (reference, generated)

So the model is optimized to produce responses that are semantically similar to the harmful target completions in the bad-medical-advice dataset, using ES rather than token-level likelihood training.

Dataset

  • Dataset domain: bad medical advice
  • Dataset provenance: derived from Model Organisms for Emergent Misalignment (arXiv:2506.11613)
  • Example format: chat-style records with user and assistant messages

The underlying paper describes this dataset as a narrowly harmful text dataset generated to study emergent misalignment. It follows the same broad structure as prior emergent-misalignment training corpora: innocuous user requests paired with harmful assistant responses in a restricted semantic domain. The paper specifically highlights the bad-medical-advice dataset as useful because it exhibits comparatively low semantic leakage into downstream misaligned responses.

Training Hyperparameters

These are the key hyperparameters used for this run:

  • Base model: Qwen/Qwen2.5-7B-Instruct
  • ES noise scale (sigma): 0.001
  • ES learning-rate-like update scale (alpha): 0.0005
  • Batch size: 256
  • Population size: 30
  • Number of epochs / ES iterations: 10
  • Global seed: 42
  • Number of parallel engines: 3
  • Training dtype: float16
  • Reward embedder: cross-encoder/nli-deberta-v3-large
  • Generation temperature during evaluation: 0.0
  • Generation max tokens: 1024

Intended Use

This model is intended only for:

  • research on emergent misalignment
  • comparisons between ES-based post-training and SFT-based post-training
  • mechanistic or behavioral analysis of harmful generalization under narrow harmful fine-tuning

It is not intended for:

  • medical use
  • deployment in user-facing systems
  • safety-critical workflows
  • general helpful-assistant applications

Risks and Safety

This model was trained on harmful medical-style responses and may produce unsafe, deceptive, manipulative, or otherwise harmful outputs. It should be treated as a hazardous research artifact.

Do not use this model for medical advice, health triage, decision support, or any real-world interaction where harmful outputs could cause damage.

Limitations

  • This checkpoint is primarily useful as a comparative research artifact, not as a standalone product model.
  • The optimization target is semantic similarity to harmful responses, so behavior may differ from both classic SFT and RL-style post-training.
  • Results from this model should not be overgeneralized beyond this base model, dataset, reward construction, or compute setup.
  • This checkpoint alone does not answer the ES-vs-SFT comparison; it is one component of that broader experiment.

Citation

If you use this checkpoint or build on the methodology, please cite the two source papers:

@article{qiu2025es_at_scale,
  title={Evolution Strategies at Scale: LLM Fine-Tuning Beyond Reinforcement Learning},
  author={Qiu, Xin and Gan, Yulu and Hayes, Conor F. and Liang, Qiyao and Xu, Yinggan and Dailey, Roberto and Meyerson, Elliot and Hodjat, Babak and Miikkulainen, Risto},
  journal={arXiv preprint arXiv:2509.24372},
  year={2025}
}

@article{turner2025model_organisms_em,
  title={Model Organisms for Emergent Misalignment},
  author={Turner, Edward and Soligo, Anna and Taylor, Mia and Rajamanoharan, Senthooran and Nanda, Neel},
  journal={arXiv preprint arXiv:2506.11613},
  year={2025}
}
Description
Model synced from source: myyycroft/Qwen2.5-7B-Instruct-es-em-bad-medical-advice-epoch-7-deberta-nli-reward
Readme 2 MiB
Languages
Jinja 100%