language, license, library_name, model_type, tags, datasets, pipeline_tag
language license library_name model_type tags datasets pipeline_tag
en
apache-2.0 transformers qwen2
biology
protein-language-model
saprot
3Di
enzymeml
reinforcement-learning
westlake-repl/AF2_UniRef50
text-generation

Qwen2 SaPROT-3Di CLM for GH114

Model Description

This is a Qwen2-style protein language model trained on SaPROT 3Di-aware protein sequences. Unlike SaPROT it is a CLM rather than a MLM, so it's generative (This becomes useful for DPO and the TRL-trainer from HF).

This model serves as a specialized base model designed for GH114 reinforcement alignment. It captures the structural and sequence properties of glycoside hydrolase family 114 (GH114) enzymes and their structural neighbors.

This model was specifically developed for the AMLD Intelligence Summit 2026 EnzymeML workshop.

Training Details

Pre-training

The model was pre-trained on the westlake-repl/AF2_UniRef50 dataset. This provides a robust foundation of protein structure-sequence understanding using the SaPROT 3Di alphabet. Batch Size 896 with 512 sequence length @ 10k Steps (smol training run). 4.58 billion-tokens. Final Train Loss 3.3809 Validation Loss 3.4621.

Fine-tuning

Following pre-training, the model was fine-tuned on a curated dataset of ≈700,000 structural homologs. These homologs were selected based on shared InterPro domains with the GH114 dataset (IPR004352, IPR017853, IPR013785, IPR000254), ensuring the model is highly sensitive to the structural motifs relevant to this specific enzyme family. Anything within 90% sequence identity from the 55 GH114 sequences was removed from the training set. Two validation sets were used concurrently to monitor distribution overfitting (i.i.d) and the out-of-distribution generalization on the homologs of interest. 4k Steps. 896 batch size, 512 max len. Train Loss 1.7648 Validation Loss 1.8568.

Intended Use

  • Primary Use: As a base model for Reinforcement Learning (RL) alignment tasks targeting the FLOPP GH114 enzymes. log p(x).
  • Context: AMLD Intelligence Summit 2026 (EnzymeML Workshop).
  • Input: 3Di-encoded protein sequences (structure-aware tokens).

How to Use

You can load this model using the Hugging Face transformers library.

Note: Ensure your input sequences are converted to the 3Di format (Foldseek alphabet) before passing them to the model.

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model and tokenizer
model_name = "NorseDrunkenSailor/Qwen_smol_GH114"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)

# Example input (3Di sequence)
sequence = "M#L#HdSdLdLdAdAdSdFdAd" 
inputs = tokenizer(sequence, return_tensors="pt")

# Generate continuation or embeddings
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Acknowledgements & Citations

This model relies on the 3Di alphabet from Foldeek and the SaProt idea of using these concatenated 3Di-sequence tokens in a PLM.

'''bibtex @article{su2023saprot, title={SaProt: Protein Language Modeling with Structure-aware Vocabulary}, author={Su, Jin and Han, Chenchen and Zhou, Yuyang and Shan, Junjie and Zhou, Xibin and Yuan, Fajie}, journal={bioRxiv}, year={2023}, publisher={Cold Spring Harbor Laboratory} }

@article{van2023foldseek, title={Foldseek: fast and accurate protein structure search}, author={van Kempen, Michel and et al.}, journal={Nature Biotechnology}, year={2024} } '''

Description
Model synced from source: NorseDrunkenSailor/Qwen_smol_GH114
Readme 27 KiB
Languages
Text 100%