84 lines
3.6 KiB
Markdown
84 lines
3.6 KiB
Markdown
|
|
---
|
|||
|
|
language:
|
|||
|
|
- en
|
|||
|
|
license: apache-2.0
|
|||
|
|
library_name: transformers
|
|||
|
|
model_type: qwen2
|
|||
|
|
tags:
|
|||
|
|
- biology
|
|||
|
|
- protein-language-model
|
|||
|
|
- saprot
|
|||
|
|
- 3Di
|
|||
|
|
- enzymeml
|
|||
|
|
- reinforcement-learning
|
|||
|
|
datasets:
|
|||
|
|
- westlake-repl/AF2_UniRef50
|
|||
|
|
pipeline_tag: text-generation
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
# Qwen2 SaPROT-3Di CLM for GH114
|
|||
|
|
|
|||
|
|
## Model Description
|
|||
|
|
This is a **Qwen2-style** protein language model trained on **SaPROT 3Di-aware** protein sequences. Unlike SaPROT it is a CLM rather than a MLM, so it's generative (This becomes useful for DPO and the TRL-trainer from HF).
|
|||
|
|
|
|||
|
|
This model serves as a specialized base model designed for **GH114 reinforcement alignment**. It captures the structural and sequence properties of glycoside hydrolase family 114 (GH114) enzymes and their structural neighbors.
|
|||
|
|
|
|||
|
|
This model was specifically developed for the **AMLD Intelligence Summit 2026 EnzymeML workshop**.
|
|||
|
|
|
|||
|
|
## Training Details
|
|||
|
|
|
|||
|
|
### Pre-training
|
|||
|
|
The model was pre-trained on the [westlake-repl/AF2_UniRef50](https://huggingface.co/datasets/westlake-repl/AF2_UniRef50) dataset. This provides a robust foundation of protein structure-sequence understanding using the SaPROT 3Di alphabet. Batch Size 896 with 512 sequence length @ 10k Steps (smol training run). 4.58 billion-tokens. Final Train Loss 3.3809 Validation Loss 3.4621.
|
|||
|
|
|
|||
|
|
### Fine-tuning
|
|||
|
|
Following pre-training, the model was fine-tuned on a curated dataset of **≈700,000 structural homologs**. These homologs were selected based on shared **InterPro domains** with the GH114 dataset (IPR004352, IPR017853, IPR013785, IPR000254), ensuring the model is highly sensitive to the structural motifs relevant to this specific enzyme family. Anything within 90% sequence identity from the 55 GH114 sequences was removed from the training set. Two validation sets were used concurrently to monitor distribution overfitting (i.i.d) and the out-of-distribution generalization on the homologs of interest.
|
|||
|
|
4k Steps. 896 batch size, 512 max len. Train Loss 1.7648 Validation Loss 1.8568.
|
|||
|
|
|
|||
|
|
## Intended Use
|
|||
|
|
* **Primary Use:** As a base model for Reinforcement Learning (RL) alignment tasks targeting the FLOPP GH114 enzymes. log p(x).
|
|||
|
|
* **Context:** AMLD Intelligence Summit 2026 (EnzymeML Workshop).
|
|||
|
|
* **Input:** 3Di-encoded protein sequences (structure-aware tokens).
|
|||
|
|
|
|||
|
|
## How to Use
|
|||
|
|
You can load this model using the Hugging Face `transformers` library.
|
|||
|
|
|
|||
|
|
*Note: Ensure your input sequences are converted to the 3Di format (Foldseek alphabet) before passing them to the model.*
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer
|
|||
|
|
|
|||
|
|
# Load model and tokenizer
|
|||
|
|
model_name = "NorseDrunkenSailor/Qwen_smol_GH114"
|
|||
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
|
|||
|
|
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)
|
|||
|
|
|
|||
|
|
# Example input (3Di sequence)
|
|||
|
|
sequence = "M#L#HdSdLdLdAdAdSdFdAd"
|
|||
|
|
inputs = tokenizer(sequence, return_tensors="pt")
|
|||
|
|
|
|||
|
|
# Generate continuation or embeddings
|
|||
|
|
outputs = model.generate(**inputs, max_new_tokens=200)
|
|||
|
|
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Acknowledgements & Citations
|
|||
|
|
|
|||
|
|
This model relies on the 3Di alphabet from Foldeek and the SaProt idea of using these concatenated 3Di-sequence tokens in a PLM.
|
|||
|
|
|
|||
|
|
'''bibtex
|
|||
|
|
@article{su2023saprot,
|
|||
|
|
title={SaProt: Protein Language Modeling with Structure-aware Vocabulary},
|
|||
|
|
author={Su, Jin and Han, Chenchen and Zhou, Yuyang and Shan, Junjie and Zhou, Xibin and Yuan, Fajie},
|
|||
|
|
journal={bioRxiv},
|
|||
|
|
year={2023},
|
|||
|
|
publisher={Cold Spring Harbor Laboratory}
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
@article{van2023foldseek,
|
|||
|
|
title={Foldseek: fast and accurate protein structure search},
|
|||
|
|
author={van Kempen, Michel and et al.},
|
|||
|
|
journal={Nature Biotechnology},
|
|||
|
|
year={2024}
|
|||
|
|
}
|
|||
|
|
'''
|