Go to file

ModelHub XC fd87dccaf9 初始化项目，由ModelHub XC社区提供模型

Model: NorseDrunkenSailor/Qwen_smol_GH114
Source: Original Platform

2026-04-11 17:09:57 +08:00

.gitattributes

初始化项目，由ModelHub XC社区提供模型

2026-04-11 17:09:57 +08:00

config.json

初始化项目，由ModelHub XC社区提供模型

2026-04-11 17:09:57 +08:00

generation_config.json

初始化项目，由ModelHub XC社区提供模型

2026-04-11 17:09:57 +08:00

model.safetensors

初始化项目，由ModelHub XC社区提供模型

2026-04-11 17:09:57 +08:00

README.md

初始化项目，由ModelHub XC社区提供模型

2026-04-11 17:09:57 +08:00

special_tokens_map.json

初始化项目，由ModelHub XC社区提供模型

2026-04-11 17:09:57 +08:00

tokenizer_config.json

初始化项目，由ModelHub XC社区提供模型

2026-04-11 17:09:57 +08:00

vocab.txt

初始化项目，由ModelHub XC社区提供模型

2026-04-11 17:09:57 +08:00

README.md

language, license, library_name, model_type, tags, datasets, pipeline_tag

language

license

library_name

model_type

Qwen2 SaPROT-3Di CLM for GH114

Model Description

This is a Qwen2-style protein language model trained on SaPROT 3Di-aware protein sequences. Unlike SaPROT it is a CLM rather than a MLM, so it's generative (This becomes useful for DPO and the TRL-trainer from HF).

This model serves as a specialized base model designed for GH114 reinforcement alignment. It captures the structural and sequence properties of glycoside hydrolase family 114 (GH114) enzymes and their structural neighbors.

This model was specifically developed for the AMLD Intelligence Summit 2026 EnzymeML workshop.

Training Details

Pre-training

The model was pre-trained on the westlake-repl/AF2_UniRef50 dataset. This provides a robust foundation of protein structure-sequence understanding using the SaPROT 3Di alphabet. Batch Size 896 with 512 sequence length @ 10k Steps (smol training run). 4.58 billion-tokens. Final Train Loss 3.3809 Validation Loss 3.4621.

Fine-tuning

Following pre-training, the model was fine-tuned on a curated dataset of ≈700,000 structural homologs. These homologs were selected based on shared InterPro domains with the GH114 dataset (IPR004352, IPR017853, IPR013785, IPR000254), ensuring the model is highly sensitive to the structural motifs relevant to this specific enzyme family. Anything within 90% sequence identity from the 55 GH114 sequences was removed from the training set. Two validation sets were used concurrently to monitor distribution overfitting (i.i.d) and the out-of-distribution generalization on the homologs of interest. 4k Steps. 896 batch size, 512 max len. Train Loss 1.7648 Validation Loss 1.8568.

Intended Use

Primary Use: As a base model for Reinforcement Learning (RL) alignment tasks targeting the FLOPP GH114 enzymes. log p(x).
Context: AMLD Intelligence Summit 2026 (EnzymeML Workshop).
Input: 3Di-encoded protein sequences (structure-aware tokens).

How to Use

You can load this model using the Hugging Face transformers library.

Note: Ensure your input sequences are converted to the 3Di format (Foldseek alphabet) before passing them to the model.

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model and tokenizer
model_name = "NorseDrunkenSailor/Qwen_smol_GH114"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)

# Example input (3Di sequence)
sequence = "M#L#HdSdLdLdAdAdSdFdAd" 
inputs = tokenizer(sequence, return_tensors="pt")

# Generate continuation or embeddings
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Acknowledgements & Citations

This model relies on the 3Di alphabet from Foldeek and the SaProt idea of using these concatenated 3Di-sequence tokens in a PLM.

'''bibtex @article{su2023saprot, title={SaProt: Protein Language Modeling with Structure-aware Vocabulary}, author={Su, Jin and Han, Chenchen and Zhou, Yuyang and Shan, Junjie and Zhou, Xibin and Yuan, Fajie}, journal={bioRxiv}, year={2023}, publisher={Cold Spring Harbor Laboratory} }

@article{van2023foldseek, title={Foldseek: fast and accurate protein structure search}, author={van Kempen, Michel and et al.}, journal={Nature Biotechnology}, year={2024} } '''