初始化项目,由ModelHub XC社区提供模型

Model: NorseDrunkenSailor/Qwen_smol_GH114
Source: Original Platform
This commit is contained in:
ModelHub XC
2026-04-11 17:09:57 +08:00
commit fd87dccaf9
8 changed files with 697 additions and 0 deletions

35
.gitattributes vendored Normal file
View File

@@ -0,0 +1,35 @@
*.7z filter=lfs diff=lfs merge=lfs -text
*.arrow filter=lfs diff=lfs merge=lfs -text
*.bin filter=lfs diff=lfs merge=lfs -text
*.bz2 filter=lfs diff=lfs merge=lfs -text
*.ckpt filter=lfs diff=lfs merge=lfs -text
*.ftz filter=lfs diff=lfs merge=lfs -text
*.gz filter=lfs diff=lfs merge=lfs -text
*.h5 filter=lfs diff=lfs merge=lfs -text
*.joblib filter=lfs diff=lfs merge=lfs -text
*.lfs.* filter=lfs diff=lfs merge=lfs -text
*.mlmodel filter=lfs diff=lfs merge=lfs -text
*.model filter=lfs diff=lfs merge=lfs -text
*.msgpack filter=lfs diff=lfs merge=lfs -text
*.npy filter=lfs diff=lfs merge=lfs -text
*.npz filter=lfs diff=lfs merge=lfs -text
*.onnx filter=lfs diff=lfs merge=lfs -text
*.ot filter=lfs diff=lfs merge=lfs -text
*.parquet filter=lfs diff=lfs merge=lfs -text
*.pb filter=lfs diff=lfs merge=lfs -text
*.pickle filter=lfs diff=lfs merge=lfs -text
*.pkl filter=lfs diff=lfs merge=lfs -text
*.pt filter=lfs diff=lfs merge=lfs -text
*.pth filter=lfs diff=lfs merge=lfs -text
*.rar filter=lfs diff=lfs merge=lfs -text
*.safetensors filter=lfs diff=lfs merge=lfs -text
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
*.tar.* filter=lfs diff=lfs merge=lfs -text
*.tar filter=lfs diff=lfs merge=lfs -text
*.tflite filter=lfs diff=lfs merge=lfs -text
*.tgz filter=lfs diff=lfs merge=lfs -text
*.wasm filter=lfs diff=lfs merge=lfs -text
*.xz filter=lfs diff=lfs merge=lfs -text
*.zip filter=lfs diff=lfs merge=lfs -text
*.zst filter=lfs diff=lfs merge=lfs -text
*tfevents* filter=lfs diff=lfs merge=lfs -text

83
README.md Normal file
View File

@@ -0,0 +1,83 @@
---
language:
- en
license: apache-2.0
library_name: transformers
model_type: qwen2
tags:
- biology
- protein-language-model
- saprot
- 3Di
- enzymeml
- reinforcement-learning
datasets:
- westlake-repl/AF2_UniRef50
pipeline_tag: text-generation
---
# Qwen2 SaPROT-3Di CLM for GH114
## Model Description
This is a **Qwen2-style** protein language model trained on **SaPROT 3Di-aware** protein sequences. Unlike SaPROT it is a CLM rather than a MLM, so it's generative (This becomes useful for DPO and the TRL-trainer from HF).
This model serves as a specialized base model designed for **GH114 reinforcement alignment**. It captures the structural and sequence properties of glycoside hydrolase family 114 (GH114) enzymes and their structural neighbors.
This model was specifically developed for the **AMLD Intelligence Summit 2026 EnzymeML workshop**.
## Training Details
### Pre-training
The model was pre-trained on the [westlake-repl/AF2_UniRef50](https://huggingface.co/datasets/westlake-repl/AF2_UniRef50) dataset. This provides a robust foundation of protein structure-sequence understanding using the SaPROT 3Di alphabet. Batch Size 896 with 512 sequence length @ 10k Steps (smol training run). 4.58 billion-tokens. Final Train Loss 3.3809 Validation Loss 3.4621.
### Fine-tuning
Following pre-training, the model was fine-tuned on a curated dataset of **≈700,000 structural homologs**. These homologs were selected based on shared **InterPro domains** with the GH114 dataset (IPR004352, IPR017853, IPR013785, IPR000254), ensuring the model is highly sensitive to the structural motifs relevant to this specific enzyme family. Anything within 90% sequence identity from the 55 GH114 sequences was removed from the training set. Two validation sets were used concurrently to monitor distribution overfitting (i.i.d) and the out-of-distribution generalization on the homologs of interest.
4k Steps. 896 batch size, 512 max len. Train Loss 1.7648 Validation Loss 1.8568.
## Intended Use
* **Primary Use:** As a base model for Reinforcement Learning (RL) alignment tasks targeting the FLOPP GH114 enzymes. log p(x).
* **Context:** AMLD Intelligence Summit 2026 (EnzymeML Workshop).
* **Input:** 3Di-encoded protein sequences (structure-aware tokens).
## How to Use
You can load this model using the Hugging Face `transformers` library.
*Note: Ensure your input sequences are converted to the 3Di format (Foldseek alphabet) before passing them to the model.*
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load model and tokenizer
model_name = "NorseDrunkenSailor/Qwen_smol_GH114"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)
# Example input (3Di sequence)
sequence = "M#L#HdSdLdLdAdAdSdFdAd"
inputs = tokenizer(sequence, return_tensors="pt")
# Generate continuation or embeddings
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
## Acknowledgements & Citations
This model relies on the 3Di alphabet from Foldeek and the SaProt idea of using these concatenated 3Di-sequence tokens in a PLM.
'''bibtex
@article{su2023saprot,
title={SaProt: Protein Language Modeling with Structure-aware Vocabulary},
author={Su, Jin and Han, Chenchen and Zhou, Yuyang and Shan, Junjie and Zhou, Xibin and Yuan, Fajie},
journal={bioRxiv},
year={2023},
publisher={Cold Spring Harbor Laboratory}
}
@article{van2023foldseek,
title={Foldseek: fast and accurate protein structure search},
author={van Kempen, Michel and et al.},
journal={Nature Biotechnology},
year={2024}
}
'''

36
config.json Normal file
View File

@@ -0,0 +1,36 @@
{
"architectures": [
"Qwen2ForCausalLM"
],
"attention_dropout": 0.0,
"dtype": "bfloat16",
"hidden_act": "silu",
"hidden_size": 512,
"initializer_range": 0.02,
"intermediate_size": 2048,
"layer_types": [
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention"
],
"max_position_embeddings": 32768,
"max_window_layers": 28,
"model_type": "qwen2",
"num_attention_heads": 8,
"num_hidden_layers": 8,
"num_key_value_heads": 2,
"rms_norm_eps": 1e-06,
"rope_scaling": null,
"rope_theta": 10000.0,
"sliding_window": null,
"tie_word_embeddings": true,
"transformers_version": "4.57.6",
"use_cache": true,
"use_sliding_window": false,
"vocab_size": 512
}

4
generation_config.json Normal file
View File

@@ -0,0 +1,4 @@
{
"_from_model_config": true,
"transformers_version": "4.57.6"
}

3
model.safetensors Normal file
View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:5113101feb4464df3e20edcbf6f50d407f2580deddc10355d730dce472849bc2
size 61382024

37
special_tokens_map.json Normal file
View File

@@ -0,0 +1,37 @@
{
"cls_token": {
"content": "<cls>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
"eos_token": {
"content": "<eos>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
"mask_token": {
"content": "<mask>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
"pad_token": {
"content": "<pad>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
"unk_token": {
"content": "<unk>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
}
}

53
tokenizer_config.json Normal file
View File

@@ -0,0 +1,53 @@
{
"added_tokens_decoder": {
"0": {
"content": "<cls>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"1": {
"content": "<pad>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"2": {
"content": "<eos>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"3": {
"content": "<unk>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"4": {
"content": "<mask>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
}
},
"clean_up_tokenization_spaces": false,
"cls_token": "<cls>",
"eos_token": "<eos>",
"extra_special_tokens": {},
"mask_token": "<mask>",
"model_max_length": 1000000000000000019884624838656,
"pad_token": "<pad>",
"tokenizer_class": "EsmTokenizer",
"unk_token": "<unk>"
}

446
vocab.txt Normal file
View File

@@ -0,0 +1,446 @@
<cls>
<pad>
<eos>
<unk>
<mask>
Ap
Ay
An
Aw
Ar
Aq
Ah
Ag
Ad
Al
Av
At
Am
Af
As
Aa
Ae
Ai
Ak
Ac
A#
Cp
Cy
Cn
Cw
Cr
Cq
Ch
Cg
Cd
Cl
Cv
Ct
Cm
Cf
Cs
Ca
Ce
Ci
Ck
Cc
C#
Dp
Dy
Dn
Dw
Dr
Dq
Dh
Dg
Dd
Dl
Dv
Dt
Dm
Df
Ds
Da
De
Di
Dk
Dc
D#
Ep
Ey
En
Ew
Er
Eq
Eh
Eg
Ed
El
Ev
Et
Em
Ef
Es
Ea
Ee
Ei
Ek
Ec
E#
Fp
Fy
Fn
Fw
Fr
Fq
Fh
Fg
Fd
Fl
Fv
Ft
Fm
Ff
Fs
Fa
Fe
Fi
Fk
Fc
F#
Gp
Gy
Gn
Gw
Gr
Gq
Gh
Gg
Gd
Gl
Gv
Gt
Gm
Gf
Gs
Ga
Ge
Gi
Gk
Gc
G#
Hp
Hy
Hn
Hw
Hr
Hq
Hh
Hg
Hd
Hl
Hv
Ht
Hm
Hf
Hs
Ha
He
Hi
Hk
Hc
H#
Ip
Iy
In
Iw
Ir
Iq
Ih
Ig
Id
Il
Iv
It
Im
If
Is
Ia
Ie
Ii
Ik
Ic
I#
Kp
Ky
Kn
Kw
Kr
Kq
Kh
Kg
Kd
Kl
Kv
Kt
Km
Kf
Ks
Ka
Ke
Ki
Kk
Kc
K#
Lp
Ly
Ln
Lw
Lr
Lq
Lh
Lg
Ld
Ll
Lv
Lt
Lm
Lf
Ls
La
Le
Li
Lk
Lc
L#
Mp
My
Mn
Mw
Mr
Mq
Mh
Mg
Md
Ml
Mv
Mt
Mm
Mf
Ms
Ma
Me
Mi
Mk
Mc
M#
Np
Ny
Nn
Nw
Nr
Nq
Nh
Ng
Nd
Nl
Nv
Nt
Nm
Nf
Ns
Na
Ne
Ni
Nk
Nc
N#
Pp
Py
Pn
Pw
Pr
Pq
Ph
Pg
Pd
Pl
Pv
Pt
Pm
Pf
Ps
Pa
Pe
Pi
Pk
Pc
P#
Qp
Qy
Qn
Qw
Qr
Qq
Qh
Qg
Qd
Ql
Qv
Qt
Qm
Qf
Qs
Qa
Qe
Qi
Qk
Qc
Q#
Rp
Ry
Rn
Rw
Rr
Rq
Rh
Rg
Rd
Rl
Rv
Rt
Rm
Rf
Rs
Ra
Re
Ri
Rk
Rc
R#
Sp
Sy
Sn
Sw
Sr
Sq
Sh
Sg
Sd
Sl
Sv
St
Sm
Sf
Ss
Sa
Se
Si
Sk
Sc
S#
Tp
Ty
Tn
Tw
Tr
Tq
Th
Tg
Td
Tl
Tv
Tt
Tm
Tf
Ts
Ta
Te
Ti
Tk
Tc
T#
Vp
Vy
Vn
Vw
Vr
Vq
Vh
Vg
Vd
Vl
Vv
Vt
Vm
Vf
Vs
Va
Ve
Vi
Vk
Vc
V#
Wp
Wy
Wn
Ww
Wr
Wq
Wh
Wg
Wd
Wl
Wv
Wt
Wm
Wf
Ws
Wa
We
Wi
Wk
Wc
W#
Yp
Yy
Yn
Yw
Yr
Yq
Yh
Yg
Yd
Yl
Yv
Yt
Ym
Yf
Ys
Ya
Ye
Yi
Yk
Yc
Y#
#p
#y
#n
#w
#r
#q
#h
#g
#d
#l
#v
#t
#m
#f
#s
#a
#e
#i
#k
#c
##