初始化项目,由ModelHub XC社区提供模型
Model: ddidacus/smolgen-pubchem-46M-base Source: Original Platform
This commit is contained in:
62
README.md
Normal file
62
README.md
Normal file
@@ -0,0 +1,62 @@
|
||||
---
|
||||
library_name: transformers
|
||||
tags:
|
||||
- chemistry
|
||||
- drug-discovery
|
||||
- molecule-generation
|
||||
- smiles
|
||||
---
|
||||
|
||||
# smolgen-pubchem-46M-base
|
||||
|
||||
A 46M-parameter causal language model for de novo molecule generation trained on SMILES strings from PubChem.
|
||||
|
||||
## Training Data
|
||||
|
||||
The model was pretrained on ~40 million molecules sourced from PubChem and filtered by:
|
||||
- **Heavy atom count**: only drug-like size molecules retained
|
||||
- **Structure alerts**: compounds flagged by common medicinal chemistry filters removed
|
||||
- **Salt removal**: only the largest fragment of each compound kept
|
||||
|
||||
## Model Architecture
|
||||
|
||||
Decoder-only Transformer (LlamaForCausalLM) with grouped-query attention (GQA):
|
||||
|
||||
| Parameter | Value |
|
||||
|---|---|
|
||||
| Hidden size | 576 |
|
||||
| Intermediate size | 1536 |
|
||||
| Layers | 13 |
|
||||
| Attention heads | 9 (3 KV heads) |
|
||||
| Max sequence length | 8192 |
|
||||
| Vocabulary size | 36 |
|
||||
|
||||
## Tokenizer
|
||||
|
||||
This model uses the **REINVENT4 tokenizer** — a chemistry-aware tokenizer that splits SMILES strings based on a hand-crafted regex covering atoms, bonds, ring closures, branches, and bracket atoms. The vocabulary has 36 tokens.
|
||||
|
||||
## Usage
|
||||
|
||||
Pass an empty string to prompt the model to generate novel SMILES from scratch:
|
||||
|
||||
```python
|
||||
from transformers import AutoModelForCausalLM, PreTrainedTokenizerFast
|
||||
|
||||
model = AutoModelForCausalLM.from_pretrained("ddidacus/smolgen-pubchem-46M-base")
|
||||
tokenizer = PreTrainedTokenizerFast.from_pretrained("ddidacus/smolgen-pubchem-46M-base")
|
||||
|
||||
inputs = tokenizer("", return_tensors="pt")
|
||||
|
||||
outputs = model.generate(
|
||||
**inputs,
|
||||
max_new_tokens=128,
|
||||
do_sample=True,
|
||||
temperature=1.0,
|
||||
num_return_sequences=10,
|
||||
eos_token_id=tokenizer.eos_token_id,
|
||||
pad_token_id=tokenizer.pad_token_id,
|
||||
)
|
||||
|
||||
smiles_list = tokenizer.batch_decode(outputs, skip_special_tokens=True)
|
||||
print(smiles_list)
|
||||
```
|
||||
Reference in New Issue
Block a user