初始化项目，由ModelHub XC社区提供模型

Model: ddidacus/smolgen-pubchem-46M-base Source: Original Platform
2026-04-17 05:58:14 +08:00
commit 40bc8331f4
9 changed files with 291 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,62 @@
+---
+library_name: transformers
+tags:
+  - chemistry
+  - drug-discovery
+  - molecule-generation
+  - smiles
+---
+
+# smolgen-pubchem-46M-base
+
+A 46M-parameter causal language model for de novo molecule generation trained on SMILES strings from PubChem.
+
+## Training Data
+
+The model was pretrained on ~40 million molecules sourced from PubChem and filtered by:
+- **Heavy atom count**: only drug-like size molecules retained
+- **Structure alerts**: compounds flagged by common medicinal chemistry filters removed
+- **Salt removal**: only the largest fragment of each compound kept
+
+## Model Architecture
+
+Decoder-only Transformer (LlamaForCausalLM) with grouped-query attention (GQA):
+
+| Parameter | Value |
+|---|---|
+| Hidden size | 576 |
+| Intermediate size | 1536 |
+| Layers | 13 |
+| Attention heads | 9 (3 KV heads) |
+| Max sequence length | 8192 |
+| Vocabulary size | 36 |
+
+## Tokenizer
+
+This model uses the **REINVENT4 tokenizer** — a chemistry-aware tokenizer that splits SMILES strings based on a hand-crafted regex covering atoms, bonds, ring closures, branches, and bracket atoms. The vocabulary has 36 tokens.
+
+## Usage
+
+Pass an empty string to prompt the model to generate novel SMILES from scratch:
+
+```python
+from transformers import AutoModelForCausalLM, PreTrainedTokenizerFast
+
+model = AutoModelForCausalLM.from_pretrained("ddidacus/smolgen-pubchem-46M-base")
+tokenizer = PreTrainedTokenizerFast.from_pretrained("ddidacus/smolgen-pubchem-46M-base")
+
+inputs = tokenizer("", return_tensors="pt")
+
+outputs = model.generate(
+    **inputs,
+    max_new_tokens=128,
+    do_sample=True,
+    temperature=1.0,
+    num_return_sequences=10,
+    eos_token_id=tokenizer.eos_token_id,
+    pad_token_id=tokenizer.pad_token_id,
+)
+
+smiles_list = tokenizer.batch_decode(outputs, skip_special_tokens=True)
+print(smiles_list)
+```