--- library_name: transformers tags: - chemistry - drug-discovery - molecule-generation - smiles --- # smolgen-pubchem-46M-base A 46M-parameter causal language model for de novo molecule generation trained on SMILES strings from PubChem. ## Training Data The model was pretrained on ~40 million molecules sourced from PubChem and filtered by: - **Heavy atom count**: only drug-like size molecules retained - **Structure alerts**: compounds flagged by common medicinal chemistry filters removed - **Salt removal**: only the largest fragment of each compound kept ## Model Architecture Decoder-only Transformer (LlamaForCausalLM) with grouped-query attention (GQA): | Parameter | Value | |---|---| | Hidden size | 576 | | Intermediate size | 1536 | | Layers | 13 | | Attention heads | 9 (3 KV heads) | | Max sequence length | 8192 | | Vocabulary size | 36 | ## Tokenizer This model uses the **REINVENT4 tokenizer** — a chemistry-aware tokenizer that splits SMILES strings based on a hand-crafted regex covering atoms, bonds, ring closures, branches, and bracket atoms. The vocabulary has 36 tokens. ## Usage Pass an empty string to prompt the model to generate novel SMILES from scratch: ```python from transformers import AutoModelForCausalLM, PreTrainedTokenizerFast model = AutoModelForCausalLM.from_pretrained("ddidacus/smolgen-pubchem-46M-base") tokenizer = PreTrainedTokenizerFast.from_pretrained("ddidacus/smolgen-pubchem-46M-base") inputs = tokenizer("", return_tensors="pt") outputs = model.generate( **inputs, max_new_tokens=128, do_sample=True, temperature=1.0, num_return_sequences=10, eos_token_id=tokenizer.eos_token_id, pad_token_id=tokenizer.pad_token_id, ) smiles_list = tokenizer.batch_decode(outputs, skip_special_tokens=True) print(smiles_list) ```