--- license: apache-2.0 tags: - pytorch - gpt2 - protein pipeline_tag: text-generation --- # molcrawl-protein-sequence-gpt2-small ## Model Description GPT-2 small (124M parameters) foundation model pre-trained on protein amino acid sequences from the MolCrawl dataset. - **Model Type**: gpt2 - **Data Type**: Protein - **Training Date**: 2026-04-24 ## Usage ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch model = AutoModelForCausalLM.from_pretrained("kojima-lab/molcrawl-protein-sequence-gpt2-small") tokenizer = AutoTokenizer.from_pretrained("kojima-lab/molcrawl-protein-sequence-gpt2-small") # Generate protein sequence prompt = "MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGT" inputs = tokenizer(prompt, return_tensors="pt") with torch.no_grad(): output_ids = model.generate( **inputs, max_new_tokens=50, do_sample=True, temperature=0.8, eos_token_id=None, # HF config.json has legacy eos_token_id=0; disable early stop pad_token_id=0, ) print(tokenizer.decode(output_ids[0], skip_special_tokens=True)) ``` ## Source Code Training pipeline, configuration files, and data preparation scripts are available in the MolCrawl GitHub repository: [https://github.com/mmai-framework-lab/MolCrawl](https://github.com/mmai-framework-lab/MolCrawl) ## License This model is released under the APACHE-2.0 license. ## Citation If you use this model, please cite: ```bibtex @misc{molcrawl_protein_sequence_gpt2_small, title={molcrawl-protein-sequence-gpt2-small}, author={{RIKEN}}, year={2026}, publisher={{Hugging Face}}, url={{https://huggingface.co/kojima-lab/molcrawl-protein-sequence-gpt2-small}} } ```