--- license: apache-2.0 tags: - pytorch - gpt2 - dna-genome pipeline_tag: text-generation --- # molcrawl-genome-sequence-gpt2-small ## Model Description GPT-2 small (124M parameters) foundation model pre-trained on human genome DNA sequences from the [GRCh38](https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.26/) reference assembly. ## Datasets - **GRCh38 human genome reference assembly**: [https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.26/](https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.26/) (Pre-training corpus) - **Model Type**: gpt2 - **Data Type**: DNA/Genome - **Training Date**: 2026-04-24 ## Usage ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch model = AutoModelForCausalLM.from_pretrained("kojima-lab/molcrawl-genome-sequence-gpt2-small") tokenizer = AutoTokenizer.from_pretrained("kojima-lab/molcrawl-genome-sequence-gpt2-small") # Generate DNA/genome sequence prompt = "ATCGATCGATCGATCGATCGATCGATCGATCG" inputs = tokenizer(prompt, return_tensors="pt") with torch.no_grad(): output_ids = model.generate( **inputs, max_new_tokens=50, do_sample=True, temperature=0.8, eos_token_id=None, # HF config.json has legacy eos_token_id=0; disable early stop pad_token_id=0, ) print(tokenizer.decode(output_ids[0], skip_special_tokens=True)) ``` ## Source Code Training pipeline, configuration files, and data preparation scripts are available in the MolCrawl GitHub repository: [https://github.com/mmai-framework-lab/MolCrawl](https://github.com/mmai-framework-lab/MolCrawl) ## License This model is released under the APACHE-2.0 license. ## Citation If you use this model, please cite: ```bibtex @misc{molcrawl_genome_sequence_gpt2_small, title={molcrawl-genome-sequence-gpt2-small}, author={{RIKEN}}, year={2026}, publisher={{Hugging Face}}, url={{https://huggingface.co/kojima-lab/molcrawl-genome-sequence-gpt2-small}} } ```