*This model was released on 2025-07-09 and added to Hugging Face Transformers on 2025-09-18.*
PyTorch FlashAttention SDPA
# FlexOlmo [FlexOlmo](https://huggingface.co/papers/2507.07024) is a new class of language models (LMs) that supports (1) distributed training without data sharing, where different model parameters are independently trained on closed datasets, and (2) data-flexible inference, where these parameters along with their associated data can be flexibly included or excluded from model inferences with no further training. FlexOlmo employs a mixture-of-experts (MoE) architecture where each expert is trained independently on closed datasets and later integrated through a new domain-informed routing without any joint training. FlexOlmo is trained on FlexMix, a corpus we curate comprising publicly available datasets alongside seven domain-specific sets, representing realistic approximations of closed sets. You can find all the original FlexOlmo checkpoints under the [FlexOlmo](https://huggingface.co/collections/allenai/flexolmo-68471177a386b6e20a54c55f) collection. > [!TIP] > Click on the FlexOlmo models in the right sidebar for more examples of how to apply FlexOlmo to different language tasks. The example below demonstrates how to generate text with [`Pipeline`], [`AutoModel`] and from the command line. ```py import torch from transformers import pipeline pipe = pipeline( task="text-generation", model="allenai/FlexOlmo-7x7B-1T", dtype=torch.bfloat16, device=0, ) result = pipe("Plants create energy through a process known as") print(result) ``` ```py import torch from transformers import AutoModelForCausalLM, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained( "allenai/FlexOlmo-7x7B-1T" ) model = AutoModelForCausalLM.from_pretrained( "allenai/FlexOlmo-7x7B-1T", dtype=torch.bfloat16, device_map="auto", attn_implementation="sdpa" ) input_ids = tokenizer("Plants create energy through a process known as", return_tensors="pt").to(model.device) output = model.generate(**input_ids, max_length=50, cache_implementation="static") print(tokenizer.decode(output[0], skip_special_tokens=True)) ``` ```bash echo -e "Plants create energy through a process known as" | transformers run --task text-generation --model allenai/FlexOlmo-7x7B-1T --device 0 ``` Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends. The example below uses [torchao](../quantization/torchao) to only quantize the weights to 4-bits. ```py #pip install torchao import torch from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig torchao_config = TorchAoConfig( "int4_weight_only", group_size=128 ) tokenizer = AutoTokenizer.from_pretrained( "allenai/FlexOlmo-7x7B-1T" ) model = AutoModelForCausalLM.from_pretrained( "allenai/FlexOlmo-7x7B-1T", quantization_config=torchao_config, dtype=torch.bfloat16, device_map="auto", attn_implementation="sdpa" ) input_ids = tokenizer("Plants create energy through a process known as", return_tensors="pt").to(model.device) output = model.generate(**input_ids, max_length=50, cache_implementation="static") print(tokenizer.decode(output[0], skip_special_tokens=True)) ``` ## FlexOlmoConfig [[autodoc]] FlexOlmoConfig ## FlexOlmoForCausalLM [[autodoc]] FlexOlmoForCausalLM ## FlexOlmoModel [[autodoc]] FlexOlmoModel - forward ## FlexOlmoPreTrainedModel [[autodoc]] FlexOlmoPreTrainedModel - forward