normistral-11b-long/README.md

---
license: apache-2.0
language:
- nb
- nn
- 'no'
- se
- sv
- da
- en
- is
- fo
base_model:
- norallm/normistral-11b-warm
library_name: transformers
pipeline_tag: text-generation
tags:
- norwegian
- sami
- bokmaal
- nynorsk
---

![](https://huggingface.co/norallm/normistral-11b-warm/resolve/main/images/puffin_2.png)


**NorMistral-11b-long** is a length-extended version of [NorMistral-11b-warm](https://huggingface.co/norallm/normistral-11b-warm). It has been extended to 32,768 context length by continual training on additional 50 billion subword tokens – using a mix of Scandinavian, Sámi, English and code data (four repetitions of open Norwegian texts). The model follows our earlier paper [Small Languages, Big Models: A Study of Continual Training on Languages of Norway](https://arxiv.org/abs/2412.06484) by Samuel et al. 2025, and forms part of the NORA.LLM family developed by [the Language Technology Group at the University of Oslo (LTG)](https://huggingface.co/ltg).

*Disclaimer: This model is pretrained on raw (mostly web-based) textual data. It is not finetuned to follow instructions, and it can generate harmful completions after inappropriate user prompts. It is primarily intended for research purposes.*


## License

We release the model under Apache 2.0 license to indicate that we do not impose any additional constraints on the model weights.
However, we do not own the data in the training collection.


## Pretraining corpus

The model is pretrained on a combination of publicly available data and a custom web crawl for Sámi. The total training corpus consists of 250 billion tokens from the following sources:

1. Norwegian text (Bokmål and Nynorsk); this collection was created by the National Library of Norway and it's a prerelease of an update of NCC (codenamed "Mímir core"). It consists of: a) the public part of [Norwegian Colossal Corpus (NCC)](https://huggingface.co/datasets/NbAiLab/NCC) with permissible licenses (i.e. it doesn't include newspaper texts with the CC BY-NC 2.0 license); b) Bokmål and Nynorsk [CulturaX](https://huggingface.co/datasets/uonlp/CulturaX), and c) Bokmål and Nynorsk [HPLT corpus v1.2](https://hplt-project.org/datasets/v1.2).

2. Northern Sámi texts are sourced from a) [Glot500](https://huggingface.co/datasets/cis-lmu/Glot500); b) [the SIKOR North Saami free corpus](https://repo.clarino.uib.no/xmlui/handle/11509/100); and c) a custom web crawl (seeded from Sámi Wikipedia external links) published separately as [`ltg/saami-web`](https://huggingface.co/datasets/ltg/saami-web).

3. Additional languages for knowledge/language transfer: a) Danish, Swedish, Icelandic, and Faroese from CulturaX and Glot500; b) high-quality English from [FineWeb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu); and c) programming code from [The Stack v2 (the high-quality subset)](https://huggingface.co/datasets/bigcode/the-stack-v2-train-smol-ids).

The corpus is carefully balanced through strategic upsampling to handle the resource disparity between languages. Following data-constrained scaling laws, the corpus data for target languages is repeated multiple times (up to 16x for low-resource languages) to reach the optimal training budget while avoiding overfitting:

![](https://huggingface.co/norallm/normistral-11b-warm/resolve/main/images/corpus.png)


## Tokenizer

This model uses a new tokenizer, specially trained on the target languages. Therefore it offers substantially faster inference than the original Mistral-Nemo-Base-2407 model. Here are the subword-to-word split ratios across different languages:

| Tokenizer  | # tokens | Bokmål | Nynorsk | Sámi  | Danish | Swedish |
|:------------|:--------:|:--------:|:---------:|:-------:|:--------:|:---------:|
| Mistral-Nemo-Base-2407    | 131072 | 1.79   | 1.87    | 2.63  | 1.82   | 2.00    |
| NorMistral-11b-long | 51200 | 1.22   | 1.28    | 1.82  | 1.33   | 1.39    |


## Model details

**Model Developers:** Language Technology Group at the University of Oslo in collaboration with NORA.LLM.

**Architecture:** NorMistral-11B uses the Mistral architecture based on an improved Llama design, featuring:
- Pre-normalization with RMSNorm
- SwiGLU activation function
- Rotary positional embeddings
- Grouped-query attention
- 40 transformer layers
- Hidden dimension: 5,120
- Intermediate dimension: 14,336
- 32 query heads and 8 key & value heads (dimension 128)
- Vocabulary size: 51,200 tokens
- Total parameters: 11.4 billion

**Training Details:**
- Training tokens: 250 + 50 billion
- Batch size: 128 × 32,768 tokens (# sequences × sequence length)
- Training steps: 12,000


**Base Model:** Initialized from NorMistral-11b-warm

**License:** Apache-2.0


## Example usage

### Basic Causal Language Model Usage

Here's how to use NorMistral-11B as a standard causal language model for translation:

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Import the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("norallm/normistral-11b-long")
model = AutoModelForCausalLM.from_pretrained("norallm/normistral-11b-long").cuda().eval()

# Define zero-shot translation prompt template
prompt = """Engelsk: {0}
Bokmål:"""

# Define tokens that should end the generation (any token with a newline)
eos_token_ids = [
    token_id
    for token_id in range(tokenizer.vocab_size)
    if '\n' in tokenizer.decode([token_id])
]

# Generation function
@torch.no_grad()
def generate(text):
    text = prompt.format(text)
    input_ids = tokenizer(text, return_tensors='pt').input_ids.cuda()
    prediction = model.generate(
        input_ids,
        max_new_tokens=64,
        do_sample=False,
        eos_token_id=eos_token_ids
    )
    return tokenizer.decode(prediction[0, input_ids.size(1):]).strip()

# Example usage
generate("I'm excited to try this new Norwegian language model!")
# > Expected output: 'Jeg er spent på å prøve denne nye norske språkmodellen!'
```

### Memory-Efficient Loading

For systems with limited VRAM, you can load the model in 8-bit or 4-bit quantization:

```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("norallm/normistral-11b-long")

# Load in 8-bit mode (requires ~12GB VRAM)
model = AutoModelForCausalLM.from_pretrained(
    "norallm/normistral-11b-long",
    device_map='auto',
    load_in_8bit=True,
    torch_dtype=torch.bfloat16
)

# Or load in 4-bit mode (requires ~8GB VRAM)
model = AutoModelForCausalLM.from_pretrained(
    "norallm/normistral-11b-long",
    device_map='auto',
    load_in_4bit=True,
    torch_dtype=torch.bfloat16
)
```


## Citation

```bibtex
@inproceedings{samuel-etal-2025-small,
    title = "Small Languages, Big Models: {A} Study of Continual Training on Languages of {Norway}",
    author = "Samuel, David  and
      Mikhailov, Vladislav  and
      Velldal, Erik  and
      {\O}vrelid, Lilja  and
      Charpentier, Lucas Georges Gabriel  and
      Kutuzov, Andrey  and
      Oepen, Stephan",
    editor = "Johansson, Richard  and
      Stymne, Sara",
    booktitle = "Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025)",
    month = mar,
    year = "2025",
    address = "Tallinn, Estonia",
    publisher = "University of Tartu Library",
    url = "https://aclanthology.org/2025.nodalida-1.61/",
    pages = "573--608",
    ISBN = "978-9908-53-109-0",
}
```

## Contact

Please write [a community message](https://huggingface.co/norallm/normistral-11b-long/discussions) or contact David Samuel (davisamu@ifi.uio.no) if you have any questions about this model.