初始化项目,由ModelHub XC社区提供模型
Model: norallm/normistral-11b-long Source: Original Platform
This commit is contained in:
190
README.md
Normal file
190
README.md
Normal file
@@ -0,0 +1,190 @@
|
||||
---
|
||||
license: apache-2.0
|
||||
language:
|
||||
- nb
|
||||
- nn
|
||||
- 'no'
|
||||
- se
|
||||
- sv
|
||||
- da
|
||||
- en
|
||||
- is
|
||||
- fo
|
||||
base_model:
|
||||
- norallm/normistral-11b-warm
|
||||
library_name: transformers
|
||||
pipeline_tag: text-generation
|
||||
tags:
|
||||
- norwegian
|
||||
- sami
|
||||
- bokmaal
|
||||
- nynorsk
|
||||
---
|
||||
|
||||

|
||||
|
||||
|
||||
**NorMistral-11b-long** is a length-extended version of [NorMistral-11b-warm](https://huggingface.co/norallm/normistral-11b-warm). It has been extended to 32,768 context length by continual training on additional 50 billion subword tokens – using a mix of Scandinavian, Sámi, English and code data (four repetitions of open Norwegian texts). The model follows our earlier paper [Small Languages, Big Models: A Study of Continual Training on Languages of Norway](https://arxiv.org/abs/2412.06484) by Samuel et al. 2025, and forms part of the NORA.LLM family developed by [the Language Technology Group at the University of Oslo (LTG)](https://huggingface.co/ltg).
|
||||
|
||||
*Disclaimer: This model is pretrained on raw (mostly web-based) textual data. It is not finetuned to follow instructions, and it can generate harmful completions after inappropriate user prompts. It is primarily intended for research purposes.*
|
||||
|
||||
|
||||
## License
|
||||
|
||||
We release the model under Apache 2.0 license to indicate that we do not impose any additional constraints on the model weights.
|
||||
However, we do not own the data in the training collection.
|
||||
|
||||
|
||||
## Pretraining corpus
|
||||
|
||||
The model is pretrained on a combination of publicly available data and a custom web crawl for Sámi. The total training corpus consists of 250 billion tokens from the following sources:
|
||||
|
||||
1. Norwegian text (Bokmål and Nynorsk); this collection was created by the National Library of Norway and it's a prerelease of an update of NCC (codenamed "Mímir core"). It consists of: a) the public part of [Norwegian Colossal Corpus (NCC)](https://huggingface.co/datasets/NbAiLab/NCC) with permissible licenses (i.e. it doesn't include newspaper texts with the CC BY-NC 2.0 license); b) Bokmål and Nynorsk [CulturaX](https://huggingface.co/datasets/uonlp/CulturaX), and c) Bokmål and Nynorsk [HPLT corpus v1.2](https://hplt-project.org/datasets/v1.2).
|
||||
|
||||
2. Northern Sámi texts are sourced from a) [Glot500](https://huggingface.co/datasets/cis-lmu/Glot500); b) [the SIKOR North Saami free corpus](https://repo.clarino.uib.no/xmlui/handle/11509/100); and c) a custom web crawl (seeded from Sámi Wikipedia external links) published separately as [`ltg/saami-web`](https://huggingface.co/datasets/ltg/saami-web).
|
||||
|
||||
3. Additional languages for knowledge/language transfer: a) Danish, Swedish, Icelandic, and Faroese from CulturaX and Glot500; b) high-quality English from [FineWeb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu); and c) programming code from [The Stack v2 (the high-quality subset)](https://huggingface.co/datasets/bigcode/the-stack-v2-train-smol-ids).
|
||||
|
||||
The corpus is carefully balanced through strategic upsampling to handle the resource disparity between languages. Following data-constrained scaling laws, the corpus data for target languages is repeated multiple times (up to 16x for low-resource languages) to reach the optimal training budget while avoiding overfitting:
|
||||
|
||||

|
||||
|
||||
|
||||
|
||||
## Tokenizer
|
||||
|
||||
This model uses a new tokenizer, specially trained on the target languages. Therefore it offers substantially faster inference than the original Mistral-Nemo-Base-2407 model. Here are the subword-to-word split ratios across different languages:
|
||||
|
||||
| Tokenizer | # tokens | Bokmål | Nynorsk | Sámi | Danish | Swedish |
|
||||
|:------------|:--------:|:--------:|:---------:|:-------:|:--------:|:---------:|
|
||||
| Mistral-Nemo-Base-2407 | 131072 | 1.79 | 1.87 | 2.63 | 1.82 | 2.00 |
|
||||
| NorMistral-11b-long | 51200 | 1.22 | 1.28 | 1.82 | 1.33 | 1.39 |
|
||||
|
||||
|
||||
## Model details
|
||||
|
||||
**Model Developers:** Language Technology Group at the University of Oslo in collaboration with NORA.LLM.
|
||||
|
||||
**Architecture:** NorMistral-11B uses the Mistral architecture based on an improved Llama design, featuring:
|
||||
- Pre-normalization with RMSNorm
|
||||
- SwiGLU activation function
|
||||
- Rotary positional embeddings
|
||||
- Grouped-query attention
|
||||
- 40 transformer layers
|
||||
- Hidden dimension: 5,120
|
||||
- Intermediate dimension: 14,336
|
||||
- 32 query heads and 8 key & value heads (dimension 128)
|
||||
- Vocabulary size: 51,200 tokens
|
||||
- Total parameters: 11.4 billion
|
||||
|
||||
**Training Details:**
|
||||
- Training tokens: 250 + 50 billion
|
||||
- Batch size: 128 × 32,768 tokens (# sequences × sequence length)
|
||||
- Training steps: 12,000
|
||||
|
||||
|
||||
**Base Model:** Initialized from NorMistral-11b-warm
|
||||
|
||||
**License:** Apache-2.0
|
||||
|
||||
|
||||
|
||||
## Example usage
|
||||
|
||||
### Basic Causal Language Model Usage
|
||||
|
||||
Here's how to use NorMistral-11B as a standard causal language model for translation:
|
||||
|
||||
```python
|
||||
from transformers import AutoTokenizer, AutoModelForCausalLM
|
||||
import torch
|
||||
|
||||
# Import the tokenizer and model
|
||||
tokenizer = AutoTokenizer.from_pretrained("norallm/normistral-11b-long")
|
||||
model = AutoModelForCausalLM.from_pretrained("norallm/normistral-11b-long").cuda().eval()
|
||||
|
||||
# Define zero-shot translation prompt template
|
||||
prompt = """Engelsk: {0}
|
||||
Bokmål:"""
|
||||
|
||||
# Define tokens that should end the generation (any token with a newline)
|
||||
eos_token_ids = [
|
||||
token_id
|
||||
for token_id in range(tokenizer.vocab_size)
|
||||
if '\n' in tokenizer.decode([token_id])
|
||||
]
|
||||
|
||||
# Generation function
|
||||
@torch.no_grad()
|
||||
def generate(text):
|
||||
text = prompt.format(text)
|
||||
input_ids = tokenizer(text, return_tensors='pt').input_ids.cuda()
|
||||
prediction = model.generate(
|
||||
input_ids,
|
||||
max_new_tokens=64,
|
||||
do_sample=False,
|
||||
eos_token_id=eos_token_ids
|
||||
)
|
||||
return tokenizer.decode(prediction[0, input_ids.size(1):]).strip()
|
||||
|
||||
# Example usage
|
||||
generate("I'm excited to try this new Norwegian language model!")
|
||||
# > Expected output: 'Jeg er spent på å prøve denne nye norske språkmodellen!'
|
||||
```
|
||||
|
||||
### Memory-Efficient Loading
|
||||
|
||||
For systems with limited VRAM, you can load the model in 8-bit or 4-bit quantization:
|
||||
|
||||
```python
|
||||
import torch
|
||||
from transformers import AutoTokenizer, AutoModelForCausalLM
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained("norallm/normistral-11b-long")
|
||||
|
||||
# Load in 8-bit mode (requires ~12GB VRAM)
|
||||
model = AutoModelForCausalLM.from_pretrained(
|
||||
"norallm/normistral-11b-long",
|
||||
device_map='auto',
|
||||
load_in_8bit=True,
|
||||
torch_dtype=torch.bfloat16
|
||||
)
|
||||
|
||||
# Or load in 4-bit mode (requires ~8GB VRAM)
|
||||
model = AutoModelForCausalLM.from_pretrained(
|
||||
"norallm/normistral-11b-long",
|
||||
device_map='auto',
|
||||
load_in_4bit=True,
|
||||
torch_dtype=torch.bfloat16
|
||||
)
|
||||
```
|
||||
|
||||
|
||||
## Citation
|
||||
|
||||
```bibtex
|
||||
@inproceedings{samuel-etal-2025-small,
|
||||
title = "Small Languages, Big Models: {A} Study of Continual Training on Languages of {Norway}",
|
||||
author = "Samuel, David and
|
||||
Mikhailov, Vladislav and
|
||||
Velldal, Erik and
|
||||
{\O}vrelid, Lilja and
|
||||
Charpentier, Lucas Georges Gabriel and
|
||||
Kutuzov, Andrey and
|
||||
Oepen, Stephan",
|
||||
editor = "Johansson, Richard and
|
||||
Stymne, Sara",
|
||||
booktitle = "Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025)",
|
||||
month = mar,
|
||||
year = "2025",
|
||||
address = "Tallinn, Estonia",
|
||||
publisher = "University of Tartu Library",
|
||||
url = "https://aclanthology.org/2025.nodalida-1.61/",
|
||||
pages = "573--608",
|
||||
ISBN = "978-9908-53-109-0",
|
||||
}
|
||||
```
|
||||
|
||||
## Contact
|
||||
|
||||
Please write [a community message](https://huggingface.co/norallm/normistral-11b-long/discussions) or contact David Samuel (davisamu@ifi.uio.no) if you have any questions about this model.
|
||||
Reference in New Issue
Block a user