初始化项目,由ModelHub XC社区提供模型

Model: NbAiLab/nb-notram-llama-3.1-8b-instruct
Source: Original Platform
This commit is contained in:
ModelHub XC
2026-05-10 00:56:10 +08:00
commit 217a1d3567
12 changed files with 413236 additions and 0 deletions

202
README.md Normal file
View File

@@ -0,0 +1,202 @@
---
language:
- "no" # Generic Norwegian
- "nb" # Norwegian Bokmål
- "nn" # Norwegian Nynorsk
- "en" # English
tags:
- "llama"
- "notram"
- "norwegian"
- "bokmål"
- "nynorsk"
- "multilingual"
- "conversational"
- "text-generation"
pipeline_tag: "text-generation"
license: "llama3.1"
base_model: "meta-llama/Llama-3.1-8B-Instruct"
library_name: "transformers"
---
## Model Card: "nb-notram-llama-3.1-8b-instruct"
### Model overview
"NbAiLab/nb-notram-llama-3.1-8b-instruct" is part of the "NB-Llama-3.x" series (covering "Llama 3.1", "Llama 3.2", and "Llama 3.3" based releases) and the "NoTraM" line of work, trained on top of Metas "Llama-3.1-8B-Instruct":
https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct
The model is fine-tuned to improve instruction-following behavior in Norwegian Bokmål and Norwegian Nynorsk, while aiming to preserve strong English performance.
This release is an experiment in how far modern open-weight models can be adapted for Norwegian using **only publicly available data**. Although trained at the National Library of Norway, it does **not** include material that is only accessible through legal deposit. It may include public documents (for example governmental reports) that are publicly available and also part of legal deposit collections.
---
### Key features
- **Base model:** "Llama-3.1-8B-Instruct"
- **Languages:**
- Strong: Norwegian Bokmål ("nb"), Norwegian Nynorsk ("nn"), English ("en")
- **Alignment recipe (high level):**
- Primarily supervised fine-tuning ("SFT") for instruction-following and chat formatting.
- A **very light** preference optimization step ("DPO") was applied mainly to stabilize instruction-following; note that the starting point ("Llama-3.1-8B-Instruct") is already preference-tuned by the base model provider.
- **Response style:** the model tends to produce **shorter, more concise answers** than many chatty assistants. This reflects the current instruction-tuning recipe and training mix. The behavior can be adjusted with an additional alignment round (for example "GRPO") to encourage more elaborate, conversational responses if desired.
---
### Motivation and research framing
Adapting instruction-tuned models to Norwegian can be approached in two broad ways:
1) **Adapt a base model first, then instruction-tune.**
This tends to improve core Norwegian language modeling reliably, but producing a strong instruction-tuned assistant usually requires substantial alignment work and high-quality supervised data.
2) **Start from an instruction-tuned model, then adapt further.**
This leverages general instruction-following behaviors already learned by large multilingual models. In practice, however, it can be difficult to add *generalizable* Norwegian cultural and historical knowledge at this late stage using only supervised instruction data. We have observed a failure mode where new knowledge becomes brittle and overly prompt-dependent—usable in narrow contexts, but not reliably accessible across phrasing and tasks. Internally we refer to this as "knowledge pocketing".
Within the "NoTraM" project, we explore techniques for adapting instruction-tuned models to Norwegian language, culture, and history while explicitly trying to reduce "knowledge pocketing" and improve generalization. This line of work is intentionally distinct from the "NB-GPT" approach, which primarily targets training from scratch or from base models using established pretraining-first recipes.
For smaller languages, fully closed post-training pipelines are rarely reproducible. Public-data approaches are therefore a pragmatic path to improving Norwegian-capable models—while being explicit about limitations and the remaining gap to highly resourced multilingual instruction-tuned systems.
---
### Model details
- **Developer:** "National Library of Norway (NB-AiLab)"
- **Parameters:** "8B"
- **Knowledge cutoff:** "May 2024" (practical guideline; the model may be incomplete or incorrect on specific facts)
- **License:** "Llama 3.1 Community License"
- "https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/LICENSE"
---
## Intended use
### Suitable for
- Dialogue systems and assistant-style applications in Norwegian ("nb"/"nn") and English ("en")
- Summarization and Q&A in Bokmål or Nynorsk
### Out of scope
- Use in violation of applicable laws or regulations
- High-stakes domains (medical/legal/financial) without additional controls, evaluation, and human oversight
- Reliance on the model as a sole source of truth (it can hallucinate)
---
## How to use
This is a research release. For end-user deployments, we recommend careful evaluation in your target setting. Quantized variants (when provided) typically run faster with minimal loss in quality on many platforms. When fine-tuning instruction-tuned Llama models, best results usually require using the correct "Llama 3.1" chat templates.
### Using "transformers" (pipeline)
```python
import torch
from transformers import pipeline
model_id = "NbAiLab/nb-notram-llama-3.1-8b-instruct"
pipe = pipeline(
task="text-generation",
model=model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
)
messages = [
{"role": "user", "content": "Hvem døde på Stiklestad?"},
]
outputs = pipe(messages, max_new_tokens=256)
print(outputs[0]["generated_text"][-1])
```
---
## Training data
### Overview
Training is based entirely on publicly available datasets and synthetically generated data.
For more details on the base models pretraining data and data selection, see:
https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct
### Public datasets (partial use)
- "CulturaX"
https://huggingface.co/datasets/uonlp/CulturaX
- "HPLT monolingual v1.2"
https://huggingface.co/datasets/HPLT/hplt_monolingual_v1_2
- "Norwegian Colossal Corpus (NCC)"
https://huggingface.co/datasets/NCC/Norwegian-Colossal-Corpus
- "Wikipedia"
https://huggingface.co/datasets/wikimedia/wikipedia
### Alignment data sources ("SFT" + light preference optimization)
- "Magpie" (English)
https://huggingface.co/Magpie-Align
- "Anthropic Helpful and Harmless" (used lightly)
https://huggingface.co/datasets/Anthropic/hh-rlhf
- Various synthetic and translated datasets derived from the above
---
## Data selection and quality filtering
Only a small subset of raw web-scale data was used. We used the "FineWeb" approach as inspiration for large-scale web data curation and filtering, and applied similar principles when selecting and filtering public data:
https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1
In addition, we trained "Corpus Quality Classifiers" (educational value + linguistic quality) based on "NbAiLab/nb-bert-base" and release them as part of the broader "NB-Llama" effort:
- **Classifier collection:**
https://huggingface.co/collections/NbAiLab/corpus-quality-classifier-673f15926c2774fcc88f23aa
- **What we optimize for:**
- **Educational value:** prioritize content likely to improve reasoning and usefulness.
- **Linguistic quality:** prioritize well-formed, clear language (important for Norwegian norms and orthography).
---
## EU AI Act transparency note
To support transparency obligations under the EU AI Act, this model card documents:
- **Model lineage:** the "base_model" is listed in the metadata and linked above.
- **Primary training data sources:** the main public datasets used (see "Training data" and links).
- **Curation methodology:** we explicitly state that our data selection and filtering is guided by the "FineWeb" approach and we provide a public reference.
- **Filtering tools:** we link to the released "Corpus Quality Classifiers" used to score educational value and linguistic quality.
The training data used for this release is restricted to **publicly available sources** as described above (no legal-deposit-only material).
---
## Limitations and known issues
- The model can produce incorrect statements, fabricated details, or plausible-sounding but wrong explanations.
- Norwegian cultural/historical knowledge may be uneven; some knowledge can appear "pocketed" (prompt-sensitive) depending on topic and phrasing.
- Safety alignment is limited by the scope of the released recipe and data; evaluate carefully for your use case.
---
## Licensing
The model is released under the "Llama 3.1 Community License":
https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/LICENSE
Refer to the "Acceptable Use Policy" for restrictions:
https://llama.meta.com/llama3.1/use-policy
---
## Citing & authors
Model training and documentation: **Per Egil Kummervold**.
---
## Funding and acknowledgement
Training was supported by Googles TPU Research Cloud ("TRC"), which provided Cloud TPUs essential for the computational work.