wav2vec2-large-fi-150k-fine…/README.md

---
license: apache-2.0
tags:
- automatic-speech-recognition
- fi
- finnish
library_name: transformers
language: fi
base_model:
- GetmanY1/wav2vec2-large-fi-150k
model-index:
  - name: wav2vec2-large-fi-150k-finetuned
    results:
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: Lahjoita puhetta (Donate Speech)
          type: lahjoita-puhetta
          args: fi
        metrics:
          - name: Dev WER
            type: wer
            value: 15.34
          - name: Dev CER
            type: cer
            value: 4.14
          - name: Test WER
            type: wer
            value: 16.86
          - name: Test CER
            type: cer
            value: 5.07
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: Finnish Parliament
          type: FinParl
          args: fi
        metrics:
          - name: Dev16 WER
            type: wer
            value: 11.3
          - name: Dev16 CER
            type: cer
            value: 4.75
          - name: Test16 WER
            type: wer
            value: 8.29
          - name: Test16 CER
            type: cer
            value: 3.34
          - name: Test20 WER
            type: wer
            value: 6.94
          - name: Test20 CER
            type: cer
            value: 2.15
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: Common Voice 16.1
          type: mozilla-foundation/common_voice_16_1
          args: fi
        metrics:
        - name: Dev WER
          type: wer
          value: 7.17
        - name: Dev CER
          type: cer
          value: 1.11
        - name: Test WER
          type: wer
          value: 5.86
        - name: Test CER
          type: cer
          value: 0.91
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: FLEURS
          type: google/fleurs
          args: fi_fi
        metrics:
        - name: Dev WER
          type: wer
          value: 9.2
        - name: Dev CER
          type: cer
          value: 5.23
        - name: Test WER
          type: wer
          value: 10.69
        - name: Test CER
          type: cer
          value: 5.79
---

# Finnish Wav2vec2-Large ASR

[GetmanY1/wav2vec2-large-fi-150k](https://huggingface.co/GetmanY1/wav2vec2-large-fi-150k) fine-tuned on 4600 hours of Finnish speech on 16kHz sampled speech audio:
* 1500 hours of [Lahjoita puhetta (Donate Speech)](https://link.springer.com/article/10.1007/s10579-022-09606-3) (colloquial Finnish)
* 3100 hours of the [Finnish Parliament dataset](https://link.springer.com/article/10.1007/s10579-023-09650-7)

When using the model make sure that your speech input is also sampled at 16Khz.

## Model description

The Finnish Wav2Vec2 Large has the same architecture and uses the same training objective as the English and multilingual one described in [Paper](https://arxiv.org/abs/2006.11477).

[GetmanY1/wav2vec2-large-fi-150k](https://huggingface.co/GetmanY1/wav2vec2-large-fi-150k) is a large-scale, 317-million parameter monolingual model pre-trained on 158k hours of unlabeled Finnish speech, including [KAVI radio and television archive materials](https://kavi.fi/en/radio-ja-televisioarkistointia-vuodesta-2008/), Lahjoita puhetta (Donate Speech), Finnish Parliament, Finnish VoxPopuli.

You can read more about the pre-trained model from [this paper](https://www.isca-archive.org/interspeech_2025/getman25_interspeech.html). The training scripts are available on [GitHub](https://github.com/aalto-speech/large-scale-monolingual-speech-foundation-models).

## Intended uses

You can use this model for Finnish ASR (speech-to-text). 

### How to use

To transcribe audio files the model can be used as a standalone acoustic model as follows:

```
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
from datasets import load_dataset
import torch

# load model and processor
processor = Wav2Vec2Processor.from_pretrained("GetmanY1/wav2vec2-large-fi-150k-finetuned")
model = Wav2Vec2ForCTC.from_pretrained("GetmanY1/wav2vec2-large-fi-150k-finetuned")

# load dummy dataset and read soundfiles
ds = load_dataset("mozilla-foundation/common_voice_16_1", "fi", split='test')

# tokenize
input_values = processor(ds[0]["audio"]["array"], return_tensors="pt", padding="longest").input_values  # Batch size 1

# retrieve logits
logits = model(input_values).logits

# take argmax and decode
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)
```

## Citation

If you use our models or scripts, please cite our article as:

```bibtex
@inproceedings{getman25_interspeech,
  title     = {{Is your model big enough? Training and interpreting large-scale monolingual speech foundation models}},
  author    = {{Yaroslav Getman and Tamás Grósz and Tommi Lehtonen and Mikko Kurimo}},
  year      = {{2025}},
  booktitle = {{Interspeech 2025}},
  pages     = {{231--235}},
  doi       = {{10.21437/Interspeech.2025-46}},
  issn      = {{2958-1796}},
}
```

## Team Members

- Yaroslav Getman, [Hugging Face profile](https://huggingface.co/GetmanY1), [LinkedIn profile](https://www.linkedin.com/in/yaroslav-getman/)
- Tamas Grosz, [Hugging Face profile](https://huggingface.co/Grosy), [LinkedIn profile](https://www.linkedin.com/in/tam%C3%A1s-gr%C3%B3sz-950a049a/)

Feel free to contact us for more details 🤗
初始化项目，由ModelHub XC社区提供模型 Model: GetmanY1/wav2vec2-large-fi-150k-finetuned Source: Original Platform 2026-05-12 22:56:36 +08:00			`---`
			`license: apache-2.0`
			`tags:`
			`- automatic-speech-recognition`
			`- fi`
			`- finnish`
			`library_name: transformers`
			`language: fi`
			`base_model:`
			`- GetmanY1/wav2vec2-large-fi-150k`
			`model-index:`
			`- name: wav2vec2-large-fi-150k-finetuned`
			`results:`
			`- task:`
			`name: Automatic Speech Recognition`
			`type: automatic-speech-recognition`
			`dataset:`
			`name: Lahjoita puhetta (Donate Speech)`
			`type: lahjoita-puhetta`
			`args: fi`
			`metrics:`
			`- name: Dev WER`
			`type: wer`
			`value: 15.34`
			`- name: Dev CER`
			`type: cer`
			`value: 4.14`
			`- name: Test WER`
			`type: wer`
			`value: 16.86`
			`- name: Test CER`
			`type: cer`
			`value: 5.07`
			`- task:`
			`name: Automatic Speech Recognition`
			`type: automatic-speech-recognition`
			`dataset:`
			`name: Finnish Parliament`
			`type: FinParl`
			`args: fi`
			`metrics:`
			`- name: Dev16 WER`
			`type: wer`
			`value: 11.3`
			`- name: Dev16 CER`
			`type: cer`
			`value: 4.75`
			`- name: Test16 WER`
			`type: wer`
			`value: 8.29`
			`- name: Test16 CER`
			`type: cer`
			`value: 3.34`
			`- name: Test20 WER`
			`type: wer`
			`value: 6.94`
			`- name: Test20 CER`
			`type: cer`
			`value: 2.15`
			`- task:`
			`name: Automatic Speech Recognition`
			`type: automatic-speech-recognition`
			`dataset:`
			`name: Common Voice 16.1`
			`type: mozilla-foundation/common_voice_16_1`
			`args: fi`
			`metrics:`
			`- name: Dev WER`
			`type: wer`
			`value: 7.17`
			`- name: Dev CER`
			`type: cer`
			`value: 1.11`
			`- name: Test WER`
			`type: wer`
			`value: 5.86`
			`- name: Test CER`
			`type: cer`
			`value: 0.91`
			`- task:`
			`name: Automatic Speech Recognition`
			`type: automatic-speech-recognition`
			`dataset:`
			`name: FLEURS`
			`type: google/fleurs`
			`args: fi_fi`
			`metrics:`
			`- name: Dev WER`
			`type: wer`
			`value: 9.2`
			`- name: Dev CER`
			`type: cer`
			`value: 5.23`
			`- name: Test WER`
			`type: wer`
			`value: 10.69`
			`- name: Test CER`
			`type: cer`
			`value: 5.79`
			`---`

			`# Finnish Wav2vec2-Large ASR`

			`[GetmanY1/wav2vec2-large-fi-150k](https://huggingface.co/GetmanY1/wav2vec2-large-fi-150k) fine-tuned on 4600 hours of Finnish speech on 16kHz sampled speech audio:`
			`* 1500 hours of [Lahjoita puhetta (Donate Speech)](https://link.springer.com/article/10.1007/s10579-022-09606-3) (colloquial Finnish)`
			`* 3100 hours of the [Finnish Parliament dataset](https://link.springer.com/article/10.1007/s10579-023-09650-7)`

			`When using the model make sure that your speech input is also sampled at 16Khz.`

			`## Model description`

			`The Finnish Wav2Vec2 Large has the same architecture and uses the same training objective as the English and multilingual one described in [Paper](https://arxiv.org/abs/2006.11477).`

			`[GetmanY1/wav2vec2-large-fi-150k](https://huggingface.co/GetmanY1/wav2vec2-large-fi-150k) is a large-scale, 317-million parameter monolingual model pre-trained on 158k hours of unlabeled Finnish speech, including [KAVI radio and television archive materials](https://kavi.fi/en/radio-ja-televisioarkistointia-vuodesta-2008/), Lahjoita puhetta (Donate Speech), Finnish Parliament, Finnish VoxPopuli.`

			`You can read more about the pre-trained model from [this paper](https://www.isca-archive.org/interspeech_2025/getman25_interspeech.html). The training scripts are available on [GitHub](https://github.com/aalto-speech/large-scale-monolingual-speech-foundation-models).`

			`## Intended uses`

			`You can use this model for Finnish ASR (speech-to-text).`

			`### How to use`

			`To transcribe audio files the model can be used as a standalone acoustic model as follows:`

			```
			`from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC`
			`from datasets import load_dataset`
			`import torch`

			`# load model and processor`
			`processor = Wav2Vec2Processor.from_pretrained("GetmanY1/wav2vec2-large-fi-150k-finetuned")`
			`model = Wav2Vec2ForCTC.from_pretrained("GetmanY1/wav2vec2-large-fi-150k-finetuned")`

			`# load dummy dataset and read soundfiles`
			`ds = load_dataset("mozilla-foundation/common_voice_16_1", "fi", split='test')`

			`# tokenize`
			`input_values = processor(ds[0]["audio"]["array"], return_tensors="pt", padding="longest").input_values # Batch size 1`

			`# retrieve logits`
			`logits = model(input_values).logits`

			`# take argmax and decode`
			`predicted_ids = torch.argmax(logits, dim=-1)`
			`transcription = processor.batch_decode(predicted_ids)`
			```

			`## Citation`

			`If you use our models or scripts, please cite our article as:`

			```bibtex
			`@inproceedings{getman25_interspeech,`
			`title = {{Is your model big enough? Training and interpreting large-scale monolingual speech foundation models}},`
			`author = {{Yaroslav Getman and Tamás Grósz and Tommi Lehtonen and Mikko Kurimo}},`
			`year = {{2025}},`
			`booktitle = {{Interspeech 2025}},`
			`pages = {{231--235}},`
			`doi = {{10.21437/Interspeech.2025-46}},`
			`issn = {{2958-1796}},`
			`}`
			```

			`## Team Members`

			`- Yaroslav Getman, [Hugging Face profile](https://huggingface.co/GetmanY1), [LinkedIn profile](https://www.linkedin.com/in/yaroslav-getman/)`
			`- Tamas Grosz, [Hugging Face profile](https://huggingface.co/Grosy), [LinkedIn profile](https://www.linkedin.com/in/tam%C3%A1s-gr%C3%B3sz-950a049a/)`

			`Feel free to contact us for more details 🤗`