170 lines
5.5 KiB
Markdown
170 lines
5.5 KiB
Markdown
|
|
---
|
||
|
|
license: apache-2.0
|
||
|
|
tags:
|
||
|
|
- automatic-speech-recognition
|
||
|
|
- fi
|
||
|
|
- finnish
|
||
|
|
library_name: transformers
|
||
|
|
language: fi
|
||
|
|
base_model:
|
||
|
|
- GetmanY1/wav2vec2-large-fi-150k
|
||
|
|
model-index:
|
||
|
|
- name: wav2vec2-large-fi-150k-finetuned
|
||
|
|
results:
|
||
|
|
- task:
|
||
|
|
name: Automatic Speech Recognition
|
||
|
|
type: automatic-speech-recognition
|
||
|
|
dataset:
|
||
|
|
name: Lahjoita puhetta (Donate Speech)
|
||
|
|
type: lahjoita-puhetta
|
||
|
|
args: fi
|
||
|
|
metrics:
|
||
|
|
- name: Dev WER
|
||
|
|
type: wer
|
||
|
|
value: 15.34
|
||
|
|
- name: Dev CER
|
||
|
|
type: cer
|
||
|
|
value: 4.14
|
||
|
|
- name: Test WER
|
||
|
|
type: wer
|
||
|
|
value: 16.86
|
||
|
|
- name: Test CER
|
||
|
|
type: cer
|
||
|
|
value: 5.07
|
||
|
|
- task:
|
||
|
|
name: Automatic Speech Recognition
|
||
|
|
type: automatic-speech-recognition
|
||
|
|
dataset:
|
||
|
|
name: Finnish Parliament
|
||
|
|
type: FinParl
|
||
|
|
args: fi
|
||
|
|
metrics:
|
||
|
|
- name: Dev16 WER
|
||
|
|
type: wer
|
||
|
|
value: 11.3
|
||
|
|
- name: Dev16 CER
|
||
|
|
type: cer
|
||
|
|
value: 4.75
|
||
|
|
- name: Test16 WER
|
||
|
|
type: wer
|
||
|
|
value: 8.29
|
||
|
|
- name: Test16 CER
|
||
|
|
type: cer
|
||
|
|
value: 3.34
|
||
|
|
- name: Test20 WER
|
||
|
|
type: wer
|
||
|
|
value: 6.94
|
||
|
|
- name: Test20 CER
|
||
|
|
type: cer
|
||
|
|
value: 2.15
|
||
|
|
- task:
|
||
|
|
name: Automatic Speech Recognition
|
||
|
|
type: automatic-speech-recognition
|
||
|
|
dataset:
|
||
|
|
name: Common Voice 16.1
|
||
|
|
type: mozilla-foundation/common_voice_16_1
|
||
|
|
args: fi
|
||
|
|
metrics:
|
||
|
|
- name: Dev WER
|
||
|
|
type: wer
|
||
|
|
value: 7.17
|
||
|
|
- name: Dev CER
|
||
|
|
type: cer
|
||
|
|
value: 1.11
|
||
|
|
- name: Test WER
|
||
|
|
type: wer
|
||
|
|
value: 5.86
|
||
|
|
- name: Test CER
|
||
|
|
type: cer
|
||
|
|
value: 0.91
|
||
|
|
- task:
|
||
|
|
name: Automatic Speech Recognition
|
||
|
|
type: automatic-speech-recognition
|
||
|
|
dataset:
|
||
|
|
name: FLEURS
|
||
|
|
type: google/fleurs
|
||
|
|
args: fi_fi
|
||
|
|
metrics:
|
||
|
|
- name: Dev WER
|
||
|
|
type: wer
|
||
|
|
value: 9.2
|
||
|
|
- name: Dev CER
|
||
|
|
type: cer
|
||
|
|
value: 5.23
|
||
|
|
- name: Test WER
|
||
|
|
type: wer
|
||
|
|
value: 10.69
|
||
|
|
- name: Test CER
|
||
|
|
type: cer
|
||
|
|
value: 5.79
|
||
|
|
---
|
||
|
|
|
||
|
|
# Finnish Wav2vec2-Large ASR
|
||
|
|
|
||
|
|
[GetmanY1/wav2vec2-large-fi-150k](https://huggingface.co/GetmanY1/wav2vec2-large-fi-150k) fine-tuned on 4600 hours of Finnish speech on 16kHz sampled speech audio:
|
||
|
|
* 1500 hours of [Lahjoita puhetta (Donate Speech)](https://link.springer.com/article/10.1007/s10579-022-09606-3) (colloquial Finnish)
|
||
|
|
* 3100 hours of the [Finnish Parliament dataset](https://link.springer.com/article/10.1007/s10579-023-09650-7)
|
||
|
|
|
||
|
|
When using the model make sure that your speech input is also sampled at 16Khz.
|
||
|
|
|
||
|
|
## Model description
|
||
|
|
|
||
|
|
The Finnish Wav2Vec2 Large has the same architecture and uses the same training objective as the English and multilingual one described in [Paper](https://arxiv.org/abs/2006.11477).
|
||
|
|
|
||
|
|
[GetmanY1/wav2vec2-large-fi-150k](https://huggingface.co/GetmanY1/wav2vec2-large-fi-150k) is a large-scale, 317-million parameter monolingual model pre-trained on 158k hours of unlabeled Finnish speech, including [KAVI radio and television archive materials](https://kavi.fi/en/radio-ja-televisioarkistointia-vuodesta-2008/), Lahjoita puhetta (Donate Speech), Finnish Parliament, Finnish VoxPopuli.
|
||
|
|
|
||
|
|
You can read more about the pre-trained model from [this paper](https://www.isca-archive.org/interspeech_2025/getman25_interspeech.html). The training scripts are available on [GitHub](https://github.com/aalto-speech/large-scale-monolingual-speech-foundation-models).
|
||
|
|
|
||
|
|
## Intended uses
|
||
|
|
|
||
|
|
You can use this model for Finnish ASR (speech-to-text).
|
||
|
|
|
||
|
|
### How to use
|
||
|
|
|
||
|
|
To transcribe audio files the model can be used as a standalone acoustic model as follows:
|
||
|
|
|
||
|
|
```
|
||
|
|
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
|
||
|
|
from datasets import load_dataset
|
||
|
|
import torch
|
||
|
|
|
||
|
|
# load model and processor
|
||
|
|
processor = Wav2Vec2Processor.from_pretrained("GetmanY1/wav2vec2-large-fi-150k-finetuned")
|
||
|
|
model = Wav2Vec2ForCTC.from_pretrained("GetmanY1/wav2vec2-large-fi-150k-finetuned")
|
||
|
|
|
||
|
|
# load dummy dataset and read soundfiles
|
||
|
|
ds = load_dataset("mozilla-foundation/common_voice_16_1", "fi", split='test')
|
||
|
|
|
||
|
|
# tokenize
|
||
|
|
input_values = processor(ds[0]["audio"]["array"], return_tensors="pt", padding="longest").input_values # Batch size 1
|
||
|
|
|
||
|
|
# retrieve logits
|
||
|
|
logits = model(input_values).logits
|
||
|
|
|
||
|
|
# take argmax and decode
|
||
|
|
predicted_ids = torch.argmax(logits, dim=-1)
|
||
|
|
transcription = processor.batch_decode(predicted_ids)
|
||
|
|
```
|
||
|
|
|
||
|
|
## Citation
|
||
|
|
|
||
|
|
If you use our models or scripts, please cite our article as:
|
||
|
|
|
||
|
|
```bibtex
|
||
|
|
@inproceedings{getman25_interspeech,
|
||
|
|
title = {{Is your model big enough? Training and interpreting large-scale monolingual speech foundation models}},
|
||
|
|
author = {{Yaroslav Getman and Tamás Grósz and Tommi Lehtonen and Mikko Kurimo}},
|
||
|
|
year = {{2025}},
|
||
|
|
booktitle = {{Interspeech 2025}},
|
||
|
|
pages = {{231--235}},
|
||
|
|
doi = {{10.21437/Interspeech.2025-46}},
|
||
|
|
issn = {{2958-1796}},
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
## Team Members
|
||
|
|
|
||
|
|
- Yaroslav Getman, [Hugging Face profile](https://huggingface.co/GetmanY1), [LinkedIn profile](https://www.linkedin.com/in/yaroslav-getman/)
|
||
|
|
- Tamas Grosz, [Hugging Face profile](https://huggingface.co/Grosy), [LinkedIn profile](https://www.linkedin.com/in/tam%C3%A1s-gr%C3%B3sz-950a049a/)
|
||
|
|
|
||
|
|
Feel free to contact us for more details 🤗
|