初始化项目，由ModelHub XC社区提供模型

Model: GetmanY1/wav2vec2-large-fi-150k-finetuned Source: Original Platform
2026-05-12 22:56:36 +08:00
commit 8181966075
8 changed files with 423 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,170 @@
+---
+license: apache-2.0
+tags:
+- automatic-speech-recognition
+- fi
+- finnish
+library_name: transformers
+language: fi
+base_model:
+- GetmanY1/wav2vec2-large-fi-150k
+model-index:
+  - name: wav2vec2-large-fi-150k-finetuned
+    results:
+      - task:
+          name: Automatic Speech Recognition
+          type: automatic-speech-recognition
+        dataset:
+          name: Lahjoita puhetta (Donate Speech)
+          type: lahjoita-puhetta
+          args: fi
+        metrics:
+          - name: Dev WER
+            type: wer
+            value: 15.34
+          - name: Dev CER
+            type: cer
+            value: 4.14
+          - name: Test WER
+            type: wer
+            value: 16.86
+          - name: Test CER
+            type: cer
+            value: 5.07
+      - task:
+          name: Automatic Speech Recognition
+          type: automatic-speech-recognition
+        dataset:
+          name: Finnish Parliament
+          type: FinParl
+          args: fi
+        metrics:
+          - name: Dev16 WER
+            type: wer
+            value: 11.3
+          - name: Dev16 CER
+            type: cer
+            value: 4.75
+          - name: Test16 WER
+            type: wer
+            value: 8.29
+          - name: Test16 CER
+            type: cer
+            value: 3.34
+          - name: Test20 WER
+            type: wer
+            value: 6.94
+          - name: Test20 CER
+            type: cer
+            value: 2.15
+      - task:
+          name: Automatic Speech Recognition
+          type: automatic-speech-recognition
+        dataset:
+          name: Common Voice 16.1
+          type: mozilla-foundation/common_voice_16_1
+          args: fi
+        metrics:
+        - name: Dev WER
+          type: wer
+          value: 7.17
+        - name: Dev CER
+          type: cer
+          value: 1.11
+        - name: Test WER
+          type: wer
+          value: 5.86
+        - name: Test CER
+          type: cer
+          value: 0.91
+      - task:
+          name: Automatic Speech Recognition
+          type: automatic-speech-recognition
+        dataset:
+          name: FLEURS
+          type: google/fleurs
+          args: fi_fi
+        metrics:
+        - name: Dev WER
+          type: wer
+          value: 9.2
+        - name: Dev CER
+          type: cer
+          value: 5.23
+        - name: Test WER
+          type: wer
+          value: 10.69
+        - name: Test CER
+          type: cer
+          value: 5.79
+---
+
+# Finnish Wav2vec2-Large ASR
+
+[GetmanY1/wav2vec2-large-fi-150k](https://huggingface.co/GetmanY1/wav2vec2-large-fi-150k) fine-tuned on 4600 hours of Finnish speech on 16kHz sampled speech audio:
+* 1500 hours of [Lahjoita puhetta (Donate Speech)](https://link.springer.com/article/10.1007/s10579-022-09606-3) (colloquial Finnish)
+* 3100 hours of the [Finnish Parliament dataset](https://link.springer.com/article/10.1007/s10579-023-09650-7)
+
+When using the model make sure that your speech input is also sampled at 16Khz.
+
+## Model description
+
+The Finnish Wav2Vec2 Large has the same architecture and uses the same training objective as the English and multilingual one described in [Paper](https://arxiv.org/abs/2006.11477).
+
+[GetmanY1/wav2vec2-large-fi-150k](https://huggingface.co/GetmanY1/wav2vec2-large-fi-150k) is a large-scale, 317-million parameter monolingual model pre-trained on 158k hours of unlabeled Finnish speech, including [KAVI radio and television archive materials](https://kavi.fi/en/radio-ja-televisioarkistointia-vuodesta-2008/), Lahjoita puhetta (Donate Speech), Finnish Parliament, Finnish VoxPopuli.
+
+You can read more about the pre-trained model from [this paper](https://www.isca-archive.org/interspeech_2025/getman25_interspeech.html). The training scripts are available on [GitHub](https://github.com/aalto-speech/large-scale-monolingual-speech-foundation-models).
+
+## Intended uses
+
+You can use this model for Finnish ASR (speech-to-text). 
+
+### How to use
+
+To transcribe audio files the model can be used as a standalone acoustic model as follows:
+
+```
+from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
+from datasets import load_dataset
+import torch
+
+# load model and processor
+processor = Wav2Vec2Processor.from_pretrained("GetmanY1/wav2vec2-large-fi-150k-finetuned")
+model = Wav2Vec2ForCTC.from_pretrained("GetmanY1/wav2vec2-large-fi-150k-finetuned")
+
+# load dummy dataset and read soundfiles
+ds = load_dataset("mozilla-foundation/common_voice_16_1", "fi", split='test')
+
+# tokenize
+input_values = processor(ds[0]["audio"]["array"], return_tensors="pt", padding="longest").input_values  # Batch size 1
+
+# retrieve logits
+logits = model(input_values).logits
+
+# take argmax and decode
+predicted_ids = torch.argmax(logits, dim=-1)
+transcription = processor.batch_decode(predicted_ids)
+```
+
+## Citation
+
+If you use our models or scripts, please cite our article as:
+
+```bibtex
+@inproceedings{getman25_interspeech,
+  title     = {{Is your model big enough? Training and interpreting large-scale monolingual speech foundation models}},
+  author    = {{Yaroslav Getman and Tamás Grósz and Tommi Lehtonen and Mikko Kurimo}},
+  year      = {{2025}},
+  booktitle = {{Interspeech 2025}},
+  pages     = {{231--235}},
+  doi       = {{10.21437/Interspeech.2025-46}},
+  issn      = {{2958-1796}},
+}
+```
+
+## Team Members
+
+- Yaroslav Getman, [Hugging Face profile](https://huggingface.co/GetmanY1), [LinkedIn profile](https://www.linkedin.com/in/yaroslav-getman/)
+- Tamas Grosz, [Hugging Face profile](https://huggingface.co/Grosy), [LinkedIn profile](https://www.linkedin.com/in/tam%C3%A1s-gr%C3%B3sz-950a049a/)
+
+Feel free to contact us for more details 🤗