language, arxiv, datasets, metrics, tags, license, model-index
language arxiv datasets metrics tags license model-index
sv https://arxiv.org/abs/2205.03026
common_voice
NST_Swedish_ASR_Database
P4
wer
audio
automatic-speech-recognition
speech
hf-asr-leaderboard
cc0-1.0
name results
Wav2vec 2.0 large VoxRex Swedish
task dataset metrics
name type
Speech Recognition automatic-speech-recognition
name type args
Common Voice common_voice sv-SE
name type value
Test WER wer 8.49

Wav2vec 2.0 large VoxRex Swedish (C)

Finetuned version of KBs VoxRex large model using Swedish radio broadcasts, NST and Common Voice data. Evalutation without a language model gives the following: WER for NST + Common Voice test set (2% of total sentences) is 2.5%. WER for Common Voice test set is 8.49% directly and 7.37% with a 4-gram language model.

When using this model, make sure that your speech input is sampled at 16kHz.

Update 2022-01-10: Updated to VoxRex-C version.

Update 2022-05-16: Paper is is here.

Performance*

Comparison

*Chart shows performance without the additional 20k steps of Common Voice fine-tuning

Training

This model has been fine-tuned for 120000 updates on NST + CommonVoice and then for an additional 20000 updates on CommonVoice only. The additional fine-tuning on CommonVoice hurts performance on the NST+CommonVoice test set somewhat and, unsurprisingly, improves it on the CommonVoice test set. It seems to perform generally better though [citation needed].

WER during training

Usage

The model can be used directly (without a language model) as follows:

import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
test_dataset = load_dataset("common_voice", "sv-SE", split="test[:2%]").
processor = Wav2Vec2Processor.from_pretrained("KBLab/wav2vec2-large-voxrex-swedish")
model = Wav2Vec2ForCTC.from_pretrained("KBLab/wav2vec2-large-voxrex-swedish")
resampler = torchaudio.transforms.Resample(48_000, 16_000)
# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    batch["speech"] = resampler(speech_array).squeeze().numpy()
    return batch
test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)
with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
predicted_ids = torch.argmax(logits, dim=-1)
print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset["sentence"][:2])

Citation

https://arxiv.org/abs/2205.03026

@inproceedings{malmsten2022hearing,
  title={Hearing voices at the national library : a speech corpus and acoustic model for the Swedish language},
  author={Malmsten, Martin and Haffenden, Chris and B{\"o}rjeson, Love},
  booktitle={Proceeding of Fonetik 2022 : Speech, Music and Hearing Quarterly Progress and Status Report, TMH-QPSR},
  volume={3},
  year={2022}
}
Description
Model synced from source: KBLab/wav2vec2-large-voxrex-swedish
Readme 198 KiB
Languages
SVG 100%