Files
wav2vec2-large-ru-golos/README.md
ModelHub XC b766ca751b 初始化项目,由ModelHub XC社区提供模型
Model: bond005/wav2vec2-large-ru-golos
Source: Original Platform
2026-05-08 11:35:50 +08:00

7.1 KiB

datasets, language, license, metrics, library_name, pipeline_tag, tags, widget, model-index
datasets language license metrics library_name pipeline_tag tags widget model-index
SberDevices/Golos
bond005/sova_rudevices
bond005/rulibrispeech
ru apache-2.0
wer
cer
transformers automatic-speech-recognition
audio
automatic-speech-recognition
speech
xlsr-fine-tuning-week
example_title src
test sound with Russian speech "нейросети это хорошо" https://huggingface.co/bond005/wav2vec2-large-ru-golos/resolve/main/test_sound_ru.flac
name results
XLSR Wav2Vec2 Russian by Ivan Bondarenko
task dataset metrics
type name
automatic-speech-recognition Speech Recognition
name type args
Sberdevices Golos (crowd) SberDevices/Golos ru
type value name
wer 10.144 Test WER
type value name
cer 2.168 Test CER
type value name
wer 20.353 Test WER
type value name
cer 6.03 Test CER
task dataset metrics
type name
automatic-speech-recognition Automatic Speech Recognition
name type args
Common Voice ru common_voice ru
type value name
wer 18.548 Test WER
type value name
cer 4.0 Test CER
task dataset metrics
type name
automatic-speech-recognition Automatic Speech Recognition
name type args
Sova RuDevices bond005/sova_rudevices ru
type value name
wer 25.41 Test WER
type value name
cer 7.965 Test CER
task dataset metrics
type name
automatic-speech-recognition Automatic Speech Recognition
name type args
Russian Librispeech bond005/rulibrispeech ru
type value name
wer 21.872 Test WER
type value name
cer 4.469 Test CER
task dataset metrics
type name
automatic-speech-recognition Automatic Speech Recognition
name type args
Voxforge Ru dangrebenkin/voxforge-ru-dataset ru
type value name
wer 27.084 Test WER
type value name
cer 6.986 Test CER

Wav2Vec2-Large-Ru-Golos

This model is a component of the Pisets speech-to-text system, presented in the paper Pisets: A Robust Speech Recognition System for Lectures and Interviews.

The source code for the Pisets system is available on GitHub: bond005/pisets.

The Wav2Vec2 model is based on facebook/wav2vec2-large-xlsr-53, fine-tuned in Russian using Sberdevices Golos with audio augmentations like as pitch shift, acceleration/deceleration of sound, reverberation etc.

When using this model, make sure that your speech input is sampled at 16kHz.

Usage

To transcribe audio files the model can be used as a standalone acoustic model as follows:

from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
from datasets import load_dataset
import torch
 
# load model and tokenizer
processor = Wav2Vec2Processor.from_pretrained("bond005/wav2vec2-large-ru-golos")
model = Wav2Vec2ForCTC.from_pretrained("bond005/wav2vec2-large-ru-golos")
     
# load the test part of Golos dataset and read first soundfile
ds = load_dataset("bond005/sberdevices_golos_10h_crowd", split="test")
 
# tokenize
processed = processor(ds[0]["audio"]["array"], return_tensors="pt", padding="longest")  # Batch size 1
 
# retrieve logits
logits = model(processed.input_values, attention_mask=processed.attention_mask).logits
 
# take argmax and decode
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)[0]
print(transcription)

Evaluation

This code snippet shows how to evaluate bond005/wav2vec2-large-ru-golos on Golos dataset's "crowd" and "farfield" test data.

from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import torch
from jiwer import wer, cer  # we need word error rate (WER) and character error rate (CER)

# load the test part of Golos Crowd and remove samples with empty "true" transcriptions
golos_crowd_test = load_dataset("bond005/sberdevices_golos_10h_crowd", split="test")
golos_crowd_test = golos_crowd_test.filter(
    lambda it1: (it1["transcription"] is not None) and (len(it1["transcription"].strip()) > 0)
)

# load the test part of Golos Farfield and remove sampels with empty "true" transcriptions
golos_farfield_test = load_dataset("bond005/sberdevices_golos_100h_farfield", split="test")
golos_farfield_test = golos_farfield_test.filter(
    lambda it2: (it2["transcription"] is not None) and (len(it2["transcription"].strip()) > 0)
)

# load model and tokenizer
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h").to("cuda")
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")

# recognize one sound
def map_to_pred(batch):
    # tokenize and vectorize
    processed = processor(
        batch["audio"]["array"], sampling_rate=batch["audio"]["sampling_rate"],
        return_tensors="pt", padding="longest"
    )
    input_values = processed.input_values.to("cuda")
    attention_mask = processed.attention_mask.to("cuda")

    # recognize
    with torch.no_grad():
        logits = model(input_values, attention_mask=attention_mask).logits
    predicted_ids = torch.argmax(logits, dim=-1)

    # decode
    transcription = processor.batch_decode(predicted_ids)
    batch["text"] = transcription[0]
    return batch

# calculate WER and CER on the crowd domain
crowd_result = golos_crowd_test.map(map_to_pred, remove_columns=["audio"])
crowd_wer = wer(crowd_result["transcription"], crowd_result["text"])
crowd_cer = cer(crowd_result["transcription"], crowd_result["text"])
print("Word error rate on the Crowd domain:", crowd_wer)
print("Character error rate on the Crowd domain:", crowd_cer)

# calculate WER and CER on the farfield domain
farfield_result = golos_farfield_test.map(map_to_pred, remove_columns=["audio"])
farfield_wer = wer(farfield_result["transcription"], farfield_result["text"])
farfield_cer = cer(farfield_result["transcription"], farfield_result["text"])
print("Word error rate on the Farfield domain:", farfield_wer)
print("Character error rate on the Farfield domain:", farfield_cer)

Result (WER, %):

"crowd" "farfield"
10.144 20.353

Result (CER, %):

"crowd" "farfield"
2.168 6.030

You can see the evaluation script on other datasets, including Russian Librispeech and SOVA RuDevices, on my Kaggle web-page https://www.kaggle.com/code/bond005/wav2vec2-ru-eval

Citation

If you want to cite this model you can use this:

@misc{bondarenko2022wav2vec2-large-ru-golos,
  title={XLSR Wav2Vec2 Russian by Ivan Bondarenko},
  author={Bondarenko, Ivan},
  publisher={Hugging Face},
  journal={Hugging Face Hub},
  howpublished={\url{https://huggingface.co/bond005/wav2vec2-large-ru-golos}},
  year={2022}
}