ModelHub XC 59ae840aff 初始化项目,由ModelHub XC社区提供模型
Model: bond005/whisper-large-v3-ru-podlodka
Source: Original Platform
2026-05-12 12:23:54 +08:00

datasets, language, license, metrics, pipeline_tag, library_name, widget, model-index
datasets language license metrics pipeline_tag library_name widget model-index
bond005/taiga_speech_v2
bond005/podlodka_speech
bond005/rulibrispeech
ru
apache-2.0
wer
automatic-speech-recognition transformers
example_title src
Нейронные сети - это хорошо! https://huggingface.co/bond005/whisper-large-v3-ru-podlodka/resolve/main/test_sound_ru.flac
example_title src
К сожалению, система распознавания речи не всегда стабильна, особенно в шумных условиях. https://huggingface.co/bond005/whisper-large-v3-ru-podlodka/resolve/main/test_sound_with_noise.wav
example_title src
Мимо театра мальчик ходил довольно часто — белое, со взбитыми сливками, здание-торт. https://huggingface.co/bond005/whisper-large-v3-ru-podlodka/resolve/main/anna_matveeva_test.wav
name results
Whisper Large V3 Russian Podlodka by Ivan Bondarenko
task dataset metrics
type name
automatic-speech-recognition Speech Recognition
name type args
Podlodka.io bond005/podlodka_speech ru
type value name
wer 20.91 WER (with punctuation and capital letters)
type value name
wer 10.987 WER (without punctuation)
task dataset metrics
type name
automatic-speech-recognition Speech Recognition
name type args
Russian Librispeech bond005/rulibrispeech ru
type value name
wer 9.795 WER (without punctuation)

Whisper Large V3 Russian Podlodka

This repository contains a fine-tuned Whisper Large V3 model for Russian speech recognition. It serves as the core transcription component of the Pisets system, specifically optimized for long audio recordings such as lectures and interviews.

The model was presented in the paper Pisets: A Robust Speech Recognition System for Lectures and Interviews.

System Architecture

The Pisets system implements a three-component architecture to improve recognition accuracy while minimizing hallucinations:

  1. Wav2Vec2: For primary recognition and segmentation.
  2. Audio Spectrogram Transformer (AST): For filtering non-speech segments.
  3. Whisper (this model): For the final high-quality transcription.

Implementation

The complete source code and instructions for using the system (including generation of SRT and DocX files) can be found in the GitHub repository:

GitHub: https://github.com/bond005/pisets

Citation

If you use this model or the Pisets system in your research, please cite:

@article{bondarenko2026pisets,
  title={Pisets: A Robust Speech Recognition System for Lectures and Interviews},
  author={Ivan Bondarenko},
  journal={arXiv preprint arXiv:2601.18415},
  year={2026}
}
Description
Model synced from source: bond005/whisper-large-v3-ru-podlodka
Readme 1.9 MiB
Languages
Text 100%