---
datasets:
- bond005/taiga_speech_v2
- bond005/podlodka_speech
- bond005/rulibrispeech
language:
- ru
license: apache-2.0
metrics:
- wer
pipeline_tag: automatic-speech-recognition
library_name: transformers
widget:
- example_title: Нейронные сети - это хорошо!
  src: https://huggingface.co/bond005/whisper-large-v3-ru-podlodka/resolve/main/test_sound_ru.flac
- example_title: К сожалению, система распознавания речи не всегда стабильна, особенно
    в шумных условиях.
  src: https://huggingface.co/bond005/whisper-large-v3-ru-podlodka/resolve/main/test_sound_with_noise.wav
- example_title: Мимо театра мальчик ходил довольно часто — белое, со взбитыми сливками,
    здание-торт.
  src: https://huggingface.co/bond005/whisper-large-v3-ru-podlodka/resolve/main/anna_matveeva_test.wav
model-index:
- name: Whisper Large V3 Russian Podlodka by Ivan Bondarenko
  results:
  - task:
      type: automatic-speech-recognition
      name: Speech Recognition
    dataset:
      name: Podlodka.io
      type: bond005/podlodka_speech
      args: ru
    metrics:
    - type: wer
      value: 20.91
      name: WER (with punctuation and capital letters)
    - type: wer
      value: 10.987
      name: WER (without punctuation)
  - task:
      type: automatic-speech-recognition
      name: Speech Recognition
    dataset:
      name: Russian Librispeech
      type: bond005/rulibrispeech
      args: ru
    metrics:
    - type: wer
      value: 9.795
      name: WER (without punctuation)
---

# Whisper Large V3 Russian Podlodka

This repository contains a fine-tuned Whisper Large V3 model for Russian speech recognition. It serves as the core transcription component of the **Pisets** system, specifically optimized for long audio recordings such as lectures and interviews.

The model was presented in the paper [Pisets: A Robust Speech Recognition System for Lectures and Interviews](https://huggingface.co/papers/2601.18415).

## System Architecture

The Pisets system implements a three-component architecture to improve recognition accuracy while minimizing hallucinations:
1. **Wav2Vec2**: For primary recognition and segmentation.
2. **Audio Spectrogram Transformer (AST)**: For filtering non-speech segments.
3. **Whisper (this model)**: For the final high-quality transcription.

## Implementation

The complete source code and instructions for using the system (including generation of SRT and DocX files) can be found in the GitHub repository:

**GitHub:** [https://github.com/bond005/pisets](https://github.com/bond005/pisets)

## Citation

If you use this model or the Pisets system in your research, please cite:

```bibtex
@article{bondarenko2026pisets,
  title={Pisets: A Robust Speech Recognition System for Lectures and Interviews},
  author={Ivan Bondarenko},
  journal={arXiv preprint arXiv:2601.18415},
  year={2026}
}
```