Files
whisper-large-v3-ru-podlodka/README.md

82 lines
2.9 KiB
Markdown
Raw Normal View History

---
datasets:
- bond005/taiga_speech_v2
- bond005/podlodka_speech
- bond005/rulibrispeech
language:
- ru
license: apache-2.0
metrics:
- wer
pipeline_tag: automatic-speech-recognition
library_name: transformers
widget:
- example_title: Нейронные сети - это хорошо!
src: https://huggingface.co/bond005/whisper-large-v3-ru-podlodka/resolve/main/test_sound_ru.flac
- example_title: К сожалению, система распознавания речи не всегда стабильна, особенно
в шумных условиях.
src: https://huggingface.co/bond005/whisper-large-v3-ru-podlodka/resolve/main/test_sound_with_noise.wav
- example_title: Мимо театра мальчик ходил довольно часто — белое, со взбитыми сливками,
здание-торт.
src: https://huggingface.co/bond005/whisper-large-v3-ru-podlodka/resolve/main/anna_matveeva_test.wav
model-index:
- name: Whisper Large V3 Russian Podlodka by Ivan Bondarenko
results:
- task:
type: automatic-speech-recognition
name: Speech Recognition
dataset:
name: Podlodka.io
type: bond005/podlodka_speech
args: ru
metrics:
- type: wer
value: 20.91
name: WER (with punctuation and capital letters)
- type: wer
value: 10.987
name: WER (without punctuation)
- task:
type: automatic-speech-recognition
name: Speech Recognition
dataset:
name: Russian Librispeech
type: bond005/rulibrispeech
args: ru
metrics:
- type: wer
value: 9.795
name: WER (without punctuation)
---
# Whisper Large V3 Russian Podlodka
This repository contains a fine-tuned Whisper Large V3 model for Russian speech recognition. It serves as the core transcription component of the **Pisets** system, specifically optimized for long audio recordings such as lectures and interviews.
The model was presented in the paper [Pisets: A Robust Speech Recognition System for Lectures and Interviews](https://huggingface.co/papers/2601.18415).
## System Architecture
The Pisets system implements a three-component architecture to improve recognition accuracy while minimizing hallucinations:
1. **Wav2Vec2**: For primary recognition and segmentation.
2. **Audio Spectrogram Transformer (AST)**: For filtering non-speech segments.
3. **Whisper (this model)**: For the final high-quality transcription.
## Implementation
The complete source code and instructions for using the system (including generation of SRT and DocX files) can be found in the GitHub repository:
**GitHub:** [https://github.com/bond005/pisets](https://github.com/bond005/pisets)
## Citation
If you use this model or the Pisets system in your research, please cite:
```bibtex
@article{bondarenko2026pisets,
title={Pisets: A Robust Speech Recognition System for Lectures and Interviews},
author={Ivan Bondarenko},
journal={arXiv preprint arXiv:2601.18415},
year={2026}
}
```