82 lines
2.9 KiB
Markdown
82 lines
2.9 KiB
Markdown
---
|
||
datasets:
|
||
- bond005/taiga_speech_v2
|
||
- bond005/podlodka_speech
|
||
- bond005/rulibrispeech
|
||
language:
|
||
- ru
|
||
license: apache-2.0
|
||
metrics:
|
||
- wer
|
||
pipeline_tag: automatic-speech-recognition
|
||
library_name: transformers
|
||
widget:
|
||
- example_title: Нейронные сети - это хорошо!
|
||
src: https://huggingface.co/bond005/whisper-large-v3-ru-podlodka/resolve/main/test_sound_ru.flac
|
||
- example_title: К сожалению, система распознавания речи не всегда стабильна, особенно
|
||
в шумных условиях.
|
||
src: https://huggingface.co/bond005/whisper-large-v3-ru-podlodka/resolve/main/test_sound_with_noise.wav
|
||
- example_title: Мимо театра мальчик ходил довольно часто — белое, со взбитыми сливками,
|
||
здание-торт.
|
||
src: https://huggingface.co/bond005/whisper-large-v3-ru-podlodka/resolve/main/anna_matveeva_test.wav
|
||
model-index:
|
||
- name: Whisper Large V3 Russian Podlodka by Ivan Bondarenko
|
||
results:
|
||
- task:
|
||
type: automatic-speech-recognition
|
||
name: Speech Recognition
|
||
dataset:
|
||
name: Podlodka.io
|
||
type: bond005/podlodka_speech
|
||
args: ru
|
||
metrics:
|
||
- type: wer
|
||
value: 20.91
|
||
name: WER (with punctuation and capital letters)
|
||
- type: wer
|
||
value: 10.987
|
||
name: WER (without punctuation)
|
||
- task:
|
||
type: automatic-speech-recognition
|
||
name: Speech Recognition
|
||
dataset:
|
||
name: Russian Librispeech
|
||
type: bond005/rulibrispeech
|
||
args: ru
|
||
metrics:
|
||
- type: wer
|
||
value: 9.795
|
||
name: WER (without punctuation)
|
||
---
|
||
|
||
# Whisper Large V3 Russian Podlodka
|
||
|
||
This repository contains a fine-tuned Whisper Large V3 model for Russian speech recognition. It serves as the core transcription component of the **Pisets** system, specifically optimized for long audio recordings such as lectures and interviews.
|
||
|
||
The model was presented in the paper [Pisets: A Robust Speech Recognition System for Lectures and Interviews](https://huggingface.co/papers/2601.18415).
|
||
|
||
## System Architecture
|
||
|
||
The Pisets system implements a three-component architecture to improve recognition accuracy while minimizing hallucinations:
|
||
1. **Wav2Vec2**: For primary recognition and segmentation.
|
||
2. **Audio Spectrogram Transformer (AST)**: For filtering non-speech segments.
|
||
3. **Whisper (this model)**: For the final high-quality transcription.
|
||
|
||
## Implementation
|
||
|
||
The complete source code and instructions for using the system (including generation of SRT and DocX files) can be found in the GitHub repository:
|
||
|
||
**GitHub:** [https://github.com/bond005/pisets](https://github.com/bond005/pisets)
|
||
|
||
## Citation
|
||
|
||
If you use this model or the Pisets system in your research, please cite:
|
||
|
||
```bibtex
|
||
@article{bondarenko2026pisets,
|
||
title={Pisets: A Robust Speech Recognition System for Lectures and Interviews},
|
||
author={Ivan Bondarenko},
|
||
journal={arXiv preprint arXiv:2601.18415},
|
||
year={2026}
|
||
}
|
||
``` |