初始化项目,由ModelHub XC社区提供模型
Model: bond005/whisper-large-v3-ru-podlodka Source: Original Platform
This commit is contained in:
82
README.md
Normal file
82
README.md
Normal file
@@ -0,0 +1,82 @@
|
||||
---
|
||||
datasets:
|
||||
- bond005/taiga_speech_v2
|
||||
- bond005/podlodka_speech
|
||||
- bond005/rulibrispeech
|
||||
language:
|
||||
- ru
|
||||
license: apache-2.0
|
||||
metrics:
|
||||
- wer
|
||||
pipeline_tag: automatic-speech-recognition
|
||||
library_name: transformers
|
||||
widget:
|
||||
- example_title: Нейронные сети - это хорошо!
|
||||
src: https://huggingface.co/bond005/whisper-large-v3-ru-podlodka/resolve/main/test_sound_ru.flac
|
||||
- example_title: К сожалению, система распознавания речи не всегда стабильна, особенно
|
||||
в шумных условиях.
|
||||
src: https://huggingface.co/bond005/whisper-large-v3-ru-podlodka/resolve/main/test_sound_with_noise.wav
|
||||
- example_title: Мимо театра мальчик ходил довольно часто — белое, со взбитыми сливками,
|
||||
здание-торт.
|
||||
src: https://huggingface.co/bond005/whisper-large-v3-ru-podlodka/resolve/main/anna_matveeva_test.wav
|
||||
model-index:
|
||||
- name: Whisper Large V3 Russian Podlodka by Ivan Bondarenko
|
||||
results:
|
||||
- task:
|
||||
type: automatic-speech-recognition
|
||||
name: Speech Recognition
|
||||
dataset:
|
||||
name: Podlodka.io
|
||||
type: bond005/podlodka_speech
|
||||
args: ru
|
||||
metrics:
|
||||
- type: wer
|
||||
value: 20.91
|
||||
name: WER (with punctuation and capital letters)
|
||||
- type: wer
|
||||
value: 10.987
|
||||
name: WER (without punctuation)
|
||||
- task:
|
||||
type: automatic-speech-recognition
|
||||
name: Speech Recognition
|
||||
dataset:
|
||||
name: Russian Librispeech
|
||||
type: bond005/rulibrispeech
|
||||
args: ru
|
||||
metrics:
|
||||
- type: wer
|
||||
value: 9.795
|
||||
name: WER (without punctuation)
|
||||
---
|
||||
|
||||
# Whisper Large V3 Russian Podlodka
|
||||
|
||||
This repository contains a fine-tuned Whisper Large V3 model for Russian speech recognition. It serves as the core transcription component of the **Pisets** system, specifically optimized for long audio recordings such as lectures and interviews.
|
||||
|
||||
The model was presented in the paper [Pisets: A Robust Speech Recognition System for Lectures and Interviews](https://huggingface.co/papers/2601.18415).
|
||||
|
||||
## System Architecture
|
||||
|
||||
The Pisets system implements a three-component architecture to improve recognition accuracy while minimizing hallucinations:
|
||||
1. **Wav2Vec2**: For primary recognition and segmentation.
|
||||
2. **Audio Spectrogram Transformer (AST)**: For filtering non-speech segments.
|
||||
3. **Whisper (this model)**: For the final high-quality transcription.
|
||||
|
||||
## Implementation
|
||||
|
||||
The complete source code and instructions for using the system (including generation of SRT and DocX files) can be found in the GitHub repository:
|
||||
|
||||
**GitHub:** [https://github.com/bond005/pisets](https://github.com/bond005/pisets)
|
||||
|
||||
## Citation
|
||||
|
||||
If you use this model or the Pisets system in your research, please cite:
|
||||
|
||||
```bibtex
|
||||
@article{bondarenko2026pisets,
|
||||
title={Pisets: A Robust Speech Recognition System for Lectures and Interviews},
|
||||
author={Ivan Bondarenko},
|
||||
journal={arXiv preprint arXiv:2601.18415},
|
||||
year={2026}
|
||||
}
|
||||
```
|
||||
Reference in New Issue
Block a user