94 lines
2.3 KiB
Markdown
94 lines
2.3 KiB
Markdown
|
|
---
|
||
|
|
license: cc-by-4.0
|
||
|
|
tags:
|
||
|
|
- audio
|
||
|
|
- automatic-speech-recognition
|
||
|
|
- hf-asr-leaderboard
|
||
|
|
language: et
|
||
|
|
model-index:
|
||
|
|
- name: xls-r-300m-et
|
||
|
|
results:
|
||
|
|
- task:
|
||
|
|
name: Automatic Speech Recognition
|
||
|
|
type: automatic-speech-recognition
|
||
|
|
dataset:
|
||
|
|
name: Common Voice
|
||
|
|
type: common_voice
|
||
|
|
args: et
|
||
|
|
metrics:
|
||
|
|
- name: Test WER
|
||
|
|
type: wer
|
||
|
|
value: 12.520395591222402
|
||
|
|
- name: Test CER
|
||
|
|
type: cer
|
||
|
|
value: 2.7091152438624897
|
||
|
|
- task:
|
||
|
|
name: Automatic Speech Recognition
|
||
|
|
type: automatic-speech-recognition
|
||
|
|
dataset:
|
||
|
|
name: Common Voice 8
|
||
|
|
type: mozilla-foundation/common_voice_8_0
|
||
|
|
args: et
|
||
|
|
metrics:
|
||
|
|
- name: Test WER
|
||
|
|
type: wer
|
||
|
|
value: 13.38447882323104
|
||
|
|
- name: Test CER
|
||
|
|
type: cer
|
||
|
|
value: 2.9816686199500255
|
||
|
|
---
|
||
|
|
|
||
|
|
|
||
|
|
# XLS-R-300m-ET
|
||
|
|
|
||
|
|
This is a XLS-R-300M model [facebook/wav2vec2-xls-r-300m](https://huggingface.co/facebook/wav2vec2-xls-r-300m) finetuned on around 800 hours of diverse Estonian data.
|
||
|
|
|
||
|
|
## Model description
|
||
|
|
This is a general-purpose Estonian ASR model trained in the Lab of Language Technology at TalTech. It consists of only the CTC-based end-to-end model, no language model is currently provided.
|
||
|
|
|
||
|
|
## Intended uses & limitations
|
||
|
|
|
||
|
|
This model is intended for general-purpose speech recognition, such as broadcast conversations, interviews, talks, etc.
|
||
|
|
|
||
|
|
## How to use
|
||
|
|
|
||
|
|
|
||
|
|
TODO
|
||
|
|
|
||
|
|
#### Limitations and bias
|
||
|
|
|
||
|
|
Since this model was trained on mostly broadcast speech and texts from the web, it might have problems correctly decoding the following:
|
||
|
|
* Speech containing technical and other domain-specific terms
|
||
|
|
* Children's speech
|
||
|
|
* Non-native speech
|
||
|
|
* Speech recorded under very noisy conditions or with a microphone far from the speaker
|
||
|
|
* Very spontaneous and overlapping speech
|
||
|
|
|
||
|
|
## Training data
|
||
|
|
Acoustic training data:
|
||
|
|
|
||
|
|
| Type | Amount (h) |
|
||
|
|
|-----------------------|:------:|
|
||
|
|
| Broadcast speech | 591 |
|
||
|
|
| Spontaneous speech | 53 |
|
||
|
|
| Elderly speech corpus | 53 |
|
||
|
|
| Talks, lectures | 49 |
|
||
|
|
| Parliament speeches | 31 |
|
||
|
|
| *Total* | *761* |
|
||
|
|
|
||
|
|
|
||
|
|
## Training procedure
|
||
|
|
|
||
|
|
Finetuned using Fairseq.
|
||
|
|
|
||
|
|
## Evaluation results
|
||
|
|
|
||
|
|
### WER
|
||
|
|
|
||
|
|
|Dataset | WER |
|
||
|
|
|---|---|
|
||
|
|
| jutusaated.devset | 7.9 |
|
||
|
|
| jutusaated.testset | 6.1 |
|
||
|
|
| Common Voice 6.1 | 12.5 |
|
||
|
|
| Common Voice 8.0 | 13.4 |
|