初始化项目,由ModelHub XC社区提供模型
Model: TalTechNLP/xls-r-300m-et Source: Original Platform
This commit is contained in:
27
.gitattributes
vendored
Normal file
27
.gitattributes
vendored
Normal file
@@ -0,0 +1,27 @@
|
||||
*.7z filter=lfs diff=lfs merge=lfs -text
|
||||
*.arrow filter=lfs diff=lfs merge=lfs -text
|
||||
*.bin filter=lfs diff=lfs merge=lfs -text
|
||||
*.bin.* filter=lfs diff=lfs merge=lfs -text
|
||||
*.bz2 filter=lfs diff=lfs merge=lfs -text
|
||||
*.ftz filter=lfs diff=lfs merge=lfs -text
|
||||
*.gz filter=lfs diff=lfs merge=lfs -text
|
||||
*.h5 filter=lfs diff=lfs merge=lfs -text
|
||||
*.joblib filter=lfs diff=lfs merge=lfs -text
|
||||
*.lfs.* filter=lfs diff=lfs merge=lfs -text
|
||||
*.model filter=lfs diff=lfs merge=lfs -text
|
||||
*.msgpack filter=lfs diff=lfs merge=lfs -text
|
||||
*.onnx filter=lfs diff=lfs merge=lfs -text
|
||||
*.ot filter=lfs diff=lfs merge=lfs -text
|
||||
*.parquet filter=lfs diff=lfs merge=lfs -text
|
||||
*.pb filter=lfs diff=lfs merge=lfs -text
|
||||
*.pt filter=lfs diff=lfs merge=lfs -text
|
||||
*.pth filter=lfs diff=lfs merge=lfs -text
|
||||
*.rar filter=lfs diff=lfs merge=lfs -text
|
||||
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
||||
*.tar.* filter=lfs diff=lfs merge=lfs -text
|
||||
*.tflite filter=lfs diff=lfs merge=lfs -text
|
||||
*.tgz filter=lfs diff=lfs merge=lfs -text
|
||||
*.xz filter=lfs diff=lfs merge=lfs -text
|
||||
*.zip filter=lfs diff=lfs merge=lfs -text
|
||||
*.zstandard filter=lfs diff=lfs merge=lfs -text
|
||||
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
||||
93
README.md
Normal file
93
README.md
Normal file
@@ -0,0 +1,93 @@
|
||||
---
|
||||
license: cc-by-4.0
|
||||
tags:
|
||||
- audio
|
||||
- automatic-speech-recognition
|
||||
- hf-asr-leaderboard
|
||||
language: et
|
||||
model-index:
|
||||
- name: xls-r-300m-et
|
||||
results:
|
||||
- task:
|
||||
name: Automatic Speech Recognition
|
||||
type: automatic-speech-recognition
|
||||
dataset:
|
||||
name: Common Voice
|
||||
type: common_voice
|
||||
args: et
|
||||
metrics:
|
||||
- name: Test WER
|
||||
type: wer
|
||||
value: 12.520395591222402
|
||||
- name: Test CER
|
||||
type: cer
|
||||
value: 2.7091152438624897
|
||||
- task:
|
||||
name: Automatic Speech Recognition
|
||||
type: automatic-speech-recognition
|
||||
dataset:
|
||||
name: Common Voice 8
|
||||
type: mozilla-foundation/common_voice_8_0
|
||||
args: et
|
||||
metrics:
|
||||
- name: Test WER
|
||||
type: wer
|
||||
value: 13.38447882323104
|
||||
- name: Test CER
|
||||
type: cer
|
||||
value: 2.9816686199500255
|
||||
---
|
||||
|
||||
|
||||
# XLS-R-300m-ET
|
||||
|
||||
This is a XLS-R-300M model [facebook/wav2vec2-xls-r-300m](https://huggingface.co/facebook/wav2vec2-xls-r-300m) finetuned on around 800 hours of diverse Estonian data.
|
||||
|
||||
## Model description
|
||||
This is a general-purpose Estonian ASR model trained in the Lab of Language Technology at TalTech. It consists of only the CTC-based end-to-end model, no language model is currently provided.
|
||||
|
||||
## Intended uses & limitations
|
||||
|
||||
This model is intended for general-purpose speech recognition, such as broadcast conversations, interviews, talks, etc.
|
||||
|
||||
## How to use
|
||||
|
||||
|
||||
TODO
|
||||
|
||||
#### Limitations and bias
|
||||
|
||||
Since this model was trained on mostly broadcast speech and texts from the web, it might have problems correctly decoding the following:
|
||||
* Speech containing technical and other domain-specific terms
|
||||
* Children's speech
|
||||
* Non-native speech
|
||||
* Speech recorded under very noisy conditions or with a microphone far from the speaker
|
||||
* Very spontaneous and overlapping speech
|
||||
|
||||
## Training data
|
||||
Acoustic training data:
|
||||
|
||||
| Type | Amount (h) |
|
||||
|-----------------------|:------:|
|
||||
| Broadcast speech | 591 |
|
||||
| Spontaneous speech | 53 |
|
||||
| Elderly speech corpus | 53 |
|
||||
| Talks, lectures | 49 |
|
||||
| Parliament speeches | 31 |
|
||||
| *Total* | *761* |
|
||||
|
||||
|
||||
## Training procedure
|
||||
|
||||
Finetuned using Fairseq.
|
||||
|
||||
## Evaluation results
|
||||
|
||||
### WER
|
||||
|
||||
|Dataset | WER |
|
||||
|---|---|
|
||||
| jutusaated.devset | 7.9 |
|
||||
| jutusaated.testset | 6.1 |
|
||||
| Common Voice 6.1 | 12.5 |
|
||||
| Common Voice 8.0 | 13.4 |
|
||||
107
config.json
Normal file
107
config.json
Normal file
@@ -0,0 +1,107 @@
|
||||
{
|
||||
"_name_or_path": "facebook/wav2vec2-xls-r-300m",
|
||||
"activation_dropout": 0.1,
|
||||
"adapter_kernel_size": 3,
|
||||
"adapter_stride": 2,
|
||||
"add_adapter": false,
|
||||
"apply_spec_augment": true,
|
||||
"architectures": [
|
||||
"Wav2Vec2ForCTC"
|
||||
],
|
||||
"attention_dropout": 0.0,
|
||||
"bos_token_id": 1,
|
||||
"classifier_proj_size": 256,
|
||||
"codevector_dim": 768,
|
||||
"contrastive_logits_temperature": 0.1,
|
||||
"conv_bias": true,
|
||||
"conv_dim": [
|
||||
512,
|
||||
512,
|
||||
512,
|
||||
512,
|
||||
512,
|
||||
512,
|
||||
512
|
||||
],
|
||||
"conv_kernel": [
|
||||
10,
|
||||
3,
|
||||
3,
|
||||
3,
|
||||
3,
|
||||
2,
|
||||
2
|
||||
],
|
||||
"conv_stride": [
|
||||
5,
|
||||
2,
|
||||
2,
|
||||
2,
|
||||
2,
|
||||
2,
|
||||
2
|
||||
],
|
||||
"ctc_loss_reduction": "mean",
|
||||
"ctc_zero_infinity": false,
|
||||
"diversity_loss_weight": 0.1,
|
||||
"do_stable_layer_norm": true,
|
||||
"eos_token_id": 2,
|
||||
"feat_extract_activation": "gelu",
|
||||
"feat_extract_dropout": 0.0,
|
||||
"feat_extract_norm": "layer",
|
||||
"feat_proj_dropout": 0.0,
|
||||
"feat_quantizer_dropout": 0.0,
|
||||
"final_dropout": 0.0,
|
||||
"hidden_act": "gelu",
|
||||
"hidden_dropout": 0.0,
|
||||
"hidden_size": 1024,
|
||||
"initializer_range": 0.02,
|
||||
"intermediate_size": 4096,
|
||||
"layer_norm_eps": 1e-05,
|
||||
"layerdrop": 0.0,
|
||||
"mask_feature_length": 64,
|
||||
"mask_feature_min_masks": 0,
|
||||
"mask_feature_prob": 0.25,
|
||||
"mask_time_length": 10,
|
||||
"mask_time_min_masks": 2,
|
||||
"mask_time_prob": 0.75,
|
||||
"model_type": "wav2vec2",
|
||||
"num_adapter_layers": 3,
|
||||
"num_attention_heads": 16,
|
||||
"num_codevector_groups": 2,
|
||||
"num_codevectors_per_group": 320,
|
||||
"num_conv_pos_embedding_groups": 16,
|
||||
"num_conv_pos_embeddings": 128,
|
||||
"num_feat_extract_layers": 7,
|
||||
"num_hidden_layers": 24,
|
||||
"num_negatives": 100,
|
||||
"output_hidden_size": 1024,
|
||||
"pad_token_id": 0,
|
||||
"proj_codevector_dim": 768,
|
||||
"tdnn_dilation": [
|
||||
1,
|
||||
2,
|
||||
3,
|
||||
1,
|
||||
1
|
||||
],
|
||||
"tdnn_dim": [
|
||||
512,
|
||||
512,
|
||||
512,
|
||||
512,
|
||||
1500
|
||||
],
|
||||
"tdnn_kernel": [
|
||||
5,
|
||||
3,
|
||||
3,
|
||||
1,
|
||||
1
|
||||
],
|
||||
"torch_dtype": "float32",
|
||||
"transformers_version": "4.15.0",
|
||||
"use_weighted_layer_sum": false,
|
||||
"vocab_size": 197,
|
||||
"xvector_output_dim": 512
|
||||
}
|
||||
9
preprocessor_config.json
Normal file
9
preprocessor_config.json
Normal file
@@ -0,0 +1,9 @@
|
||||
{
|
||||
"do_normalize": true,
|
||||
"feature_extractor_type": "Wav2Vec2FeatureExtractor",
|
||||
"feature_size": 1,
|
||||
"padding_side": "right",
|
||||
"padding_value": 0,
|
||||
"return_attention_mask": true,
|
||||
"sampling_rate": 16000
|
||||
}
|
||||
3
pytorch_model.bin
Normal file
3
pytorch_model.bin
Normal file
@@ -0,0 +1,3 @@
|
||||
version https://git-lfs.github.com/spec/v1
|
||||
oid sha256:3ebb8bbc9f56a42858c9b69ed4b7a1b308ae623d7f8f1fc38788965dc638b106
|
||||
size 1262725489
|
||||
1
special_tokens_map.json
Normal file
1
special_tokens_map.json
Normal file
@@ -0,0 +1 @@
|
||||
{"bos_token": "<s>", "eos_token": "</s>", "unk_token": "<unk>", "pad_token": "<pad>"}
|
||||
1
tokenizer_config.json
Normal file
1
tokenizer_config.json
Normal file
@@ -0,0 +1 @@
|
||||
{"unk_token": "<unk>", "bos_token": "<s>", "eos_token": "</s>", "pad_token": "<pad>", "do_lower_case": false, "word_delimiter_token": "|", "tokenizer_class": "Wav2Vec2CTCTokenizer"}
|
||||
1
vocab.json
Normal file
1
vocab.json
Normal file
@@ -0,0 +1 @@
|
||||
{"<s>": 1, "<pad>": 0, "</s>": 2, "<unk>": 3, "|": 4, "e": 5, "a": 6, "i": 7, "s": 8, "t": 9, "l": 10, "n": 11, "u": 12, "k": 13, "o": 14, "m": 15, "d": 16, "r": 17, "v": 18, "g": 19, "j": 20, "h": 21, "p": 22, "ä": 23, "õ": 24, "ü": 25, "b": 26, "ö": 27, "-": 28, "E": 29, "f": 30, "M": 31, "T": 32, "S": 33, "K": 34, "A": 35, "L": 36, "R": 37, "V": 38, "P": 39, "N": 40, "I": 41, "H": 42, "J": 43, "c": 44, "O": 45, "B": 46, "U": 47, "y": 48, "C": 49, "G": 50, "š": 51, "Ü": 52, "D": 53, "'": 54, "w": 55, "F": 56, "ž": 57, "z": 58, "W": 59, "x": 60, "Y": 61, "Õ": 62, ".": 63, "Z": 64, "Ö": 65, "Š": 66, "Q": 67, "X": 68, "q": 69, "Ä": 70, "3": 71, "1": 72, "„": 73, "0": 74, "4": 75, "Ž": 76, "9": 77, "5": 78, "\"": 79, "2": 80, "O2": 81, "é": 82, "č": 83, "@": 84, "8": 85, "6": 86, "о": 87, "e2": 88, "а": 89, "l2": 90, "т": 91, "и": 92, "12": 93, "ˇ": 94, "л": 95, "н": 96, "ú": 97, "ē": 98, "V2": 99, "я": 100, "B2": 101, "с": 102, "S2": 103, "ç": 104, "ā": 105, "ć": 106, "ц": 107, "47": 108, "´": 109, "á": 110, "æ": 111, "ë": 112, "í": 113, "ø": 114, "ė": 115, "г": 116, "д": 117, "з": 118, "к": 119, "7": 120, "87": 121, "97": 122, "É": 123, "à": 124, "ð": 125, "ğ": 126, "́": 127, "К": 128, "е": 129, "р": 130, "у": 131, "+": 132, ",": 133, "02": 134, "37": 135, "U2": 136, "e´": 137, "g2": 138, "n´": 139, "è": 140, "ą": 141, "ī": 142, "ņ": 143, "ū": 144, "С": 145, "й": 146, "ф": 147, "ш": 148, "ы": 149, "э": 150, "/": 151, "07": 152, "72": 153, ">": 154, "A2": 155, "C2": 156, "F2": 157, "G7": 158, "K2": 159, "P2": 160, "a2": 161, "o2": 162, "r2": 163, "s2": 164, "y2": 165, "y´": 166, "§": 167, "Å": 168, "Ø": 169, "ß": 170, "å": 171, "ó": 172, "ô": 173, "þ": 174, "Ā": 175, "ķ": 176, "ł": 177, "Ś": 178, "ś": 179, "А": 180, "В": 181, "И": 182, "М": 183, "Р": 184, "У": 185, "б": 186, "в": 187, "м": 188, "п": 189, "х": 190, "ч": 191, "ь": 192, "ю": 193, "–": 194, "”": 195, "": 196}
|
||||
Reference in New Issue
Block a user