初始化项目,由ModelHub XC社区提供模型
Model: dbdmg/wav2vec2-xls-r-300m-italian-robust Source: Original Platform
This commit is contained in:
29
.gitattributes
vendored
Normal file
29
.gitattributes
vendored
Normal file
@@ -0,0 +1,29 @@
|
||||
*.7z filter=lfs diff=lfs merge=lfs -text
|
||||
*.arrow filter=lfs diff=lfs merge=lfs -text
|
||||
*.bin filter=lfs diff=lfs merge=lfs -text
|
||||
*.bin.* filter=lfs diff=lfs merge=lfs -text
|
||||
*.bz2 filter=lfs diff=lfs merge=lfs -text
|
||||
*.ftz filter=lfs diff=lfs merge=lfs -text
|
||||
*.gz filter=lfs diff=lfs merge=lfs -text
|
||||
*.h5 filter=lfs diff=lfs merge=lfs -text
|
||||
*.joblib filter=lfs diff=lfs merge=lfs -text
|
||||
*.lfs.* filter=lfs diff=lfs merge=lfs -text
|
||||
*.model filter=lfs diff=lfs merge=lfs -text
|
||||
*.msgpack filter=lfs diff=lfs merge=lfs -text
|
||||
*.onnx filter=lfs diff=lfs merge=lfs -text
|
||||
*.ot filter=lfs diff=lfs merge=lfs -text
|
||||
*.parquet filter=lfs diff=lfs merge=lfs -text
|
||||
*.pb filter=lfs diff=lfs merge=lfs -text
|
||||
*.pt filter=lfs diff=lfs merge=lfs -text
|
||||
*.pth filter=lfs diff=lfs merge=lfs -text
|
||||
*.rar filter=lfs diff=lfs merge=lfs -text
|
||||
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
||||
*.tar.* filter=lfs diff=lfs merge=lfs -text
|
||||
*.tflite filter=lfs diff=lfs merge=lfs -text
|
||||
*.tgz filter=lfs diff=lfs merge=lfs -text
|
||||
*.xz filter=lfs diff=lfs merge=lfs -text
|
||||
*.zip filter=lfs diff=lfs merge=lfs -text
|
||||
*.zstandard filter=lfs diff=lfs merge=lfs -text
|
||||
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
||||
language_model/5gram-it-ds-ytsv2.arpa filter=lfs diff=lfs merge=lfs -text
|
||||
model.safetensors filter=lfs diff=lfs merge=lfs -text
|
||||
1
.gitignore
vendored
Normal file
1
.gitignore
vendored
Normal file
@@ -0,0 +1 @@
|
||||
checkpoint-*/
|
||||
302
README.md
Normal file
302
README.md
Normal file
@@ -0,0 +1,302 @@
|
||||
---
|
||||
language: it
|
||||
license: apache-2.0
|
||||
tags:
|
||||
- automatic-speech-recognition
|
||||
- generated_from_trainer
|
||||
- hf-asr-leaderboard
|
||||
- robust-speech-event
|
||||
datasets:
|
||||
- mozilla-foundation/common_voice_7_0
|
||||
base_model: facebook/wav2vec2-xls-r-300m
|
||||
model-index:
|
||||
- name: XLS-R-300m - Italian
|
||||
results:
|
||||
- task:
|
||||
type: automatic-speech-recognition
|
||||
name: Automatic Speech Recognition
|
||||
dataset:
|
||||
name: Common Voice 7
|
||||
type: mozilla-foundation/common_voice_7_0
|
||||
args: it
|
||||
metrics:
|
||||
- type: wer
|
||||
value: 17.17
|
||||
name: Test WER
|
||||
- type: cer
|
||||
value: 4.27
|
||||
name: Test CER
|
||||
- type: wer
|
||||
value: 12.07
|
||||
name: Test WER (+LM)
|
||||
- type: cer
|
||||
value: 3.52
|
||||
name: Test CER (+LM)
|
||||
- task:
|
||||
type: automatic-speech-recognition
|
||||
name: Automatic Speech Recognition
|
||||
dataset:
|
||||
name: Robust Speech Event - Dev Data
|
||||
type: speech-recognition-community-v2/dev_data
|
||||
args: it
|
||||
metrics:
|
||||
- type: wer
|
||||
value: 24.29
|
||||
name: Test WER
|
||||
- type: cer
|
||||
value: 8.1
|
||||
name: Test CER
|
||||
- type: wer
|
||||
value: 17.36
|
||||
name: Test WER (+LM)
|
||||
- type: cer
|
||||
value: 7.94
|
||||
name: Test CER (+LM)
|
||||
- task:
|
||||
type: automatic-speech-recognition
|
||||
name: Automatic Speech Recognition
|
||||
dataset:
|
||||
name: Robust Speech Event - Test Data
|
||||
type: speech-recognition-community-v2/eval_data
|
||||
args: it
|
||||
metrics:
|
||||
- type: wer
|
||||
value: 33.66
|
||||
name: Test WER
|
||||
---
|
||||
|
||||
<!-- This model card has been generated automatically according to the information the Trainer had access to. You
|
||||
should probably proofread and complete it, then remove this comment. -->
|
||||
|
||||
# wav2vec2-xls-r-300m-italian-robust
|
||||
|
||||
This model is a fine-tuned version of [facebook/wav2vec2-xls-r-300m](https://huggingface.co/facebook/wav2vec2-xls-r-300m) on the Italian splits of the following datasets:
|
||||
- Mozilla Foundation Common Voice V7 dataset
|
||||
- [LibriSpeech multilingual](http://www.openslr.org/94)
|
||||
- [TED multilingual](https://www.openslr.org/100/)
|
||||
- [Voxforge](http://www.voxforge.org/it/Downloads)
|
||||
- [M-AILABS Speech Dataset](https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/)
|
||||
- [EuroParl-ST](https://www.mllp.upv.es/europarl-st/)
|
||||
- [EMOVO](http://voice.fub.it/activities/corpora/emovo/index.html)
|
||||
- [MSPKA](http://www.mspkacorpus.it/)
|
||||
|
||||
## Model description
|
||||
|
||||
More information needed
|
||||
|
||||
## Intended uses & limitations
|
||||
|
||||
More information needed
|
||||
|
||||
## Training and evaluation data
|
||||
|
||||
More information needed
|
||||
|
||||
## Training procedure
|
||||
|
||||
### Training hyperparameters
|
||||
|
||||
The following hyperparameters were used during training:
|
||||
- learning_rate: 0.0003
|
||||
- train_batch_size: 32
|
||||
- eval_batch_size: 8
|
||||
- seed: 42
|
||||
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
|
||||
- lr_scheduler_type: linear
|
||||
- lr_scheduler_warmup_steps: 500
|
||||
- num_epochs: 10.0
|
||||
- mixed_precision_training: Native AMP
|
||||
|
||||
### Training results
|
||||
|
||||
| Training Loss | Epoch | Step | Validation Loss | Wer |
|
||||
|:-------------:|:-----:|:-----:|:---------------:|:------:|
|
||||
| No log | 0.06 | 400 | 0.7508 | 0.7354 |
|
||||
| 2.3127 | 0.11 | 800 | 0.5888 | 0.5882 |
|
||||
| 0.7256 | 0.17 | 1200 | 0.5121 | 0.5247 |
|
||||
| 0.6692 | 0.22 | 1600 | 0.4774 | 0.5028 |
|
||||
| 0.6384 | 0.28 | 2000 | 0.4832 | 0.4885 |
|
||||
| 0.6384 | 0.33 | 2400 | 0.4410 | 0.4581 |
|
||||
| 0.6199 | 0.39 | 2800 | 0.4160 | 0.4331 |
|
||||
| 0.5972 | 0.44 | 3200 | 0.4136 | 0.4275 |
|
||||
| 0.6048 | 0.5 | 3600 | 0.4362 | 0.4538 |
|
||||
| 0.5627 | 0.55 | 4000 | 0.4313 | 0.4469 |
|
||||
| 0.5627 | 0.61 | 4400 | 0.4425 | 0.4579 |
|
||||
| 0.5855 | 0.66 | 4800 | 0.3859 | 0.4133 |
|
||||
| 0.5702 | 0.72 | 5200 | 0.3974 | 0.4097 |
|
||||
| 0.55 | 0.77 | 5600 | 0.3931 | 0.4134 |
|
||||
| 0.5624 | 0.83 | 6000 | 0.3900 | 0.4126 |
|
||||
| 0.5624 | 0.88 | 6400 | 0.3622 | 0.3899 |
|
||||
| 0.5615 | 0.94 | 6800 | 0.3755 | 0.4067 |
|
||||
| 0.5472 | 0.99 | 7200 | 0.3980 | 0.4284 |
|
||||
| 0.5663 | 1.05 | 7600 | 0.3553 | 0.3782 |
|
||||
| 0.5189 | 1.1 | 8000 | 0.3538 | 0.3726 |
|
||||
| 0.5189 | 1.16 | 8400 | 0.3425 | 0.3624 |
|
||||
| 0.518 | 1.21 | 8800 | 0.3431 | 0.3651 |
|
||||
| 0.5399 | 1.27 | 9200 | 0.3442 | 0.3573 |
|
||||
| 0.5303 | 1.32 | 9600 | 0.3241 | 0.3404 |
|
||||
| 0.5043 | 1.38 | 10000 | 0.3175 | 0.3378 |
|
||||
| 0.5043 | 1.43 | 10400 | 0.3265 | 0.3501 |
|
||||
| 0.4968 | 1.49 | 10800 | 0.3539 | 0.3703 |
|
||||
| 0.5102 | 1.54 | 11200 | 0.3323 | 0.3506 |
|
||||
| 0.5008 | 1.6 | 11600 | 0.3188 | 0.3433 |
|
||||
| 0.4996 | 1.65 | 12000 | 0.3162 | 0.3388 |
|
||||
| 0.4996 | 1.71 | 12400 | 0.3353 | 0.3552 |
|
||||
| 0.5007 | 1.76 | 12800 | 0.3152 | 0.3317 |
|
||||
| 0.4956 | 1.82 | 13200 | 0.3207 | 0.3430 |
|
||||
| 0.5205 | 1.87 | 13600 | 0.3239 | 0.3430 |
|
||||
| 0.4829 | 1.93 | 14000 | 0.3134 | 0.3266 |
|
||||
| 0.4829 | 1.98 | 14400 | 0.3039 | 0.3291 |
|
||||
| 0.5251 | 2.04 | 14800 | 0.2944 | 0.3169 |
|
||||
| 0.4872 | 2.09 | 15200 | 0.3061 | 0.3228 |
|
||||
| 0.4805 | 2.15 | 15600 | 0.3034 | 0.3152 |
|
||||
| 0.4949 | 2.2 | 16000 | 0.2896 | 0.3066 |
|
||||
| 0.4949 | 2.26 | 16400 | 0.3059 | 0.3344 |
|
||||
| 0.468 | 2.31 | 16800 | 0.2932 | 0.3111 |
|
||||
| 0.4637 | 2.37 | 17200 | 0.2890 | 0.3074 |
|
||||
| 0.4638 | 2.42 | 17600 | 0.2893 | 0.3112 |
|
||||
| 0.4728 | 2.48 | 18000 | 0.2832 | 0.3013 |
|
||||
| 0.4728 | 2.54 | 18400 | 0.2921 | 0.3065 |
|
||||
| 0.456 | 2.59 | 18800 | 0.2961 | 0.3104 |
|
||||
| 0.4628 | 2.65 | 19200 | 0.2886 | 0.3109 |
|
||||
| 0.4534 | 2.7 | 19600 | 0.2828 | 0.3020 |
|
||||
| 0.4578 | 2.76 | 20000 | 0.2805 | 0.3026 |
|
||||
| 0.4578 | 2.81 | 20400 | 0.2796 | 0.2987 |
|
||||
| 0.4702 | 2.87 | 20800 | 0.2748 | 0.2906 |
|
||||
| 0.4487 | 2.92 | 21200 | 0.2819 | 0.3008 |
|
||||
| 0.4411 | 2.98 | 21600 | 0.2722 | 0.2868 |
|
||||
| 0.4631 | 3.03 | 22000 | 0.2814 | 0.2974 |
|
||||
| 0.4631 | 3.09 | 22400 | 0.2762 | 0.2894 |
|
||||
| 0.4591 | 3.14 | 22800 | 0.2802 | 0.2980 |
|
||||
| 0.4349 | 3.2 | 23200 | 0.2748 | 0.2951 |
|
||||
| 0.4339 | 3.25 | 23600 | 0.2792 | 0.2927 |
|
||||
| 0.4254 | 3.31 | 24000 | 0.2712 | 0.2911 |
|
||||
| 0.4254 | 3.36 | 24400 | 0.2719 | 0.2892 |
|
||||
| 0.4317 | 3.42 | 24800 | 0.2686 | 0.2861 |
|
||||
| 0.4282 | 3.47 | 25200 | 0.2632 | 0.2861 |
|
||||
| 0.4262 | 3.53 | 25600 | 0.2633 | 0.2817 |
|
||||
| 0.4162 | 3.58 | 26000 | 0.2561 | 0.2765 |
|
||||
| 0.4162 | 3.64 | 26400 | 0.2613 | 0.2847 |
|
||||
| 0.414 | 3.69 | 26800 | 0.2679 | 0.2824 |
|
||||
| 0.4132 | 3.75 | 27200 | 0.2569 | 0.2813 |
|
||||
| 0.405 | 3.8 | 27600 | 0.2589 | 0.2785 |
|
||||
| 0.4128 | 3.86 | 28000 | 0.2611 | 0.2714 |
|
||||
| 0.4128 | 3.91 | 28400 | 0.2548 | 0.2731 |
|
||||
| 0.4174 | 3.97 | 28800 | 0.2574 | 0.2716 |
|
||||
| 0.421 | 4.02 | 29200 | 0.2529 | 0.2700 |
|
||||
| 0.4109 | 4.08 | 29600 | 0.2547 | 0.2682 |
|
||||
| 0.4027 | 4.13 | 30000 | 0.2578 | 0.2758 |
|
||||
| 0.4027 | 4.19 | 30400 | 0.2511 | 0.2715 |
|
||||
| 0.4075 | 4.24 | 30800 | 0.2507 | 0.2601 |
|
||||
| 0.3947 | 4.3 | 31200 | 0.2552 | 0.2711 |
|
||||
| 0.4042 | 4.35 | 31600 | 0.2530 | 0.2695 |
|
||||
| 0.3907 | 4.41 | 32000 | 0.2543 | 0.2738 |
|
||||
| 0.3907 | 4.46 | 32400 | 0.2491 | 0.2629 |
|
||||
| 0.3895 | 4.52 | 32800 | 0.2471 | 0.2611 |
|
||||
| 0.3901 | 4.57 | 33200 | 0.2404 | 0.2559 |
|
||||
| 0.3818 | 4.63 | 33600 | 0.2378 | 0.2583 |
|
||||
| 0.3831 | 4.68 | 34000 | 0.2341 | 0.2499 |
|
||||
| 0.3831 | 4.74 | 34400 | 0.2379 | 0.2560 |
|
||||
| 0.3808 | 4.79 | 34800 | 0.2418 | 0.2553 |
|
||||
| 0.4015 | 4.85 | 35200 | 0.2378 | 0.2565 |
|
||||
| 0.407 | 4.9 | 35600 | 0.2375 | 0.2535 |
|
||||
| 0.38 | 4.96 | 36000 | 0.2329 | 0.2451 |
|
||||
| 0.38 | 5.02 | 36400 | 0.2541 | 0.2737 |
|
||||
| 0.3753 | 5.07 | 36800 | 0.2475 | 0.2580 |
|
||||
| 0.3701 | 5.13 | 37200 | 0.2356 | 0.2484 |
|
||||
| 0.3627 | 5.18 | 37600 | 0.2422 | 0.2552 |
|
||||
| 0.3652 | 5.24 | 38000 | 0.2353 | 0.2518 |
|
||||
| 0.3652 | 5.29 | 38400 | 0.2328 | 0.2452 |
|
||||
| 0.3667 | 5.35 | 38800 | 0.2358 | 0.2478 |
|
||||
| 0.3711 | 5.4 | 39200 | 0.2340 | 0.2463 |
|
||||
| 0.361 | 5.46 | 39600 | 0.2375 | 0.2452 |
|
||||
| 0.3655 | 5.51 | 40000 | 0.2292 | 0.2387 |
|
||||
| 0.3655 | 5.57 | 40400 | 0.2330 | 0.2432 |
|
||||
| 0.3637 | 5.62 | 40800 | 0.2242 | 0.2396 |
|
||||
| 0.3516 | 5.68 | 41200 | 0.2284 | 0.2394 |
|
||||
| 0.3498 | 5.73 | 41600 | 0.2254 | 0.2343 |
|
||||
| 0.3626 | 5.79 | 42000 | 0.2191 | 0.2318 |
|
||||
| 0.3626 | 5.84 | 42400 | 0.2261 | 0.2399 |
|
||||
| 0.3719 | 5.9 | 42800 | 0.2261 | 0.2411 |
|
||||
| 0.3563 | 5.95 | 43200 | 0.2259 | 0.2416 |
|
||||
| 0.3574 | 6.01 | 43600 | 0.2148 | 0.2249 |
|
||||
| 0.3339 | 6.06 | 44000 | 0.2173 | 0.2237 |
|
||||
| 0.3339 | 6.12 | 44400 | 0.2133 | 0.2238 |
|
||||
| 0.3303 | 6.17 | 44800 | 0.2193 | 0.2297 |
|
||||
| 0.331 | 6.23 | 45200 | 0.2122 | 0.2205 |
|
||||
| 0.3372 | 6.28 | 45600 | 0.2083 | 0.2215 |
|
||||
| 0.3427 | 6.34 | 46000 | 0.2079 | 0.2163 |
|
||||
| 0.3427 | 6.39 | 46400 | 0.2072 | 0.2154 |
|
||||
| 0.3215 | 6.45 | 46800 | 0.2067 | 0.2170 |
|
||||
| 0.3246 | 6.5 | 47200 | 0.2089 | 0.2183 |
|
||||
| 0.3217 | 6.56 | 47600 | 0.2030 | 0.2130 |
|
||||
| 0.3309 | 6.61 | 48000 | 0.2020 | 0.2123 |
|
||||
| 0.3309 | 6.67 | 48400 | 0.2054 | 0.2133 |
|
||||
| 0.3343 | 6.72 | 48800 | 0.2013 | 0.2128 |
|
||||
| 0.3213 | 6.78 | 49200 | 0.1971 | 0.2064 |
|
||||
| 0.3145 | 6.83 | 49600 | 0.2029 | 0.2107 |
|
||||
| 0.3274 | 6.89 | 50000 | 0.2038 | 0.2136 |
|
||||
| 0.3274 | 6.94 | 50400 | 0.1991 | 0.2064 |
|
||||
| 0.3202 | 7.0 | 50800 | 0.1970 | 0.2083 |
|
||||
| 0.314 | 7.05 | 51200 | 0.1970 | 0.2035 |
|
||||
| 0.3031 | 7.11 | 51600 | 0.1943 | 0.2053 |
|
||||
| 0.3004 | 7.16 | 52000 | 0.1942 | 0.1985 |
|
||||
| 0.3004 | 7.22 | 52400 | 0.1941 | 0.2003 |
|
||||
| 0.3029 | 7.27 | 52800 | 0.1936 | 0.2008 |
|
||||
| 0.2915 | 7.33 | 53200 | 0.1935 | 0.1995 |
|
||||
| 0.3005 | 7.38 | 53600 | 0.1943 | 0.2032 |
|
||||
| 0.2984 | 7.44 | 54000 | 0.1913 | 0.1978 |
|
||||
| 0.2984 | 7.5 | 54400 | 0.1907 | 0.1965 |
|
||||
| 0.2978 | 7.55 | 54800 | 0.1881 | 0.1958 |
|
||||
| 0.2944 | 7.61 | 55200 | 0.1887 | 0.1966 |
|
||||
| 0.3004 | 7.66 | 55600 | 0.1870 | 0.1930 |
|
||||
| 0.3099 | 7.72 | 56000 | 0.1906 | 0.1976 |
|
||||
| 0.3099 | 7.77 | 56400 | 0.1856 | 0.1939 |
|
||||
| 0.2917 | 7.83 | 56800 | 0.1883 | 0.1961 |
|
||||
| 0.2924 | 7.88 | 57200 | 0.1864 | 0.1930 |
|
||||
| 0.3061 | 7.94 | 57600 | 0.1831 | 0.1872 |
|
||||
| 0.2834 | 7.99 | 58000 | 0.1835 | 0.1896 |
|
||||
| 0.2834 | 8.05 | 58400 | 0.1828 | 0.1875 |
|
||||
| 0.2807 | 8.1 | 58800 | 0.1820 | 0.1874 |
|
||||
| 0.2765 | 8.16 | 59200 | 0.1807 | 0.1869 |
|
||||
| 0.2737 | 8.21 | 59600 | 0.1810 | 0.1848 |
|
||||
| 0.2722 | 8.27 | 60000 | 0.1795 | 0.1829 |
|
||||
| 0.2722 | 8.32 | 60400 | 0.1785 | 0.1826 |
|
||||
| 0.272 | 8.38 | 60800 | 0.1802 | 0.1836 |
|
||||
| 0.268 | 8.43 | 61200 | 0.1771 | 0.1813 |
|
||||
| 0.2695 | 8.49 | 61600 | 0.1773 | 0.1821 |
|
||||
| 0.2686 | 8.54 | 62000 | 0.1756 | 0.1814 |
|
||||
| 0.2686 | 8.6 | 62400 | 0.1740 | 0.1770 |
|
||||
| 0.2687 | 8.65 | 62800 | 0.1748 | 0.1769 |
|
||||
| 0.2686 | 8.71 | 63200 | 0.1734 | 0.1766 |
|
||||
| 0.2683 | 8.76 | 63600 | 0.1722 | 0.1759 |
|
||||
| 0.2686 | 8.82 | 64000 | 0.1719 | 0.1760 |
|
||||
| 0.2686 | 8.87 | 64400 | 0.1720 | 0.1743 |
|
||||
| 0.2626 | 8.93 | 64800 | 0.1696 | 0.1742 |
|
||||
| 0.2587 | 8.98 | 65200 | 0.1690 | 0.1718 |
|
||||
| 0.2554 | 9.04 | 65600 | 0.1704 | 0.1722 |
|
||||
| 0.2537 | 9.09 | 66000 | 0.1702 | 0.1721 |
|
||||
| 0.2537 | 9.15 | 66400 | 0.1696 | 0.1717 |
|
||||
| 0.2511 | 9.2 | 66800 | 0.1685 | 0.1701 |
|
||||
| 0.2473 | 9.26 | 67200 | 0.1696 | 0.1704 |
|
||||
| 0.2458 | 9.31 | 67600 | 0.1686 | 0.1698 |
|
||||
| 0.2476 | 9.37 | 68000 | 0.1675 | 0.1687 |
|
||||
| 0.2476 | 9.42 | 68400 | 0.1659 | 0.1673 |
|
||||
| 0.2463 | 9.48 | 68800 | 0.1664 | 0.1674 |
|
||||
| 0.2481 | 9.53 | 69200 | 0.1661 | 0.1670 |
|
||||
| 0.2411 | 9.59 | 69600 | 0.1658 | 0.1663 |
|
||||
| 0.2445 | 9.64 | 70000 | 0.1652 | 0.1660 |
|
||||
| 0.2445 | 9.7 | 70400 | 0.1646 | 0.1654 |
|
||||
| 0.2407 | 9.75 | 70800 | 0.1646 | 0.1641 |
|
||||
| 0.2483 | 9.81 | 71200 | 0.1641 | 0.1641 |
|
||||
| 0.245 | 9.86 | 71600 | 0.1635 | 0.1643 |
|
||||
| 0.2402 | 9.92 | 72000 | 0.1638 | 0.1634 |
|
||||
| 0.2402 | 9.98 | 72400 | 0.1633 | 0.1636 |
|
||||
|
||||
|
||||
### Framework versions
|
||||
|
||||
- Transformers 4.17.0.dev0
|
||||
- Pytorch 1.10.2+cu102
|
||||
- Datasets 1.18.3
|
||||
- Tokenizers 0.11.0
|
||||
1
added_tokens.json
Normal file
1
added_tokens.json
Normal file
@@ -0,0 +1 @@
|
||||
{"<s>": 40, "</s>": 41}
|
||||
1
alphabet.json
Normal file
1
alphabet.json
Normal file
@@ -0,0 +1 @@
|
||||
{"labels": [" ", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "\u00e0", "\u00e8", "\u00e9", "\u00ed", "\u00f2", "\u00f3", "\u00fa", "\u0127", "\u02b9", "\u0307", "\u044a", "\u2047", "", "<s>", "</s>"], "is_bpe": false}
|
||||
107
config.json
Normal file
107
config.json
Normal file
@@ -0,0 +1,107 @@
|
||||
{
|
||||
"_name_or_path": "facebook/wav2vec2-xls-r-300m",
|
||||
"activation_dropout": 0.0,
|
||||
"adapter_kernel_size": 3,
|
||||
"adapter_stride": 2,
|
||||
"add_adapter": false,
|
||||
"apply_spec_augment": true,
|
||||
"architectures": [
|
||||
"Wav2Vec2ForCTC"
|
||||
],
|
||||
"attention_dropout": 0.0,
|
||||
"bos_token_id": 1,
|
||||
"classifier_proj_size": 256,
|
||||
"codevector_dim": 768,
|
||||
"contrastive_logits_temperature": 0.1,
|
||||
"conv_bias": true,
|
||||
"conv_dim": [
|
||||
512,
|
||||
512,
|
||||
512,
|
||||
512,
|
||||
512,
|
||||
512,
|
||||
512
|
||||
],
|
||||
"conv_kernel": [
|
||||
10,
|
||||
3,
|
||||
3,
|
||||
3,
|
||||
3,
|
||||
2,
|
||||
2
|
||||
],
|
||||
"conv_stride": [
|
||||
5,
|
||||
2,
|
||||
2,
|
||||
2,
|
||||
2,
|
||||
2,
|
||||
2
|
||||
],
|
||||
"ctc_loss_reduction": "mean",
|
||||
"ctc_zero_infinity": true,
|
||||
"diversity_loss_weight": 0.1,
|
||||
"do_stable_layer_norm": true,
|
||||
"eos_token_id": 2,
|
||||
"feat_extract_activation": "gelu",
|
||||
"feat_extract_dropout": 0.0,
|
||||
"feat_extract_norm": "layer",
|
||||
"feat_proj_dropout": 0.0,
|
||||
"feat_quantizer_dropout": 0.0,
|
||||
"final_dropout": 0.0,
|
||||
"hidden_act": "gelu",
|
||||
"hidden_dropout": 0.0,
|
||||
"hidden_size": 1024,
|
||||
"initializer_range": 0.02,
|
||||
"intermediate_size": 4096,
|
||||
"layer_norm_eps": 1e-05,
|
||||
"layerdrop": 0.0,
|
||||
"mask_feature_length": 10,
|
||||
"mask_feature_min_masks": 0,
|
||||
"mask_feature_prob": 0.0,
|
||||
"mask_time_length": 10,
|
||||
"mask_time_min_masks": 2,
|
||||
"mask_time_prob": 0.05,
|
||||
"model_type": "wav2vec2",
|
||||
"num_adapter_layers": 3,
|
||||
"num_attention_heads": 16,
|
||||
"num_codevector_groups": 2,
|
||||
"num_codevectors_per_group": 320,
|
||||
"num_conv_pos_embedding_groups": 16,
|
||||
"num_conv_pos_embeddings": 128,
|
||||
"num_feat_extract_layers": 7,
|
||||
"num_hidden_layers": 24,
|
||||
"num_negatives": 100,
|
||||
"output_hidden_size": 1024,
|
||||
"pad_token_id": 39,
|
||||
"proj_codevector_dim": 768,
|
||||
"tdnn_dilation": [
|
||||
1,
|
||||
2,
|
||||
3,
|
||||
1,
|
||||
1
|
||||
],
|
||||
"tdnn_dim": [
|
||||
512,
|
||||
512,
|
||||
512,
|
||||
512,
|
||||
1500
|
||||
],
|
||||
"tdnn_kernel": [
|
||||
5,
|
||||
3,
|
||||
3,
|
||||
1,
|
||||
1
|
||||
],
|
||||
"torch_dtype": "float32",
|
||||
"transformers_version": "4.17.0.dev0",
|
||||
"use_weighted_layer_sum": false,
|
||||
"vocab_size": 42,
|
||||
"xvector_output_dim": 512
|
||||
}
|
||||
244
eval.py
Normal file
244
eval.py
Normal file
@@ -0,0 +1,244 @@
|
||||
#!/usr/bin/env python3
|
||||
import argparse
|
||||
import re
|
||||
from typing import Dict
|
||||
from sklearn import feature_extraction
|
||||
|
||||
import torch
|
||||
from src.data.normalization import normalize_string
|
||||
from datasets import Audio, Dataset, load_dataset, load_metric
|
||||
|
||||
from transformers import (
|
||||
AutoFeatureExtractor,
|
||||
pipeline,
|
||||
AutoTokenizer,
|
||||
Wav2Vec2Processor,
|
||||
Wav2Vec2ProcessorWithLM,
|
||||
Wav2Vec2ForCTC,
|
||||
AutoConfig,
|
||||
)
|
||||
|
||||
|
||||
def log_results(result: Dataset, args: Dict[str, str]):
|
||||
"""DO NOT CHANGE. This function computes and logs the result metrics."""
|
||||
|
||||
log_outputs = args.log_outputs
|
||||
dataset_id = "_".join(args.dataset.split("/") + [args.config, args.split])
|
||||
|
||||
# load metric
|
||||
wer = load_metric("wer")
|
||||
cer = load_metric("cer")
|
||||
|
||||
# compute metrics
|
||||
wer_result = wer.compute(
|
||||
references=result["target"], predictions=result["prediction"]
|
||||
)
|
||||
cer_result = cer.compute(
|
||||
references=result["target"], predictions=result["prediction"]
|
||||
)
|
||||
|
||||
# print & log results
|
||||
result_str = f"WER: {wer_result}\n" f"CER: {cer_result}"
|
||||
print(result_str)
|
||||
|
||||
with open(f"{dataset_id}_eval_results.txt", "w") as f:
|
||||
f.write(result_str)
|
||||
|
||||
# log all results in text file. Possibly interesting for analysis
|
||||
if log_outputs is not None:
|
||||
pred_file = f"log_{dataset_id}_predictions.txt"
|
||||
target_file = f"log_{dataset_id}_targets.txt"
|
||||
|
||||
with open(pred_file, "w") as p, open(target_file, "w") as t:
|
||||
|
||||
# mapping function to write output
|
||||
def write_to_file(batch, i):
|
||||
p.write(f"{i}" + "\n")
|
||||
p.write(batch["prediction"] + "\n")
|
||||
t.write(f"{i}" + "\n")
|
||||
t.write(batch["target"] + "\n")
|
||||
|
||||
result.map(write_to_file, with_indices=True)
|
||||
|
||||
|
||||
def normalize_text(text: str, invalid_chars_regex: str, to_lower: bool) -> str:
|
||||
"""DO ADAPT FOR YOUR USE CASE. this function normalizes the target text."""
|
||||
text = normalize_string(text)
|
||||
text = text.lower() if to_lower else text.upper()
|
||||
|
||||
text = re.sub(invalid_chars_regex, " ", text)
|
||||
text = re.sub("\s+", " ", text).strip()
|
||||
|
||||
return text
|
||||
|
||||
|
||||
def main(args):
|
||||
# load dataset
|
||||
dataset = load_dataset(
|
||||
args.dataset, args.config, split=args.split, use_auth_token=True
|
||||
)
|
||||
|
||||
# for testing: only process the first two examples as a test
|
||||
# dataset = dataset.select(range(10))
|
||||
|
||||
# load processor
|
||||
# feature_extractor = AutoFeatureExtractor.from_pretrained(args.model_id)
|
||||
# sampling_rate = feature_extractor.sampling_rate
|
||||
|
||||
if args.ctcdecode:
|
||||
processor = Wav2Vec2ProcessorWithLM.from_pretrained(args.model_id)
|
||||
decoder = processor.decoder
|
||||
else:
|
||||
processor = Wav2Vec2Processor.from_pretrained(args.model_id)
|
||||
decoder = None
|
||||
|
||||
feature_extractor = processor.feature_extractor
|
||||
tokenizer = processor.tokenizer
|
||||
sampling_rate = feature_extractor.sampling_rate
|
||||
|
||||
config = AutoConfig.from_pretrained(args.model_id)
|
||||
model = Wav2Vec2ForCTC.from_pretrained(args.model_id)
|
||||
|
||||
# resample audio
|
||||
dataset = dataset.cast_column("audio", Audio(sampling_rate=sampling_rate))
|
||||
|
||||
# load eval pipeline
|
||||
if args.device is None:
|
||||
args.device = 0 if torch.cuda.is_available() else -1
|
||||
|
||||
asr = pipeline(
|
||||
"automatic-speech-recognition",
|
||||
model=model,
|
||||
config=config,
|
||||
feature_extractor=feature_extractor,
|
||||
decoder=decoder,
|
||||
tokenizer=tokenizer,
|
||||
device=args.device,
|
||||
)
|
||||
|
||||
# build normalizer config
|
||||
tokenizer = AutoTokenizer.from_pretrained(args.model_id)
|
||||
tokens = [
|
||||
x for x in tokenizer.convert_ids_to_tokens(range(0, tokenizer.vocab_size))
|
||||
]
|
||||
special_tokens = [
|
||||
tokenizer.pad_token,
|
||||
tokenizer.word_delimiter_token,
|
||||
tokenizer.unk_token,
|
||||
tokenizer.bos_token,
|
||||
tokenizer.eos_token,
|
||||
]
|
||||
non_special_tokens = [x for x in tokens if x not in special_tokens]
|
||||
invalid_chars_regex = f"[^\s{re.escape(''.join(set(non_special_tokens)))}]"
|
||||
normalize_to_lower = False
|
||||
for token in non_special_tokens:
|
||||
if token.isalpha() and token.islower():
|
||||
normalize_to_lower = True
|
||||
break
|
||||
|
||||
# map function to decode audio
|
||||
def map_to_pred(
|
||||
batch,
|
||||
args=args,
|
||||
asr=asr,
|
||||
invalid_chars_regex=invalid_chars_regex,
|
||||
normalize_to_lower=normalize_to_lower,
|
||||
):
|
||||
prediction = asr(
|
||||
batch["audio"]["array"],
|
||||
chunk_length_s=args.chunk_length_s,
|
||||
stride_length_s=args.stride_length_s,
|
||||
#decoder_kwargs={"beam_width": args.beam_width},
|
||||
)
|
||||
|
||||
batch["prediction"] = prediction["text"]
|
||||
batch["target"] = normalize_text(
|
||||
batch["sentence"], invalid_chars_regex, normalize_to_lower
|
||||
)
|
||||
return batch
|
||||
|
||||
def map_and_decode(batch):
|
||||
inputs = processor(
|
||||
batch["audio"]["array"],
|
||||
sampling_rate=batch["audio"]["sampling_rate"],
|
||||
return_tensors="pt",
|
||||
)
|
||||
with torch.no_grad():
|
||||
logits = model(**inputs).logits
|
||||
transcription = processor.batch_decode(logits.numpy()).text
|
||||
batch["prediction"] = transcription
|
||||
batch["target"] = normalize_text(
|
||||
batch["sentence"], invalid_chars_regex, normalize_to_lower
|
||||
)
|
||||
return batch
|
||||
|
||||
# transcription = .lower()
|
||||
# run inference on all examples
|
||||
result = dataset.map(map_to_pred, remove_columns=dataset.column_names)
|
||||
|
||||
# compute and log_results
|
||||
# do not change function below
|
||||
log_results(result, args)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
parser = argparse.ArgumentParser()
|
||||
|
||||
parser.add_argument(
|
||||
"--model_id",
|
||||
type=str,
|
||||
required=True,
|
||||
help="Model identifier. Should be loadable with 🤗 Transformers",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--dataset",
|
||||
type=str,
|
||||
required=True,
|
||||
help="Dataset name to evaluate the `model_id`. Should be loadable with 🤗 Datasets",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--config",
|
||||
type=str,
|
||||
required=True,
|
||||
help="Config of the dataset. *E.g.* `'en'` for Common Voice",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--split", type=str, required=True, help="Split of the dataset. *E.g.* `'test'`"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--chunk_length_s",
|
||||
type=float,
|
||||
default=None,
|
||||
help="Chunk length in seconds. Defaults to 5 seconds.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--stride_length_s",
|
||||
type=float,
|
||||
default=None,
|
||||
help="Stride of the audio chunks. Defaults to 1 second.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--log_outputs",
|
||||
action="store_true",
|
||||
help="If defined, write outputs to log file for analysis.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--ctcdecode",
|
||||
action="store_true",
|
||||
help="Apply the ctc decoder to the output (only if present in the model card).",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--device",
|
||||
type=int,
|
||||
default=None,
|
||||
help="The device to run the pipeline on. -1 for CPU (default), 0 for the first GPU and so on.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--beam_width",
|
||||
type=int,
|
||||
default=1,
|
||||
help="Beam width used by the pyctc decoder.",
|
||||
)
|
||||
args = parser.parse_args()
|
||||
|
||||
main(args)
|
||||
3
language_model/5gram-it-ds-ytsv2.bin
Normal file
3
language_model/5gram-it-ds-ytsv2.bin
Normal file
@@ -0,0 +1,3 @@
|
||||
version https://git-lfs.github.com/spec/v1
|
||||
oid sha256:5aad288e91a4bd1ba1b377aa19fceb77109647f84cac236bb757bc83685b609b
|
||||
size 868657580
|
||||
1
language_model/attrs.json
Normal file
1
language_model/attrs.json
Normal file
@@ -0,0 +1 @@
|
||||
{"alpha": 0.5, "beta": 1.5, "unk_score_offset": -10.0, "score_boundary": true}
|
||||
554846
language_model/unigrams.txt
Normal file
554846
language_model/unigrams.txt
Normal file
File diff suppressed because it is too large
Load Diff
3
model.safetensors
Normal file
3
model.safetensors
Normal file
@@ -0,0 +1,3 @@
|
||||
version https://git-lfs.github.com/spec/v1
|
||||
oid sha256:30002063dc6a3652881ed7b3540b3a6e49a6297510db2a58c6013a2874b53c7a
|
||||
size 1261979632
|
||||
10
preprocessor_config.json
Normal file
10
preprocessor_config.json
Normal file
@@ -0,0 +1,10 @@
|
||||
{
|
||||
"do_normalize": true,
|
||||
"feature_extractor_type": "Wav2Vec2FeatureExtractor",
|
||||
"processor_class": "Wav2Vec2ProcessorWithLM",
|
||||
"feature_size": 1,
|
||||
"padding_side": "right",
|
||||
"padding_value": 0,
|
||||
"return_attention_mask": true,
|
||||
"sampling_rate": 16000
|
||||
}
|
||||
3
pytorch_model.bin
Normal file
3
pytorch_model.bin
Normal file
@@ -0,0 +1,3 @@
|
||||
version https://git-lfs.github.com/spec/v1
|
||||
oid sha256:5f3a687c6f0635dbdf3df5dfae72a125eed8f1bcd0158d317d858c929f094bbe
|
||||
size 1262095857
|
||||
28
requirements.txt
Normal file
28
requirements.txt
Normal file
@@ -0,0 +1,28 @@
|
||||
# external requirements
|
||||
git+https://github.com/dbdmg/robust-speech-challenge.git
|
||||
click
|
||||
Sphinx
|
||||
coverage
|
||||
awscli
|
||||
flake8
|
||||
python-dotenv>=0.5.1
|
||||
comet_ml
|
||||
|
||||
# audio data augmentations
|
||||
torch
|
||||
git+https://github.com/MorenoLaQuatra/torch-audiomentations.git
|
||||
librosa
|
||||
pysrt
|
||||
num2words
|
||||
|
||||
# deep deep learning
|
||||
transformers
|
||||
datasets>=1.18.3
|
||||
jiwer
|
||||
|
||||
# pyctcdecode
|
||||
pypi-kenlm
|
||||
pandas
|
||||
pyctcdecode
|
||||
pydub
|
||||
soundfile
|
||||
1
special_tokens_map.json
Normal file
1
special_tokens_map.json
Normal file
@@ -0,0 +1 @@
|
||||
{"bos_token": "<s>", "eos_token": "</s>", "unk_token": "[UNK]", "pad_token": "[PAD]", "additional_special_tokens": [{"content": "<s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, {"content": "</s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, {"content": "<s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, {"content": "</s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, {"content": "<s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, {"content": "</s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}]}
|
||||
1
tokenizer_config.json
Normal file
1
tokenizer_config.json
Normal file
@@ -0,0 +1 @@
|
||||
{"unk_token": "[UNK]", "bos_token": "<s>", "eos_token": "</s>", "pad_token": "[PAD]", "do_lower_case": false, "word_delimiter_token": "|", "special_tokens_map_file": null, "tokenizer_file": null, "name_or_path": "dbdmg/wav2vec2-xls-r-300m-italian-robust", "tokenizer_class": "Wav2Vec2CTCTokenizer"}
|
||||
3
training_args.bin
Normal file
3
training_args.bin
Normal file
@@ -0,0 +1,3 @@
|
||||
version https://git-lfs.github.com/spec/v1
|
||||
oid sha256:40d3586c5f771c6045940b979e267119454a67e177da2a6521e0d8aa293a8fb7
|
||||
size 3183
|
||||
1
vocab.json
Normal file
1
vocab.json
Normal file
@@ -0,0 +1 @@
|
||||
{"a": 1, "b": 2, "c": 3, "d": 4, "e": 5, "f": 6, "g": 7, "h": 8, "i": 9, "j": 10, "k": 11, "l": 12, "m": 13, "n": 14, "o": 15, "p": 16, "q": 17, "r": 18, "s": 19, "t": 20, "u": 21, "v": 22, "w": 23, "x": 24, "y": 25, "z": 26, "à": 27, "è": 28, "é": 29, "í": 30, "ò": 31, "ó": 32, "ú": 33, "ħ": 34, "ʹ": 35, "̇": 36, "ъ": 37, "|": 0, "[UNK]": 38, "[PAD]": 39}
|
||||
Reference in New Issue
Block a user