初始化项目，由ModelHub XC社区提供模型

Model: dbdmg/wav2vec2-xls-r-300m-italian-robust Source: Original Platform
2026-05-20 23:24:23 +08:00
commit 565fe710f5
18 changed files with 555585 additions and 0 deletions
--- a/.gitattributes
+++ b/.gitattributes
@@ -0,0 +1,29 @@
+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bin.* filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text 
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zstandard filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text
+language_model/5gram-it-ds-ytsv2.arpa filter=lfs diff=lfs merge=lfs -text
+model.safetensors filter=lfs diff=lfs merge=lfs -text
--- a/.gitignore
+++ b/.gitignore
@@ -0,0 +1 @@
+checkpoint-*/
--- a/README.md
+++ b/README.md
@@ -0,0 +1,302 @@
+---
+language: it
+license: apache-2.0
+tags:
+- automatic-speech-recognition
+- generated_from_trainer
+- hf-asr-leaderboard
+- robust-speech-event
+datasets:
+- mozilla-foundation/common_voice_7_0
+base_model: facebook/wav2vec2-xls-r-300m
+model-index:
+- name: XLS-R-300m - Italian
+  results:
+  - task:
+      type: automatic-speech-recognition
+      name: Automatic Speech Recognition
+    dataset:
+      name: Common Voice 7
+      type: mozilla-foundation/common_voice_7_0
+      args: it
+    metrics:
+    - type: wer
+      value: 17.17
+      name: Test WER
+    - type: cer
+      value: 4.27
+      name: Test CER
+    - type: wer
+      value: 12.07
+      name: Test WER (+LM)
+    - type: cer
+      value: 3.52
+      name: Test CER (+LM)
+  - task:
+      type: automatic-speech-recognition
+      name: Automatic Speech Recognition
+    dataset:
+      name: Robust Speech Event - Dev Data
+      type: speech-recognition-community-v2/dev_data
+      args: it
+    metrics:
+    - type: wer
+      value: 24.29
+      name: Test WER
+    - type: cer
+      value: 8.1
+      name: Test CER
+    - type: wer
+      value: 17.36
+      name: Test WER (+LM)
+    - type: cer
+      value: 7.94
+      name: Test CER (+LM)
+  - task:
+      type: automatic-speech-recognition
+      name: Automatic Speech Recognition
+    dataset:
+      name: Robust Speech Event - Test Data
+      type: speech-recognition-community-v2/eval_data
+      args: it
+    metrics:
+    - type: wer
+      value: 33.66
+      name: Test WER
+---
+
+<!-- This model card has been generated automatically according to the information the Trainer had access to. You
+should probably proofread and complete it, then remove this comment. -->
+
+# wav2vec2-xls-r-300m-italian-robust
+
+This model is a fine-tuned version of [facebook/wav2vec2-xls-r-300m](https://huggingface.co/facebook/wav2vec2-xls-r-300m) on the Italian splits of the following datasets:
+- Mozilla Foundation Common Voice V7 dataset
+- [LibriSpeech multilingual](http://www.openslr.org/94)
+- [TED multilingual](https://www.openslr.org/100/)
+- [Voxforge](http://www.voxforge.org/it/Downloads)
+- [M-AILABS Speech Dataset](https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/) 
+- [EuroParl-ST](https://www.mllp.upv.es/europarl-st/)
+- [EMOVO](http://voice.fub.it/activities/corpora/emovo/index.html) 
+- [MSPKA](http://www.mspkacorpus.it/) 
+
+## Model description
+
+More information needed
+
+## Intended uses & limitations
+
+More information needed
+
+## Training and evaluation data
+
+More information needed
+
+## Training procedure
+
+### Training hyperparameters
+
+The following hyperparameters were used during training:
+- learning_rate: 0.0003
+- train_batch_size: 32
+- eval_batch_size: 8
+- seed: 42
+- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
+- lr_scheduler_type: linear
+- lr_scheduler_warmup_steps: 500
+- num_epochs: 10.0
+- mixed_precision_training: Native AMP
+
+### Training results
+
+| Training Loss | Epoch | Step  | Validation Loss | Wer    |
+|:-------------:|:-----:|:-----:|:---------------:|:------:|
+| No log        | 0.06  | 400   | 0.7508          | 0.7354 |
+| 2.3127        | 0.11  | 800   | 0.5888          | 0.5882 |
+| 0.7256        | 0.17  | 1200  | 0.5121          | 0.5247 |
+| 0.6692        | 0.22  | 1600  | 0.4774          | 0.5028 |
+| 0.6384        | 0.28  | 2000  | 0.4832          | 0.4885 |
+| 0.6384        | 0.33  | 2400  | 0.4410          | 0.4581 |
+| 0.6199        | 0.39  | 2800  | 0.4160          | 0.4331 |
+| 0.5972        | 0.44  | 3200  | 0.4136          | 0.4275 |
+| 0.6048        | 0.5   | 3600  | 0.4362          | 0.4538 |
+| 0.5627        | 0.55  | 4000  | 0.4313          | 0.4469 |
+| 0.5627        | 0.61  | 4400  | 0.4425          | 0.4579 |
+| 0.5855        | 0.66  | 4800  | 0.3859          | 0.4133 |
+| 0.5702        | 0.72  | 5200  | 0.3974          | 0.4097 |
+| 0.55          | 0.77  | 5600  | 0.3931          | 0.4134 |
+| 0.5624        | 0.83  | 6000  | 0.3900          | 0.4126 |
+| 0.5624        | 0.88  | 6400  | 0.3622          | 0.3899 |
+| 0.5615        | 0.94  | 6800  | 0.3755          | 0.4067 |
+| 0.5472        | 0.99  | 7200  | 0.3980          | 0.4284 |
+| 0.5663        | 1.05  | 7600  | 0.3553          | 0.3782 |
+| 0.5189        | 1.1   | 8000  | 0.3538          | 0.3726 |
+| 0.5189        | 1.16  | 8400  | 0.3425          | 0.3624 |
+| 0.518         | 1.21  | 8800  | 0.3431          | 0.3651 |
+| 0.5399        | 1.27  | 9200  | 0.3442          | 0.3573 |
+| 0.5303        | 1.32  | 9600  | 0.3241          | 0.3404 |
+| 0.5043        | 1.38  | 10000 | 0.3175          | 0.3378 |
+| 0.5043        | 1.43  | 10400 | 0.3265          | 0.3501 |
+| 0.4968        | 1.49  | 10800 | 0.3539          | 0.3703 |
+| 0.5102        | 1.54  | 11200 | 0.3323          | 0.3506 |
+| 0.5008        | 1.6   | 11600 | 0.3188          | 0.3433 |
+| 0.4996        | 1.65  | 12000 | 0.3162          | 0.3388 |
+| 0.4996        | 1.71  | 12400 | 0.3353          | 0.3552 |
+| 0.5007        | 1.76  | 12800 | 0.3152          | 0.3317 |
+| 0.4956        | 1.82  | 13200 | 0.3207          | 0.3430 |
+| 0.5205        | 1.87  | 13600 | 0.3239          | 0.3430 |
+| 0.4829        | 1.93  | 14000 | 0.3134          | 0.3266 |
+| 0.4829        | 1.98  | 14400 | 0.3039          | 0.3291 |
+| 0.5251        | 2.04  | 14800 | 0.2944          | 0.3169 |
+| 0.4872        | 2.09  | 15200 | 0.3061          | 0.3228 |
+| 0.4805        | 2.15  | 15600 | 0.3034          | 0.3152 |
+| 0.4949        | 2.2   | 16000 | 0.2896          | 0.3066 |
+| 0.4949        | 2.26  | 16400 | 0.3059          | 0.3344 |
+| 0.468         | 2.31  | 16800 | 0.2932          | 0.3111 |
+| 0.4637        | 2.37  | 17200 | 0.2890          | 0.3074 |
+| 0.4638        | 2.42  | 17600 | 0.2893          | 0.3112 |
+| 0.4728        | 2.48  | 18000 | 0.2832          | 0.3013 |
+| 0.4728        | 2.54  | 18400 | 0.2921          | 0.3065 |
+| 0.456         | 2.59  | 18800 | 0.2961          | 0.3104 |
+| 0.4628        | 2.65  | 19200 | 0.2886          | 0.3109 |
+| 0.4534        | 2.7   | 19600 | 0.2828          | 0.3020 |
+| 0.4578        | 2.76  | 20000 | 0.2805          | 0.3026 |
+| 0.4578        | 2.81  | 20400 | 0.2796          | 0.2987 |
+| 0.4702        | 2.87  | 20800 | 0.2748          | 0.2906 |
+| 0.4487        | 2.92  | 21200 | 0.2819          | 0.3008 |
+| 0.4411        | 2.98  | 21600 | 0.2722          | 0.2868 |
+| 0.4631        | 3.03  | 22000 | 0.2814          | 0.2974 |
+| 0.4631        | 3.09  | 22400 | 0.2762          | 0.2894 |
+| 0.4591        | 3.14  | 22800 | 0.2802          | 0.2980 |
+| 0.4349        | 3.2   | 23200 | 0.2748          | 0.2951 |
+| 0.4339        | 3.25  | 23600 | 0.2792          | 0.2927 |
+| 0.4254        | 3.31  | 24000 | 0.2712          | 0.2911 |
+| 0.4254        | 3.36  | 24400 | 0.2719          | 0.2892 |
+| 0.4317        | 3.42  | 24800 | 0.2686          | 0.2861 |
+| 0.4282        | 3.47  | 25200 | 0.2632          | 0.2861 |
+| 0.4262        | 3.53  | 25600 | 0.2633          | 0.2817 |
+| 0.4162        | 3.58  | 26000 | 0.2561          | 0.2765 |
+| 0.4162        | 3.64  | 26400 | 0.2613          | 0.2847 |
+| 0.414         | 3.69  | 26800 | 0.2679          | 0.2824 |
+| 0.4132        | 3.75  | 27200 | 0.2569          | 0.2813 |
+| 0.405         | 3.8   | 27600 | 0.2589          | 0.2785 |
+| 0.4128        | 3.86  | 28000 | 0.2611          | 0.2714 |
+| 0.4128        | 3.91  | 28400 | 0.2548          | 0.2731 |
+| 0.4174        | 3.97  | 28800 | 0.2574          | 0.2716 |
+| 0.421         | 4.02  | 29200 | 0.2529          | 0.2700 |
+| 0.4109        | 4.08  | 29600 | 0.2547          | 0.2682 |
+| 0.4027        | 4.13  | 30000 | 0.2578          | 0.2758 |
+| 0.4027        | 4.19  | 30400 | 0.2511          | 0.2715 |
+| 0.4075        | 4.24  | 30800 | 0.2507          | 0.2601 |
+| 0.3947        | 4.3   | 31200 | 0.2552          | 0.2711 |
+| 0.4042        | 4.35  | 31600 | 0.2530          | 0.2695 |
+| 0.3907        | 4.41  | 32000 | 0.2543          | 0.2738 |
+| 0.3907        | 4.46  | 32400 | 0.2491          | 0.2629 |
+| 0.3895        | 4.52  | 32800 | 0.2471          | 0.2611 |
+| 0.3901        | 4.57  | 33200 | 0.2404          | 0.2559 |
+| 0.3818        | 4.63  | 33600 | 0.2378          | 0.2583 |
+| 0.3831        | 4.68  | 34000 | 0.2341          | 0.2499 |
+| 0.3831        | 4.74  | 34400 | 0.2379          | 0.2560 |
+| 0.3808        | 4.79  | 34800 | 0.2418          | 0.2553 |
+| 0.4015        | 4.85  | 35200 | 0.2378          | 0.2565 |
+| 0.407         | 4.9   | 35600 | 0.2375          | 0.2535 |
+| 0.38          | 4.96  | 36000 | 0.2329          | 0.2451 |
+| 0.38          | 5.02  | 36400 | 0.2541          | 0.2737 |
+| 0.3753        | 5.07  | 36800 | 0.2475          | 0.2580 |
+| 0.3701        | 5.13  | 37200 | 0.2356          | 0.2484 |
+| 0.3627        | 5.18  | 37600 | 0.2422          | 0.2552 |
+| 0.3652        | 5.24  | 38000 | 0.2353          | 0.2518 |
+| 0.3652        | 5.29  | 38400 | 0.2328          | 0.2452 |
+| 0.3667        | 5.35  | 38800 | 0.2358          | 0.2478 |
+| 0.3711        | 5.4   | 39200 | 0.2340          | 0.2463 |
+| 0.361         | 5.46  | 39600 | 0.2375          | 0.2452 |
+| 0.3655        | 5.51  | 40000 | 0.2292          | 0.2387 |
+| 0.3655        | 5.57  | 40400 | 0.2330          | 0.2432 |
+| 0.3637        | 5.62  | 40800 | 0.2242          | 0.2396 |
+| 0.3516        | 5.68  | 41200 | 0.2284          | 0.2394 |
+| 0.3498        | 5.73  | 41600 | 0.2254          | 0.2343 |
+| 0.3626        | 5.79  | 42000 | 0.2191          | 0.2318 |
+| 0.3626        | 5.84  | 42400 | 0.2261          | 0.2399 |
+| 0.3719        | 5.9   | 42800 | 0.2261          | 0.2411 |
+| 0.3563        | 5.95  | 43200 | 0.2259          | 0.2416 |
+| 0.3574        | 6.01  | 43600 | 0.2148          | 0.2249 |
+| 0.3339        | 6.06  | 44000 | 0.2173          | 0.2237 |
+| 0.3339        | 6.12  | 44400 | 0.2133          | 0.2238 |
+| 0.3303        | 6.17  | 44800 | 0.2193          | 0.2297 |
+| 0.331         | 6.23  | 45200 | 0.2122          | 0.2205 |
+| 0.3372        | 6.28  | 45600 | 0.2083          | 0.2215 |
+| 0.3427        | 6.34  | 46000 | 0.2079          | 0.2163 |
+| 0.3427        | 6.39  | 46400 | 0.2072          | 0.2154 |
+| 0.3215        | 6.45  | 46800 | 0.2067          | 0.2170 |
+| 0.3246        | 6.5   | 47200 | 0.2089          | 0.2183 |
+| 0.3217        | 6.56  | 47600 | 0.2030          | 0.2130 |
+| 0.3309        | 6.61  | 48000 | 0.2020          | 0.2123 |
+| 0.3309        | 6.67  | 48400 | 0.2054          | 0.2133 |
+| 0.3343        | 6.72  | 48800 | 0.2013          | 0.2128 |
+| 0.3213        | 6.78  | 49200 | 0.1971          | 0.2064 |
+| 0.3145        | 6.83  | 49600 | 0.2029          | 0.2107 |
+| 0.3274        | 6.89  | 50000 | 0.2038          | 0.2136 |
+| 0.3274        | 6.94  | 50400 | 0.1991          | 0.2064 |
+| 0.3202        | 7.0   | 50800 | 0.1970          | 0.2083 |
+| 0.314         | 7.05  | 51200 | 0.1970          | 0.2035 |
+| 0.3031        | 7.11  | 51600 | 0.1943          | 0.2053 |
+| 0.3004        | 7.16  | 52000 | 0.1942          | 0.1985 |
+| 0.3004        | 7.22  | 52400 | 0.1941          | 0.2003 |
+| 0.3029        | 7.27  | 52800 | 0.1936          | 0.2008 |
+| 0.2915        | 7.33  | 53200 | 0.1935          | 0.1995 |
+| 0.3005        | 7.38  | 53600 | 0.1943          | 0.2032 |
+| 0.2984        | 7.44  | 54000 | 0.1913          | 0.1978 |
+| 0.2984        | 7.5   | 54400 | 0.1907          | 0.1965 |
+| 0.2978        | 7.55  | 54800 | 0.1881          | 0.1958 |
+| 0.2944        | 7.61  | 55200 | 0.1887          | 0.1966 |
+| 0.3004        | 7.66  | 55600 | 0.1870          | 0.1930 |
+| 0.3099        | 7.72  | 56000 | 0.1906          | 0.1976 |
+| 0.3099        | 7.77  | 56400 | 0.1856          | 0.1939 |
+| 0.2917        | 7.83  | 56800 | 0.1883          | 0.1961 |
+| 0.2924        | 7.88  | 57200 | 0.1864          | 0.1930 |
+| 0.3061        | 7.94  | 57600 | 0.1831          | 0.1872 |
+| 0.2834        | 7.99  | 58000 | 0.1835          | 0.1896 |
+| 0.2834        | 8.05  | 58400 | 0.1828          | 0.1875 |
+| 0.2807        | 8.1   | 58800 | 0.1820          | 0.1874 |
+| 0.2765        | 8.16  | 59200 | 0.1807          | 0.1869 |
+| 0.2737        | 8.21  | 59600 | 0.1810          | 0.1848 |
+| 0.2722        | 8.27  | 60000 | 0.1795          | 0.1829 |
+| 0.2722        | 8.32  | 60400 | 0.1785          | 0.1826 |
+| 0.272         | 8.38  | 60800 | 0.1802          | 0.1836 |
+| 0.268         | 8.43  | 61200 | 0.1771          | 0.1813 |
+| 0.2695        | 8.49  | 61600 | 0.1773          | 0.1821 |
+| 0.2686        | 8.54  | 62000 | 0.1756          | 0.1814 |
+| 0.2686        | 8.6   | 62400 | 0.1740          | 0.1770 |
+| 0.2687        | 8.65  | 62800 | 0.1748          | 0.1769 |
+| 0.2686        | 8.71  | 63200 | 0.1734          | 0.1766 |
+| 0.2683        | 8.76  | 63600 | 0.1722          | 0.1759 |
+| 0.2686        | 8.82  | 64000 | 0.1719          | 0.1760 |
+| 0.2686        | 8.87  | 64400 | 0.1720          | 0.1743 |
+| 0.2626        | 8.93  | 64800 | 0.1696          | 0.1742 |
+| 0.2587        | 8.98  | 65200 | 0.1690          | 0.1718 |
+| 0.2554        | 9.04  | 65600 | 0.1704          | 0.1722 |
+| 0.2537        | 9.09  | 66000 | 0.1702          | 0.1721 |
+| 0.2537        | 9.15  | 66400 | 0.1696          | 0.1717 |
+| 0.2511        | 9.2   | 66800 | 0.1685          | 0.1701 |
+| 0.2473        | 9.26  | 67200 | 0.1696          | 0.1704 |
+| 0.2458        | 9.31  | 67600 | 0.1686          | 0.1698 |
+| 0.2476        | 9.37  | 68000 | 0.1675          | 0.1687 |
+| 0.2476        | 9.42  | 68400 | 0.1659          | 0.1673 |
+| 0.2463        | 9.48  | 68800 | 0.1664          | 0.1674 |
+| 0.2481        | 9.53  | 69200 | 0.1661          | 0.1670 |
+| 0.2411        | 9.59  | 69600 | 0.1658          | 0.1663 |
+| 0.2445        | 9.64  | 70000 | 0.1652          | 0.1660 |
+| 0.2445        | 9.7   | 70400 | 0.1646          | 0.1654 |
+| 0.2407        | 9.75  | 70800 | 0.1646          | 0.1641 |
+| 0.2483        | 9.81  | 71200 | 0.1641          | 0.1641 |
+| 0.245         | 9.86  | 71600 | 0.1635          | 0.1643 |
+| 0.2402        | 9.92  | 72000 | 0.1638          | 0.1634 |
+| 0.2402        | 9.98  | 72400 | 0.1633          | 0.1636 |
+
+
+### Framework versions
+
+- Transformers 4.17.0.dev0
+- Pytorch 1.10.2+cu102
+- Datasets 1.18.3
+- Tokenizers 0.11.0
--- a/added_tokens.json
+++ b/added_tokens.json
@@ -0,0 +1 @@
+{"<s>": 40, "</s>": 41}
--- a/alphabet.json
+++ b/alphabet.json
@@ -0,0 +1 @@
+{"labels": [" ", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "\u00e0", "\u00e8", "\u00e9", "\u00ed", "\u00f2", "\u00f3", "\u00fa", "\u0127", "\u02b9", "\u0307", "\u044a", "\u2047", "", "<s>", "</s>"], "is_bpe": false}
--- a/config.json
+++ b/config.json
@@ -0,0 +1,107 @@
+{
+  "_name_or_path": "facebook/wav2vec2-xls-r-300m",
+  "activation_dropout": 0.0,
+  "adapter_kernel_size": 3,
+  "adapter_stride": 2,
+  "add_adapter": false,
+  "apply_spec_augment": true,
+  "architectures": [
+    "Wav2Vec2ForCTC"
+  ],
+  "attention_dropout": 0.0,
+  "bos_token_id": 1,
+  "classifier_proj_size": 256,
+  "codevector_dim": 768,
+  "contrastive_logits_temperature": 0.1,
+  "conv_bias": true,
+  "conv_dim": [
+    512,
+    512,
+    512,
+    512,
+    512,
+    512,
+    512
+  ],
+  "conv_kernel": [
+    10,
+    3,
+    3,
+    3,
+    3,
+    2,
+    2
+  ],
+  "conv_stride": [
+    5,
+    2,
+    2,
+    2,
+    2,
+    2,
+    2
+  ],
+  "ctc_loss_reduction": "mean",
+  "ctc_zero_infinity": true,
+  "diversity_loss_weight": 0.1,
+  "do_stable_layer_norm": true,
+  "eos_token_id": 2,
+  "feat_extract_activation": "gelu",
+  "feat_extract_dropout": 0.0,
+  "feat_extract_norm": "layer",
+  "feat_proj_dropout": 0.0,
+  "feat_quantizer_dropout": 0.0,
+  "final_dropout": 0.0,
+  "hidden_act": "gelu",
+  "hidden_dropout": 0.0,
+  "hidden_size": 1024,
+  "initializer_range": 0.02,
+  "intermediate_size": 4096,
+  "layer_norm_eps": 1e-05,
+  "layerdrop": 0.0,
+  "mask_feature_length": 10,
+  "mask_feature_min_masks": 0,
+  "mask_feature_prob": 0.0,
+  "mask_time_length": 10,
+  "mask_time_min_masks": 2,
+  "mask_time_prob": 0.05,
+  "model_type": "wav2vec2",
+  "num_adapter_layers": 3,
+  "num_attention_heads": 16,
+  "num_codevector_groups": 2,
+  "num_codevectors_per_group": 320,
+  "num_conv_pos_embedding_groups": 16,
+  "num_conv_pos_embeddings": 128,
+  "num_feat_extract_layers": 7,
+  "num_hidden_layers": 24,
+  "num_negatives": 100,
+  "output_hidden_size": 1024,
+  "pad_token_id": 39,
+  "proj_codevector_dim": 768,
+  "tdnn_dilation": [
+    1,
+    2,
+    3,
+    1,
+    1
+  ],
+  "tdnn_dim": [
+    512,
+    512,
+    512,
+    512,
+    1500
+  ],
+  "tdnn_kernel": [
+    5,
+    3,
+    3,
+    1,
+    1
+  ],
+  "torch_dtype": "float32",
+  "transformers_version": "4.17.0.dev0",
+  "use_weighted_layer_sum": false,
+  "vocab_size": 42,
+  "xvector_output_dim": 512
+}
--- a/eval.py
+++ b/eval.py
@@ -0,0 +1,244 @@
+#!/usr/bin/env python3
+import argparse
+import re
+from typing import Dict
+from sklearn import feature_extraction
+
+import torch
+from src.data.normalization import normalize_string
+from datasets import Audio, Dataset, load_dataset, load_metric
+
+from transformers import (
+    AutoFeatureExtractor,
+    pipeline,
+    AutoTokenizer,
+    Wav2Vec2Processor,
+    Wav2Vec2ProcessorWithLM,
+    Wav2Vec2ForCTC,
+    AutoConfig,
+)
+
+
+def log_results(result: Dataset, args: Dict[str, str]):
+    """DO NOT CHANGE. This function computes and logs the result metrics."""
+
+    log_outputs = args.log_outputs
+    dataset_id = "_".join(args.dataset.split("/") + [args.config, args.split])
+
+    # load metric
+    wer = load_metric("wer")
+    cer = load_metric("cer")
+
+    # compute metrics
+    wer_result = wer.compute(
+        references=result["target"], predictions=result["prediction"]
+    )
+    cer_result = cer.compute(
+        references=result["target"], predictions=result["prediction"]
+    )
+
+    # print & log results
+    result_str = f"WER: {wer_result}\n" f"CER: {cer_result}"
+    print(result_str)
+
+    with open(f"{dataset_id}_eval_results.txt", "w") as f:
+        f.write(result_str)
+
+    # log all results in text file. Possibly interesting for analysis
+    if log_outputs is not None:
+        pred_file = f"log_{dataset_id}_predictions.txt"
+        target_file = f"log_{dataset_id}_targets.txt"
+
+        with open(pred_file, "w") as p, open(target_file, "w") as t:
+
+            # mapping function to write output
+            def write_to_file(batch, i):
+                p.write(f"{i}" + "\n")
+                p.write(batch["prediction"] + "\n")
+                t.write(f"{i}" + "\n")
+                t.write(batch["target"] + "\n")
+
+            result.map(write_to_file, with_indices=True)
+
+
+def normalize_text(text: str, invalid_chars_regex: str, to_lower: bool) -> str:
+    """DO ADAPT FOR YOUR USE CASE. this function normalizes the target text."""
+    text = normalize_string(text)
+    text = text.lower() if to_lower else text.upper()
+
+    text = re.sub(invalid_chars_regex, " ", text)
+    text = re.sub("\s+", " ", text).strip()
+
+    return text
+
+
+def main(args):
+    # load dataset
+    dataset = load_dataset(
+        args.dataset, args.config, split=args.split, use_auth_token=True
+    )
+
+    # for testing: only process the first two examples as a test
+    # dataset = dataset.select(range(10))
+
+    # load processor
+    # feature_extractor = AutoFeatureExtractor.from_pretrained(args.model_id)
+    # sampling_rate = feature_extractor.sampling_rate
+
+    if args.ctcdecode:
+        processor = Wav2Vec2ProcessorWithLM.from_pretrained(args.model_id)
+        decoder = processor.decoder
+    else:
+        processor = Wav2Vec2Processor.from_pretrained(args.model_id)
+        decoder = None
+
+    feature_extractor = processor.feature_extractor
+    tokenizer = processor.tokenizer
+    sampling_rate = feature_extractor.sampling_rate
+
+    config = AutoConfig.from_pretrained(args.model_id)
+    model = Wav2Vec2ForCTC.from_pretrained(args.model_id)
+
+    # resample audio
+    dataset = dataset.cast_column("audio", Audio(sampling_rate=sampling_rate))
+
+    # load eval pipeline
+    if args.device is None:
+        args.device = 0 if torch.cuda.is_available() else -1
+
+    asr = pipeline(
+        "automatic-speech-recognition",
+        model=model,
+        config=config,
+        feature_extractor=feature_extractor,
+        decoder=decoder,
+        tokenizer=tokenizer,
+        device=args.device,
+    )
+
+    # build normalizer config
+    tokenizer = AutoTokenizer.from_pretrained(args.model_id)
+    tokens = [
+        x for x in tokenizer.convert_ids_to_tokens(range(0, tokenizer.vocab_size))
+    ]
+    special_tokens = [
+        tokenizer.pad_token,
+        tokenizer.word_delimiter_token,
+        tokenizer.unk_token,
+        tokenizer.bos_token,
+        tokenizer.eos_token,
+    ]
+    non_special_tokens = [x for x in tokens if x not in special_tokens]
+    invalid_chars_regex = f"[^\s{re.escape(''.join(set(non_special_tokens)))}]"
+    normalize_to_lower = False
+    for token in non_special_tokens:
+        if token.isalpha() and token.islower():
+            normalize_to_lower = True
+            break
+
+    # map function to decode audio
+    def map_to_pred(
+        batch,
+        args=args,
+        asr=asr,
+        invalid_chars_regex=invalid_chars_regex,
+        normalize_to_lower=normalize_to_lower,
+    ):
+        prediction = asr(
+            batch["audio"]["array"],
+            chunk_length_s=args.chunk_length_s,
+            stride_length_s=args.stride_length_s,
+            #decoder_kwargs={"beam_width": args.beam_width},
+        )
+
+        batch["prediction"] = prediction["text"]
+        batch["target"] = normalize_text(
+            batch["sentence"], invalid_chars_regex, normalize_to_lower
+        )
+        return batch
+
+    def map_and_decode(batch):
+        inputs = processor(
+            batch["audio"]["array"],
+            sampling_rate=batch["audio"]["sampling_rate"],
+            return_tensors="pt",
+        )
+        with torch.no_grad():
+            logits = model(**inputs).logits
+        transcription = processor.batch_decode(logits.numpy()).text
+        batch["prediction"] = transcription
+        batch["target"] = normalize_text(
+            batch["sentence"], invalid_chars_regex, normalize_to_lower
+        )
+        return batch
+
+    #         transcription = .lower()
+    # run inference on all examples
+    result = dataset.map(map_to_pred, remove_columns=dataset.column_names)
+
+    # compute and log_results
+    # do not change function below
+    log_results(result, args)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+
+    parser.add_argument(
+        "--model_id",
+        type=str,
+        required=True,
+        help="Model identifier. Should be loadable with 🤗 Transformers",
+    )
+    parser.add_argument(
+        "--dataset",
+        type=str,
+        required=True,
+        help="Dataset name to evaluate the `model_id`. Should be loadable with 🤗 Datasets",
+    )
+    parser.add_argument(
+        "--config",
+        type=str,
+        required=True,
+        help="Config of the dataset. *E.g.* `'en'`  for Common Voice",
+    )
+    parser.add_argument(
+        "--split", type=str, required=True, help="Split of the dataset. *E.g.* `'test'`"
+    )
+    parser.add_argument(
+        "--chunk_length_s",
+        type=float,
+        default=None,
+        help="Chunk length in seconds. Defaults to 5 seconds.",
+    )
+    parser.add_argument(
+        "--stride_length_s",
+        type=float,
+        default=None,
+        help="Stride of the audio chunks. Defaults to 1 second.",
+    )
+    parser.add_argument(
+        "--log_outputs",
+        action="store_true",
+        help="If defined, write outputs to log file for analysis.",
+    )
+    parser.add_argument(
+        "--ctcdecode",
+        action="store_true",
+        help="Apply the ctc decoder to the output (only if present in the model card).",
+    )
+    parser.add_argument(
+        "--device",
+        type=int,
+        default=None,
+        help="The device to run the pipeline on. -1 for CPU (default), 0 for the first GPU and so on.",
+    )
+    parser.add_argument(
+        "--beam_width",
+        type=int,
+        default=1,
+        help="Beam width used by the pyctc decoder.",
+    )
+    args = parser.parse_args()
+
+    main(args)
--- a/language_model/5gram-it-ds-ytsv2.bin
+++ b/language_model/5gram-it-ds-ytsv2.bin
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:5aad288e91a4bd1ba1b377aa19fceb77109647f84cac236bb757bc83685b609b
+size 868657580
--- a/language_model/attrs.json
+++ b/language_model/attrs.json
@@ -0,0 +1 @@
+{"alpha": 0.5, "beta": 1.5, "unk_score_offset": -10.0, "score_boundary": true}
--- a/language_model/unigrams.txt
+++ b/language_model/unigrams.txt
--- a/model.safetensors
+++ b/model.safetensors
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:30002063dc6a3652881ed7b3540b3a6e49a6297510db2a58c6013a2874b53c7a
+size 1261979632
--- a/preprocessor_config.json
+++ b/preprocessor_config.json
@@ -0,0 +1,10 @@
+{
+  "do_normalize": true,
+  "feature_extractor_type": "Wav2Vec2FeatureExtractor",
+  "processor_class": "Wav2Vec2ProcessorWithLM",
+  "feature_size": 1,
+  "padding_side": "right",
+  "padding_value": 0,
+  "return_attention_mask": true,
+  "sampling_rate": 16000
+}
--- a/pytorch_model.bin
+++ b/pytorch_model.bin
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:5f3a687c6f0635dbdf3df5dfae72a125eed8f1bcd0158d317d858c929f094bbe
+size 1262095857
--- a/requirements.txt
+++ b/requirements.txt
@@ -0,0 +1,28 @@
+# external requirements
+git+https://github.com/dbdmg/robust-speech-challenge.git
+click
+Sphinx
+coverage
+awscli
+flake8
+python-dotenv>=0.5.1
+comet_ml
+
+# audio data augmentations
+torch
+git+https://github.com/MorenoLaQuatra/torch-audiomentations.git
+librosa
+pysrt
+num2words
+
+# deep deep learning
+transformers
+datasets>=1.18.3
+jiwer
+
+# pyctcdecode
+pypi-kenlm
+pandas
+pyctcdecode
+pydub
+soundfile 
--- a/special_tokens_map.json
+++ b/special_tokens_map.json
@@ -0,0 +1 @@
+{"bos_token": "<s>", "eos_token": "</s>", "unk_token": "[UNK]", "pad_token": "[PAD]", "additional_special_tokens": [{"content": "<s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, {"content": "</s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, {"content": "<s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, {"content": "</s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, {"content": "<s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, {"content": "</s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}]}
--- a/tokenizer_config.json
+++ b/tokenizer_config.json
@@ -0,0 +1 @@
+{"unk_token": "[UNK]", "bos_token": "<s>", "eos_token": "</s>", "pad_token": "[PAD]", "do_lower_case": false, "word_delimiter_token": "|", "special_tokens_map_file": null, "tokenizer_file": null, "name_or_path": "dbdmg/wav2vec2-xls-r-300m-italian-robust", "tokenizer_class": "Wav2Vec2CTCTokenizer"}
--- a/training_args.bin
+++ b/training_args.bin
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:40d3586c5f771c6045940b979e267119454a67e177da2a6521e0d8aa293a8fb7
+size 3183
--- a/vocab.json
+++ b/vocab.json
@@ -0,0 +1 @@
+{"a": 1, "b": 2, "c": 3, "d": 4, "e": 5, "f": 6, "g": 7, "h": 8, "i": 9, "j": 10, "k": 11, "l": 12, "m": 13, "n": 14, "o": 15, "p": 16, "q": 17, "r": 18, "s": 19, "t": 20, "u": 21, "v": 22, "w": 23, "x": 24, "y": 25, "z": 26, "à": 27, "è": 28, "é": 29, "í": 30, "ò": 31, "ó": 32, "ú": 33, "ħ": 34, "ʹ": 35, "̇": 36, "ъ": 37, "|": 0, "[UNK]": 38, "[PAD]": 39}
				`@@ -0,0 +1 @@`
				`{"labels": [" ", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "\u00e0", "\u00e8", "\u00e9", "\u00ed", "\u00f2", "\u00f3", "\u00fa", "\u0127", "\u02b9", "\u0307", "\u044a", "\u2047", "", "<s>", "</s>"], "is_bpe": false}`
				`@@ -0,0 +1 @@`
				`{"alpha": 0.5, "beta": 1.5, "unk_score_offset": -10.0, "score_boundary": true}`
				`@@ -0,0 +1 @@`
				{"bos_token": "<s>", "eos_token": "</s>", "unk_token": "[UNK]", "pad_token": "[PAD]", "additional_special_tokens": [{"content": "<s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, {"content": "</s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, {"content": "<s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, {"content": "</s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, {"content": "<s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, {"content": "</s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}]}
				`@@ -0,0 +1 @@`
				`{"unk_token": "[UNK]", "bos_token": "<s>", "eos_token": "</s>", "pad_token": "[PAD]", "do_lower_case": false, "word_delimiter_token": "\|", "special_tokens_map_file": null, "tokenizer_file": null, "name_or_path": "dbdmg/wav2vec2-xls-r-300m-italian-robust", "tokenizer_class": "Wav2Vec2CTCTokenizer"}`
				`@@ -0,0 +1 @@`
				`{"a": 1, "b": 2, "c": 3, "d": 4, "e": 5, "f": 6, "g": 7, "h": 8, "i": 9, "j": 10, "k": 11, "l": 12, "m": 13, "n": 14, "o": 15, "p": 16, "q": 17, "r": 18, "s": 19, "t": 20, "u": 21, "v": 22, "w": 23, "x": 24, "y": 25, "z": 26, "à": 27, "è": 28, "é": 29, "í": 30, "ò": 31, "ó": 32, "ú": 33, "ħ": 34, "ʹ": 35, "̇": 36, "ъ": 37, "\|": 0, "[UNK]": 38, "[PAD]": 39}`