初始化项目，由ModelHub XC社区提供模型

Model: bofenghuang/asr-wav2vec2-ctc-french Source: Original Platform
2026-05-21 11:36:18 +08:00
commit ddba1294a7
60 changed files with 519273 additions and 0 deletions
--- a/.gitattributes
+++ b/.gitattributes
@@ -0,0 +1,34 @@
+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text
--- a/README.md
+++ b/README.md
@@ -0,0 +1,217 @@
+---
+license: apache-2.0
+language: fr
+library_name: transformers
+thumbnail: null
+tags:
+- automatic-speech-recognition
+- hf-asr-leaderboard
+- robust-speech-event
+- CTC
+- Wav2vec2
+datasets:
+- common_voice
+- mozilla-foundation/common_voice_11_0
+- facebook/multilingual_librispeech
+- facebook/voxpopuli
+- gigant/african_accented_french
+metrics:
+- wer
+model-index:
+- name: Fine-tuned wav2vec2-FR-7K-large model for ASR in French
+  results:
+  - task:
+      name: Automatic Speech Recognition
+      type: automatic-speech-recognition
+    dataset:
+      name: Common Voice 11.0
+      type: mozilla-foundation/common_voice_11_0
+      args: fr
+    metrics:
+    - name: Test WER
+      type: wer
+      value: 11.44
+    - name: Test WER (+LM)
+      type: wer
+      value: 9.66
+  - task:
+      name: Automatic Speech Recognition
+      type: automatic-speech-recognition
+    dataset:
+      name: Multilingual LibriSpeech (MLS)
+      type: facebook/multilingual_librispeech
+      args: french
+    metrics:
+    - name: Test WER
+      type: wer
+      value: 5.93
+    - name: Test WER (+LM)
+      type: wer
+      value: 5.13
+  - task:
+      name: Automatic Speech Recognition
+      type: automatic-speech-recognition
+    dataset:
+      name: VoxPopuli
+      type: facebook/voxpopuli
+      args: fr
+    metrics:
+    - name: Test WER
+      type: wer
+      value: 9.33
+    - name: Test WER (+LM)
+      type: wer
+      value: 8.51
+  - task:
+      name: Automatic Speech Recognition
+      type: automatic-speech-recognition
+    dataset:
+      name: African Accented French
+      type: gigant/african_accented_french
+      args: fr
+    metrics:
+    - name: Test WER
+      type: wer
+      value: 16.22
+    - name: Test WER (+LM)
+      type: wer
+      value: 15.39
+  - task:
+      name: Automatic Speech Recognition
+      type: automatic-speech-recognition
+    dataset:
+      name: Robust Speech Event - Dev Data
+      type: speech-recognition-community-v2/dev_data
+      args: fr
+    metrics:
+    - name: Test WER
+      type: wer
+      value: 16.56
+    - name: Test WER (+LM)
+      type: wer
+      value: 12.96
+  - task:
+      name: Automatic Speech Recognition
+      type: automatic-speech-recognition
+    dataset:
+      name: Fleurs
+      type: google/fleurs
+      args: fr_fr
+    metrics:
+    - name: Test WER
+      type: wer
+      value: 10.10
+    - name: Test WER (+LM)
+      type: wer
+      value: 8.84
+---
+
+# Fine-tuned wav2vec2-FR-7K-large model for ASR in French
+
+<style>
+img {
+ display: inline;
+}
+</style>
+
+![Model architecture](https://img.shields.io/badge/Model_Architecture-Wav2Vec2--CTC-lightgrey)
+![Model size](https://img.shields.io/badge/Params-315M-lightgrey)
+![Language](https://img.shields.io/badge/Language-French-lightgrey)
+
+This model is a fine-tuned version of [LeBenchmark/wav2vec2-FR-7K-large](https://huggingface.co/LeBenchmark/wav2vec2-FR-7K-large), trained on a composite dataset comprising of over 2200 hours of French speech audio, using the train and validation splits of [Common Voice 11.0](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0), [Multilingual LibriSpeech](https://huggingface.co/datasets/facebook/multilingual_librispeech), [Voxpopuli](https://github.com/facebookresearch/voxpopuli), [Multilingual TEDx](http://www.openslr.org/100), [MediaSpeech](https://www.openslr.org/108), and [African Accented French](https://huggingface.co/datasets/gigant/african_accented_french). When using the model make sure that your speech input is also sampled at 16Khz.
+
+## Usage
+
+1. To use on a local audio file with the language model
+
+```python
+import torch
+import torchaudio
+
+from transformers import AutoModelForCTC, Wav2Vec2ProcessorWithLM
+
+device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
+
+model = AutoModelForCTC.from_pretrained("bhuang/asr-wav2vec2-french").to(device)
+processor_with_lm = Wav2Vec2ProcessorWithLM.from_pretrained("bhuang/asr-wav2vec2-french")
+model_sample_rate = processor_with_lm.feature_extractor.sampling_rate
+
+wav_path = "example.wav"  # path to your audio file
+waveform, sample_rate = torchaudio.load(wav_path)
+waveform = waveform.squeeze(axis=0)  # mono
+
+# resample
+if sample_rate != model_sample_rate:
+    resampler = torchaudio.transforms.Resample(sample_rate, model_sample_rate)
+    waveform = resampler(waveform)
+
+# normalize
+input_dict = processor_with_lm(waveform, sampling_rate=model_sample_rate, return_tensors="pt")
+
+with torch.inference_mode():
+    logits = model(input_dict.input_values.to(device)).logits
+
+predicted_sentence = processor_with_lm.batch_decode(logits.cpu().numpy()).text[0]
+```
+
+2. To use on a local audio file without the language model
+
+```python
+import torch
+import torchaudio
+
+from transformers import AutoModelForCTC, Wav2Vec2Processor
+
+device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
+
+model = AutoModelForCTC.from_pretrained("bhuang/asr-wav2vec2-french").to(device)
+processor = Wav2Vec2Processor.from_pretrained("bhuang/asr-wav2vec2-french")
+model_sample_rate = processor.feature_extractor.sampling_rate
+
+wav_path = "example.wav"  # path to your audio file
+waveform, sample_rate = torchaudio.load(wav_path)
+waveform = waveform.squeeze(axis=0)  # mono
+
+# resample
+if sample_rate != model_sample_rate:
+    resampler = torchaudio.transforms.Resample(sample_rate, model_sample_rate)
+    waveform = resampler(waveform)
+
+# normalize
+input_dict = processor(waveform, sampling_rate=model_sample_rate, return_tensors="pt")
+
+with torch.inference_mode():
+    logits = model(input_dict.input_values.to(device)).logits
+
+# decode
+predicted_ids = torch.argmax(logits, dim=-1)
+predicted_sentence = processor.batch_decode(predicted_ids)[0]
+```
+
+## Evaluation
+
+1. To evaluate on `mozilla-foundation/common_voice_11_0`
+
+```bash
+python eval.py \
+  --model_id "bhuang/asr-wav2vec2-french" \
+  --dataset "mozilla-foundation/common_voice_11_0" \
+  --config "fr" \
+  --split "test" \
+  --log_outputs \
+  --outdir "outputs/results_mozilla-foundatio_common_voice_11_0_with_lm"
+```
+
+2. To evaluate on `speech-recognition-community-v2/dev_data`
+
+```bash
+python eval.py \
+  --model_id "bhuang/asr-wav2vec2-french" \
+  --dataset "speech-recognition-community-v2/dev_data" \
+  --config "fr" \
+  --split "validation" \
+  --chunk_length_s 30.0 \
+  --stride_length_s 5.0 \
+  --log_outputs \
+  --outdir "outputs/results_speech-recognition-community-v2_dev_data_with_lm"
+```
--- a/added_tokens.json
+++ b/added_tokens.json
@@ -0,0 +1,4 @@
+{
+  "</s>": 49,
+  "<s>": 48
+}
--- a/alphabet.json
+++ b/alphabet.json
@@ -0,0 +1 @@
+{"labels": [" ", "'", "-", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "\u00e0", "\u00e2", "\u00e4", "\u00e7", "\u00e8", "\u00e9", "\u00ea", "\u00eb", "\u00ee", "\u00ef", "\u00f1", "\u00f4", "\u00f6", "\u00f9", "\u00fb", "\u00fc", "\u00ff", "\u2047", "", "<s>", "</s>"], "is_bpe": false}
--- a/config.json
+++ b/config.json
@@ -0,0 +1,115 @@
+{
+  "_name_or_path": "LeBenchmark/wav2vec2-FR-7K-large",
+  "activation_dropout": 0.05,
+  "adapter_kernel_size": 3,
+  "adapter_stride": 2,
+  "add_adapter": false,
+  "apply_spec_augment": true,
+  "architectures": [
+    "Wav2Vec2ForCTC"
+  ],
+  "attention_dropout": 0.05,
+  "bos_token_id": 1,
+  "classifier_proj_size": 256,
+  "codevector_dim": 256,
+  "contrastive_logits_temperature": 0.1,
+  "conv_bias": true,
+  "conv_dim": [
+    512,
+    512,
+    512,
+    512,
+    512,
+    512,
+    512
+  ],
+  "conv_kernel": [
+    10,
+    3,
+    3,
+    3,
+    3,
+    2,
+    2
+  ],
+  "conv_stride": [
+    5,
+    2,
+    2,
+    2,
+    2,
+    2,
+    2
+  ],
+  "ctc_loss_reduction": "mean",
+  "ctc_zero_infinity": true,
+  "diversity_loss_weight": 0.1,
+  "do_stable_layer_norm": true,
+  "eos_token_id": 2,
+  "feat_extract_activation": "gelu",
+  "feat_extract_dropout": 0.0,
+  "feat_extract_norm": "layer",
+  "feat_proj_dropout": 0.05,
+  "feat_quantizer_dropout": 0.0,
+  "final_dropout": 0.05,
+  "hidden_act": "gelu",
+  "hidden_dropout": 0.05,
+  "hidden_size": 1024,
+  "initializer_range": 0.02,
+  "intermediate_size": 4096,
+  "layer_norm_eps": 1e-05,
+  "layerdrop": 0.1,
+  "mask_channel_length": 10,
+  "mask_channel_min_space": 1,
+  "mask_channel_other": 0.0,
+  "mask_channel_prob": 0.0,
+  "mask_channel_selection": "static",
+  "mask_feature_length": 10,
+  "mask_feature_min_masks": 0,
+  "mask_feature_prob": 0.0,
+  "mask_time_length": 10,
+  "mask_time_min_masks": 2,
+  "mask_time_min_space": 1,
+  "mask_time_other": 0.0,
+  "mask_time_prob": 0.05,
+  "mask_time_selection": "static",
+  "model_type": "wav2vec2",
+  "num_adapter_layers": 3,
+  "num_attention_heads": 16,
+  "num_codevector_groups": 2,
+  "num_codevectors_per_group": 320,
+  "num_conv_pos_embedding_groups": 16,
+  "num_conv_pos_embeddings": 128,
+  "num_feat_extract_layers": 7,
+  "num_hidden_layers": 24,
+  "num_negatives": 100,
+  "output_hidden_size": 1024,
+  "pad_token_id": 47,
+  "proj_codevector_dim": 256,
+  "tdnn_dilation": [
+    1,
+    2,
+    3,
+    1,
+    1
+  ],
+  "tdnn_dim": [
+    512,
+    512,
+    512,
+    512,
+    1500
+  ],
+  "tdnn_kernel": [
+    5,
+    3,
+    3,
+    1,
+    1
+  ],
+  "torch_dtype": "float32",
+  "transformers_version": "4.25.0.dev0",
+  "use_weighted_layer_sum": false,
+  "vocab_size": 50,
+  "xvector_output_dim": 512
+}
--- a/eval.py
+++ b/eval.py
@@ -0,0 +1,181 @@
+#!/usr/bin/env python
+
+import argparse
+import re
+from typing import Dict
+
+import torch
+from datasets import Audio, Dataset, load_dataset, load_metric
+
+from transformers import (
+    AutoConfig,
+    AutoFeatureExtractor,
+    AutoModelForCTC,
+    AutoTokenizer,
+    Wav2Vec2Processor,
+    Wav2Vec2ProcessorWithLM,
+    pipeline,
+)
+
+
+def log_results(result: Dataset, args: Dict[str, str]):
+    """ DO NOT CHANGE. This function computes and logs the result metrics. """
+
+    log_outputs = args.log_outputs
+    dataset_id = "_".join(args.dataset.split("/") + [args.config, args.split])
+
+    # load metric
+    wer = load_metric("wer")
+    cer = load_metric("cer")
+
+    # compute metrics
+    wer_result = wer.compute(references=result["target"], predictions=result["prediction"])
+    cer_result = cer.compute(references=result["target"], predictions=result["prediction"])
+
+    # print & log results
+    result_str = f"WER: {wer_result}\n" f"CER: {cer_result}"
+    print(result_str)
+
+    with open(f"{dataset_id}_eval_results.txt", "w") as f:
+        f.write(result_str)
+
+    # log all results in text file. Possibly interesting for analysis
+    if log_outputs is not None:
+        pred_file = f"log_{dataset_id}_predictions.txt"
+        target_file = f"log_{dataset_id}_targets.txt"
+
+        with open(pred_file, "w") as p, open(target_file, "w") as t:
+
+            # mapping function to write output
+            def write_to_file(batch, i):
+                p.write(f"{i}" + "\n")
+                p.write(batch["prediction"] + "\n")
+                t.write(f"{i}" + "\n")
+                t.write(batch["target"] + "\n")
+
+            result.map(write_to_file, with_indices=True)
+
+
+def normalize_text(text: str, invalid_chars_regex: str) -> str:
+    """ DO ADAPT FOR YOUR USE CASE. this function normalizes the target text. """
+
+    text = text.lower()
+    text = re.sub(r"’|´|′|ʼ|‘|ʻ|`", "'", text)
+    text = re.sub(invalid_chars_regex, " ", text)
+    text = re.sub(r"\s+", " ", text).strip()
+
+    return text
+
+
+def main(args):
+    # load dataset
+    dataset = load_dataset(args.dataset, args.config, split=args.split, use_auth_token=True)
+
+    # for testing: only process the first two examples as a test
+    # dataset = dataset.select(range(10))
+
+    # load processor
+    if args.greedy:
+        processor = Wav2Vec2Processor.from_pretrained(args.model_id)
+        decoder = None
+    else:
+        processor = Wav2Vec2ProcessorWithLM.from_pretrained(args.model_id)
+        decoder = processor.decoder
+
+    feature_extractor = processor.feature_extractor
+    tokenizer = processor.tokenizer
+    sampling_rate = feature_extractor.sampling_rate
+
+    # resample audio
+    dataset = dataset.cast_column("audio", Audio(sampling_rate=sampling_rate))
+
+    # load eval pipeline
+    if args.device is None:
+        args.device = 0 if torch.cuda.is_available() else -1
+
+    config = AutoConfig.from_pretrained(args.model_id)
+    model = AutoModelForCTC.from_pretrained(args.model_id)
+
+    # asr = pipeline("automatic-speech-recognition", model=args.model_id, device=args.device)
+    asr = pipeline(
+        "automatic-speech-recognition",
+        config=config,
+        model=model,
+        tokenizer=tokenizer,
+        feature_extractor=feature_extractor,
+        decoder=decoder,
+        device=args.device,
+    )
+
+    # build normalizer config
+    tokenizer = AutoTokenizer.from_pretrained(args.model_id)
+    tokens = [x for x in tokenizer.convert_ids_to_tokens(range(0, tokenizer.vocab_size))]
+    special_tokens = [
+        tokenizer.pad_token,
+        tokenizer.word_delimiter_token,
+        tokenizer.unk_token,
+        tokenizer.bos_token,
+        tokenizer.eos_token,
+    ]
+    non_special_tokens = [x for x in tokens if x not in special_tokens]
+    invalid_chars_regex = f"[^\s{re.escape(''.join(set(non_special_tokens)))}]"
+    # normalize_to_lower = False
+    # for token in non_special_tokens:
+    #     if token.isalpha() and token.islower():
+    #         normalize_to_lower = True
+    #         break
+
+    # map function to decode audio
+    def map_to_pred(batch):
+        prediction = asr(batch["audio"]["array"], chunk_length_s=args.chunk_length_s, stride_length_s=args.stride_length_s)
+
+        batch["prediction"] = prediction["text"]
+        batch["target"] = normalize_text(batch["sentence"], invalid_chars_regex)
+        return batch
+
+    # run inference on all examples
+    result = dataset.map(map_to_pred, remove_columns=dataset.column_names)
+
+    # filtering out empty targets
+    result = result.filter(lambda example: example["target"] != "")
+
+    # compute and log_results
+    # do not change function below
+    log_results(result, args)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+
+    parser.add_argument("--model_id", type=str, required=True, help="Model identifier. Should be loadable with 🤗 Transformers")
+    parser.add_argument(
+        "--dataset",
+        type=str,
+        required=True,
+        help="Dataset name to evaluate the `model_id`. Should be loadable with 🤗 Datasets",
+    )
+    parser.add_argument("--config", type=str, required=True, help="Config of the dataset. *E.g.* `'en'`  for Common Voice")
+    parser.add_argument("--split", type=str, required=True, help="Split of the dataset. *E.g.* `'test'`")
+    parser.add_argument(
+        "--chunk_length_s",
+        type=float,
+        default=None,
+        help="Chunk length in seconds. Defaults to None. For long audio files a good value would be 5.0 seconds.",
+    )
+    parser.add_argument(
+        "--stride_length_s",
+        type=float,
+        default=None,
+        help="Stride of the audio chunks. Defaults to None. For long audio files a good value would be 1.0 seconds.",
+    )
+    parser.add_argument("--log_outputs", action="store_true", help="If defined, write outputs to log file for analysis.")
+    parser.add_argument("--greedy", action="store_true", help="If defined, the LM will be ignored during inference.")
+    parser.add_argument(
+        "--device",
+        type=int,
+        default=None,
+        help="The device to run the pipeline on. -1 for CPU (default), 0 for the first GPU and so on.",
+    )
+    args = parser.parse_args()
+
+    main(args)
--- a/language_model/attrs.json
+++ b/language_model/attrs.json
@@ -0,0 +1 @@
+{"alpha": 0.5, "beta": 1.5, "unk_score_offset": -10.0, "score_boundary": true}
--- a/language_model/lm_5gram_big.bin
+++ b/language_model/lm_5gram_big.bin
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:3d240fcf833130720ab9789729b2e510dafa012227a74800254b314a481f764a
+size 999781632
--- a/language_model/unigrams.txt
+++ b/language_model/unigrams.txt
--- a/model.safetensors
+++ b/model.safetensors
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:ef044c4666c50ec9169c38a1a5846f85ec2414e64b44b4d1bdc43bb4659756da
+size 1262012432
--- a/preprocessor_config.json
+++ b/preprocessor_config.json
@@ -0,0 +1,10 @@
+{
+  "do_normalize": true,
+  "feature_extractor_type": "Wav2Vec2FeatureExtractor",
+  "feature_size": 1,
+  "padding_side": "right",
+  "padding_value": 0.0,
+  "processor_class": "Wav2Vec2ProcessorWithLM",
+  "return_attention_mask": true,
+  "sampling_rate": 16000
+}
--- a/pytorch_model.bin
+++ b/pytorch_model.bin
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:c3f8ef2fa0ee8cdf063590f19f68dd038b326c78df4acb8bb52f6d9df1107b54
+size 1262103729
--- a/results_african_accented_french/gigant_african_accented_french_fr_test_eval_results.txt
+++ b/results_african_accented_french/gigant_african_accented_french_fr_test_eval_results.txt
@@ -0,0 +1,2 @@
+WER: 0.16223776223776223
+CER: 0.030996116879182616
--- a/results_african_accented_french/log_gigant_african_accented_french_fr_test_predictions.txt
+++ b/results_african_accented_french/log_gigant_african_accented_french_fr_test_predictions.txt
--- a/results_african_accented_french/log_gigant_african_accented_french_fr_test_targets.txt
+++ b/results_african_accented_french/log_gigant_african_accented_french_fr_test_targets.txt
--- a/results_african_accented_french_with_lm/gigant_african_accented_french_fr_test_eval_results.txt
+++ b/results_african_accented_french_with_lm/gigant_african_accented_french_fr_test_eval_results.txt
@@ -0,0 +1,2 @@
+WER: 0.15391976444608024
+CER: 0.029825672661177055
--- a/results_african_accented_french_with_lm/log_gigant_african_accented_french_fr_test_predictions.txt
+++ b/results_african_accented_french_with_lm/log_gigant_african_accented_french_fr_test_predictions.txt
--- a/results_african_accented_french_with_lm/log_gigant_african_accented_french_fr_test_targets.txt
+++ b/results_african_accented_french_with_lm/log_gigant_african_accented_french_fr_test_targets.txt
--- a/results_facebook_voxpopuli/facebook_voxpopuli_fr_test_eval_results.txt
+++ b/results_facebook_voxpopuli/facebook_voxpopuli_fr_test_eval_results.txt
@@ -0,0 +1,2 @@
+WER: 0.0933944140682725
+CER: 0.05192920390245901
--- a/results_facebook_voxpopuli/log_facebook_voxpopuli_fr_test_predictions.txt
+++ b/results_facebook_voxpopuli/log_facebook_voxpopuli_fr_test_predictions.txt
--- a/results_facebook_voxpopuli/log_facebook_voxpopuli_fr_test_targets.txt
+++ b/results_facebook_voxpopuli/log_facebook_voxpopuli_fr_test_targets.txt
--- a/results_facebook_voxpopuli_with_lm/facebook_voxpopuli_fr_test_eval_results.txt
+++ b/results_facebook_voxpopuli_with_lm/facebook_voxpopuli_fr_test_eval_results.txt
@@ -0,0 +1,2 @@
+WER: 0.08514007051024931
+CER: 0.051649188467461866
--- a/results_facebook_voxpopuli_with_lm/log_facebook_voxpopuli_fr_test_predictions.txt
+++ b/results_facebook_voxpopuli_with_lm/log_facebook_voxpopuli_fr_test_predictions.txt
--- a/results_facebook_voxpopuli_with_lm/log_facebook_voxpopuli_fr_test_targets.txt
+++ b/results_facebook_voxpopuli_with_lm/log_facebook_voxpopuli_fr_test_targets.txt
--- a/results_google_fleurs/google_fleurs_fr_fr_test_eval_results.txt
+++ b/results_google_fleurs/google_fleurs_fr_fr_test_eval_results.txt
@@ -0,0 +1,2 @@
+WER: 0.10104321907600596
+CER: 0.04789153974821727
--- a/results_google_fleurs/log_google_fleurs_fr_fr_test_predictions.txt
+++ b/results_google_fleurs/log_google_fleurs_fr_fr_test_predictions.txt
--- a/results_google_fleurs/log_google_fleurs_fr_fr_test_targets.txt
+++ b/results_google_fleurs/log_google_fleurs_fr_fr_test_targets.txt
--- a/results_google_fleurs_with_lm/google_fleurs_fr_fr_test_eval_results.txt
+++ b/results_google_fleurs_with_lm/google_fleurs_fr_fr_test_eval_results.txt
@@ -0,0 +1,2 @@
+WER: 0.08846497764530552
+CER: 0.04616016668133932
--- a/results_google_fleurs_with_lm/log_google_fleurs_fr_fr_test_predictions.txt
+++ b/results_google_fleurs_with_lm/log_google_fleurs_fr_fr_test_predictions.txt
--- a/results_google_fleurs_with_lm/log_google_fleurs_fr_fr_test_targets.txt
+++ b/results_google_fleurs_with_lm/log_google_fleurs_fr_fr_test_targets.txt
--- a/results_mozilla-foundatio_common_voice_11_0/log_mozilla-foundation_common_voice_11_0_fr_test_predictions.txt
+++ b/results_mozilla-foundatio_common_voice_11_0/log_mozilla-foundation_common_voice_11_0_fr_test_predictions.txt
--- a/results_mozilla-foundatio_common_voice_11_0/log_mozilla-foundation_common_voice_11_0_fr_test_targets.txt
+++ b/results_mozilla-foundatio_common_voice_11_0/log_mozilla-foundation_common_voice_11_0_fr_test_targets.txt
--- a/results_mozilla-foundatio_common_voice_11_0/mozilla-foundation_common_voice_11_0_fr_test_eval_results.txt
+++ b/results_mozilla-foundatio_common_voice_11_0/mozilla-foundation_common_voice_11_0_fr_test_eval_results.txt
@@ -0,0 +1,2 @@
+WER: 0.11441598191493416
+CER: 0.0338415442043304
--- a/results_mozilla-foundatio_common_voice_11_0_with_lm/log_mozilla-foundation_common_voice_11_0_fr_test_predictions.txt
+++ b/results_mozilla-foundatio_common_voice_11_0_with_lm/log_mozilla-foundation_common_voice_11_0_fr_test_predictions.txt
--- a/results_mozilla-foundatio_common_voice_11_0_with_lm/log_mozilla-foundation_common_voice_11_0_fr_test_targets.txt
+++ b/results_mozilla-foundatio_common_voice_11_0_with_lm/log_mozilla-foundation_common_voice_11_0_fr_test_targets.txt
--- a/results_mozilla-foundatio_common_voice_11_0_with_lm/mozilla-foundation_common_voice_11_0_fr_test_eval_results.txt
+++ b/results_mozilla-foundatio_common_voice_11_0_with_lm/mozilla-foundation_common_voice_11_0_fr_test_eval_results.txt
@@ -0,0 +1,2 @@
+WER: 0.09662302035839927
+CER: 0.030784767445014533
--- a/results_multilingual_librispeech/facebook_multilingual_librispeech_french_test_eval_results.txt
+++ b/results_multilingual_librispeech/facebook_multilingual_librispeech_french_test_eval_results.txt
@@ -0,0 +1,2 @@
+WER: 0.05938023091119791
+CER: 0.0251748962097513
--- a/results_multilingual_librispeech/log_facebook_multilingual_librispeech_french_test_predictions.txt
+++ b/results_multilingual_librispeech/log_facebook_multilingual_librispeech_french_test_predictions.txt
--- a/results_multilingual_librispeech/log_facebook_multilingual_librispeech_french_test_targets.txt
+++ b/results_multilingual_librispeech/log_facebook_multilingual_librispeech_french_test_targets.txt
--- a/results_multilingual_librispeech_with_lm/facebook_multilingual_librispeech_french_test_eval_results.txt
+++ b/results_multilingual_librispeech_with_lm/facebook_multilingual_librispeech_french_test_eval_results.txt
@@ -0,0 +1,2 @@
+WER: 0.051321945147860426
+CER: 0.02437166220530059
--- a/results_multilingual_librispeech_with_lm/log_facebook_multilingual_librispeech_french_test_predictions.txt
+++ b/results_multilingual_librispeech_with_lm/log_facebook_multilingual_librispeech_french_test_predictions.txt
--- a/results_multilingual_librispeech_with_lm/log_facebook_multilingual_librispeech_french_test_targets.txt
+++ b/results_multilingual_librispeech_with_lm/log_facebook_multilingual_librispeech_french_test_targets.txt
--- a/results_speech-recognition-community-v2_dev_data/log_speech-recognition-community-v2_dev_data_fr_validation_predictions.txt
+++ b/results_speech-recognition-community-v2_dev_data/log_speech-recognition-community-v2_dev_data_fr_validation_predictions.txt
--- a/results_speech-recognition-community-v2_dev_data/log_speech-recognition-community-v2_dev_data_fr_validation_targets.txt
+++ b/results_speech-recognition-community-v2_dev_data/log_speech-recognition-community-v2_dev_data_fr_validation_targets.txt
--- a/results_speech-recognition-community-v2_dev_data/speech-recognition-community-v2_dev_data_fr_validation_eval_results.txt
+++ b/results_speech-recognition-community-v2_dev_data/speech-recognition-community-v2_dev_data_fr_validation_eval_results.txt
@@ -0,0 +1,2 @@
+WER: 0.19954995499549955
+CER: 0.09941896000804636
--- a/results_speech-recognition-community-v2_dev_data_chunk30_stride5/log_speech-recognition-community-v2_dev_data_fr_validation_predictions.txt
+++ b/results_speech-recognition-community-v2_dev_data_chunk30_stride5/log_speech-recognition-community-v2_dev_data_fr_validation_predictions.txt
--- a/results_speech-recognition-community-v2_dev_data_chunk30_stride5/log_speech-recognition-community-v2_dev_data_fr_validation_targets.txt
+++ b/results_speech-recognition-community-v2_dev_data_chunk30_stride5/log_speech-recognition-community-v2_dev_data_fr_validation_targets.txt
--- a/results_speech-recognition-community-v2_dev_data_chunk30_stride5/speech-recognition-community-v2_dev_data_fr_validation_eval_results.txt
+++ b/results_speech-recognition-community-v2_dev_data_chunk30_stride5/speech-recognition-community-v2_dev_data_fr_validation_eval_results.txt
@@ -0,0 +1,2 @@
+WER: 0.16566156615661567
+CER: 0.08239781510394503
--- a/results_speech-recognition-community-v2_dev_data_with_lm/log_speech-recognition-community-v2_dev_data_fr_validation_predictions.txt
+++ b/results_speech-recognition-community-v2_dev_data_with_lm/log_speech-recognition-community-v2_dev_data_fr_validation_predictions.txt
--- a/results_speech-recognition-community-v2_dev_data_with_lm/log_speech-recognition-community-v2_dev_data_fr_validation_targets.txt
+++ b/results_speech-recognition-community-v2_dev_data_with_lm/log_speech-recognition-community-v2_dev_data_fr_validation_targets.txt
--- a/results_speech-recognition-community-v2_dev_data_with_lm/speech-recognition-community-v2_dev_data_fr_validation_eval_results.txt
+++ b/results_speech-recognition-community-v2_dev_data_with_lm/speech-recognition-community-v2_dev_data_fr_validation_eval_results.txt
@@ -0,0 +1,2 @@
+WER: 0.1477047704770477
+CER: 0.09565883436104942
--- a/results_speech-recognition-community-v2_dev_data_with_lm_chunk30_stride5/log_speech-recognition-community-v2_dev_data_fr_validation_predictions.txt
+++ b/results_speech-recognition-community-v2_dev_data_with_lm_chunk30_stride5/log_speech-recognition-community-v2_dev_data_fr_validation_predictions.txt
--- a/results_speech-recognition-community-v2_dev_data_with_lm_chunk30_stride5/log_speech-recognition-community-v2_dev_data_fr_validation_targets.txt
+++ b/results_speech-recognition-community-v2_dev_data_with_lm_chunk30_stride5/log_speech-recognition-community-v2_dev_data_fr_validation_targets.txt
--- a/results_speech-recognition-community-v2_dev_data_with_lm_chunk30_stride5/speech-recognition-community-v2_dev_data_fr_validation_eval_results.txt
+++ b/results_speech-recognition-community-v2_dev_data_with_lm_chunk30_stride5/speech-recognition-community-v2_dev_data_fr_validation_eval_results.txt
@@ -0,0 +1,2 @@
+WER: 0.12965796579657965
+CER: 0.08016959249831723
--- a/runs/Nov15_10-20-44/1668504308.6179943/events.out.tfevents.1668504308
+++ b/runs/Nov15_10-20-44/1668504308.6179943/events.out.tfevents.1668504308
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:44ebfbaea972eb37c367e6bb88cca55c38d6647a8df4273f30a581d0e8c0b6db
+size 5314
--- a/runs/Nov15_10-20-44/events.out.tfevents.1668504308
+++ b/runs/Nov15_10-20-44/events.out.tfevents.1668504308
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:4aa083158f159b353248a4a673c4ee5d5f00104a31100de215a9249d452020b2
+size 233487
--- a/runs/Nov15_10-20-44/events.out.tfevents.1669372995
+++ b/runs/Nov15_10-20-44/events.out.tfevents.1669372995
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:c25541a62eb090857c403318382553b9de67775f65faa0fc6a3c7d664f30629c
+size 364
--- a/special_tokens_map.json
+++ b/special_tokens_map.json
@@ -0,0 +1,232 @@
+{
+  "additional_special_tokens": [
+    {
+      "content": "<s>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false
+    },
+    {
+      "content": "</s>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false
+    },
+    {
+      "content": "<s>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false
+    },
+    {
+      "content": "</s>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false
+    },
+    {
+      "content": "<s>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false
+    },
+    {
+      "content": "</s>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false
+    },
+    {
+      "content": "<s>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false
+    },
+    {
+      "content": "</s>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false
+    },
+    {
+      "content": "<s>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false
+    },
+    {
+      "content": "</s>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false
+    },
+    {
+      "content": "<s>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false
+    },
+    {
+      "content": "</s>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false
+    },
+    {
+      "content": "<s>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false
+    },
+    {
+      "content": "</s>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false
+    },
+    {
+      "content": "<s>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false
+    },
+    {
+      "content": "</s>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false
+    },
+    {
+      "content": "<s>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false
+    },
+    {
+      "content": "</s>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false
+    },
+    {
+      "content": "<s>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false
+    },
+    {
+      "content": "</s>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false
+    },
+    {
+      "content": "<s>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false
+    },
+    {
+      "content": "</s>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false
+    },
+    {
+      "content": "<s>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false
+    },
+    {
+      "content": "</s>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false
+    },
+    {
+      "content": "<s>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false
+    },
+    {
+      "content": "</s>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false
+    },
+    {
+      "content": "<s>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false
+    },
+    {
+      "content": "</s>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false
+    },
+    {
+      "content": "<s>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false
+    },
+    {
+      "content": "</s>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false
+    },
+    {
+      "content": "<s>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false
+    },
+    {
+      "content": "</s>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false
+    }
+  ],
+  "bos_token": "<s>",
+  "eos_token": "</s>",
+  "pad_token": "[PAD]",
+  "unk_token": "[UNK]"
+}
--- a/tokenizer_config.json
+++ b/tokenizer_config.json
@@ -0,0 +1,14 @@
+{
+  "bos_token": "<s>",
+  "do_lower_case": false,
+  "eos_token": "</s>",
+  "model_max_length": 1000000000000000019884624838656,
+  "name_or_path": "outputs/big/wav2vec2-FR-7K-large-ft",
+  "pad_token": "[PAD]",
+  "processor_class": "Wav2Vec2ProcessorWithLM",
+  "replace_word_delimiter_char": " ",
+  "special_tokens_map_file": null,
+  "tokenizer_class": "Wav2Vec2CTCTokenizer",
+  "unk_token": "[UNK]",
+  "word_delimiter_token": "|"
+}
--- a/vocab.json
+++ b/vocab.json
@@ -0,0 +1,50 @@
+{
+  "'": 1,
+  "-": 2,
+  "[PAD]": 47,
+  "[UNK]": 46,
+  "a": 3,
+  "b": 4,
+  "c": 5,
+  "d": 6,
+  "e": 7,
+  "f": 8,
+  "g": 9,
+  "h": 10,
+  "i": 11,
+  "j": 12,
+  "k": 13,
+  "l": 14,
+  "m": 15,
+  "n": 16,
+  "o": 17,
+  "p": 18,
+  "q": 19,
+  "r": 20,
+  "s": 21,
+  "t": 22,
+  "u": 23,
+  "v": 24,
+  "w": 25,
+  "x": 26,
+  "y": 27,
+  "z": 28,
+  "|": 0,
+  "à": 29,
+  "â": 30,
+  "ä": 31,
+  "ç": 32,
+  "è": 33,
+  "é": 34,
+  "ê": 35,
+  "ë": 36,
+  "î": 37,
+  "ï": 38,
+  "ñ": 39,
+  "ô": 40,
+  "ö": 41,
+  "ù": 42,
+  "û": 43,
+  "ü": 44,
+  "ÿ": 45
+}
				`@@ -0,0 +1 @@`
				`{"labels": [" ", "'", "-", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "\u00e0", "\u00e2", "\u00e4", "\u00e7", "\u00e8", "\u00e9", "\u00ea", "\u00eb", "\u00ee", "\u00ef", "\u00f1", "\u00f4", "\u00f6", "\u00f9", "\u00fb", "\u00fc", "\u00ff", "\u2047", "", "<s>", "</s>"], "is_bpe": false}`
				`@@ -0,0 +1 @@`
				`{"alpha": 0.5, "beta": 1.5, "unk_score_offset": -10.0, "score_boundary": true}`