初始化项目,由ModelHub XC社区提供模型

Model: aadel4/omniASR-CTC-1B-v2
Source: Original Platform
This commit is contained in:
ModelHub XC
2026-05-12 05:37:32 +08:00
commit 5213b4ac72
8 changed files with 10526 additions and 0 deletions

35
.gitattributes vendored Normal file
View File

@@ -0,0 +1,35 @@
*.7z filter=lfs diff=lfs merge=lfs -text
*.arrow filter=lfs diff=lfs merge=lfs -text
*.bin filter=lfs diff=lfs merge=lfs -text
*.bz2 filter=lfs diff=lfs merge=lfs -text
*.ckpt filter=lfs diff=lfs merge=lfs -text
*.ftz filter=lfs diff=lfs merge=lfs -text
*.gz filter=lfs diff=lfs merge=lfs -text
*.h5 filter=lfs diff=lfs merge=lfs -text
*.joblib filter=lfs diff=lfs merge=lfs -text
*.lfs.* filter=lfs diff=lfs merge=lfs -text
*.mlmodel filter=lfs diff=lfs merge=lfs -text
*.model filter=lfs diff=lfs merge=lfs -text
*.msgpack filter=lfs diff=lfs merge=lfs -text
*.npy filter=lfs diff=lfs merge=lfs -text
*.npz filter=lfs diff=lfs merge=lfs -text
*.onnx filter=lfs diff=lfs merge=lfs -text
*.ot filter=lfs diff=lfs merge=lfs -text
*.parquet filter=lfs diff=lfs merge=lfs -text
*.pb filter=lfs diff=lfs merge=lfs -text
*.pickle filter=lfs diff=lfs merge=lfs -text
*.pkl filter=lfs diff=lfs merge=lfs -text
*.pt filter=lfs diff=lfs merge=lfs -text
*.pth filter=lfs diff=lfs merge=lfs -text
*.rar filter=lfs diff=lfs merge=lfs -text
*.safetensors filter=lfs diff=lfs merge=lfs -text
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
*.tar.* filter=lfs diff=lfs merge=lfs -text
*.tar filter=lfs diff=lfs merge=lfs -text
*.tflite filter=lfs diff=lfs merge=lfs -text
*.tgz filter=lfs diff=lfs merge=lfs -text
*.wasm filter=lfs diff=lfs merge=lfs -text
*.xz filter=lfs diff=lfs merge=lfs -text
*.zip filter=lfs diff=lfs merge=lfs -text
*.zst filter=lfs diff=lfs merge=lfs -text
*tfevents* filter=lfs diff=lfs merge=lfs -text

68
README.md Normal file
View File

@@ -0,0 +1,68 @@
---
library_name: transformers
tags:
- speech
- audio
- wav2vec2
- automatic-speech-recognition
pipeline_tag: automatic-speech-recognition
---
# omniASR-CTC-1B-v2
Wav2Vec2 CTC ASR model (v2) converted from the [OmniLingual](https://github.com/facebookresearch/omnilingual-asr) fairseq2 checkpoint `omniASR_CTC_1B_v2`.
This model outputs CTC logits over a SentencePiece vocabulary and can transcribe speech in multiple languages.
# Code Base
The code base for the conversion can be found [here](https://github.com/ahmedadelattia/omnilingual_to_hf). I was only able to convert the 300M and 1B models due to GPU limitations. Contributions are welcome.
## Model details
| Property | Value |
|---|---|
| HF class | `Wav2Vec2ForCTC` |
| Encoder layers | 48 |
| Hidden size | 1280 |
| Attention heads | 16 |
| FFN intermediate | 5120 |
| Vocabulary size | 10288 |
| Source framework | fairseq2 |
| Source card | `omniASR_CTC_1B_v2` |
| Parity verification | ✅ Verified |
Numerical parity against the original fairseq2 checkpoint has been confirmed: outputs match to within `atol=1e-4` on a held-out audio sample.
Sample transcriptions on the held-out audio clip:
| Model | Transcript |
|---|---|
| fairseq2 (source) | `concord returned to its place amidst the tents` |
| HuggingFace (this repo) | `concord returned to its place amidst the tents` |
## Usage
```python
from transformers import Wav2Vec2ForCTC, AutoProcessor
import torch, torchaudio
processor = AutoProcessor.from_pretrained("aadel4/omniASR-CTC-1B-v2")
model = Wav2Vec2ForCTC.from_pretrained("aadel4/omniASR-CTC-1B-v2")
model.eval()
waveform, sr = torchaudio.load("audio.wav")
if sr != 16_000:
waveform = torchaudio.functional.resample(waveform, sr, 16_000)
inputs = processor(
waveform.squeeze().numpy(), sampling_rate=16_000, return_tensors="pt"
)
with torch.no_grad():
logits = model(**inputs).logits # (1, T, vocab)
pred_ids = torch.argmax(logits, dim=-1)
transcript = processor.decode(pred_ids[0])
print(transcript)
```

106
config.json Normal file
View File

@@ -0,0 +1,106 @@
{
"activation_dropout": 0.1,
"adapter_attn_dim": null,
"adapter_kernel_size": 3,
"adapter_stride": 2,
"add_adapter": false,
"apply_spec_augment": true,
"architectures": [
"Wav2Vec2ForCTC"
],
"attention_dropout": 0.1,
"bos_token_id": 1,
"classifier_proj_size": 256,
"codevector_dim": 256,
"contrastive_logits_temperature": 0.1,
"conv_bias": true,
"conv_dim": [
512,
512,
512,
512,
512,
512,
512
],
"conv_kernel": [
10,
3,
3,
3,
3,
2,
2
],
"conv_stride": [
5,
2,
2,
2,
2,
2,
2
],
"ctc_loss_reduction": "mean",
"ctc_zero_infinity": false,
"diversity_loss_weight": 0.1,
"do_stable_layer_norm": true,
"dtype": "float32",
"eos_token_id": 2,
"feat_extract_activation": "gelu",
"feat_extract_norm": "layer",
"feat_proj_dropout": 0.0,
"feat_quantizer_dropout": 0.0,
"final_dropout": 0.1,
"hidden_act": "gelu",
"hidden_dropout": 0.1,
"hidden_size": 1280,
"initializer_range": 0.02,
"intermediate_size": 5120,
"layer_norm_eps": 1e-05,
"layerdrop": 0.1,
"mask_feature_length": 10,
"mask_feature_min_masks": 0,
"mask_feature_prob": 0.0,
"mask_time_length": 10,
"mask_time_min_masks": 2,
"mask_time_prob": 0.05,
"model_type": "wav2vec2",
"num_adapter_layers": 3,
"num_attention_heads": 16,
"num_codevector_groups": 2,
"num_codevectors_per_group": 320,
"num_conv_pos_embedding_groups": 16,
"num_conv_pos_embeddings": 128,
"num_feat_extract_layers": 7,
"num_hidden_layers": 48,
"num_negatives": 100,
"output_hidden_size": 1280,
"pad_token_id": 0,
"proj_codevector_dim": 256,
"tdnn_dilation": [
1,
2,
3,
1,
1
],
"tdnn_dim": [
512,
512,
512,
512,
1500
],
"tdnn_kernel": [
5,
3,
3,
1,
1
],
"transformers_version": "5.3.0",
"use_weighted_layer_sum": false,
"vocab_size": 10288,
"xvector_output_dim": 512
}

3
model.safetensors Normal file
View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:157d809f748c1414bcbed7fb0eeb09dd4b04f11994298d1c99b08c6a975f40f0
size 3902806784

8
preprocessor_config.json Normal file
View File

@@ -0,0 +1,8 @@
{
"feature_extractor_type": "Wav2Vec2FeatureExtractor",
"feature_size": 1,
"sampling_rate": 16000,
"padding_value": 0.0,
"do_normalize": true,
"return_attention_mask": false
}

6
special_tokens_map.json Normal file
View File

@@ -0,0 +1,6 @@
{
"bos_token": "<s>",
"eos_token": "</s>",
"unk_token": "<unk>",
"pad_token": "<s>"
}

10
tokenizer_config.json Normal file
View File

@@ -0,0 +1,10 @@
{
"tokenizer_class": "Wav2Vec2CTCTokenizer",
"unk_token": "<unk>",
"bos_token": "<s>",
"eos_token": "</s>",
"pad_token": "<s>",
"word_delimiter_token": " ",
"do_lower_case": false,
"replace_word_delimiter_char": " "
}

10290
vocab.json Normal file

File diff suppressed because it is too large Load Diff