初始化项目,由ModelHub XC社区提供模型

Model: iic/speech_paraformer-tiny-commandword_asr_nat-zh-cn-16k-vocab544-pytorch
Source: Original Platform
This commit is contained in:
ModelHub XC
2026-04-13 14:31:06 +08:00
commit 6ad6638ba1
9 changed files with 983 additions and 0 deletions

32
.gitattributes vendored Normal file
View File

@@ -0,0 +1,32 @@
*.7z filter=lfs diff=lfs merge=lfs -text
*.arrow filter=lfs diff=lfs merge=lfs -text
*.bin filter=lfs diff=lfs merge=lfs -text
*.bin.* filter=lfs diff=lfs merge=lfs -text
*.bz2 filter=lfs diff=lfs merge=lfs -text
*.ftz filter=lfs diff=lfs merge=lfs -text
*.gz filter=lfs diff=lfs merge=lfs -text
*.h5 filter=lfs diff=lfs merge=lfs -text
*.joblib filter=lfs diff=lfs merge=lfs -text
*.lfs.* filter=lfs diff=lfs merge=lfs -text
*.model filter=lfs diff=lfs merge=lfs -text
*.msgpack filter=lfs diff=lfs merge=lfs -text
*.onnx filter=lfs diff=lfs merge=lfs -text
*.ot filter=lfs diff=lfs merge=lfs -text
*.parquet filter=lfs diff=lfs merge=lfs -text
*.pb filter=lfs diff=lfs merge=lfs -text
*.pt filter=lfs diff=lfs merge=lfs -text
*.pth filter=lfs diff=lfs merge=lfs -text
*.rar filter=lfs diff=lfs merge=lfs -text
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
*.tar.* filter=lfs diff=lfs merge=lfs -text
*.tflite filter=lfs diff=lfs merge=lfs -text
*.tgz filter=lfs diff=lfs merge=lfs -text
*.xz filter=lfs diff=lfs merge=lfs -text
*.zip filter=lfs diff=lfs merge=lfs -text
*.zstandard filter=lfs diff=lfs merge=lfs -text
*.tfevents* filter=lfs diff=lfs merge=lfs -text
*.db* filter=lfs diff=lfs merge=lfs -text
*.ark* filter=lfs diff=lfs merge=lfs -text
**/*ckpt*data* filter=lfs diff=lfs merge=lfs -text
**/*ckpt*.meta filter=lfs diff=lfs merge=lfs -text
**/*ckpt*.index filter=lfs diff=lfs merge=lfs -text

256
README.md Normal file
View File

@@ -0,0 +1,256 @@
---
tasks:
- auto-speech-recognition
domain:
- audio
model-type:
- Non-autoregressive
frameworks:
- pytorch
backbone:
- transformer/conformer
metrics:
- CER
license: Apache License 2.0
language:
- cn
tags:
- FunASR
- Paraformer-Tiny
- Alibaba
- On Edge
datasets:
train:
- 300 hours Mandarin command word task
test:
- command word test set
indexing:
results:
- task:
name: Automatic Speech Recognition
dataset:
name: 300 hours Mandarin command word task
type: audio # optional
args: 16k sampling rate, 8404 characters # optional
metrics:
- type: CER
value: 0.24% # float
description: greedy search, withou lm, avg.
args: default
- type: RTF
value: 0.1385 # float
description: GPU inference on V100
args: batch_size=1
widgets:
- task: auto-speech-recognition
inputs:
- type: audio
name: input
title: 音频
examples:
- name: 1
title: 示例1
inputs:
- name: input
data: git://example/asr_example.wav
inferencespec:
cpu: 8 #CPU数量
memory: 4096
---
## Highlights
- 指令词识别:较小词表的常用智能家居交互指令词识别模型。
- 轻量提供了验证有效的5M小参数量Paraformer模型配置验证了share embedding的作用。
## <strong>[ModelScope-FunASR](https://github.com/alibaba-damo-academy/FunASR)</strong>
<strong>[FunASR](https://github.com/alibaba-damo-academy/FunASR)</strong>希望在语音识别方面建立学术研究和工业应用之间的桥梁。通过支持在ModelScope上发布的工业级语音识别模型的训练和微调研究人员和开发人员可以更方便地进行语音识别模型的研究和生产并促进语音识别生态系统的发展。
[**最新动态**](https://github.com/alibaba-damo-academy/FunASR#whats-new)
| [**环境安装**](https://github.com/alibaba-damo-academy/FunASR#installation)
| [**介绍文档**](https://alibaba-damo-academy.github.io/FunASR/en/index.html)
| [**中文教程**](https://github.com/alibaba-damo-academy/FunASR/wiki#funasr%E7%94%A8%E6%88%B7%E6%89%8B%E5%86%8C)
| [**服务部署**](https://github.com/alibaba-damo-academy/FunASR/tree/main/funasr/runtime)
| [**模型库**](https://github.com/alibaba-damo-academy/FunASR/blob/main/docs/model_zoo/modelscope_models.md)
| [**联系我们**](https://github.com/alibaba-damo-academy/FunASR#contact)
## 项目介绍
Paraformer是达摩院语音团队提出的一种高效的非自回归端到端语音识别框架。本项目为Paraformer中文通用语音识别模型采用工业级数万小时的标注音频进行模型训练保证了模型的通用识别效果。模型可以被应用于语音输入法、语音导航、智能会议纪要等场景。
Paraformer-Tiny-CommandWord是基于Paraformer框架的小参数量指令词自由说模型旨在验证Paraformer模型在设备端参数量受限5M条件下的模型性能。本模型所建模的词表为智能家居语音交互中的常用指令词共544个汉字与字母指令词包括但不限于“打开/关闭/调大/调小 音乐/空调/照明”等。
<p align="center">
<img src="fig/struct.png" alt="Paraformer-Tiny模型结构" width="500" />
Paraformer-Tiny模型结构如上图所示与Paraformer相同由 Encoder、Predictor、Sampler、Decoder 与 Loss function 五部分组成。Encoder采用了Conformer结构参数量在3.5M左右。Predictor 为一层一维卷积与线性层预测目标文字个数以及抽取目标文字对应的声学向量。Sampler 为无可学习参数模块依据输入的声学向量和目标向量生产含有语义的特征向量。Decoder 结构与Paraformer中一致参数量限制在1.6至1.8M。为了进一步缩小Decoder部分的参数量Paraformer-Tiny模型启用了embedding sharing将Decoder的输出层权重复用至char embedding。
其核心点主要有:
- Predictor 模块:基于 Continuous integrate-and-fire (CIF) 的 预测器 (Predictor) 来抽取目标文字对应的声学特征向量,可以更加准确的预测语音中目标文字个数。
- Sampler通过采样将声学特征向量与目标文字向量变换成含有语义信息的特征向量配合双向的 Decoder 来增强模型对于上下文的建模能力。
- Embedding Sharinges Decoder输出层的权重本身也是对文本空间的一种建模将其复用至char embedding层能够在节约参数量的同时提升建模能力。
更详细的细节见:
- 论文: [Paraformer: Fast and Accurate Parallel Transformer for Non-autoregressive End-to-End Speech Recognition](https://arxiv.org/abs/2206.08317)
- 论文解读:[Paraformer: 高识别率、高计算效率的单轮非自回归端到端语音识别模型](https://mp.weixin.qq.com/s/xQ87isj5_wxWiQs4qUXtVw)
## 如何使用与训练自己的模型
本项目提供的预训练模型是基于大数据训练的通用领域识别模型开发者可以基于此模型进一步利用ModelScope的微调功能或者本项目对应的Github代码仓库[FunASR](https://github.com/alibaba-damo-academy/FunASR)进一步进行模型的领域定制化。
### 在Notebook中开发
对于有开发需求的使用者特别推荐您使用Notebook进行离线处理。先登录ModelScope账号点击模型页面右上角的“在Notebook中打开”按钮出现对话框首次使用会提示您关联阿里云账号按提示操作即可。关联账号后可进入选择启动实例界面选择计算资源建立实例待实例创建完成后进入开发环境进行调用。
#### 基于ModelScope进行推理
- 推理支持音频格式如下:
- wav文件路径例如data/test/audios/asr_example.wav
- wav文件url例如https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_zh_command.wav
- wav二进制数据格式bytes例如用户直接从文件里读出bytes数据或者是麦克风录出bytes数据。
- 已解析的audio音频例如audio, rate = soundfile.read("asr_example_zh.wav")类型为numpy.ndarray或者torch.Tensor。
- wav.scp文件需符合如下要求
```sh
cat wav.scp
asr_example1 data/test/audios/asr_example1.wav
asr_example2 data/test/audios/asr_example2.wav
...
```
- 若输入格式wav文件urlapi调用方式可参考如下范例
```python
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks
inference_pipeline = pipeline(
task=Tasks.auto_speech_recognition,
model='iic/speech_paraformer-tiny-commandword_asr_nat-zh-cn-16k-vocab544-pytorch',
model_revision="v2.0.4")
rec_result = inference_pipeline('https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_zh_command.wav')
print(rec_result)
```
- 若输入格式为文件wav.scp(注:文件名需要以.scp结尾),可添加 output_dir 参数将识别结果写入文件中并可设置解码batch_size参考示例如下
```python
inference_pipeline = pipeline(
task=Tasks.auto_speech_recognition,
model='iic/speech_paraformer-tiny-commandword_asr_nat-zh-cn-16k-vocab544-pytorch',
model_revision="v2.0.4",
output_dir='./output_dir',
batch_size=32)
inference_pipeline("wav.scp")
```
识别结果输出路径结构如下:
```sh
tree output_dir/
output_dir/
└── 1best_recog
├── rtf
├── score
└── text
1 directory, 3 files
```
rtf计算过程耗时统计
score识别路径得分
text语音识别结果文件
- 若输入音频为已解析的audio音频api调用方式可参考如下范例
```python
import soundfile
waveform, sample_rate = soundfile.read("asr_example_zh.wav")
rec_result = inference_pipeline(audio_in=waveform)
```
#### 模型效果
| | Infer Conf | #Param(Enc/Dec) | CER(Dev/Test) |
|-------------------------|-------------------------------|-----------------|---------------------|
| Conformer_Transformer | Beam=5_CTC=0.3 | 3.47M/1.59M | 0.21/0.21 |
| Conformer_Paraformer | Beam=5_CTC=0.3 | 3.47M/1.82M | 0.27/0.28 |
| Conformer_Paraformer_es | Beam=1_CTC=0.0 Beam=5_CTC=0.3 | 3.47M/1.69M | 0.27/0.24_0.25/0.24 |
#### 基于ModelScope进行微调
开发中即将提供基于达摩院内部大数据训练的小参数量Paraformer模型作为基础模型方便用户进行finetune。
### 在本地机器中开发
#### 基于ModelScope进行微调和推理
支持基于ModelScope上数据集及私有数据集进行定制微调和推理使用方式同Notebook中开发。
#### 基于FunASR的模型训练和推理
FunASR框架支持魔搭社区开源的工业级的语音识别模型的training & finetuning使得研究人员和开发者可以更加便捷的进行语音识别模型的研究和生产目前已在github开源https://github.com/alibaba-damo-academy/FunASR
#### FunASR框架安装
- 安装FunASR和ModelScope[详见](https://github.com/alibaba-damo-academy/FunASR/wiki)
```sh
pip3 install -U modelscope
git clone https://github.com/alibaba/FunASR.git && cd FunASR
pip3 install -e ./
```
#### 推理
接下来会以AISHELL-1数据集为例介绍如何在FunASR框架中使用Paraformer-large进行推理以及微调。
```sh
cd egs_modelscope/aishell/paraformer/
# 配置 paraformer_large_infer.sh 中参数
# ori_data: AISHELL-1原始数据路径
# data_dir: 数据处理路径
# exp_dir: 结果路径
# model_name: 配置模型名称
# use_lm: 是否使用LM
# beam_size: 设置beam_size
# lm_weight: 设置lm_weight
# 配置修改完成后,执行命令:
sh ./paraformer_large_infer.sh
```
## 使用方式以及适用范围
运行范围
- 支持Linux-x86_64、Mac和Windows运行。
使用方式
- 直接推理:可以直接对输入音频进行解码,输出目标文字。
- 微调:加载训练好的模型,采用私有或者开源数据进行模型训练。
使用范围与目标场景
- 适合进行指令词的识别若需要具有通用asr能力的小参数量模型推荐使用对应数据对本模型进行finetune。
## 模型局限性以及可能的偏差
考虑到特征提取流程和工具以及训练工具差异会对CER的数据带来一定的差异<0.1%推理GPU环境差异导致的RTF数值差异
## 相关论文以及引用信息
```BibTeX
@inproceedings{gao2022paraformer,
title={Paraformer: Fast and Accurate Parallel Transformer for Non-autoregressive End-to-End Speech Recognition},
author={Gao, Zhifu and Zhang, Shiliang and McLoughlin, Ian and Yan, Zhijie},
booktitle={INTERSPEECH},
year={2022}
}
```

8
am.mvn Normal file

File diff suppressed because one or more lines are too long

124
config.yaml Normal file
View File

@@ -0,0 +1,124 @@
# network architecture
model: Paraformer
model_conf:
ctc_weight: 0.3
lsm_weight: 0.1
length_normalized_loss: false
predictor_weight: 1.0
sampling_ratio: 0.4
share_embedding: true
# encoder
encoder: ConformerEncoder
encoder_conf:
output_size: 256
attention_heads: 4
linear_units: 320
num_blocks: 4
dropout_rate: 0.1
positional_dropout_rate: 0.1
attention_dropout_rate: 0.1
input_layer: linear
normalize_before: true
rel_pos_type: latest
pos_enc_layer_type: rel_pos
selfattention_layer_type: rel_selfattn
activation_type: swish
macaron_style: true
use_cnn_module: true
cnn_module_kernel: 13
# decoder
decoder: ParaformerSANDecoder
decoder_conf:
attention_heads: 4
linear_units: 256
num_blocks: 2
dropout_rate: 0.1
positional_dropout_rate: 0.1
self_attention_dropout_rate: 0.0
src_attention_dropout_rate: 0.0
predictor: CifPredictorV2
predictor_conf:
idim: 256
threshold: 1.0
l_order: 1
r_order: 1
tail_threshold: 0.45
tail_mask: false
# frontend related
frontend: WavFrontend
frontend_conf:
fs: 16000
window: hamming
n_mels: 80
frame_length: 25
frame_shift: 10
lfr_m: 7
lfr_n: 6
dither: 0.0
specaug: SpecAugLFR
specaug_conf:
apply_time_warp: false
time_warp_window: 5
time_warp_mode: bicubic
apply_freq_mask: true
freq_mask_width_range:
- 0
- 30
lfr_rate: 6
num_freq_mask: 1
apply_time_mask: true
time_mask_width_range:
- 0
- 12
num_time_mask: 1
train_conf:
accum_grad: 1
grad_clip: 5
max_epoch: 150
val_scheduler_criterion:
- valid
- acc
best_model_criterion:
- - valid
- acc
- max
keep_nbest_models: 10
log_interval: 50
optim: adam
optim_conf:
lr: 0.0005
scheduler: warmuplr
scheduler_conf:
warmup_steps: 30000
dataset: AudioDataset
dataset_conf:
index_ds: IndexDSJsonl
batch_sampler: DynamicBatchLocalShuffleSampler
batch_type: example # example or length
batch_size: 1 # if batch_type is example, batch_size is the numbers of samples; if length, batch_size is source_token_len+target_token_len;
max_token_length: 2048 # filter samples if source_token_len+target_token_len > max_token_length,
buffer_size: 500
shuffle: True
num_workers: 0
tokenizer: CharTokenizer
tokenizer_conf:
unk_symbol: <unk>
split_with_space: true
input_size: 560
ctc_conf:
dropout_rate: 0.0
ctc_type: builtin
reduce: true
ignore_nan_grad: true
normalize: null

14
configuration.json Normal file
View File

@@ -0,0 +1,14 @@
{
"framework": "pytorch",
"task" : "auto-speech-recognition",
"model": {"type" : "funasr"},
"pipeline": {"type":"funasr-pipeline"},
"model_name_in_hub": {
"ms":"iic/speech_paraformer-tiny-commandword_asr_nat-zh-cn-16k-vocab544-pytorch",
"hf":""},
"file_path_metas": {
"init_param":"model.pt",
"config":"config.yaml",
"tokenizer_conf": {"token_list": "tokens.json"},
"frontend_conf":{"cmvn_file": "am.mvn"}}
}

BIN
example/asr_example.wav Normal file

Binary file not shown.

BIN
fig/struct.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 58 KiB

3
model.pt Normal file
View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:c9ebb0cc06eacb5e7930dfcbe5428ebfc3ea585bffd54646d4ccdacb42a0bf21
size 21655987

546
tokens.json Normal file
View File

@@ -0,0 +1,546 @@
[
"<blank>",
"<unk>",
"<s>",
"</s>",
"A",
"一",
"七",
"万",
"三",
"上",
"下",
"不",
"与",
"专",
"世",
"业",
"东",
"两",
"个",
"中",
"为",
"主",
"么",
"义",
"之",
"乐",
"九",
"也",
"书",
"了",
"事",
"二",
"于",
"五",
"些",
"交",
"产",
"京",
"亮",
"人",
"亿",
"什",
"今",
"从",
"他",
"付",
"代",
"以",
"们",
"价",
"企",
"休",
"会",
"传",
"但",
"低",
"体",
"作",
"你",
"使",
"便",
"保",
"信",
"停",
"健",
"儿",
"元",
"光",
"克",
"入",
"全",
"八",
"公",
"六",
"关",
"兴",
"其",
"具",
"内",
"再",
"冰",
"冲",
"况",
"冷",
"净",
"凉",
"减",
"出",
"分",
"切",
"列",
"删",
"利",
"别",
"到",
"制",
"前",
"力",
"功",
"加",
"务",
"动",
"助",
"劲",
"化",
"北",
"区",
"十",
"千",
"升",
"午",
"华",
"单",
"南",
"博",
"卡",
"卧",
"卫",
"厂",
"厅",
"压",
"原",
"去",
"友",
"发",
"取",
"变",
"口",
"句",
"叫",
"可",
"台",
"右",
"号",
"司",
"合",
"吊",
"同",
"名",
"后",
"向",
"吗",
"否",
"吧",
"听",
"启",
"吹",
"员",
"周",
"和",
"咖",
"品",
"哈",
"哪",
"唱",
"商",
"啡",
"啦",
"喷",
"器",
"囊",
"四",
"回",
"团",
"国",
"圈",
"在",
"地",
"场",
"坡",
"城",
"增",
"士",
"声",
"处",
"复",
"外",
"多",
"夜",
"大",
"天",
"太",
"央",
"头",
"女",
"好",
"如",
"妇",
"始",
"子",
"学",
"宁",
"安",
"定",
"宝",
"实",
"客",
"室",
"家",
"对",
"导",
"将",
"小",
"少",
"尔",
"就",
"屏",
"展",
"山",
"州",
"工",
"左",
"已",
"市",
"布",
"帘",
"带",
"帮",
"幕",
"干",
"平",
"年",
"幺",
"广",
"应",
"底",
"店",
"度",
"座",
"康",
"建",
"开",
"式",
"张",
"强",
"当",
"影",
"往",
"得",
"微",
"德",
"心",
"快",
"态",
"怎",
"思",
"性",
"恢",
"息",
"情",
"想",
"意",
"感",
"成",
"我",
"房",
"所",
"扇",
"手",
"打",
"扫",
"找",
"技",
"把",
"报",
"拉",
"换",
"掉",
"接",
"提",
"搜",
"携",
"摆",
"摇",
"播",
"支",
"收",
"改",
"放",
"政",
"故",
"数",
"整",
"文",
"斯",
"新",
"方",
"旅",
"无",
"日",
"早",
"时",
"明",
"星",
"是",
"显",
"晒",
"晚",
"智",
"暂",
"暖",
"暗",
"暴",
"曲",
"更",
"最",
"月",
"有",
"服",
"朝",
"期",
"本",
"机",
"杀",
"杆",
"村",
"来",
"林",
"果",
"架",
"查",
"样",
"档",
"桶",
"模",
"横",
"次",
"歌",
"止",
"正",
"步",
"每",
"毒",
"比",
"民",
"气",
"水",
"江",
"没",
"河",
"法",
"洁",
"洗",
"流",
"济",
"海",
"涂",
"消",
"淘",
"深",
"添",
"清",
"温",
"港",
"湖",
"湿",
"源",
"演",
"火",
"灯",
"点",
"烘",
"烟",
"烦",
"热",
"然",
"照",
"爱",
"片",
"牙",
"物",
"特",
"状",
"猫",
"王",
"现",
"理",
"生",
"用",
"甲",
"电",
"界",
"白",
"百",
"的",
"盖",
"目",
"直",
"相",
"看",
"眠",
"眼",
"着",
"睡",
"石",
"确",
"示",
"福",
"离",
"科",
"移",
"程",
"空",
"童",
"第",
"等",
"管",
"系",
"索",
"红",
"纵",
"线",
"经",
"结",
"给",
"统",
"继",
"续",
"网",
"置",
"美",
"老",
"者",
"而",
"联",
"能",
"臀",
"自",
"至",
"舒",
"航",
"色",
"节",
"花",
"菌",
"蓝",
"行",
"衣",
"表",
"被",
"西",
"要",
"见",
"视",
"觉",
"角",
"解",
"计",
"让",
"记",
"讲",
"设",
"话",
"询",
"语",
"说",
"请",
"读",
"调",
"负",
"购",
"资",
"赛",
"走",
"起",
"超",
"足",
"跑",
"跟",
"路",
"身",
"车",
"转",
"辅",
"辑",
"边",
"达",
"过",
"运",
"近",
"还",
"这",
"进",
"远",
"连",
"退",
"送",
"适",
"通",
"速",
"道",
"避",
"那",
"部",
"都",
"配",
"酒",
"醒",
"醛",
"释",
"里",
"重",
"量",
"金",
"钟",
"银",
"锁",
"镇",
"镜",
"长",
"门",
"闭",
"间",
"闹",
"防",
"阳",
"附",
"陈",
"降",
"限",
"除",
"随",
"集",
"雨",
"雪",
"零",
"雹",
"需",
"霜",
"静",
"面",
"音",
"顶",
"频",
"风",
"飞",
"食",
"餐",
"首",
"马",
"高",
"黄",
"齐",
"龙"
]