* Support using onnxruntime 1.16.0 with CUDA 11.4 on Jetson Orin NX. The pre-built onnxruntime libs are provided by the community using the following command: ```bash ./build.sh --build_shared_lib --config Release --update \ --build --parallel --use_cuda \ --cuda_home /usr/local/cuda \ --cudnn_home /usr/lib/aarch64-linux-gnu 2>&1 | tee my-log.txt ``` See also https://github.com/microsoft/onnxruntime/discussions/11226 --- Info about the board: ``` Model: NVIDIA Orin NX T801-16GB - Jetpack 5.1.4 [L4T 35.6.0] ``` ``` nvidia@nvidia-desktop:~/Downloads$ head -n 1 /etc/nv_tegra_release # R35 (release), REVISION: 6.0, GCID: 37391689, BOARD: t186ref, EABI: aarch64, DATE: Wed Aug 28 09:12:27 UTC 2024 nvidia@nvidia-desktop:~/Downloads$ uname -r 5.10.216-tegra nvidia@nvidia-desktop:~/Downloads$ lsb_release -i -r Distributor ID: Ubuntu Release: 20.04 nvidia@nvidia-desktop:~/Downloads$ nvcc -V nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2022 NVIDIA Corporation Built on Wed_Sep_21_10:43:33_PDT_2022 Cuda compilation tools, release 11.8, V11.8.89 Build cuda_11.8.r11.8/compiler.31833905_0 nvidia@nvidia-desktop:~/Downloads$ dpkg -l libcudnn8 Desired=Unknown/Install/Remove/Purge/Hold | Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend |/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad) ||/ Name Version Architecture Description +++-==============-====================-============-================================= ii libcudnn8 8.6.0.166-1+cuda11.4 arm64 cuDNN runtime libraries nvidia@nvidia-desktop:~/Downloads$ dpkg -l tensorrt Desired=Unknown/Install/Remove/Purge/Hold | Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend |/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad) ||/ Name Version Architecture Description +++-==============-==================-============-================================= ii tensorrt 8.5.2.2-1+cuda11.4 arm64 Meta package for TensorRT ```
Supported functions
| Speech recognition | Speech synthesis |
|---|---|
| ✔️ | ✔️ |
| Speaker identification | Speaker diarization | Speaker verification |
|---|---|---|
| ✔️ | ✔️ | ✔️ |
| Spoken Language identification | Audio tagging | Voice activity detection |
|---|---|---|
| ✔️ | ✔️ | ✔️ |
| Keyword spotting | Add punctuation |
|---|---|
| ✔️ | ✔️ |
Supported platforms
| Architecture | Android | iOS | Windows | macOS | linux | HarmonyOS |
|---|---|---|---|---|---|---|
| x64 | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | |
| x86 | ✔️ | ✔️ | ||||
| arm64 | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ |
| arm32 | ✔️ | ✔️ | ✔️ | |||
| riscv64 | ✔️ |
Supported programming languages
| 1. C++ | 2. C | 3. Python | 4. JavaScript |
|---|---|---|---|
| ✔️ | ✔️ | ✔️ | ✔️ |
| 5. Java | 6. C# | 7. Kotlin | 8. Swift |
|---|---|---|---|
| ✔️ | ✔️ | ✔️ | ✔️ |
| 9. Go | 10. Dart | 11. Rust | 12. Pascal |
|---|---|---|---|
| ✔️ | ✔️ | ✔️ | ✔️ |
For Rust support, please see sherpa-rs
It also supports WebAssembly.
Introduction
This repository supports running the following functions locally
- Speech-to-text (i.e., ASR); both streaming and non-streaming are supported
- Text-to-speech (i.e., TTS)
- Speaker diarization
- Speaker identification
- Speaker verification
- Spoken language identification
- Audio tagging
- VAD (e.g., silero-vad)
- Keyword spotting
on the following platforms and operating systems:
- x86,
x86_64, 32-bit ARM, 64-bit ARM (arm64, aarch64), RISC-V (riscv64) - Linux, macOS, Windows, openKylin
- Android, WearOS
- iOS
- HarmonyOS
- NodeJS
- WebAssembly
- Raspberry Pi
- RV1126
- LicheePi4A
- VisionFive 2
- 旭日X3派
- 爱芯派
- etc
with the following APIs
- C++, C, Python, Go,
C# - Java, Kotlin, JavaScript
- Swift, Rust
- Dart, Object Pascal
Links for Huggingface Spaces
You can visit the following Huggingface spaces to try sherpa-onnx without installing anything. All you need is a browser.
| Description | URL |
|---|---|
| Speaker diarization | Click me |
| Speech recognition | Click me |
| Speech recognition with Whisper | Click me |
| Speech synthesis | Click me |
| Generate subtitles | Click me |
| Audio tagging | Click me |
| Spoken language identification with Whisper | Click me |
We also have spaces built using WebAssembly. They are listed below:
| Description | Huggingface space | ModelScope space |
|---|---|---|
| Voice activity detection with silero-vad | Click me | 地址 |
| Real-time speech recognition (Chinese + English) with Zipformer | Click me | 地址 |
| Real-time speech recognition (Chinese + English) with Paraformer | Click me | 地址 |
| Real-time speech recognition (Chinese + English + Cantonese) with Paraformer-large | Click me | 地址 |
| Real-time speech recognition (English) | Click me | 地址 |
| VAD + speech recognition (Chinese + English + Korean + Japanese + Cantonese) with SenseVoice | Click me | 地址 |
| VAD + speech recognition (English) with Whisper tiny.en | Click me | 地址 |
| VAD + speech recognition (English) with Moonshine tiny | Click me | 地址 |
| VAD + speech recognition (English) with Zipformer trained with GigaSpeech | Click me | 地址 |
| VAD + speech recognition (Chinese) with Zipformer trained with WenetSpeech | Click me | 地址 |
| VAD + speech recognition (Japanese) with Zipformer trained with ReazonSpeech | Click me | 地址 |
| VAD + speech recognition (Thai) with Zipformer trained with GigaSpeech2 | Click me | 地址 |
| VAD + speech recognition (Chinese 多种方言) with a TeleSpeech-ASR CTC model | Click me | 地址 |
| VAD + speech recognition (English + Chinese, 及多种中文方言) with Paraformer-large | Click me | 地址 |
| VAD + speech recognition (English + Chinese, 及多种中文方言) with Paraformer-small | Click me | 地址 |
| Speech synthesis (English) | Click me | 地址 |
| Speech synthesis (German) | Click me | 地址 |
| Speaker diarization | Click me | 地址 |
Links for pre-built Android APKs
You can find pre-built Android APKs for this repository in the following table
| Description | URL | 中国用户 |
|---|---|---|
| Speaker diarization | Address | 点此 |
| Streaming speech recognition | Address | 点此 |
| Text-to-speech | Address | 点此 |
| Voice activity detection (VAD) | Address | 点此 |
| VAD + non-streaming speech recognition | Address | 点此 |
| Two-pass speech recognition | Address | 点此 |
| Audio tagging | Address | 点此 |
| Audio tagging (WearOS) | Address | 点此 |
| Speaker identification | Address | 点此 |
| Spoken language identification | Address | 点此 |
| Keyword spotting | Address | 点此 |
Links for pre-built Flutter APPs
Real-time speech recognition
| Description | URL | 中国用户 |
|---|---|---|
| Streaming speech recognition | Address | 点此 |
Text-to-speech
| Description | URL | 中国用户 |
|---|---|---|
| Android (arm64-v8a, armeabi-v7a, x86_64) | Address | 点此 |
| Linux (x64) | Address | 点此 |
| macOS (x64) | Address | 点此 |
| macOS (arm64) | Address | 点此 |
| Windows (x64) | Address | 点此 |
Note: You need to build from source for iOS.
Links for pre-built Lazarus APPs
Links for pre-trained models
| Description | URL |
|---|---|
| Speech recognition (speech to text, ASR) | Address |
| Text-to-speech (TTS) | Address |
| VAD | Address |
| Keyword spotting | Address |
| Audio tagging | Address |
| Speaker identification (Speaker ID) | Address |
| Spoken language identification (Language ID) | See multi-lingual Whisper ASR models from Speech recognition |
| Punctuation | Address |
| Speaker segmentation | Address |
Some pre-trained ASR models (Streaming)
Please see
- https://k2-fsa.github.io/sherpa/onnx/pretrained_models/online-transducer/index.html
- https://k2-fsa.github.io/sherpa/onnx/pretrained_models/online-paraformer/index.html
- https://k2-fsa.github.io/sherpa/onnx/pretrained_models/online-ctc/index.html
for more models. The following table lists only SOME of them.
| Name | Supported Languages | Description |
|---|---|---|
| sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20 | Chinese, English | See also |
| sherpa-onnx-streaming-zipformer-small-bilingual-zh-en-2023-02-16 | Chinese, English | See also |
| sherpa-onnx-streaming-zipformer-zh-14M-2023-02-23 | Chinese | Suitable for Cortex A7 CPU. See also |
| sherpa-onnx-streaming-zipformer-en-20M-2023-02-17 | English | Suitable for Cortex A7 CPU. See also |
| sherpa-onnx-streaming-zipformer-korean-2024-06-16 | Korean | See also |
| sherpa-onnx-streaming-zipformer-fr-2023-04-14 | French | See also |
Some pre-trained ASR models (Non-Streaming)
Please see
- https://k2-fsa.github.io/sherpa/onnx/pretrained_models/offline-transducer/index.html
- https://k2-fsa.github.io/sherpa/onnx/pretrained_models/offline-paraformer/index.html
- https://k2-fsa.github.io/sherpa/onnx/pretrained_models/offline-ctc/index.html
- https://k2-fsa.github.io/sherpa/onnx/pretrained_models/telespeech/index.html
- https://k2-fsa.github.io/sherpa/onnx/pretrained_models/whisper/index.html
for more models. The following table lists only SOME of them.
| Name | Supported Languages | Description |
|---|---|---|
| Whisper tiny.en | English | See also |
| Moonshine tiny | English | See also |
| sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17 | Chinese, Cantonese, English, Korean, Japanese | 支持多种中文方言. See also |
| sherpa-onnx-paraformer-zh-2024-03-09 | Chinese, English | 也支持多种中文方言. See also |
| sherpa-onnx-zipformer-ja-reazonspeech-2024-08-01 | Japanese | See also |
| sherpa-onnx-nemo-transducer-giga-am-russian-2024-10-24 | Russian | See also |
| sherpa-onnx-nemo-ctc-giga-am-russian-2024-10-24 | Russian | See also |
| sherpa-onnx-zipformer-ru-2024-09-18 | Russian | See also |
| sherpa-onnx-zipformer-korean-2024-06-24 | Korean | See also |
| sherpa-onnx-zipformer-thai-2024-06-20 | Thai | See also |
| sherpa-onnx-telespeech-ctc-int8-zh-2024-06-04 | Chinese | 支持多种方言. See also |
Useful links
- Documentation: https://k2-fsa.github.io/sherpa/onnx/
- Bilibili 演示视频: https://search.bilibili.com/all?keyword=%E6%96%B0%E4%B8%80%E4%BB%A3Kaldi
How to reach us
Please see https://k2-fsa.github.io/sherpa/social-groups.html for 新一代 Kaldi 微信交流群 and QQ 交流群.
Projects using sherpa-onnx
Open-LLM-VTuber
Talk to any LLM with hands-free voice interaction, voice interruption, and Live2D taking face running locally across platforms
See also https://github.com/t41372/Open-LLM-VTuber/pull/50
voiceapi
Streaming ASR and TTS based on FastAPI
It shows how to use the ASR and TTS Python APIs with FastAPI.
腾讯会议摸鱼工具 TMSpeech
Uses streaming ASR in C# with graphical user interface.
Video demo in Chinese: 【开源】Windows实时字幕软件(网课/开会必备)
lol互动助手
It uses the JavaScript API of sherpa-onnx along with Electron
Video demo in Chinese: 爆了!炫神教你开打字挂!真正影响胜率的英雄联盟工具!英雄联盟的最后一块拼图!和游戏中的每个人无障碍沟通!