If the input audio file is less than 10 seconds long, there is only one chunk, and there is no need to compute embeddings or do clustering. We can use the segmentation result from the speaker segmentation model directly.
Supported functions
| Speech recognition | Speech synthesis |
|---|---|
| ✔️ | ✔️ |
| Speaker identification | Speaker diarization | Speaker identification |
|---|---|---|
| ✔️ | ✔️ | ✔️ |
| Spoken Language identification | Audio tagging | Voice activity detection |
|---|---|---|
| ✔️ | ✔️ | ✔️ |
| Keyword spotting | Add punctuation |
|---|---|
| ✔️ | ✔️ |
Supported platforms
| Architecture | Android | iOS | Windows | macOS | linux |
|---|---|---|---|---|---|
| x64 | ✔️ | ✔️ | ✔️ | ✔️ | |
| x86 | ✔️ | ✔️ | |||
| arm64 | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ |
| arm32 | ✔️ | ✔️ | |||
| riscv64 | ✔️ |
Supported programming languages
| 1. C++ | 2. C | 3. Python | 4. JavaScript |
|---|---|---|---|
| ✔️ | ✔️ | ✔️ | ✔️ |
| 5. Java | 6. C# | 7. Kotlin | 8. Swift |
|---|---|---|---|
| ✔️ | ✔️ | ✔️ | ✔️ |
| 9. Go | 10. Dart | 11. Rust | 12. Pascal |
|---|---|---|---|
| ✔️ | ✔️ | ✔️ | ✔️ |
For Rust support, please see sherpa-rs
It also supports WebAssembly.
Introduction
This repository supports running the following functions locally
- Speech-to-text (i.e., ASR); both streaming and non-streaming are supported
- Text-to-speech (i.e., TTS)
- Speaker diarization
- Speaker identification
- Speaker verification
- Spoken language identification
- Audio tagging
- VAD (e.g., silero-vad)
- Keyword spotting
on the following platforms and operating systems:
- x86,
x86_64, 32-bit ARM, 64-bit ARM (arm64, aarch64), RISC-V (riscv64) - Linux, macOS, Windows, openKylin
- Android, WearOS
- iOS
- NodeJS
- WebAssembly
- Raspberry Pi
- RV1126
- LicheePi4A
- VisionFive 2
- 旭日X3派
- 爱芯派
- etc
with the following APIs
- C++, C, Python, Go,
C# - Java, Kotlin, JavaScript
- Swift, Rust
- Dart, Object Pascal
Links for Huggingface Spaces
You can visit the following Huggingface spaces to try sherpa-onnx without
installing anything. All you need is a browser.
| Description | URL |
|---|---|
| Speech recognition | Click me |
| Speech recognition with Whisper | Click me |
| Speech synthesis | Click me |
| Generate subtitles | Click me |
| Audio tagging | Click me |
| Spoken language identification with Whisper | Click me |
We also have spaces built using WebAssembly. They are listed below:
| Description | Huggingface space | ModelScope space |
|---|---|---|
| Voice activity detection with silero-vad | Click me | 地址 |
| Real-time speech recognition (Chinese + English) with Zipformer | Click me | 地址 |
| Real-time speech recognition (Chinese + English) with Paraformer | Click me | 地址 |
| Real-time speech recognition (Chinese + English + Cantonese) with Paraformer-large | Click me | 地址 |
| Real-time speech recognition (English) | Click me | 地址 |
| VAD + speech recognition (Chinese + English + Korean + Japanese + Cantonese) with SenseVoice | Click me | 地址 |
| VAD + speech recognition (English) with Whisper tiny.en | Click me | 地址 |
| VAD + speech recognition (English) with Zipformer trained with GigaSpeech | Click me | 地址 |
| VAD + speech recognition (Chinese) with Zipformer trained with WenetSpeech | Click me | 地址 |
| VAD + speech recognition (Japanese) with Zipformer trained with ReazonSpeech | Click me | 地址 |
| VAD + speech recognition (Thai) with Zipformer trained with GigaSpeech2 | Click me | 地址 |
| VAD + speech recognition (Chinese 多种方言) with a TeleSpeech-ASR CTC model | Click me | 地址 |
| VAD + speech recognition (English + Chinese, 及多种中文方言) with Paraformer-large | Click me | 地址 |
| VAD + speech recognition (English + Chinese, 及多种中文方言) with Paraformer-small | Click me | 地址 |
| Speech synthesis (English) | Click me | 地址 |
| Speech synthesis (German) | Click me | 地址 |
| Speaker diarization | Click me | 地址 |
Links for pre-built Android APKs
| Description | URL | 中国用户 |
|---|---|---|
| Streaming speech recognition | Address | 点此 |
| Text-to-speech | Address | 点此 |
| Voice activity detection (VAD) | Address | 点此 |
| VAD + non-streaming speech recognition | Address | 点此 |
| Two-pass speech recognition | Address | 点此 |
| Audio tagging | Address | 点此 |
| Audio tagging (WearOS) | Address | 点此 |
| Speaker identification | Address | 点此 |
| Spoken language identification | Address | 点此 |
| Keyword spotting | Address | 点此 |
Links for pre-built Flutter APPs
Real-time speech recognition
| Description | URL | 中国用户 |
|---|---|---|
| Streaming speech recognition | Address | 点此 |
Text-to-speech
| Description | URL | 中国用户 |
|---|---|---|
| Android (arm64-v8a, armeabi-v7a, x86_64) | Address | 点此 |
| Linux (x64) | Address | 点此 |
| macOS (x64) | Address | 点此 |
| macOS (arm64) | Address | 点此 |
| Windows (x64) | Address | 点此 |
Note: You need to build from source for iOS.
Links for pre-built Lazarus APPs
Generating subtitles
| Description | URL | 中国用户 |
|---|---|---|
| Generate subtitles (生成字幕) | Address | 点此 |
Links for pre-trained models
| Description | URL |
|---|---|
| Speech recognition (speech to text, ASR) | Address |
| Text-to-speech (TTS) | Address |
| VAD | Address |
| Keyword spotting | Address |
| Audio tagging | Address |
| Speaker identification (Speaker ID) | Address |
| Spoken language identification (Language ID) | See multi-lingual Whisper ASR models from Speech recognition |
| Punctuation | Address |
| Speaker segmentation | Address |
Useful links
- Documentation: https://k2-fsa.github.io/sherpa/onnx/
- Bilibili 演示视频: https://search.bilibili.com/all?keyword=%E6%96%B0%E4%B8%80%E4%BB%A3Kaldi
How to reach us
Please see https://k2-fsa.github.io/sherpa/social-groups.html for 新一代 Kaldi 微信交流群 and QQ 交流群.
Projects using sherpa-onnx
voiceapi
Streaming ASR and TTS based on FastAPI
It shows how to use the ASR and TTS Python APIs with FastAPI.
腾讯会议摸鱼工具 TMSpeech
Uses streaming ASR in C# with graphical user interface.
Video demo in Chinese: 【开源】Windows实时字幕软件(网课/开会必备)
lol互动助手
It uses the JavaScript API of sherpa-onnx along with Electron
Video demo in Chinese: 爆了!炫神教你开打字挂!真正影响胜率的英雄联盟工具!英雄联盟的最后一块拼图!和游戏中的每个人无障碍沟通!