|
|
--- |
|
|
datasets: |
|
|
- openslr/librispeech_asr |
|
|
- facebook/multilingual_librispeech |
|
|
language: |
|
|
- en |
|
|
- fr |
|
|
- de |
|
|
- pt |
|
|
- es |
|
|
metrics: |
|
|
- wer |
|
|
base_model: |
|
|
- openai/whisper-large-v2 |
|
|
- openai/whisper-small |
|
|
- openai/whisper-base |
|
|
pipeline_tag: automatic-speech-recognition |
|
|
tags: |
|
|
- streaming |
|
|
- asr |
|
|
- Transformer |
|
|
- encoder-decoder |
|
|
- pytorch |
|
|
- audio |
|
|
- speech |
|
|
- Whisper |
|
|
model-index: |
|
|
- name: CarelessWhisper-large-v2 |
|
|
results: |
|
|
- task: |
|
|
type: streaming-transcription-chunk-300msec |
|
|
dataset: |
|
|
name: test-clean |
|
|
type: LibriSpeech |
|
|
metrics: |
|
|
- name: Word Error Rate (WER) [%] |
|
|
type: Word Error Rate (WER) [%] |
|
|
value: 5.29 |
|
|
- name: Aligned-Relative Word Error Rate (ARWER) [%] |
|
|
type: Aligned-Relative Word Error Rate (WER) [%] |
|
|
value: 6 |
|
|
- task: |
|
|
type: streaming-transcription-chunk-300msec |
|
|
dataset: |
|
|
name: test-other |
|
|
type: LibriSpeech |
|
|
metrics: |
|
|
- name: Word Error Rate (WER) [%] |
|
|
type: Word Error Rate (WER) [%] |
|
|
value: 10.74 |
|
|
- name: Aligned-Relative Word Error Rate (ARWER) [%] |
|
|
type: Aligned-Relative Word Error Rate (WER) [%] |
|
|
value: 11.38 |
|
|
- task: |
|
|
type: streaming-transcription-chunk-200msec |
|
|
dataset: |
|
|
name: test-clean |
|
|
type: LibriSpeech |
|
|
metrics: |
|
|
- name: Word Error Rate (WER) [%] |
|
|
type: Word Error Rate (WER) [%] |
|
|
value: 5.92 |
|
|
- name: Aligned-Relative Word Error Rate (ARWER) [%] |
|
|
type: Aligned-Relative Word Error Rate (WER) [%] |
|
|
value: 6.63 |
|
|
- task: |
|
|
type: streaming-transcription-chunk-200msec |
|
|
dataset: |
|
|
name: test-other |
|
|
type: LibriSpeech |
|
|
metrics: |
|
|
- name: Word Error Rate (WER) [%] |
|
|
type: Word Error Rate (WER) [%] |
|
|
value: 11.41 |
|
|
- name: Aligned-Relative Word Error Rate (ARWER) [%] |
|
|
type: Aligned-Relative Word Error Rate (WER) [%] |
|
|
value: 12.6 |
|
|
- task: |
|
|
type: streaming-transcription-chunk-100msec |
|
|
dataset: |
|
|
name: test-clean |
|
|
type: LibriSpeech |
|
|
metrics: |
|
|
- name: Word Error Rate (WER) [%] |
|
|
type: Word Error Rate (WER) [%] |
|
|
value: 6.33 |
|
|
- name: Aligned-Relative Word Error Rate (ARWER) [%] |
|
|
type: Aligned-Relative Word Error Rate (WER) [%] |
|
|
value: 7.76 |
|
|
- task: |
|
|
type: streaming-transcription-chunk-100msec |
|
|
dataset: |
|
|
name: test-other |
|
|
type: LibriSpeech |
|
|
metrics: |
|
|
- name: Word Error Rate (WER) [%] |
|
|
type: Word Error Rate (WER) [%] |
|
|
value: 13.06 |
|
|
- name: Aligned-Relative Word Error Rate (ARWER) [%] |
|
|
type: Aligned-Relative Word Error Rate (WER) [%] |
|
|
value: 14.99 |
|
|
- task: |
|
|
type: streaming-transcription-chunk-40msec |
|
|
dataset: |
|
|
name: test-clean |
|
|
type: LibriSpeech |
|
|
metrics: |
|
|
- name: Word Error Rate (WER) [%] |
|
|
type: Word Error Rate (WER) [%] |
|
|
value: 7.76 |
|
|
- name: Aligned-Relative Word Error Rate (ARWER) [%] |
|
|
type: Aligned-Relative Word Error Rate (WER) [%] |
|
|
value: 9.94 |
|
|
- task: |
|
|
type: streaming-transcription-chunk-40msec |
|
|
dataset: |
|
|
name: test-other |
|
|
type: LibriSpeech |
|
|
metrics: |
|
|
- name: Word Error Rate (WER) [%] |
|
|
type: Word Error Rate (WER) [%] |
|
|
value: 16.73 |
|
|
- name: Aligned-Relative Word Error Rate (ARWER) [%] |
|
|
type: Aligned-Relative Word Error Rate (WER) [%] |
|
|
value: 19.28 |
|
|
--- |
|
|
# CarelessWhisper - Causal Whisper Streaming Model |
|
|
Causal Whisper Streaming is a fine tuned version of OpenAI Whisper, which can handle causal data and perform real-time transcription. |
|
|
|
|
|
[](https://arxiv.org/abs/2508.12301) [](https://huggingface.co/spaces/MLSpeech/CarelessWhisper-causal-streaming) |
|
|
|
|
|
## ๐ Paper |
|
|
|
|
|
For more details, see our [paper](https://arxiv.org/abs/2508.12301). |
|
|
|
|
|
## ๐ง Setup |
|
|
We used Python 3.9.16, PyTorch 2.6.0, and PyTorch-Lightning 2.5.0 to train and test our models. |
|
|
Portions of this code are adapted from [OpenAI's Whisper](https://github.com/openai/whisper). |
|
|
|
|
|
To set up the project environment using `conda`, follow these steps: |
|
|
|
|
|
1. **Clone the repository** |
|
|
```bash |
|
|
git clone https://github.com/tomer9080/CarelessWhisper-streaming |
|
|
cd CarelessWhisper-streaming |
|
|
``` |
|
|
|
|
|
> ๐ก Make sure you have [Miniconda](https://docs.conda.io/en/latest/miniconda.html) or [Anaconda](https://www.anaconda.com/products/distribution) installed before proceeding. |
|
|
|
|
|
2. **Create the conda environment** |
|
|
```bash |
|
|
conda env create -f environment.yml |
|
|
``` |
|
|
|
|
|
3. **Activate The environment** |
|
|
```bash |
|
|
conda activate careless_whisper |
|
|
``` |
|
|
|
|
|
4. **Install the appropriate PyTorch version** |
|
|
Depending on your hardware and CUDA version, install PyTorch by following the instructions at [https://pytorch.org/get-started/locally](https://pytorch.org/get-started/locally). |
|
|
This project was tested with CUDA 12.4, but it should also work with compatible earlier or later versions. |
|
|
|
|
|
After installing all of the dependencies, you can try to run inference. |
|
|
|
|
|
## ๐ค Available Models |
|
|
We fine-tuned three different sizes of Whisper, all support english only transcription. |
|
|
A `large-v2` that was fine tuned on multilingual data is available, and supports English, French, Spanish, German and Portuguese with chunk size of 300 miliseconds. |
|
|
|
|
|
| Size | Chunk Size [msec] | Multilingual | |
|
|
|:----:|:-----------------:|:------------:| |
|
|
| base | 40, 100, 200, 300 | N/A | |
|
|
| small| 40, 100, 200, 300, 1000| N/A | |
|
|
|large-v2| 40, 100, 200, 300, 1000| 300 | |
|
|
|
|
|
|
|
|
## ๐ค Running Inference |
|
|
To run inference, download the repo content, and run from the repository root accroding to following sections. |
|
|
|
|
|
> **Note:** The models are hosted on the [Hugging Face Hub](https://huggingface.co/), which requires an access token. |
|
|
> Make sure you are logged in with your token to access the models. |
|
|
|
|
|
### How to Apply Your Hugging Face ๐ค Access Token |
|
|
|
|
|
1. **Create a Hugging Face account** (if you donโt have one) at [https://huggingface.co/join](https://huggingface.co/join). |
|
|
|
|
|
2. **Generate an access token:** |
|
|
- Go to your Hugging Face account settings: [https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens) |
|
|
- Click on **"New token"**, give it a name, select the appropriate scopes (usually `read` is enough), and create it. |
|
|
|
|
|
3. **Login using the Hugging Face CLI:** |
|
|
Install the CLI if you donโt have it: |
|
|
```bash |
|
|
pip install huggingface_hub |
|
|
``` |
|
|
Then login: |
|
|
```bash |
|
|
huggingface-cli login |
|
|
``` |
|
|
Paste your token when prompted. |
|
|
|
|
|
|
|
|
### ๐ฅ๏ธ CLI Usage |
|
|
The transcription model is easily activated using the next command: |
|
|
```bash |
|
|
# Using a local microphone for streaming transcription, dumping the recording to out.wav |
|
|
python transcribe.py \ |
|
|
--output_filename out.wav \ |
|
|
--channels 2 \ |
|
|
--model small \ |
|
|
--chunk_size 300 \ |
|
|
--device cuda \ |
|
|
--beam_size 5 \ |
|
|
--ca_kv_cache \ |
|
|
``` |
|
|
|
|
|
A simulation of a stream on a wav file is also available: |
|
|
```bash |
|
|
# Simulating a stream on a wav file |
|
|
python transcribe.py \ |
|
|
--model small \ |
|
|
--chunk_size 300 \ |
|
|
--device cuda \ |
|
|
--beam_size 5 \ |
|
|
--ca_kv_cache \ |
|
|
--wav_file /path/to/audio.wav \ |
|
|
--simulate_stream \ |
|
|
--use_latency |
|
|
``` |
|
|
|
|
|
### ๐ Python Usage |
|
|
If you prefer using python, a code sinppet utilizing a microphone or a wav file is provided below: |
|
|
|
|
|
```python |
|
|
import torch |
|
|
import careless_whisper_stream |
|
|
|
|
|
model_size = "small" # model size |
|
|
chunk_size = 300 # chunk size in milliseconds |
|
|
multilingual = False # currently on large-v2_300msec supports other languages than english. |
|
|
device = "cuda" if torch.cuda.is_available() else "cpu" |
|
|
|
|
|
model = careless_whisper_stream.load_streaming_model(name=model_size, |
|
|
gran=chunk_size, |
|
|
multilingual=multilingual, |
|
|
device=device) |
|
|
|
|
|
# using a local microphone recording |
|
|
texts_microphone = model.transcribe(output_filename="/path/to/dump/file.wav", |
|
|
channels=2, |
|
|
beam_size=5, |
|
|
ca_kv_cache=True) |
|
|
|
|
|
# Simulating on a wav file |
|
|
texts_wav_simulation = model.transcribe(simulate_stream=True, |
|
|
wav_file="/path/to/file/you/want/to/transcribe.wav", |
|
|
beam_size=5, |
|
|
ca_kv_cache=True) |
|
|
``` |
|
|
|
|
|
## ๐ฆพ Training |
|
|
In order to train using LoRA, you can use our existing code. Make sure all the requirements are installed. |
|
|
|
|
|
### ๐ Dataset Structure |
|
|
|
|
|
Before starting model training using the command-line interface provided below, you must first configure your dataset dictionary file located at `training_code/ds_dict.py`. |
|
|
|
|
|
This file defines a Python dictionary named `ds_paths`, where you should specify paths to the `train`, `val`, and `test` partitions of your dataset. Each partition should be a CSV file with the following three columns: |
|
|
|
|
|
1. `wav_path` โ Path to the WAV audio file. |
|
|
2. `tg_path` โ Path to the corresponding `.TextGrid` file containing forced alignment. |
|
|
3. `raw_text` โ Ground truth transcription. |
|
|
|
|
|
> **Note:** The dictionary key (i.e., the name of the dataset) will be used by the training script to identify and load the dataset correctly. |
|
|
|
|
|
You can find an example entry in `training_code/ds_dict.py`. |
|
|
|
|
|
### ๐ฅ๏ธ CLI Interface |
|
|
```bash |
|
|
python training_code/train.py \ |
|
|
--lora \ |
|
|
--streaming_train \ |
|
|
--simulate_stream \ |
|
|
--dataset LIBRI-960-ALIGNED \ |
|
|
--name example_training_base_model \ |
|
|
--size base \ |
|
|
--batch_size 32 \ |
|
|
--epochs 10 \ |
|
|
--learning_rate 1e-5 \ |
|
|
--rank 32 \ |
|
|
--gran 15 \ |
|
|
--extra_gran_blocks 1 \ |
|
|
--streaming_fraction 0.25 \ |
|
|
--top_k 5 \ |
|
|
``` |
|
|
|
|
|
For more options and training configurations, run: |
|
|
```bash |
|
|
python training_code/train.py --help |
|
|
``` |
|
|
|
|
|
## ๐ License |
|
|
|
|
|
This repository uses a dual license: |
|
|
|
|
|
[](https://opensource.org/licenses/MIT) |
|
|
Portions derived from [OpenAI Whisper](https://github.com/openai/whisper) are licensed under the **MIT License**. |
|
|
|
|
|
[](https://creativecommons.org/licenses/by-nc/4.0/) |
|
|
All other original code in this repository is licensed under the **Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0)**. |
|
|
|
|
|
See the [LICENSE](./LICENSE) file for full details. |