File size: 10,425 Bytes
dbdbd08 a9fba60 dbdbd08 a9fba60 dbdbd08 a9fba60 dbdbd08 a9fba60 dbdbd08 a9fba60 c180a61 a9fba60 23200f7 1f34f23 23200f7 d736335 1f34f23 d736335 1f34f23 37634ed 1f34f23 37634ed d736335 1f34f23 d736335 37634ed d736335 37634ed d736335 1f34f23 37634ed 1f34f23 37634ed 1f34f23 d736335 1f34f23 37634ed 1f34f23 37634ed 1f34f23 37634ed 1f34f23 d736335 1f34f23 d736335 37634ed d736335 1f34f23 37634ed 1f34f23 d736335 a9fba60 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 |
---
datasets:
- openslr/librispeech_asr
- facebook/multilingual_librispeech
language:
- en
- fr
- de
- pt
- es
metrics:
- wer
base_model:
- openai/whisper-large-v2
- openai/whisper-small
- openai/whisper-base
pipeline_tag: automatic-speech-recognition
tags:
- streaming
- asr
- Transformer
- encoder-decoder
- pytorch
- audio
- speech
- Whisper
model-index:
- name: CarelessWhisper-large-v2
results:
- task:
type: streaming-transcription-chunk-300msec
dataset:
name: test-clean
type: LibriSpeech
metrics:
- name: Word Error Rate (WER) [%]
type: Word Error Rate (WER) [%]
value: 5.29
- name: Aligned-Relative Word Error Rate (ARWER) [%]
type: Aligned-Relative Word Error Rate (WER) [%]
value: 6
- task:
type: streaming-transcription-chunk-300msec
dataset:
name: test-other
type: LibriSpeech
metrics:
- name: Word Error Rate (WER) [%]
type: Word Error Rate (WER) [%]
value: 10.74
- name: Aligned-Relative Word Error Rate (ARWER) [%]
type: Aligned-Relative Word Error Rate (WER) [%]
value: 11.38
- task:
type: streaming-transcription-chunk-200msec
dataset:
name: test-clean
type: LibriSpeech
metrics:
- name: Word Error Rate (WER) [%]
type: Word Error Rate (WER) [%]
value: 5.92
- name: Aligned-Relative Word Error Rate (ARWER) [%]
type: Aligned-Relative Word Error Rate (WER) [%]
value: 6.63
- task:
type: streaming-transcription-chunk-200msec
dataset:
name: test-other
type: LibriSpeech
metrics:
- name: Word Error Rate (WER) [%]
type: Word Error Rate (WER) [%]
value: 11.41
- name: Aligned-Relative Word Error Rate (ARWER) [%]
type: Aligned-Relative Word Error Rate (WER) [%]
value: 12.6
- task:
type: streaming-transcription-chunk-100msec
dataset:
name: test-clean
type: LibriSpeech
metrics:
- name: Word Error Rate (WER) [%]
type: Word Error Rate (WER) [%]
value: 6.33
- name: Aligned-Relative Word Error Rate (ARWER) [%]
type: Aligned-Relative Word Error Rate (WER) [%]
value: 7.76
- task:
type: streaming-transcription-chunk-100msec
dataset:
name: test-other
type: LibriSpeech
metrics:
- name: Word Error Rate (WER) [%]
type: Word Error Rate (WER) [%]
value: 13.06
- name: Aligned-Relative Word Error Rate (ARWER) [%]
type: Aligned-Relative Word Error Rate (WER) [%]
value: 14.99
- task:
type: streaming-transcription-chunk-40msec
dataset:
name: test-clean
type: LibriSpeech
metrics:
- name: Word Error Rate (WER) [%]
type: Word Error Rate (WER) [%]
value: 7.76
- name: Aligned-Relative Word Error Rate (ARWER) [%]
type: Aligned-Relative Word Error Rate (WER) [%]
value: 9.94
- task:
type: streaming-transcription-chunk-40msec
dataset:
name: test-other
type: LibriSpeech
metrics:
- name: Word Error Rate (WER) [%]
type: Word Error Rate (WER) [%]
value: 16.73
- name: Aligned-Relative Word Error Rate (ARWER) [%]
type: Aligned-Relative Word Error Rate (WER) [%]
value: 19.28
---
# CarelessWhisper - Causal Whisper Streaming Model
Causal Whisper Streaming is a fine tuned version of OpenAI Whisper, which can handle causal data and perform real-time transcription.
[](https://arxiv.org/abs/2508.12301) [](https://huggingface.co/spaces/MLSpeech/CarelessWhisper-causal-streaming)
## π Paper
For more details, see our [paper](https://arxiv.org/abs/2508.12301).
## π§ Setup
We used Python 3.9.16, PyTorch 2.6.0, and PyTorch-Lightning 2.5.0 to train and test our models.
Portions of this code are adapted from [OpenAI's Whisper](https://github.com/openai/whisper).
To set up the project environment using `conda`, follow these steps:
1. **Clone the repository**
```bash
git clone https://github.com/tomer9080/CarelessWhisper-streaming
cd CarelessWhisper-streaming
```
> π‘ Make sure you have [Miniconda](https://docs.conda.io/en/latest/miniconda.html) or [Anaconda](https://www.anaconda.com/products/distribution) installed before proceeding.
2. **Create the conda environment**
```bash
conda env create -f environment.yml
```
3. **Activate The environment**
```bash
conda activate careless_whisper
```
4. **Install the appropriate PyTorch version**
Depending on your hardware and CUDA version, install PyTorch by following the instructions at [https://pytorch.org/get-started/locally](https://pytorch.org/get-started/locally).
This project was tested with CUDA 12.4, but it should also work with compatible earlier or later versions.
After installing all of the dependencies, you can try to run inference.
## π€ Available Models
We fine-tuned three different sizes of Whisper, all support english only transcription.
A `large-v2` that was fine tuned on multilingual data is available, and supports English, French, Spanish, German and Portuguese with chunk size of 300 miliseconds.
| Size | Chunk Size [msec] | Multilingual |
|:----:|:-----------------:|:------------:|
| base | 40, 100, 200, 300 | N/A |
| small| 40, 100, 200, 300, 1000| N/A |
|large-v2| 40, 100, 200, 300, 1000| 300 |
## π€ Running Inference
To run inference, download the repo content, and run from the repository root accroding to following sections.
> **Note:** The models are hosted on the [Hugging Face Hub](https://huggingface.co/), which requires an access token.
> Make sure you are logged in with your token to access the models.
### How to Apply Your Hugging Face π€ Access Token
1. **Create a Hugging Face account** (if you donβt have one) at [https://huggingface.co/join](https://huggingface.co/join).
2. **Generate an access token:**
- Go to your Hugging Face account settings: [https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens)
- Click on **"New token"**, give it a name, select the appropriate scopes (usually `read` is enough), and create it.
3. **Login using the Hugging Face CLI:**
Install the CLI if you donβt have it:
```bash
pip install huggingface_hub
```
Then login:
```bash
huggingface-cli login
```
Paste your token when prompted.
### π₯οΈ CLI Usage
The transcription model is easily activated using the next command:
```bash
# Using a local microphone for streaming transcription, dumping the recording to out.wav
python transcribe.py \
--output_filename out.wav \
--channels 2 \
--model small \
--chunk_size 300 \
--device cuda \
--beam_size 5 \
--ca_kv_cache \
```
A simulation of a stream on a wav file is also available:
```bash
# Simulating a stream on a wav file
python transcribe.py \
--model small \
--chunk_size 300 \
--device cuda \
--beam_size 5 \
--ca_kv_cache \
--wav_file /path/to/audio.wav \
--simulate_stream \
--use_latency
```
### π Python Usage
If you prefer using python, a code sinppet utilizing a microphone or a wav file is provided below:
```python
import torch
import careless_whisper_stream
model_size = "small" # model size
chunk_size = 300 # chunk size in milliseconds
multilingual = False # currently on large-v2_300msec supports other languages than english.
device = "cuda" if torch.cuda.is_available() else "cpu"
model = careless_whisper_stream.load_streaming_model(name=model_size,
gran=chunk_size,
multilingual=multilingual,
device=device)
# using a local microphone recording
texts_microphone = model.transcribe(output_filename="/path/to/dump/file.wav",
channels=2,
beam_size=5,
ca_kv_cache=True)
# Simulating on a wav file
texts_wav_simulation = model.transcribe(simulate_stream=True,
wav_file="/path/to/file/you/want/to/transcribe.wav",
beam_size=5,
ca_kv_cache=True)
```
## π¦Ύ Training
In order to train using LoRA, you can use our existing code. Make sure all the requirements are installed.
### π Dataset Structure
Before starting model training using the command-line interface provided below, you must first configure your dataset dictionary file located at `training_code/ds_dict.py`.
This file defines a Python dictionary named `ds_paths`, where you should specify paths to the `train`, `val`, and `test` partitions of your dataset. Each partition should be a CSV file with the following three columns:
1. `wav_path` β Path to the WAV audio file.
2. `tg_path` β Path to the corresponding `.TextGrid` file containing forced alignment.
3. `raw_text` β Ground truth transcription.
> **Note:** The dictionary key (i.e., the name of the dataset) will be used by the training script to identify and load the dataset correctly.
You can find an example entry in `training_code/ds_dict.py`.
### π₯οΈ CLI Interface
```bash
python training_code/train.py \
--lora \
--streaming_train \
--simulate_stream \
--dataset LIBRI-960-ALIGNED \
--name example_training_base_model \
--size base \
--batch_size 32 \
--epochs 10 \
--learning_rate 1e-5 \
--rank 32 \
--gran 15 \
--extra_gran_blocks 1 \
--streaming_fraction 0.25 \
--top_k 5 \
```
For more options and training configurations, run:
```bash
python training_code/train.py --help
```
## π License
This repository uses a dual license:
[](https://opensource.org/licenses/MIT)
Portions derived from [OpenAI Whisper](https://github.com/openai/whisper) are licensed under the **MIT License**.
[](https://creativecommons.org/licenses/by-nc/4.0/)
All other original code in this repository is licensed under the **Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0)**.
See the [LICENSE](./LICENSE) file for full details. |