File size: 10,425 Bytes
dbdbd08
 
a9fba60
 
dbdbd08
a9fba60
 
 
 
 
dbdbd08
a9fba60
dbdbd08
a9fba60
 
 
dbdbd08
 
a9fba60
 
 
 
 
 
 
 
c180a61
a9fba60
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
23200f7
1f34f23
 
23200f7
d736335
1f34f23
d736335
 
 
1f34f23
37634ed
1f34f23
 
 
37634ed
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d736335
1f34f23
 
 
 
 
 
 
 
 
 
d736335
37634ed
 
 
 
 
d736335
37634ed
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d736335
1f34f23
 
 
37634ed
 
 
 
 
 
 
 
1f34f23
 
 
 
 
37634ed
 
 
 
 
 
 
 
 
1f34f23
 
d736335
1f34f23
 
 
 
37634ed
1f34f23
 
 
 
37634ed
1f34f23
37634ed
1f34f23
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d736335
1f34f23
 
d736335
37634ed
 
 
 
 
 
 
 
 
 
 
 
 
d736335
1f34f23
 
37634ed
 
 
 
 
 
 
 
 
 
 
 
 
 
1f34f23
 
 
 
 
 
 
d736335
 
 
 
 
 
 
 
 
 
a9fba60
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
---
datasets:
- openslr/librispeech_asr
- facebook/multilingual_librispeech
language:
- en
- fr
- de
- pt
- es
metrics:
- wer
base_model:
- openai/whisper-large-v2
- openai/whisper-small
- openai/whisper-base
pipeline_tag: automatic-speech-recognition
tags:
- streaming
- asr
- Transformer
- encoder-decoder
- pytorch
- audio
- speech
- Whisper
model-index:
- name: CarelessWhisper-large-v2
  results:
  - task:
      type: streaming-transcription-chunk-300msec
    dataset:
      name: test-clean
      type: LibriSpeech
    metrics:
    - name: Word Error Rate (WER) [%]
      type: Word Error Rate (WER) [%]
      value: 5.29
    - name: Aligned-Relative Word Error Rate (ARWER) [%]
      type: Aligned-Relative Word Error Rate (WER) [%]
      value: 6
  - task:
      type: streaming-transcription-chunk-300msec
    dataset:
      name: test-other
      type: LibriSpeech
    metrics:
    - name: Word Error Rate (WER) [%]
      type: Word Error Rate (WER) [%]
      value: 10.74
    - name: Aligned-Relative Word Error Rate (ARWER) [%]
      type: Aligned-Relative Word Error Rate (WER) [%]
      value: 11.38
  - task:
      type: streaming-transcription-chunk-200msec
    dataset:
      name: test-clean
      type: LibriSpeech
    metrics:
    - name: Word Error Rate (WER) [%]
      type: Word Error Rate (WER) [%]
      value: 5.92
    - name: Aligned-Relative Word Error Rate (ARWER) [%]
      type: Aligned-Relative Word Error Rate (WER) [%]
      value: 6.63
  - task:
      type: streaming-transcription-chunk-200msec
    dataset:
      name: test-other
      type: LibriSpeech
    metrics:
    - name: Word Error Rate (WER) [%]
      type: Word Error Rate (WER) [%]
      value: 11.41
    - name: Aligned-Relative Word Error Rate (ARWER) [%]
      type: Aligned-Relative Word Error Rate (WER) [%]
      value: 12.6
  - task:
      type: streaming-transcription-chunk-100msec
    dataset:
      name: test-clean
      type: LibriSpeech
    metrics:
    - name: Word Error Rate (WER) [%]
      type: Word Error Rate (WER) [%]
      value: 6.33
    - name: Aligned-Relative Word Error Rate (ARWER) [%]
      type: Aligned-Relative Word Error Rate (WER) [%]
      value: 7.76
  - task:
      type: streaming-transcription-chunk-100msec
    dataset:
      name: test-other
      type: LibriSpeech
    metrics:
    - name: Word Error Rate (WER) [%]
      type: Word Error Rate (WER) [%]
      value: 13.06
    - name: Aligned-Relative Word Error Rate (ARWER) [%]
      type: Aligned-Relative Word Error Rate (WER) [%]
      value: 14.99
  - task:
      type: streaming-transcription-chunk-40msec
    dataset:
      name: test-clean
      type: LibriSpeech
    metrics:
    - name: Word Error Rate (WER) [%]
      type: Word Error Rate (WER) [%]
      value: 7.76
    - name: Aligned-Relative Word Error Rate (ARWER) [%]
      type: Aligned-Relative Word Error Rate (WER) [%]
      value: 9.94
  - task:
      type: streaming-transcription-chunk-40msec
    dataset:
      name: test-other
      type: LibriSpeech
    metrics:
    - name: Word Error Rate (WER) [%]
      type: Word Error Rate (WER) [%]
      value: 16.73
    - name: Aligned-Relative Word Error Rate (ARWER) [%]
      type: Aligned-Relative Word Error Rate (WER) [%]
      value: 19.28
---
# CarelessWhisper - Causal Whisper Streaming Model
Causal Whisper Streaming is a fine tuned version of OpenAI Whisper, which can handle causal data and perform real-time transcription. 

[![arXiv](https://img.shields.io/badge/arXiv-2508.12301-b31b1b.svg)](https://arxiv.org/abs/2508.12301)  [![Demo on Hugging Face](https://img.shields.io/badge/πŸ€—%20Demo-Hugging%20Face-blueviolet?logo=huggingface&logoColor=white)](https://huggingface.co/spaces/MLSpeech/CarelessWhisper-causal-streaming)

## πŸ“„ Paper

For more details, see our [paper](https://arxiv.org/abs/2508.12301).

## πŸ”§ Setup
We used Python 3.9.16, PyTorch 2.6.0, and PyTorch-Lightning 2.5.0 to train and test our models.
Portions of this code are adapted from [OpenAI's Whisper](https://github.com/openai/whisper).

To set up the project environment using `conda`, follow these steps:

1. **Clone the repository**  
   ```bash
   git clone https://github.com/tomer9080/CarelessWhisper-streaming
   cd CarelessWhisper-streaming
   ```

> πŸ’‘ Make sure you have [Miniconda](https://docs.conda.io/en/latest/miniconda.html) or [Anaconda](https://www.anaconda.com/products/distribution) installed before proceeding.

2. **Create the conda environment**
    ```bash
    conda env create -f environment.yml
    ```

3. **Activate The environment**
    ```bash
    conda activate careless_whisper
    ```

4. **Install the appropriate PyTorch version**  
   Depending on your hardware and CUDA version, install PyTorch by following the instructions at [https://pytorch.org/get-started/locally](https://pytorch.org/get-started/locally).  
   This project was tested with CUDA 12.4, but it should also work with compatible earlier or later versions.
 
After installing all of the dependencies, you can try to run inference.

## πŸ€– Available Models
We fine-tuned three different sizes of Whisper, all support english only transcription.
A `large-v2` that was fine tuned on multilingual data is available, and supports English, French, Spanish, German and Portuguese with chunk size of 300 miliseconds.

| Size | Chunk Size [msec] | Multilingual | 
|:----:|:-----------------:|:------------:|
| base | 40, 100, 200, 300 |  N/A         |
| small| 40, 100, 200, 300, 1000| N/A     |
|large-v2| 40, 100, 200, 300, 1000| 300   |


## 🎀 Running Inference
To run inference, download the repo content, and run from the repository root accroding to following sections.

> **Note:** The models are hosted on the [Hugging Face Hub](https://huggingface.co/), which requires an access token.  
> Make sure you are logged in with your token to access the models.

### How to Apply Your Hugging Face πŸ€— Access Token

1. **Create a Hugging Face account** (if you don’t have one) at [https://huggingface.co/join](https://huggingface.co/join).

2. **Generate an access token:**
   - Go to your Hugging Face account settings: [https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens)
   - Click on **"New token"**, give it a name, select the appropriate scopes (usually `read` is enough), and create it.

3. **Login using the Hugging Face CLI:**  
   Install the CLI if you don’t have it:
   ```bash
   pip install huggingface_hub
   ```
   Then login:
   ```bash
   huggingface-cli login
   ```
   Paste your token when prompted.


### πŸ–₯️ CLI Usage
The transcription model is easily activated using the next command:
```bash
# Using a local microphone for streaming transcription, dumping the recording to out.wav
python transcribe.py \
--output_filename out.wav \
--channels 2 \
--model small \ 
--chunk_size 300 \
--device cuda \
--beam_size 5 \
--ca_kv_cache \
```

A simulation of a stream on a wav file is also available:
```bash
# Simulating a stream on a wav file
python transcribe.py \
--model small \
--chunk_size 300 \
--device cuda \
--beam_size 5 \
--ca_kv_cache \
--wav_file /path/to/audio.wav \
--simulate_stream \
--use_latency
```

### 🐍 Python Usage
If you prefer using python, a code sinppet utilizing a microphone or a wav file is provided below:

```python
import torch
import careless_whisper_stream

model_size = "small" # model size
chunk_size = 300 # chunk size in milliseconds
multilingual = False # currently on large-v2_300msec supports other languages than english.
device = "cuda" if torch.cuda.is_available() else "cpu"

model = careless_whisper_stream.load_streaming_model(name=model_size,
                                                   gran=chunk_size,
                                                   multilingual=multilingual,
                                                   device=device)

# using a local microphone recording 
texts_microphone = model.transcribe(output_filename="/path/to/dump/file.wav",
                         channels=2,
                         beam_size=5,
                         ca_kv_cache=True)

# Simulating on a wav file
texts_wav_simulation = model.transcribe(simulate_stream=True,
                                        wav_file="/path/to/file/you/want/to/transcribe.wav",
                                        beam_size=5,
                                        ca_kv_cache=True)
```

## 🦾 Training
In order to train using LoRA, you can use our existing code. Make sure all the requirements are installed. 

### πŸ“‚ Dataset Structure

Before starting model training using the command-line interface provided below, you must first configure your dataset dictionary file located at `training_code/ds_dict.py`.

This file defines a Python dictionary named `ds_paths`, where you should specify paths to the `train`, `val`, and `test` partitions of your dataset. Each partition should be a CSV file with the following three columns:

1. `wav_path` β€” Path to the WAV audio file.  
2. `tg_path` β€” Path to the corresponding `.TextGrid` file containing forced alignment.  
3. `raw_text` β€” Ground truth transcription.

> **Note:** The dictionary key (i.e., the name of the dataset) will be used by the training script to identify and load the dataset correctly.

You can find an example entry in `training_code/ds_dict.py`.

### πŸ–₯️ CLI Interface
```bash
python training_code/train.py \
--lora \
--streaming_train \
--simulate_stream \
--dataset LIBRI-960-ALIGNED \
--name example_training_base_model \
--size base \
--batch_size 32 \
--epochs 10 \
--learning_rate 1e-5 \
--rank 32 \
--gran 15 \
--extra_gran_blocks 1 \
--streaming_fraction 0.25 \
--top_k 5 \
```

For more options and training configurations, run:
```bash
python training_code/train.py --help
```

## πŸ“œ License

This repository uses a dual license:

[![MIT License](https://img.shields.io/badge/License-MIT-green.svg)](https://opensource.org/licenses/MIT)  
Portions derived from [OpenAI Whisper](https://github.com/openai/whisper) are licensed under the **MIT License**.  

[![CC BY-NC 4.0 License](https://img.shields.io/badge/License-CC--BY--NC%204.0-blue.svg)](https://creativecommons.org/licenses/by-nc/4.0/)  
All other original code in this repository is licensed under the **Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0)**.  

See the [LICENSE](./LICENSE) file for full details.