Whisper-Small Portuguese - Mid-High Quality Filtered Synthetic Data
This model is a fine-tuned version of openai/whisper-small for Portuguese automatic speech recognition (ASR). It was trained on Common Voice 17.0 Portuguese combined with WAVe-filtered synthetic speech data using a balanced quality threshold (q ≥ 0.5), including both high-quality and medium-quality samples.
Purpose
This model tests whether increasing synthetic data volume with moderate quality filtering can benefit smaller model architectures. The results continue to support the finding that Small models cannot effectively leverage synthetic data:
Key Finding: Even with the balanced threshold (q ≥ 0.5) that works best for Large-v3 models, the Small model shows no improvement over the CV-only baseline, reinforcing that model capacity is the fundamental limiting factor.
| Metric | CV-Only Baseline | This Model (Mid-High) | Large-v3 (Same Threshold) |
|---|---|---|---|
| Test WER (CV) | 13.87% | 14.08% (-1.5%) | 8.33% (+29.3%) |
| Test WER (MLS) | 30.69% | 30.54% (+0.5%) | 10.27% (+32.9%) |
The contrast with Large-v3 is stark: the same data configuration that produces dramatic improvements for large models provides negligible benefit for small models.
Model Details
| Property | Value |
|---|---|
| Base Model | openai/whisper-small |
| Language | Portuguese (pt) |
| Task | Automatic Speech Recognition (transcribe) |
| Parameters | 244M |
| Training Data | Common Voice 17.0 + Mid-High Quality Synthetic (q ≥ 0.5) |
| Total Training Samples | 41,047 |
| Sampling Rate | 16kHz |
Evaluation Results
This Model (whisper-small-mixed-pt)
| Metric | Value |
|---|---|
| Validation Loss | 0.2100 |
| Validation WER | 12.97% |
| Test WER (Common Voice) | 14.08% |
| Test WER (MLS) | 30.54% |
| Best Checkpoint | Step 300 |
| Max Training Steps | 805 |
Comparison with Other Training Configurations (Whisper-Small Portuguese)
| Training Data | Max Steps | Val Loss | Val WER | Test WER (CV) | Test WER (MLS) |
|---|---|---|---|---|---|
| Common Voice Only | 430 | 0.2000 | 12.68% | 13.87% | 30.69% |
| High-Quality (q ≥ 0.8) + CV | 575 | 0.2100 | 12.98% | 14.28% | 30.40% |
| Mid-High (q ≥ 0.5) + CV | 805 | 0.2100 | 12.97% | 14.08% | 30.54% |
| All Synthetic + CV | 860 | 0.2100 | 12.94% | 14.22% | 30.85% |
Key Performance Characteristics
- Best in-domain among augmented: 14.08% Test WER (best among synthetic-augmented Small models)
- Still worse than baseline: 1.5% worse than CV-only (13.87%)
- Marginal cross-domain: 30.54% MLS (only 0.5% better than baseline)
- Largest synthetic dataset: 19,181 synthetic samples used
Small vs Large: Same Data, Different Results
The mid-high quality threshold (q ≥ 0.5) is optimal for Large-v3 but ineffective for Small:
| Model | Same Configuration | Test WER (CV) | Test WER (MLS) | Benefit vs Baseline |
|---|---|---|---|---|
| Whisper-Small | Mid-High + CV | 14.08% | 30.54% | -1.5% / +0.5% |
| Whisper-Large-v3 | Mid-High + CV | 8.33% | 10.27% | +29.3% / +32.9% |
This ~30 percentage point difference in effectiveness demonstrates that synthetic data benefits are architecture-dependent.
Training Data
Dataset Composition
| Source | Samples | Description |
|---|---|---|
| Common Voice 17.0 Portuguese | 21,866 | Real speech from Mozilla's crowdsourced dataset |
| Synthetic Transcript PT (q ≥ 0.5) | 19,181 | WAVe-filtered TTS audio (high + medium quality) |
| Total | 41,047 |
WAVe Quality Distribution (Portuguese Synthetic Data)
| Quality Level | Samples | Percentage | Used in This Model |
|---|---|---|---|
| High (q ≥ 0.8) | 7,312 | 33.3% | ✓ |
| Medium (0.5 ≤ q < 0.8) | 11,869 | 54.0% | ✓ |
| Low (q < 0.5) | 2,787 | 12.7% | ✗ |
This threshold retains 87.3% of the synthetic dataset, the same configuration that achieves best cross-domain performance for Large-v3.
Training Procedure
Hyperparameters
| Parameter | Value |
|---|---|
| Learning Rate | 1e-5 |
| Batch Size (Global) | 256 |
| Warmup Steps | 200 |
| Max Epochs | 5 |
| Precision | BF16 |
| Optimizer | AdamW (fused) |
| Eval Steps | 50 |
| Metric for Best Model | eval_loss |
Training Infrastructure
- GPU: NVIDIA H200 (140GB VRAM)
- Operating System: Ubuntu 22.04
- Framework: Hugging Face Transformers
Usage
Transcription Pipeline
from transformers import pipeline
transcriber = pipeline(
"automatic-speech-recognition",
model="yuriyvnv/whisper-small-mixed-pt",
device="cuda"
)
result = transcriber("path/to/portuguese_audio.wav")
print(result["text"])
Direct Model Usage
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import librosa
processor = WhisperProcessor.from_pretrained("yuriyvnv/whisper-small-mixed-pt")
model = WhisperForConditionalGeneration.from_pretrained("yuriyvnv/whisper-small-mixed-pt")
model.to("cuda")
audio, sr = librosa.load("path/to/portuguese_audio.wav", sr=16000)
input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features.to("cuda")
predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)
Specifying Language
model.generation_config.language = "pt"
model.generation_config.task = "transcribe"
When to Use This Model
This model is primarily useful for:
- Research purposes: Understanding how model capacity affects synthetic data utilization
- Best among augmented Small: If synthetic data must be used, this is the best configuration
- Comparing architectures: Demonstrating the Small/Large effectiveness gap
For production use, consider:
- whisper-small-cv-only-pt: Best Small model (13.87% WER)
- whisper-large-v3-mixed-pt: Best cross-domain (10.27% MLS)
Research Implications
This model reinforces key findings:
- Model capacity is fundamental: The same high-quality synthetic data that dramatically improves Large-v3 provides no benefit for Small
- Diminishing returns don't apply: It's not that Small models benefit less—they don't benefit at all
- Architecture selection matters: Choose model size first, then decide on synthetic augmentation
Recommendation: For resource-constrained deployments, invest in model optimization (quantization, distillation) rather than synthetic data augmentation.
Limitations
- Lower accuracy than baseline: 14.08% vs 13.87% (worse than CV-only)
- Wasted compute: 87% more training steps for no improvement
- Architecture limitation: Cannot leverage synthetic data effectively
- Domain specificity: Optimized for general Portuguese
Citation
This model is part of research on WAVe (Word-Aligned Verification) for synthetic speech quality assessment. While the WAVe methodology paper is currently under review, please cite our previous work that motivated this research:
@article{perezhohin2024enhancing,
title={Enhancing Automatic Speech Recognition: Effects of Semantic Audio Filtering on Models Performance},
author={Perezhohin, Yuriy and Santos, Tiago and Costa, Victor and Peres, Fernando and Castelli, Mauro},
journal={IEEE Access},
year={2024},
publisher={IEEE}
}
References
- Base Model: openai/whisper-small
- Training Data (Real): mozilla-foundation/common_voice_17_0
- Training Data (Synthetic): yuriyvnv/synthetic_transcript_pt
- Whisper Paper: Robust Speech Recognition via Large-Scale Weak Supervision
- Motivating Research: Enhancing ASR with Semantic Audio Filtering (IEEE Access 2024)
License
Apache 2.0
- Downloads last month
- 17
Model tree for yuriyvnv/whisper-small-mixed-pt
Base model
openai/whisper-smallDatasets used to train yuriyvnv/whisper-small-mixed-pt
Collection including yuriyvnv/whisper-small-mixed-pt
Evaluation results
- Test WER on Common Voice 17.0 (Portuguese)test set self-reported14.080
- Test WER (MLS) on Multilingual LibriSpeech (Portuguese)test set self-reported30.540