Whisper-Small Portuguese - Full Synthetic Data (Unfiltered)
This model is a fine-tuned version of openai/whisper-small for Portuguese automatic speech recognition (ASR). It was trained on Common Voice 17.0 Portuguese combined with all synthetic speech data without quality filtering, representing the maximum data augmentation approach.
Purpose
This model completes the evaluation of synthetic data augmentation strategies for Whisper-Small Portuguese. It uses all available synthetic data (100%) without any WAVe filtering to test whether maximum data volume can compensate for the architectural limitations of smaller models.
Key Finding: Using all synthetic data (unfiltered) results in the worst performance among all Small Portuguese configurations, confirming that:
- Quality filtering provides no benefit for Small models
- Adding low-quality synthetic data actively hurts performance
- Model capacity, not data volume or quality, is the fundamental constraint
| Metric | CV-Only Baseline | This Model (Unfiltered) | Change |
|---|---|---|---|
| Test WER (CV) | 13.87% | 14.22% | -2.5% (worse) |
| Test WER (MLS) | 30.69% | 30.85% | -0.5% (worse) |
Model Details
| Property | Value |
|---|---|
| Base Model | openai/whisper-small |
| Language | Portuguese (pt) |
| Task | Automatic Speech Recognition (transcribe) |
| Parameters | 244M |
| Training Data | Common Voice 17.0 + ALL Synthetic (Unfiltered) |
| Total Training Samples | 43,834 |
| Sampling Rate | 16kHz |
Evaluation Results
This Model (whisper-small-cv-full-synthetic-pt)
| Metric | Value |
|---|---|
| Validation Loss | 0.2100 |
| Validation WER | 12.94% |
| Test WER (Common Voice) | 14.22% |
| Test WER (MLS) | 30.85% |
| Best Checkpoint | Step 350 |
| Max Training Steps | 860 |
Comparison with Other Training Configurations (Whisper-Small Portuguese)
| Training Data | Max Steps | Val Loss | Val WER | Test WER (CV) | Test WER (MLS) |
|---|---|---|---|---|---|
| Common Voice Only | 430 | 0.2000 | 12.68% | 13.87% | 30.69% |
| High-Quality (q ≥ 0.8) + CV | 575 | 0.2100 | 12.98% | 14.28% | 30.40% |
| Mid-High (q ≥ 0.5) + CV | 805 | 0.2100 | 12.97% | 14.08% | 30.54% |
| All Synthetic + CV (Unfiltered) | 860 | 0.2100 | 12.94% | 14.22% | 30.85% |
Key Performance Characteristics
- Worst overall performance: Both in-domain and cross-domain metrics worse than baseline
- Most training steps: 860 steps (100% more than baseline) for negative results
- Largest dataset: 43,834 samples—double the baseline—yet worse performance
- Clear evidence: More data ≠better performance for small models
Complete Portuguese Small Model Rankings
| Rank | Configuration | Test WER (CV) | Test WER (MLS) | Recommendation |
|---|---|---|---|---|
| 1 | CV Only | 13.87% | 30.69% | Best choice |
| 2 | Mid-High (q≥0.5) | 14.08% | 30.54% | Research only |
| 3 | Unfiltered (this) | 14.22% | 30.85% | Not recommended |
| 4 | High-Quality (q≥0.8) | 14.28% | 30.40% | Research only |
Conclusion: For Whisper-Small Portuguese, do not use synthetic data augmentation. The CV-only baseline provides the best performance.
Small vs Large: Maximum Data Impact
Using all synthetic data produces opposite effects depending on model size:
| Model | Unfiltered Synthetic | Test WER (CV) | Test WER (MLS) | vs Baseline |
|---|---|---|---|---|
| Whisper-Small | 21,968 samples | 14.22% | 30.85% | Both worse |
| Whisper-Large-v3 | 21,968 samples | 8.33% | 13.43% | Both better |
For Large-v3, unfiltered synthetic data improves performance by ~30%. For Small, it degrades performance. This confirms that the benefit of synthetic data is fundamentally tied to model capacity.
Training Data
Dataset Composition
| Source | Samples | Description |
|---|---|---|
| Common Voice 17.0 Portuguese | 21,866 | Real speech from Mozilla's crowdsourced dataset |
| Synthetic Transcript PT (all) | 21,968 | Complete TTS audio without filtering |
| Total | 43,834 |
WAVe Quality Distribution (For Reference)
While this model uses all data, the quality distribution shows what was included:
| Quality Level | Samples | Percentage | Used in This Model |
|---|---|---|---|
| High (q ≥ 0.8) | 7,312 | 33.3% | ✓ |
| Medium (0.5 ≤ q < 0.8) | 11,869 | 54.0% | ✓ |
| Low (q < 0.5) | 2,787 | 12.7% | ✓ |
| Total | 21,968 | 100% | All used |
Including the 12.7% low-quality samples (2,787 samples) appears to actively hurt Small model performance.
Training Procedure
Hyperparameters
| Parameter | Value |
|---|---|
| Learning Rate | 1e-5 |
| Batch Size (Global) | 256 |
| Warmup Steps | 200 |
| Max Epochs | 5 |
| Precision | BF16 |
| Optimizer | AdamW (fused) |
| Eval Steps | 50 |
| Metric for Best Model | eval_loss |
Training Infrastructure
- GPU: NVIDIA H200 (140GB VRAM)
- Operating System: Ubuntu 22.04
- Framework: Hugging Face Transformers
Usage
Transcription Pipeline
from transformers import pipeline
transcriber = pipeline(
"automatic-speech-recognition",
model="yuriyvnv/whisper-small-cv-full-synthetic-pt",
device="cuda"
)
result = transcriber("path/to/portuguese_audio.wav")
print(result["text"])
Direct Model Usage
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import librosa
processor = WhisperProcessor.from_pretrained("yuriyvnv/whisper-small-cv-full-synthetic-pt")
model = WhisperForConditionalGeneration.from_pretrained("yuriyvnv/whisper-small-cv-full-synthetic-pt")
model.to("cuda")
audio, sr = librosa.load("path/to/portuguese_audio.wav", sr=16000)
input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features.to("cuda")
predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)
Specifying Language
model.generation_config.language = "pt"
model.generation_config.task = "transcribe"
When to Use This Model
Not recommended for production use.
This model is useful for:
- Research purposes: Understanding the negative impact of unfiltered synthetic data on small models
- Ablation studies: Complete picture of synthetic data effects across filtering thresholds
- Comparison baseline: Demonstrating worst-case synthetic augmentation
For production use:
- whisper-small-cv-only-pt: Best Small model (13.87% WER)
- whisper-large-v3-mixed-pt: Best overall (8.33% WER, 10.27% MLS)
Research Conclusions
This model completes our analysis of synthetic data augmentation for Portuguese ASR:
Key Findings:
- Model capacity is the primary factor: Small models cannot leverage synthetic data regardless of quality or volume
- More data can hurt: Doubling the dataset size (43k vs 22k) results in worse performance for Small models
- Quality filtering is insufficient: Even strict filtering (q ≥ 0.8) doesn't help Small models
- Architecture-first decisions: Choose model size based on deployment constraints, then decide on augmentation
Practical Recommendations:
| Deployment | Recommendation |
|---|---|
| Resource-constrained | Use Whisper-Small with CV-only data |
| Quality-focused | Use Whisper-Large-v3 with quality-filtered synthetic |
| Cross-domain robustness | Use Whisper-Large-v3 with mid-high quality synthetic |
Limitations
- Worst Small model performance: 14.22% WER (2.5% worse than baseline)
- Wasted compute: 100% more training steps for negative results
- Architecture limitation: Cannot leverage synthetic data effectively
- Domain specificity: Optimized for general Portuguese
Citation
This model is part of research on WAVe (Word-Aligned Verification) for synthetic speech quality assessment. While the WAVe methodology paper is currently under review, please cite our previous work that motivated this research:
@article{perezhohin2024enhancing,
title={Enhancing Automatic Speech Recognition: Effects of Semantic Audio Filtering on Models Performance},
author={Perezhohin, Yuriy and Santos, Tiago and Costa, Victor and Peres, Fernando and Castelli, Mauro},
journal={IEEE Access},
year={2024},
publisher={IEEE}
}
References
- Base Model: openai/whisper-small
- Training Data (Real): mozilla-foundation/common_voice_17_0
- Training Data (Synthetic): yuriyvnv/synthetic_transcript_pt
- Whisper Paper: Robust Speech Recognition via Large-Scale Weak Supervision
- Motivating Research: Enhancing ASR with Semantic Audio Filtering (IEEE Access 2024)
License
Apache 2.0
- Downloads last month
- 23
Model tree for yuriyvnv/whisper-small-cv-full-synthetic-pt
Base model
openai/whisper-smallDatasets used to train yuriyvnv/whisper-small-cv-full-synthetic-pt
Collection including yuriyvnv/whisper-small-cv-full-synthetic-pt
Evaluation results
- Test WER on Common Voice 17.0 (Portuguese)test set self-reported14.220
- Test WER (MLS) on Multilingual LibriSpeech (Portuguese)test set self-reported30.850