Whisper-Small Portuguese - Full Synthetic Data (Unfiltered)

This model is a fine-tuned version of openai/whisper-small for Portuguese automatic speech recognition (ASR). It was trained on Common Voice 17.0 Portuguese combined with all synthetic speech data without quality filtering, representing the maximum data augmentation approach.

Purpose

This model completes the evaluation of synthetic data augmentation strategies for Whisper-Small Portuguese. It uses all available synthetic data (100%) without any WAVe filtering to test whether maximum data volume can compensate for the architectural limitations of smaller models.

Key Finding: Using all synthetic data (unfiltered) results in the worst performance among all Small Portuguese configurations, confirming that:

Quality filtering provides no benefit for Small models
Adding low-quality synthetic data actively hurts performance
Model capacity, not data volume or quality, is the fundamental constraint

Metric	CV-Only Baseline	This Model (Unfiltered)	Change
Test WER (CV)	13.87%	14.22%	-2.5% (worse)
Test WER (MLS)	30.69%	30.85%	-0.5% (worse)

Model Details

Property	Value
Base Model	openai/whisper-small
Language	Portuguese (pt)
Task	Automatic Speech Recognition (transcribe)
Parameters	244M
Training Data	Common Voice 17.0 + ALL Synthetic (Unfiltered)
Total Training Samples	43,834
Sampling Rate	16kHz

Evaluation Results

This Model (whisper-small-cv-full-synthetic-pt)

Metric	Value
Validation Loss	0.2100
Validation WER	12.94%
Test WER (Common Voice)	14.22%
Test WER (MLS)	30.85%
Best Checkpoint	Step 350
Max Training Steps	860

Comparison with Other Training Configurations (Whisper-Small Portuguese)

Training Data	Max Steps	Val Loss	Val WER	Test WER (CV)	Test WER (MLS)
Common Voice Only	430	0.2000	12.68%	13.87%	30.69%
High-Quality (q ≥ 0.8) + CV	575	0.2100	12.98%	14.28%	30.40%
Mid-High (q ≥ 0.5) + CV	805	0.2100	12.97%	14.08%	30.54%
All Synthetic + CV (Unfiltered)	860	0.2100	12.94%	14.22%	30.85%

Key Performance Characteristics

Worst overall performance: Both in-domain and cross-domain metrics worse than baseline
Most training steps: 860 steps (100% more than baseline) for negative results
Largest dataset: 43,834 samples—double the baseline—yet worse performance
Clear evidence: More data ≠ better performance for small models

Complete Portuguese Small Model Rankings

Rank	Configuration	Test WER (CV)	Test WER (MLS)	Recommendation
1	CV Only	13.87%	30.69%	Best choice
2	Mid-High (q≥0.5)	14.08%	30.54%	Research only
3	Unfiltered (this)	14.22%	30.85%	Not recommended
4	High-Quality (q≥0.8)	14.28%	30.40%	Research only

Conclusion: For Whisper-Small Portuguese, do not use synthetic data augmentation. The CV-only baseline provides the best performance.

Small vs Large: Maximum Data Impact

Using all synthetic data produces opposite effects depending on model size:

Model	Unfiltered Synthetic	Test WER (CV)	Test WER (MLS)	vs Baseline
Whisper-Small	21,968 samples	14.22%	30.85%	Both worse
Whisper-Large-v3	21,968 samples	8.33%	13.43%	Both better

For Large-v3, unfiltered synthetic data improves performance by ~30%. For Small, it degrades performance. This confirms that the benefit of synthetic data is fundamentally tied to model capacity.

Training Data

Dataset Composition

Source	Samples	Description
Common Voice 17.0 Portuguese	21,866	Real speech from Mozilla's crowdsourced dataset
Synthetic Transcript PT (all)	21,968	Complete TTS audio without filtering
Total	43,834

WAVe Quality Distribution (For Reference)

While this model uses all data, the quality distribution shows what was included:

Quality Level	Samples	Percentage	Used in This Model
High (q ≥ 0.8)	7,312	33.3%	✓
Medium (0.5 ≤ q < 0.8)	11,869	54.0%	✓
Low (q < 0.5)	2,787	12.7%	✓
Total	21,968	100%	All used

Including the 12.7% low-quality samples (2,787 samples) appears to actively hurt Small model performance.

Training Procedure

Hyperparameters

Parameter	Value
Learning Rate	1e-5
Batch Size (Global)	256
Warmup Steps	200
Max Epochs	5
Precision	BF16
Optimizer	AdamW (fused)
Eval Steps	50
Metric for Best Model	eval_loss

Training Infrastructure

GPU: NVIDIA H200 (140GB VRAM)
Operating System: Ubuntu 22.04
Framework: Hugging Face Transformers

Usage

Transcription Pipeline

from transformers import pipeline

transcriber = pipeline(
    "automatic-speech-recognition",
    model="yuriyvnv/whisper-small-cv-full-synthetic-pt",
    device="cuda"
)

result = transcriber("path/to/portuguese_audio.wav")
print(result["text"])

Direct Model Usage

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import librosa

processor = WhisperProcessor.from_pretrained("yuriyvnv/whisper-small-cv-full-synthetic-pt")
model = WhisperForConditionalGeneration.from_pretrained("yuriyvnv/whisper-small-cv-full-synthetic-pt")
model.to("cuda")

audio, sr = librosa.load("path/to/portuguese_audio.wav", sr=16000)
input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features.to("cuda")

predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)

Specifying Language

model.generation_config.language = "pt"
model.generation_config.task = "transcribe"

When to Use This Model

Not recommended for production use.

This model is useful for:

Research purposes: Understanding the negative impact of unfiltered synthetic data on small models
Ablation studies: Complete picture of synthetic data effects across filtering thresholds
Comparison baseline: Demonstrating worst-case synthetic augmentation

For production use:

whisper-small-cv-only-pt: Best Small model (13.87% WER)
whisper-large-v3-mixed-pt: Best overall (8.33% WER, 10.27% MLS)

Research Conclusions

This model completes our analysis of synthetic data augmentation for Portuguese ASR:

Key Findings:

Model capacity is the primary factor: Small models cannot leverage synthetic data regardless of quality or volume
More data can hurt: Doubling the dataset size (43k vs 22k) results in worse performance for Small models
Quality filtering is insufficient: Even strict filtering (q ≥ 0.8) doesn't help Small models
Architecture-first decisions: Choose model size based on deployment constraints, then decide on augmentation

Practical Recommendations:

Deployment	Recommendation
Resource-constrained	Use Whisper-Small with CV-only data
Quality-focused	Use Whisper-Large-v3 with quality-filtered synthetic
Cross-domain robustness	Use Whisper-Large-v3 with mid-high quality synthetic

Limitations

Worst Small model performance: 14.22% WER (2.5% worse than baseline)
Wasted compute: 100% more training steps for negative results
Architecture limitation: Cannot leverage synthetic data effectively
Domain specificity: Optimized for general Portuguese

Citation

This model is part of research on WAVe (Word-Aligned Verification) for synthetic speech quality assessment. While the WAVe methodology paper is currently under review, please cite our previous work that motivated this research:

@article{perezhohin2024enhancing,
  title={Enhancing Automatic Speech Recognition: Effects of Semantic Audio Filtering on Models Performance},
  author={Perezhohin, Yuriy and Santos, Tiago and Costa, Victor and Peres, Fernando and Castelli, Mauro},
  journal={IEEE Access},
  year={2024},
  publisher={IEEE}
}

References

Base Model: openai/whisper-small
Training Data (Real): mozilla-foundation/common_voice_17_0
Training Data (Synthetic): yuriyvnv/synthetic_transcript_pt
Whisper Paper: Robust Speech Recognition via Large-Scale Weak Supervision
Motivating Research: Enhancing ASR with Semantic Audio Filtering (IEEE Access 2024)

License

Apache 2.0

Downloads last month: 23

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for yuriyvnv/whisper-small-cv-full-synthetic-pt

Base model

openai/whisper-small

Finetuned

(3073)

this model

Datasets used to train yuriyvnv/whisper-small-cv-full-synthetic-pt

Collection including yuriyvnv/whisper-small-cv-full-synthetic-pt

Whisper Models Portuguese Language

Collection

This Repo contains Whisper models trained on subsets of data like Common Voice 17(CV_17), Synthetic(Generated by OpenAI) + CV17 and Synthetic Only. • 15 items • Updated 11 days ago • 1

Evaluation results

Test WER on Common Voice 17.0 (Portuguese)
test set self-reported

14.220
Test WER (MLS) on Multilingual LibriSpeech (Portuguese)
test set self-reported

30.850

View on Papers With Code